Research Article A Shared Interest Discovery Model

Hindawi Publishing Corporation International Journal of Distributed Sensor Networks Volume 2014, Article ID 820715, 9 pages http://dx.doi.org/10.1155/2014/820715

Research Article A Shared Interest Discovery Model for Coauthor Relationship in SNS Xin An,1 Shuo Xu,2 Yali Wen,1 and Mingxing Hu1 1

School of Economics and Management, Beijing Forestry University, No. 35 Qinghua East Road, Haidian District, Beijing 100083, China 2 Information Technology Supporting Center, Institute of Scientific and Technical Information of China, No. 15 fuxing Road, Haidian District, Beijing 100038, China Correspondence should be addressed to Yali Wen; [email protected] Received 6 December 2013; Revised 14 March 2014; Accepted 2 April 2014; Published 28 April 2014 Academic Editor: Goreti Marreiros Copyright © 2014 Xin An et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. A social network service (SNS) is a platform to build social networks or social relations among people. Many users enjoy SNS with their smart devices, which are mostly equipped with sensory devices. The sensitive information produced by these sensory devices is uploaded to SNS, which may raise many potential risks. In order to share one’s sensitive data with random people without security and privacy concerns, this paper proposes a shared interest discovery model for coauthor relationship in SNS, named as coauthor topic (coAT) model, to identify the users with similar interests from social networks, and collapsed Gibbs sampling method is utilized for inferring model parameters. Thus, one can reduce the possibility that recommended users are not friends but attackers. Finally, extensive experimental results on NIPS dataset indicate that our coAT model is feasible and efficient.

1. Introduction A social network service (SNS) [1] is a platform to build social networks or social relations among people who, for example, share interests, activities, backgrounds, or real-life connections. A SNS consists of a representation of each user (often a profile), his/her social links, and a variety of additional services, such as potential friend recommendation service. Most social network services, such as Facebook, LinkedIn, and Twitter, are web based and provide means for users to interact over the Internet. Recently, people enjoy SNS with their smart devices including phones and tablets, which are mostly equipped with sensory devices such as global positioning system (GPS) and camera. As we all know, the information produced from the sensory devices is often sensitive to people’s privacy. However, many users like to upload such sensitive information to SNS for some reasons, which raises many potential risks [2]. For example, some attackers usually pretend to originate from a trusted SNS and post a comment containing a malicious URL.

If one follows the instructions, he/she may disclose sensitive information or compromise the security of his/her system. This paper focuses on dealing with the following problem: how to share one’s sensitive data with random people without security and privacy concerns. Our main idea is to identify the users with similar interests from social networks, so that recommended friends are not attackers, but indeed friends. This idea is motivated by the following fact. In order not to be recognized as an attacker, he/she often posts a comment related to your posted article. As we all know, he/she does not care about your posted article but entices you to click the malicious URL hiding in the comment. If we are able to analyze in advance whether there are some shared interests, SNS may directly filter the comments. In the paper, academic social network is taken as research object, which are about interests of academic persons. For example, given coauthors with Koch C in Figure 1 and coauthored papers in Table 1, our model described in the article can easily discover the shared interest from this social network.

International Journal of Distributed Sensor Networks

Morris T Has ler P Kr uge rW M in ch B

Wessel R

net W Metz iF bian J Gab er am Kr G eri div In

DeWeertj S

2

Zo ha ry Su E are zH Ush er M Stem mler M Olami Z Bisofberger B

J n au Br L Itti D Lee iA wan Man on R Harris Higgins C Steinmetz P

Sivilotti M

Koch C

Hutch inson J Luo J Me ad C Ho lme Ka sP mm en D

Liu S

e Sarp

shk

ar R

J

iT

M oo re La A zza ro J Wa ng H Mat hur B Schu ster H Ruderm an D

ris Har

iuch Hor Hsu A

Niebur E

B el M rW i Ba

as R ugl O Do er nd a rn Be

Figure 1: Coauthor social network of Koch C in NIPS dataset. Table 1: Papers coauthored by Bair W, Horiuchi T, Luo T, Douglas T, and Manwani T with Koch C in NIPS dataset.

Bair W

Horiuchi T

Luo T

Douglas T

Manwani T

1 2 3 4 1 2 3 4 1 2 3 1 2 3 1 2 3

Real-time computer vision and robotics using analog VLSI circuits An analog VLSI chip for finding edges from zero-crossings Visual motion computation in analog VLSI using pulses Correlated neuronal response: time scales and mechanisms Real-time computer vision and robotics using analog VLSI circuits A delay-line based motion detection chip An analog VLSI saccadic eye movement system Analog VLSI circuits for attention-based, visual tracking Computing motion using resistive networks Real-time computer vision and robotics using analog VLSI circuits Object-based analog VLSI vision circuits Network activity determines spatiotemporal integration in single cells Amplifying and linearizing apical synaptic inputs to cortical pyramidal cells Direction selectivity in primary visual cortex using massive intracortical connections Synaptic transmission: an information-theoretic perspective Multielectrode spike sorting by clustering transfer functions Memory capacity of linear versus nonlinear models of dendritic integration

The organization of the rest of this work is as follows. In Section 2, we firstly discuss briefly the author topic (AT) model and then introduce in detail our proposed coauthor Topic (coAT) model. Section 3 describes the collapse Gibbs sampling methods used for inferring the model parameters. In Section 4, extensive experimental evaluations are conducted, and Section 5 concludes this work.

2. Generative Models for Documents Before presenting our coauthor topic model (coAT), author topic (AT) model is described firstly. The notations are summarized in Table 2.

2.1. Author Topic (AT) Model. Rosen-Zvi et al. [3–5] propose an author topic (AT) model for extracting information about authors and topics from large text collections. Rosen-Zvi et al. model documents as if they were generated by a two-stage stochastic process. An author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words. The probability distribution over topics in a multiauthor paper is a mixture of the distributions associated with the authors. The graphical model representations for AT model are shown in Figure 2. The AT model can be viewed as a generative process, which can be described as follows.


3

Table 2: Notations used in the Generative Models.

(ii) draw a topic assignment 𝑧𝑚,𝑛 from Multinomial(𝜗𝑥𝑚,𝑛 ); (iii) draw a word 𝑤𝑚,𝑛 from Multinomial(𝜑𝑧𝑚,𝑛 ).

Symbol

Description

𝐾 𝑀 𝑉 𝐴 𝑁𝑚 𝐴𝑚 a𝑚

Number of topics Number of documents Number of unique words Number of unique authors Number of word tokens in document 𝑚 Number of authors in document 𝑚 (usually 𝐴 𝑚 > 1) Authors in document 𝑚 Multinomial distribution of topics specific to the coauthor relationship (𝑖, 𝑗) in coAT model Multinomial distribution of topics specific to the author 𝑎 in AT model Multinomial distribution of words specific to the topic 𝑘 Topic assignment associated with the 𝑛th token in the document 𝑚 𝑛th token in document 𝑚 One chosen author associated with the word token 𝑤𝑚,𝑛 Another chosen author associated with the word token 𝑤𝑚,𝑛 Dirichlet priors (hyperparameter) to the multinomial distribution 𝜗 Dirichlet priors (hyperparameter) to the multinomial distribution 𝜑

𝜗𝑖,𝑗 𝜗𝑎 𝜑𝑘 𝑧𝑚,𝑛 𝑤𝑚,𝑛 𝑥𝑚,𝑛 𝑦𝑚,𝑛 𝛼 𝛽

am

xm,n

2.2. Coauthor Topic (coAT) Model. The graphical model representations of the coAT model are shown in Figure 3(a). The coAT model can be viewed as a generative process, which can be described as follows. (1) For each topic 𝑘 ∈ [1, 𝐾]: (i) draw a multinomial 𝜑𝑘 from Dirichlet(𝛽); (2) for each author pair (𝑖, 𝑗) with 𝑖 ∈ [1, 𝐴 − 1], 𝑗 ∈ [𝑖 + 1, 𝐴]: (i) draw a multinomial 𝜗𝑖,𝑗 from Dirichle(𝛼); (3) for each word 𝑛 ∈ [1, 𝑁𝑚 ] in document 𝑚 ∈ [1, 𝑀]: (i) draw an author 𝑥𝑚,𝑛 uniformly from the group of authors a𝑚 ; (ii) draw another author 𝑦𝑚,𝑛 uniformly from the group of authors a𝑚 \ 𝑥𝑚,𝑛 ; (iii) if 𝑥𝑚,𝑛 > 𝑦𝑚,𝑛 , to swap 𝑥𝑚,𝑛 with 𝑦𝑚,𝑛 ; (iv) draw a topic assignment 𝑧𝑚,𝑛 from Multinomial(𝜗𝑥𝑚,𝑛 ,𝑦𝑚,𝑛 ). (v) draw a word 𝑤𝑚,𝑛 from Multinomial(𝜑𝑧𝑚,𝑛 ). As shown in the above process, the posterior distribution of topics depends on the information from the text and authors. The parameterization of the coAT model is 𝜗𝑖,𝑗 | 𝛼 ∼ Dirichlet (𝛼) ,

α

ϑa

z m,n

a ∈ [1, A]

β

𝜑k k ∈ [1, K]

𝜑𝑘 | 𝛽 ∼ Dirichlet (𝛽) , 𝑧𝑚,𝑛 | 𝜗𝑥𝑚,𝑛 ,𝑦𝑚,𝑛 ∼ Multinomial (𝜗𝑥𝑚,𝑛 ,𝑦𝑚,𝑛 ) ,

wm,n

𝑤𝑚,𝑛 | 𝜑𝑧𝑚,𝑛 ∼ Multinomial (𝜑𝑧𝑚,𝑛 ) ,

n ∈ [1, Nm ] m ∈ [1, M]

Figure 2: The graphical model representation of the AT model.

(1) For each topic 𝑘 ∈ [1, 𝐾]: (i) draw a multinomial 𝜑𝑘 from Dirichlet(𝛽); (2) for each author 𝑎 ∈ [1, 𝐴]: (i) draw a multinomial 𝜗𝑎 from Dirichlet(𝛼); (3) for each word 𝑛 ∈ [1, 𝑁𝑚 ] in document 𝑚 ∈ [1, 𝑀]: (i) draw an author assignment 𝑥𝑚,𝑛 uniformly from the group of authors a𝑚 ;

(1)

(𝑥𝑚,𝑛 , 𝑦𝑚,𝑛 ) | a𝑚 ∼ Multinomial (

2 ). 𝐴 𝑚 (𝐴 𝑚 − 1)

In fact, if we do not draw one author 𝑥𝑚,𝑛 from a𝑚 and then another one 𝑦𝑚,𝑛 from a𝑚 \ 𝑥𝑚,𝑛 in Figure 3(a) but draw simultaneously an author pair (𝑥𝑚,𝑛 , 𝑦𝑚,𝑛 ) from a𝑚 in Figure 3(b), coAT model is equivalent to AT model in principle. One can verify this point by comparing Figure 3(b) with Figure 2.

3. Inference Algorithm For inference, the task is to estimate the sets of thefollowing unknown parameters in the coAT model: (1) Φ = {𝜑𝑘 }𝐾 𝑘=1 𝐴 and Θ = {{𝜗𝑖,𝑗 }𝐴−1 𝑖=1 }𝑗=𝑖+1 ; (2) the corresponding topic and author pair assignments 𝑧𝑚,𝑛 and (𝑥𝑚,𝑛 , 𝑦𝑚,𝑛 ) for each word token 𝑤𝑚,𝑛 . In fact, inference cannot be done exactly in this

4


am

xm,n

α

am

ym,n

z m,n

ϑi,j

(xm,n , ym,n )

i ∈ [1, A − 1] j ∈ [i + 1, A] 𝜑k

β

i ∈ [1, A − 1] j ∈ [i + 1, A] wm,n

k ∈ [1, K]

z m,n

ϑi,j

α

wm,n

𝜑k

β

n ∈ [1, Nm ] m ∈ [1, M]

n ∈ [1, Nm ] m ∈ [1, M]

k ∈ [1, K]

(a)

(b)

Figure 3: The graphical model representation of the coAT Model.

model. A variety of algorithms have been used to estimate the parameters of topics models, such as variational EM (expectation maximization) [6, 7], expectation propagation [8, 9], belief propagation [10], and Gibbs sampling [11–15]. In this work, collapsed Gibbs sampling algorithm [11–15] is used, since it provides a simple method for obtaining parameter estimates under Dirichlet priors and allows combination of estimates from several local maxima of the posterior distribution. In the Gibbs sampling procedure, we need to calculate the full conditional distribution 𝑃(𝑧𝑚,𝑛 , 𝑥𝑚,𝑛 , 𝑦𝑚,𝑛 | w, z¬(𝑚,𝑛) , x¬(𝑚,𝑛) , y¬(𝑚,𝑛) , a, 𝛼, 𝛽), where z¬(𝑚,𝑛) and (x¬(𝑚,𝑛) , y¬(𝑚,𝑛) ) represent the topic and author pair assignments for all tokens except 𝑤𝑚,𝑛 , respectively. We begin with the joint distribution 𝑃(w, z, x, y | a, 𝛼, 𝛽) of a dataset, and using the chain rule, we can get the conditional probability conveniently as 𝑃 (𝑧𝑚,𝑛 , 𝑥𝑚,𝑛 , 𝑦𝑚,𝑛 | w, z¬(𝑚,𝑛) , x¬(𝑚,𝑛) , y¬(𝑚,𝑛) , a, 𝛼, 𝛽) (𝑤

∝

)

𝑛𝑧𝑚,𝑛𝑚,𝑛 + 𝛽𝑤𝑚,𝑛 − 1 (V) ∑𝑉 V=1 (𝑛𝑧𝑚,𝑛 + 𝛽V ) − 1

(𝑧

×

)

𝑚,𝑛 𝑛𝑥𝑚,𝑛 ,𝑦𝑚,𝑛 + 𝛼𝑧𝑚,𝑛 − 1

(𝑘) ∑𝐾 𝑘=1 (𝑛𝑥𝑚,𝑛 ,𝑦𝑚,𝑛 + 𝛼𝑘 ) − 1

,

(2)

where 𝑛𝑘(V) is the number of times tokens of word V is assigned (𝑘) represent the number of times author to topic 𝑘 and 𝑛𝑖,𝑗 pair (𝑖, 𝑗) is assigned to topic 𝑘. Detailed derivation of Gibbs sampling for coAT is provided in appendix. If one further manipulates the above equation (2), one can turn it into separated update equations for the topic and author of each token, suitable for random or systematic scan updates: 𝑃 (𝑥𝑚,𝑛 , 𝑦𝑚,𝑛 | x¬(𝑚,𝑛) , y¬(𝑚,𝑛) , z, a, 𝛼) (𝑧

∝

)


∑𝐾 𝑘=1

(𝑛𝑥(𝑘)𝑚,𝑛 ,𝑦𝑚,𝑛

+ 𝛼𝑘 ) − 1

,

(3)

𝑃 (𝑧𝑚,𝑛 | w, z¬(𝑚,𝑛) , x, y, 𝛼, 𝛽) (𝑤

∝

)

𝑛𝑧𝑚,𝑛𝑚,𝑛 + 𝛽𝑤𝑚,𝑛 − 1 ∑𝑉 V=1

(𝑛𝑧(V)𝑚,𝑛

+ 𝛽V ) − 1

(𝑧

×

)


∑𝐾 𝑘=1

(𝑛𝑥(𝑘)𝑚,𝑛 ,𝑦𝑚,𝑛

+ 𝛼𝑘 ) − 1

.

(4)

During parameter estimation, the algorithm keeps track of two large data structures: an (𝐴(𝐴 − 1)/2) × 𝐾 count (𝑘) matrix 𝑛𝑖,𝑗 and an 𝐾 × 𝑉 count matrix 𝑛𝑘(V) . From these data structures, one can easily estimate the Φ and Θ as follows: 𝜑𝑘,V = 𝜗𝑖,𝑗,𝑘 =

𝑛𝑘(V) + 𝛽V

(V) ∑𝑉 V=1 (𝑛𝑘 + 𝛽V )

,

(5)

.

(6)

(𝑘) + 𝛼𝑘 𝑛𝑖,𝑗 (𝑘) ∑𝐾 𝑘=1 (𝑛𝑖,𝑗 + 𝛼𝑘 )

With (3)–(6), Gibbs sampling algorithm for coAT model is summarized in Algorithm 1. The procedure itself uses only (𝑘) and seven larger data structures, the count variables 𝑛𝑖,𝑗

𝑛𝑘(V) , which have dimension (𝐴(𝐴 − 1)/2) × 𝐾 and 𝐾 × 𝑉, respectively, and their row sums 𝑛𝑖,𝑗 and 𝑛𝑘 with dimension 𝐴(𝐴 − 1)/2 and 𝐾, as well as the state variable 𝑥𝑚,𝑛 , 𝑦𝑚,𝑛 , 𝑧𝑚,𝑛 with dimension 𝑊 = ∑𝑀 𝑚=1 𝑁𝑚 .

4. Experimental Results and Discussions NIPS proceeding dataset is utilized to evaluate the performance of our model, which consists of the full text of the 13 years of proceedings from 1987 to 1999 Neural Information Processing Systems (NIPS) Conferences. The dataset contains 1,740 research papers and 2,037 unique authors. The distribution of the number of papers over the number of authors is shown in Table 3, which indicates that the percentages of papers with 3, 4, and 5 authors at most are 87.8736%, 95.8046%, and 97.9885%, respectively.


5

Algorithm coATGibbs({w}, {a}, 𝛼, 𝛽, 𝐾) Input: word vectors {w}, author vectors {a}, hyperparameters 𝛼, 𝛽, topic number 𝐾 (𝑘) }, {𝑛𝑘(V) } and their sums {𝑛𝑖,𝑗 }, {𝑛𝑘 } Global data: count statistics {𝑛𝑖,𝑗 Output: topic associations {z}, author pair associations {x} and {y}, multinomial parameters Φ and Θ, hyperparameter estimates 𝛼, 𝛽 // initialization (𝑘) , 𝑛𝑖,𝑗 , 𝑛𝑘(V) , 𝑛𝑘 zero all count variables: 𝑛𝑖,𝑗 for all documents 𝑚 ∈ [1, 𝑀] do for all words 𝑛 ∈ [1, 𝑁𝑚 ] in document 𝑚 do sample topic index 𝑧𝑚,𝑛 ∼ Multinomial(1/𝐾) 1 { , 𝑖 ∈ a𝑚 𝐴 sample one author index 𝑥𝑚,𝑛 ∼ Multinomial(p) with 𝑝𝑖 = { 𝑚 otherwise {0, 1 { , 𝑖 ∈ a𝑚 \ 𝑥𝑚,𝑛 sample another author index 𝑦𝑚,𝑛 ∼ Multinomial(p) with 𝑝𝑖 = { 𝐴 𝑚 − 1 otherwise {0, if 𝑥𝑚,𝑛 > 𝑦𝑚,𝑛 then swap 𝑥𝑚,𝑛 with 𝑦𝑚,𝑛 // increment counts and sums (𝑤 ) 𝑛𝑥(𝑘)𝑚,𝑛 ,𝑦𝑚,𝑛 += 1, 𝑛𝑥𝑚,𝑛 ,𝑦𝑚,𝑛 += 1, 𝑛𝑧𝑚,𝑛𝑚,𝑛 += 1, 𝑛𝑧𝑚,𝑛 += 1 // Gibbs sampling over burn-in period and sampling period while not finished do for all documents 𝑚 ∈ [1, 𝑀] do for all words 𝑛 ∈ [1, 𝑁𝑚 ] in document 𝑚 do // decrement counts and sums (𝑧𝑚,𝑛 ) (𝑤𝑚,𝑛 ) 𝑛𝑥𝑚,𝑛 ,𝑦𝑚,𝑛 −= 1; 𝑛𝑥𝑚,𝑛 ,𝑦𝑚,𝑛 −= 1; 𝑛𝑧𝑚,𝑛 −= 1; 𝑛𝑧𝑥,𝑦 −= 1 ̃ according to (3) sample an author pair index (̃𝑖, 𝑗) sample a topic index 𝑧̃ according to (4) // increment counts and sums ̃ (𝑤 ) 𝑛̃𝑖,(𝑘)𝑗̃ += 1; 𝑛̃𝑖,𝑗̃ += 1; 𝑛̃𝑘 𝑚,𝑛 += 1; 𝑛̃𝑘 += 1 if converged and 𝐿 sampling iterations since last read out then // the different parameters read outs are averaged. read out parameter set Φ according to (5) read out parameter set Θ according to (6) Algorithm 1: Gibbs sampling algorithm for coAT model.

In addition to downcasing and removing stopwords and numbers, we also remove the words appearing less than five times in the corpus. After the preprocessing, the dataset contains 13,649 unique words and 2,301,375 word tokens in total. Each document’s timestamp is determined by the year of the proceedings. In our experiments, 𝐾 is fixed at 100, and the symmetric Dirichlet priors 𝛼 and 𝛽 are set at 0.5 and 0.1, respectively. Gibbs sampling is run for 2000 iterations. 4.1. Examples of Topic, Coauthor Relationship Distributions. Table 4 illustrates examples of 8 topics learned by coAT model. The topics are extracted from a single sample at the 2000th iteration of the Gibbs sampler. Each topic is illustrated with (a) the top 10 words most likely to be generated conditioned on the topic; (b) the top 10 coauthor relationships which have the highest probability conditioned on the topic. 4.2. Shared Interest Discovery. In order to analyze further shared interest by an author pair, one can see Table 4 from the viewpoint of author pairs. Table 5 shows the shared interests with Koch C in NIPS dataset. In Table 5, the meaning of

each topic is given in Table 4, and × means that the resulting strength is very low. From Table 5, one can see that shared interests by Koch C and Manwani A include Topic 97, Topic 50, and Topic 56 with the strengths of 0.31870, 0.22405, and 0.10003, respectively. By comparing Table 5 with Table 1, it is not difficult to see that our discovered shared interests make sense. 4.3. Predictive Power Analysis. In order to compare the performance of AT and coAT models, we further divide the NIPS papers into a training set Dtrain of 1,557 papers, and a test set Dtest of 183 papers. Each author in Dtest must have authored at least one of the training papers. The perplexity, originally used in language modeling [16], is a standard measure for estimating the performance of a probabilistic model. The perplexity is defined as the reciprocal geometric w𝑚,⋅ , ̃a𝑚,⋅ } mean of the token likelihoods in the test set Dtest = {̃ under the AT or coAT model: w𝑚,⋅ } | {̃a𝑚,⋅ } , M) perplexityAT ({̃ = exp [−

w𝑚,⋅ ̃a𝑚,⋅ , M) ∑𝑚 log 𝑃AT (̃ ], ∑𝑚 (𝑁𝑚 × 𝐴 𝑚 )

6

International Journal of Distributed Sensor Networks Table 3: Distribution of #papers over #authors in NIPS dataset.

#Authors

1 2 3 414 743 372 #Papers (23.7931%) (42.7011%) (21.3793%)

4 138 (7.9310%)

5 38 (2.1839%)

6 23 (1.3218%)

7 7 (0.4023%)

8 1 (0.0575%)

9 3 (0.1724%)

10 1 (0.0575%)

1.1 1.08 1.06

Perplexity

1.04 1.02 1 0.98 0.96 0.94 0.92 0.9 5

10

100

50 #Topics

AT coAT

Figure 4: Perplexity of the test set Dtest .

𝑁𝑚 𝐾

perplexitycoAT ({̃ w𝑚,⋅ } | {̃a𝑚,⋅ } , M)

= ∏∑

[ ∑ log 𝑃coAT (̃ w𝑚,⋅ ̃a𝑚,⋅ , M) ] [ ] = exp [− 𝑚 ], [ 𝐴 𝑚 (𝐴 𝑚 − 1) ] ∑𝑚 (𝑁𝑚 × ) 2 [ ] (7)

𝜑𝑘,̃ 𝑤𝑚,𝑛 𝜗𝑖,𝑗,𝑘 . (8)

Figure 4 shows the results for the AT model and coAT model on the test set Dtest . It is not difficult to see that the perplexity of coAT model is smaller than that of AT model, which indicates that coAT model outperforms AT model.

5. Conclusions

with AT

𝑃

∑

𝑛=1 𝑘=1 𝑖∈̃a𝑚 ,𝑗∈̃a𝑚 ,𝑖