2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology
Coauthor Network Topic Models with Application to Expert Finding Jia Zeng1,2, William K. Cheung2, Chun-hung Li2 and Jiming Liu2 of Computer Science and Technology, Soochow University, Suzhou 215006, China 2 Department of Computer Science, Hong Kong Baptist University, Hong Kong Email:
[email protected]
1 School
?
f1
f2
f3
? ? o1 Figure 1.
Keywords-Topic models; coauthor document network; expert finding; higher-order relation; Gibbs sampling.
o2
o3
o4
o5
Coauthor document networks.
due to their limited expressive ability for modeling higherorder relations in coauthor document networks. Fig. 1 shows a typical scenario of the coauthor document network with higher-order relations. In Fig. 1, o denotes the document, and f denotes the expertise distribution characterizing an author. For this example, the document o3 has three coauthors with their expertise characterized by {f1 , f2 , f3 }. If we model only pairwise relations, f1 and f3 will influence f2 through links separately, which however is not correct. There often exist cases where the author f1 may not influence f2 unless the relation with f3 is established. Coauthors often collaborate collectively rather than separately to write a document. Simply projecting higher-order relations into pairwise ones will lead to loss of information for expertise modeling. So, it is more natural to incorporate the more complex higher-order relations, where {f1 , f3 } jointly influence f2 in Fig. 1. Such higher-order relations can accurately reflect complex coauthor network structures, and thus have the potential to improve both topic summarization and expertise modeling performance. In this paper, as illustrated in Fig. 1, we aim to address two problems: 1) how to learn topics (denoted by question marks) from a coauthor document network with higherorder relations considered; 2) how to predict the relevance (denoted by the dashed lines) between a query document (e.g., o5 ) and the author’s expertise (e.g., f3 ). We develop the coauthor network topic (CNT) model based on Markov random fields (MRFs) with higher-order cliques [10]. The MRF framework has been widely used in computer vision and image processing areas because of
I. I NTRODUCTION Confronting with large document repositories such as scientific papers, online forums and web pages, one of the fundamental problems is topic modeling, which segments the whole repository into related topical communities, and extracts a set of semantically meaningful topics. Besides the content information, document repositories are also characterized by their structural relations. For example, scientific documents are generally composed of fields like titles, abstracts, keywords and texts containing the word content. Documents can be implicitly linked based on the cosine similarity of the word content. Other fields of documents like journals (proceedings), authors, affiliations and references also define explicit links between them, resulting in the complex structure of the document network. Regularizing topic models by network structures is the subject of intense research, and the incorporation of coauthor and citation links have been widely considered in recent network topic models (NTMs) [2], [5], [7], [11]. Other than topic segmentation, expert finding is an important application of topic models [3], [8], where the author’s expertise is modeled as a mixture of topics. Given a query document, we rank a list of authors or experts with relevant topics. Accurate expert finding has found many practical applications such as reviewers assignment for research grant applications and peer-review scientific papers. However, recent NTMs have not been used for expert finding partly 978-0-7695-4191-4/10 $26.00 © 2010 IEEE DOI 10.1109/WI-IAT.2010.20
Coauthor network
Topics
Abstract—This paper presents the coauthor network topic (CNT) model constructed based on Markov random fields (MRFs) with higher-order cliques. Regularized by the complex coauthor network structures, the CNT can simultaneously learn topic distributions as well as expertise of authors from large document collections. Besides modeling the pairwise relations, we model also higher-order coauthor relations and investigate their effects on topic and expertise modeling. We derive efficient inference and learning algorithms from the Gibbs sampling procedure. To confirm the effectiveness, we apply the CNT to the expert finding problem on a DBLP corpus of titles from six different computer science conferences. Experiments show that the higher-order relations among coauthors can improve the topic and expertise modeling performance over the case with pairwise relations, and thus can find more relevant experts given a query topic or document.
366
its expressive power for local statistical interactions [12]– [14]. The coauthors in the same document form a clique. We consider only 2-order, 3-order and 4-order cliques to describe complex higher-order relations in coauthor document networks because generally scientific documents are written by less than or equal to four coauthors (see Fig. 3B for statistics of our dataset). Compared with pairwise (2-order) cliques alone, the additional 3-order and 4-order cliques can express fine details of coauthor network structures so that they may help identify more meaningful topic communities in document networks. Meanwhile, the higher-order network structure can regularize the expertise modeling of authors in the community, which is very useful for the tasks like matching papers with reviewers [8] or expert finding tasks [3] on a selected topic in the professional network. Our main contributions in this paper are summarized as follows: 1) we propose the CNT for modeling both pairwise and higher-order relations in coauthor document networks; 2) we develop efficient inference and learning algorithms for the CNT based on the Gibbs sampling; 3) we apply the CNT to the expert finding problem. We organize this paper as follows. Section II discusses related work on coauthor network topic modeling. Section III defines the problem and formulates two variants of the CNT: the single-author CNT (CNT-S) and the multiple-author CNT (CNT-M). We also derive efficient inference and learning algorithms from Gibbs sampling. Section IV shows the application of the CNT to expert finding problem, and presents the experimental results on a corpus of titles from six computer science conferences. Section V draws conclusions.
expertise of each author is represented as a mixture of topics, and the expertise correlations established via coauthor relations will affect the topic assignment to each word of their documents. In the Author-Recipient-Topic (ART) model [6], each pair of coauthors, e.g., an email sender and recipient, is represented as an expertise distribution for generating topics. The ART is consistent with the real-world coauthor relation in that each pair authors focus on different topics from those of single authors alone. The Author-Person-Topic (APT) model [8] incorporates a cluster of expertise distributions in order to diversify each author’s expertise. Topic models have been recently used for expert finding problems [3], [8], where the similarity between the query document and the expertise is used to rank a list of relevant authors as candidate reviewers. Again, those topic models also ignore the higher-order coauthor network structure for both topic and expertise modeling.
II. R ELATED W ORKS
We have a document network O = {o1 , . . . , od , . . . , oD } with 1 ≤ d ≤ D documents, where od represents the “bag of words” of the document d. By cascading all these words, we can rewrite O = {o1 , . . . , oi , . . . , oI }, where the word odi ∈ D od and I = d=1 |od |. Each word can take one of W unique words from the vocabulary, W = {1, . . . , w, . . . , W }. The topic model assigns a set of topic labels zi to each of the words oi in the document network. Each topic label zi takes one of J topics from the set J = {1, . . . , j, . . . , J}. We define the topic configuration as Z = {z1 , . . . , zi , . . . , zI }, which partitions all these words into different topics. Each word is generated by a topic according to a multinomial distribution, P (oi = w|zi = j) = φw (j). A Dirichlet hyperparameter β is on the multinomial distribution φ. All the documents O altogether have 1 ≤ a ≤ A authors, in which each author a is associated with a multinomial expertise distribution fa = [fa (j)] over J topics, where each element fa (j) is the expertise strength satisfying fa (j) ≥ 0, j fa (j) = 1. The expertise configuration over all the authors composes the matrix F. The document d also has a document-specific topic proportions over J topics, where the strength hd = (hd (j)), hd (j) ≥ 0, j hd (j) = 1. A Dirichlet hyperparameter α is on both f and h. Each
III. C OAUTHOR N ETWORK T OPIC M ODELS In our proposed CNT, the coauthor network structure is explicitly modeled as statistical interactions of coauthors’ expertise. Besides pairwise relations, higher-order coauthor relations are described by the higher-order clique potentials within the MRF framework. Through maximizing the corresponding joint probabilities, we partition the coauthor document networks into topical communities, and estimate the expertise of each author. Given a query document in the test set, we can retrieve relevant authors in the training set based on learned topics for expert finding. A. Notations
Scientific documents have fields like titles, abstracts, keywords and texts, which induce implicit content similarity links between each document pair for content-based document clustering [15], [16] and topic modeling, e.g., latent Dirichlet allocation (LDA) [1]. They also have fields like authors and references that build explicit coauthor and citation links. These links can further improve the performance of topic modeling [2], [5], [7], [11]. The relational topic models (RTM) [2] describe pairwise citation relations as conditional probabilities on latent topic variables. The topic-link latent Dirichlet allocation (TLLDA) [5] introduces pairwise community information of authors as regularization. It focuses only on citation links between documents regularized by authors’ community information, but does not take the coauthor network structure into account. The multirelational topic models (MRTM) [11] combines both citation and coauthor links into the unified MRF framework. All the before-mentioned topic models neglect the complex higherorder coauthor relations, which may limit their expressive ability for topic and expertise modeling. Combining coauthor network structure and topic modeling has attracted intense interests. The basic idea is that the
367
α
representation of the CNT-S. Two documents od and od determine two cliques of coauthors cd and cd . There is one author who writes both documents d and d so that the two documents are linked. The hyperparameter α regularizes the coauthor network. For the document d, we randomly select one author a from cd , whose expertise distribution fa generates a topic label zi for the word oi . Similarly, Fig. 2B shows the graphical representation of the CNT-M. The only difference from the CNT-S lies in the strategy to generate the topic label zi . The CNT-M calculates the average expertise fcd = ¯fa , a ∈ cd , and then generates a topic label zi from fcd . Note that, in Fig. 2, both CNT-S and CNT-M encode rich coauthor network structures by both pairwise and higherorder cliques cd ∈ C.
α α
od
α
od
cd cd
fcd c
d
fa
fa
od
c
For CNT-S (Fig. 2A), we need to maximize P (G, G|α) with respect to G. Using the Bayes’ rule, we obtain
zi A
Figure 2.
od zi
zi zi
fcd
d
B
P (G, G|α) = P (O|Z)P (Z|C, F)P (C, F|α),
(A) Single-author CNT. (B) Multiple-author CNT.
(1)
with {C, F} denoting the expertise configurations F over cliques C in the coauthor network. We model the coauthor network P (C, F|α) as an MRF defined as the following Gibbs distribution form [13], 1 − C VC e , (2) P (C, F|α) = Zα
document d determines a clique of coauthors cd , where the authors (a, a ) ∈ cd are pairwise neighbors of each other defined in the clique. For simplicity, we consider 1, 2, 3, 4order cliques, denoted as C = {C1 , C2 , C3 , C4 }, where cda ∈ C is a clique defined by the document d containing the author a. As a result, the document network can be represented as a graph, G = {O, C}, where C encodes the coauthor network structure.
where VC denotes clique potentials and the constant normalization factor Zα can be neglected. In principle, we may arbitrarily design the clique potentials VC if they can increase the corresponding Gibbs probability in terms of the labeling configurations. In particular, we define 1, 2, 3, 4order clique potentials as the corresponding negative loglikelihood functions,
B. Formulation Our objective is to find the best topic labeling configuration G = {Z, F} to generate the observed graph G = {O, C}. To this end, we need to maximize the joint probability P (G, G|α) in terms of G. The generative process for each word in the document network is as follows. For the document d, the hyperparameter α first generates a clique of coauthors cd . From the expertise distribution fcd over the clique, a topic zi is sampled for the word oi . We have two variants of CNT based on two different generative strategies α → fcd → zi . The first strategy randomly samples an author a from cd according to a uniform distribution, and then samples a topic zi from the expertise distribution fa for the word odi . We call this CNT as single-author CNT (CNT-S). Such strategy has also been used in the AT model [9], where each word is associated with one of the coauthors. The second strategy calculates fcd as the average of fa , a ∈ cd . In this sense, we may view fcd as the average expertise distribution, and sample a topic zi from fcd for the word oi . We call this CNT as multiple-author CNT (CNT-M). The CNT-S assumes that one of the coauthors is responsible for each word, while the CNT-M assumes that all coauthors contribute equally for each word. More specifically, Fig. 2A shows the graphical
VC1 = − ln P (fa |α),
VC2 = − ln P (fC2 |α),
VC3 = − ln P (fC3 |α),
VC4 = − ln P (fC4 |α).
By combining Eqs. (1) and (2), we can rewrite Eq. (1) as P (G, G|α) =
I
P (oi |zi )P (zi |fa )P (a|cd )P (fa |α)× (3)
i=1
P (fC2 |α)
C2
P (fC3 |α)
C3
P (fC4 |α),
C4
where both pairwise C2 and higher-order relations {C3 , C4 } are accounted for. The probability P (a|cd ) is a uniform distribution for generating the author a from the clique cd . Similarly, for CNT-M (Fig. 2B), the objective is P (G, G|α) =
I
P (oi |zi )P (zi |fcd )P (fcd |α)×
i=1
C2
P (fC2 |α)
C3
P (fC3 |α)
(4)
P (fC4 |α),
C4
where the coauthor expertise interactions are modeled in the
368
where zi = j and ui = a represent the assignments of the word oi in a document to topic j and author a respectively, oi = w represents the word oi taking the word w in the vocabulary, and Z−i and U−i represent all topic and author a assignments excluding the word oi , and nw −i,j and n−i,j are the number of word w assigned to the topic j and the number of author a assigned to the topic j excluding the current instance, and α and β are fixed Dirichlet hyperparameters, and J and W are the number of topics and the number of words in the vocabulary. The term Qa (j) is the support from collaborators (neighbors) of the author a, P (zi = j, fca |α), (11) Qa (j) =
same way of the CNT-S. For simplicity, we approximate P (zi |fcd )P (fcd |α) ∝ P (zi |fa )P (fa |α), (5) a∈cd
which means the coauthors of a document contribute equally to generate the topic label for each word. C. Pairwise and Higher-order Relation Modeling To model the higher-order relations, we need to describe characteristics of higher-order relations among authors’ expertise. One characteristic is the expertise similarity in the coauthor network. We define the coauthor expertise similarity of 2, 3, 4-order cliques, J j=1 fa (j)fa (j) , (6) fC2 = fa fa J j=1 fa (j)fa (j)fa (j) , (7) fC3 = fa fa fa J j=1 fa (j)fa (j)fa (j)fa (j) fC4 = , (8) fa fa fa fa
ca ∈C2 ,C3 ,C4
where ca is the clique containing the author a, C2 , C3 , C4 are sets of 2, 3, 4-order cliques. Furthermore, we obtain P (zi = j, fca |α) = P (zi = j|fca )P (fca |α),
where if ca ∈ C2 the term P (fca |α) is defined in Eq. (9), and the term P (zi = j|fca ) ∝ fa (j)fa (j)/fa fa defined in Eq. (6). We can calculate how the neighbors of the author a support the assignment of the topic j by Eq. (11). The higher Qa (j) implies that the topic j is more likely to be generated for the word in the document collaborated by the author a. After a topic zi = j and the author ui = a is assigned to the word oi = w, we can update φw (j) and fa (j) by
where C2 = {a, a }, C3 = {a, a , a }, C4 = {a, a , a , a }, fC2 is the cosine similarity between two expertise vectors of coauthors, while fC3 and fC4 have the similar mathematical formulation to the definition of fC2 . The generalized linear models (GLMs) have been used for pairwise relation modeling in recent NTMs [2], [5], [11]. We also use GLMs for coauthor relation modeling given α, P (fC2 |α) = σ(wfC2 + b),
(12)
nw j +β , w w nj + W β naj + α fˆa (j) = a , j nj + Jα
(9)
φˆw (j) =
where w is the weight matrix, b is the bias vector, and σ is the logistic activation function. Similarly, the higher-order relations P (fC3 |α) and P (fC4 |α) are also described by the GLMs. To estimate parameters of Eq. (9), we need some negative samples randomly selected from groups of authors who do not collaborate to write a document.
(13) (14)
where nw j is the total number word w associated with the topic j, and naj is the total number of topic j associated with the author a. After we obtain the updated fa , we can estimate the GLM (9) using the iterative reweighted least squares (IRLS) algorithm. The Gibbs sampler for the CNT-M is slightly different because we use a different generative strategy in Eq. (5). In the case of the CNT-M (4), the full conditional probability for the Gibbs sampler becomes
D. Inference and Learning In the CNT, our objective is to estimate the topic configuration Z, the expertise configuration F, the topic distribution over words φ, and the parameters of GLM (9) from training data. We extend the Gibbs sampling algorithm [4], [9] that iteratively infers the best topic labeling configuration over words in the document network according to the objective (3). After the best topic configuration for all words is obtained, other parameters can be estimated using a generalized expectation-maximization (EM) algorithm. With the CNT-S defined, we need to derive the full conditional probability for the Gibbs sampler
P (zi = j|oi = w, Z−i , O−i , C) ∝ na−i,j + α nw + β −i,j (j) , (15) × Q a w a w n−i,j + W β j n−i,j + Jα d a∈c
d
where c is a clique of coauthors for the document d, and Qa (j) is defined in Eq. (11). In the CNT-M because all coauthors are responsible for all the words in the document, all the authors a ∈ cd share the same number of words in the document as naj when estimating the expertise distribution in Eq. (14).
P (zi = j, ui = a|oi = w, Z−i , U−i , O−i , C−a ) ∝ na−i,j + α nw + β −i,j × × Qa (j), (10) w a w n−i,j + W β j n−i,j + Jα
369
A
E. Computational Complexity 2500
The computational complexity of the Gibbs sampler (10) and (15) is O(IJC), where I is the total number of words, J is the number of topics, and C is the total number of cliques determined by all documents. Practically we first run the Gibbs sampler several iterations without neighboring support Qa (j) because the expertise estimation fa is not reliable for estimating the GLM (9). After several iterations, we add the coauthor network support Qa (j) to refine the topic configuration Z. Because the expertise is changing at each iteration, we need to update the GLM (9) all the time during sampling process, which is computationally intractable. In practice, we update the GLM every hundred iterations in the Gibbs sampling process. Note that if there is only one author for a document, both CNT-S and CNT-M reduces to the standard LDA model [1] because the coauthor network structure disappears.
Number of papers
2000
1500
1000
500
0 0
2
4
6
8
10
12
14
16
18
35
40
20
Number of coauthors
B 3500
Number of coauthors
3000
F. Expert Finding We apply the proposed CNT to the expert finding problems. We define two practical expert finding tasks as follows. The first task is the collaborator finding. Given a query author and his/her documents, the CNT has to retrieve a list of experts that have the potential to be his/her collaborators. This task is useful when a student wants to find an advisor (or a research group) to do further research. Another usage is that the department of human resources wants to find relevant reviewers to evaluate the academic background of the applicants. The second task is to retrieve a list of experts in the training set that may have the expertise to review the query paper in the test set. This task is useful when matching papers with reviewers in the peer-review process [8]. For the first task, given a query author expertise, we use the author expertise similarity and GLM (9) to retrieve the most likely authors in the training set. We estimate the parameters of GLMs (9) from training set. Then we use the Gibbs sampling algorithm to estimate the expertise distribution over topics of the query author in the test set using the learned parameter φ (13) in the training set. Finally, we calculate the likelihood for the query author with all authors in the training data, and predict that a candidate clique exists if P (C|α) is above 0.5 in Eq. (9). The higher likelihood corresponds to more relevant cliques. This is a standard information retrieval problem and we will use Fmeasures to evaluation relation prediction performance. For the second task, we can first estimate the topic proportion of the query paper based on φ (13), and compute the cosine similarity with all authors’ expertise distribution in the training set. Then we rank the similarity values and retrieve the top N = 5 authors as candidate reviewers. We delete the coauthors of the query paper from the ranking list to avoid retrieving the coauthors as reviewers.
2500 2000 1500 1000 500 0 0
5
10
15
20
25
30
45
Number of papers
Figure 3.
Statistics of coauthor document networks.
IV. E XPERIMENTAL R ESULTS A. Datasets To evaluate the proposed CNT, the coauthor document network were extracted from PROXIMITY DBLP dataset (http://kdl.cs.umass.edu/data/dblp/dblp-info.html). We selected six computer science conferences: ICCV (International Conference on Computer Vision) and CVPR (International Conference on Computer Vision and Pattern Recognition) for computer vision (CV) with 1944 papers, SIGIR (International ACM SIGIR Conference) and KDD (International Conference on Knowledge Discovery and Data Mining) for data mining (DM) with 1165 papers, and NIPS (Neural Information Processing Systems Conferences) and ICML (International Conference on Machine Learning) for machine learning (ML) with 2233 papers. The network contains a total of 5342 papers with titles and 5951 authors. There are a total of 12098 pairwise coauthor cliques, 10958 3-order coauthor cliques and 11542 4-order coauthor cliques. After stemming and removing stop words, the vocabulary includes 3209 unique words. Fig. 3 shows the statistics of the coauthor document network. We see that around 90% papers have 2, 3 and
370
4 coauthors in Fig. 3A, and over 45% authors write more than one documents in Fig. 3B. Therefore, the cliques C2 , C3 and C4 are enough to characterize the coauthor network structures. For each paper and each author, we assign a true topic label as the ground truth. Each paper has one of the three topic labels CV, DM and ML according to its conference. Each author belongs to one of three topic communities CV, DM and ML according to his/her most frequent papers’ topic. These true labels are used to evaluate the accuracy of topic identification for both documents and authors. We divided the dataset into five folds. For each pair of training and test sets, we removed the shared authors so that the test set is completely unknown to the training set. We reported the average performance for the five-fold crossvalidation.
LDA AT pairwise CNT-S high-order CNT-S pairwise CNT-M high-order CNT-M
−5.8
Word Log Likelihood
−5.7
−6
−6.1
−6.2
−6.3
−6.4 5
10
15
20
25
Number of Topics
Figure 4.
B. Experimental Settings We investigate the performance of CNT-S and CNT-M and compare them against two benchmark topic models: the LDA model [4] and the AT model [9]. Note that we do not compare the CNT with other state-of-the-art NTMs [2], [5], [7], [11] using pairwise relations because we believe that their performance is comparable with our CNT without higher-order relations. Therefore, we have two conditions for both CNT-S and CNT-M, where the first uses only pairwise relations, while the second uses all 2, 3, 4-order relations. We call these conditions as pairwise CNT-S, higher-order CNTS, pairwise CNT-M and higher-order CNT-M, respectively. The LDA and AT models are implemented in Matlab codes available online (http://psiexp.ss.uci.edu/research/programs data/toolbox.htm). All these models are trained using the Gibbs sampling algorithm with 2000 iterations. For simplicity, we use the same hyperparameters α = 50/J and β = 200/W , where J is the number of topics and W is the number of unique words in the vocabulary.
Average log word likelihood.
relation modeling indeed improves the generative ability of the model. Another interesting finding is that the pairwise CNT-M performs better than the higher-order CNT-S. This phenomenon is reasonable because CNT-M considers higher-order relation in generating topics, while CNT-S only selects one of authors in generating topics for words. Note that higher-order CNT-M is still better than pairwise CNT-M, which reconfirms that higher-order relations are beneficial in topic modeling. One may be interested in the performance of higher-order CNT without pairwise relation modeling. We find that there is little improvement without incorporating pairwise relations. Therefore, we conclude that pairwise relations play important roles in CNTs, but 3, 4order relations can improve the overall performance further. We are also interested in if the predicted topics are close to the true topics. We use the winner-take-all strategy to get the predicted topic labels for each test document and author, i.e., j ∗ = arg maxj hd (j), j ∗ = arg maxj fa (j). To evaluate the closeness between predicted labels and true labels, we use the normalized mutual information (NMI) [15], [16],
C. Generative Ability The log word likelihood is a standard measure for the generative topic models to predict unknown test data [2], which is defined as − ln P (Otest |φ)/ntest , where φ is the topic distribution over words learned from the training set in Eq. (13). The higher the word log likelihood, the better the generative performance of the model. We examined the performance of all models in different topics, J = 5, 10, 15, 20, 25. Fig. 4 shows the average word log likelihood on five-fold crossvalidation. In terms of word generative ability, we see that the pairwise CNT-S/CNT-M and higher-order CNT-S/CNT-M perform significantly better than the baseline LDA and AT models. The higher-order CNT-M outperforms 6% and 4% over LDA and AT models, respectively. We also observe that higher-order CNT-S/CNTM outperforms the corresponding pairwise CNT-S/CNT-M consistently. This observation demonstrates that higher-order
I(X; Y ) , NMI = H(X)H(Y ) where X and Y are the predicted topic labels of documents and true topic labels of these documents, I(X; Y ) is the mutual information between X and Y , and H(X) and H(Y ) are the entropy of X and Y . The NMI value ranges from zero to one, where an NMI value of zero means that the result is equal to almost random partitioning, and an NMI value close to one means that the result is almost identical to the true topic labeling. Fig. 5 shows the average NMI values for all the models. The higher NMI value corresponds to the more closeness between predicted topic labels and topic labels. We see that the NMI values in Fig. 5 are consistent with the word log
371
Documents 0.14
0.13
Pairwise cliques
Authors
LDA AT pairwise CNT−S high−order CNT−S pairwise CNT−M high−order CNT−M
Third-order cliques
0.68
Fourth-order cliques 0.75
0.7 pairwise CNT−S high−order CNT−S pairwise CNT−M high−order CNT−M
0.21 0.66
high−order CNT−S high−order CNT−M
0.68
0.2
0.64
0.66
0.19
0.62
0.64
0.6
0.62
0.58
0.6
0.56
0.58
0.54
0.56
0.52
0.54
0.5
0.52
0.7
F−measure
0.12
NMI
0.18 0.11 0.17 0.1
0.65
0.16 0.09
0.08
0.6
AT pairwise CNT−S high−order CNT−S pairwise CNT−M high−order CNT−M
0.15 0.14
high−order CNT−S high−order CNT−M
0.48
5
10
15
20
Number of Topics
25
0.13
5
10
15
20
25
5
10
15
25
0.5
5
10
15
20
Number of Topics
25
0.55
5
10
15
20
25
Number of Topics
Number of Topics
Figure 6. Figure 5.
20
Number of Topics
Average F-measure for coauthor prediction.
Average NMI values.
likelihood in Fig. 4. This observation shows that the better generative ability corresponds to the better predicting ability for true topic labels. All models can predict topic labels for each document, but the LDA model cannot predict topic labels for each author. We see that the higher-order CNTM gains nearly 45% improvement on average compared with LDA in terms of NMI values for documents. Likewise, the higher-order CNT-M gains nearly 15% improvement on average compared with the AT model in terms of NMI values for authors. The higher-order CNTs consistently performs better than corresponding pairwise CNTs by 6% improvement on average for documents and 4% improvement on average for authors.
verified without coauthor cliques. We use the same number of positive samples and negative samples for both training and test purposes. Collaborator finding is performed on 3025 pairwise, 2126 3-order and 1301 4-order positive and negative samples from test data. Both LDA and AT models cannot predict coauthor relations, so we compare only CNTs. Fig. 6 shows the average F-measure for the collaborator finding. Generally, high order CNTs performs significantly better than corresponding pairwise CNTs. In terms of 3, 4-order cliques, we find that the higher-order CNT-M outperforms the higher-order CNTS consistently. On average the higher-order CNTs gains nearly 8% improvement in terms of average F-measures.
D. Expert Finding
As far as matching papers with reviewers is concerned, it is difficult to evaluate the quality of query relevance rankings due to scarcity of data that can be examined publicly. In this paper, we identify those retrieved authors’ relevant expertise to the query paper using Google Scholar. We use J = 5 and randomly select three query paper titles from the test set. Table I shows the top five reviewers having the highest expertise similarity with topic proportions of the query papers using the AT model, pairwise CNT-M and higher-order CNT-M, respectively. Aided by searching the authors’ names and the query paper’s title in Google Scholar, we manually confirm whether the retrieved author’s expertise is relevant to the query paper. We use the bold face to highlight those retrieved authors who have the relevant expertise for the query paper. Compared with the AT model, we see that both pairwise CNT-M and higher-order CNT-M can retrieve relatively more relevant reviewers with related expertise, which reconfirms the effectiveness of CNTs in terms of author expertise modeling.
To evaluate the collaborator finding performance, we need to predict whether there is a clique between the authors in the test set and the authors in the training set. If a clique is correctly predicted, we count it as a true positive (TP). If a nonclique is correctly predicted, we count it as a true negative (TN). If a non-clique is predicted as a clique, we count it as a false positive (TP). If a clique is predicted as a non-clique, we count it as a false negative (FN). The sensitivity (Se) is calculated as Se = TP/(TP + FN). The positive predictive value (PPV) is calculated as PPV = TP/(TP + FP). Based on Se and PPV, we use the F-measure (F) to evaluate the collaborator finding performance, 2 × Se × PPV , Se + PPV where the F-measure reflects the balance between Se and PPV. Generally, the higher F-measure means the better prediction performance. From both training and test sets, we randomly extract the negative samples from author groups F=
372
Table I M ATCHING PAPERS WITH REVIEWERS .
Paper titles Scene Modeling for Wide Area Surveillance and Image Synthesis
Efficiently decodable and searchable natural language adaptive compression
Optimizing spatio-temporal filters for improving BrainComputer Interfacing
AT Mubarak Shah Glenn Healey Takeo Kanade Andrew Zisserman Amnon Shashua Ming-Syan Chen David Jensen Atsushi Fujii Pedro Domingos Edward Y. Chang Wolfgang Maass Marc Pollefeys Jiawei Han Nicu Sebe Dan Pelleg
Pairwise CNT-M Rachid Deriche Paul A. Viola Jeremy S. De Bonet Steven W. Zucker Yunmei Chen Rachid Deriche Paul A. Viola Jeremy S. De Bonet Steven W. Zucker Yunmei Chen Satinder P. Singh Wolfgang Maass Peter Dayan Huan Liu Carl Edward Rasmussen
V. C ONCLUSIONS In this paper, we have developed the CNT for topic modeling of the coauthor document network within the higher-order MRF framework. We focus on the higher-order relations in the coauthor network, and find that they are able to further improve the topic and expertise modeling performance over pairwise relations alone. Based on the controlled DBLP dataset, we apply the CNT for the expert finding tasks. We find that the higher-order CNT can retrieve more relevant experts from training set. Although the higher-order relation modeling requires much more computational cost, it enhances the topic and expertise modeling performance by a reasonable margin. In the future work, more experiments on larger datasets are needed in order to compare with other state-of-the-art expert finding algorithms. Also, we shall develop more powerful topic models that can handle both multiplex (multiple link types such as citation and coauthor) and higher-order relations existing widely in document and image networks.
higher-order CNT-M Thomas S. Huang Pietro Perona Andrew Zisserman Trevor Darrell Larry S. Davis W. Bruce Croft Joemon M. Jose Ian Ruthven Tetsuya Sakai ChengXiang Zhai Dimitrios Gunopulos Partha Niyogi Richard S. Sutton Daniel B. Neill Xiaofei He
[6] A. McCallum, A. Corrada-Emmanuel, and X. Wang. Topic and role discovery in social networks. In IJCAI, pages 786– 791, 2005. [7] Q. Mei, D. Cai, D. Zhang, and C. X. Zhai. Topic modeling with network regularization. In WWW, pages 101–110, 2008. [8] D. Mimno and A. McCallum. Expertise modeling for matching papers with reviewers. In KDD, pages 500–509, 2007. [9] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In UAI, pages 487–494, 2004. [10] C. Rother, P. Kohli, W. Feng, and J. Jia. Minimizing sparse higher order energy functions of discrete variables. In CVPR, pages 1382–1389, 2009. [11] J. Zeng, W. K.-W. Cheung, C.-H. Li, and J. Liu. Multirelational topic models. In ICDM, pages 1070–1075, 2009. [12] J. Zeng, W. Feng, L. Xie, and Z.-Q. Liu. Cascade Markov random fields for stroke extraction of Chinese characters. Information Sciences, 180:301–311, 2010.
R EFERENCES
[13] J. Zeng and Z.-Q. Liu. Markov random field-based statistical character structure modeling for handwritten Chinese character recognition. IEEE Trans. Pattern Anal. Mach. Intell., 30(5):767–780, 2008.
[1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 3(4-5):993–1022, 2003. [2] J. Chang and D. Blei. Relational topic models for document networks. In AISTATS, pages 81–88, 2009.
[14] J. Zeng and Z. Q. Liu. Type-2 fuzzy Markov random fields and their application to handwritten Chinese character recognition. IEEE Trans. Fuzzy Syst., 16(3):747–760, 2008.
[3] H. Deng, I. King, and M. R. Lyu. Formal models for expert finding on DBLP bibliography data. In ICDM, pages 163– 172, 2008. [4] T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc. Natl. Acad. Sci., 101:5228–5235, 2004.
[15] S. Zhu, I. Takigawa, J. Zeng, and H. Mamitsuka. Field independent probabilistic model for clustering multi-field documents. Information Processing & Management, 45(5):555– 570, 2009.
[5] Y. Liu, A. Niculescu-Mizil, and W. Gryc. Topic-Link LDA: Joint models of topic and author community. In ICML, pages 665–672, 2009.
[16] S. Zhu, J. Zeng, and H. Mamitsuka. Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity. Bioinformatics, 25(15):1944–1951, 2009.
373