A case study on expert/expertise mining - Semantic Scholar

31 downloads 247 Views 180KB Size Report
Model (SMM) proposed by Hofmann can use only one ... Modifications of SMM lead to other models ..... In our future work, we plan to apply TSMM to other.
Mining Latent Associations of Objects Using a Typed Mixture Model --A case study on expert/expertise mining Shenghua BaoI, Yunbo Cao†, Bing Liu‡, Yong YuI, and Hang Li† † Computer Science & Engineering Microsoft Research Asia Shanghai Jiao Tong University 5F Sigma Center No.800 Dongchuan Road No.49 Zhichun Road, Haidian Shanghai, China, 200240 Beijing, China, 100080 {shhbao,yyu}@apex.sjtu.edu.cn {yucao, hangli}@microsoft.com

I

Abstract

This paper studies the problem of discovering latent associations among objects in text documents. Specifically, given two sets of objects and various types of co-occurrence data concerning the objects existing in texts, we aim to discover the hidden or latent associative relationships between the two sets of objects. Existing methods are not directly applicable as they are unable to consider all this information. For example, the probabilistic mixture model called Separable Mixture Model (SMM) proposed by Hofmann can use only one type of co-occurrences to mine latent associations. This paper proposes a more general probabilistic mixture model called the Typed Separable Mixture Model (TSMM), which is able to use all types of co-occurrences within a single framework. Experimental results based on the expert/expertise mining task show that TSMM outperforms SMM significantly.

1. Introduction This paper addresses the problem of finding latent associative relationships between two sets of objects, using various types of co-occurrence data existing in texts concerning the sets of objects. A formal model is proposed to perform this task. Given two sets of objects, X and Y, discovering latent association relationships between the objects in X and Y means to find ‘soft clusters’ over the objects. Although our goal is to mine latent associations between objects in X and Y, the co-occurrences among the objects in X, and the co-occurrences among the objects in Y may also be helpful. Furthermore, co-occurrences between the objects in X (or Y) and the objects in another set Z can be useful too. To add to the complexity, the co-occurrences may be from different contexts. How to exploit all the cooccurrence information in mining becomes an important issue. In this paper, we propose a general mixture model,



Department of Computer Science University of Illinois at Chicago 851 S. Morgan Street Chicago, IL 60607 [email protected]

called Typed Separable Mixture Model (TSMM), which is an extension of SMM [5] and is able to exploit the information contained in all possible co-occurrences concerning the object sets. The key feature of TSMM is that it contains a type variable, which represents the type of co-occurrences. Letting the type variable t respectively denote co-occurrences between objects in X and Y, cooccurrences among objects in X, and co-occurrences among objects in Y, for example, TSMM becomes a generative model of all co-occurrences and their types. Using various types of co-occurrence data and TSMM, we can significantly improve the accuracy of mining of associations between the objects in X and Y. TSMM can be used to solve many object association problems. As a case study, we apply it to expert/expertise mining. We have conducted extensive experiments using two data sets. One is from the TREC expert search task and the other is from the web site of Microsoft Research. Experimental result on expert/expertise mining shows that more precise soft clusters were found by TSMM. As an application, we considered expert search which is also a competition task in TREC-2005 [1]. Experimental results on expert search show that TSMM outperforms SMM significantly. The rest of the paper is organized as follows. Section 2 discusses the related work. Section 3 formulates the problem of object association mining from co-occurrence data. Section 4 presents our general model of TSMM. Section 5 gives the experimental results. Finally, we make some concluding remarks in Section 6.

2. Related work 2.1. Mining latent association Hofmann proposed the Separable Mixture Model (SMM) to represent latent associations between objects. SMM is a model consisting of a number of soft clusters which can give rise to co-occurrence data [5]. Mining with SMM is to estimate the model based on the observed co-

occurrences. Modifications of SMM lead to other models including Asymmetric Clustering Model (ACM), Symmetric Clustering Model (SCM), and Hierarchical Clustering Model (HCM) [5]. These models have been applied to information retrieval [6], natural language processing [7], and collaborative filtering [8]. All the models proposed by Hofmann can only make use of one kind of direct co-occurrences. In practice, it is more advantageous if one can mine associations from more complex co-occurrence information. TSMM is proposed for this purpose. It is able to make use of complex co-occurrences by explicitly modeling the types of the co-occurrences. In this paper, we confine ourselves to extending only SMM, the same idea can be applied to ACM, SCM, and HCM as well.

2.2. Expert/expertise mining Discovering the right associations between people (experts) and topics (expertise) within or outside an organization is becoming increasingly important for conducting business in an enterprise. A simple approach to solving this problem is to manually develop an expert/expertise database, often called a skill inventory [9, 10]. Recently, methods for automatically finding experts from documents were introduced [2, 3]. Many systems designed for expert/expertise mining and search have been developed, e.g., [4]. At TREC-2005, a task on expert search was organized [1].

3. Object association mining Object association mining is a text mining problem. Specifically, given two sets of objects and various types of co-occurrence data in text documents related to the objects, we want to discover the latent associative relationships between the objects. In this paper, we assume that objects can be entities, properties, or concepts. First, we assume that there are multiple sets of objects, for example, three sets of objects, X={x1, x2, …, xM}, Y={y1, y2, …, yN}, and Z={z1, z2, …, zR}. Without loss of generality, suppose that we want to mine the association relationship between the objects in X and Y. Next, we assume that there are multiple types of co-occurrences exist, i.e., between objects in X×Y, Y×Z, X×Z, X×X, Y×Y, and Z×Z. Furthermore, the co-occurrences between objects can be observed in multiple contexts. That is to say, even for the same type of co-occurrences, for example, X×X, they can be from different contexts of the documents (e.g., titles or bodies). The goal is to mine the latent associations between the objects in X and Y using all the co-occurrence information.

4. Typed separable mixture model

We now propose the Typed Separable Mixture Model (TSMM) for mining latent associations between objects.

4.1. TSMM - General model Let X = {x1, x2, …, xM}, Y = {y1, y2, …, yN} represent the two sets of objects in question. To model different cooccurrences concerning the objects, we introduce a factor of ‘type’ t. Thus (a, b, t) denotes the co-occurrence of objects a and b of type t. Different types of cooccurrences can be written as S’ = {(al, bl, tl): al∈X∪Y, bl∈X∪Y, al ≠ bl, tl∈T, l=1, 2, …, L} where T denotes the set of all possible types. The co-occurrence data set becomes a sample of triples generated from a probability distribution. We further assume that the following generative model has given rise to the sample data. K

P (a, b, t ) = ∑ P (t )P (a , b,α | t )

(1)

α =1

The log-likelihood of the model with respect to the data can be calculated as: L

LL = ∑ log (P ( a l , bl , t l ) )

(2)

l =1

Due to the use of the type variable t, we call the new model Typed Separable Mixture Model (TSMM). This addition makes TSMM much more powerful and general than SMM as we will see below.

4.2. TSMM – Case 1 We consider the case in which co-occurrences between X and X and co-occurrences between Y and Y can help the mining between X and Y. TSMM can be employed here. The sample data is represented as S’={(al, bl, tl): al∈X∪Y, bl∈X∪Y, tl∈T, l=1, 2, …, L} where T = {tXY, tXX, tYY}, and tXY, tXX, and tYY represent the co-occurrences (a, b) from X×Y, X×X and Y×Y respectively. In TSMM, all the co-occurrences between the objects are put together into the model. We write the joint probability of (a, b, t) as: K

P(a, b, t ) = ∑ P (t ) P (α | t )P (a, b | α , t ) α =1 K

= ∑ P (t ) P (α | t )P (a | α , t ) P (b | α , t )

,

(3)

α =1

More precisely, ai and bj become xi and yj , xi and xj , or yi and yj, when type t changes. Equation (3) can be rewritten as:

K ∑ P(t XY ) P(α )P( x | α ) P( y | α ), t = t XY α =1 K P( a, b, t ) = ∑ P(t XX ) P(α )P( x | α ) P ( x' | α ), t = t XX , α =1 K ∑ P(t YY ) P(α )P( y | α ) P( y ' | α ), t = tYY α =1

(4)

where we assume that x and y are only dependent on α and α is independent of t. This is because α is a soft cluster over the two sets of objects, not related to the cooccurrence data. Note that in the model through the use of α, co-occurrence data of different types are ‘linked’ together. The log-likelihood with respect to the given data S’ becomes: M N  K  LL = ∑ ∑ f ( xi , y j , t XY ) log ∑ P (t XY ) P (α )P( xi | α ) P ( y j | α )  i =1 j =1  α =1  M M  K  + ∑ ∑ f ( xi , x j , t XX ) log ∑ P (t XX ) P (α )P ( xi | α ) P ( x j | α )  α = 1 i =1 j =1   N N  K  + ∑ ∑ f ( yi , y j , tYY ) log  ∑ P (tYY ) P (α )P ( yi | α ) P ( y j | α )  i =1 j =1  α =1 

The EM algorithm for approximately estimating the parameters θ=(P(t), P(α), P(x|α), P(y|α)) in the TSMM model can be constructed.

contexts, i.e., t∈{tXY, t’XY}, where tXY and t’XY denote the two contexts. In this way the sample data as becomes S’= {(al, bl, tl): al∈X, bl∈Y, tl∈T, l=1, 2, …, L}. Again the log-likelihood can be defined with the involvement of type t. the EM algorithm can be used to approximately estimate the parameters.

4.5. TSMM - Combination So far we have seen that TSMM can make use of cooccurrences of different forms. It is easy to make a further extension such that it can utilize all types together. Specifically, we use the type variable to define all the possible forms of co-occurrences and define TSMM as a generative model of all the co-occurrences and their types. In this paper we extend SMM into TSMM. We can also make similar extensions on other mixture models proposed by Hofmann. We focus on SMM, because it is the most basic model.

5. Experimental results To test the effectiveness of the proposed TSMM model, we applied it to expert/expertise mining. We also compared it with SMM in the experiments.

5.1. Experiment setup 4.3. TSMM – Case 2 Next, we consider a case in which there are more than two sets of objects. Let X={x1, x2, …, xM}, Y={y1, y2, …, yN}, and Z={z1, z2, …, zR} represent three sets of objects. Again, we use (a, b, t) to denote an instance of cooccurrence and its type. The sample data is S’={(al, bl, tl): al, bl∈X∪Y∪Z, tl∈T, l=1, 2, …, L}, where T={tXY, tXZ, tYZ} and tXY, tXZ, and tYZ represent the co-occurrences between X×Y, X×Z, and Y×Z respectively. The log-likelihood can be defined with the involvement of type t. Similarly, we can employ EM to approximately estimate the parameters.

4.4. TSMM – Case 3 Co-occurrences can be observed from different contexts. For example, in expert/expertise mining, co-occurrences between people and topics can be observed in not only the bodies of the documents, but also in the title and the author fields of the documents. Co-occurrences in different contexts need to be combined together, which can be easily done within the framework of TSMM. For simplicity, suppose there are two sets of cooccurrence data; each set is from one context. The cooccurrence data is about direct co-occurrences. We can use the type variable in TSMM to represent the two

5.1.1. Data sets. The experiments were conducted on two data sets, W3C corpus and MSR corpus. The W3C corpus is the data set used in the expert search task of enterprise search track at TREC-2005 [1]. It consists of 331,037 web pages. The ground truth includes 60 topics. Each topic is associated with a list of people who are experts of the topic. All the people are from a list of 1,092 staff members at W3C. The MSR corpus comes from the web site of Microsoft Research1 . It consists of 31,754 web pages. We selected 32 research topics and then built a ground truth against the topics based on the annotations by the related researchers. All the people are from a list of 667 researchers. 5.1.2. Personal name identification. We identify persons by matching the candidate’s email, full name, and abbreviated name etc. using the heuristic rules. Table 1 shows the number of experts found in the two data sets with our name identification method. 5.1.3. Topic identification. A string matching method was used to identify the topics in the two data sets. The statistics of the identified topics are given in Table 2.

1

http://research.microsoft.com/

5.1.4. Co-occurrence extraction. We considered the use of the co-occurrence within section. If two entities (person and topic) appear in the same section, we consider that they occur once. A section is identified with the HTML tags such as ,

, and


. Table 3 gives the statistics of the co-occurrence data within section. Table 1. Statistics on persons Data set Num. of Persons Num. of Persons in Ground Truth

W3C 765 765

MSR 554 174

Table 2. Statistics on topics Data set Num. of Topics Num. of Topics in Ground Truth

W3C 222 43

MSR 141 24

Table 3. Number of co-occurrences within sections Co-occurrence Type Expert-Expertise Expert-Expert Expertise-Expertise

W3C 59,277,035 15,163,303 15,744,566,410

MSR 132,869 247,886 288,121

5.2. Expert/expertise mining We conducted expert/expertise mining using the cooccurrence data in the two data sets. Both TSMM and SMM were utilized. In this section, we show soft clusters obtained from the models. We can observe a general tendency that the clusters obtained by TSMM appear to be more meaningful than those obtained by SMM. Table 4. Clusters generated by TSMM on W3C data Cluster ID 1 2

Top-4 Ranked Expertise in Cluster XML, DOM, Binary XML, SVG, Server, HTTP, URI, User Agent

Table 5. Clusters generated by SMM on W3C data Cluster ID 1 2

Top-4 Ranked Expertise in Cluster XML, HTTP, XPATH, XSL RDF, HTTP, Server, Architecture

Table 6. Clusters generated by TSMM on MSR data Cluster ID 1 2

Top-4 Ranked Expertise in Cluster Graphics, Computer Vision, Pattern Recognition, Graph Theory Speech Recognition, Speech, Pattern Recognition, Audio processing

Table 7. Clusters generated by SMM on MSR data Cluster ID 1 2

Top-4 Ranked Expertise in Cluster Graphics, Magpie, Speech, Programming Principles and Tools Speech, Signal Processing, Speech

Recognition, Audio Processing 5.2.1. Mining with TSMM-multiple relations. We first performed expert/expertise mining using co-occurrence data between expert-expertise, expert-expert and expertise-expertise within sections. Table 4 shows top two clusters generated by TSMM on the W3C data with the number of clusters set to 20. We see that cluster-1 is mainly related to XML, and cluster-2 is mainly related to web technologies. Two similar clusters obtained from SMM are shown in Table 5. We see that cluster-2 is difficult to interpret. It contains both web technologies (e.g. ‘Server’, ‘Architecture’) and semantic web (e.g. ‘RDF’). 5.2.2. Mining with TSMM-multiple sets. We evaluated the results of using TSMM on mining expert/expertise with not only people set and topic set, but also an additional group set with the MSR data. The group set contains information on which researcher belongs to which group. The corresponding co-occurrence data between group and person can also be obtained. Table 6 shows two clusters on MSR data generated by TSMM using all the information. The cluster number is set to 20. It is easy to know that cluster-1 and cluster-2 are related to graphics and speech respectively. The similar clusters learned by SMM are shown in Table 7. We can see that the results in Table 6 are more consistent than those in Table 7. There are noisy terms in the clusters in Table 7, e.g., ‘Programming Principles and Tools’ in cluster 1.

5.3. Expert search 5.3.1. Expert search methods. We can also perform expert search based on mined associations between experts and expertise. The experts are ranked by the conditional probability P(xi |yj), where xi is a person (expert) and yj is a topic (expertise). A baseline method for calculating P(xi|yj) is the Frequency Based Method (FBM) which directly estimate the conditional probabilities using frequencies. P ( xi | y j ) =

f ( xi , y j )



f ( xi' , y j )

(5)

i'

Here, f(xi, yj) denotes co-occurrence frequency of expert xi and expertise yj. For the mixture models (SMM and TSMM), we use the following equations to calculate the conditional probabilities. K

P ( xi | y j ) = ∑ P ( xi | α ) P (α | y j ) α =1

(6)

Here P(xi |α) is estimated from the mixture model and P(α|yj) is calculated using the Bayes rule.

5.3.2. Evaluation measure. We used mean average precision (MAP) to evaluate expert search. MAP was also used as the main measure in the expert search task of TREC-2005. Table 8. Expert search results in MAP (100 clusters) Methods FBM SMM TSMM

W3C 0.2144 0.2197 0.2470

MSR 0.4550 0.4320 0.4736

0.6 0.5 MAP

0.4 0.3 0.2

In this paper, we studied the problem of latent object association mining from texts. The main contributions are: 1) The proposal of studying the new problem of mining latent object associations based on various types of cooccurrence data. The co-occurrences can represent different relations, of different types, and from different contexts. 2) The proposal of the Typed Separable Mixture Model (TSMM) which is able to utilize all types of cooccurrences in mining within a single framework. To our knowledge, no existing technique can exploit such diverse types of co-occurrence data. 3) The application of the TSMM model to expert/expertise mining, and the verification of its effectiveness. In our future work, we plan to apply TSMM to other tasks and also extend it to other mixture models.

References

0.1 0 0

20 SMM on W3C TSMM on W3C FBM on W3C

40

60

SMM on MSR TSMM on MSR FBM on MSR

80

100

Num. of Clusters

Figure 1. Search results with different number of clusters 5.3.3. TSMM for expert search. We conducted expert search experiments on both W3C and MSR data sets. The number of clusters was set to 100. In TSMM, expertexpert, and expertise-expertise co-occurrences as well as expert-expertise co-occurrence were all utilized. Table 8 shows the results. From the table, we observe that 1) TSMM performs better than SMM by 15.21% and 9.63% on the W3C and MSR data sets respectively. We performed t-tests on the results. The p-values for W3C and MSR data sets are 0.0187 and 0.0470 (< 0.05) respectively. Thus the improvement of TSMM over SMM is statistically significant. We can conclude that TSMM can indeed effectively utilize additional co-occurrence information to enhance the accuracy of mining results. 2) TSMM outperforms FBM as well, because it can effectively use the latent clusters to help enhance the mining results. Figure 1 shows the expert search results of TSMM, SMM, and FBM in terms of MAP when the number of clusters changes. We see that TSMM always outperforms SMM. When the number of clusters is small, SMM is not able to work better than FBM.

6. Conclusion

[1] Craswell, N., deVries, A. P., and Soboroff, I. Overview of the TREC-2005 Enterprise Track. In Proc. of TREC’05. 2005,199-204 [2] Craswell, N., Hawking, D., Vercoustre, A. M. and Wilkins, P. P@NOPTIC Expert: Searching for Experts not just for Documents. In Proc. Of Ausweb, 2001. [3] D’Amore, R. J. Expertise Community Detection. In Proc. of SIGIR ’04. ACM Press, 2004, 498–499. [4] ExpertNet: http://www.cvcp.ac.uk/expertnet.htm. [5] Hofmann, T., and Puzicha, J. Statistical Models for Co-occurrence Data. Technical report, A.I.Memo 1635, MIT, 1998. [6] Hofmann, T. Probabilistic Latent Semantic Indexing. In Proc. of SIGIR’99. 1999, 50-57. [7] Hofmann, T. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, Volume 42 Issue 1-2, 2001, 177 – 196. [8] Hofmann, T. Latent Semantic Models for Collaborative Filtering. ACM TOIS Volume 22, Issue 1, (January 2004), 89 – 115. [9] ProfNet: http://www.profnet.com/. [10] Skillview: htttp://www.skillview.com