Injecting Structured Data to Generative Topic ... - Semantic Scholar

2 downloads 0 Views 908KB Size Report
Example of a three level organigram, consists of organization names and .... ment, the related topics will be assigned with high probability convincingly, meanwhile ..... ference on Empirical Methods in Natural Language Processing, Singapore, ...
Injecting Structured Data to Generative Topic Model in Enterprise Settings Han Xiao1,2, , Xiaojie Wang2 , and Chao Du3 1 2

Technische Universit¨at M¨unchen, D-85748 Garching bei M¨unchen, Germany [email protected] Beijing University of Posts and Telecommunications, 100876 Beijing, China [email protected] 3 Beihang University, 100191 Beijing, China [email protected]

Abstract. Enterprises have accumulated both structured and unstructured data steadily as computing resources improve. However, previous research on enterprise data mining often treats these two kinds of data independently and omits mutual benefits. We explore the approach to incorporate a common type of structured data (i.e. organigram) into generative topic model. Our approach, the Partially Observed Topic model (POT), not only considers the unstructured words, but also takes into account the structured information in its generation process. By integrating the structured data implicitly, the mixed topics over document are partially observed during the Gibbs sampling procedure. This allows POT to learn topic pertinently and directionally, which makes it easy tuning and suitable for end-use application. We evaluate our proposed new model on a real-world dataset and show the result of improved expressiveness over traditional LDA. In the task of document classification, POT also demonstrates more discriminative power than LDA. Keywords: probabilistic topic model, structured data, document classification.

1 Introduction The increased use of computers has augmented the amount of data that can be stored in them. To take advantage of the rich storehouse of knowledge that the data represent, we need to harness the data to extract information and turn it into business intelligence. Any well-structured dataset or database with a large number of observations and variables can be mined for valuable information. Many retail businesses now use the sales record to adjust their marketing strategies. On the other hand, it is equally important to analyze unstructured data like text documents and webpages. A business developer might need to examine the latest news on webpages and manually coded them into categories, and assign them to the related division. However, enterprises with information depositories do not yet use the full resources of their two kinds of data. Although text mining is well-studied in last few decades, it is rarely analyzed with the support of structured data in enterprise. Just a few works [1, 2] discussed extracting interesting patterns from the 

This work was done during H.X.’s internship at Hewlett-Packard Laboratories, China.

Z.-H. Zhou and T. Washio (Eds.): ACML 2009, LNAI 5828, pp. 382–395, 2009. c Springer-Verlag Berlin Heidelberg 2009 

Injecting Structured Data to Generative Topic Model in Enterprise Settings

383

structured and unstructured data jointly by Machine Learning approaches. In this paper, we present a straightforward manner to incorporate the structured data of enterprise into generative topic model by Gibbs sampling. The structured data we considered in this work is organigram, which is the most common information in the Enterprise. Our new model, which we refer to henceforth as the Partially Observed Topic model (POT), defines a one-to-one correspondence between LDA’s latent topics and organizations constrained to organigram. This allows the structured words to guide the evolution of topics in our model.

State Food and Drug Administration Department of Finance Planning

Division of Secretaries Department of Food Safety Coordination

Division of Comprehensive Coordination

Department of Drug Registration

Division of Biological Products

Division of Pharmaceuticals

Fig. 1. Example of a three level organigram, consists of organization names and hierarchical structure that shows the hierarchical structure of an organization with its parts

The rest of the paper is organized as follows. In section 2, we discuss past work related to the topic model. In section 3, we describe the POT model in detail including its generative process, and the collapsed Gibbs sampling used for parameter inference. Section 4 presents some qualitative and quantitative evaluation of our proposed model against with traditional LDA. We use the document-specific mixing proportions to feed a downstream SVM classifier, we demonstrate the advantages of POT in document classification task. Section 5 concludes the paper and discusses future work.

2 Related Work Research in statistical models of co-occurrence has led to the development of a variety of useful topic models. Topic models, such as Latent Dirichlet Allocation(LDA) [3], have been an effective tool for the statistical analysis of document collections and other discrete data. LDA assumes that the words of each document arise from a mixture of topics. The topics are shared by all documents in the collection; the topic proportions are document-specific and randomly drawn from a Dirichlet distribution. LDA allows each document to exhibit multiple topics with different proportions, and it can

384

H. Xiao, X. Wang, and C. Du

thus capture the heterogeneity in grouped data which exhibit multiple patterns. In the recent past, several extensions to this model have been proposed such as the Topics Over Time(ToT) [4] model that permits us to analyze the popularity of various topics as a function of time, Hidden Markov Model-LDA [5] that integrates topic modeling with syntax, Author-Persona-Topic models [6] that model words and their authors, etc. In each case, graphical model structures are carefully-designed to capture the relevant structure and co-occurrence dependencies among the data. However, most of these generative topic models are designed in unsupervised fashion, which made them suffered from non-commutative and unstable of topics. Due to the lack of a priori ordering on the topics, they are unidentifiable between or even within runs of the algorithm. Some topics are stable and will reappear across samples while others topics are idiosyncratic for a particular solution. To avoid this problem, some modifications of LDA to incorporate supervision haven been proposed previously. Supervised LDA [10] posits that a label is generated from each document’s empirical topic mixture distribution. DiscLDA [11] associates a single categorical label variable with each document and associates a topic mixture with each label. A third model, presented in [2] showed that the structured and unstructured data can naturally benefit each other in the tasks of entity identification and document categorization. We follow previous approaches in incorporating structured data into generative topic model, and propose a semi-supervised fashion from a document collection and a backend organigram database. We motivate how the organigram can reinforce the topic clustering and document categorization. In proposed Partially Observed Topic model (POT), we not only learn the topical components, but also map each topic to an organization. Thus each document is represented as mixtures of organizations, which are distributions over the words in the corpus. The POT model interprets the organizations name and other unstructured word in a joint generative process, and automatically learns the posterior distribution of each word conditioned on each organization. We also perform a Gibbs sampling technique to establish of topic hierarchies from the structure embedded in organigram.

3 Hierarchical Organization Topic Model Our notation used in this paper is summarized in Table 1, and the graphical model of POT models is shown in Figure 2. Partially Observed Topic (POT) model is a generative model of organization names and unstructured word in the documents. Like standard LDA, it models each document as a mixture of underlying topics and generates each word from one topic. In contrast to LDA, topic discovery in POT is influenced not only by word co-occurrences, but also organigram information, including: organization names and hierarchical structure. This allows POT to predict words distribution for topics with the guide of organigram, which results in organization specific multinomial. The robustness of the topic cluster is greatly enhanced by integrating this structured data. The generative process of POT, which corresponds to the process used in Gibbs sampling for parameter estimation, is given as follows:

Injecting Structured Data to Generative Topic Model in Enterprise Settings

385

1. Draw T multinomial φz from a Dirichlet prior β, one for each organization z; 2. For each document d, draw a multinomial θd from a Dirichlet prior α; For each organization name edj in document d: 3. Draw an organization name edj from multinomial θd For each word wdi in document d: 4. Draw an organization zdi from θd 5. Draw a word wdi from multinomial φzdi As shown in the process, each document is represented as mixtures of organizations, which denote by multinomial θ. θ plays the key role in connecting the organization names and the unstructured words in the document. The dependency of θ on both e and z is the only additional dependency we introduce in LDA’s graphic model. As the mechanism in LDA, frequently co-occurrence of some words often indicated that Table 1. Notations used in POT model Symbol T D V θd φz zdi z sup z sub wdi edi α β γ

Description number of organizations number of documents number of unique words the multinomial distribution of organizations specific to document d the multinomial distribution of words specific to organization z the organization associated with the ith token in the document d the parent organization of z the child organization of z the ith token (unstructured word) in the document d the ith organization name mentioned in the document d Dirichlet prior of θ Dirichlet prior of φ Smooth for unseen organization

Fig. 2. Graphical model of POT

386

H. Xiao, X. Wang, and C. Du

they belong to the same topic or organization. On the other hand, if we observe an organization’s name in a document, it will give higher probability on corresponding parameter of θd . This will result in the unstructured words in this document more likely related to this organization. Although unstructured words do not directly mention an organization, they may provide significant indirect evidence for inferring it. Especially in a document without any organization names, the unstructured words will imply the latent organizations of document as they work in traditional Multinomial Naive Bayes classifier [12]. We use Gibbs sampling to conduct parameter inference. The Gibbs sampler provides a clean method of exploring the parameter space. Note that we adopt conjugate prior(Dirichlet) for the multinomial distributions, and thus we can easily integrate out θ and φ, analytically capturing the uncertainty associated with them. Since we need not sample θ and φ at all, the sampling procedure becomes facility. Starting from the joint distribution P (w, z, e|α, β), we calculate the conditional probabilities P (zdi |z−di , w, e, α, β) conveniently using Bayes Rule, where z−di denotes the organization assignment for all words token except wdi . In the Gibbs sampling procedure, we draw the organization assignment zdi iteratively for all unstructured word token wdi according to the following conditional probability distribution: P (zdi |z−di , w, e, α, β) ∝ (nd,edj + r − 1)(nd,zdi + α − 1)× nzdi ,wdi − mzsup ,wdi + βwdi − 1 V v=1 (ndi ,v + βv ) − 1

(1)

where nd,e represent how many times the name of organization e mentioned in document d, we interpolate r here to avoid probability from zero, since some documents may not mention any organization names. nd,z is the number of tokens in document d that assigned to organization z, nz,v is the number of tokens of word v are assigned to the organization z. mzsup ,v describes how many time the tokens of word v are assigned to all parent organizations of z. According to the hierarchical structure of organigram, it can work out level-by-level recursively as follow: mz,v =

nz,v + mzsup ,v nzsub

(2)

where nzsub is the number of child organizations of z. Intuitively, using the counts that eliminate words co-occurs with parent organizations will refine the distribution for current organization. The sampling algorithm gives direct estimates of z for every word. θ and φ of the word-organization distributions and organization-document distributions can be obtained from the count matrices as follows: φzv = V

nz,v + βv

v=1 (nz,v

θdz = Z

+ βv )

nd,z + αz

z=1 (nd,z

+ αz )

(3) (4)

Injecting Structured Data to Generative Topic Model in Enterprise Settings

387

In state-of-the-art hierarchical Bayesian models, it is important to get the right values for the hyper-parameters. For many topic models, the final distribution is sensitive to hyper-parameters. One could use Gibbs EM algorithm [7] to infer the value of hyperparameters. In the paper, for simplicity, we use fixed symmetric Dirichlet distributions and empirical value of r smooth in our experiments. As we highlighted in Section 1, we seek an approach that can automatically learn word-organization correspondences. Towards this objective, we separated the structured word from unstructured words in our generative process to refine the topic evolution. We also note that there are some variants of our purposed model. For example, one could consider an arbitrary restrictive model [14] that restrict θd to be defined only over the topics that correspond to its organization mentions, while keep zero value on others. Thus, all topic assignments over words are limited to the organizations mentioned in current document. However, this restriction ignores the effect of common and background topics. The background topics may crops up throughout in corpus, nevertheless without apparent mentions. For instance, as we illustrated in Figure 1, “Department of Drug Registration (DDR)” and “Department of Food Safety Coordination (DFSC)” are sub-organizations that attached to ”State Food and Drug Administration (SFDA)”. In our news story about food and drug security, “DDR” and “DFSC” own different representative words respectively, they may also share some words such as “health”, “legal”, “safety”, “permission” that actually generated by “SFDA”. In case the “SFDA” itself is not directly mentioned in document, the restrictive version of model will fail to recover the word distribution over “SFDA” accurately. This is a reasonable assumption because the organigram itself is embedded with hierarchy. From the view of sub-organizations, their parents could be seen as background topics. However, the parent-organizations may not necessarily be mentioned in every document, and therefore become latent topics. In contrast to standard LDA and the restrictive model, we model the topics in document as neither fully latent nor clearly observed. When structured word is seen in document, the related topics will be assigned with high probability convincingly, meanwhile the unstructured words in document are free to be sampled from any of the T organizations, swing the probability in favor of the corresponding topics, and thus makes a full-scale and supervised exploration in latent topic space.

4 Experiments 4.1 Dataset In this section, we present experiments with POT model on HPWEB dataset, which consists of a database and a webpage collection. The database includes the information of all organizations and departments in Hewlett-Packard Company, and the webpage collection has 50 thousand webpages crawled from Hewlett-Packard external web. On the structured side, we extract the name, parent and child for each organization from the database in HPWEB. This form a small backend database (i.e. organigram) used in our experiment as in Table 2 showed.

388

H. Xiao, X. Wang, and C. Du Table 2. Example organigram based on the HPWEB database ID 0 1 2 3

Name Personal System Group Marketing Supply Chain Notebook Sales

Parent NULL 0 0 1

Child 1, 2 3, 6 4, 5 NULL

This table describes a hierarchical tree structure with a set of linked nodes, where each node represents an organization. Like the definition of the tree structure, each organization has a set of zero or more child organizations, and at most one parent organization. The organigram we build for the experiment is two levels, and has 68 organizations in all (10 parent organizations and 58 sub organizations). To create our unstructured document collection, we parsed all webpages and tokenized the words, then omitted stop-words and words that occurred fewer than 100 times. This resulted in a total of 34,036 unique words in vocabulary. For each time of experiment, we randomly draw 10,000 documents from HPWEB dataset, and omit pages length shorter than 5 tokens. 4.2 Preprocessing Each document was processed to obtain its list of unstructured words and the list of organization mentions. To locate the organization names in the document, we need to annotate the proper entities that may indicate organizations. However, it needs to capture all different variations, abbreviations, acronym and spelling mistakes. For example, for the organization ‘HP Finance Service’, the variations need to be considered includes ‘Hewlett Packard Finance Service’, ‘Finance Service Group’, ‘HP Finance Group’ and ‘HPFS’. To address this problem, we first use LEX algorithm [8] to locate the name entities in document. LEX is based on the observation that complex names tend to be multi-word units(MWUs), that is, sequences of words whose meaning cannot be determined by examining their constituent parts. This lexical statistics-based approach is simple, fast and with high recall. Next we compute the candidate matches for each entity that annotated by LEX against the organizations names in our organigram database. We use an open-source Java toolkit SecondString1 for matching names and records. After experimenting with several similarity metrics, we finally choose the JaroWinklerTFIDF [9] metrics which is hybrid of cosine similarity and the Jaro-Winkler method. By computing the similarity scores for these pairs, and filtering out those scores below 0.8, we get the organization name list that mentioned in document. The list of unstructured words is found simply by removing the organizations’ names from the document. 4.3 Topic Interpretation We first examine the topic interpretation of POT, and how it compares with standard LDA. We fit 68-topic POT and standard unsupervised LDA on HPWEB dataset(6499 1

http://secondstring.sourceforge.net/

Injecting Structured Data to Generative Topic Model in Enterprise Settings

389

LDA

POT Printing Group

hp print analyst business printer customer product company marketing technology

travel hotel emplyees book cards meeting expense car trip airport

Topic 12

Printer Service

tools monitor hardware ems diagnostics troubleshoot utility analysis hpux isee

health science research technology director medical healthcare dr hospital medicine

Topic 15

Web Service

system administrator dynamics pcx workgroup processor wwid monarch password scsi

file system options default command configuration software device session message

Topic 7

Technology Group

people office personal company inside director ideas shared execution computing

sigapore africa india france zealand australia netherland austria japan europe

Topic 50

Research Lab

hpl technology research project services management system team web computing

print printer technology device images project paper displays inkjet deskjet

Topic 42

Software Tech.

program copy free modified trace running execution event section data

notebooks pc workstations desktops commercial intel compaq mobile launch client

Topic 26

Solution Group

product customer services support region solutions company marketing division

business customer team management employees organization strategy development focus operations

Topic 30

Software Solution

backup file fbackup device archive system graph tape volume database

digital experience photos gaming enterainment camera video media pictures images movie

Topic 14

Fig. 3. Topic clustering result of 8 topics learned by POT (left) and traditional LDA (right). The descriptors are selected by the 10 highest probability words under organization. POT’s topic are named by their associated organization name and linked by their hierarchy in organigram. We also selected the 8 most expressive topics in LDA for fairly comparison.

documents, 8,191,292 words in total). Both models ran for 4500 Gibbs sampling iterations, with symmetric priors α = 0.2, β = 0.02, the smooth factor is γ = 2 in POT. In POT, topics are directly mapped to appropriate organizations name. The model has nicely captured the words using an auxiliary organigram. Examine the result of POT, we found that words under each organizations are quite representative. For example, the parent organizations often related with some generic words (such as “product”, “service”, “customer” and “company”). Although these words are too general to interpret, they also imply the differences on business area between two organizations. “product” and “service” are ranked high in both “Printing Group (PG)” and “Solution Group (SG)”, on the other hand, they do not appear in the top word lists of “Technology Group (TG)”. This indicates that PG and SG have more similarity on business than they compare to TG. At the next level, topic descriptors specify more clearly domain of sub-organizations. The cluster of “Research Lab (RL)” provides an extremely salient summary of research area, while “Software Technology (ST)” assembles many common words used in network administration. POT model separated the words pertaining to different organizations. On the other hand, sub-organizations also inherited some words from their parents, such as “computing” and “execution”, which appears in RL and ST respectively are also listed in TG’s descriptors.

390

H. Xiao, X. Wang, and C. Du

On the side of standard LDA, although its topic descriptors are also quite expressive, it offers no obvious clues for mapping the topics to appropriate organizations. LDA seems to discover topics that model the fine details of documents with no regard to their discrimination power on organizations. 4.4 Document Visualization We then study the evaluation of POT performance on document classification. The first group of experiments are qualitative evaluations. We use some popular visualization method to intuitively demonstrate the effectiveness of POT on document modeling. It is worth noting that HPWEB dataset does not lend itself well to document classification analysis. The lack of class-labels on webpages is a major frustration on evaluation. In this experiment, we first use some simple heuristic rules to label the document class and then verify each page’s label manually. The rules used here include: the organization mentioned in domain name (for instance, “HR” in http://www.hr.hp.com), page title or html meta data. On the other hand, webpages are full of noisy and contain some highly overlapping content: because all webpages are inherited from several base templates, the pages that share same template will naturally overlap a great deal. After filtering out the low-quality and duplicate webpages, we finally categorize 1192 documents into 10 class, that is the 10 top-level organizations. illustrate that the POT model is also feasible for document classification by Figure 4. It shows the 2D embedding of the expected topic proportions of POT and standard LDA by using the t-SNE stochastic neighborhood embedding [13], where each dot represents a document and color-shape pairs represent class labels, each pair is marked with an organization acronym in the corner legend. Obviously, the POT produces a better grouping and separation of the documents in different categories. In contrast, standard LDA does not produce a well separated embedding, and documents in different categories tend to mix together. It is also interesting to examine the discovered topics and their association with organization mentions in document. In Figure 5, we randomly sample 100 documents

(a) LDA

(b) POT

Fig. 4. t-SNE embedding of topic representation by: tranditional unsupervised LDA (left) and POT model (right)

Injecting Structured Data to Generative Topic Model in Enterprise Settings

391

Fig. 5. Document-organization co-occurrence matrix. Each row represents a document, and each column corresponds to an organization. The upper part of graph shows the LEX result, that is, if LEX finds an organizations name in the document, the co-occur value is set to 1 correspondingly. The middle and bottom parts of the figure describe the recovered document-organization co-occurrence matrix by POT and traditional LDA respectively. Each row corresponds to a multinomial distribution of organizations specific to document. Grayscale indicates the value of θdz , with black being the highest probability and white being zero.

and 20 organizations (10 top-organizations and 10 sub-organizations) from our dataset and visualize their co-occur matrix. Each row represents a document, and each column corresponds to an organization. It can be clearly seen that, POT’s interpretation of documents used the LEX result on organization mentions as guidance, nevertheless, with less arbitrary. It treats the structured data as important evidences, and used them to supervise the topic evolution implicitly. Unlike LEX, by inspection the unstructured word, POT can infer the organization labels on documents, when organization name is not mentioned in document. We can also see that by ignoring supervised responses, standard LDA’s co-occur matrix seems hard to interpret, which makes it not appropriate for document classification. POT trade off between the arbitrary LEX labeling and totally unsupervised LDA. Since each document is soft-assigned to every organization with a certain probability and described by a multinomial distribution, POT gives a more precise description of the document across the organizations. It should be apparent to the reader that POT can be adapted as an effective multi-label classifier for documents. Straight-forward manners such as inferring the document’s most likely labels by suitably thresholding its posterior probability over topics, or developing a downstream classifier are natural extensions for further investigation. Another notable feature of POT model is that it discovers the latent relationship between different organizations. Table 3 listed three documents with organizations found by LEX and top four organizations assigned by POT. By comparing the result with LEX and POT, some hidden relations between organizations can be uncovered. For example, the “Linux” department is strongly related with “Business PC” in the context of document 36. Document 148 introduces a high-speed video and graph transmission technology developed by “Research Lab”, which is relevant with “Digital Entertainment”

392

H. Xiao, X. Wang, and C. Du Table 3. The interpretation of document on organizations by POT

Doc. Annotated ID organizations

36

Linux

148 Research Lab

691

Large-scale Computing

Top four organizations assigned by POT Names Business PC Linux Human Resources Service Group (America) Digital Entertainment IT Management Worldwide Marketing Business Service Software Large-scale Computing Business Service IT Management

POT

LDA

Prob. 0.28 0.15 0.09 0.06 0.49 0.10 0.09 0.02 0.37 0.19 0.05 0.05

— a high performance video game console. Finally, the “Large-scale Computing” is involved in some area of “Software” in document 691. These results suggest that POT can be an effective tool in mining business relationship of organizations under certain context. On the right side of Table 3, we also give the document-topics distribution recovered by POT and standard LDA respectively. POT yields yields sharper and sparser distribution over topics, which results in a sharper low dimensional representation. 4.5 Document Classification To study the practical benefit of injecting the structured data into topic models, we also provide quantitative evaluation of POT on document classification. We perform a multiclass classification on our HPWEB dataset as we previously used. However, many topic models based on LDA are not suit for classification. LDA and POT can not be directly used as classifier. To obtain a baseline, we fit our data to standard LDA model with different number of topics, and then use the latent topic representation of the training documents as features to build a multi-class SVM classifier. For POT, we first build a straight-forward classifier by selecting the topic (organization) with highest posterior probability. The second classifier based on the same strategy on feature selection and training as LDA+SVM, except the topic numbers are fixed to 68. We denote these three classifier as LDA+SVM, POT and POT+SVM, and evaluate their average precision of 5 times experiments. We use SimpleSVM 2 to build a 1-vs-all SVM classifiers. which is popular and used by many previous works. The parameter C is chosen from 5 fold crossvalidation on 1192 documents. The precision of different models w.r.t topic numbers are shown in Figure 6.

2

http://gaelle.loosli.fr/research/tools/simplesvm.html

Injecting Structured Data to Generative Topic Model in Enterprise Settings

393

Fig. 6. Precision of multi-class document classification on selected HPWEB dataset(1192 documents, 10 class in all). We first fit POT and LDA models on our dataset, and then use the learned doc-topic distribution as input features to feed the a downstream SVM classifier.

We can clearly see that both purely POT and POT+SVM outperform traditional unsupervised LDA significantly. The average precision of POT+SVM is 74%, POT yields 58%, while standard LDA only holds 55% on average. By injecting the structured word into parameter estimation as shown in Eq. 1, the latent topic space of POT yields a better discrimination power, and thus more suitable for document classification. Although LDA gains slightly better performance as the number of topics arise, it still suffers from low precision even with 5˜6 times number of topics than POT.

5 Conclusion In this paper, we have explored the way to formally incorporate the enterprise structured data into generative topic model. We have presented the Partially Observed Topic model that generates both organization mentions and unstructured words. In contrast to traditional LDA, the topics mixtures are partially observed in parameter estimation. The topics evolution are guided by structured word in document, which results in oneto-one correspondences between latent topics and organizations. Results on a real-world dataset have shown that POT can use back-end organigram to supervise topic evolution implicity. In contrast to traditional unsupervised LDA, POT discovers more expressive topics that appropriate for document classification.

394

H. Xiao, X. Wang, and C. Du

Enterprise data mining on structured and unstructured data are often processed separately. Our work opens a new direction of research in injecting structured data into generative topic model. In ongoing work, we will explore to benefit more from structured data in enterprise database(e.g. employee information) and discover the meaningful pattern that redound to enterprise. Moreover, we also notice that POT can be adapted to more general field other than enterprise. Given a training set consisting of documents with human-annotated labels or tags in context, POT can derive supervision from these annotations and automatically learn the posterior distribution of each unstructured word on label set. This mechanism can be utilized and extended to automatic summarize of contextual information about labels.

Acknowledgement We thank Yuhong Xiong and Ping Luo for helpful comments and constructive criticism on a previous draft. We also thank Hewlett-Packard Laboratories for providing HPWEB dataset.

References [1] Lu, Y., Zhai, C.: Opinion integration through semi-supervised topic modeling. In: Proceedings of WWW International World Wide Web Conference, pp. 121–130 (2008) [2] Bhattacharya, I., Godbole, S., Joshi, S.: Structured entity identification and document categorization: two tasks with one joint model. In: Proceedings of 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2008) [3] Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003) [4] Wang, X., McCallum, A.: Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends. In: Proceedings of 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2006) [5] Griffiths, T., Steyvers, M., Blei, D., Tenenbaum, J.: Integrating topics and syntax. In: Advances in Neural Information Processing Systems 17, pp. 537–544. MIT Press, Cambridge (2005) [6] Mimno, D., McCallum, A.: Expertise modeling for matching papers with reviewers. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2007) [7] Andrieu, C., de Freitas, N., Doucet, A., Jordan, M.: An introduction to MCMC for machine learning. Machine Learning 50, 5–43 (2003) [8] Downey, D., Broadhead, M., Etzioni, O.: Locating complex named entities in web text. In: Proceedings of IJCAI International Joint Conferences on Artificial Intelligence (2007) [9] Cohen, W.W., Ravikumar, P., Fienberg, S.: A Comparison of String Metrics for Matching Names and Records. In: Proceedings of 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop on Data Cleaning and Object Consolidation (2003) [10] Blei, D.M., McAuliffe, J.D.: Supervised Topic Models. In: Proceedings of NIPS Neural Information Processing Systems (2007)

Injecting Structured Data to Generative Topic Model in Enterprise Settings

395

[11] Lacoste-Julien, S., Sha, F., Jordan, M.I.: DiscLDA: Discriminative learning for dimensionality reduction and classification. In: Blei, D.M., McAuliffe, J.D. (eds.) Proceedings of NIPS Neural Information Processing Systems (2008) [12] McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization. Tech. rep. WS-98-05. AAAI Press, Stanford (1998), http://www.cs.cmu.edu/˜mccallum [13] van der Maaten, L.J.P., Hinton, G.E.: Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9, 2579–2605 (2008) [14] Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, August 6-7, pp. 248–256 (2009)