Proceedings of IEEE CCIS2012
HLDA BASED TEXT CLUSTERING Pingan Liu, Lei Li, Wei Heng, Boyuan Wang Center for Intelligence Science and Technology, School of Computer Science and Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China
[email protected],
[email protected],
[email protected],
[email protected] Abstract: LDA (Latent Dirichlet Allocation) topic model has been applied into many applications in recent years. But LDA has a shortcoming that it cannot deal with various changes of data set well, which has become a limitation for its applications. Hierarchical Latent Dirichlet Allocation (hLDA) is a generalization of LDA and it can adapt itself to the growing data set automatically. hLDA can mine latent topics from a large amount of discrete data and organize these topics into a hierarchy, in which the topics of higher level are more abstractive while the topics of lower level are more specific. This hierarchy could achieve a deeper semantic model which is similar with human mind. Given a set of documents, hLDA generates a prior distribution of Bayesian nonparametrics using a nested Chinese restaurant process (nCRP)[1]. The documents sharing similar topics are organized into a cluster of path. hLDA learns the distribution of topics using a method of Bayesian posterior inference. This paper tries to study hLDA model in details and apply it into the application of Chinese text clustering. Experiments have shown that hLDA is a very promising model for text clustering.
processes which can cluster words of similar topics. The reduced dimension number is just the number of topics. pLSI method gives different document different topic distribution, thus its mixed proportion parameters grow with the document number, which could lead to the problem of over fitting. LDA is different with pLSI, it builds a unified prior distribution for all documents with limited number of mixed proportion parameters. For different documents, there are different values of the parameters, which could solve the problem of over fitting. But LDA demands that the number of topics is predefined by human in its applications, which is an important shortcoming that draws back its function. hLDA could help overcoming this problem. hLDA is a totally unsupervised method in which the number of topics could grow with the data set automatically. What’s more, there is no relations between the topics in LDA[7], but in hLDA, there are different levels of topics, abstractive or specific. For example, in the biology domain, the topic of “prokaryote” is more abstractive than “virus”, and the topic of “sun bacillus” is more specific than “virus”.
Keywords: Text clustering; Hierarchical Latent Dirichlet Allocation (hLDA); Nested Chinese restaurant process (nCRP); Bayesian nonparametrics
The rest of this paper is organized as follows: section 2 introduces the nested Chinese restaurant process, section 3 and section 4 describes the principles of Dirichlet process and hLDA respectively. Section 5 proposes the Gibbs sampler as the method of posterior inferences. In section 6, we put forward our system for text clustering and describe the experiments, including the corpus, evaluation parameters, results and analysis. Finally conclusions are given in section 7.
1 Introduction Documents usually contain large-scale vocabulary in many applications of natural language processing (NLP). According to Artificial Intelligence theory [2-3], the purpose of text clustering is transfer “data” to “information”, and from “information” to “knowledge” [4]. We need to reduce the dimensions of text features (usually words) so as to represent the documents in less complex model and keep the important information of the original text as large as possible. Traditional methods for text dimension reduction are mostly based on word features, such as TF-IDF and LSI (Latent Sementic Indexing) [5]. LSI adopts singular value decomposition (SVD) into TF-IDF to strengthen the dimension reduction. Studies have shown that LSI could combine the synonyms into a common feature. After these methods of word features, researchers have found out that documents are composed of various topics. If we could use these topics as features, it could not only reduce dimensions greatly, but also obtain better understanding similar with human mind. Thus methods based on topic features have appeared, such as Plsi [6], LDA, hLDA. All of these methods are generative
2 Introduction of nCRP 2.1 CRP(Chinese Restaurant Process) Suppose there is unlimited number of tables in a Chinese restaurant, and each table could serve unlimited number of guests. All guests choose a seat at a table according to some probability in sequence. Let the set of guests are (X 1 ,X 2 ,X 3 ,…), and the former n guests have chosen their tables, which is the table set 1, 2, 3,…m, then the n+1’th guest choose a table according to the following probability.
N k (X 1:n ) represents the number of guests that have been seated at the k’th table. is a hyper-parameter which
___________________________________ 978-1-4673-1857-0/12/$31.00 ©2012 IEEE
Proceedings of IEEE CCIS2012
controls the probability of the guest choosing a new table. As we can see, the guests are seated in sequence. The probability of a guest choosing a table relies on all the former guests and chosen tables.
stick broken down. Stick-breaking method could break the stick of one unit length into unlimited number of parts and the length of the stick broken down could be used to represent the prior probability of the topic in nCRP.
2.2 nCRP(nested Chinese restaurant process) Suppose there is unlimited number of Chinese restaurants in a Chinese city, the number of tables in a restaurant is unlimited, and the number of guests at a table is unlimited. On the first day, all guests have their dinner in one restaurant, which is called the root restaurant. Each guest chooses a table according to CRP. Suppose the tables chosen are (T 1 , T 2 , T 3 ,…). On the second day, a guest finds the corresponding restaurant according to the number of the table in the root restaurant and chooses a table according to CRP again. Things go on like this. After L days, a guest has passed a path of restaurants (P 1 , P 2 , P 3 ,…P L ). All the paths will construct a structure like a tree upside down. Each restaurant is a node on paths and has its own level.
4 Hierarchical topic model The concept of hierarchical topic model starts from the tree like topological structure of document set constructed by nCRP. hLDA modeling process is as shown in Figure 1.
In the hLDA model for texts, each document is mapped to a topic path. Each word in the document has chosen its own topic and has its level in the topic path. A Dirichlet process is used to generate each document and the prior distribution of each word in the document. And nCRP is used to generate the tree structure of topics.
Figure 1 hLDA modeling process
Firstly, the prior distribution of topics is obtained from Dirichlet distribution. For a document d, nCRP generates a path c d . hLDA looks on a document d as a mixture of multi-topics, and the mixed ratio is d . All the topics conform to the Dirichlet distribution. Next, to generate a word w d,n , a topic must be generated. In the path c d for a document, topic z d,n is just the level of the word through the mixed ratio d . Since a topic in hLDA is composed of multi-words’ distribution, words could be generated through the polynomial distribution of words over topics after topics are decided.
3 Dirichlet process Dirichlet process is also called the distribution of distribution. There is a hyper-parameter in it, which could be set to different values so as to generate different probabilistic distribution. X = (x 1 , x 2 , x 3 ,…,x k ) ~ Dir () = Dir ( 1 , 2 , 3 ,…, k ) In the above distribution, x i 䌜㼇㻜㻘㻝㼉㻌 㼕㼟㻌 㼍㻌 㼞㼑㼍㼘㻌 㼢㼍㼘㼡㼑㻘㻌 =1, k is the number of topics. The k-dimension parameter vector ( 1 , 2 , 3 ,…, k ) sets different generative probability to different X. In this way, unlimited number of X could be generated under control. Thus Dirichlet distribution could generate the probability distribution for unlimited number of documents. When 1 2 3 =… k , we call it a symmetric Dirichlet distribution.
5 Posterior inferences After the above processes, the joint distribution of observable variables and hidden variables are constructed, which is the joint distribution of observable documents and hidden structured topics. The posterior inference is to obtain the structure of topics given the observable variables including the set of documents in a topic path and the level of words in the topic. Here a special MCMC (Markov Chain Monte Carlo) [10] method is used to estimate the posterior probability distribution of hLDA.
Probability model could represent the observable variables through machine learning. There are two kinds of machine learning methods, parametric model and nonparametric model [8]. Traditional parametric model uses predefined and limited number of parameters to model data set. When the number of documents is growing dynamically, the number of parameters is also growing. Thus it may lead to the problem of over fitting or less fitting. Nonparametric model does not mean that it has no parameters. Its parameters could adapt to the change of data set. As the expansion of Dirichlet distribution to unlimited number of dimensions, Dirichlet process uses the method of stick-breaking to construct the unlimited number of k. Let ~GEM(m, [9], is the length of the stick broken down, 0