The Cluster{Abstraction Model: Unsupervised ... - Semantic Scholar

The Cluster{Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data (to appear in Proceedings of IJCAI'99)

Thomas Hofmann

Computer Science Division, UC Berkeley & International CS Institute, Berkeley, CA [email protected]

Abstract

This paper presents a novel statistical latent class model for text mining and interactive information access. The described learning architecture, called Cluster{Abstraction Model (CAM), is purely data driven and utilizes context-speci c word occurrence statistics. In an intertwined fashion, the CAM extracts hierarchical relations between groups of documents as well as an abstractive organization of keywords. An annealed version of the Expectation{Maximization (EM) algorithm for maximum likelihood estimation of the model parameters is derived. The bene ts of the CAM for interactive retrieval and automated cluster summarization are investigated experimentally.

1 Introduction

Intelligent processing of text and documents ultimately has to be considered as a problem of natural language understanding. This paper presents a statistical approach to learning of language models for context{dependent word occurrences and discusses the applicability of this model for interactive information access. The proposed technique is purely data{driven and does not make use of domain{dependent background information, nor does it rely on prede ned document categories or a given list of topics. The Cluster{Abstraction Model (CAM), is a statistical latent class or mixture model [McLachlan and Basford, 1988] which organizes groups of documents in a hierarchy. Compared to most state-of-the-art techniques based on agglomerative clustering (e.g., [Jardine and van Rijsbergen, 1971; Croft, 1977; Willett, 1988]) it has several advantages and additional features: As a probabilistic model the most important advantages are: a sound foundation in statistics and probabilistic inference a principled evaluation of generalization performance for model selection, ecient model tting by the EM algorithm,

an explicit representation of conditional indepen-

dence relations. Additional advantages are provided by the hierarchical nature of the model, namely: multiple levels of document clustering, discriminative topic descriptors for document groups, coarse-to- ne approach by annealing. The following section will rst introduce a non{ hierarchical probabilistic clustering model for documents, which is then extended to the full hierarchical model.

2 Probabilistic Clustering of Documents

Let us emphasize the clustering aspect by rst introducing a simpli ed, non{hierarchical version of the CAM which performs ` at' probabilistic clustering and is closely related to the distributional clustering model [Pereira et al., 1993] that has been used for word clustering and text categorization [Baker and McCallum, 1998]. Let d 2 D = fd(1); : : : ; d( )g denote documents and w 2 W = fw(1); : : : ; w( )g denote words or word stems. Moreover let w refer to the vector (sequence) of words w constituting d. Word frequencies are summarized using count variables n(d; w) which indicatePhow often a word w occurred in a document d; n(d) = n(d; w) denotes the document length. Following the standard latent class approach, it is assumed that each document d belongs to exactly one cluster c 2 C = fc(1) ; : : : ; c( ) g, where the number of clusters is assumed to be xed for now. Introducing class conditional word distributions P (wjc) and class prior probabilities P (c) (stacked in a parameter vector ), the model is de ned by P (c = c; ) = P (c) and I

J

d

dt

w

K

d

d

P

(w jc = c; ) = d

Y

n(d)

d

t=1

(

j )=

P wdt c

Y

( j)

P wc

n(d;w )

w

:

(1)

The factorial expression re ects conditional independence assumptions about word occurrences in w (bagof-words model). d

Starting from (1) the standard EM approach [Dempster et al., 1977] to latent variable models is employed. In EM two re-estimation steps are alternated: an Expectation (E)-step for estimating the posterior probabilities of the unobserved clustering variables P (c = cjw ; ) for a given parameter estimate , a Maximization (M)-step, which involves maximization of the so{called expected complete data log{likelihood for given posterior probabilities with respect to the parameters. The EM algorithm is known to increase the observed likelihood in each step, and converges to a (local) maximum under mild assumptions. An application of Bayes' rule to (1) yields the following E-step re-estimation equations for the distributional clustering model Q P (c) P (wjc) ( ) Q P (c = cjw ; )= P : (2) P (c ) P (wjc ) ( ) The M-step stationary equations obtained by dierentiating L are given by 1 X P (c = cjw ; ) P (c) = (3) d

d

0

0

n d;w

d

w

d

0

c0

I

( j) =

0 n d;w

w

d

d

P P (cd = cjwd ; )n(d; w) Pd : d

(4) ( = cjw ; )n(d) These equations are very intuitive: The posteriors P (c = cjw ; ) encode a probabilistic clustering of documents, while the conditionals P (wjc) represent average word distributions for documents belonging to group c. Of course, the simpli ed at clustering model de ned by (1) has several de cits. Most severe are the lack of a multi-resolution structure and the inadequacy of the `prototypical' distributions P (wjc) to emphasize discriminative or characteristic words (they are in fact typically dominated by the most frequent word occurrences). To cure these aws is the main goal of the hierarchical extension. P wc

d

d

P cd

d

d

3 Document Hierarchies and Abstraction 3.1 The Cluster{Abstraction Model

Most hierarchical document clustering techniques utilize agglomerative algorithms which generate a cluster hierarchy or dendogram as a by{product of successive cluster merging (cf. [Willett, 1988]). In the CAM we will use an explicit abstraction model instead to represent hierarchical relations between document groups. This is achieved by extending the `horizontal' mixture model of the previous section with a `vertical' component that captures the speci city of a particular word w in the context of a document d. It is assumed that each word occurrence w has an associated abstraction node a, the latter being identi ed with inner or terminal nodes of the cluster hierarchy (cf. Figure 1 (a)). dt

To formalize the sketched ideas, additional latent variable vectors a with components a are introduced which assign words in d to exactly one of the nodes in the hierarchy. Based on the topology of the nodes in the hierarchy the following constraints between the cluster variables c and the abstraction variables a are imposed: a 2 faja is above c in the hierarchyg (5) The notation a " c will be used as a shortcut to refer to nodes a above the terminal node c in the hierarchy. Eq. (5) states that the admissible values of the latent abstraction variables a for a particular document with latent class c are restricted to those nodes in the hierarchy that are predecessors of c . This breaks the permutation-symmetry of the abstraction nodes as well as of the document clusters. An abstraction node a at a particular place in the hierarchy can only be utilized to \explain" words of documents associated with terminal nodes in the subtree of a. A pictorial representation can be found in Figure 1 (b): if d is assigned to c the choices for abstraction nodes for word occurrences w are restricted to the `active' (highlighted) vertical path. One may think of the CAM as a mixture model with a horizontal mixture of clusters and a vertical mixture of abstraction levels. Each horizontal component is a mixture of vertical components on the path to the root, vertical components being shared by dierent horizontal components according to the tree topology. Generalizing the non{hierarchical model in (1), a probability distribution P (wja) over words is attached to each node (inner or terminal) of the hierarchy. After application of the chain rule, the complete data model (i.e., the joint probability of all observed and latent variables) can be speci ed in three steps P (c = c; )= P (c), P (a = ajc = c; )= P (ajc; d), and d

dt

d

dt

dt

d

dt

d

d

dt

d

dt

d

P

(w ja ; ) = d

Y

n(d)

d

(

j )

P wdt adt :

(6)

t=1

Note that additional document-speci c vertical mixing proportions P (ajc; d) over abstraction nodes above cluster c have been introduced, with the understanding that P (ajc; d) = 0 whenever it is not the case that a " c. If one makes the simplifying assumption that the same mixing proportions are shared by all documents assigned to a particular cluster (i.e., P (ajc; d) = P (ajc)), the solution degenerates to the distributional clustering model since one may always choose P (ajc) = . However, we propose to use this more parsimonious model and t P (ajc) from held{out data (a fraction of words held out from each document), which is in the spirit of model interpolation techniques [Jelinek and Mercer, 1980]. ac

3.2 EM Algorithm

As for the distributional clustering model before, we will derive an EM algorithm for model tting. The E{step requires to compute (joint) posterior probabilities of the form P (c = c; a = ajw ; ). After applying the chain d

dt

d

(a)

(b)

Document cluster Abstraction levels of words

w dt

abstraction levels

a dt= a

document partitioning

cd = c d

Figure 1: (a) Sketch of the cluster{abstraction structure, (b) the corresponding representation for assigning occurrences to abstraction levels in terms of latent class variables. rule one obtains: ( = cjw ; ) / P (c)

P cd

" Y X

d

w

( = ajw

P adt

d

; cd

( j) (j)

#n(d;w)

P wa P ac

;

(7)

a

= c; ) = P P P(w(w jaj)aP)(Paj(ca) jc) : (8) dt

a0

0

dt

0

The M{step re-estimation equations for the conditional word distributions are given by P P : dt = P (a = ajw ; ) P P P (wja) = ; (9) P (a = ajw ; ) P where P (a = ajw ; ) = P (c = cjw ; )P (a = ajw ; c = c; ). Moreover, we have the update equation (3) for the class priors P (c) and the formula X P (ajc) / P (c = cjw ; ) (10) d

t w

d

dt

d

dt

w

d

dt

t

d

d

d

c

d

dt

d

d

d

d

X

( = ajw

P adt

d

; cd

= c; )

t

which is evaluated on the held{out data. Finally, it may be worth taking a closer look at the predictive word probability distribution P (wjd) in the CAM which is given by X X P (wjd)= P (c = cjw ; ) P (ajc)P (wja): (11) d

c

d

a

If we assume for simplicity that P (c = cjw ; ) = 1 for some c (hard clustering case), then the word probability of d is modeled as a mixture of occurrences from dierent abstraction levels a. This re ects the reasonable assumption that each document contains a certain mixture of words ranging from general terms of ordinary language to highly speci c technical terms and specialty words. d

3.3 Annealed EM Algorithm

d

There are three important problems which also need to be addressed in a successful application of the CAM: First and most importantly, one has to avoid the problem of over tting. Second, it is necessary to specify

a method to determine a meaningful tree topology including the maximum number of terminal nodes. And third, one may also want to nd ways to reduce the sensitivity of the EM procedure to local maxima. An answer to all three questions is provided by a generalization called annealed EM [Hofmann and Puzicha, 1998]. Annealed EM is closely related to a technique known as deterministic annealing that has been applied to many clustering problems (e.g. [Rose et al., 1990; Pereira et al., 1993]). Since a thorough discussion of annealed EM is beyond the scope of this paper, the theoretical background is skipped and we focus on a procedural description instead. The key idea in deterministic annealing is+ the introduction of a temperature parameter T 2 IR . Applying the annealing principle to the clustering variables, the posterior calculation in (7) is generalized by replacing n(d; w) in the exponent by n(d; w)=T . For T > 1 this dampens the likelihood contribution linearly on the log{probability scale and will in general increase the entropy of the (annealed) posterior probabilities. In annealed EM, T is utilized as a control parameter which is initialized at a high value and successively lowered until the performance on the heldout data starts to decrease. Annealing is advantageous for model tting, since it oers a simple and inexpensive regularization which avoids over tting and improves the average solution quality. Moreover, it also oers a way to generate tree topologies, since annealing leads through a sequence of so-called phase transitions, where clusters split. In our experiments, T has been lowered until the perplexity (i.e., the log-averaged inverse word probability) on held-out data starts to increase, which automatically de nes the number of terminal nodes in the hierarchy. More details on this subject can be found in [Hofmann and Puzicha, 1998].

4 Results and Conclusion

All documents used in the experiments have been preprocessed by word sux stripping with a word stemmer. A standard stop word list has been utilized to eliminate the most frequent words, in addition very rarely occurring words have also been eliminated. An exam-

Saul, L.; Jordan, M.I. Learning in Boltzmann trees. Neural Computation, Nov. 1994, vol.6, (no.6):1174-84. (b) Word stems (a) Verbatim introduc larg famili boltzmann machin train Introduces a large family of Boltzmann machines that can be trained by standard standard gradient descent network layer hidgradient descent. The networks can have one or more layers of hidden units, with den unit connect implem supervis learn algotree-like connectivity. We show how to implement a supervised learning algorithm rithm boltzmann machin exactli simul anneal for these Boltzmann machines exactly, without resort to simulated or mean- eld stochast averag yield gradient weight space annealing. The stochastic averages that yield the gradients in weight space are techniqu present result problem pariti detec computed by the technique of decimation. We present results on the problems of hidden symmetri N-bit parity and the detection of hidden symmetries. (c) Ghost writer level 1 paper model base new method gener process dier eect approach provid set studi develop author level 2 function propos model error method input optim gener neural paramet paper obtain shown appli output level 3 gener number set neural propos function perform method inform data given obtain approxim dynamic input level 4 neural pattern rule number process recogni rate perform classif propos gener input neuron time properti data level 5 converg neural optim method rule rate dynamic process pattern paramet studi statist condition adapt limit level 6 perceptron exampl error gener rule onlin calcul deriv backpropag simpl output asymptot solution separ unsupervis level 7 error neural architectur perform entropi statist multilay activ backpropag gener number maximum pattern phase level 8 teacher delta output introduc sampl replica decai nois projec correl student temperatur gain dynamic predic

Figure 2: (a) Abstract from the generated LEARN document collection, (b) representation in terms of word stems, (c) words with lowest perplexity under the CAM for words not occurring in the abstract (dierentiated according to the hierarchy level). # 33 learn network neural algorithm weight

Most frequent words #35 #42 #50 learn control learn exampl learn algorithm algorithm robot network gener model neural problem propos propos

# 88 learn educ student technologi develop

# 33 analog control oper implement backpropag

CAM node top words #35 #42 #50 program feedback reinforc theori desir scheme set dynamic oper space position adapt induc arm parallel

# 88 interact video multimedia scienc remot

Figure 3: Group descriptions for exemplary inner nodes by most frequent words and by the highest probability words from the respective CAM node. ple abstract and its index term representation is depicted in Figure 2 (a),(b). The experiments reported are some typical examples selected from a much larger number of performance evaluations. They are based on two datasets which form the core of our current prototype system: a collection of 3609 recent papers with `learning' as a titleword, including all abstracts of papers from Machine Learning Vol. 10-28 (LEARN), and a dataset of 1568 recent papers with `cluster' in the title (CLUSTER). The rst problem we consider is to estimate the probability for a word occurrence in a text based on the statistical model. Figure 2 (c) shows the most probable words from dierent abstraction levels, which did not occur in the original text of Figure 2 (a). The abstractive organization is very helpful to distinguish layers with trivial and unspeci c word suggestions (like 'paper') up to highly speci c technical terms (like 'replica'). One of the most important bene ts of the CAM is the resolution{speci c extraction of characteristic keywords. In Figure 4 and 6 we have visualized the top 6 levels for the dataset LEARN and CLUSTER, respectively. The overall hierarchical organization of the documents is very satisfying, the topological relations between clusters seems to capture important aspects of the inter-document similarities. In contrast to most multi{ resolution approaches the distributions at inner nodes of the hierarchy are not obtained by a coarsening procedure which typically performs some sort of averaging over the respective subtree of the hierarchy. The abstraction mechanism in fact leads to a specialization of the inner nodes. This specialization eect makes the

probabilities P (wja) suitable for cluster summarization. Notice, how the low{level nodes capture the speci c vocabulary of the documents associated with clusters in the subtree below. The speci c terms become automatically the most probable words in the component distribution, because higher level nodes account for more general terms. To stress this point we have compared the abstraction result with probability distributions obtained by averaging over the respective subtree. Figure 3 summarizes some exemplary comparisons showing that averaging mostly results in high probabilities for rather unspeci c terms, while the CAM node descriptions are highly discriminative. The node{speci c word distribution thus oer a principled and very satisfying solution to the problem of nding resolution{speci c index terms for document groups as opposed to many circulating ad hoc heuristics to distinguish between typical and topical terms. An example run for an interactive coarse-to- ne retrieval with the CLUSTER collection is depicted in Figure 5, where we pretend to be interested in documents on clustering for texture{based image segmentation. In a real interactive scenario, one would of course display more than just the top 5 words to describe document groups and use a more advanced shifting window approach to represent the actual focus in a large hierarchy. In addition to the description of document groups by inner node word distributions, the CAM also oers the possibility to attach prototypical documents to each of the nodes (the ones with maximal probability P (ajd)), to compute most probable documents for a given query, etc. All types of information, the cluster summaries by

LEVEL

learn 0.371 paper 0.0461 base 0.0382 new 0.0292 model 0.0242 train 0.0223

1 process 0.0418 experi 0.0385 knowledg 0.037 develop 0.0367 inform 0.0311 design 0.028

2 environ 0.0478 educ 0.0417 design 0.0343 teach 0.0325 softwar 0.0276 work 0.0273

3

learner 0.0503 problem 0.0336 develop 0.0288 user 0.0286 tool 0.0256 gener 0.023

5a

organ 0.0598 process 0.0485 organiz 0.0399 framework 0.0299 manag 0.0294 research 0.0272

student 0.13 group 0.0976 program 0.038 individu 0.032 differ 0.0317 result 0.0293

5b

6aa

program 0.127 simul 0.0497 interfac 0.0244 graphic 0.0208 power 0.0205 demonstr 0.0201

model 0.0871 task 0.0411 control 0.0372 simul 0.032 process 0.0227 chang 0.0205

knowledg 0.0797 problem 0.0764 algorithm 0.068 method 0.0632 exampl 0.0502 data 0.0466 structur 0.0396 search 0.0293 oper 0.0249 task 0.0234 concept 0.0229 new 0.0207

student 0.135 educ 0.0877 cours 0.0451 instruc 0.0321 commun 0.0308 distanc 0.0288

support 0.0579 model 0.0536 intellig 0.0407 student 0.0384 present 0.0344 learner 0.0275

train 0.0518 data 0.0381 visual 0.0306 test 0.0221 observ 0.0214 suggest 0.0211

rule 0.114 attribut 0.0372 data 0.0368 induc 0.0351 induct 0.0336 method 0.0332

program 0.078 theori 0.0496 set 0.0467 space 0.0411 induc 0.0354 logic 0.028

network 0.0655 number 0.0287 rule 0.0256 gener 0.0237 rate 0.0211 unit 0.0183 weight 0.0751 neural 0.0474 neuron 0.0378 inform 0.0224 network 0.0214 correl 0.0177

problem 0.0676 method 0.0392 approach 0.0352 plan 0.0331 paper 0.025 agent 0.0246

subject 0.0704 visual 0.0416 perceptu 0.0398 studi 0.0377 task 0.0355 differ 0.0238

control 0.433 nonlinear 0.0409 scheme 0.0382 design 0.0326 robot 0.0193 condition 0.0191

robot 0.0848 track 0.0638 trajectori 0.0636 manipul 0.0527 error 0.0471 iter 0.0373

algorithm 0.125 converg 0.0633 control 0.0429 optim 0.0429 search 0.03 new 0.0247

analog 0.0386 control 0.032 oper 0.0256 implement 0.0245 backpropag 0.0242 onchip 0.023

plant 0.0335 output 0.0249 studi 0.0246 oper 0.0241 type 0.0221 stabil 0.0211

model 0.0412 hypertext 0.0365 strategi 0.031 represent 0.0258 test 0.0241 differ 0.0229

manag 0.0807 industri 0.0309 approach 0.0306 market 0.0289 product 0.0284 innov 0.0261

problem 0.0653 techniqu 0.0321 expert 0.0307 solv 0.0277 reason 0.0253 mechan 0.0219

network 0.0608 sequenc 0.0365 neural 0.0307 languag 0.028 neuron 0.0214 imag 0.0198

memori 0.112 associ 0.0605 pattern 0.0567 rule 0.0447 train 0.0363 recal 0.0292

6ba

cours 0.0789 project 0.0329 develop 0.0238 commun 0.0231 lectur 0.023 experi 0.0228

web 0.0464 network 0.0459 univers 0.0339 distanc 0.0279 project 0.0231 program 0.0204 inform 0.0617 librari 0.0434 support 0.0304 think 0.0258 profession 0.0253 concept 0.0179

decision 0.14 expert 0.0594 acquisi 0.0267 propos 0.026 selec 0.0223 make 0.0217

robot 0.109 environ 0.0827 behavior 0.0477 action 0.0455 reinforc 0.0374 task 0.0336

network 0.068 optim 0.0349 backpropag inform 0.0283 0.0574 base 0.0276 weight 0.0419 reinforc 0.0251 gradient 0.0408 studi 0.0233 deriv 0.0396 set 0.0207 recurr 0.0392 fuzzi 0.246 reinforc 0.0755 propos 0.0733 scheme 0.0339 function 0.04 oper 0.0262 logic 0.0381 adapt 0.0249 genet 0.0318 parallel 0.0244 weight 0.0224 state 0.0212

instruc 0.0829 learner 0.0589 perform 0.0476 studi 0.0472 achiev 0.0422 motiv 0.0397

network 0.0577 model 0.0411 approach 0.0346 structur 0.0243 languag 0.023 empir 0.0197

produc 0.0758 time 0.0515 product 0.049 effect 0.0391 curv 0.0309 rate 0.0286

feedback 0.0541 desir 0.044 dynamic 0.0418 position 0.0371 arm 0.0338 motor 0.0326

network 0.165 neural 0.0865 pattern 0.0466 perform 0.0385 data 0.0377 input 0.0307

exampl 0.0631 distribu 0.0481 case 0.0438 function 0.0419 studi 0.0382 bound 0.0371

network 0.0712 converg 0.0438 neural 0.0381 process 0.0347 simul 0.0311 method 0.0279

fuzzi 0.0714 rule 0.0556 paper 0.0275 gener 0.0272 perform 0.0235 process 0.0182

6ab

6bb

gener 0.044 number 0.0437 set 0.0371 inform 0.0239 compar 0.0221 properti 0.0174

network 0.133 neural 0.102 propos 0.0781 method 0.0636 simul 0.059 perform 0.0508

present 0.0554 algorithm 0.0542 result 0.0516 perform 0.0422 gener 0.0413 network 0.0355

technologi 0.0852 develop 0.0669 new 0.0346 project 0.0322 engin 0.0173 wai 0.0167

student 0.119 subject 0.0389 studi 0.0363 instruc 0.0327 result 0.0295 present 0.0271

4

algorithm 0.15 function 0.0625 present 0.0531 result 0.053 problem 0.0458 model 0.0357

method 0.0932 train 0.0911 propos 0.0575 output 0.0355 neural 0.0328 weight 0.0292

class 0.145 learnabl 0.108 concept 0.0766 learner 0.0387 size 0.0303 target 0.0277

train 0.0853 perceptron 0.0628 gener 0.0489 error 0.0462 rule 0.0318 calcul 0.0312

featur 0.0509 model 0.0435 imag 0.0393 analysi 0.0245 adapt 0.0202 differ 0.0159

network 0.106 layer 0.0587 function 0.0585 unit 0.0419 backpropag 0.0409 error 0.0336

model 0.0566 data 0.0437 prior 0.025 differ 0.0191 bayesian 0.0191 test 0.0186 data 0.0797 sampl 0.0462 grammar 0.0462 program 0.0356 consist 0.0354 notion 0.0327 polynomi 0.13 formula 0.0844 algorithm 0.0839 time 0.0636 boolean 0.0576 queri 0.0409

pattern 0.0709 classif 0.0477 input 0.0445 classifi 0.0403 fuzzi 0.0385 adapt 0.0315 rule 0.0449 unsupervis 0.037 compon 0.0355 nonlinear 0.0316 princip 0.0271 weight 0.0247 method 0.108 imag 0.0919 recogni 0.0781 pattern 0.0433 recogn 0.0301 distanc 0.0273

Figure 4: Top 6 levels of the cluster hierarchy for the LEARN dataset. Nodes are represented by their most probable words. Left/right successors of nodes in row 4 are depicted in row 5a and 5b, respectively. Similarly, left successors of nodes in row 5a/5b can be found in rows 6aa/6ba and right successors in rows 6ab/6bb. approach point fuzzi genet initi cluster object problem algorithm design algorithm method cluster paper propos

model observ distrib densiti studi

fuzzy data algorithm classif feature galaxi function correl author scale cluster surfac state defect particl

process model applic architectur structur

center function estim equal vector

cluster learn input vector quantiz

network neural learn control kohonen

segment object estim approach base

image region segment iter color speaker speech train continu recogni

tissue brain famili patient reson pixel texture contour motion spatial

Figure 5: Example run of an interactive image retrieval for documents on `texture{based image segmentation' with one level look-ahead in the CAM hierarchy.

(locally) discriminant keywords, the keyword distributions over nodes, and the automatic selection of prototypical documents are particularly bene cial to support an interactive retrieval process. Due to the abstraction mechanism the cluster summaries are expected to be more comprehensible than descriptions derived by simple averaging. The hierarchy oers a direct way to re ne queries and can even be utilized to actively ask the user for additional speci cations. Conclusion: The cluster{abstraction model is a novel statistical approach to text mining which has a sound foundation on the likelihood principle. The dual organization of document cluster hierarchies and keyword abstractions makes it a particularly interesting model for interactive retrieval. The experiments carried out on small/medium scale document collections have emphasized some of the most important advantages. Since the model extracts hierarchical structures and supports resolution dependent cluster summarizations, the application to large scale databases seems promising.

LEVEL

cluster 0.23 result 0.0463 model 0.0353 data 0.0352 present 0.0324

1 model 0.0733 observ 0.0433 distribu 0.0428 densiti 0.0236 studi 0.0229

2

cluster 0.0349 surfac 0.0262 state 0.0246 defect 0.0227 particl 0.0222

3

4

5a

5b

6aa

ion 0.036 increas 0.0263 atom 0.0238 temperatur 0.0237 concentr 0.0228 interstiti 0.0501 alloi 0.0441 atom 0.0418 vacanc 0.0324 format 0.0277

particl 0.0214 size 0.0186 concentr 0.0182 degre 0.0175 target 0.0155

grain 0.0288 coupl 0.0262 dynamic 0.0255 network 0.0195 lattic 0.0191 degre 0.0355 concentr 0.0263 diffus 0.0216 contrast 0.0201 xfsub 0.0201

galaxi 0.108 magnitud 0.0384 faint 0.023 evid 0.0216 blue 0.0185

sampl 0.057 redshift 0.0486 alpha 0.0468 correl 0.0339 survei 0.03

matrix 0.0435 equat 0.0417 control 0.0297 case 0.0282 complex 0.0197

perform 0.055 wavelength 0.033 file 0.0283 csub 0.0283 dynamic 0.0247

cluster 0.0512 object 0.0369 problem 0.034 algorithm 0.0313 design 0.0256

galaxi 0.105 function 0.077 correl 0.0622 author 0.0477 scale 0.0446

system 0.0395 test 0.0328 theori 0.0154 gener 0.0153 analysi 0.0149

sub 0.076 nonlinear 0.0583 simul 0.0349 densiti 0.0306 hierarch 0.028

distribu 0.03 particl 0.0242 probabl 0.0235 univers 0.0219 radiu 0.0182

earthquak 0.0558 event 0.0396 time 0.0271 seismic 0.0258 region 0.0213

sampl 0.0371 qso 0.0359 asub 0.0267 qsub 0.0267 qdot 0.0187

map 0.0335 imag 0.0278 featur 0.0265 spatial 0.0195 allow 0.0171

processor 0.0422 distribut 0.0264 graph 0.0194 partition 0.0191 system 0.0189

partition 0.0931 method 0.0349 placem 0.0307 algorithm 0.0299 size 0.0181

video 0.0465 visual 0.0328 index 0.0283 databas 0.028 shot 0.0179

fuzzi 0.083 data 0.0465 algorithm 0.0336 classif 0.0266 featur 0.0219

process 0.0371 model 0.0345 applic 0.026 architectur 0.0259 structur 0.0227

approach 0.0362 point 0.0338 fuzzi 0.0272 genet 0.0269 initi 0.022

power 0.0356 scale 0.0266 correl 0.0262 spectrum 0.0212 distor 0.0202

mass 0.0482 halo 0.0459 ssub 0.0268 quasar 0.0229 cdm 0.0229

algorithm 0.11 method 0.0489 cluster 0.0476 paper 0.0353 propos 0.035

center 0.0311 function 0.0229 estim 0.0224 equal 0.02 vector 0.0192

machin 0.0411 manufactur 0.0272 learn 0.0238 cell 0.0218 format 0.0217

object 0.0824 databas 0.041 queri 0.0363 objectori 0.0296 access 0.0277

cluster 0.0478 learn 0.035 input 0.0279 quantiz 0.0243 vector 0.0223

converg 0.0248 filter 0.0226 glvq 0.0201 initi 0.0176 mountain 0.0167

irradi 0.0459 loop 0.0414 cascad 0.0299 atom 0.0275 dimer 0.0236

cloud 0.0683 dwarf 0.0473 galaxi 0.0473 lsb 0.0289 field 0.0227

densiti 0.0175 ssup 0.0175 background 0.0172 voltag 0.0172 peak 0.0172

fractal 0.0394 jump 0.032 tau 0.0296 moment 0.0197 nsub 0.0173

constraint 0.0277 environ 0.0271 energi 0.0257 surfac 0.024 line 0.0205

task 0.0665 parallel 0.0501 schedul 0.0428 linear 0.0258 heurist 0.0235

cost 0.0167 entropi 0.0141 membership 0.0131 distor 0.0114 bayesian 0.0112

6ba

glass 0.0385 loop 0.0366 relax 0.0348 percol 0.027 conform 0.0212

car 0.0313 veloc 0.0216 lipid 0.0192 distanc 0.0163 transition 0.0162

theta 0.0538 weak 0.0243 omega 0.0233 disc 0.0177 epsilon 0.0177

anisotropi 0.032 temperatur 0.020 nois 0.0185 threepoint 0.018 particl 0.0185

circuit 0.0349 file 0.0341 docum 0.0319 matrix 0.0274 signatur 0.0256

neural 0.0328 part 0.0225 network 0.0176 pattern 0.0173 dignet 0.0157

map 0.0251 scale 0.0144 backpropag 0.0132 multilay 0.0131 activ 0.0118

cell 0.0505 layer 0.0318 insub 0.0234 channel 0.0232 grown 0.0166

robust 0.0495 plane 0.0345 perturb 0.0323 eigenvalu 0.028 root 0.028

poisson 0.037 fault 0.0342 pressschecht 0.0313 descrip 0.0228 stress 0.0199

spectrum 0.0553 mpc 0.044 hsup 0.04 cdm 0.0375 power 0.0226

optim 0.0268 order 0.0261 function 0.0205 class 0.0156 popul 0.0144

similar 0.0268 attribut 0.0257 classif 0.0217 instanc 0.0203 descrip 0.0187

speaker 0.0601 speech 0.0576 train 0.04 continu 0.0234 recogni 0.0232

network 0.124 neural 0.0837 learn 0.0597 control 0.0258 kohonen 0.0241

6ab

6bb

segment 0.0382 object 0.0234 estim 0.0217 approach 0.0211 base 0.0207

imag 0.165 region 0.0527 segment 0.0415 iter 0.0152 color 0.0149

model 0.0584 chip 0.0274 phone 0.0274 sourc 0.0233 hmm 0.0206

adapt 0.0447 codebook 0.0241 design 0.0238 compres 0.0194 aflc 0.0188

ellipsoid 0.0323 concept 0.0264 shell 0.0234 pitch 0.0221 phrase 0.0209 tissu 0.0188 brain 0.0166 famili 0.0144 patient 0.0133 reson 0.0122 pixel 0.0526 textur 0.0272 contour 0.0271 motion 0.0247 spatial 0.0226

Figure 6: Top 6 levels of the cluster hierarchy for the CLUSTER dataset (cf. comments for Figure 4).

Acknowledgments: Helpful comments of Jan Puzicha, Sebastian Thrun, Andrew McCallum, Tom Mitchell, Hagit Shatkay, and the anonymous reviewers are greatly acknowledged. Thomas Hofmann has been supported by a DAAD postdoctoral fellowship.

References

[Baker and McCallum, 1998] L.D. Baker and A.K. McCallum. Distributional clustering of words for text classi cation. In SIGIR, 1998. [Croft, 1977] W.B. Croft. Clustering large les of documents using the single-link method. Journal of the American Society for Information Science, 28:341{ 344, 1977. [Dempster et al., 1977] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1{38, 1977. [Hofmann and Puzicha, 1998] T. Hofmann and J. Puzicha. Statistical models for co-occurrence data. Technical Report 1625, Memo, AI Lab/CBCL, M.I.T., 1998.

[Jardine and van Rijsbergen, 1971] N. Jardine and C.J. van Rijsbergen. The use of hierarchical clustering in information retrieval. Information Storage and Retrieval, 7:217{240, 1971. [Jelinek and Mercer, 1980] F. Jelinek and R. Mercer. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the Workshop of Pattern Recognition in Practice, 1980. [McLachlan and Basford, 1988] G.J. McLachlan and K. E. Basford. Mixture Models. Marcel Dekker, INC, New York Basel, 1988. [Pereira et al., 1993] F.C.N. Pereira, N.Z. Tishby, and L. Lee. Distributional clustering of english words. In Proceedings of the Association for Computational Linguistics, pages 183{190, 1993. [Rose et al., 1990] K. Rose, E. Gurewitz, and G. Fox.

Statistical mechanics and phase transitions in clustering. Physical Review Letters, 65(8):945{948, 1990. [Willett, 1988] P. Willett. Recent trends in hierarchical document clustering: a critical review. Information Processing & Management, 24(5):577{597, 1988.

The Cluster{Abstraction Model: Unsupervised ... - Semantic Scholar

The Cluster{Abstraction Model: Unsupervised ... - Semantic Scholar

Suggest Documents

unsupervised language model adaptation for ... - Semantic Scholar

Unsupervised language model adaptation - Semantic Scholar

Generalized Model Selection For Unsupervised ... - Semantic Scholar

Unsupervised Texture Segmentation - Semantic Scholar

Unsupervised Deep Auditory Model Using Stack of ... - Semantic Scholar

A Topic-Motion Model for Unsupervised Video ... - Semantic Scholar

a dynamic mixture model for unsupervised tail ... - Semantic Scholar

A Topic-Motion Model for Unsupervised Video ... - Semantic Scholar

UNSUPERVISED DISCRIMINATIVE ADAPTATION ... - Semantic Scholar

Topographic Connectionist Unsupervised ... - Semantic Scholar

Unsupervised Morphological Disambiguation ... - Semantic Scholar

Unsupervised Hierarchical Probabilistic ... - Semantic Scholar

Unsupervised Mammograms Segmentation - Semantic Scholar

Unsupervised Text Mining - Semantic Scholar

Bootstrapping the interactome: unsupervised ... - Semantic Scholar

Unsupervised Semantic Labeling Framework for ... - Semantic Scholar

Unsupervised Semantic Object Segmentation of ... - Semantic Scholar

UNSUPERVISED STATISTICAL SKETCHING FOR ... - Semantic Scholar

Unsupervised Cross-Domain Word ... - Semantic Scholar

UNSUPERVISED SINGLE-CHANNEL SOURCE ... - Semantic Scholar

Unsupervised Context-aware User Preference ... - Semantic Scholar

Unsupervised nonparametric method for gait ... - Semantic Scholar

Unsupervised Sequential Information Bottleneck ... - Semantic Scholar

Unsupervised Feature Generation using ... - Semantic Scholar