HIERARCHIC DOCUMENT CLUSTERING USING ... - CiteSeerX

2 downloads 0 Views 951KB Size Report
HIERARCHIC DOCUMENT. CLUSTERING USING. WARD'S METHOD. A. EL-HAUDOUCHI and. P. WILLETT. Sheffield. University,. Western Bank,. Sheffield,.
HIERARCHIC DOCUMENT CLUSTERING USING WARD'S METHOD A.

EL-HAUDOUCHI and P. WILLETT Sheffield University, Western Bank, Sheffield, SlO 2TN. UK

suggested by tiich file may be structured docunent matching application' involves clustering techniques.

ABSTRACT. In this paper, we discuss the application of a recent hierarchic clustering algorittxn to the autanatic classification of files of documents. Whereas most hierarchic clustering algorithms involve the generation and updating of an interobject dissimilarity matrix, this new algorittxn is based upon a series of nearest neightour searches. sev era1 Such an approach is appropriate to clustering methods, including Ward's method which has been shown to perform well in experimental studies of hierarchic docunent clustering. A desoriptlon is given of heuristics which can increase the efficiency of the new'algorittxa when it is used to cluster three docunent collections by Ward's method.

The subject of docunent retriwal involves the study of caeputerised methods for the storage and retrieval of textual itens such as memoranda, technical reports and references to journal articles Eucunent retrieval (VANR79, SALT83). systems have for long been the preserve of the librarian or Information worker, but the spread of office automation facilities is leading. to increased interest in the use of such systems In a wide range of environments.

focussed upon Early work on docunent clustering the use of non-hierarchic partitioning algorithms which permitted a rapid clustering of a file, but which were fourd to give classifications that ware less effective for retrieval than pvposes conventional, non-clustered search strategies (SALT71). Accordingly, attention has focussed upon hierarchic the use of clustering methods, specifically hierarchic agglanerative methods, which result in binary tree-like classifications in which small .clusters of strongly related documents are nested within larger and larger clusters of less strongly related docunents.

Documents are generally represented for retrieval purposes by a series of keywords that have been assigned using either manual or autanatic Indexing, and similar lists of keywords are used to represent the queries which are presented to the system. The sets of document and query keywords form the basis for a matching operation which identifies those docueents that are most similar to the query, and thus those that are most likely to be relevant to the person Iho submitted the query. A very wide range of methods has been

The first detailed studies of the use of hierarchic agglanerative methods for docunent clustering was by Jardine and van Rijsbergen (JARD71bl and vsn Rijsbergen (VABR74) who used the sets of single linkage method to cluster several docunents and queries: further experiments with the single linkage method have been described by van Rijsbergen and Croft (VANR75) and Croft (CRCFTT. CRCF80). More recently, Qifflths et al. have (GRIF84, CiRIF86) described canparative experiments tiich Involved not only single linkage but also the canplete linkage, group average and Ward, or minimus variance, methods. Their results suggest that the single linkage method Is the least appropriate of the methods studied, with the most useful results generally being obtained from the use of Ward's method. This method was fourd to

Permission to copy without fee a11 orpart of this material is granted prothe vided that the copyright notice of “Organization of the 19Bb-I-KM Conference on Rmscrrch and Dmvslopment in lnformat1ol-l Rrtricvsl” and the title of the and its date .ppear. publication ths 1986-~Ctl Orgsnlzation of @ 1986 and Research Canfcrcncc on

in

machine-readable docunent to facilitate the queryoperations, and one such the use of docunent

Docunent clustering involves the application of a clustering method to the lists of keywords representing each of the' docunents in a collection, resulting in the identification of clusters of dccunents that have many terms in canmon and that might accordingly be expected to be about the sane, or related, subjects. A query may then be matched against the set of clusters, rather than against each of the docunents. Not only may this increase the efficiency of a search, since fewer matching operations are required, but it may also increase the effectiveness of a search, in terms of the amount of relevant material that is retrieved, since the interdocunent term similarities have been taken into consideration in the structuring of the file.

1. INTRODUCTION

Development Retrirvrl

a

I"formstio"

149

result in balanced hierarchies that gave high levels of retrieval effectiveness in a range of cluster searches based upon the bottom level the clusters in hierarchy. tbwwer , the experiments were restricted to collections containing 800 dccunents or less, and while smh a file is large in .ccmparison with the data sets that are encountered in most applications of hierarchic cluster analysis, it is trivially small in canparism with the file sixes encountered in docuxent retrieval systems which may contains tens or hundreds of thcusands of docunents. Thus althorcgh the experimental results to date have suggested that document clustering may indeed have substantial merits for retrieval. purposes, there is considerable scepticism as to the use of document clustering on files of realistic size. A notable exception is the work of Croft CCROF771 who has reported docunent clustering experiments using part of the UKCIS doouxent test collectim: howaver, he used only the single linkage method, and till recently there have been no algorithns available eich rJould permit other hierarchic clustering methods, such as Ward'smethod, to be applied to data sets of ccmparable size. Bowaver, efficient several new and algorithns for agglomerative clustering have been hierarchic reported recently which are based upon series of repeated nearest neighborr searches CDAYE04, HURTma, HURT84bl. In this paper, we study the of these algorithmic dwelopnents to application autamatlc doctient clustering using Ward's method: the description is at a sufficient level of detail to allow other workers to implement this clustering method without undue difficulty should they so desire. We have used the resultig program in an extended comparative study of hierarchic clustering methods: the results of this study will be reported in a future canmunication. 2. HIERARCHIC AGGLOHERATIVECLUSTERING 2.1 Clustering

algorithms

'Ihe algorittxn for traditional obtaining a hierarchic agglomerative classification of a collection of N points Is as shown below: we shall refer to this as algorithm A. The reader should refers initially to note that the term 'point irxlividual docunents but enccmpasses clusters of docunents as the agglaneration proceeds. Calculate all N(N-1)/2 distinct inter-point dissimilarities and store thw in an inter-point dissimilarity matrix ; WHILE more than one point renains DO BEGIN Search the matrix to identify the least dissimilar renainirrg pair of points ; Fuse these t+m points to form a new point ; Update the matrix by calculating the dissimilarities betwaen this new point and each.of the other points END. Xt will be seen' that algorithm A involves the generation of a matrix which Initially contains the dissimilarities betwaen all pairs of documents in the collection. The matrix is repeatedly scanned to identify the least dissimilar pair of and the Pint and these are then fused together dissimilarities calculated that involve this new clustering point. Different methods may. be within this general Pamemrk by accuaodated

varying the pieciae definition that is used in the calculation of the dissimilarities ClANc673. Since there are NCN-1)/2 entries in the dissimilarity matrix2 this algorithn has a storage requirement of b(N 1. 'lhe main loop of the algorithx is performed N-l times, once for each cluster formed dving the generation of the hierarchy, and each such i$eration may require the scanning of all of the 00~ 1 entries in the matrix to identify the least 'dissimilar pair of points: thug the time requirement of the algorithm Is O(N 1. Bore efficient elgorithns are known for certain specific methods. mu.9 Sibran has tsmv31 described an O(N) space and O(N 1 time algorithm for the single linkage method and Dafays IDEFA771 has reported a modification of the Sibson algorithm tiich may be used to generate a nonunique cunplete linkqe hierarchy: in both cases, the dissimilarities in the matrix are calculated one TOW at a time and are used to update the hierarchy recvsively. However there are no canparable algorithms for Ward's method.

The new algorithms for hierarchic agglcmerative classification are well reviewed by Mrrtagh IN URT83, DMtT84al cho emphaaises the central importance of nearest neighbour (NN) searching, where the nearest neighbou of some point I; NNCI), la defined to be that point which is least dissimilar to I. men an NH-chain is defined to be the sequence of points I. J=NN(I). K=NN(J) . ..Y=NN(X). X=NN(Y). X and Yare said to be reciprocal nearest neighbows, or RNNs. We note in passing that the use of UN-chains for docunent. retrieval has been described by CDfihan CCQFF69?. lhe particular algorithm studied here, algorithm B, is as follows: Select some arbitrary point I ; WHILE there is still more than one point DO BEGIN REPEAT J=NNN(I), K=NM(J), L=Nn(Kl etc. UNTIL Y=NN(X) and X=NN(Y) ; Fuse X and Ytogether and update the data matrix by replacing then with the centroid of the new point : I := either the point prior to X in the NN-chain or, if no such point exists, an arbitrary point END.

Mrtagh CMURT831 describes this as a single cluster algorithm for geanetric clustering methods. It is a single cluster algorithm since only one NN-chain,.and hence only one cluster, is grow at a time: this is in contrast with multiple cluster algorithns where several NH-chains may be groom simultaneously. Gaanetric clustering methods are ttmse where the new point that is formed by the fusion of two previous points can be represented by a centrold. or point centre: this is not the case with the alternative linkage methods such as single or complete link-e. Examples of clustering methods that oan be implamented by algorithn B include the centroid, median and Ward methods: they may be obtained merely by altering the dissimilarity measure used to generate the NN-chains, and the definition of the point centre used to characterise each new point as it is formed. If the curent pbint in the NNichain, i.e. the point whose nearest neighbour is being sught, is called the query point. Q, a Ward classification may be obtained by, firstly,

setting the I’th

the dissimilarity, D(Q,I), betheen Q and coint In the collection (I 0 9) equal to *SIZE(Q)‘SIZE(I )*DIST(Q. I) SItE(Q)+SIZE(I) where SIZE(Q) a&SIZE(I) are the nuxbers of and uhere docuaents in the points Q and I, is the Euclidean distance between then, DIST (%I) point centr e the defining secondly, ad, representing the point arising fran the fusion of two points Q and I by the expression SIZE(Q)*VEC(Q)+SIZE(I)*VEC(I) SIZE(Q)+SIZEU) where VEC(Q) and VEC(1) are the index term vectors characterising the points Q and I.

For these reasons, considerable hot-k has gone into the design of nearest neighbour algoriths that could be used with the high dimenslonality data which characterise textual data bases. The majority of the algorittxas *ioh have been developed involve the use of the inverted file structure [NOAETI, SYEA81, EIIJRTE?, PERR83J. and we have used one such alnorithn as the basis for a program for Ward’s -clustering method. lhe exact manner In which this Is achieved is described in the next section of the paper.

3. THE PROGRAM 3. 1 Introduction

2.2

Nearest

neightour

searching

It will be seen that the repeated scanning and updating of the dissimilarity matrix in the simple algorithm A is replaced In algorithm B by a Thus the overall sequence of N-l RNN searches. rmning time will be crucially dependent upon the efficiency of the search procedure that is used to identify the nearest neighbows at each stage of the clustering. brute-force algorithm for nearest The obvious, canpute the to searching is neighbour dissimilarity between the cwrent point in the NNof the other points in the chain, Q, and each file, a p-oced\re that is much too slow for all To wercane this but the mnallest data sets. probIen, a range of vocedues has been suggested avo Id the calculation of most of the that dissimilarities tiile still calculating them for tbse few points which are, in fact, similar to the query: exanples include the projection of the onto a space of lower multi-dimensional points dimensionality. uhere most of the dissimilarity and the grouping of calculations are performed, points into clusters so that several mints may be the search, from eliminated searched, or simultaneously ttiurtagh, 1984bl. most of the nearest neighbour &fortunately, algorithms in the literature are not direqtly and clusteriw files of to searching a ppl lc able docunents since they generally assume that 11, the dimensionality of the space that contains the so that records, is very low, typically 2 or 3, multiplicative terms involving H in the equation describing the caxplexity of the search algorith In text retrieval systfm.3, may be neglected. conversely, many thousands or tens of thousands of keywords may be used to characterise the docuxents In the file, and such algoritlxng are thus quite An example of this is the O( 1ogN) impracticable. nearest neigmou search procedue of Friedman et Weiss CUEIS 1 suggests which CFRIE771 al. proportionality of about invfllves a constant of expected time search the oonstant 1.6 . Again, algorithm of Bentley et aI.,,CBENTBOl involves the fnspectlon of all of the 3 -1 cells adjacent to a given cell in the space, an undertaking that is quite infeasible if the dimensionaiity is at all large, and also the assumption that the data chosen independently frax a records have been uniform point distribution throL@iout the space. As a further exanple of such problems, Kittler nearest for hew istic [KITT783 ‘describes a neighbour searching using the city block diskqw, the efficiency of which Is proportional to 11 .

The strategy that is chosen for NN-searching will depend critically upon the Inter-docunent dissimilarity measure used. In the case of Ward’s clustering method, the dissimilarity D(Q,I) between the query point Q and, some arbitrary point I is as glven in Section 2.’ 1 above. For I to be the NN of Q. two situations may be identified, depending upon whether Q and I do or do not have terms in common. It Is normally assumed in docun’ent clustering that docunents which have no terms in comxon should not be considered as being related in any way. Fbwever, such a restriction may mean that the true NN for some Q may not be identified, resulting in an approximate and inexact hierarchy: as we sh%~ below in Section 4.1, NNs which do not have terms in conxaon with the current Q occur quite frequently in at least one of the data sets which we have studied. Moreover, if such NNs are Ignored, the situation can, and *does, arise that it is impossible to create a full hierarchy: instead, a series of discrete sub-hierarchies is created which cannot be linked Into a single classification. In the practtcal context of a clustered docunent file, this need not be too serious a Foblen if bottom-up cluster searches are employed since it is then unlikely that the reaches of the hierarchy will need to be upper [CRCFOOI: this will not be the case If inspected to p-down are algorithms used to search the classification PfAMt74 I. Accordingly, we shall assume for the manent thst an exact hierarchy is to be generated in which NN’(Q) is allowed to have no terms in common with Q. We shall refer to the set of points having terms in conmon with Q as CQtN, and the set of points having no terms in comxcn with Q as WCOHII. Then the identification of NN(Q) reduces to that of identifying NN(Q)C, the NN for COtlH, and IN (Q)NC, the NN for NCCMM, and canparing the tm dissimilarities to identify the true nearest neighborr. The identification of 3.3 NN(QX: and NN(Q)NC are described in Sections and. 3.4 below. before doing w, it is muever, convenient to list the briefly main data structures that sre required by our prograx.

3.2 Main

data

structures

Each of the N documents in the collection which Is to be clustered is characterised by a list of integers, corresporxiing to the lndalng terms which have been assigned to the docuaents. The set of N such lists canprises the basic data matrix DOCFILE. The docunent collections considered in this paper all Involve binary, i..e. present or

3.3

absent, indexine without the use of anv withindocunent weighting schsme. If.such a &hene were to be used, DOCFILE would need to contain the weight of each term in eech docunent, in addition to its presence.

Identification

of Nti(Q)C

lhe inverted file nearest neighbour algorithns mentioned in Section 2.2 all involve consideration of the points in COllt4, and thus result in the identification of NN(Q)C. A critical review of these algorithns [PERR831 suggested that the most efficient algorithm was that due to Noreaul t et CNORE771 and we have used this in our al. clustering program for the identification of involves the processing of NN(Q)C. The algorithn the point identifiers and term coefficients in the inverted file lists corresponding to the terms In the 'query point, which we shall refer to as query lists. lhe processing of the query lists results in a new list containing the identifiers of all of timse points that have at least one tenn in comnon with the query, together with the dot product between the term vectors representing Q and each other point. The dot products may be calculated as follors. When a point identifier. I say, is encountered for the. first time in a query list. a DOTPRDD(I), counter, is allocated to that point and set equal to the product of the term coefficients I and for Q: DOTPROD is for incremented by further such products each time that I is encountered in subsequent query lists. When all of the lists have been processed in this wsy, each counter will cantain the sum of the term coefficient products between the query and one of the points in the collection that. is being searched. lhese sums may then be used for the calculation of the distance DISTTQ.1) and hence the dissimilarity D(Q,I): the D(Q.1) values may then be used to identify the nearest neighbour of PI NN(Q)C. This procedve may be summarised by the TC(Q) and TCXI) are the following pseudo-code: term coefficients for Q and I in the cvrent query list, an3 tiINDIS.SIM is the smallest dissimilarity value calculated at any stage of the NN-search FOR each query list DO BEGIN FQR each point, I, in this query list DO IF identified in a previous list THEN DOTPROD := DOTPROD + TC(Q)*TC(I) ELSE DOTPROD := TC(Q)*TC(I) END ; HN(Q )c :r 0; HIHDISSlM :r MAXINT ; FOR each point. I, in COIli DO BEGIN DIST(Q,I) := SQRT(SUlSQ(Q)~U~SP(1)-2'DOfPROD(I)) ; D(Q,I) := SIZE(Q)*SIZE(I)*DIST(Q,I) D(Q.1) :=D(Q.I)/(SIZE(Q)~SIZE(I)) : IF b(Q,I, < flINDISSBi THEN BEGIN NN(QX := I ; YINDISSIM := D(Q.1) END END .

Tbe'inverted file, INVFILE, consists of a set of H lists, one for each of the terms in the vocabulary that has been used to index the docunents. 'Ibe I'th such list (l