Incremental clustering for dynamic information processing

2 downloads 0 Views 1MB Size Report
Information. FAZLI CAN. Miami University. Clustering for Dynamic. Processing. Clustering of very large document databases is useful for both searching and.
Incremental Clustering for Dynamic Information Processing FAZLI CAN Miami University

Clustering

of very

periodic for

updating

incremental

together it

is

clusters the

Categories

and

H.3.3.

ing,

retrieval

cient,

1.

search

cost

and

Design,

Best-match

retrieval

environment,

information

The

An

algorithm

analysis

of the

algorithm

Through

empirical

generates

experimental

testing

statistically evidence

valid

shows

that

Management]:

Physical

Information

Information

Search

Design—access

Storage—tile

orgarztza-

and Retrieval—ch&er-

process

Phrases:

information

browsing.

environment.

Retrieval]:

Retrieval]:

and

of databases.

cost

and

The

retrieval

Words

and

and

effectiveness

[Database and

searching

nature

are presented.

of reclustering.

H.2.2.

Storage

both

complexity

and efficient

Descriptors:

for

dynamic

behavior

Algorithms,

Key

effectiveness,

those

Storage

models,

dynamic

with

[Information

[Information

Terms:

Additional

achieves

an effective

Subject

H.3.2.

tion;

General

creates

The

of its expected

algorithm

is useful

due to the

is introduced.

are compatible

algorithm

methods;

the

databases

is required

an investigation that

that

document

clustering

with

shown

large

of clusters

retrieval

Performance cluster

search,

information

cluster retrieval,

validity, information

cover

coeffi-

retrieval

efficiency

INTRODUCTION

This work differs from our previous research [8, 9] because it studies multiple maintenance steps with a dynamic indexing vocabulary. Furthermore, the algorithm is based on a new data structure yielding very low computational complexity with negligible storage overhead. This makes C 2ICM (Cover-Coefficient-based Incremental Clustering Methodology) suitable for very large dynamic document databases (for examples see Table D. Section 2 reviews the clustering and cluster maintenance problem. In Section 3 the base concept of the incremental clustering algorithm, the cover coefficient (CC) concept, and its relation to clustering are described. Section 4 introduces the incremental clustering methodology, C 2ICM, its complexity analysis of its expected behavior in and cost analysis, and an analytical

This work was supported by a MUCFR grant. Author’s address: Department of Systems Analysis, Miami University, Oxford, OH 45056. email: fc74sanf@miamiu. bitnet Permission to copy without fee all or part of this material is granted provided that the copies are for direct commercial advantage, the ACM copyright notice and the title not made or distributed of the publication and its date appear, and notice is given that copying is by permission of the Machinery. To copy otherwise, or to republish, requires a fee and/or Association for Computing specific

permission.

01993

ACM

1046-8188/93/0400-0143 ACM

Transactions

$01.50 on Information

Systems,

Vol. 11, No. 2, April

1993, Pages 143-164.

144

.

Fazli Can Table

I.

File

Size and Update

Characteristics

Document

Database

Evew two weeks

Confer. Papers Index

1,309,711

Six times per year

ERIC

660,868

Monthly

McGraw-Hill News

79,640

Every 15 minutes

Medllne

6,200,000

document and

12,684

documents

Three

basic

The

and

AND CLUSTER

a similarity

documents,

then by

closely

graph.

Depending

on

“single

link,” example

for

approach,

a

clusters

of the of

seeds

clusters

our

experimental

INSPEC

database

of

of and

single-pass

is the

a document),

to which

they known.

most

is

known

as

formed. clustering.

called

similar.

This

clusters

a connected

seed-oriented

Documents

are

the

structures

are

(e.g.,

individual

forms

the

link”

algorithm

must

determine

cluster

documents,

algorithm

between

to

Each

to be formed. be

similarity

“complete

initiator

are graph-theoretical,

graph-theoretical

threshold

documents.

connectivity

for each cluster

number

typical the

similarity

average,”

a cluster

determined

5 covers on the

methodologies A

representing

related

the

25,000/men,

MAINTENANCE

algorithms.

applies

“group

is based

of clustering

matrix

represented

Section

experiment

Approx

77 queries.

categories and iterative

single-pass, prepares

database environments.

evaluation.

CLUSTERING

An

Updates

File %ze 330,091

dynamic

this

of Some Commercial [11]

Computer Database

design

2.

Databases

cluster

are

then

For

implementation

usually

assigned

provided

In

seed,

as

is

to the the an

input

parameter. Iterative

algorithms

optimization example, The ing

function. by

problem

most

literature

the A

are are

are

developed

for

clustering

close

very

document

for

with

new

examination

few for

of

according

clusters the

can

to an

be done,

modification

documents of

cluster

maintenance

growing

structure

initial

for

[1].

deals

addition

unsuitable

there

a clustering of the

maintenance

to

both).

of them

be used

generation

seed-oriented

due (or

algorithms also

to embellish

of cluster

structures

documents that

using

try The

clustering maintenance algorithms.

databases;

most

or

of cluster-

deletion

algorithms [26, In of them,

of

old

reveals

p. 58]. general, however,

In

the

these can

deletion.

The cluster-splitting approach for maintenance treats a new document as a query and compares it with the existing clusters and assigns it to the cluster that yields the best match [24]. The major problems with this approach [21, ACM

Transactions

on Information

Systems,

VO1 11, No. 2, April

1993

Incremental

Clustering

.

145

494] are: (1) the generated clustering structure depends on the order of processing of the documents; and (2) it starts to degenerate with relatively small increases (such as 25% to 50$%) in the size of the database. Adaptive clustering is based on user queries. The method introduced in [31] assigns a random position on the real line (the hypothetical line from negative infinity to positive infinity) to each document. For each query, the positions of the documents relevant to it are shifted closer to each other. To ensure that all documents are not bunched up together, the centroid of these documents is moved away from the centroid of the entire database. This approach has some disadvantages: (1) the clusters depend on the order of query processing; (2) the definition of clusters (separation between the members of different clusters) requires an input parameter and the appropriate value for this parameter is database dependent; and (3) convergence can be slow. The applicability of such algorithms in dynamic environments is not yet known. For the maintenance of clusters generated by graph-theoretical methods (single link, group average, complete link), similarity values need to be used —and this may be very costly. The update cost of the single-link method is reasonable; its IR effectiveness is known to be unsatisfactory, however [12, 27, 29]. The time and space requirements of the group average and the complete link approaches are prohibitive, since they require the complete knowledge of similarities among old documents [27, pp. 29-30]. Another possibility for handling database growth is to use a cheap clustercomplexity to recluster the entire database after ing algorithm of mlogm document additions and deletions. However, even though some algorithms have theoretical complexity of mlogm, the experimental evaluations have shown that, due to the large proportionality constant, the actual time taken for clustering is greater than that of an m2 algorithm [26, p. 59]. These points imply that an efficient maintenance algorithm would be better than reclustering the whole database.

p.

3. PRELIMINARY

CONCEPTS

The base of C 2ICM, the CC (cover-coefficient) concept, provides a measure of similarities among documents (see also [10]). This concept is first used to determine the number of clusters and cluster seeds, then to assign nonseed documents to the clusters initiated by the seed documents to generate a partitioned clustering structure. (To aid in reading, the definition of the notation introduced in Sections 3 and 4 is provided in Table II.) The CC concept determines document relationships in a multidimensional term space by a two-stage probability experiment. It maps an m X n (documatrix. The ment by term) D matrix [10] into an m X m C, cover coefficient, of selecting individual entries of C, Cij(l < i, j < m), indicate the probability any term of dL from d,, and are defined as follows:

ACM

Transactions

on Information

Systems,

Vol. 11, No. 2, April

1993.

146

.

Fazli Can

Table

II.

Symbol

Notation

Introduced

in Sections

3, 4

.— Defmmon / Meaning Extent to which d, IScovered by d, (according to CC concept) Average cluster sue, max (1, m /~) ~dcs m Set of documents to be clustered during an increment No. of Incremental updates No of documents (Current set of documents) Imt!al no, of documents before all increments No, of added documents m an increment (Set of such documents), Dm’ ~ Dr No. of deleted documents m an increment (Set of such documents) No, of terms (No, of terms after klh increment) No. of clusters (No, of clusters after kth Increment), 1 sncs mln (m, n) Seed power of d, lDfl/m’ (r~l) No. of reclustered documents m k increments Total no. of nonzero entries m D Average no. of documents per term, term generalrfy (t/n) Average no. of seed documents per term Average no, of dlstmct terms per document, depth of mdexmg (t/ m) Reciprocal of Ith row sum of D Reciprocal of Ith column sum of D Decouphng coefficient of dl (bl= Cll, O c bIs 1) Couphng coefhclent of dl (VI =1 - ~1, Os w < 1) Like ~ and v,, but fort]

Symbol Cli

dc Dr k m (%J ml m’ (Dm) m“ (Dmti) n (nk) nc (nc, ~ PI A t tg tgs xd al @j 61 w b’, ,1+1’,

where a, and ~h are the reciprocals of the ith row sum and k th column sum, respectively. To’”apply the above formula, each document must have at least one term and each term must appear at least in one document. By definition, can be any nonnegative real number. the entries of the D matrix The an

two-stage

intuitive

many

terms),

probability

analogy and

also

[ 17,

experiment p.

suppose

94].

behind

Suppose

that

each

we urn

C,J can have contains

be better

many balls

urns

explained (d,

of different

by

contains colors

of d, may appear in many documents). Then what is the probability of selecting a ball of a particular color (i. e., d])? The experimental computation of this probability fh-st requires the selection of an urn at random and second involves an attempt at drawing the ball of that particular in the first stage, from color from the selected urn. In terms of the D matrix, tk,at random; in C,j this random the terms (urns) of cl, we choose any term, selection is provided by the product al x d, ~. In the second stage, from the color); in selected term (urn), t~, we try to draw d] (the ball of that particular the probability of selecting d]. In this c the product ~~ ~ dlk signifies e~periment, we have to consider the contribution of each urn (terms of d,) to the selection probability of a ball of that particular color (dj ); that is why we have the summation for all terms in the definition of CLJ.Accordingly, in experiment reveals the probability of selecting any terms of D, this two-stage of d, from dj. Intuitively this is a measure of similarity, since the term documents that have many common terms will have a high probability of being selected because they appear together in most term lists. The experiment is also affected by the distribution of terms in the whole database: if d, and d] share rare terms, this increases CL,. (the

terms

ACM Ti-ansact,nns on Information

Systems, Vol

11, No.

2,

Aprd

1993.

Incremental

Some characteristics (l)

O
0,

and in general

C,j # Cj,;

(5) c,, = c~~= ~~1= c~i ~ di and dL are identical; (6) d, is distinct

~ c,, = 1.

Thereby from (4) it is seen that CZJ measures an asymmetric similarity between d, and d]. In the C matrix, a document having terms in common with several documents will have c,, value considerably if a less than 1. Conversely, document has terms in common with very few documents, then its c,, value will be close to 1. The c,~ entry indicates the extent to which dl is “covered by dj. Identical documents are covered equally by all other documents; if the document vector of di is a subset of the document vector of d~, then d, will cover itself equally as dj covers d,. It is also true that if two documents have will not cover each other at all, and the no terms in common, then they corresponding c,~ and cj~ values will be zero. Because higher values of c,, result from document d, having terms found in very few other documents, we let 6, = c,, and call S, the decoupling of how different d, is from all other coefficient of d,. It is a measure coefficient of d,. documents. Conversely, we call @, = 1 – 8, the coupling Similar to documents, the relationship between terms can be defined via the C’ matrix of size n x n whose entries are defined as follows. C:j = ~, X

~

where

(d~, X ah XdkJ)

l nt, suggests that the tested clustering structure is invalid, since it is unsuccessful in placing the documents relevant to the same query into a fewer number of of clusters than that of the random case. The case rz~ < nt, is an indication the validity of the clustering structure; however, to decide validity, one must less than nt,. show that nt is significantly Table VII gives the nt and nt, values for all incremental clustering experiments. The characteristics of the queries are given in Section 5.5.3. To less than n ~,, we must know the probability show that nf is significantly density function (baseline distribution) of n,,. For this purpose, we generated 1000 random clustering structures for each clustering structure produced by C 2ICM and obtained the average number of target clusters for each random case, nt( r ). For example, for matrix 1 of the INSPEC database, the minimum nt( r) values, respectively, are 29.623 and 30.792 (standard and maximum deviation is 0.175). This shows that the nt value of 23.558 for this experiment is significantly less than the random case, since all of the n ~(r ) observations have a value greater than nt. The same is observed in all of the incremental clustering experiments. This shows that the clusters are not an artifact of the C 2ICM algorithm, but are indeed valid. 5.5

Information

Retrieval Experiments

In this section we assess the effectiveness of C 2ICM by comparing its CBR performance with that of C 3M. In [10] it is shown that C 3M is 15.1 to 63.5 (with an average of 47.5) percent better than four other clustering algorithms in CBR using the INSPEC database. These methods are single link, complete link, average link, and Ward method [121. It is also shown that CBR ACM Transactions

on Information

Systems,

Vol. 11, No. 2, April

1993.

158

.

Table

Fazli Can

VII.

Comparison

of C 2 ICM

and Random

of Target

1

Matrk No. nt ntr

Clusters

Clustering for All

in Terms

Number

1

5

4

3

2

of Average

Queries

23.558

23.597

23.639

23.831

24.143

30.436

30.418

30.413

30.506

30.630

performance of C 3M is compatible with a very demanding (in terms of storage and CPU time) implementation of a complete link method [28]. In this section we want to show that the CBR behavior of C zICM is as effective as C 3M, and therefore better than other clustering methods. In the experiments, ml is taken as 4014, and the clustering structure obtained after the fifth increment is used in the retrieval experiments (see Table III). Notice that this clustering structure is the outcome of the most stringent conditions of our experimental environment. 5.5.1 Evaluation nzatch

CBR

similarity

Approach.

policy. of

their

That

In

is,

in

centroids

this

study,

CBR, with

for

clusters the

retrieval

are

user

first

query.

we

use

ranked Then,

the

best-

according n,,

to

clusters

are

of the selected clusters are chosen according to selected and d, documents their similarity to the query. The selected clusters and documents must have nonzero similarity to the query. Typically, a user evaluates the first 10 to 20 documents returned by the system and after that is either satisfied or the query is reformulated. In this study, d, values 10 and 20 are used. The effectiveness measures used in this study are the average precision for all queries and total number of queries with no relevant documents, Q. 5.5.2 Query Matching. For query matching there are several matching functions, depending on term-weighting components of document and query terms. Term weighting has three components: the term frequency component (TFC), the collection frequency component (CFC), and the normalization component (NC) [23]. The weights of the terms of a document and a query the respec(denoted by w~] and WCJ1,1 < j < n) are obtained by multiplying tive weights of these three weighting components. After obtaining the term weights, the matching function for a document and a query is defined as the following vector product. n

similarity

(d,

q)

=

~

ul~~ X w~k.

k=l

Salton and Buckley term-weight recommended

] Prectston retrieved ACM

[23]

obtained

assignments. due

is defined

to

In their

as the ratio

287 the

unique

same

superior

IR

of the number

combinations

study,

six

performance.

of retrieved

relevant

documents.

Tran>actlons

on Informatlcm

Sysi,ems,

Vol

11, No. 2, April

of document\

of these

1993

In

query

combinations this

documents

study

are we

used

to the number

of

incremental

Table r

No of Quene.s

77

Avg. No of Terms per Query

VIII.

Characteristics

NO Avg Rel Dots per Query

1582

of

3303

Clustering

.

159

of the Queries

Total No of Relevant Documents

2543

No. of Dmtmct Dots Retrieved

1940

NO of Diet. “ Terms for Query Def

577

these six matching functions in addition to the cosine matching function. Each of these matching functions enables us to test the performance of C 2ICM under a different condition. For brevity, the definition of matching functions is skipped. In this study the cosine similarity function is referred to as TW1 (term weighting 1) and the other six are referred to as TW2 through TW7. In [ 10], they are again referred to as TW1 through TW7; in [23] they are, respectively, defined as (txc.txx), (tfc.nfx), (tfc.tfx), (tfc.bfx), (nfc.nfx), (nfc.tfx), and (nfc.bfx). experiment the same matching function is used both for cluster ment selection.

In a given and docu-

5.5.3 Retrieval Environment: Queries and Centroids. The query characteristics of INSPEC are presented in Table VIII. The query terms are weighted; the weight information is used whenever needed. The cluster representatives (centroids) are constructed using the terms with the highest total number of occurrences within the documents of a cluster. The maximum centroid length (i. e., the number of different terms in a centroid), which is a parameter, is set to 250. This value is just 1.729% of the total number of distinct terms used for the description of the database. In our previous research we showed that, after some point, the increase in centroid length does not increase the effectiveness of IR, and the above centroid length is observed as suitable for INSPEC [10]. Similar results are also obtained for hierarchical clustering of various document databases [28]. The characteristics of the centroids produced for C 3M and C 2ICM are very similar to each other, as shown in Table IX. In this table, XC indicates the average number of distinct terms per “centroid”; % n indicates the percentage of terms that appear in at least one centroid; tgc indicates the average number of centroids per term for the terms that appear in at least one centroid; and %D indicates the total size of the centroid vectors as a percentage of t in the corresponding D matrix. The storage cost of the centroids is less than one third of that of the D matrix (refer to column %D). This is good in terms of search and storage efficiency. Actually, if we had used all the terms of each cluster, the value of %D would have been around 50 and 52, respectively, for C 3M and C 2ICM. However, as stated earlier, we know that longer centroids do not result in superior retrieval effectiveness. With centroid length 250, the percentage of cluster terms used in their centroids, 100 X ( xC/average number of distinct terms per “cluster”) = % XC, is 51 for both algorithms. ACM

Transactions

on Information

Systems,

Vol. 11, No. 2, April

1993.

160

Fazli Can

. Table

IX.

Characteristics

of the Centroids

(the

second

row is taken

from

[ 10])

~

Table

X.

Effectiveness

of the Reclustering (figures

lwl

Effecwe Measure

l-w2

RI

Prec\sion

(R) and Incremental

for R are taken lw4

l-vw RI

RI 277

418

419

396

401

401

230 17 13

226 17

336 66

329

323 87

314

329 66

14

32

3

4

33

Q

TW5 Rl 388 383 310 310 54

RI

287

Clustering

(I), for n. = 50

[ 101)

from

397 320

42

lvm RI 382 379 313 311 98

lw7 RI

352 288 76

,43

343 283

42

5.5.4 Retrieval Effectiveness. In this section we compare the CBR effectiveness of C 2ICM and C 3M using the matching functions defined in Section 5.5.2 and show that the former is as effective as the latter, which is known to be very effective for CBR. For the implementation of the best-match CBR, we first have to decide CBR about the number of clusters to be selected, n,. In the best-match process,

the

retrieval

“saturation” Figure for

point,

6].

n,

centage periments

In

= 50,

our i.e.,

is typical for

n,

effectiveness it

either

remains

experiments when for

11%

increases

the of the

best-match

the

up

same

saturation clusters CBR

[21,

to

or

a certain

increases

point are

for

selected

p. 376].

The

n,. very

INSPEC (n,

this

slowly

[10,

is

= 475).

results

After

observed This

of the

perIR

ex-

are shown in Table X. The first and second rows of measure (precision, Q) are for d,, values of 10 and 20,

= 50

each effectiveness respectively. Table X shows that Q values of reclustering (R) and incremental clustering (l) are comparable. Similarly, the precision observations for both algorithms are very close to each other. The maximum difference is observed for TW1: the CBR precision of C 2ICM is 3.5% lower than that of C 3M (with d, = 10). On the other hand, for TW2 and TW3 with d, = 10, the precision of incremental clustering is slightly better than that of reclustering. All the differences are less than 109ZZ,and can therefore be considered insignificant [2]. The results of this section indicate that the incremental clustering algorithm C 2ICM provides a retrieval effectiveness quality comparable with that of reclustering using C 3M; thus it is superior to the algorithms that are outperformed by C 3M. Given that the foundation of the clustering methodology is CC (with the seeds that it produces), these seeds are also used as cluster centroids. This approach represents a cluster with a typical member, and is similar to the ACM

Transactions

on Information

Systems,

Vol.

11, No. 2, April

1993

Incremental

Clustering

.

161

method defined in [25]. As expected, the use of a typical member (in this case the seed) of a cluster as its representative is less effective (about 18% less in terms of precision). However, one direct advantage of this approach is that CBR is as efficient as FS, based on inverted index searches—moreover, in a dynamic environment, it completely eliminates the overhead of centroid maintenance. 5.6

Efficiency Considerations

In this discussion only the more crucial efficiency component, time, is considered. The storage efficiency of the retrieval environment is analyzed in [6]. AS stated before, retrieval effectiveness reaches its saturation when n. = 50. In other words, for INSPEC, II% (nC = 475) of clusters is selected, and about the same percentage of the documents is considered for retrieval. As we have stated previously, this percentage is typical for best-match CBR. However, it might be considerably greater than that of the case of hierarchical searches. The exact number of documents matched during hierarchical CBR is unavailable [12, 27, 28], but as shown in the following, the method used in this study is much more efficient than the top-down search [27, 28] that we used in our effectiveness comparison. We are unable to compare our results with that of the bottom-up search strategies reported in [12], since no efficiency figure is available for them. Hierarchical searches usually involve the matching of query vectors with many centroid and document vectors that have little in common with the queries [22]. For example, the top-down search of the INSPEC database is performed on the complete link tree structure containing 11,878 clusters in 16 levels [27, p. 60]. The best-match CBR can be implemented very efficiently, independent of both the number of clusters to be selected (n. ) and the number of documents to be retrieved (d.), as shown below. In the conventional implementation of best-match CBR (for brevity we use CBR to indicate best-match CBR), query vectors are first matched with the inverted indexes of the centroids (IC) and then with the document vectors (DV) of the members of the selected clusters. This implementation of CBR is referred to as ICDV because of its search components. In ICDV, the query vectors are only matched with the centroids that have common terms with the query vectors, but the same is not true for the DV component. Notice that the generation of separate inverted indexes for the members of each cluster is impractical. However, we can still introduce a new implementation strategy and exploit the efficiency of the inverted index search (11S) of the complete database in CBR. In this new approach, which is referred to as ICIIS because of its components, the clusters are again selected using IC; the documents of the selected clusters are then ranked using the results of the 11S performed database! By definition, the efficiency of the ICIIS approach on the complete contains some unproductive is independent of n, and d.. The ICIIS strategy computation in its 11S component, since only a subset of the database is the member of the selected clusters. This drawback is compromised by the shortness of the query vectors: in IR it is known that query vectors are usually short [10, 14, 23]. ACM

TransactIons

on Information

Systems,

Vol. 11, No. 2, April

1993.

162

.

Fazli Can

Table

XI.

Average

Query

Processing

Time

Using

Different

Retrieval

r

Average Query Processing Time (s) 1,53 241

Retrieval Technique IIS (FS) ICIIS (best-match CBR) ICDV (best. match CBR), ns. ‘Top-down

The

Search

efficiency

terms

of

of the

average

environment can

on

the

the

ICIIS

of

the

top-down show

centroid

that

length

[6].

necessarily

shorter

the

centroid

terms.

The

same

almost

true

for

The

results

of

environment method and

and

top-down more

is

processing

clustering

structure

processing

times

C 3M

respectively,

are

ICIIS the

val~es same)

similarity

of the

terms

respect

and 2.41s

values

clustering

ICDV and

4.82s.

the

same

(due

are

very

close

structures

is

to each

generated

by

the

on the query

generated

stems

the

requires

average

values

This

ICIIS for

performed

Cz ICM

the

other.

IR

efficient

using

structure

to rounding,

is

efficient

search the

for

a

query

length)

average

search

that

the

most

times

aside,

clustering

Notice

an the

top-down

an

of

are

using

[28].

same

ICIIS

by the

by

centroid

processing

As

on the

11S

the

to the

C z ICM.

the

The

experi-

centroid

all

creates

that

words,

The

selected

search

C z ICM

5.31s.

experiments,

is unaffected

contain

from

query

the

(less function

is 250.

been

top-down

shows

other by

the

and

of

of that

negligible

by a longer

may

that

average

2.41s with

not

of efficiency for

all

the

efficiency

matching

length

brought

centroids

The

time

In

from

table

it is shown

is

The

of ICIIS

have

XI

In

practically

ICDV

the

show

13.07s.

ICIIS

[10].

this

The

[6]

query

(taken

In

of ICDV

centroid efficiency

Table

generated for

are and

1.53s.

TW6

that

XI.

is typical.

the

is false

respectively,

search

to

XI

maximum

section

Table

efficiency

of Table

shorter

ICIIS.

requires are,

In

the

in

model

average

search

function.

on

but

top-down

in

retrieval this

The

matching

independence [6],

this

using

ICDV

442%

(i.e., ICDV

the

6].

of the

terms or

and

and

environment, [3,

TW2.

is because

length

CBR

function

similar

query

our

documents

is measured

storage

matching

a threshold

This

In

ICDV

the

of the

search,

after

and

using

provided

result is

top-down

1307

are

function ICDV

search

the

ments

not

the

28].

ICDV,

is independent

thus

including

ICIIS,

in terms

of matching

5%),

[27,

[28]

11S, ICIIS, time

million

database

is given

effect

than

of 11S,

INSPEC

result

11S and

in one

531

I-ilearc.

techniques

defined

times

50

Lmk

processing

accommodate

processing ICDV

search

query

model

easily

[28])

on Complete

Techmques

and are

by C3M,

exactly from

the

algorithms.

6. CONCLUSION

Clustering is useful for both searching and browsing of very large document databases. The periodic updating of clusters is required due to the dynamic nature of document databases. In this study an algorithm for incremental clustering, C 2ICM, has been introduced. Its complexity analysis, the cost comparison with reclustering, and an analytical analysis of the expected ACM

TransactIons

on Infm-mat]on

Systems,

Vol. 11, No. 2, April

1993

incremental

Clustering

.

163

clustering behavior in dynamic IR environments are provided. The experimental observations show that the algorithm creates an effective and efficient retrieval environment, and it is cost effective with respect to reclustering and can be used for many increments of various sizes. ACKNOWLEDGMENTS

The author would like to acknowledge the constructive criticism and helpful suggestions made by the three anonymous referees and R. B. Allen. Thanks also to N. D. Drochak II, to D. J. Anderson for programming some parts of the experimental environment, to E. A. Ozkarahan for his comments on an earlier draft of this paper, and to J. M. Patton for helpful discussion. REFERENCES 1. AND~R~~RG, 2, BELKIN,

M. R.

22, ( 1987), 8th

New

York,

International

1985,

Canadian

Montreal,

6. CAN, F.

Optimization

Que.,

1989,

and

IEEE,

of incremental

Miami

Oxfordj

Univ.,

On the efficiency of the 1990

Los Alamitos, F.,

Press,

New

York,

Znf. SC1. Technol.

In

Orleans,

La.,

vector

searches.

(Montreal,

Que.,

1973.

ARIST.

In

Proceedings

June

1985).

of

ACM,

In

(Montreal,

Proceedings

of the

Que.,

1989).

Sept.,

1990,

cluster

91-002,

Dept.

of Systems

(Submitted

for publication.)

for dynamic

Computing

document

(Fayetteville,

databases.

Ark.,

April

In

1990).

61-67.

of the

dynamic

10th New

9. CAN, F., AND OZKARAHAN, E. A.

Paper

searches.

clustering

on Applied A

ACM,

Working

1991.

Incremental

E.A.

Proceedings 1987).

retrieval.

Engineering

clustering.

Oh., Aug.

Symposium

Calif.,

June

in information

Computer

of best-match

AND OZKAKAHAN,

retrieval.

(1989),

Rell.

572-575.

Experiments

proceedings

Conference

structures

on Electrical

7. CAN, F., AND DROCHAK II, N. D.

8. CAN,

Ann.

of inverted

ACM-SIGIR

of clustering

Conference

Analysis,

Academic

techniques.

97–110.

Validation

5. CAN, F.

for Applications.

Retrieval

C., ANII LEWITT, A. F.

Annual

4. CAN, F. EIC,

Analysis

109-145.

3. BUCKLEY, the

Clustering

N. J., AND CROFT, W. B.

cluster

Annual York,

1987,

Dynamic

maintenance

International

system

for

ACM-SIGIR

information

Conference

(New

123-131.

cluster

maintenance.

Inf.

Process.

Manage.

25, 3

275-291.

10. CAN, F., AND OZKARAHAN, clustering

methodology

E. A.

Concepts

for text

and

databases.

effectiveness

ACM

Trans.

of the

Database

cover-coefficient-based Syst.

15, 4 (Dec.

1990),

483-517. 11. DIALOG.

Dialog

Database

12. EL-HAMDOUCHI, methods

A.,

for document

13. FALOUTSOS, C.

Catalog.

AND WILLETT,

Access

in document

Computer

methods

for text.

Great

systems.

Online

Bibliographic

Britain,

16. HEAPS, Press,

H.

Znformatzon

York,

San Fransisco, 18. JAIN, A. K.,

Sot.

Using

Inf.

Databases:

Retrieual

Calif.,

Sci.

1989), Suru.

1 (Mar.

17,

interdocument 37,

clustering

220–227. 1985),

similarity

49-74. information

1 (1986), 3–11.

A Directory

and

Sourcebook,

4th

ed. Aslib,

Inf.

Computational

and

Theoretical

Aspects.

Academic

AND Dums,

E. L.

Storage Hypertext

Basic

Concepts

of Probabd~ty

and Statistics.

Holden-Day,

1964. R. C.

Algorithms

Cliffs, N. J., 1988. 19. JARIIINM, N., AND VAN RIJSBERGEN, 20. NI~LSTJN, J.

Am.

Cornput.

P.

1989.

agglomerative

1978.

17. HODGES, J. L., AND LEHMANN,

retrieval.

ACM

Inc.,

1986. S.

New

J.

Services

of hierarchical

J, 32, 3 (June

C., AND WILLETT,

retrieval

J. L.

Information

Comparison

retrieval,

14. GRIFFITHS, A., LUCKHURST,

15. HALL,

Dialog P.

Retrleual and

C. J.

The

7 (1971),

Hypermedia. ACM

for

Clustering

Data.

Prentice-Hall,

use of hierarchical

clustering

Englewood in information

2 17–240. Academic

TransactIons

Press,

on Information

San Diegoj Systems,

Calif.,

1990.

Vol. 11, No. 2, April

1993.

164

Fazli Can

.

21. SALTON, G. N.J

Dynamzc

Z?lformatzon

22. SALTON,

G.

Automatzc

Information

Text

by Computer.

Lzbrary

Process. Manage. 24. S.A.LTCIN,c, , AND Syst.

3, 4 (Dec.

WONG,

1978),

1974),

Processing.

Prentice-Hall,

Englewood

C1lffs,

Transformation,

Reading,

Term-weighting

Analysis,

Mass.,

and

Retrieval

of

1989

approaches

in automatic

text

retrieval

Inf.

513-523

Generation

search

and

of clustered

files.

ACM

Trans

Database

321-346.

C. J.

The

C. J.

E. M.

in document

A.

The

Wesley,

best-match

problem

in document

retrieval.

Commun.

ACM

17,

648-649.

26. VMJ RIJSBERGEN, 27. VOORHEES,

C.

24, 5 (1988),

25. VAN RIJS~ERGEN, (Nov.

Processing:

Addison

23, SALTON, G., AND BUCKLEY,

11

and

, 1975.

The

retrieval.

Information

Retrieval,

effectiveness

and

Ph.D.

2nd

et%ciency

dissertation.,

Dept

ed. Butterworths,

London,

of agglomerative

hierarchical

of Computer

Science,

1979.

Cornell

clustering Univ.,

Ithaca,

N Y., 1986. 28. VOORHEES, the 9th

E. M,

Annual

The

efficiency

International

of inverted

ACM-SIGIR

index

and

Conference

cluster (Piss,

searches. Sept.

In

1986),

Proceedings

ACM,

of

New

York,

critical

review.

Inf.

Commun.

ACM

164-174 29. WII.LETr, Process.

P.

30. YAO, S. B. (April 31

Recent

Manage. 1977),

trends

m

24, 5 (1988),

Approximating

hierarchical

document

clustering:

block

accesses

in database

orgamzations.

Yu, C. T , AND CHEN, C. H ACM-SIGIR

Adaptive Conference

document (Montreal,

clustering. Que.,

In Proceedings June

1985),

197-203.

ACM

December

TransactIons

20,

4

260-261.

International

Recewed

A

577-597.

1990;

revised

on Information

August

Systems,

1992;

Vol.

accepted

November

11, No. 2, April

1993.

1992

ACM,

of the 8th Annual New

York,

1985,

Suggest Documents