Information. FAZLI CAN. Miami University. Clustering for Dynamic. Processing. Clustering of very large document databases is useful for both searching and.
Incremental Clustering for Dynamic Information Processing FAZLI CAN Miami University
Clustering
of very
periodic for
updating
incremental
together it
is
clusters the
Categories
and
H.3.3.
ing,
retrieval
cient,
1.
search
cost
and
Design,
Best-match
retrieval
environment,
information
The
An
algorithm
analysis
of the
algorithm
Through
empirical
generates
experimental
testing
statistically evidence
valid
shows
that
Management]:
Physical
Information
Information
Search
Design—access
Storage—tile
orgarztza-
and Retrieval—ch&er-
process
Phrases:
information
browsing.
environment.
Retrieval]:
Retrieval]:
and
of databases.
cost
and
The
retrieval
Words
and
and
effectiveness
[Database and
searching
nature
are presented.
of reclustering.
H.2.2.
Storage
both
complexity
and efficient
Descriptors:
for
dynamic
behavior
Algorithms,
Key
effectiveness,
those
Storage
models,
dynamic
with
[Information
[Information
Terms:
Additional
achieves
an effective
Subject
H.3.2.
tion;
General
creates
The
of its expected
algorithm
is useful
due to the
is introduced.
are compatible
algorithm
methods;
the
databases
is required
an investigation that
that
document
clustering
with
shown
large
of clusters
retrieval
Performance cluster
search,
information
cluster retrieval,
validity, information
cover
coeffi-
retrieval
efficiency
INTRODUCTION
This work differs from our previous research [8, 9] because it studies multiple maintenance steps with a dynamic indexing vocabulary. Furthermore, the algorithm is based on a new data structure yielding very low computational complexity with negligible storage overhead. This makes C 2ICM (Cover-Coefficient-based Incremental Clustering Methodology) suitable for very large dynamic document databases (for examples see Table D. Section 2 reviews the clustering and cluster maintenance problem. In Section 3 the base concept of the incremental clustering algorithm, the cover coefficient (CC) concept, and its relation to clustering are described. Section 4 introduces the incremental clustering methodology, C 2ICM, its complexity analysis of its expected behavior in and cost analysis, and an analytical
This work was supported by a MUCFR grant. Author’s address: Department of Systems Analysis, Miami University, Oxford, OH 45056. email: fc74sanf@miamiu. bitnet Permission to copy without fee all or part of this material is granted provided that the copies are for direct commercial advantage, the ACM copyright notice and the title not made or distributed of the publication and its date appear, and notice is given that copying is by permission of the Machinery. To copy otherwise, or to republish, requires a fee and/or Association for Computing specific
permission.
01993
ACM
1046-8188/93/0400-0143 ACM
Transactions
$01.50 on Information
Systems,
Vol. 11, No. 2, April
1993, Pages 143-164.
144
.
Fazli Can Table
I.
File
Size and Update
Characteristics
Document
Database
Evew two weeks
Confer. Papers Index
1,309,711
Six times per year
ERIC
660,868
Monthly
McGraw-Hill News
79,640
Every 15 minutes
Medllne
6,200,000
document and
12,684
documents
Three
basic
The
and
AND CLUSTER
a similarity
documents,
then by
closely
graph.
Depending
on
“single
link,” example
for
approach,
a
clusters
of the of
seeds
clusters
our
experimental
INSPEC
database
of
of and
single-pass
is the
a document),
to which
they known.
most
is
known
as
formed. clustering.
called
similar.
This
clusters
a connected
seed-oriented
Documents
are
the
structures
are
(e.g.,
individual
forms
the
link”
algorithm
must
determine
cluster
documents,
algorithm
between
to
Each
to be formed. be
similarity
“complete
initiator
are graph-theoretical,
graph-theoretical
threshold
documents.
connectivity
for each cluster
number
typical the
similarity
average,”
a cluster
determined
5 covers on the
methodologies A
representing
related
the
25,000/men,
MAINTENANCE
algorithms.
applies
“group
is based
of clustering
matrix
represented
Section
experiment
Approx
77 queries.
categories and iterative
single-pass, prepares
database environments.
evaluation.
CLUSTERING
An
Updates
File %ze 330,091
dynamic
this
of Some Commercial [11]
Computer Database
design
2.
Databases
cluster
are
then
For
implementation
usually
assigned
provided
In
seed,
as
is
to the the an
input
parameter. Iterative
algorithms
optimization example, The ing
function. by
problem
most
literature
the A
are are
are
developed
for
clustering
close
very
document
for
with
new
examination
few for
of
according
clusters the
can
to an
be done,
modification
documents of
cluster
maintenance
growing
structure
initial
for
[1].
deals
addition
unsuitable
there
a clustering of the
maintenance
to
both).
of them
be used
generation
seed-oriented
due (or
algorithms also
to embellish
of cluster
structures
documents that
using
try The
clustering maintenance algorithms.
databases;
most
or
of cluster-
deletion
algorithms [26, In of them,
of
old
reveals
p. 58]. general, however,
In
the
these can
deletion.
The cluster-splitting approach for maintenance treats a new document as a query and compares it with the existing clusters and assigns it to the cluster that yields the best match [24]. The major problems with this approach [21, ACM
Transactions
on Information
Systems,
VO1 11, No. 2, April
1993
Incremental
Clustering
.
145
494] are: (1) the generated clustering structure depends on the order of processing of the documents; and (2) it starts to degenerate with relatively small increases (such as 25% to 50$%) in the size of the database. Adaptive clustering is based on user queries. The method introduced in [31] assigns a random position on the real line (the hypothetical line from negative infinity to positive infinity) to each document. For each query, the positions of the documents relevant to it are shifted closer to each other. To ensure that all documents are not bunched up together, the centroid of these documents is moved away from the centroid of the entire database. This approach has some disadvantages: (1) the clusters depend on the order of query processing; (2) the definition of clusters (separation between the members of different clusters) requires an input parameter and the appropriate value for this parameter is database dependent; and (3) convergence can be slow. The applicability of such algorithms in dynamic environments is not yet known. For the maintenance of clusters generated by graph-theoretical methods (single link, group average, complete link), similarity values need to be used —and this may be very costly. The update cost of the single-link method is reasonable; its IR effectiveness is known to be unsatisfactory, however [12, 27, 29]. The time and space requirements of the group average and the complete link approaches are prohibitive, since they require the complete knowledge of similarities among old documents [27, pp. 29-30]. Another possibility for handling database growth is to use a cheap clustercomplexity to recluster the entire database after ing algorithm of mlogm document additions and deletions. However, even though some algorithms have theoretical complexity of mlogm, the experimental evaluations have shown that, due to the large proportionality constant, the actual time taken for clustering is greater than that of an m2 algorithm [26, p. 59]. These points imply that an efficient maintenance algorithm would be better than reclustering the whole database.
p.
3. PRELIMINARY
CONCEPTS
The base of C 2ICM, the CC (cover-coefficient) concept, provides a measure of similarities among documents (see also [10]). This concept is first used to determine the number of clusters and cluster seeds, then to assign nonseed documents to the clusters initiated by the seed documents to generate a partitioned clustering structure. (To aid in reading, the definition of the notation introduced in Sections 3 and 4 is provided in Table II.) The CC concept determines document relationships in a multidimensional term space by a two-stage probability experiment. It maps an m X n (documatrix. The ment by term) D matrix [10] into an m X m C, cover coefficient, of selecting individual entries of C, Cij(l < i, j < m), indicate the probability any term of dL from d,, and are defined as follows:
ACM
Transactions
on Information
Systems,
Vol. 11, No. 2, April
1993.
146
.
Fazli Can
Table
II.
Symbol
Notation
Introduced
in Sections
3, 4
.— Defmmon / Meaning Extent to which d, IScovered by d, (according to CC concept) Average cluster sue, max (1, m /~) ~dcs m Set of documents to be clustered during an increment No. of Incremental updates No of documents (Current set of documents) Imt!al no, of documents before all increments No, of added documents m an increment (Set of such documents), Dm’ ~ Dr No. of deleted documents m an increment (Set of such documents) No, of terms (No, of terms after klh increment) No. of clusters (No, of clusters after kth Increment), 1 sncs mln (m, n) Seed power of d, lDfl/m’ (r~l) No. of reclustered documents m k increments Total no. of nonzero entries m D Average no. of documents per term, term generalrfy (t/n) Average no. of seed documents per term Average no, of dlstmct terms per document, depth of mdexmg (t/ m) Reciprocal of Ith row sum of D Reciprocal of Ith column sum of D Decouphng coefficient of dl (bl= Cll, O c bIs 1) Couphng coefhclent of dl (VI =1 - ~1, Os w < 1) Like ~ and v,, but fort]
Symbol Cli
dc Dr k m (%J ml m’ (Dm) m“ (Dmti) n (nk) nc (nc, ~ PI A t tg tgs xd al @j 61 w b’, ,1+1’,
where a, and ~h are the reciprocals of the ith row sum and k th column sum, respectively. To’”apply the above formula, each document must have at least one term and each term must appear at least in one document. By definition, can be any nonnegative real number. the entries of the D matrix The an
two-stage
intuitive
many
terms),
probability
analogy and
also
[ 17,
experiment p.
suppose
94].
behind
Suppose
that
each
we urn
C,J can have contains
be better
many balls
urns
explained (d,
of different
by
contains colors
of d, may appear in many documents). Then what is the probability of selecting a ball of a particular color (i. e., d])? The experimental computation of this probability fh-st requires the selection of an urn at random and second involves an attempt at drawing the ball of that particular in the first stage, from color from the selected urn. In terms of the D matrix, tk,at random; in C,j this random the terms (urns) of cl, we choose any term, selection is provided by the product al x d, ~. In the second stage, from the color); in selected term (urn), t~, we try to draw d] (the ball of that particular the probability of selecting d]. In this c the product ~~ ~ dlk signifies e~periment, we have to consider the contribution of each urn (terms of d,) to the selection probability of a ball of that particular color (dj ); that is why we have the summation for all terms in the definition of CLJ.Accordingly, in experiment reveals the probability of selecting any terms of D, this two-stage of d, from dj. Intuitively this is a measure of similarity, since the term documents that have many common terms will have a high probability of being selected because they appear together in most term lists. The experiment is also affected by the distribution of terms in the whole database: if d, and d] share rare terms, this increases CL,. (the
terms
ACM Ti-ansact,nns on Information
Systems, Vol
11, No.
2,
Aprd
1993.
Incremental
Some characteristics (l)
O
0,
and in general
C,j # Cj,;
(5) c,, = c~~= ~~1= c~i ~ di and dL are identical; (6) d, is distinct
~ c,, = 1.
Thereby from (4) it is seen that CZJ measures an asymmetric similarity between d, and d]. In the C matrix, a document having terms in common with several documents will have c,, value considerably if a less than 1. Conversely, document has terms in common with very few documents, then its c,, value will be close to 1. The c,~ entry indicates the extent to which dl is “covered by dj. Identical documents are covered equally by all other documents; if the document vector of di is a subset of the document vector of d~, then d, will cover itself equally as dj covers d,. It is also true that if two documents have will not cover each other at all, and the no terms in common, then they corresponding c,~ and cj~ values will be zero. Because higher values of c,, result from document d, having terms found in very few other documents, we let 6, = c,, and call S, the decoupling of how different d, is from all other coefficient of d,. It is a measure coefficient of d,. documents. Conversely, we call @, = 1 – 8, the coupling Similar to documents, the relationship between terms can be defined via the C’ matrix of size n x n whose entries are defined as follows. C:j = ~, X
~
where
(d~, X ah XdkJ)
l nt, suggests that the tested clustering structure is invalid, since it is unsuccessful in placing the documents relevant to the same query into a fewer number of of clusters than that of the random case. The case rz~ < nt, is an indication the validity of the clustering structure; however, to decide validity, one must less than nt,. show that nt is significantly Table VII gives the nt and nt, values for all incremental clustering experiments. The characteristics of the queries are given in Section 5.5.3. To less than n ~,, we must know the probability show that nf is significantly density function (baseline distribution) of n,,. For this purpose, we generated 1000 random clustering structures for each clustering structure produced by C 2ICM and obtained the average number of target clusters for each random case, nt( r ). For example, for matrix 1 of the INSPEC database, the minimum nt( r) values, respectively, are 29.623 and 30.792 (standard and maximum deviation is 0.175). This shows that the nt value of 23.558 for this experiment is significantly less than the random case, since all of the n ~(r ) observations have a value greater than nt. The same is observed in all of the incremental clustering experiments. This shows that the clusters are not an artifact of the C 2ICM algorithm, but are indeed valid. 5.5
Information
Retrieval Experiments
In this section we assess the effectiveness of C 2ICM by comparing its CBR performance with that of C 3M. In [10] it is shown that C 3M is 15.1 to 63.5 (with an average of 47.5) percent better than four other clustering algorithms in CBR using the INSPEC database. These methods are single link, complete link, average link, and Ward method [121. It is also shown that CBR ACM Transactions
on Information
Systems,
Vol. 11, No. 2, April
1993.
158
.
Table
Fazli Can
VII.
Comparison
of C 2 ICM
and Random
of Target
1
Matrk No. nt ntr
Clusters
Clustering for All
in Terms
Number
1
5
4
3
2
of Average
Queries
23.558
23.597
23.639
23.831
24.143
30.436
30.418
30.413
30.506
30.630
performance of C 3M is compatible with a very demanding (in terms of storage and CPU time) implementation of a complete link method [28]. In this section we want to show that the CBR behavior of C zICM is as effective as C 3M, and therefore better than other clustering methods. In the experiments, ml is taken as 4014, and the clustering structure obtained after the fifth increment is used in the retrieval experiments (see Table III). Notice that this clustering structure is the outcome of the most stringent conditions of our experimental environment. 5.5.1 Evaluation nzatch
CBR
similarity
Approach.
policy. of
their
That
In
is,
in
centroids
this
study,
CBR, with
for
clusters the
retrieval
are
user
first
query.
we
use
ranked Then,
the
best-
according n,,
to
clusters
are
of the selected clusters are chosen according to selected and d, documents their similarity to the query. The selected clusters and documents must have nonzero similarity to the query. Typically, a user evaluates the first 10 to 20 documents returned by the system and after that is either satisfied or the query is reformulated. In this study, d, values 10 and 20 are used. The effectiveness measures used in this study are the average precision for all queries and total number of queries with no relevant documents, Q. 5.5.2 Query Matching. For query matching there are several matching functions, depending on term-weighting components of document and query terms. Term weighting has three components: the term frequency component (TFC), the collection frequency component (CFC), and the normalization component (NC) [23]. The weights of the terms of a document and a query the respec(denoted by w~] and WCJ1,1 < j < n) are obtained by multiplying tive weights of these three weighting components. After obtaining the term weights, the matching function for a document and a query is defined as the following vector product. n
similarity
(d,
q)
=
~
ul~~ X w~k.
k=l
Salton and Buckley term-weight recommended
] Prectston retrieved ACM
[23]
obtained
assignments. due
is defined
to
In their
as the ratio
287 the
unique
same
superior
IR
of the number
combinations
study,
six
performance.
of retrieved
relevant
documents.
Tran>actlons
on Informatlcm
Sysi,ems,
Vol
11, No. 2, April
of document\
of these
1993
In
query
combinations this
documents
study
are we
used
to the number
of
incremental
Table r
No of Quene.s
77
Avg. No of Terms per Query
VIII.
Characteristics
NO Avg Rel Dots per Query
1582
of
3303
Clustering
.
159
of the Queries
Total No of Relevant Documents
2543
No. of Dmtmct Dots Retrieved
1940
NO of Diet. “ Terms for Query Def
577
these six matching functions in addition to the cosine matching function. Each of these matching functions enables us to test the performance of C 2ICM under a different condition. For brevity, the definition of matching functions is skipped. In this study the cosine similarity function is referred to as TW1 (term weighting 1) and the other six are referred to as TW2 through TW7. In [ 10], they are again referred to as TW1 through TW7; in [23] they are, respectively, defined as (txc.txx), (tfc.nfx), (tfc.tfx), (tfc.bfx), (nfc.nfx), (nfc.tfx), and (nfc.bfx). experiment the same matching function is used both for cluster ment selection.
In a given and docu-
5.5.3 Retrieval Environment: Queries and Centroids. The query characteristics of INSPEC are presented in Table VIII. The query terms are weighted; the weight information is used whenever needed. The cluster representatives (centroids) are constructed using the terms with the highest total number of occurrences within the documents of a cluster. The maximum centroid length (i. e., the number of different terms in a centroid), which is a parameter, is set to 250. This value is just 1.729% of the total number of distinct terms used for the description of the database. In our previous research we showed that, after some point, the increase in centroid length does not increase the effectiveness of IR, and the above centroid length is observed as suitable for INSPEC [10]. Similar results are also obtained for hierarchical clustering of various document databases [28]. The characteristics of the centroids produced for C 3M and C 2ICM are very similar to each other, as shown in Table IX. In this table, XC indicates the average number of distinct terms per “centroid”; % n indicates the percentage of terms that appear in at least one centroid; tgc indicates the average number of centroids per term for the terms that appear in at least one centroid; and %D indicates the total size of the centroid vectors as a percentage of t in the corresponding D matrix. The storage cost of the centroids is less than one third of that of the D matrix (refer to column %D). This is good in terms of search and storage efficiency. Actually, if we had used all the terms of each cluster, the value of %D would have been around 50 and 52, respectively, for C 3M and C 2ICM. However, as stated earlier, we know that longer centroids do not result in superior retrieval effectiveness. With centroid length 250, the percentage of cluster terms used in their centroids, 100 X ( xC/average number of distinct terms per “cluster”) = % XC, is 51 for both algorithms. ACM
Transactions
on Information
Systems,
Vol. 11, No. 2, April
1993.
160
Fazli Can
. Table
IX.
Characteristics
of the Centroids
(the
second
row is taken
from
[ 10])
~
Table
X.
Effectiveness
of the Reclustering (figures
lwl
Effecwe Measure
l-w2
RI
Prec\sion
(R) and Incremental
for R are taken lw4
l-vw RI
RI 277
418
419
396
401
401
230 17 13
226 17
336 66
329
323 87
314
329 66
14
32
3
4
33
Q
TW5 Rl 388 383 310 310 54
RI
287
Clustering
(I), for n. = 50
[ 101)
from
397 320
42
lvm RI 382 379 313 311 98
lw7 RI
352 288 76
,43
343 283
42
5.5.4 Retrieval Effectiveness. In this section we compare the CBR effectiveness of C 2ICM and C 3M using the matching functions defined in Section 5.5.2 and show that the former is as effective as the latter, which is known to be very effective for CBR. For the implementation of the best-match CBR, we first have to decide CBR about the number of clusters to be selected, n,. In the best-match process,
the
retrieval
“saturation” Figure for
point,
6].
n,
centage periments
In
= 50,
our i.e.,
is typical for
n,
effectiveness it
either
remains
experiments when for
11%
increases
the of the
best-match
the
up
same
saturation clusters CBR
[21,
to
or
a certain
increases
point are
for
selected
p. 376].
The
n,. very
INSPEC (n,
this
slowly
[10,
is
= 475).
results
After
observed This
of the
perIR
ex-
are shown in Table X. The first and second rows of measure (precision, Q) are for d,, values of 10 and 20,
= 50
each effectiveness respectively. Table X shows that Q values of reclustering (R) and incremental clustering (l) are comparable. Similarly, the precision observations for both algorithms are very close to each other. The maximum difference is observed for TW1: the CBR precision of C 2ICM is 3.5% lower than that of C 3M (with d, = 10). On the other hand, for TW2 and TW3 with d, = 10, the precision of incremental clustering is slightly better than that of reclustering. All the differences are less than 109ZZ,and can therefore be considered insignificant [2]. The results of this section indicate that the incremental clustering algorithm C 2ICM provides a retrieval effectiveness quality comparable with that of reclustering using C 3M; thus it is superior to the algorithms that are outperformed by C 3M. Given that the foundation of the clustering methodology is CC (with the seeds that it produces), these seeds are also used as cluster centroids. This approach represents a cluster with a typical member, and is similar to the ACM
Transactions
on Information
Systems,
Vol.
11, No. 2, April
1993
Incremental
Clustering
.
161
method defined in [25]. As expected, the use of a typical member (in this case the seed) of a cluster as its representative is less effective (about 18% less in terms of precision). However, one direct advantage of this approach is that CBR is as efficient as FS, based on inverted index searches—moreover, in a dynamic environment, it completely eliminates the overhead of centroid maintenance. 5.6
Efficiency Considerations
In this discussion only the more crucial efficiency component, time, is considered. The storage efficiency of the retrieval environment is analyzed in [6]. AS stated before, retrieval effectiveness reaches its saturation when n. = 50. In other words, for INSPEC, II% (nC = 475) of clusters is selected, and about the same percentage of the documents is considered for retrieval. As we have stated previously, this percentage is typical for best-match CBR. However, it might be considerably greater than that of the case of hierarchical searches. The exact number of documents matched during hierarchical CBR is unavailable [12, 27, 28], but as shown in the following, the method used in this study is much more efficient than the top-down search [27, 28] that we used in our effectiveness comparison. We are unable to compare our results with that of the bottom-up search strategies reported in [12], since no efficiency figure is available for them. Hierarchical searches usually involve the matching of query vectors with many centroid and document vectors that have little in common with the queries [22]. For example, the top-down search of the INSPEC database is performed on the complete link tree structure containing 11,878 clusters in 16 levels [27, p. 60]. The best-match CBR can be implemented very efficiently, independent of both the number of clusters to be selected (n. ) and the number of documents to be retrieved (d.), as shown below. In the conventional implementation of best-match CBR (for brevity we use CBR to indicate best-match CBR), query vectors are first matched with the inverted indexes of the centroids (IC) and then with the document vectors (DV) of the members of the selected clusters. This implementation of CBR is referred to as ICDV because of its search components. In ICDV, the query vectors are only matched with the centroids that have common terms with the query vectors, but the same is not true for the DV component. Notice that the generation of separate inverted indexes for the members of each cluster is impractical. However, we can still introduce a new implementation strategy and exploit the efficiency of the inverted index search (11S) of the complete database in CBR. In this new approach, which is referred to as ICIIS because of its components, the clusters are again selected using IC; the documents of the selected clusters are then ranked using the results of the 11S performed database! By definition, the efficiency of the ICIIS approach on the complete contains some unproductive is independent of n, and d.. The ICIIS strategy computation in its 11S component, since only a subset of the database is the member of the selected clusters. This drawback is compromised by the shortness of the query vectors: in IR it is known that query vectors are usually short [10, 14, 23]. ACM
TransactIons
on Information
Systems,
Vol. 11, No. 2, April
1993.
162
.
Fazli Can
Table
XI.
Average
Query
Processing
Time
Using
Different
Retrieval
r
Average Query Processing Time (s) 1,53 241
Retrieval Technique IIS (FS) ICIIS (best-match CBR) ICDV (best. match CBR), ns. ‘Top-down
The
Search
efficiency
terms
of
of the
average
environment can
on
the
the
ICIIS
of
the
top-down show
centroid
that
length
[6].
necessarily
shorter
the
centroid
terms.
The
same
almost
true
for
The
results
of
environment method and
and
top-down more
is
processing
clustering
structure
processing
times
C 3M
respectively,
are
ICIIS the
val~es same)
similarity
of the
terms
respect
and 2.41s
values
clustering
ICDV and
4.82s.
the
same
(due
are
very
close
structures
is
to each
generated
by
the
on the query
generated
stems
the
requires
average
values
This
ICIIS for
performed
Cz ICM
the
other.
IR
efficient
using
structure
to rounding,
is
efficient
search the
for
a
query
length)
average
search
that
the
most
times
aside,
clustering
Notice
an the
top-down
an
of
are
using
[28].
same
ICIIS
by the
by
centroid
processing
As
on the
11S
the
to the
C z ICM.
the
The
experi-
centroid
all
creates
that
words,
The
selected
search
C z ICM
5.31s.
experiments,
is unaffected
contain
from
query
the
(less function
is 250.
been
top-down
shows
other by
the
and
of
of that
negligible
by a longer
may
that
average
2.41s with
not
of efficiency for
all
the
efficiency
matching
length
brought
centroids
The
time
In
from
table
it is shown
is
The
of ICIIS
have
XI
In
practically
ICDV
the
show
13.07s.
ICIIS
[10].
this
The
[6]
query
(taken
In
of ICDV
centroid efficiency
Table
generated for
are and
1.53s.
TW6
that
XI.
is typical.
the
is false
respectively,
search
to
XI
maximum
section
Table
efficiency
of Table
shorter
ICIIS.
requires are,
In
the
in
model
average
search
function.
on
but
top-down
in
retrieval this
The
matching
independence [6],
this
using
ICDV
442%
(i.e., ICDV
the
6].
of the
terms or
and
and
environment, [3,
TW2.
is because
length
CBR
function
similar
query
our
documents
is measured
storage
matching
a threshold
This
In
ICDV
the
of the
search,
after
and
using
provided
result is
top-down
1307
are
function ICDV
search
the
ments
not
the
28].
ICDV,
is independent
thus
including
ICIIS,
in terms
of matching
5%),
[27,
[28]
11S, ICIIS, time
million
database
is given
effect
than
of 11S,
INSPEC
result
11S and
in one
531
I-ilearc.
techniques
defined
times
50
Lmk
processing
accommodate
processing ICDV
search
query
model
easily
[28])
on Complete
Techmques
and are
by C3M,
exactly from
the
algorithms.
6. CONCLUSION
Clustering is useful for both searching and browsing of very large document databases. The periodic updating of clusters is required due to the dynamic nature of document databases. In this study an algorithm for incremental clustering, C 2ICM, has been introduced. Its complexity analysis, the cost comparison with reclustering, and an analytical analysis of the expected ACM
TransactIons
on Infm-mat]on
Systems,
Vol. 11, No. 2, April
1993
incremental
Clustering
.
163
clustering behavior in dynamic IR environments are provided. The experimental observations show that the algorithm creates an effective and efficient retrieval environment, and it is cost effective with respect to reclustering and can be used for many increments of various sizes. ACKNOWLEDGMENTS
The author would like to acknowledge the constructive criticism and helpful suggestions made by the three anonymous referees and R. B. Allen. Thanks also to N. D. Drochak II, to D. J. Anderson for programming some parts of the experimental environment, to E. A. Ozkarahan for his comments on an earlier draft of this paper, and to J. M. Patton for helpful discussion. REFERENCES 1. AND~R~~RG, 2, BELKIN,
M. R.
22, ( 1987), 8th
New
York,
International
1985,
Canadian
Montreal,
6. CAN, F.
Optimization
Que.,
1989,
and
IEEE,
of incremental
Miami
Oxfordj
Univ.,
On the efficiency of the 1990
Los Alamitos, F.,
Press,
New
York,
Znf. SC1. Technol.
In
Orleans,
La.,
vector
searches.
(Montreal,
Que.,
1973.
ARIST.
In
Proceedings
June
1985).
of
ACM,
In
(Montreal,
Proceedings
of the
Que.,
1989).
Sept.,
1990,
cluster
91-002,
Dept.
of Systems
(Submitted
for publication.)
for dynamic
Computing
document
(Fayetteville,
databases.
Ark.,
April
In
1990).
61-67.
of the
dynamic
10th New
9. CAN, F., AND OZKARAHAN, E. A.
Paper
searches.
clustering
on Applied A
ACM,
Working
1991.
Incremental
E.A.
Proceedings 1987).
retrieval.
Engineering
clustering.
Oh., Aug.
Symposium
Calif.,
June
in information
Computer
of best-match
AND OZKAKAHAN,
retrieval.
(1989),
Rell.
572-575.
Experiments
proceedings
Conference
structures
on Electrical
7. CAN, F., AND DROCHAK II, N. D.
8. CAN,
Ann.
of inverted
ACM-SIGIR
of clustering
Conference
Analysis,
Academic
techniques.
97–110.
Validation
5. CAN, F.
for Applications.
Retrieval
C., ANII LEWITT, A. F.
Annual
4. CAN, F. EIC,
Analysis
109-145.
3. BUCKLEY, the
Clustering
N. J., AND CROFT, W. B.
cluster
Annual York,
1987,
Dynamic
maintenance
International
system
for
ACM-SIGIR
information
Conference
(New
123-131.
cluster
maintenance.
Inf.
Process.
Manage.
25, 3
275-291.
10. CAN, F., AND OZKARAHAN, clustering
methodology
E. A.
Concepts
for text
and
databases.
effectiveness
ACM
Trans.
of the
Database
cover-coefficient-based Syst.
15, 4 (Dec.
1990),
483-517. 11. DIALOG.
Dialog
Database
12. EL-HAMDOUCHI, methods
A.,
for document
13. FALOUTSOS, C.
Catalog.
AND WILLETT,
Access
in document
Computer
methods
for text.
Great
systems.
Online
Bibliographic
Britain,
16. HEAPS, Press,
H.
Znformatzon
York,
San Fransisco, 18. JAIN, A. K.,
Sot.
Using
Inf.
Databases:
Retrieual
Calif.,
Sci.
1989), Suru.
1 (Mar.
17,
interdocument 37,
clustering
220–227. 1985),
similarity
49-74. information
1 (1986), 3–11.
A Directory
and
Sourcebook,
4th
ed. Aslib,
Inf.
Computational
and
Theoretical
Aspects.
Academic
AND Dums,
E. L.
Storage Hypertext
Basic
Concepts
of Probabd~ty
and Statistics.
Holden-Day,
1964. R. C.
Algorithms
Cliffs, N. J., 1988. 19. JARIIINM, N., AND VAN RIJSBERGEN, 20. NI~LSTJN, J.
Am.
Cornput.
P.
1989.
agglomerative
1978.
17. HODGES, J. L., AND LEHMANN,
retrieval.
ACM
Inc.,
1986. S.
New
J.
Services
of hierarchical
J, 32, 3 (June
C., AND WILLETT,
retrieval
J. L.
Information
Comparison
retrieval,
14. GRIFFITHS, A., LUCKHURST,
15. HALL,
Dialog P.
Retrleual and
C. J.
The
7 (1971),
Hypermedia. ACM
for
Clustering
Data.
Prentice-Hall,
use of hierarchical
clustering
Englewood in information
2 17–240. Academic
TransactIons
Press,
on Information
San Diegoj Systems,
Calif.,
1990.
Vol. 11, No. 2, April
1993.
164
Fazli Can
.
21. SALTON, G. N.J
Dynamzc
Z?lformatzon
22. SALTON,
G.
Automatzc
Information
Text
by Computer.
Lzbrary
Process. Manage. 24. S.A.LTCIN,c, , AND Syst.
3, 4 (Dec.
WONG,
1978),
1974),
Processing.
Prentice-Hall,
Englewood
C1lffs,
Transformation,
Reading,
Term-weighting
Analysis,
Mass.,
and
Retrieval
of
1989
approaches
in automatic
text
retrieval
Inf.
513-523
Generation
search
and
of clustered
files.
ACM
Trans
Database
321-346.
C. J.
The
C. J.
E. M.
in document
A.
The
Wesley,
best-match
problem
in document
retrieval.
Commun.
ACM
17,
648-649.
26. VMJ RIJSBERGEN, 27. VOORHEES,
C.
24, 5 (1988),
25. VAN RIJS~ERGEN, (Nov.
Processing:
Addison
23, SALTON, G., AND BUCKLEY,
11
and
, 1975.
The
retrieval.
Information
Retrieval,
effectiveness
and
Ph.D.
2nd
et%ciency
dissertation.,
Dept
ed. Butterworths,
London,
of agglomerative
hierarchical
of Computer
Science,
1979.
Cornell
clustering Univ.,
Ithaca,
N Y., 1986. 28. VOORHEES, the 9th
E. M,
Annual
The
efficiency
International
of inverted
ACM-SIGIR
index
and
Conference
cluster (Piss,
searches. Sept.
In
1986),
Proceedings
ACM,
of
New
York,
critical
review.
Inf.
Commun.
ACM
164-174 29. WII.LETr, Process.
P.
30. YAO, S. B. (April 31
Recent
Manage. 1977),
trends
m
24, 5 (1988),
Approximating
hierarchical
document
clustering:
block
accesses
in database
orgamzations.
Yu, C. T , AND CHEN, C. H ACM-SIGIR
Adaptive Conference
document (Montreal,
clustering. Que.,
In Proceedings June
1985),
197-203.
ACM
December
TransactIons
20,
4
260-261.
International
Recewed
A
577-597.
1990;
revised
on Information
August
Systems,
1992;
Vol.
accepted
November
11, No. 2, April
1993.
1992
ACM,
of the 8th Annual New
York,
1985,