Document not found! Please try again

Coefficients for Combining Concept Classes in a Collection - CiteSeerX

14 downloads 0 Views 2MB Size Report
order to improve a query, “moving” it closer to useful portions of a document collection [ROCC. 711. His “vector feedback” method was later refined with schemes ...
Coefficients for Combining Concept Classes in a Collection* by

Edward A. Fox§ Gary L. Nunnt Whay C. Lee Department of Computer Science 562 McBryde Hall Virginia Polytechnic Institute and State University Blacksburg, VA 24061 USA Abstract

This report considers combining information to improve retrieval. The vector space model has been extended so different cIasses of data are associated with distinct concept types and their respective subvectors. Two collections with multiple concept types are described, ISIand CACM-3204. Experiments indicate that regression methods can help predict relevance, given query-document similarity values for each concept type. After sampling and transformation of data, the coefficient of determination for the best model was .48 (.66) for ISI (CACM). Average precision for the two collections was 11% (3 1%) better for probabilistic feedback with interest to designers of all types versus with terms only. These findings may be of particular document retrieval or hypertext systems since the role of links is shown to be especially beneficial.

1, Introduction A great deal of the work in information retrieval has focussed on exact or partial matching of document representations made up of temls, keywords, or other types of descriptors. According to the taxonomy proposed in [BELK 871, these feature-based methods usually contrast with network approaches where groupings of or relationships between documents are emphasized. Thus, there are commercially available retrieval systems that require Boolean queries to he supplied in order to find documents with the right word combinations, and quite separate systems for citation searching. However, we believe that these and other schemes should and can be unified, in accordance with the following obvious but rarely followed Principle of Combination: Eflective integration of nwre information slwuld lead EObetter information retrieval. *

This malcrial is based upon work supported by the National Scicncc Foundation under Grant Numbers IST8017589, IST-84 X8877 and IRI-8703580; by U-XVirginia Ccntcr for Innovative Tcchnoiogy under Gnnt Numbers INF-85-016 and IW-87-012; by Nimbus Records Inc.; and by AT&T equipmcnl contributions. 3 All corrcspondcncc regarding this submission should bc dirrxtcd to ~hc lirst author al tie address indicated. t Current address is Dcpl. of Compulcr Science, Radford University, Radford, VA 24142.

Permission to copy without fee all part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. C

1988

ACM

O-89791-274-8

88

0400

-291-

0291

$ iso

This report describes experimental stttdies of one method for combination of term and citation information that is based on a very simple linear model. We demonstrate how the proposed hybrid scheme is more effective th,an either of the: separate approaches. In particular, we use regression methods to identify proper coefficients to aid in comlbination, and to detemiine the relative “importance” of different classes of information. Following a discussion of some of the kinds of links that can be found be:tween documents, a short review of methods for feedback. is given, since feedback mechanisms making use of links are explored later. Section 2 discusses related work, leading into the exposition of an extended vector model in section 3. Section 4 describes two test collections some early exploratory develctped to support these investigations, Ebnd section 5 summarizes studies with those collections. Section 6 presents new work. regarding detemlination of coefficients for effective combination, and section 7 co.ncludes this report.

1.1

Links

Recently there has been a great deal of attention given to the need for retrieval systems to support the concept of hypertext [CONK 871. As originally proposed in [BUSH 451, users should be able to follow trains of associations created during earlier systetn interactions or by others, and to thereby explore the vast stores of available knowledge. Links between documents are easy to create in message handling sys terns based on notions of time sequence, referral, or inclusion [BABA 85J but are not always so easily developed for bibliographic collections. Some work has considered making use of links arising from citations and references [KOCH 821. Section 2.2 below does describe some other methods for obtaining links that have been successfully exploited in retrieval studies, but there are still many important open problems regarding developing hypertext links. It is hoped that later sections of this report will shed some light on a few of those problems.

1.2 F’eedback Rocchio proposed using easy-to-obtain relevance judgments provided by system users in order to improve a query, “moving” it closer to useful portions of a document collection [ROCC 711. His “vector feedback” method was later refined with schemes for computing term precision [SALT 75a] or term relevance [ROBE 76J that are at the heart of current probabilistic retrieval systems. During the feedback process, documents found by some initial search scheme that are judged to be relevant can be used to expand an original query. While most concern with feedback studies has focussed on test collectj.ons with terms only, it is here shown that feedback of other classes of information can lead to significant performance improvements over termsonly feedback methods. Thus we deal not only with ,problems of using relevance feedback to compute better query weights, but also with feedback-based methods for query expansion.

2. Related

Work

When bibliographic retrieval systems are searched by different people, or using different classes of information about documents (e.g., titles vs. abstracts vs. descriptors), there is often a rather small overIap in what usefuI documents are found by each method [KATZ 821, Furthermore, many methods seem to have roughly the same (low) level. of performance, It has been suggested, therefore, that retrieval systems should adapt to users and situations and select the most appropriate method for any given situation [CROF 841. Indeed, there are efforts now underway to build systems that use artificial intelligence techniques to recognize important retrieval situations and follow rules to adapt appropriately to those situations [CROF 87, FOXE 873. Making use of available information is further complicated by the existence of formatted data to go along with the text portions [DESA 861, or by even richer multimedia resources [CHRI 86’. In the discussion below, text, factual, and relational data will all be integrated, and it is expected that multimedia objects could also be handled using these methods in concert with more sophisticated analysis and retrieval techniques, perhaps like those described in [.FOXE 871. 2.1

Vector

Extensions

The basis for the approach discussed below is the use of vectors of features for describing each clocument [SALT 75b]. The original vector space model has been criticized because of its naive assumption of term independence, and several generalized approaches have been -292--

suggested and validated [RAGH 861. However, for the purposes of describe a document in terms of numerical values associated with characterize that document. While it may seem like we are ignoring are actually exploring dependencies between classes of features in has been previously considered.

this report it is adequate to the different features that dependencies, though, we a more thorough way than

2.2 Citation Data and Retrieval Kessler noticed that documents could be described and classified not only by the terms He defined the “bibliographic coupling*’ present, but by the data in their bibliographies. measure in terms of the degree of overlap between the set of references for a pair of articles [KESS 631. Since then, bibliographic coupling has been used for a wide range of clustering and other investigations [WEIN 741. Small defined a similar measure for pairs of articles, but from the different end of the time spectrum relating to documents. Thus, while bibliographic coupling relates documents based on their view of past publications, Small’s cocitation measure relates documents based on the number of later documents referring to both elements of the pair [SMAL 733. He and others have shown that cocitation data and related contextual information can help with the understanding and portrayal of scientific fields [SMAL 801. Garfield has, through his work in developing citation indexes, enabled researchers to locate relevant articles based on citation relationships [GARF 781. The cycling between starting with cited works and following hints given in references can be studied mathematically [CUMM 731 and has real practical value as a manual approach to searching. Further extensions are possible if automatic methods can help find and make use of citing statements in texts [OCON 82). In addition, there appears to be promise if both titles and cited titles are considered in representing document content [KWOK 751. Similarly, retrieval seems to be more effective when certain methods are combined; in particular, use of bibliographic coupling together with a cocitation measure yields better retrieval than either method alone [BICH SO]. Based on these studies it was suggested that the classic vector space model be extended so that additional types of information could be considered, and that testing of that approach be made using new collections rich in bibliographic relationship data [FOXE 83a].

3. Extended

Vector

Model

In the standard vector space model, each document in a collection C of N documents is characterized by the various “concepts” present in that collection. Let T be the set of such concepts (also called “terms” because collections often have mostly terms and few other types of concepts) and let M ,=ITI be the size of that set. Note that because of authors’ habits with language, for any reasonable size collection MLrm is usually much smaller than TV. The document collection can be represented by a document-term matrix with N rows and M, columns, so that each document is of the form where crmik is the (possibly zero) weight of term k in zi . Queries can be put in the same form. Later, query-document similarities can be computed using, for example, a cosine correlation measure. When the weights for terms in documents and queries are determined as real values that reflect both the within-document and withincollection characteristics of terms, then reasonably good retrieval results from this approach. 3.1 Adding Bibliographic Concepts Following our Principle of Combination we considered adding various types of readily available bibliographic data to the above mentioned term-based document representation. Salton carried out an early study of this type, using links between documents [SALT 631. To motivate the discussion below, and to ensure that terminology is clearly understood, precise definitions are given for links and other related classes of bibliographic data. First, consider the practice of having a document refer to another, usually accomplished by including the proper information in the set of references provided at the end of the “source” -293 -

document. we include

We can then define “direct reference” using this terminology the alternate verb form of the word “cite”). A-D Direct Reference when A refers to (cites) document D, so that D is referred By definition, D + D always holds.

(where

for symmetry

(2)

to (cited by) A.

Given the notion of reference, we can also consider “indirect reference” where one or more intermediate documents are present in the chain of reference leading from a source to a target document. Further we define bibliographic coupling: B and C are bibliographically coupled [KESS 631 if some document, say E, is referred to by both B and C. Similarly,

cocitation

(3)

can be defined.

F and G are cocited [SMAL 731 if some document, say D, refers to both of them in its bibliography. Finally,

“links”

between

a pair of document

can be defined,

again preserving

A and D are linked if either A + D or D + A [SALT

(4)

symmetry.

631

(5)

Note that these measures relate any pair of documents in a collection. Each one, then, could be represented by data stored in an N x iV symmetric matrix. Since intuition does not as readily help with adapting these definitions to a pair of documents when the pair only involves a single document, the diagona1 entries can be frecIy defined in as usefuI a manner as might help with other computations and operations on the data. bbcii

=

COCii

=

number of references in bibliography of irh document. number of articles that each refer to the ifh document.

inkii

=

1m

Here the cot value indicates the incoming citation count nnd the Ink entry reflects document is itself and so is linked to itself. Off diagonal entries are as expected. bbcii COCij

lnkij

= = =

(6)

the fact that a

number of articles referred to by both Di and Oj number of articles that each refer to both Di and Oj 1 if the iih document refers to thejrh, or vice versa.

(7)

3.2

Separate Subvectors In Salton’s 1963 study he used direct citations between documents as well as terms, and showed that associations between documlznts could be computed with that additional data present in the resulting longer vectors [SALT 631. Michelson et al. applied vector feedback to that type of long vector, made up of terms and other information, and even ran experiments, but only made tests with a document collection containing 82 documents [MICH 7 I]. In both of these studies, the same type: of vector as was given in (I) above was employed, except that the vector length was greater. Considering, for example, what would happen if a document collection that was characterized by terms was also characterized by authors, one would have long vectors with term weights and author-related weights as shown in (8) below.

If separate subvectors are associated with each class of concepts (i.e., each “concept this same document vector could more easily be described as in (9) below. 4 = (Gili, &i) Di The resulting

matrix

for the collection

would

then be made of two submatrices. --294-

type”)

then

(9)

=

-trmN

(TRM

ALIT/

(10)

autN

It should be noted that this scheme has quite a number of advantages (see chapter 6 of [FOXE 83a] for more discussion). Different weighting methods could be used for different submatrices: binary weights for author names, real-valued weights for terms, and special counts Different similarity measures could be used for weights based on bibliographic connections. when comparing one subvector from each of two different vectors than might be used when comparing another pair of subvectors. It is even possible that different storage schemes might be used for the different submatrices. The term submatrix might be stored both by row and by column, so that inverted file access as well as direct document access is possible. On the other hand, citation-related submatrices might be stored only by row (compressed as usual by omitting zero entries), since otherwise an inverted index would become enormous, with O(N) entries for each such submatrix. To explore the implications and use of this model, two collections were developed that would each contain three or more concept classes, thus including at least terms, author names, and some data reflecting bibliographic relations.

4. Data

Collection

Characteristics

During the period 1980-82, in connection

with research proposed by Fox and described in [FOXE 83a], two IR test collections were put together through the combined efforts of a number of individuals at Come11 University and other Iocations. While these colIections have been used by a number of researchers during the intervening years, there are probably many others who are unaware of their existence, since details have never been published. This section therefore describes some of the key characteristics; for more information the reader is referred to [FOXE 83c]. Incidentally, these and other useful test collections are included on Virginia Disc One, a CD-ROM produced in 1988 by Nimbus Records and available for widespread distribution to interested researchers. 4.1

ISIThe first collection, called “ISI-1460” or “ISI” for short, had its origin in a tape provided by Henry Small of the Institute of Scientific Information (ISI). Provided in response to a request for help in developing a collection rich in citation information, the tape included titles, bibliographic data, and citation counts for the 1627 articles in the information science file for the period 1969-1977 that had received at least 5 citations (from a “source” group of 4150 articles). Using the titles and bibliographic data, 1460 articles out of the 1627 where located; the remainder were errors or could not be found. From the hardcopy versions, titles, author names, and abstracts were entered into a computer collection. Automatic indexing techniques were used to construct vectors with terms, and also with author name indicators. By combining the tape data and the indexed collection data, a matrix of cocitation counts for all pairs of the 1460 articles was constructed and converted to vector format. Using the terminology of the previous section, a collection was thus constructed with three submatrices, for terms, authors, and cocitations. The lengths of the several subvectors are described in Table 1. It should be noted that the mean lengths of the term and cocitation subvectors are approximately the same. This is exactly the situation hoped for, since it is hard to compare the utility of citation data with term occurrence data when the quantities of each are vastly different. Put another way, if it were to turn out that in this collection retrieval methods using both citation and term information were no better than retrieval with just terms, then it would be rather unlikely that such a combination approach would be of value in more standard collections where far more term data was present than bibliographic data. Figure 1 illustrates the distribution of subvector lengths. While the term subvector lengths as shown in part (a) seem to approximate a normal distribution, the lengths for the other two subvectors, shown in parts (b) and (c), are clearly bunched at the low end of the range, but do vary widely as shown by the fairly long tail, especially fur cocitations.

-295-

Table 1,. IS1 Subvector Length Statistics Statistic Measured

s ubvectol COC *------

,9UT 1.4

0.:

2.7: 46.4

40.6 47 8 179 21.5 --

1255

1460

7392

median mill max stdv

1. Top Trimmed

Top Trimmed

54..0 *40

:

TotaI no. of concepts

Figure

TRW

Histograms

for IS1 Subvector

Histogram of ISI trm Lcngthr,

Top Trimmed 1200”

Histograms

Lengths

of ISI aut Lengths

r-m--.

looop

soo-

.

x

5

%m Q4oo2

11.B 0

20

._-

40

-.a . 11-m. 60

trm Subvector

80

100

120

:

,.’ 200’

0

I. n, 4 . I-- -I I - -0

1

2

Length

auI Subvector

Top Trimmed

Histogram of IS1 CCC LcngLhs

0

100 cm Subvector

2bo

.

n

4

3 Length

3 IO

L.ength

To go along with the document collection, a set of queries was built. Thirty-five queries relating to information science had been used in earlier studies with the AD1 collection. These were especially useful since three different searchers had previously constructed Boolean queries for each one. Forty-one more queries were obtained to bring the total to 76 that could be used for experiments with “natural language” query input. Some of the remaining ones had been used with the ISPRA collection, and #again were r&ted to information science, The rest were “documents as queries” that were formed by taking the title, author, and abstract data that appeared in several issues of ACM SIGIR 1Turum. -296-

5

Finally, two graduate students (E. Voorhees and E. Fox) involved in doctoral studies relating to information retrieval prepared relevance judgments for the entire coIlection. Though the query collection may seem rather ad hoc, it did include a number of different types of questions, and though the procedures employed may seem rather artificial, the relevance judgments were obtained by exhaustive analysis by “experts” who compared the text of each query and each document. All in all, then, a collection of 1460 documents, 76 queries, and complete relevance judgments was obtained. CACM-3204 The second collection, called “CACM-3204” or “CACM” for short, had its origin in a tape provided by Robert Dattola of Xerox Corporation. DattoIa had supervised entry of data about each of the articles appearing in the Communications of the ACM, starting with the fist issue in 1958, and running through the last issue of 1979. Included were titles, author names, Reviews abstracts, author chosen keywords, category numbers according to the Computing scheme, and dates of publication. It should be noted, however, that in the early years there were no abstracts and no categories, so those fields were omitted from the relevant records. Though the CACM collection of 3204 articIes published during this 22 year period did have a variety of term and factual data, there was no data on bibliographic relationships. Carol Fox and another volunteer went through all of the articles, examined the list of references at the end, and recorded each instance of an article in the collection referring to another article in the collection. This record-keeping of “internal” citations contrasts with the “‘external” citations provided with the ISI collection. While labor intensive, the process of taking a journal and finding references to other articles in the same journal is much easier for a researcher or publisher to undertake than is the all-encompassing type of effort carried out by ISI. Thus obtaining CACM citation data in this fashion provides a realistic contrast to that used for the IS1 collection. From this database of direct references between articles, link, cocitation, and bibliographic coupling counts were computed. Table 2 lists the various classes of information collected along with the abbreviations used to identify the different concept types. 4.2

Table AUT CRC DTE TRM BBC LNK cot

2. -

Sub-vector

Abbreviations

Author Cornfluting Reviews Category Date of Publication Term Bibliographic Coupling Bibliographic Link &citation

A number of computer scientists at Cornell and elsewhere were asked to prepare real questions of interest to them that might retrieve suitable articles from this collection. A simple retrieval scheme was used to obtain some documents, and the users were asked to supply relevance judgments. Two individuals knowledgeable in the field and familiar with retrieval practices were asked to each prepare Boolean queries for each question as well. Then, seven additional searches were undertaken and results merged so that the original users could look at another set of retrieved documents. Various heuristics were employed so that the users were not burdened with more than a total of 100 documents to judge. Altogether, 52 queries were collected along with a good many of the relevant documents for each. Though full relevance data is not available, it was approximated, in a fairly realistic test situation where reai questions were obtained, All told there were seven subvectors. TabIe 3 provides statistics for each of those concept types, illustrating the amount of data available for the various subvectors. It should be noted that the mean length of term subvectors, while reasonably high, was only about half of that for ISI. However, this situation is in part explained by the fact that no abstracts were printed in the early years of the journal. More important, however, is the fact that all of the other subvectors were on average very short. Clearly, only making use of “internal” references inside one journal, even one as central to a field as CACM has been to computer science, does reduce the amount of data available to characterize bibliographic relationships. Nevertheless, there are numerous pairs of articles in the collection with a great many cocitations or with high degrees of bibliographic coupling. -297-

TabIe 3. CACM

Sulbvector

Length Statisitics Subvector

Statistic Measured mean median mill max stdv Total no. of concepts

183 10.8

111 10.7

2:: 1.9

LNK m---p2.7 2 1* 74 3.1

3204

3204

200

3204

AIJT

13BC

cot

1.3 1 1 7 0.7

4.2 ti

3.7 u”

2647

CRC

m-

1.2 0

TKM

10446

25.0 15 1 168 22.7

*Note: The minimum length for l& is one since a document is, by definition, linked to itself, and so the diagonal of the submatrix is set to ones. To provide more insight into the data ipresent in the various subvectors, distributions of lengths for several of the concept types are given in Figure 2. All are peaked at the low end, and have fairly long tails. All in all, the CACM and IS1 collections represent meaningfu1 vehicles for exploring the value of mixing term, factual, and bibliographic relationship data.

IFigure

2.

CACM Frcqucncy ‘700 y



600’

CACM

Distribulion

Frequency

Distribution

for Ink Subvcctor Lcugth

for

CACM Frcqucncy

Subvector

Distribulion

Lengths

for crc Subvcctor Length

.l 1 ;

1 1 -:;;% :KlOQ- 200 -y --\ -7 Ei2qzzq 2 :100.I_..:.. omF'-- . . _--.-....f --...-...f- -..: 0

2

4

6

Ink Subvector

CACM ‘Frcqucncy Distribution

10

20

bbc Subvector

8

1

10

for bbc Subvcctor Lcrgth

30 Length

8

234567 crc Subvector

Length

40

50

CACM

Frcqucncy Distribution

9

10

Length

for cot Subvector Length

5. Initial

Exploration

While these collections were being developed, a modification to the normal vector processing approach was specified and incorporated into a new version of the SMART retrieval system ]FOXE 83b]. Further effort on SMART was continued, leading to a version distributed to many interested researchers [BUCK 851. While in earlier versions of SMART and other retrieval systems each document was characterized by a single vector, in the enhanced version a document is described by any number of subvectors. In particular, prior to indexing documents are split into separate fields, such as for author, title, abstract, journal name, etc. While there could be repeated separate sections all of the same type of field (e.g., the ordering might be text, date, name, text, name, etc.), any part of the document must be assigned to exactly one field. Special files describe what type of indexing processing is needed for each field (e.g., stemming vs. plural removal). Ultimately, then, all concepts in a document are identified and classified as to “concept type.” Vectors are built with separate subvectors, and query-document similarity is separately An overall similarity is then computed as determined as appropriate for each of the subvectors. a linear combination of the similarities for each subvector. Thus, for the IS1 collection sim(Q,D) = ctrm *simt,(Qt,Dlrm) where the ci values are coefficients

+ taut *

~im,,~(Q,~~Jh)

+ c,,

used in this linear combination

*

Simcoc(QcocDcoc)

(10)

procedure.

With the collections and extended version of SMART, two preliminary studies were undertaken as described in [FOXE 83a]. In the first study, for each query documents were retrieved using a simple term matching scheme until a relevant document was found. Then, using that new document as the feedback query, another simple matching operation was performed. Results were compared when the feedback query was limited to be only terms, only authors, etc. so that the relative value of each concept type could be determined. Pairs of concept types, and combinations with more types were also considered, with all coefficients (as in Equation 10) set to one. For the ISI collection, based on an ascending ordering of resulting average precisions scores (determined as the average of precision values for recall levels .25, trm. Thus, SO, and .75), the single subvector cases were: aut, cot, all, trm+aut, trm+coc, more effective retrieval occurred when cocitations were considered as opposed to authors, and the most effective retrieval occurred when terms alone were used. However, with coefficients for terms and cocitations set to be roughly in the ratio 7: 1, the trmfcoc case was 6% better than the trm only case. For CACM, the single subvectors had values according to the ordering: aut, bbc, crc, cot, Ink, trm. Combination tests were not run since the CACM collection has insufficient data in most of the subvectors, and many of the new “feedback” queries had few terms in subvectors other than trm. A second type of feedback study was accordingly undertaken. After initial retrieval, a probabilistic feedback operation was carried out. The new queries, however, coutd be constituted from any number of the set of allowable concept types, and weights could be applied as in Equation 10 to compute an overall “similarity.” For the CACM case, using equal weights led only to small improvements, and from prior work with the ISI collection, it seemed obvious that equal weights would probably not be very valuable. Therefore, it became clear that a new problem had to be faced: How to find the right coefficients. Since early work in probabilistic feedback had explored the results from parameter estimation in a retrospective case, it was decided that such could be applied to the coefficient determination process. A simple scheme was tested, and for the CACM collection the average precision was 29.7% better than for the terms-only feedback, when regression coefficients were used on all the “similarities” computed separateIy for each subvector. Because of this success, further study was deemed appropriate, and has recently been completed.

6. Coefficients

for

Probabilistic

Feedback

As a follow up to the work discussed in [FOXE 83a], Gary L. Nunn carried out further investigations during the period 1985-87, with the assistance of Whay C. Lee and the supervision of Edward A. Fox, as discussed in [NUNN 871. The highlights of that research are summarized in the remainder of this section. -299-

As in the earlier studies, the IS1 and CACM co:llections were utilized. However, since it was l’eh that more than just retrospective analysis was needed., both collections were split in half, randomly, leading to XSIl, ISI2, CAC!M 1 and CIACM2. The bulk of the work discussed below involved CACM 1 and IS II . The basic scheme was to set up a regression situation with data simihar to that of Equation 10, where similarities were obtained using term relevance weighting f O.OOOl), trm*coc (O.OOOl), aut*coc (0.0008). The coefficients for these interaction terms and the sum of squares values are so low as to allow them to be ignored when trying to predict relevance, but it is interesting to note that there are significant interactions. For CACM 1, the only significant interaction is aut*crc (pr :p f 0.0002) which suggests that there is a relationship between authors and the categorization they assign to their works. Data is probably too limited for other interactions to be significant. Plotting predicted vs. residual All in all the regression results appeared quite promising. values for both collections showed that the best schemes are very good at predicting relevant documents being relevant, but frequently err when applied to non-relevant documents,

6.3

Retrieval Performance While the RSQ measure shown earlier is a useful measure of system performance, a more complete picture of the situation can be seen if recall and precision data is used. Thus, Table 8 shows the average precision results for the IS11 cases. The base runs are for equal weighting on all concept types, as would result from application of the usual probabilistic weighting tnethodls. The first row shows behavior for terms only. There does not seem to be a great deal of effect from applying the log transformation in the base runs when equal weights are applied, and in some cases there is a loss of performance, so it appears to not be worthwhile. Similarly, the use of sampling appe’ars 1.0lead to a decrease in perfomlancc or only a minor increase. Indeed, the base case with raw data seems comparable to other situations. It is clear, however, that average precision increases when more data is in going from considered, as we move from trm to coc+trm to all types. The improvement

-302-

terms only to using all data is around 11%. For the CACMl as can be seen in Table 9.

collection

there are similar results

The overall increase in average precision for the CACMl base run with 31.2%. This is a very significant result, and intermediate values are achieved as each of the simpler models. Once again the other runs are not appreciably better cases are worse. In most cases sampling - hurts and using logs with coefficients is better than using the raw data.

Table

7.

Regression Coefficients, Ranks, for CACM Binary Relevanccs

,

Raw Data Model Variables

Rank

AI1 Data Coefficients

Rank

AUT

2

.00093*

5

kzl

:1

.00270* .00060* .00192”

BBC LNK cot

; 7

.00098* .00094* .00021*

CRC

Model

RSQ i?i$

only slightly

RSQ’s

Log Data Sample Coefficients

:2

.00061” .00509 .00300 .00059*

: 7

.00243* .00074* .00029

.3721

and

raw data is expected for and in most

Rank

All Data Coefficients

Rank

Sample Coefficients

G

.0193

4;

.0740* .0228” .0167* .0084*

41

-0604” -.0876* a249

; 6

-.00066* .05729* .00655

1 6

1

.4386

-.00167*

.4064

::?E* .6659

43

.00312* .00099*

43

.00596” .0007 1*

;

.0222* .0777*

43

-05829” .01007

TRM

1 2

.0070* .0012s*

:.

LNK

.00084* .00128*

2

.01047* .06138*

;

.08679* .06903*

Model

RSQ

TRM LNK

.3577 .00096* .00151*

;

Model

RSQ

.3212

* = significant

Table

-4056

:.

.00121* .00147* .3678

.4024 2 1

.01997* .09478*

-6640 a10446* :.

.3294

.08 194’ -6538

at the ,05 level

8.

Precision Runs

Values from Base and for the IS11 Collection

Coefficient

A final investigation was made regarding the carryover of results to the other halves of the collection. Tn particular, the feedback queries constructed on IS1 1 and CACM 1 were tried on IS12 and CACM2, respectively. For the base case of terms only retrieval, the average precision decreased dramatically, from .3220 to .1444 for ISI and from .4813 to -1889 for CACM2. -303-

Clearly the queries did not fit well, possibly because term distributions varied in the two halves, Given this degree of error in applying the basic probabilistic leadinlg to estimation errors. approach, it is not surprising that even in the base cases the results for all concept: types comb&d were not much different than for the terms only situation (i.e., they were down to .1265 for IS12 and up to .2185 for CACM2). It appears that further study is needed regarding estimation of parameters across halves of split collections.

Tab>Ie 9.

Base and Coefficient ColIcction

Base Runs

Model

TRM,LNK

AUT, CRC, TRM, LNK

7. Conclusions

Precision Values from Runs for the CACMI.

Raw

Log

-4813 -5714

.4775 .5584

-5855 .6315

.5965 .6066

and Future

Work

This report has discussed a number of efforts relating to the basic Principle of Combination. It does appear that whenever we combine additional information in the “proper” fashion, that improvements in retrieval effectiveness do result. Thus, our main hypothesis appears to be supported by the evidence presented. We have described a model whereby in.formation besides terms can be incorporated into an extended vector system, and have discussed two collections, ISI and CACM, that. have terms, factual data (e.g., author names), and bibliographic relations (e.g., cocitation counts). We briefly explained how the SMART system was enhanced to allow indexing of collections matrix made of several with a TYariety of fields in each document, to lead to a document-concept subman-ices, one for each concept type. SMART now allows different query-document similarities to be computed for each subvector, and for an overall value to be computed using a linear combination scheme. Because of improvements found in earlier studies, a more elaborate statistical analysis was undertaken using halves of the collections. Regression studies of IS11 and CACMl suggest that coefficients for the linear combination mentioned above can be computed and that a fairly good fit is obtained when sampling and transformation methods are employed. The regression studies suggest that while terms are the most i,mportant predictors of relevance, direct references (actually generalized into the symmetric definition of “links”) and cocitations (for CACM and ISI, respectively) are bibliographic relations that should be considered. Regression also suggesl:s that there are indeed statistically significant interactions between some of the subvecrors. When average precision values are considered for feedback runs with the IS11 and CACMl collections, using standard probabilis,tic retrieval where equal weights are employed, it is clear that adding in more concept types lea’& to significant improvement in performance 1 1% for IS11 and 3 1% for CACMl. However, the regression coefficients do not make much differen#ce. Further, the feedback queries perform poorly on the other half of the collection, and using coefficients for combination is not helpful there erther. Regarding the points raised in section 6, it does appear tha.t sampling and log transformation does lead to good regression fits, and that adding in more data is valuable. It does not appear that sampling and log transformations make much difference in retrieval performance, though adding in more data does, help significantly. The probabilistic model does seem to be robust enough to handle multipIe concept types in a simple fashion, but the carryover of query weights and coefficients from one half of a collection to another half is not reIiable. Future work is needed in a number of areas. First, it is appropriate to further test the PrincipIe of Combination and to perhaps apply it to situations where various types of data are --.304-

present. Particularly promising is to use combinations of terms and bibliographic relation counts like links or cocitations. In hypertext systems this suggests that the links present could be used to enhance searching as well as to help with browsing. Second, more work on regression analysis would be valuable as a means of testing what types of phenomena lead to better fit when trying to predict relevance. A clearer understanding of how to relate good fits with better retrieval performance is certainly needed, and might be found if other measures of retrieval performance besides average precision are considered. Third, sampling issues should be investigated so that carryover of probabilistic feedback query weights and values for coefficients could be effected in a more robust and more widely applicable fashion. Fourth, theoretical study of the probabilistic model in this situation of multiple concept types, and empirical studies of vector feedback techniques, are both in order to help ensure that even better methods for combining different classes of concepts are identified. Finally, there is need to apply the Principle of Combination to other systems and to make more realistic tests. One important design feature of the CODER system is its adaptability to situations, to allow combination of evidence of different types when deciding what should be shown to a user next [FOXE 873. In the SMART system, an initial study with some 300,000 library catalog entries EFOXE 88a] should be repeated with an even larger collection and feedback with multiple concept types could be studied and tuned. Finally, building upon experience in adding other advanced features to systems like SIREm [FOXE 88b], it should eventually be possible to have generally available systems where these techniques are incorporated.

Acknowledgements Thanks go to Joy Davis for secretarial assistance with this report, to Carol Fox and the many others who helped with providing and preparing needed data and judgments, to Henry Small of ISI and Robert Dattoia of Xerox for supplying data, and to the various students at Cornell and VPI&SU who have worked on SMART and related software.

References Systems: 851 Babatz, R. and M. Bogen. Semantic Relations in Message Kandling In Proc. IFlP WG 6.5 Symposium, Sept. 1985. Referable Documents. [BELK 871 Belkin, Nicholas J. and W. Bruce Croft. Retrieval Techniques. Annual Review of Information Science and Technology, 22: 109-145, 1987. [BICH 801 Bichteler, J. and Eaton III, E.A. The Combined Use of Bibliographic Coupling and Cocitation for Document Retrieval. Journal of rhe American Society for Information Science, 3 1(4):278-282, July 1980. [BUCK 851 Buckley, C. Implementation of the SMART Information Retrieval System. TR 85-686, Cornell Univ., Dept. of Comp. Sci., May 1985. [BUSH 453 Bush, V. As We May Think. Atlantic Monthly, 176:101-108, July 1945. F. Ho, M. Papa, and A. Pathria. [CHRI 861 Christodoulakis, S., M. Theodoridou, Multimedia Document Presentation, Information Extraction, and Document Formation in on Oflice Information Systems, MINOS: A Model and a System. ACM Transactions 4(4): 345-383, Oct. 1986. [CONK 871 Conklin, Jeff. Hypertext: a Survey and Introduction. IEEE Computer, 20(g): 17-41, Sept. 1987. [CROF 841 Croft, W.B. and R.H. Thompson. The use of adaptive mechanisms for selection of search strategies in document retrieval systems. In: Res. & Dev. in Information Retrieval, Proc. 3rd Joint BCS and ACM Symp., Cambridge, 1984, Cambridge: Cambridge Univ. Press, 95- 110. 13R: A New Approach to the Design of [CROF 871 Croft, W.B. and R.H. Thompson. Document Retrieval Systems. Journal of the American Society for Information Science, 38(6): 389-404, 1987. Some Mathematical Properties of Cycling [CUMM73] Cummings, L.J. and D.A. Fox. Storage and Retrieval, 9(12): 713-719, Strategies Using Citation Indexes. information December 1973. [DESA 861 Desai, B.C., P. Goyal, and F. Sadri. A Data Model for Use with Formatted and Textual Data. Journal of the American Society for Information Science, 37(3): 158-165, May 1986. [BABA

-305-

[FOXE

83aJ Fox, E.A. Extending the Boolean and Vector Space Models of Information Retrieval with P-Norm Queries and Multiple Concept Types. Dissertation, Cornell Irk, Ann Arbor MI, Aug. 19X3. University, University Microfilms Some Considerations for Implementing the SMART Information IFOXE 83b] Fox, E.A. Retrieval System under UNIX. TR 83-560, Cornell Univ., Dept. of Comp. Sci., Sept. 1983. [FOXE 83~1 Fox, E.A. Characterization Iof Two New E.xperirnental Col,lections in Computer and Information Science Containing Textual ilnd 13ibliographic Concepts. TR 83-561, Cornell Univ., Dept. of Comp. Sci., Sept. 1983. [FOXE 84J Fox, E.A. Combining Information in an Extended Automatic Information Retrieval System for Agriculture. In The Infrastructure of an information Society , ed. B. El-Hadidy and E.E. Horne, North-Holland, Amsterdam, 449-466, 1984. [FOXE SS] Fox. E.A. Composite Document Extended Retrieval: An Overview. In Res. & Dev. in Inf. Ret., Eighth Annual int. ACM SIGlR Co&, Montreal, 42-53, June 1985. [FOXE 871 Fox, Edward A. Development of the CODER System: A Testbed for Artificial Intelligence Methods in Information Retrieval. Information Processing and Management, 23(4): 34 l-366, 1987. [FOXE 88a] Fox, Edward A. Testing the Applicability of Intelligent Methods for Information Information Services and Ifse, in press for Volume 7 (1987). RetrievaI. [FOXE 88b] Fox, Edward A. and Matthew B. Koll. Practical Enhanced Boolean Retrieval: Information Processing and Experiences with the SMART anct SIRE Systems. Management, in press for 24(3), 1988. Indexing: Its Theory and Application in Science, [GARF 781 Garfield, E. Citation Technology, and Humanities. John VViley & Sons, New York, 1978. Katzer, J., et. al. A Study of the Overlap Among Document Representations. [KATZ 821 /nf. Tech.: Res. 6r Do., I(4): 261-274, Oct. 19X2. [KESS 631 Kessler, M.M. Bibliographic Coupling Between Scientific Papers. American Documentation, 14( 1): 10-24, January 1963. [KOCH 821 Kochtanek, Thomas R. Bibliographic Compilation using Reference and Citation Links. Information Processing and &‘anagemerat, 18(l): 33-39, 1982. [KWOK 751 Kwok, K.L. The Use of Title and Cited Titles as Document Representation for Automatic Classification. /nformation Processing and Management, 1 I (8- 12): 201-206, 1975. [MIC’H 711 Michelson et al. An Experiment in the Use of Bibliographic Data As a Source of Relevance Feedback in Information Retrieval. In The SMART Retrievai System: Experiments in Automatic Docrzment Processing, ed. G. Salton, Prentice Hall, Englewood Cliffs, NJ, 1971. [NUNN 871 Nunn, Gary L. Regression Analysis of Extended Vectors to Obtain Coefficients for Use in Probabilistic Information Retrieval Systems. MS Report, VPI&SU Dept. of Comp. Sci., Blacksburg VA, Dec. 19137. Recognition by Computer and Use to [OCGN 821 O’Connor, John. Citing Statements: Improve Retrieval. Information Processing and Management, 1 g(3): 125 13 I, July 1982. Raghavan, Vijay V. and S.K.M. Wong. A Critical Analysis of Vector Space [RAGH 861 Model for Information Retrieval, Journal of the American Society far Information Sept. 1986. Science, 37(5):279-287, of Search Terms. [ROBE 761 Robertson, S.E. and K. Sparck Jones. Relevance Weighting

.lournaI of the American [ROCC

72 J

Retrieval

Rocchio,

Society for /&ormation

Jr., J.J. Relevance

System, Experiments

Feedback

in Automatic

Science, 27(3): 129- 146, 1976. Retrieval. In The SMART Document Processing, cd. by G. Salton, in Information

Prentice Hall, Englewood Cliffs, NJ, 1.97 1. [SALT 631 Salton, G. Associative Document Retrieval Techniques using Bibliouaphic lnfom>ation. Journal of the American Society for Information Science, 1O(4): 440-457, Oct. 1963. [SALT 7 1] Salton, G. Automatic Indexing Using Bibliographic Citations. J. Dot., 27(2), June 197 1. [SALT 75aJ Salton, G., Yang, C.S., and C.T. Yu. A Theory of Term Importance in Automatic Text Analysis. Journal OJ’ the American Society for Information Science, :!6( 1):33-44, Jan.-Feb. 1975. [SALT 75bJ Salton, G., Wong, A., and C.S. Yang. A Vector Space Model for Automatic Indexing, Commun. ACM, 18(11):633-620, Nov. 1975.

-306--

[SMAL73] Small, H.G. Co-Citation in the Scientific Literature: A New Measure of the Journal of rhc American Society for Information Relationship Between Two Documents. Science, 24(4), July-Aug. 1973. [SMAL SO] Small, H. Co-Citation Context Analysis and the Structure of Paradigms. J. Dot., 36(3):183-196, Sept. 1980. WEIN 741 Weinberg, B.H. Bibliographic Coupling: A Review. 1nfurmakvz Storage and Retrieval, 10(5-6): 189-196, 1974.

-307-

Suggest Documents