On the Use of Spreading Activation Methods in Automatic Information

0 downloads 0 Views 905KB Size Report
The node activation process used in the spreading model starts by placing a specified ncliualiort wcighl at some starting term or document node. Typically.
On the Use of Spreading Activation Methods in Automatic Information Retrieval Gerard

Salton

and Chris

Buckley*

Abstract Spreading activation methods have been recommended in information retrieval to expand the search vocabulary and to complement the retrieved document sets. The spreading activation strategy is reminiscent of earlier associative indexing and retrieval systems. Some spreading activation procedures are briefly described, and evaluation output is given, reflecting the effectiveness of one of the proposed procedures.

1. Associative

Information

Retrieval

It

is well-known that simple matching procedures between the vocabularies contained in the query formulations and the stored documents do not Fur this reason, methods have always produce acceptable retrieval output. been introduced to expand the query formulations by adding to the initial queries new terms, or expressions, that are related to the originally available the retrieved document subsets can be augmented by terms. Analogously, or related, to the originally available items. using items that are similar, [DOYLE75, SALTON The difficulty in applying such associrr~iuc retrieval methods is the identification of related terms and documents that wili improve the retrieval operations. If few related terms, or documents, are identified, the retrieval output is On the other hand, when the query unlikely to be substantially improved. vocabulary or the retrieved document sets are drastically altered, the advantages gained from some useful added terms, or documents, are often lost because inappropriate terms will also have been supplied,

Permission to copy without buted for direct commercial appear, and notice is given otherwise, or to republish, C

1988

ACM

fee all part of this material is granted provided that the copies are not made or distriadvantage, the ACM copyright notice and the title of the publication and its date that copying is by permission of the Association for Computing Machinery. To copy requires a fee and/or specific permission.

O-89791-274-8

88

0600

--147-

0147

$ 130

In principle, it is possible to use generally valid term or document associations for the expansion operations. Thus term associat.ions might be derived by using available term tI~csc~u~.~~scs, specifying groups of related terms, or word ckdiorrarics that contain dedinitions and other characteristics of the stored vocabulary. Analogously, associations be twccn documents could be defined formally by observing similarities between documents derived from library classification schedules. In practice, generally valid vocabulary or doc:ument relations are difficult to find, SO that it is necessary to rely on 1ocall.y valid relations thal are applicable only for particular document collections in particular situations. The well-known rclcuc~~~o /;lcrlOack process, which augments the initial query vocabulary by adding terms contained in previously retrieved documents known to be relevant to the query, is an example of such a local term augmentation process. [ROCCIII071, IDE7la, IDE7lb] Various methods have been suggested for computing locally valid term and One such process, known as ussociutive lirrcc.tr rclriad, document associations. consists in taking the original assignmen of terms to documents specified in matrix C of Figure l(a), and using it to compute a matrix ‘f of pairwise term asso8ciation factors, as well as a matrix D of pairwise document association facD specifying the tors. Assuming that C is of dimension ~1 x L, the cl x rf matrix associations between pairs of documents is then obtained by using similarities For example, the similarity in the term assignments to pairs of documents. between documents L), and I),, stored as the iJth element of matrix L), is computed by comparing rows i and j of matrix C (see Fig. f(b)). Similarly the I x 1 matrix T giving the term relations is obtained by noting similar postings patl(c)). In a linear retrieval system, the terns for pairs of terms (see Fig. response vector r giving the retrieval values for all d documents of the collection can be computed by using the linear transformations shown in Fig. l(d). [GIIJLIAN073, JONES671 While attempts have been made to produce workable term and document expansion systems, it is fair to say that completely satisfactory models have For example, a query expansion method using term assonever been designed. ciations derived from a maximum spanning tree of term similarities led to the following conclusions: [ROBERTSON811 “Our results on query expansion using the NPL data are disappointing. We have not, been able to achieve any significant improvements over non-expansion... This leaves the whole question of the cffectiveness of query expansion methods unresolved.” Other recent attempts to supply expanded document representations using tions and other bibliographic indicators attached to texts and documents also led to the conclusion that effective term expansion methods valid variety of different collections are difficult to generate. [FOX831

citahave for a

Various reasons may be given for the lack of consistent improvements in First, the associations derived from particthe associative retrieval operations: ular pairs of documents, or pairs of terms, may be valid only locally in the particular environment from which the associations are derived. [LESK793 Second, most practical methods for computing the linear document associations are based on the assumption that the terms are originally uncorrelated, that is, independent of each other; furthermore, the term association computations require an analogous assumption about the independence of the documents. -148-

lRAGHAVAN861 Finally, methods based on simplified not reflect the reality of existing relationships between operational situations. lvanRIJSBERGEN77, YU831

theoretical documents

models may and terms in

Recently, the term and document expansion methods have been revised using so-called spreading activation techniques based on mechanisms of human memory operations. ICOLLINS75, ANDERSON83, COHEN871 These developments are briefly covered in the remainder of this note.

2, Spreading

Activation

Techniques

The spreading activation techniques used in information retrieval are based on the existence of maps specifying the existence of particular relations between terms or concepts, or between documents, as the case may be. A typical arrangement of concepts of interest in cardiology is shown in Fig. 2. An alternative map including both terms as well as documents appears in Fig. 3. In both cases, the terms or documents are represented by network nodes, and the relationships between terms or documents are specified by labelled links between the nodes. In the examples of Figs. 2 and 3, the documents may be related through bibliographic citation (when one document cites another one), or by being close neighbors of each other (because the term sets assigned to the respective items are sufficiently similar). Analogously, two terms may be synonyms of each other, or one of the terms may be specified as a part, or a component of some other term. The node activation process used in the spreading model starts by placing a specified ncliualiort wcighl at some starting term or document node. Typically the starting node might represent a term included in an initial query formulation, or a document retrieved in an carlicr search operation. The initial activation weight then spreads through the network along the links originating at the starting node. The spreading action first affects those nodes located closest to the starting node, and spreads through the network one link at-a-time. Normally, the activation weight of a node is computed as a function of the weighted sum of the inputs to that node from directly connected nodes. If ~Lj is the original activation weight of node J, and lo,j is the link weight between nodes i and j, representing the influence of node j on node i, the new activation weight (1: of node i may then be computed as

(1)

where node. ing

the

summation

extends

over

all

nodes

j directly

connected

Most spreading activation systems differ from the earlier and retrieval system in several respects: lANDERSON83,

a)

Each node is assigned a special activation the starting activation weight, and traversed in the activation process.

-149-

weight, the link

to a given

associative COHEN871 which and

index-

depends on node types

b)

Distance

stopping

constraints the activity

ma,y be imposed in ?,he activation at some specified distance fronn

process by the original

node, c>

Nodes with a large branching ratio (fan-out) that many other nodes (therefore pclssibly representing general notions or documents) may be bypassed process, or may otherwise receive special treatment.

d)

The activation process follows specific rules. Thus tions, the activation rules might be of the form [(topic

Cx) with zmplics

weight

w(x)) the length

factor length.

which For

normalization

-151-

reduces a given might

of an 1, is

computed as )L, of term i

all document and document vector

be l/

It is known that an effective term weighting system for document terms uses a composite of x id/ (term frequency multiplied by inverse document frequency) factor that is normalized for document length so that all documents are treated equally, regardless of the number ancl the weight of the assigned terms. the length normalization hecom.es unnecessary because For query terms, queries are normally short and more homogeneous in len.gth than documents. However, the query terms may be given extra importance by insisting that the query term frequency factors lie in the range from 0.5 to 1. Useful term weighting factors for documents and queries ma,y then be computed in the following way [SALTON87]:

a)

For document

terms w,/ =

r/- . log (fvfr1) -_-

(5)

,‘.:d-l,Jr b)

For query

terms

K, = 0.5 + 0.5max I

lf

log N I

II

Using the term weight of expressions (5) and (6), value may be obtained for each document by computing product between query and document vectors as follows:

(6)

a document retrieval the standard vector

,,l

(7)

When the vectors jointly contain a number of highly weighted terms, the similarity value of expression (7) wiI1 be large, and the corresponding documents may then be retrieved early in a given search operation. By interpreting the query and document activation factors used in the spreading activation system as ordinary term weights, a comparison becomes possible between the spreading activation method of expression (4) and other used vector processing methods. previously In the spreading activation system, the inverse document frequency factor is absent for the query Lerms, and the document terms are not normalized for vector length. Furthermore, the inverse document frequency factors used for document terms differ substantially between the spreading activation formula (4) and the C/f x it/l) system of expressions (5) and (6). A number of query and document term weighting coefficients are included in Table 1. In each case the query document similarity is obtainable by multiplying the weights of equivalent query and document terms, and summing over all terms. The first four weighting systems of Table 1 are derived from the spreading activation melhod of expression (4); the other four syst.ems represent m0.re general vector processing strategies. The first method of Table 1 is the exact spreading acLivation system of expression (4). Method (2) adds a normaIization factor for document length. Method (3) represents the basic spreading activation with more conventional

--152-

Method (4) is equivalent to (21, (L2) normalization factors in the denominator. but uses an L2 normalization. The four vector processing methods (5) to (8) represent a standard term frequency weight for both queries and documents (5), a standard term frequency multiplied by inverse document frequency (l/x idl) (6), a length normalized term frequency for documents and enhanced (tf x idfi I/X id/ weight for documents weights for queries (71, and finally a normalized straand enhanced (If x irffi weight for queries (8). These last two weighting tegies are known to produce a high order of retrieval effectiveness. [SAL-

TON873 The weighting systems of Table 1 are evaluated using six document collecin size from 1033 documents and 30 tions in different subject arcas, ranging queries in biomedicine (MED) to 12,684 documents and 84 queries in electrical engineering (INSPEC). The collection characteristics arc listed in Table 2. The evaluation output is shown in Table 3 for the eight term weighting sysThe performance is measured tcms using the six sample document collections. by giving average search precision values evaluated at three fixed recall points of 0.25, 0.50, and 0.75. The average precision values are further averaged over the number of queries available for each collection. The base run is the spreading activation system of expression (4) described earlier in rJONES871. Percentage improvement or deterioration figures are shown in Table 3 reflecting the differences in performance between each particular method and the corresponding base run. Table 3 shows that the simple spreading activation system does not proare duce a high performance level. Improvement in retrieval effectiveness easily obtained by using better normalization factors, and in particular by taking into account the document length. Furthermore, with the exception of the term frequency weights of method (5), the vector processing model produces better output than the spreading activation system. The weighting system of method (8) improves the output by factors ranging from 23 percent for the NPL colIection to over 40 percent for CISI and CRAN above the performance level of the spreding activation system. The simple spreading activation system considered in this study may not. be sufficiently powerful to produce acceptable retrieval output. More refined systems may become available in the future; such systems must be evaluated before actual implementations in retrieval environments appear warranted.

References [ANDERSON831

Anderson

University

Press,

1983,

[COHEN871 Spreading Marrugcmcnl,

Cohen P.R. and Kjeldscn Activation in Semantic 23~4, 1987, 255-268.

ICOLLINS Semantic

Collins Processing,

J.R.,

Cambridge,

The MA,

Architecture Chapter 3.

of

Cognition,

R., Information Retrieval Networks, ltl~urmrr~iot~

by Constrained Proccssirlg and

A.M. and Loftus E.F., A Spreading Activation P.s~~c:f~o?o~ic~l Review, 82, 1975, 407-428.

[DOYLE753 Doyle L.B., In/b-maliotr Rc/ricrvn? ing Company, Los Angeles, CA, 1975.

-153-

crrrtf Processing,

Harvard

Melville

Theory Publish-

of

I FOX833

Pox E.A.,

tion Retrieval Dissertation, 1983.

Extending

the Boolean

with P-Norm Queries Department of Computer

and Vector Space Models 01: Informaand Multiple Concept Types, Doctoral Cornell University, Ithaca., NY, Science,

Associative Information [G[ULAN073] Eiuliano V.E. and cJoncs I’.;E.. Linem Retrieval, in P. Howerton (ed,) Vi:;~rz,s in Irl/;,rrr~tr/ion flrr.rtdl&:, Spartan Books, Inc., Washington, DC 1983. [IDE71a] Idc E., New Experiments in Relevance Feedback, in ‘1’11~ SllZART Rehicwtl Syskrn --E’xpccrinlcrr/s in AUIU~CI!~C Dut*r~nrer~l I-‘roccssi~~g, cd. G. Salton, Englewood Cliffs, NJ, ‘Prentice-Hall, Inc., 1971, 337-354. IIDE7lbl Ide E. and Salton G., Interactive Search Strategies in Tflif SMART Organization in Information Retrieval, Ex1~rirrtcrlf.s it1 Auborrlcrlic Dvcurncn/. t’roc*ossi~lg, ed. G. Cliffs, NJ, Prentice-Ilall, Inc., 1971, Chapter 18, 373-393. [JONES671 Jones P.E., Giuliano V.E. and Curticc A.E., L~lig,cuage Proccssiq, Vol. 2, Linear Models for Associative tle Inc., Report ESD TK 67-202, February 1967.

File and Dyn.amic Relricoal Syslcm-Salton, E:ngIewood

Popcrs OIL Aulomaled Retrieval, A.D. Lit-

[JONES871 Jones W.P. and Furnas G.W., Pictures of Relevance--A Analysis of Similarity Measures, Jourrmi of the American Society lion Scierlcc, 38:6, 1987, 420-442. [LESK79] Ancricnr~

Lesk M-E., Word-word Associations Documcnhaliort, 2011, January 1989,

in Document 27-38.

for

Geometric Informa-

Retrieval

Systems,

IRAGHAVAN861 Raghavan V.V. and Wong S.K.M., A Critical Analysis Vector Space Model for Information Retrieval, Jounrd o/’ lhu Anzcrictm fbr Irr[~~rnsa~ior: Scic’rLcc, 37:5, September 1986, 279-287.

of the Socidy

[ROBERTSON81] Robertson S.E., van Rijsbergcn C.J. and Porter M.F., ProbaaliotL Rclrieunl Research, in Irr/i,rrri bilistic Models of Indexing and Searching, R.N. Oddy, S.E. Robertson, C.J. van Rijsbergen, P.W. Williams, editors, Butterworths, London, 1981, 35-56. LROCCHI071 The SMART ed. C. Salton,

323. [SALTON McGraw

JR occhio, Jr J.J., Relevance Feedback in Information Retrieval, in Relrieoal Sl_)lsdens--E=x~~crirrrcnIs in Aulornalic Documeul Processing, Englcwood Cliffs, NJ, Prentice-Hall, Inc., 1971, Chapter 14, 313-

Salton Hill

ISALTON731 Automatic

Book

Salton Indexing,

G. Auhmcrlic Company, NY,

Solton G. Text Retrieval, Cornell University,

[VAN RIJSBERGEN773 Co-occurrence Data 1977, 106-l 19.

Orgarzizadiwt

arid

Rclricval,

G. and Yang C.S., On the Specification of Term JoIL~I~(~~ o~D~curr;cnlalisln , 29:4, December 1973,

[SALTON Salton G., Yang in Automatic Text Analysis, Scicltc~‘, 2611, January-February [SALTON Automatic Sci’ence,

Irl[ormrrlio/r 1968.

in

C.S. and Yu C.T., A Theory of Term JourllNI u/’ lhc American Socicly fh1975, 33-44.

and Buckley, C., Term Weighting Technical Report 87-881, Department Ithaca, NY, November 1987. van Rijsbergen C.J., Information Retrieval,

--154-

Values in 351-372.

Importance Irr[ormalion

Approaches vs of Computer

A Theoretical Basis for the Journal or DocumenLaliort,

Use

of 33,

[YU831 Yu C.T., Buckley C., Lam K. and Salton G., A Generalized Term Dependence Model in Information Retrieval, Inforrnatio~~ Tct:hrrology: Research atrd Devclopmcrr~, 214, October 1983, 129-154.

Term Weight Documents

(1)

Spreading Activation L i term normalization (document normalization extends over jth term in all N documents)

(2)

Spreading Activation L i term normalization and document length normalization

(3)

Spreading Activation L z term normalization

(4)

Spreading Activation f,z term normalization and document length normalization

(5)

Vector Process standard I/ weights and queries

in

Term

l/d, d-

k~l(~f2*)2

for documents

(6)

Vector Process standard (I/ x idl) weights for documents and queries

(7 ‘1

Vet tor Process l/ documents length normalized; enhanced tf x id/ queries

(8)

Vector Process (best) length; I/ x id/ documents normalized enhanced rf x idf queries

Weighting Formulas for Document and Query Terms (N documents in collection, no distinct terms) Table

1,

--155-

Weight Queries

in

-

-

Number of Queries

Average Number of Document Terms -

Number of Documents

Collection -

------

---

Average

of @wry

Number Terms

-------

(CACM

3204

24.5

64

10.1

CISI

1460

46.6

112.

28.3

CRAN

1398

53.1

225

9.2

INSPEC

12684

32.5

84

15.6

MED

1033

51.6

30

10.1

NPL

11429

20.0

100

7.2

Collectlon

SpeciIication6 Table

Prlxesriing

Method

2

CACM 3204 dots 64 queries

CISI 1460 dots 112 queries

CHAN 13!)7 dots 225 queries

INSPEC 12684 dOC6 84 queries

MED 1033 dots 30 queries

NPL 11429 dots 100 queries

.2H30

.1844

.2724

.2110

.4539

.I577

.2174

+ 3%

.4732 + 21%

-

.1486 6%

.4279 - 6%

+

.1697 8%

A779 -t- 5%

.1848 +i796

(11

(3)

(4)

(51 (6)

Sttindard Activation

Spreading (haee runs)

SIpreading Activation with Document Length Normalizution

-

.2636 7%

.1991 + 8%

.3115 f- 14%

Spreading (1,~ norm)

.3167 + 12%

.1745 - 5%

.2744

.2142

-I- 1%

+ 2%

Spreading Activation CL,, norm) with Length Normtllization

.3299 -t 17%

208 1 + 13%

Vi&or Process stendurd /,/weights

.1647 -42%

.1322

.lH34

.lO60

- 28%

-- 37%

- 50%

.3899 - 14%

.1197 -24%

Vmsctor Process X stzi;:t;d

.3246 + 15%

.2266

+ 40%

.x991 + 10%

.2365 -t12%

.5177 + 14%

.I.846 -t- 17%

.3252 + 15%

.21H9 f 42’X,

.:1950 +4h%

.2626

+ 24%

.5542 t22%

2170 + 38%

.3630 + 2n3;

.261H +42’16,

.3H4 1 +412

+ 24Q

.5628 C 24%

.1933 + 23%

Activation

tf

idf

Vector Process l/length normed documents; ehhunced 1f X idf

.:x40

.2434

+. 30%

+ 15%

queries Vector Process tf X idf length ntirmed documentn; eohtlhced X iiff queries

a2626

If

Performence

Evaluation ror 6 Document (Z-point precision averuges) Table

3

--156-

Collection

c= -

Dl

Cl1

Cl2

--*

tit

Pi?

c21

c22

---

C2t

.

a

.

.

*

.

Dd

Cdl

a)

*--

Cd2

Original

matrix

assignment

cdt

C showing to documents

Di

b)

Computation overlapping

Fig.

term

Dj

of document

association

vocabulary

between

1. Linear

Associative

--157-

matrix document

Retrieval

D using pairs.

/ I

component of

s

l/as- ~‘CT,
Computation of term association matrix postings to documents between

T using overlapping pairs of terms

r (d)

=

c (d x 1)

*

Retrieval with augmented terms

r WI

=

c (d x 1)

*

T u XL)

9 (1)

Retrieval with augmented documents

r (d)

=

D (d x d)

.

C Cdx 1)

q tt)

Retrieval with augmented terms and documents

r Cd)

=

D (d Xd)

.

C (d x 1)

1initial retrievnl , operntion

d) Computation

Fig.

I

Linear

of document

Associutive

--159-

retrieval

Rctricvtil

vuhes

(conl.1

Query 0 /-WQl

;--:.k?iY:;qq

Term in Ofi \

1

Term

1

Term

in DB

Document

in DC

1

Term

2

in DA

Term 2 In OR

Document

A

Simplified

Term

2

Term

3

1t-1DB

in DO

8

Document

Spreading to Corresponding

Activation from Documents

Fig.

--140-

4

Term

3

in 00

C

Query

Term

m

in DC

Term

m

Term

in Do

Document

m

in DE

D

Suggest Documents