An Incremental Algorithm to find Asymmetric Word ... - CiteSeerX

4 downloads 110253 Views 386KB Size Report
Apple computer. D6. Apple fruit. D3. Apple computer. D7. Apple fruit. D4. Apple computer. D1. Apple computer. D5. Apple computer. D5. Apple computer. D2.
An Incremental Algorithm to find Asymmetric Word Similarities for Fuzzy Text Mining

T. P. Martin and M. Azmi-Murad Artificial Intelligence Group, Department of Engineering Mathematics University of Bristol BS8 1TR, UK email: {masrah.azmi-murad, trevor.martin}@bris.ac.uk

Abstract. Synonymy – different words with the same meaning – is a major problem for text mining systems. We have proposed asymmetric word similarities as a possible solution to this problem, where the similarity between words is computed on the basis of the similarities between contexts in which the words appear, rather than on their syntactic identity. In this paper, we give details of an incremental algorithm to compute word similarities and outline some tests which show the method’s effectiveness.

1 Introduction Text mining aims to extract information from text sources to produce new and useful insights [1]. The case for text mining as an essential component of business intelligence, web navigation and information management has been wellrehearsed elsewhere. Text mining should be distinguished from • data mining, which is concerned with extracting predictive patterns from predominantly numerical data • information retrieval, where we search for something that is already known but may be hidden in a mass of irrelevant material • information extraction, which filters text for small, structured items such as names, addresses, etc. Current approaches to text mining fall into two camps • natural language processing, which is generally thought to be many years away from feasibility • statistically based approaches, which treat documents as bags of words

The “bag of words” approach leads to additional problems, as humans use different words to refer to the same concept. For example, [2] point out that “hair loss” and “alopecia” are semantically identical, but will be counted as three separate and unrelated words. A possible solution to this problem is offered by use of an external dictionary or lexical database such as WordNet [3]; unfortunately, such systems tend to be very general and give too many synonymy candidates as well as neglecting domain-specific synonyms. For example, WordNet 2.0 correctly gives automobile as a synonym for car but also gives elevator, gondola and railcar. Conversely, when considering telecoms companies, one might like to count Orange, T-Mobile, Vodafone etc as synonymous with mobile telco. Domain-specific information must be added by an expert or gathered from documents. Soft computing has much to offer in extracting domain-specific information and in generalising statistically-based approaches, including an ability to incorporate partial matching between words and short phrases. We have previously proposed asymmetric word similarities [4, 5] as a tool for automatically computing similarities between words on the basis of their contexts. In this paper, we give details of an efficient algorithm for computing such similarities. A key feature of the algorithm is that it is incremental i.e. it can be updated each time a new document is encountered (or each time a word is read) without extensive re-computation. We also outline experiments where this method has been successfully applied.

2 Similarity Similarity is a fundamental concept in the representation of vague knowledge and approximate reasoning, and has been studied by many researchers [6]. It is difficult, if not impossible, to define similarity. Goodman [7] suggested that two objects “a and b are more alike than c and d if the cumulative importance of the properties shared by a and b is greater than that of the properties shared by c and d.” going on to say “But importance is a highly volatile matter, varying with every shift of context and interest, and quite incapable of supporting the fixed distinctions that philosophers so often seek to rest upon it". A common approach is to treat similarity as a fuzzification of an equivalence relation, generalising reflexivity, symmetry and transitivity µ( x, x ) = 1 µ( x, y ) = µ( y, x ) µ( x,z) ≥ T (µ( x, y ), µ( y,z)) where T is a t-norm and µ (x,y) represents the degree to which x and y are similar (see for example [8] or [9]). Under this approach, similarity is related to a distance metric and hence has many useful mathematical properties. However, we argue € that this is not an appropriate definition of similarity for syntactic text mining. The transitivity axiom is not always correct – for example, fish is quite similar to gold-

fish and goldfish is quite similar to pet but pet and fish are not at all similar. Even without assigning numerical values, it can be seen that this breaks transitivity. Using different senses of a word can obviously violate transitivity – banana is similar to orange (both fruit) and orange is similar to vodafone (both mobile telcos) – but banana and vodafone have no similarity. This effect may appear in a more subtle form – hammer is similar to nail and nail is similar to screw but hammer and screw are not similar [10]. Similarity is closely related to the concept of typicality, which can lead to asymmetry in the similarity relation. Saying that goldfish is similar to pet implies that the “distance” from goldfish to pet is small so that one could replace the word goldfish by pet in a sentence without losing much meaning – for example, I keep a goldfish in my garden pond and I keep a pet in my garden pond. However the converse substitution may be less appropriate – I like to stroke my pet is quite natural but I like to stroke my goldfish sounds odd. For these reasons, we do not look for transitivity or symmetry in the similarity relation for text mining. The above argument is informal and (to some extent) unfair in that it mixes different meanings (orange as telco and orange as fruit) or superclasses (pet) with more specific subclasses (goldfish). However, our point is that a statistically-based text-mining system typically does not take account of semantic categories or word senses, and has little to go on other than the symbols representing the words. We prefer to regard similarity as the degree to which one word can be replaced by another in a given sentence. This is clearly asymmetric in many cases.

3 The Method The AWS method is based on the observation that it is frequently possible to guess the meaning of an unknown word from its context. For example, consider these sentences (taken from [11]): A bottle of tezgüno is on the table. Everyone likes tezgüno. Tezgüno makes you drunk. We make tezgüno out of corn. It is a reasonable guess that the word “tezgüno” refers to an alcoholic beverage. The idea that words occurring in the same contexts tend to have similar meanings is formalised as the Distributional Hypothesis [12]. The method described in this paper assumes that similar words appear in similar contexts, and measures the similarity by comparing word contexts. We use a frequency-based method to make this comparison; in order to smooth statistical fluctuations, we convert the frequency distributions to fuzzy sets [13, 14] which represent a family of distributions, and find their conditional probabilities. This (non-symmetric) quantity measures the degree of similarity between two words. We define a document D k as an ordered sequence of nk words, which may be systematically pre-processed to convert case, stem words, remove stop words, etc. Ordering of the words is preserved.

(

Dk = w 0 ,w1,…w n k

)

The vocabulary of a document V is the set of all words in the document, V ( Dk ) = {w ( w ) is a sub − sequence of Dk }

€For any sub-sequence Sm(i) = (wi, wi+1 … wi+m ) of words in Dk we define pre pm : N → V p €

pre pm (i) = ( w i− p ,w i− p+1,…,w i−1 ) suc pm : N → V p suc pm (i) = ( w i+m+1,w i+m+2 ,…,w i+m+ p )

for suitably chosen values of m, p, giving a block of m words preceded and succeeded by blocks of p words. Special treatment is required near the start / end of the document For the remainder of this paper, we look at m=p=1 and drop the suf€ fixes, i.e. we consider context sets of one word each side of a single word. The method is applicable to larger or variable groupings. To illustrate, consider a document D containing the sentences The quick brown fox jumps over the lazy dog. The quick brown cat jumps onto the active dog. The slow brown fox jumps onto the quick brown cat. The quick brown cat leaps over the quick brown fox. Then S(1) = quick, pre(1) = the, suc(1) = brown and S(2) = brown, pre(2) = quick, suc(2) = fox Clearly pre(i) = S(i-1) suc(i) = S(i+1) We also define the predecessor and successor set functions PRE :V → P (V ) PRE ( S (i)) = { pre( j ) S (i) = S ( j )} SUC :V × V → P (V ) SUC( p,S (i)) = {suc ( j ) S (i) = S ( j ) ∧ pre( j ) = p}

Informally, in a document or corpus, PRE gives the set of words preceding a specified word, S(i) and SUC the set of words succeeding a specified pair of words, Let € N ( pre(i),S (i),suc (i)) be the number of times the sequence pre(i), S(i), suc(i) occurs in D. Then the frequency context set of w is f w = {( P,S ) : N ( P,w,S ) ( P,w,S ) is a sub − sequence of D} € For each word w, we incrementally build up a fuzzy set context-of-w containing pairs of words that surround w, with a membership calculated from the corre€

sponding frequencies, e.g. for brown in the document above, we have (quick, cat) three times, (quick, fox) twice and (slow, fox) once. We use mass assignment theory to convert these frequencies to fuzzy sets - in the case above, we would have the fuzzy set {(quick,cat) : 1, (quick,fox) : 0.833, (slow,fox) : 0.5} For any two words w1 and w2, the value Pr(context-of-w1 | context-of-w2) measures the degree to which w1 could replace w2, based on the similarity of their context sets calculated by semantic unification. This is abbreviated to Pr(w1 | w2) below. For example, assume further sentences give the fuzzy context set of grey as { (quick, cat) : 1 , (slow, fox) : 0.75 } we calculate Pr(brown | grey) = 0.8125, Pr(grey | brown ) = 0.625 By semantic unification of the fuzzy context sets of each pair, we obtain an asymmetric word similarity matrix. A straightforward implementation would take O(ni×nj) operations per semantic unification (where ni is the cardinality of the fuzzy context set, and would require O(v2) semantic unifications where v is the size of the vocabulary. Adding a new document, or a new word, requires recomputation of the entire matrix. The algorithms presented below significantly reduce these complexity estimates. Algorithm 1 - Calculation of membership from frequency Inputs – fw = array of frequency counts, T = total frequency count for this word =

∑ f w ( P,S )

P,S

Output - mw = array of memberships sort frequency counts into decreasing order, f(P0, S0), … f(Pn-1, Sn-1) such that f(Pi, Si) ≥ f(Pj, Sj) iff i > j set the membership corresponding to €maximum count, m(P0, S0 ) = 1 for i = 1 ... n-1. i.e. for each remaining frequency count ( f (Pi−1,Si−1) − f (Pi ,Si )) × i m( Pi ,Si ) = m( Pi−1,Si−1 ) − T endfor The complexity of algorithm 1 is determined by the sorting step; the remaining steps are linear in the size of the array and (effectively) negligible. € Algorithm 2 - Semantic unification for w1 | w2 Inputs – mw1, fw2, T = total frequency count for w2 =

∑ f (P,S) w2

P,S



Output - semantic unification value, SU = Pr(w1 | w2) SU = 0 for i ∈ PRE(w1) ∩ PRE(w2) for j ∈ SUC(i, w1) ∩ SUC(i, w2) SU := SU + mw1(i, j) * fw2(i, j) endfor endfor SU := SU / T The final algorithm calculates the AWS matrix. There are important efficiency considerations in making this a totally incremental process. i.e. words (and documents) can be added or subtracted without having to recalculate the whole matrix of values. Each word is stored with a list of its context pairs and the number of times each context pair has been observed. Once the word similarities have been calculated they only need to be updated when one of the word contexts changes, i.e. if we read a word wi then we just mark the non-zero elements Pr(w|wi) and Pr(wi|w) for recalculation. The only exception occurs when a new context is read, i.e. a sequence P-w-S which has not been seen before. In this case, we need to look for other words wj which have the same context P-wj-S and mark the elements Pr(w|wj) and Pr(wj|w) as needing recalculation. Algorithm 3. Creation of AWS Inputs

word w with associated frequency fw and membership mw, predecessor P and successor S Simk(wi, wj), the current asymmetric word similarities Outputs Simk+1(wi, wj), the updated asymmetric word similarities update fw(P,S) if mw(P,S) not marked for recalculation then mark mw(P,S) as needing recalculation mark Sim(w,w) as needing recalculation for all words wi such that Sim(w, wi) > 0 mark Sim(wi,w) as needing recalculation mark Sim(w,wi) as needing recalculation endfor endif endfor if fw(P,S) = 1 then // i.e. first time we have seen this triple for wj ∈ PRE(S) , wj ≠ w if fwj (P, S) > 0 then mark Sim(w, wj) as needing recalculation mark Sim(wj , w) as needing recalculation endif endfor endif

This creates an asymmetric word similarity matrix Sim, whose rows and columns are labelled by all words encountered in the document collection. Each cell Sim (i, j) holds a value between 0 and 1, indicating the extent to which a word i is contextually similar to word j. The algorithm is lazy- it does not recalculate elements of Sim until values are needed. Updating with a new word is a very easy process – in contrast, a method such as TF-IDF requires the entire set of weights to be recalculated because all frequencies must be updated. Many of the elements are zero and we can save considerable space by storing only non-zero elements. Related words appear in the same context (e.g. grey and brown above), however, unrelated words may also appear, e.g. the phrase slow fat fox would lead to a non-zero similarity between fat and brown. We have two options to deal with these unrelated words. In [15] we filter the AWS output through WordNet’s [3] synonym, antonym and hypernym sets. This gives a document-based set of related words as the basis for a taxonomy [16]. Alternatively, we can treat all similarities from the AWS algorithm as valid and use the results for document similarity and clustering, as reported below. Further details of these tests are given in [4, 17, 18]

3.1

Test Results 1 - Document Similarity

The AWS values were used to compute document similarities on the Reuters collection (formerly available from www.research.att.com/ ~lewis/reuters21578.html) to retrieve documents relevant to a query. Measuring pairwise document similarity is an essential task in grouping documents and can help the user to navigate easily through similar documents. The similarity measure is defined as: ∑ ∑ f a Sim(a,b) fb a ∈doc1 b ∈doc 2

where f is the relative frequency of a word in a document and Sim is the similarity of words outlined above. To evaluate the AWS method, we have compared it to TF-IDF. A set of 4535 documents were used for “training” i.e. extracting the AWS matrix or creating TFIDF values, and about 4521 documents were used as a “test” set . As with many other document retrieval techniques, we applied Porter stemming and discarded common words using a stop list. Several short queries were used to ensure that the system produces expected results, and these are reported elsewhere. Here we present preliminary results, using two queries, Query 1: “bomb attack in Madrid” and Query 2:“digital library”. Based on the queries, the first (training) set has thirty-eight relevant documents retrieved for Query 1 and twenty-two documents for Query 2, while the second (test) set has fifty-four documents of Query 1 and thirty documents of Query 2. Both TF-IDF and the AWS method produce the same sets of documents in this case. The document-document similarity measures using AWS is consistently about 10% higher than using TF.IDF, taken over both document sets, with much less computation required for the AWS calculations.

Table 1 - AWS measure of contextual word similarity Test

Training word i

word j

Sim(wi, wj)

Sim(wj, wi)

word i

word j

injured said attack output price target quantity aid arrest grain surplus

wound quote raid produce policies estimate bag money custody currency unwanted

0.143 0.488 0.223 0.059 0.111 0.333 0.057 0.5 1 0.2 1

0.857 0.024 0.018 0.111 0.042 0.333 0.2 0.2 0.333 0.5 1

police said attack military Iranian Spain reach produce increase output slump

shepherd confirm threaten Pentagon Palestinian country get output prolong amount fell

Sim(wi, wj)

Sim(wj, wi)

0.303 0.109 0.667 0.385 0.173 0.027 0.417 0.182 1 0.333 0.333

0.091 0.013 0.030 0.045 0.023 0.115 0.167 0.059 0.125 0.091 0.5

Table 1 shows a selection of values from the asymmetric word similarity matrix. By inspection, most word pairs are plausible (for example, surplus and unwanted), although some (for example grain and currency) are not obviously related. We emphasise that words are considered to be similar because they frequently share contexts, although some words are quite distant in meaning.

3.2

Test Results 2 – Grouping Documents

We use the document similarity measures above to form document groupings based on their similarity values. In this case, we show the document groupings using two short queries, i.e., ‘apple’ and ‘cocoa plantation’. For the clustering method, we used a K-means algorithm with K=3. Our method produces better document groupings compared to TF.IDF with only relevant subjects appearing in each cluster – see tables 2 and 3. Table 2. Document grouping using a query ‘apple’; dotted lines indicate clusters AWS

TF.IDF

Document

General Subject

Document

D1

Apple computer

D6

General Subject Apple fruit

D3

Apple computer

D7

Apple fruit

D4

Apple computer

D1

Apple computer

D5

Apple computer

D5

Apple computer

D2

Fresh apple

D2

Fresh apple

D6

Apple fruit

D4

Apple computer

D7

Apple fruit

D3

Apple computer

Table 3. Document grouping using a query ‘cocoa plantation’; dotted lines indicate clusters AWS Document D3 D4 D2 D7 D1 D5 D6

3.3

General Subject

TF.IDF Document

Cocoa Cocoa Rubber plantation Rubber plantation Coffee plantation Cocoa Cocoa

D5 D6 D1 D7 D3 D4 D2

General Subject Cocoa Cocoa Coffee plantation Rubber plantation Cocoa Cocoa Rubber plantation

Test 3 – Document Summarisation

In DUC 2002 (http://duc.nist.gov), there are 59 document sets that consist of 567 related newswire documents from Associated Press, Wall Street Journal, Financial Times, LA Times and San Jose Mercury News, spanning the period from 1988 to 1992. The average length of a document is 30 sentences. We randomly selected eight document sets (68 documents) from the corpus for use in our experiment. Task 1 is a single-document summarisation in which the goal is to extract 100 word summaries from each document in the corpus. Our summariser picks the most representative sentences to appear in the summaries, by calculating sentencesentence similarities using the AWS method. The popular TF.ISF method was also used to extract the most representative sentences. The automatically generated summaries were compared sentence by sentence with the manually created summaries. On average, AWS produces summaries that are 60% similar to the manually created summaries, while TF.ISF produces summaries that are 30% similar. For comparison, the human summarisers, P1 and P2 produce summaries that are on average 70% similar to each other.

4 Summary and future work We have presented a detailed algorithm to calculate the asymmetric similarity between words using fuzzy context sets. Limited experiments have shown that using the AWS method in measuring pairwise document similarity produces higher similarity values than TF-IDF. Our method shows a satisfactory result when compared with a conventional similarity measure. We are currently continuing to test the system with a large document collection and longer queries to ensure that it will produce consistently satisfactory result. Future work will investigate comparative timings and more details of the complexity of the AWS algorithm, compared to its competitors.

References 1. Hearst, M.A.: Invited Talk: Untangling Text Data Mining. Annual Meeting- Association for Computational Linguistics 37 (1999) 3-10. 2. Blake, C. and Pratt, W.: Better Rules, Few Features: A Semantic Approach to Selecting Features from Text. In Proc IEEE Data mining. San Jose, CA: (2001) 59-66. 3. Miller, G.A.: WordNet: A Lexical Database for English. Comm- ACM 38 (1995) 39. 4. Azmi-Murad, M. and Martin, T.P.: Information Retrieval using Contextual Word Similarity. In Proc UKCI-04 - UK Workshop on Computational Intell (2004) 144 - 149. 5. Azmi-Murad, M. and Martin, T.P.: Asymmetric Word Similarities for Information Retrieval, Document Grouping and Taxonomic Organisation. In Proc EUNITE 2004 Aachen, Germany (2004) 277 - 282. 6. Hahn, U. and Ramscar, M., (eds.): Similarity and Categorization. OUP (2001). 7. Goodman, N.: Seven strictures on similarity, in Problems and Projects, N. Goodman, (Ed) .,Bobbs-Merrill. (1972) 437-447 8. Zadeh, L.A.: Similarity Relations and Fuzzy Orderings. Inf Sciences 3 (1971) 177-200. 9. Dubois, D. and Prade, H.: Fuzzy Sets and Systems: Theory and Applications. Academic Press (1980). 10. Love, B.C.: Similarity and Categorization: A Review. AI Magazine 23 (2002) 103-105. 11. Pantel, P. and Lin, D.: Discovering Word Senses from Text. In Proc ACM SIGKDD international conference on knowledge discovery and data mining. (2002) 613-619. 12. Harris, Z.: Distributional Structure, in The Philosophy of Linguistics, J.J. Katz, (Ed., 111),Oxford University Press. (1985) 26-47 13. Baldwin, J.F.: The Management of Fuzzy and Probabilistic Uncertainties for Knowledge Based Systems, in Encyclopedia of AI, S.A. Shapiro, (Ed.), Wiley.(1992) 528-537 14. Baldwin, J.F., Martin, T.P., and Pilsworth, B.W.: FRIL - Fuzzy and Evidential Reasoning in AI. Research Studies Press (John Wiley), U.K. (1995). 15. Martin, T.P., Azvine, B., and Assadian, B.: Acquisition of Soft Taxonomies. In Proc IPMU-04. Perugia, Italy (2004) 1027 - 1032. 16. Martin, T.P. and Azvine, B.: Acquisition of Soft Taxonomies for Intelligent Personal Hierarchies and the Soft Semantic Web. BT Technology Journal 21 (2003) 113-122. 17. Azmi-Murad, M. and Martin, T.P.: Using Fuzzy Sets in Contextual Word Similarity, in Intell. Data Eng'ng and Automated Learning, LNCS3177, Springer(2004) 517 - 522 18. Azmi-Murad, M. and Martin, T.P.: Sentence Extraction using Asymmetric Word Similiarity and Topic Similarity. In Proc 9th Online World Conference on Soft Computing in Industrial Applications. - http://wsc9.softcomputing.net/ (2004).

Suggest Documents