Computer Science Dept., Brigham Young University, Provo, Utah 84602, U.S.A.. Abstract. Traditional phrase .... word) correlation factors using the documents in the Wikipedia Database ..... Flop?. The hype around Apple Inc.'s upcoming .
Finding Similar RSS News Articles Using Correlation-Based Phrase Matching Maria Soledad Pera and Yiu-Kai Ng Computer Science Dept., Brigham Young University, Provo, Utah 84602, U.S.A. Abstract. Traditional phrase matching approaches, which can discover documents containing exactly the same phrases, fail to detect documents including phrases that are semantically relevant, but not exact matches. We propose a correlation-based phrase matching (CPM) model that can detect RSS news articles which contain not only phrases that are exactly the same but also semantically relevant, which dictate the degrees of similarity of any two articles. As the number of RSS news feeds continue to increase over the Internet, our CPM approach becomes more significant, since it minimizes the workload of the user who is otherwise required to scan through huge number of news articles to find related articles of interest, which is a tedious and often an impossible task. Experimental results show that our CPM model on matching bigrams and trigrams outperforms other phrase, including keyword, matching approaches.
1
Introduction
Phrase queries are frequently used to retrieve documents from the Web. A phrase, which is often defined as a sequence of words [3], can be represented in two folds: (i) the syntactic structure that the words are organized in, and (ii) the semantic content it delivers. Changing either one of the two representations may result in a phrase with a different meaning. Traditional phrase matching techniques aim to retrieve documents including phrases that match exactly with the query phrase, although some advanced approaches tolerate errors to some extent (e.g., proximity of words, word order, and missing words in a phrase). These inherent characteristics draw restrictions on their potential usages, i.e., they may fail to detect potentially relevant phrases and hence documents. For example, the phrase “heterogeneous node” (on wireless networks) is semantically relevant to “heterogeneous device” and “heterogeneous transport,” which could be used along with “heterogeneous node” in retrieving closely related documents. Neither keyword matching (nor traditional phrase matching as mentioned earlier) can solve the inexact phrase matching problem. Using keywords “heterogeneous” and “node” individually in keyword search could match documents that include either the word “heterogeneous” or “node,” but not necessarily both, and thus the content of retrieved documents might be totally unrelated to “heterogeneous node.” Some of these documents may address “heterogeneous alloys,” whereas others may discuss “homogeneous node.” Even though the “matched” documents include both words, they are not necessarily in the same order, which might run into the same “content mismatched” problem. The more sophisticated similarity matching approaches, such as [15], can detect documents that
include similar (not necessarily the same) words; they, however, cannot resolve the word-ordering problem. For example, consider the sentences “They jog for thirty minutes and walk for an hour” and “They run for an hour and stroll for thirty minutes.” Ignoring the word order and simply considering the degrees of (single-)word similarity, i.e., jog versus run and walk versus stroll, causes these sentences to be treated as closely related, even though they are semantically different, and filtering out mismatched documents manually is a waste of time. We propose a correlation-based phrase matching (CPM) model that can detect RSS news articles which contain phrases that are semantically relevant, in addition to exact matches. We are interested in RSS news articles, since there is no precedent in the amazing amount of online news that can be accessed by Internet users these days. Thus, the problem of seeking information in online news articles is not the lack of them but being overwhelmed by them. This brings a huge challenge in finding related online news with distinct information automatically, instead of manually, which is a labor-intensive and impractical process. The proposed CPM model measures the degrees of similarity among different RSS news articles using phrase similarity to detect redundant and discover similar news articles. We call the proposed model correlation-based, since we adapt the correlation factors in fuzzy sets to model the similarity relationships among different phrases. For each phrase p, its fuzzy set S is constructed that captures the degrees of memberships, i.e., closeness, of p to all the other phrases in S, which are called phrase correlation factors. The rest of the paper is organized as follows. In section 2, we discuss research work in phrase matching. In section 3, we present the design of CPM. In section 4, we verify the accuracy of CPM in detecting related documents using various test cases. In section 5, we draw a conclusion and include future work on CPM.
2
Related Work
Phrase matching has been applied in solving different problems, such as ranking relevant documents, document clustering [3], and Web document retrieval [1]. In [1], a system for matching phrases in XML documents, called PIX, is presented. PIX allows users to specify both (i) tags and annotations in an XML document to ignore and (ii) phrases in the document to be matched. This technique relies on exact and proximity phrase matching (i.e., words in a phrase that are within a distance of k(≥ 1)-words in a document) in retrieving relevant documents. [3] cluster Web documents based on matched phrases and their levels of significance (e.g., the title and the body) in the documents. This method uses exact phrase matching to determine the degrees of overlap among documents, which yield their degrees of similarity. [14] use phrase matching for ranking medical documents. The similarity of any two phrases in [14] is detected by the number of consecutive three-letter triples in common, with various scores assigned to different triples of letters, e.g., uncommon three-letter triples are given a higher weight, three-letter triples at the beginning of a word are more important than the ones at the end of the word, and long phrases are discounted to avoid bias on their lengths. In [14] phrases are treated as documents and tri-grams of letters are treated as words.
[8], who use phrase and proximity terms for Web document retrieval and treat every word in a query as a phrase, show that the usage of phrases and proximity terms is highly beneficial. However, their experimental results show that even though phrases and proximity terms have a positive impact on 2- or 3-word queries, they have less, or even negative, effects on other types of queries. [2] present a compression method that searches for words and phrases on natural-language text. This method performs an exact search for words and phrases on compressed text directly using any sequential pattern-matching algorithm, in addition to a word-based approximate for extended search. Thus, searches can be conducted for approximated occurrences of a phrase pattern. [9] emphasize the importance of phrase extraction, representation, and weighting and claim that phrases obtained by syntactic (instead of statistical) processing often increase the effectiveness of retrieval when proximity and weighting information are adequately attached to a query phrase representation. [12], however, determine the degree of similarity between any two documents by computing the number of common phrases in the documents, and dividing the number of common phrases by the total number of phrases in both, which is intuitively another exact phrase matching approach.
3
Correlation-Based Phrase Matching
Semantically relevant phrases detected by our CPM model hold the same syntactic features as in other phrase matching approaches, i.e., a phrase is treated as a sequence of words and the order of words is significant. Unlike existing phrase matching approaches, we develop novel phrase correlation factors for the n-gram (1 ≤ n ≤ 5) phrases. Using one of these chosen sets of n-gram phrase correlation factors, the n-gram phrases in an RSS news article are matched against the ngram phrases in another RSS news article to determine their degrees of similarity. We detail the design of our n-gram CPM approach on RSS news articles below. 3.1
Content Descriptors of RSS News Articles
Two of the essential elements in an RSS (XML) news feed file, in which RSS news articles are posted, are the title and description of an item (i.e., a news article), since the former contains the headline and the latter includes the first few sentences of the article. Furthermore, several items can appear in the same RSS feed file. (See, as an example of, an RSS news feed file as shown in Figure 1.) We treat the title and description of each item as the content descriptor of the corresponding article and determine its degree of similarity with the content descriptor of another item (in the same or a different RSS news feed file) according to the correlation factors of phrases in the two content descriptors. 3.2
Computing the Phrase Correlation Factors
Prior to computing n-gram (1 ≤ n ≤ 5) phrase correlation factors, we first decide at what level the correlation factors are to be calculated, which dictates how the subsequent process of phrase comparison should be conducted.
Fig. 1. Portion of an RSS news feed file.
The major drawback of the phrase-level granularity is its excessive overhead. A phrase may start at any position in a document, and the lengths of phrases vary in practical usage. Thus, the number of possible phrases to be considered could be huge. For example, consider a portion of the paragraph that is randomly chosen from www.cnn.com: “. . . the organ’s unwrinkled surface resembled that of the brain of an idiot. . . . Researchers contend that if the plant-eating beasts . . . .” Even only considering trigram phrases, there are 71 trigram phrases in the entire paragraph. However, not all of them, such as the phrases “that of the” and “that if the,” are useful in determining the content of the paragraph, or its corresponding document in general. Thus, our CPM model, which precomputes the correlation factors of any two n-gram phrases, considers only nonstop, stemmed words1 in an RSS news article to form phrases to be matched. The Unigram Correlation Factors We construct the unigram (i.e., singleword) correlation factors using the documents in the Wikipedia Database Dump (http://en.wikipedia.org/wiki/Wikipedia:Database download). We chose the Wikipedia documents for constructing each of the n-gram (1 ≤ n ≤ 5) correlation factors, since the 850,000 Wikipedia documents were written by more than 89,000 authors on various topics. The diversity of the authorships leads to a representative group of documents with different writing styles and a variety of subject areas. Thus, the set of Wikipedia documents is an effective representative set of documents that is appropriate for computing the general correlation factors among unigram, as well as other n-gram (2 ≤ n ≤ 5), phrases. The correlation factors of the unigrams are computed according to the distance and frequency of occurrence of the unigrams in each Wikipedia document. Prior to constructing the unigram correlation factors, we first removed all the words in the Wikipedia documents that are stopwords. Eliminating stopwords is a common practice, since the process (i) filters the noise within both a query and a document [4, 7] and (ii) enhances the retrieval performance [13], which en1
Stopwords are words that appear very frequently (e.g., “him,” “with,” “a,” etc.), which include articles, conjunctions, prepositions, punctuation marks, numbers, nonalphabetic characters, etc., and are typically not useful for analyzing the informational content of a document. Stemmed words are words with the same meaning.
riches the quality of our unigram phrase correlation factors. After stopwords were removed, we stemmed the remaining words by using the Porter Stemmer [11], which stems each word to its grammatical root but retains the semantic meaning of the words, e.g., “driven” and “drove” are reduced to their stemmed word “drive.” The final count of non-stop, stemmed unigrams is 57,926. The unigram correlation value of word wi with respect to word wj , which is constructed by using the Wikipedia documents without stop- or non-stemmed words, is defined as 1 (1) ci,j = d(wi , wj ) wi ∈V (wi ) wj ∈V (wj )
where d(wi , wj ) = |P osition(wi ) - P osition(wj )| is the distance, i.e., the number of words, between wi and wj in a Wikipedia document, and V (wi ) (V (wj ), respectively) denotes the set of stem variations of wi (wj , respectively). Correlation factors among unigrams that co-occur more frequently than others in a document are assigned higher values. To avoid the bias on the frequency of occurrences in a “long” Wikipedia document, we normalize ci,j as nci,j =
ci,j |V (wi )| × |V (wj )|
(2)
where |V (wi )| (|V (wj )|, respectively) is the number of words in V (wi ) (V (wj ), respectively). Given k different nci,j values, one from each of the Wikipedia documents in which both wi and wj occur, the unigram correlation factor cfi,j of wi and wj is k ncm i,j (3) cfi,j = m=1 m where ncm i,j is the normalized correlation value nci,j (as defined in Equation 2) of wi and wj in the mth (1 ≤ m ≤ k) document in which both wi and wj occur, and k is the total number of Wikipedia documents in which wi and wj co-occur. Example 1 Consider the following two RSS news articles represented by their content descriptors: Article 1. Ice storm threatens chaos: A brutal ice storm is threatening to turn Oklahoma into a sheet of ice and cause dangerous conditions from Texas to New York. “This is a one-in-maybe-15-to 25-year event,” CNN severe weather expert Chad Myers said. Article 2. Ice storm threatens chaos: Freezing rain hit Oklahoma today, the start of what forecasters say could be a brutal ice storm. Millions of people from Texas through Oklahoma to Missouri are being warned that conditions will deteriorate this afternoon. “This is a one-in-15- to 25-year event,” CNN severe weather expert Chad Myers said.” Table 1 shows some of the unigrams in the two news articles and their correlation factors generated by Equation 3 using the Wikipedia documents. 2
Table 1. Correlation factors of some unigrams in two different RSS news articles. Article 1/Article 2 warned conditions deteriorate afternoon severe weather expert brutal 1.9E-7 2.3E-7 2.5E-7 3.3E-8 2.1E-7 8.7E-8 6.3E-8 ice 1.4E-7 1.6E-7 9.5E-8 7.9E-8 1.5E-7 3.2E-7 5.8E-8 storm 1.4E-6 1.7E-7 9.5E-8 5.0E-7 7.9E-7 1.3E-6 7.9E-4 threatening 3.8E-7 2.4E-5 2.6E-7 1.4E-7 3.8E-7 1.6E-7 6.6E-8 turn 1.5E-7 1.1E-7 1.2E-7 1.2E-7 1.1E-7 1.3E-7 8.5E-8 Oklahoma 3.7E-8 1.5E-8 9.7E-9 4.1E-8 4.0E-8 1.0E-7 3.1E-8
The N -gram Phrase Correlation Factors The phrase correlation factors of any two n-grams (2 ≤ n ≤ 5) are calculated according to the correlation factors of their corresponding unigrams, since unigram correlation factors are reliable in detecting similar words. (See experimental results in Section 4.) We compute the bigram, trigram, 4-gram, and 5-gram phrase correlation factors using the (prior) Odds (Odds for short) [5] that measures the predictive or prospective support according to a hypothesis H by the prior knowledge p(H) alone to determine the strength of a belief, which is the unigram correlation factor in our case. p(H) (4) O(H) = 1 − p(H) Based on the computed unigram correlation factors, cfi,j , in Equation 3, we generate the n-gram (2 ≤ n ≤ 5) phrase correlation factors between any n-gram phrases p1 and p2 using Equation 4 such that p(H) is defined as the product of the unigram correlation factors of the corresponding unigrams in p1 and p2 , i.e., n i=1 cfp1i ,p2i (5) pcfp1 ,p2 = 1 − ni=1 cfp1i ,p2i where p1i and p2i (1 ≤ i ≤ n) are the ith word in p1 and p2 , respectively. Tables 3, 2, 4, and 5 show the phrase correlation factors of some bigrams, trigrams, 4-grams, and 5-grams, respectively, in the two articles in Example 1. 3.3
Phrase Comparison
Phrases in the content descriptor of an RSS news article are compared against their counterparts in another RSS news article. CPM can detect phrases in RSS news articles that are semantically relevant (or the same) to phrases in other articles. To accomplish this, the phrase of a chosen length k (1 ≤ k ≤ 5) in a news article A1 is compared with each phrase of the same length in another article A2 . If there are m (n, respectively) different words in A1 (A2 , respectively), then there are m-k+1 different phrases in A1 to be compared with n-k+1 different phrases in A2 , which include overlapped phrases. In computing the degree of similarity of A1 and A2 , the correlation factors of phrases of the chosen length, i.e., in between 1 and 5, in A1 and A2 are used. Since the average content descriptor of an RSS new article is 25 words in length, the computation time for matching phrases in two news articles is negligible.
Table 2. Correlation factors of some trigrams in two different RSS news articles. Article 1/Article 2
warned conditions deteriorate weather conditions deteriorate afternoon expert brutal ice 3.0E-14 2.2E-14 2.0E-14 5.1E-15 ice storm 2.3E-14 1.5E-14 4.7E-14 2.5E-10 storm threatening 3.2E-11 4.3E-14 1.3E-14 8.2E-14 threatening turn 4.2E-14 2.8E-12 3.1E-14 1.3E-14 Table 3. Correlation factors of some bigrams in two different RSS news articles. Article 1/Article 2
warned conditions conditions deteriorate severe weather deteriorate afternoon expert brutal ice storm 2.8E-21 1.1E-20 5.4E-17 ice storm threatening 5.9E-21 2.1E-21 1.2E-20 storm threatening turn 3.8E-18 5.3E-21 1.0E-20 threatening turn Oklahoma 4.0E-22 1.1E-19 1.5E-21 Table 4. Correlation factors of some 4-grams in two different RSS news articles. Article 1/Article 2
warned conditions severe weather deteriorate afternoon expert Chad brutal ice storm threatening 4.0E-28 4.4E-24 ice storm threatening turn 7.2E-28 7.2E-28 storm threatening turn Oklahoma 1.5E-25 1.6E-27 Table 5. Correlation factors of some 5-grams in two different RSS news articles. Article 1/Article 2
Missouri warned conditions severe weather deteriorate afternoon expert chad myers brutal ice storm threatening turn 1.2E-35 2.1E-31 ice storm threatening turn Oklahoma 5.0E-33 7.2E-35 storm threatening turn Oklahoma sheet 4.5E-37 5.1E-35
3.4
Similarity Ranking of RSS News Articles
In CPM, we use the correlation factors of n-gram (1 ≤ n ≤ 5) phrases of a chosen length to define the degrees of similarity of two RSS news articles A1 and A2 . The degree of similarity of A1 with respect to A2 is not necessary the same as the degree of similarity of A2 with respect to A1 , since A1 and A2 may share common information but also include information that are unique of their own. Using the n-gram (1 ≤ n ≤ 5) phrase correlation factors, we define a fuzzy association of each n-gram phrase in A1 with respect to all the n-gram phrases in A2 . The degree of correlation between a phrase pi in A1 and all the phrases in A2 , denoted µpi ,2 , is calculated as the complement of a negated algebraic product of all the correlation factors of pi and each distinct phrase pk in A2 , i.e., µpi ,2 = 1 − (1 − pcfi,k ) or µpi ,2 = 1 − (1 − cfi,k ) (6) pk ∈A2
pk ∈A2
which is adapted from the fuzzy word-document correlation factor in [10], and the 1st (2nd , respectively) formula in Equation 6 is used for n-gram (2 ≤ n ≤ 5)
phrases (unigram phrases, respectively). The correlation value µpi ,2 falls in the interval [0, 1] and reaches its maximum when pcfi,k (cfi,k , respectively) = 1, i.e., when pi (∈ A1 ) = pk (∈ A2 ). The degree of similarity of A1 with respect to A2 , denoted Sim(A1 , A2 ), using the chosen n-gram phrase correlation factors is calculated as the average of all the values µpi ,2 for each pi ∈ A1 (1 ≤ i ≤ m), and m is the total number of n-gram phrases in A1 . Sim(A1 , A2 ) =
µp1 ,2 + µp2 ,2 + · · · + µpm ,2 m
(7)
Sim(A1 , A2 ) ∈ [0, 1]. When Sim(A1 , A2 ) = 0, it indicates that there is no ngram phrase in A1 that can be considered similar to any n-gram phrase in A2 . If Sim(A1 , A2 ) = 1, then either A1 is (semantically) identical to A2 , or A1 is subsumed by A2 , i.e., all the n-gram phrases in A1 are (semantically) the same as (some of) the n-gram phrases in A2 , and in this case A1 is treated as a redundant article and can be ignored. Sim(A2 , A1 ) can be defined accordingly. Given any pair of A1 and A2 , we use the Stanford Certainty Factor (SCF ) [6] to combine Sim(A1 , A2 ) and Sim(A2 , A1 ), which yields the relative degree of similarity that should reflect how closely related A1 and A2 are. We adapt SCF , since it is a monotonically increasing (decreasing) function on combined assumptions for creating confidence measures and is easy to compute. SCF (A1 , A2 ) =
Sim(A1 , A2 ) + Sim(A2 , A1 ) 1 − M IN (Sim(A1 , A2 ), Sim(A2 , A1 ))
(8)
The following table shows that when the degrees of similarity of two articles are both high (see row 1), their SCF is also high, and the same occurs when one of the degrees of similarity is high (e.g., one Sim(A1 , A2 ) Sim(A2 , A1 ) SF C of the two articles is subsumed by another), 1 0.60 0.76 3.40 their SCF yields a comparatively high rela2 0.52 0.23 0.97 3 2.5E-6 3.26-6 5.76E-6 tive degree of similarity (see row 2). Furthermore, if both degrees of similarity are low, their SCF is also low, indicating that the articles are not related. Example 2 Consider Figure 2, which includes the SCF s computed for different pairs of the articles shown in Table 6, demonstrates the applicability of our CPM on bigrams and trigrams. According to the content of the RSS news articles – Articles 1 and 2, as also shown in Example 1, are very closely related, since both give an account of a storm that took place in Oklahoma, and the SCF s of bigrams and trigrams are higher than the SCFs of other n-grams. – Articles 4 and 5 (7 and 8, respectively) resort last year’s bird flu outbreak (the new iPhone, respectively), and again the SCFs of bigrams and trigrams correctly identify their similarity. – Even if articles 1, 2, and 6 report the weather in different parts of the world, they are not all related. Unigrams in this case provide a more accurate measure to detect that article pairs (1, 6) and (2,6) are not related.
Fig. 2. SCF s of the eight different RSS news articles in Table 6. Table 6. Sample articles (Art) used for demonstrating their SCF s as a result of n-gram (1 ≤ n ≤ 5) phrase matching. Art Title and Description 1 Ice Storm Threatens Chaos. Freezing rain hit Oklahoma today, the start of what . . . 2 Ice Storm Threatens Chaos. A brutal ice storm is threatening to turn Oklahoma . . . 3 Fashion Designers Issue Model Guidelines. The American fashion industry says . . . 4 New Outbreak of Bird Flu Hits Nigeria. A new outbreak of H5N1 bird flu has hit . . . 5 Indonesian Woman 59th Bird Flu Death. Indonesian woman died from bird flu . . . 6 EU proposes ambitious climate target. . . . “the most ambitious policy ever” to . . . 7 The iPhone: Revolution? Gamble? Flop?. The hype around Apple Inc.’s upcoming . . . 8 The iPhone Phenomenon. There’s a new cell phone that has had consumers . . .
– The rest of the pairs are unrelated, and in most cases the use of unigrams is more effective than any other n-grams. However, the SCF s computed by using bigrams and trigrams are also low when the pairs of articles are unrelated, but are higher than unigrams if the article pairs are related, and thus are useful in detecting and ranking related news articles. – Many SCF s computed by using 4-grams and 5-grams are not consistent compared with the SCF s obtained using unigrams, bigrams, or trigrams, and as a result, they are not dependable in detecting (un-)related articles. 2
4
Experimental Results
In evaluating the accuracy of using our n-gram (1 ≤ n ≤ 5) CPM approach to determine the degree of similarity of any two RSS news articles, we collected thousands of articles from different sources as partially shown in Table 7.
Table 7. Sources of RSS news feeds, the number of feeds (Fd) of each source (200 total), and the number of news articles (Art) from each source (1059 total). Sources 1115.org blogs.zdnet.com cbsnews.com dailymail.co.uk foxnews.com hosted.ap.org microsoftwatch.com msnbc.com news.telegraph.co.uk online.wsj.com primezone.com co.uk slashdot.org timesonline.com washingtonpost.com
Fd 3 1 8 1 10 11 1 1 3 6 2 1 3 2
Art 19 6 24 4 35 39 4 6 25 70 8
Sources abcnews.go.com boston.com chron.com english.people.com guardian.co.uk iht.com money.cnn.com news.bbc.co.uk news.yahoo.com politics.guardian.co.uk prnewswire.com
5 sltrib.com 10 today.reuters.com 6 wired.com
Fd 10 8 7 6 4 9 4 2 3 1 8
Art 135 26 21 54 12 27 17 13 37 3 24
Sources adn.com businessweek.com cnn.com forbes.com health.telegraph.co.uk latimes.com money.telegraph.co.uk news.ft.com nytimes.com portal.telegraph. seattletimes.nwsource. com 2 6 sportsillustrated.cnn.com 12 112 usatoday.com 3 9 worldpress.org
Fd 2 8 8 2 1 3 1 9 10 1 8
3 9 10 27 2 6
In order to guarantee the impartiality of our experiments, the articles were randomly selected from hundreds of different news feeds, collected between July 2006 and July 2007. The chart on the left shows the variety of subject areas covered by the chosen RSS news feeds which demonstrate the suitability of our CPM model to news articles independent of their content, and Table 8 shows a few collected news articles. We determined the relative degree of similarity of each pair of the 1059 news articles extracted from 200 RSS news feeds using each n-gram (1 ≤ n ≤ 5) phrase matching approach. In this empirical study, we (i) randomly selected 410 pairs, (ii) manually examined each pair to determine their relative degree of similarity, and (iii) compared the manually determined similarity with the automatically computed SCF s based on each one of the five n-gram phrase matching approaches. To verify which n-gram CPM is the most accurate in determining related pairs of news articles, we consider the number of False Positives (F P s), i.e., unrelated pairs with high SCF s, and False Negatives (F N s), i.e., related pairs with low SCF s, generated by using CPM on each type of n-grams as follows: Accuracy =
Art 5 41 29 8 4 9 4 47 60 3 50
Total N umber of Examined P airs − Misclassif ied P airs (9) Total N umber of Examined P airs
where M isclassif ied P airs is the sum of F P s and F N s encountered. Figure 3(a) shows the number of correctly classified pairs, as well as incorrectly identified pairs, i.e., the sum of the F P s and F N s, on the randomly chosen 410 pairs of news articles, which is a subset of the news articles listed in Table 7,
Table 8. Portion of the thousands of RSS news articles collected for empirical study. Source abcnews.com
Date URL 7/3/2007 abcnews.go.com/TheLaw/Politics/story?id=3339302& page=1&CMP=OTC-RSSFeeds0312 cbsnews.com 7/2/2007 feeds.cbsnews.com/˜r/CBSNewsBusiness/˜3/130066324/ main3010949.shtm cnn.com 1/10/2007 rss.cnn.com/˜r/rss/cnn topstories/˜3/74617352/index.html latimes.com 1/12/2007 feeds.latimes.com/˜r/latimes/news/nationworld/world/ ˜3/74277710/la-fg-somalia12jan12,1,2987342.story news.bbc.co.uk 1/16/2007 news.bbc.co.uk/go/rss/-/1/hi/world/south asia/6268487.stm
(a) (In)Correctly classified, related articles using our n-gram CPM.
(b) Accuracy of using our n-gram CPM to detect related articles.
Fig. 3. Classified 410 pairs of news articles and their accuracy.
whereas Figure 3(b) shows the accuracy computed by using Equation 9 for each of the n-gram CPM approaches. Clearly, bigrams and trigrams yield the lowest number of misclassified pairs of articles, while achieve the highest count (≥ 90%) of correctly detected pairs among all the n-grams. The use of 4- and 5-grams reduces the accuracy to as low as 60%, whereas unigrams has an accuracy of 86%. When using bigrams and trigrams, the misclassified pairs occur, since if they have at least one common bigram or trigram, then their odds increase. Also, the degrees of similarity for unigram, bigram, and trigram are relatively higher, and thus their SCF s are comparatively higher, whereas the degrees of similarity generated by using 4-grams and 5-grams tend to be much lower, and thus their SCF s are often extremely low, which might explain their low accuracy in detecting similar pairs of RSS news articles. In general, the unrelated pairs of articles detected by using bigrams and trigrams have a SCF close to the power of E-6, whereas related pairs have a SCF above 0.1. Neither F P s nor F N s are desired, and in this study they contribute only 10% of the bigram (trigram) pairs, out of which close to 90% are F P pairs. In fact, F P s are less harmful in our similarity detection approach, since we do not lose many similar pairs. Based on the empirical study, we conclude that (i) bigram and trigram outperform others in detecting similar RSS news articles. In most cases, the SCF s computed by using bigrams and trigrams on similar RSS news articles are higher than the ones computed by using unigrams. (ii) 4-grams and 5-grams are not
reliable in determining the relevance between any two RSS news articles as explained earlier. Our empirical study further verify the claims made by [8, 9], which state that the use of bigrams and trigrams is often more effective than the use of other n-gram phrases in retrieving information.
5
Conclusions
We have presented a novel approach for finding similar RSS news articles using n-gram (1 ≤ n ≤ 5) phrase matching and shown that bigrams and trigrams outperform other n-grams in detecting similar articles. We have also verified the accuracy of our correlation phrase matching (CPM) approach by analyzing hundreds of pairs of randomly selected RSS news articles from multiple sources and concluded that CPM on bigrams and trigrams is highly accurate (≥ 90%) and requires little overhead (using predefined correlation factors) in finding related articles. Our CPM can also be used for (i) detecting (similar) junk emails and spam Web pages, (ii) clustering (Web) documents with similar content, and (iii) discovering plagiarization, which form the core future work for our CPM model.
References 1. Amer-Yahia, S., Fernandez, M., Srivastava, D., Xu, Y.: PIX: Exact and Approximate Phrase Matching in XML. In: ACM SIGMOD, pp. 664-667. ACM (2003) 2. de Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and Flexible Word Searching on Compressed Text. ACM TOIS, vol. 18(2), 113-139 (2000) 3. Hammouda, K., Kamel, M.: Efficient Phrase-Based Document Indexing for Web Document Clustering. IEEE TKDE, vol. 16(10), 1279-1296 (2004) 4. Haveliwala, T., Gionis, A., Klein, D., Indyk, P.: Evaluating Strategies for Similarity Search on the Web. In: WWW Conf., pp. 432-442. (2002) 5. Judea, P.: Probabilistic Reasoning in the Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann (1988) 6. Luger, G.: Artificial Intelligence: Structures and Strategies for Complex Problem Solving, 5th Edition. Addison Wesley (2005) 7. Luo, G., Tang, C., Tian, Y.: Answering Relationship Queries on the Web. In: WWW Conf., pp. 561-570. (2007) 8. Mishne, G., de Rijke, M.: Boosting Web Retrieval through Query Operations. In: European Conf. on Information Retrieval, pp. 502-516. (2005) 9. Narita, M., Ogawa Y.: The Use of Phrases from Query Texts in Information Retrieval. In: ACM SIGIR, pp. 318-320. (2000) 10. Ogawa, Y., Morita, T., Kobayashi, K.: A Fuzzy Document Retrieval System Using the Keyword Connection Matrix and a Learning Method. Fuzzy Sets and Systems, vol. 39, 163-179 (1991) 11. Porter, M.: An Algorithm for Suffix Stripping. Program, vol. 14(3), 130-137 (1980) 12. Toud, S.: Creating a Custom Metrics Tool. MSDN Magazine, http:// msdn.microsoft.com/msdnmag/issues/05/04/EndBracket/ (April 2005) 13. Tzong-Han, T., Chia-Wei, W.: Enhance Genomic IR with Term Variation and Expansion: Experience of the IASL Group. In: Text Retrieval Conf. (2005) 14. Wilbur, W., Kim, W.: Flexible Phrase Based Query Handling Algorithms. In: American Society for Information Science and Technology, pp. 438-449. (2001) 15. Yerra, R., Ng, Y.-K.: Detecting Similar HTML Documents Using a Fuzzy Set Information Retrieval Approach. In: IEEE GrC’05, pp. 693-699. IEEE (2005)