order to improve the quality of Web searches, the number of spam pages ... fined as an attempt to deceive a search engine's relevancy ranking algorithm [18]. ...... Castillo, C., Donato, D., Gionis, A., Murdock, V., Silvestri, F.: Know your Neigh-.
Identifying Spam Web Pages Based on Content Similarity Maria Soledad Pera and Yiu-Kai Ng Computer Science Department, Brigham Young University, Provo, Utah, U.S.A.
Abstract. The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines are faced with an annoying problem: the presence of misleading Web pages, i.e., spam Web pages, that are ranked among legitimate Web pages. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. In order to improve the quality of Web searches, the number of spam pages on the Web must be reduced, if they cannot be eradicated entirely. In this paper, we present a novel approach for identifying spam Web pages that have mismatched titles and bodies and/or low percentage of hidden content. By considering the content of Web pages, we develop a spamdetection tool that is (i) reliable, since we can accurately detect 94% of spam/legitimate Web pages, and (ii) computational inexpensive, since the word-correlation factors used for content analysis are precomputed. We have verified that our spam-detection approach outperforms existing anti-spam methods by an average of 10% in terms of F -measure.
1
Introduction
The Web is populated with all possible subject areas from personal health care to constitutional laws to religious beliefs as presented in online news articles, research papers, and customer-generated media (i.e., blogs), to name a few, and virtually all kinds of information can be found on the Web. With the huge amount of information to sort through, users turn to Web search engines for assistance in locating needed information. Hence, analyzing the content of relevant Web pages and ranking them accordingly is a crucial process in Web information retrieval (IR). As a result, a search engine spider crawling the Web not only entices gathering new information, but also imposes significant financial impact on the users, such as commercial transactions on the Web, that results from the ranking of (the contents of) Web pages determined by the search engine [19]. As mentioned in [1], there is an economic incentive for manipulating search engines’ rankings by creating pages that score high independently of their real contents, even though the intention is unethical. Gradually, more pages are introduced on the Web that are considered legitimate when in fact they are not and ranked high when they should not be by existing search engines. As reported in [16], a significant portion of existing Web pages—14% in the year of 2006—are spam. O. Gervasi et al. (Eds.): ICCSA 2008, Part II, LNCS 5073, pp. 204–219, 2008. c Springer-Verlag Berlin Heidelberg 2008
Identifying Spam Web Pages Based on Content Similarity
205
Spamming is a serious IR problem, since it (i) affects the quality of Web searches, (ii) damages the search engine’s reputation, and (iii) weakens the user’s confidence in the retrieved results. In general, Web spamming are treated as an attempt to receive an unjustifiably favorable relevance or high ranking for their Web pages, regardless of the true values of the pages. Spamming can also be defined as an attempt to deceive a search engine’s relevancy ranking algorithm [18]. A number of existing spamming approaches rely on links within Web pages [8] to manipulate Web ranking algorithms such as PageRank [4], whereas others rely on infecting the content of Web pages [16], e.g., stuffing popular or concatenating words in Web pages, to increase their chance for matching Web queries. However, due to their complexity, neither considering links nor statistical features of Web page contents are effective approaches in identifying spam Web pages as the volume of Web pages is huge. Fetterly et al. [9] use a semantic technique, i.e., actual word count, on Web pages for detecting spam. Instead, we consider the actual content (i.e., words) of Web pages and use the word-similarity values to identify and eliminate spam Web pages. We will show that by (i) computing the degree of similarity among the words in the title and the body of a Web page P , which is computed by using their word-correlation factors, (ii) using the percentage of hidden content within P , and/or (iii) considering the bigram or trigram phrase-similarity values of P , we can determine whether P is spam with high accuracy. The remaining sections are organized as follows. In Section 2, we discuss existing anti-spam methods and address their differences with ours. In Section 3, we describe in detail our spam-detection approach. In section 4, we present the experimental results, which verify the accuracy of our spam-detection method. In Section 5, we give a conclusion and include the future directions of our work.
2
Related Work
Previous anti-spam work focus on applying two different strategies: content analysis and link connection. [16] introduce and combine several anti-spam heuristics based on the content of a Web page, which include (i) the number of words, (ii) the average length of the words, (iii) the amount of anchor text, (iv) the fraction of visible content, (v) the faction of globally popular words, and (vi) the likelihood of independent n-grams within the page. These heuristics are treated as features in classifying spam Web pages. [11] also present several anti-spam heuristics according to page content that include (i) detecting the inclusion of terms in anchor text that are unrelated to the referenced Web pages, (ii) computing the amount of repetitive terms in a Web page P introduced by a spammer in an attempt to increase its relevance score, (iii) verifying the existence of a large number of unrelated terms, and (iv) identifying the presence of phrase stitching in P , which should increase its degree of relevance to several posted queries. Likewise, [9] analyze the statistical features of the host component of an URL and the excessive replication of content to establish content-based features that can be used by Web spam classifiers.
206
M.S. Pera and Y.-K. Ng
As a link-analysis approach, [2] introduce a damping function, which classifies spam and non-spam Web pages using (incoming and outgoing) links within a page without considering its content. Other well-known anti-spam techniques that rely on link analysis are described in [10,12]. [10] present a semi-automatic algorithm, TrustRank, which uses the link structure of the Web to discover pages that are likely to be legitimate with respect to a small set of (seed) legitimate pages that were manually identified by experts. Since this approach requires human assistance, it is not fully automated. [12], on the other hand, define a spam mass metric that reflects the impact of link spamming on the ranking of a page and use it to determine Web pages that are significantly beneficiaries of link spamming. As yet another link-based method, [3] introduce SpamRank, which is based on PageRank, to identify pages linked by a large number of other pages intended for misleading search engines to rank their target higher. None of the approaches discussed above relies on the actual word semantic meaning in the content of a given Web page to detect spam Web pages, which is our spam-detection approach. In [17] we demonstrate that using the content of emails for detecting junk emails is effective, the same strategy that we adopt for finding spam Web pages in this paper.
3
A Content-Based Spam-Detection Approach
As discussed earlier, spam Web pages are a burden for (i) the Web servers, since (among others) they waste storage space and processing time for indexing and maintenance and (ii) the users who must deal with low-quality retrieved results (caused by spamming) when performing Web searches. In order to neutralize existing spamming tricks, we rely on the content (i.e., words and phrases and/or proportion of markup content) of a given Web page P to determine timely and accurately whether P should be treated as spam. 3.1
Our Web Spam-Detection Approach
[14] claim that the title of a document often reflects its content, and we are confident that this same concept applies to Web pages as well, since a legitimate Web page is a regular document with a title that describes its content, whereas the title of a spam Web page often does not. Consider the legitimate Web page (http://www.mothersbliss.co.uk/shopping/index) in Figure 1 in which the title reflects the content of the page, whereas Figure 2 shows a spam Web page (http://extc.co.uk) in which its content mismatches its title. We analyze the content of a Web page to determine how closely its title is related to its body (as discussed in Sections 3.2 and 3.4) and calculate the fraction of hidden content of a Web page (see Section 3.3), if necessary, in detecting spam Web pages. 3.2
Web Page Similarity
We rely on the similarity measure between the content of (represented by a sequence of words in) the title T and the body B of a Web page P to identify
Identifying Spam Web Pages Based on Content Similarity
207
Fig. 1. The title and (portion of the) content of a legitimate Web page
Fig. 2. The title and (portion of the) content of a spam Web page
spam Web pages. We determine the degree of similarity between T and B using the correlation factors of words in T and B as defined in a precomputed wordcorrelation matrix, which are generated by analyzing a set of 880,000 Wikipedia documents (downloaded from http://www.wikipedia.org/) to calculate the correlation factor (i.e., similarity value) of any two words1 based on their (i) frequency of co-occurrence and (ii) relative distance as defined below. ci,j =
wi ∈V (wi ) wj ∈V (wj )
1 , d(wi , wj )
(1)
where d(wi , wj ) denotes the distance between any two words wi and wj in any Wikipedia document D, and V (wi ) (V (wj ), respectively) is the set of stem variations of wi (wj , respectively) appeared in D. If wi and wj are consecutive words, then d(wi , wj ) = 1, whereas if wi and wj do not co-occur in the same docu1 ment, then d(wi , wj ) = ∞, i.e., d(wi1,wj ) = ∞ = 0. To avoid bias that occurs in documents of large size, we normalize the word-correlation factors as follows: Ci,j =
ci,j , |V (wi )| × |V (wj )|
(2)
where ci,j is as defined in Equation 1. Title-Body Similarity. Using the normalized word-correlation factors, we can compute the degree of similarity (between the words) of the title T and the body B of a Web page P , denoted SimT B. We focus only on T and B of P , since as shown in the experimental results (see Section 4), SimT B of P can accurately determine whether P is spam or legitimate. In the case in which there is no title in P , we consider the hidden content of P (see Section 3.3). To determine the SimT B value of P , we calculate the similarity value of each word t in T with respect to each word b in B of P . The higher the correlation 1
Words in the documents were stemmed (i.e., reduced to their grammatical roots, e.g., computer, computing, and computation are converted to compute) after all the stopwords (i.e., words that carry little meaning, e.g., articles, prepositions, and conjunctions) were removed, which minimized the number of words to be considered.
208
M.S. Pera and Y.-K. Ng
Table 1. Word-correlation factors and the μ-values among some of the words in the title and body of the legitimate Web page as shown in Figure 1 names baby 8.9×10−8 pregnancy 5.8×10−4 discover 6.1×10−8 ... ... Average 1.5×10−4
announce 1.6×10−7 4.1×10−5 7.1×10−2 ... 1.8×10−2
time 9.8×10−8 1.4×10−7 1.4×10−8 ... 8.6×10−8
diet baby −8 7.1×10 1 6.1×10−7 3.7×10−3 3.3×10−8 8.9×10−8 ... ... 1.9×10−7 2.5×10−1
answer 7.3×10−8 2.5×10−10 8.1×10−9 ... 3.6×10−8
. . . μ-value ... 1 . . . 4.3×10−3 . . . 7.1×10−2 ... ... ... 0.3
Table 2. Word-correlation factors and the μ-values among the words in the title and some of the words in the body of the spam Web page as shown in Figure 2
compare find resources Average
computers 9.1×10−8 2.2×10−7 6.2×10−8 1.3×10−7
internet 2.5×10−7 2.1×10−7 4.0×10−8 1.7×10−7
electronics 5.9×10−8 1.4×10−8 5.4×10−8 4.2×10−8
mortgages 2.3×10−8 5.3×10−9 4.5×10−8 2.4×10−8
credit 1.1×10−8 1.6×10−8 1.7×10−7 6.6×10−8
flights 6.1×10−8 2.0×10−8 2.0×10−8 3.4×10−8
... ... ... ... ...
μ-value 5.5×10−7 5.9×10−7 4.4×10−7 5.3×10−7
factors among t and the words in B, the higher the similarity value between t and B, denoted μt,B . (1 − Ct,b ) , (3) μt,B = 1 − b∈B
where μt,B is defined as the complement of a negated algebraic product, instead of the algebraic sum, and Ct,b is as defined in Equation 2. Once the μ-values of each word in T and all the words in B is calculated, we can determine the degree of similarity between T and B, which calculates the average similarity value of each word t in T and all the words in B, as SimT B(T, B) =
μ1,B + μ2,B + . . . + μn,B , n
(4)
where n is the number of words in T . A high (low, respectively) SimT B(T, B) value reflects a high (low, respectively) degree of similarity between T and B. Example 1. Table 1 (Table 2, respectively) shows the correlation factors among the words in the title and the body of the Web page P1 as shown in Figure 1 (P2 in Figure 2, respectively). The degree of similarity, i.e., SimT B, of the title T1 and the body B1 of P1 is 0.88, whereas the degree of similarity between the title T2 and the body B2 of P2 is 3.2×10−7. According to the computed SimT B value, P1 (P2 , respectively) is highly likely a legitimate (spam, respectively) page. 2 Similarity Threshold Value. Having defined the SimT B value of a given Web page P , we must determine an appropriate word-similarity threshold value V so that if SimT B of P ≥ V , then P is considered legitimate; otherwise, P is treated as spam. An ideal similarity threshold should (i) reduce to the
Identifying Spam Web Pages Based on Content Similarity
(a) SimT B threshold values
209
(b) Hidden-content threshold values
Fig. 3. Number of False Positives (F P s) and False Negatives (F N s) computed by using different possible similarity threshold values on the Web pages in the Threshold set
minimum the number of spam Web pages identified as legitimate, i.e., false negatives (F N s), and (ii) avoid treating legitimate Web pages as spam, i.e., false positives (F P s). In order to determine the correct similarity threshold, we consider (i) Web pages in a test set, called Threshold set, and (ii) a number of possible similarity threshold values which yield different number of F P s and F N s. The Threshold set is a collection of 370 previously classified spam and non-spam Web pages (170 spam and 200 non-spam) randomly selected from the WEBSPAM-UK2006 dataset (http://www.yr-bcn.es/Webspam/datasets/), which is a well-known, publically available reference collection for Web spam research that consists of 77.9 million spam and non-spam pages. As shown in Figure 3(a), the optimal word-similarity threshold is 0.80, since at 0.80 the total number of F P s and F N s are reduced to a minimum and neither the number of F P s nor F N s dominates the other.2 Hence, we declare a Web page P as Legitimate if SimT B(PT , PB ) ≥ 0.80 Status(P ) = (5) Spam otherwise , where PT (PB , respectively) denotes the title (body, respectively) of P . Using Equation 5, we classify the Web page P1 in Figure 1 as legitimate, since SimT B(T1, B1 ) = 0.88 ≥ 0.80, whereas the Web page P2 in Figure 2 as spam, since SimT B(T2 , B2 ) = 3.2×10−7 < 0.80. 3.3
Fraction of the Hidden Content
Even if a Web page P lacks of a title, we can still determine whether P is spam by considering the fraction of hidden content in P . [16] define the visible content of a Web page P as the length (in characters) of all non-markup words in P divided by the total size (in characters) of P and claim that spam Web 2
We verified the correctness of the similarity threshold value and other threshold values using another Threshold set S, which consists of 100 (38 spam and 62 nonspam) pages from WEBSPAM-UK2006, and S yields the same threshold values.
210
M.S. Pera and Y.-K. Ng
pages often contain less markup than normal pages. We adapt this heuristic, but instead we compute the fraction of hidden content, denoted HC, i.e., proportion Size of markup content of P , where the of markup content, of P as HC(P ) = Total size of P size of markup content and the total size of P are in characters. Again, upon defining the HC value of a Web page P , we have to determine an appropriate threshold value, denoted HC-threshold, so that if HC(P ) ≥ HCthreshold, then P is considered legitimate; otherwise, P is treated as spam. To determine an appropriate HC-threshold value, we used the same Threshold set previously described and computed the number of F P s and F N s according to different HC-threshold values. Figure 3(b) shows that the ideal HC-threshold value is 0.75, since the number of F P s and F N s are reduced to a minimum at 0.75. Excluding the title in the HTML page P1 , a legitimate Web page (P2 , a spam Web page, respectively) as shown in Figure 1 (Figure 2, respectively), the fraction of hidden content of P1 (P2 , respectively) is 0.87 (0.23, respectively). Using the chosen HC-threshold value, i.e., 0.75, we correctly classify P1 and P2 . 3.4
The Use of Bigrams and Trigrams
We have observed that whenever there is at least one word3 in the title T of a spam Web page P that also appears in the body B of P , then SimT B(T, B) is high, which causes our spam-detection approach to misclassify P as legitimate. As a result, our spam-detection method yields higher than expected number of F N s. In order to further enhance our spam-detection approach, we consider bigram and trigram, instead of unigram (i.e., single-word as presented in Section 3.2), phrasecorrelation factors of T and B in determining the content similarity between T and B. We consider bigrams and trigrams, since as claimed by [15] and verified by us, short phrases (i.e., bigrams and trigrams) increase the retrieval effectiveness, whereas using phrases of 4 or more words tends to retrieve unreliable results. The Phrase-Similarity Value. In computing the phrase-correlation factor, denoted pcf , of any two bigrams or trigrams, p1 and p2 , we apply the Odds p(H) , on the normalized word-similarity factors as [13] ratio, i.e., Odds(H) = 1−p(H) defined in Equation 2. Odds measures the prospective support based on a hypothesis H (i.e., n-grams) using prior knowledge p(H) (i.e., the word-correlation factors of the n-grams) to determine the strength of a belief, which is pcf . n i=1 Cp1i ,p2i , (6) pcfp1 ,p2 = n 1 − i=1 Cp1i ,p2i where p1i and p2i are the ith (1 ≤ i ≤ n) words in p1 and p2 , respectively, and Cp1i ,p2i is the normalized word-similarity value as defined in Equation 2. By using the computed phrase-correlation factors, we can replace Ct,b in Equation 3 by pcfp1 ,p2 to determine (i) the μ-value between an n-gram (2 ≤ n ≤ 3) in T and all the n-grams in B, as well as (ii) the degree of similarity between T 3
After stopwords are removed and the remaining words are reduced to their stems.
Identifying Spam Web Pages Based on Content Similarity
211
Table 3. The phrase-correction factors and μ-values of (some of) the bigrams in the title with respect to the bigrams in the body of the legitimate Web page in Figure 1 baby name baby pregnancy 4.1×10−5 pregnancy motherhood 7.7×10−11 motherhood discover 6.9×10−15 ... ... Average 1.0×10−5
name birth 1.2×10−14 2.4×10−11 8.1×10−13 ... 6.3×10−12
birth announce 1.3×10−11 2.8×10−15 4.2×10−8 ... 1.1×10−8
announce ready 5.9×10−10 3.7×10−12 2.9×10−13 ... 9.1×10−10
...
μ-value
... ... ... ... ...
4.1×10−5 1.1×10−10 4.2×10−8 ... 1.0 ×10−5
Table 4. The phrase-correction factors and μ-values of (some of) the trigrams in the title with respect to the trigrams in the body of the legitimate Web page in Figure 1 baby name name birth birth announce announce ready . . . birth announce ready baby baby pregnancy 2.4×10−11 2.4×10−22 motherhood pregnancy motherhood 4.7×10−16 1.7×10−12 discover ... ... ... −12 Average 6.1×10 4.3×10−13
μ-value
7.8×10−17
6.7×10−17
. . . 2.4×10−11
3.9×10−20
1.2×10−19
. . . 1.7×10−12
... 1.0×10−15
... 1.9×10−16
... ... . . . 6.5 ×10−12
and B, i.e., SimT B, using Equation 4 which overcomes the unigram problem that arises when an unigram in T appear in B. In adopting Equation 4 to compute the degree of similarity, n in the equation represents the number of bigrams (trigrams, respectively), instead of unigrams, in T . Table 3 (Table 4, respectively) shows (some of) the phrase-correction factors between the bigrams (trigrams, respectively) in the title and body of the legitimate page in Figure 1. Example 2. Figure 4 shows a spam Web page P (http://khs.co.uk) in which the word KHS in its title T is repeated in its body B, yielding SimT B(T, B) = 0.84 by Equation 4 on word-similarity measures. Using the word-similarity threshold value 0.80 as defined earlier, P is misclassified as legitimate. However, when considering the bigrams in T and B, SimT B(T, B) = 0.57 and using the threshold value (defined below), P is correctly classified as spam. 2 The Phrase-Similarity Threshold Value. Prior to using the phrasecorrelation factors, we define the bigram- (trigram-, respectively) similarity threshold value V so that for any Web page P , if SimT B(TP , BP ) ≥ V , where SimT B(TP , BP ) is computed by using the bigram- or trigram-correlation factors, then P is considered legitimate; otherwise, P is treated as spam. In determining an ideal phrase-similarity threshold, we use the same Threshold set (in Section 3.2) to compute the number of F P s and F N s according to
212
M.S. Pera and Y.-K. Ng
Fig. 4. A sample spam Web page that is misclassified as legitimate when the (single-) word-similarity measure is applied, but is correctly classified as spam when the phrasesimilarity value is considered
(a) Bigram-similarity threshold values
(b)Trigram-similarity threshold values
Fig. 5. Number of F P s and F N s computed by using different possible bigram- and trigram-similarity threshold values on the Web pages in the Threshold set
different phrase-similarity threshold values and choose the value V such that the total number of F P s and F N s at V are reduced to a minimum. Figure 5(a) shows that the optimal bigram-similarity threshold value is 0.75, whereas Figure 5(b) indicates that the optimal trigram-similarity threshold value is 0.65.
3.5
An Enhanced Similarity-Measure Method
We have considered alternative approaches to augment the use of phrasecorrelation factors in computing the SimT B value of a Web page that can further enhance the performance of our spam-detection approach. An alternative approach is to determine the similarity among n-gram phrases4 (1 ≤ n ≤ 3) in the title T with respect to the ones in the body B of a Web page P and penalize P with a lower SimT B value if B contains phrases that are similar to only a few phrases in T and reward P with a higher SimT B value if B contains phrases which are related to a number of phrases in T . 4
In the case when no bigrams or trigrams are available in the title T , i.e., after stopword removal and stemming on the words in T and only one word is available in T , then the (single) word-similarity will be considered.
Identifying Spam Web Pages Based on Content Similarity
213
The Enhanced Phrase-Similarity Approach. The enhanced phrasesimilarity approach assures that if the phrases in the body B are closely related to most of the phrases in the title T , then the corresponding Web page P is more likely legitimate; otherwise, P is likely spam. We compute the enhanced similarity value between T and B of P by calculating the sum of the phrasecorrelation factor of each bigram (trigram, respectively) pt in T with respect to each bigram (trigram, respectively) phrase in B. spcfpt,B =
m
pcfpt,j ,
(7)
j=1
where pcfpt,j is the phrase-correlation factor as defined in Equation 6 and m is the total number of the bigrams (trigrams, respectively) in B. Once the spcf -value of each bigram (trigram, respectively) in T has been calculated, we can compute the enhanced degree of similarity between T and B, denoted enSimT B, as enSimT B(T, B) =
n
M in(spcfi,B , 1) ,
(8)
i=1
where n is the total number of bigrams (trigrams, respectively) in T . In calculating the enSimT B value, we add the minimal value of 1 and the spcf -value of an n-gram phrase pti in T . We do so in order to restrict the similarity value of each pti in T with respect to the ones in B to 1, which is the similarity value for an exact match; otherwise, the spcf -value could be given too much weight over an exact match, which could raise the enSimT B value much higher than necessary on a “few” good (or exact) matches. The enSimT B measure should further enhance the SimT B value defined in Equation 4 and its variation for phrase-correlation factors as discussed in Section 3.4, since neither one counts the frequencies of occurrence of related phrases in T and B.
(a) EnSimT B threshold values
(b) A sample spam Web page misclassified by the SimT B value
Fig. 6. Determination of the ideal EnSimT B threshold value and a classification example using the SimT B versus EnSimT B value based on the bigram-similarity
214
M.S. Pera and Y.-K. Ng
Table 5. Some of the bigram-similarity and spcf -values for the bigrams in the title T of the Web page in Figure 6(b) with respect to the bigrams in its body B Bigrams in the Title
cash advance loan advice 1.5×10−9 advice site 2.7×10−16 site loan 4.5×10−13 ... ...
Bigrams in the Body advance credit card payday report debt 2.1×10−14 5.0×10−16 6.5×10−16 2.6×10−15 4.5×10−15 2.9×10−15 1.5×10−16 1.3×10−15 2.3×10−15 ... ... ... enSimT B
... ... ... ... ... 4.0
spcf M in value (spcf , 1) 2.14 1.00 7.2×10−12 7.2×10−12 9.7×10−10 9.7×10−10 ... ... EnSimT B 0.5
To avoid the length bias in T , we normalize an enSimT B value as EnSimT B(T, B) = enSimTnB(T,B) , where n is the total number of bigrams (trigrams, respectively) in T . Thus, 0 ≤ EnSimT B(T, B) ≤ 1. The EnSimT B Threshold Value. We define the appropriate threshold value for EnSimT B, which yields the cut-off value between spam and legitimate Web pages. Using the same Threshold set and different possible threshold values, we determine the number of F P s and F N s for each of the possible thresholds for EnSimT B. As shown in Figure 6(a), the bigram EnSimT B-threshold value should be 0.67, which yields the minimal sum of F P s and F N s. (Note that the trigram EnSimT B-threshold value is not computed, since bigrams outperform trigrams in similarity measure and we only consider bigrams from here on.) Example 3. Table 5 shows how closely related (some of) the bigrams in the title of the spam Web page P in Figure 6(b) are to the ones in its body. Using the bigram SimT B value of P , which is 0.75, P is misclassified as legitimate, since SimT B(TP , BP ) ≥ 0.75, the bigram SimT B threshold value. However, when the EnSimT B value is considered instead, P is correctly classified as spam, since 2 EnSimT B(TP , BP ) = 0.5 < 0.67, the bigram EnSimT B threshold value. The overall design of our spam-detection process is shown in Figure 7. Note that by considering the fraction of hidden content or the EnSimT B value of a Web page P after the SimT B value of P has been calculated, we are able to further reduce the number of F P s and F N s. (See details in Section 4.)
4
Experimental Results
In this section we discuss the dataset used for our empirical study and show the accuracy of our spam-detection approach in using n-gram (1 ≤ n ≤ 3) phrases, which verifies the effectiveness of our approach in detecting spam Web pages. In addition, we compare the performance of our spam-detection approach with other well-known, existing anti-spam methods.
Identifying Spam Web Pages Based on Content Similarity
215
Fig. 7. The overall Web spam-detection process
4.1
Web Page Dataset
To show the accuracy of our spam-detection approach, which is measured by the number of F P s and F N s, we used the Verification set,5 which consists of 1,040 randomly selected Web pages—370 labeled spam and 670 non-spam— from the WEBSPAM-UK2006 dataset. As stated in [5], WEBSPAM-UK2006 is appropriate and representative in establishing the accuracy of a given spamdetection approach, since the collection (i) includes a large variety of spam and non-spam Web pages, (ii) represents uniform random sample, (iii) consists of spam Web pages created by using different spam techniques, and (iv) is freely available to be used as a benchmark measure in detecting spam Web pages. 4.2
Accuracy of Our Approach in Using SimT B with(out) the HC-Value
We first verified (i) the effectiveness of our spam-detection approach in using ngrams (1 ≤ n ≤ 3) and (ii) the most accurate n-gram phrases in determining the SimT B values between the title and the body of a Web page using Accuracy = Correctly identified Web pages and Error Rate = 1 - Accuracy, where correctly Total of Web pages identified Web pages is the total number of Web pages minus F P s and F N s. As shown in Figure 8(a), using bigrams and the SimT B values on the Web pages in the Verification set yields the accuracy and error rate of 83% and 17% respectively, which outperforms the unigram and trigram approaches. Figure 8(b) shows the number of F P s and F N s of different n-grams (1 ≤ n ≤ 3) in misclassifying Web pages. According to the experimental results, bigrams significantly reduce the number of F P s and F N s as opposed to the number of F P s and F N s generated by using unigrams or trigrams in computing the SimT B values. We have observed that bigrams outperform trigrams because (closely) related 3-word phrases between the title and the body of a Web page occur less often than (closely) related 2-word phrases. As a result, the degree of similarity between the title and the body of a Web page is lower in using trigrams than bigrams, causing a higher number of F P s and F N s. We further compared how well our spam-detection approach performs when considering both (i) bigram-similarity among the words in the title T and the 5
Web pages in the Verification set are different from the ones in the Threshold set.
216
M.S. Pera and Y.-K. Ng
(a) The accuracy and error rates
(b) The number of F P s and F N s
Fig. 8. Experimental results on using n-gram (1 ≤ n ≤ 3) phrases in determining the SimT B values of the Web pages in the Verification set
(a) Accuracy using Method A (bigram SimT B + HC) versus using Method B (bigram SimT B only)
(b) Accuracy and Error Rates of the SimT B values with(out) the EnSimT B values
Fig. 9. Experimental results computed on the Web pages in the Verification set
body B of a given Web page P and (ii) the fraction of hidden content of P (Method A), i.e., Steps (iii) and (iv) in Figure 7, as opposed to only considering the bigram-similarity measure between T and B of P (Method B), i.e., Step (iii) only, using the SimT B values. Figure 9(a) shows that the accuracy of our approach is increased by close to 7% in applying Method A than Method B. 4.3
The Overall Accuracy of Our Spam Detection Approach
We have conducted further comparisons on using the bigram phrase-SimT B measure with(out) the EnSimT B values on the Web pages in the Verification set, i.e., Step (iii) with(out) Step (v) in Figure 7. As shown in Figure 9(b), we increase the accuracy by nearly 5%, yielding an accuracy ratio of 93.9% in detecting spam Web pages when using bigrams in computing the SimT B and EnSimT B values, instead of using solely the SimT B values. Even more so, by considering the HC-value as well as the EnSimT B value (i.e., Step (iv) and (v) in Figure 7), in addition to the SimT B value, we further reduce the number of F N s (as shown in Figure 10(a)) and obtain an overall
Identifying Spam Web Pages Based on Content Similarity
(a) Computed F P s and F N s
(b) Computed Rates
217
Accuracy-Error
Fig. 10. Experimental results of applying the previously described spam detection approaches on the Web pages in the Verification set
accuracy ratio of 94.4% (as shown in Figure 10(b)) without significantly increasing the computational complexity, since it requires only O(n) time in calculating the HC value, where n in the number of characters in a Web page P , and O(m2 ) in computing the EnSimT B value, where m is the number of bigrams in P . 4.4
Comparing the Performance of Our Spam-Detection Approach with Other Anti-spam Methods
We further compare the performance (in terms of precision and recall) of our spam-detection approach with other well-known anti-spam methods in [6], which consider link-based [1] and content-based [16] features, and the combination of both. The features described in [6], which include the degree-related measures, PageRank, TrustRank [10], and features described in [16], such as the number of words and average word length in a page, are served as inputs to the C4.5 decision-tree. Furthermore, [6] enhance the spam-detection accuracy by (i) implementing a graph clustering algorithm that evaluates whether the majority of hosts in a cluster C are spam and if so all the hosts in C are considered spam; (ii) applying the graph topology to smooth “spamicity” predictions by propagating them using random walks [20]; and (iii) using a stacked graphical learning scheme [7] to improve the quality of the original predictions. In comparing the existing anti-spam methods listed above with ours, we consider the evaluation method defined in [6], which adopts the following matrix: Prediction Non-Spam Spam True Non-Spam a b Label Spam c d d [6] compute the True Positive Rate (or recall) = c+d , False Positive Rate = 2×precision×recall b d a+b , and F -Measure = precission+recall , where precision is defined as b+d . High recall and precision translate into high F -measure, whereas low precision and
218
M.S. Pera and Y.-K. Ng
Fig. 11. The False Positive Rate, True Positive Rate, and F -Measure computed by using the WEBSPAM-UK2006 dataset applied to the approaches in [6] and ours
recall yield low F -measure. Furthermore, high (low, respectively) recall and low (high, respectively) precision generate low F -measure. We used the Web pages in the WEBSPAM-UK2006 dataset to determine the precision and recall ratios, which dictate the F -Measure, of our spam-detection approach and the anti-spam methods in [6]. Figure 11 shows the results reported in [6] for different Web anti-spam detection methods using the classifier with the highest F -Measure, as well as the results generated by using our approach. Our spam-detection method clearly outperforms the other anti-spam methods by at least 10% (on the average) in terms of F -Measure, which indicates that we obtain high precision and recall in detecting spam Web pages, i.e., correctly identifying spam Web pages while avoiding misclassifying legitimate Web pages.
5
Conclusions and Future Work
In this paper, we present a spam-detection approach that can effectively identify spam Web pages to aid search engines in performing more adequate searches. Our anti-spam approach minimizes the user’s time in looking through pages that are deceitful and do not contain useful information. In designing our antispam method, we consider (i) the (enhanced) similarity measures of phrases in the title with respect to the ones in the body of a Web page P , and (ii) the fraction of hidden content of P , if necessary, to determine whether P is spam. Experimental results show that by using our approach, we can classify spam Web pages with 94.4% accuracy. Even more so, our approach outperforms existing anti-spam approaches by close to 10% on the average in F -measure. Furthermore, our approach is computational inexpensive, since (i) the word-correlation factors used for computing the phrase-correlation factors are precomputed and (ii) the computational time to calculate the fraction of hidden content is insignificant. We have observed that the use of bigrams significantly increases the performance of our spam-detection approach. Since the bigram-correlation values employed in our spam-detection tool are computed by using the unigram-correlation factors, we believe that constructing a phrase-correlation matrix directly from the Wikipedia documents could further enhance the performance of our approach in terms of (i) minimizing misclassified Web pages and (ii) reducing the computational time required to determine the (En)SimT B values of Web pages.
Identifying Spam Web Pages Based on Content Similarity
219
References 1. Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-Yates, R.: Link-Based Characterization and Detection of Web Spam. In: 2nd Intl. Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pp. 1–8. ACM Press, New York (2006) 2. Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-Yates, R.: Using Rank Propagation and Probabilistic Counting for Link-Based Spam-Detection. In: Workshop on Web Mining and Web Usage Analysis, pp. 1–8 (2006) 3. Benczur, A., Csalogany, K., Sarlos, T., Uher, M.: SpamRank-Fully Automatic Link Spam-detection. In: 1st AIRWeb Workshop, pp. 25–38. ACM, New York (2005) 4. Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: 7th Intl. World Wide Web Conf., ACM, New York (1998) 5. Castillo, C., Donato, D., Becchetti, L., Boldi, P., Santini, M., Vigna, S.: A Reference Collection for Web Spam. SIGIR Forum 40(2), 11–24 (2006) 6. Castillo, C., Donato, D., Gionis, A., Murdock, V., Silvestri, F.: Know your Neighbors: Web Spam-detection Using the Web Topology. In: Intl. ACM SIGIR Conf., pp. 423–430. ACM, New York (2007) 7. Cohen, W.W., Kou, Z.: Stacked Graphical Learning: Approximating Learning Markov Random Fields Using Very Short Inhomogeneous Markov Chains. Technical Report, Machine Learning Department, Carnegie Mellon University (2006) 8. Davison, B.: Recognizing Nepotistic Links on the Web. In: Artificial Intelligence for Web Search, pp. 23–28. AAAI Press, Menlo Park (2000) 9. Fetterly, D., Manasse, M., Najork, M.: Spam, Damn Spam, and Statistics: Using Statistical Analysis to Locate Spam Web Pages. In: 7th Intl. Workshop on the Web and Databases (WebDB), pp. 1–6 (2004) 10. Gyongyi, Z., Garcia-Molina, H., Pedersen, J.: Combating Web Spam with TrustRank. In: 30th Intl. Conf. on VLDB, pp. 576–587. ACM, New York (2004) 11. Gyongyi, Z., Garcia-Molina, H.: Web Spam Taxonomy. In: 1st Intl. Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pp. 39–47. ACM, New York (2005) 12. Gyongyi, Z., Berkin, P., Garcia-Molina, H., Pedersen, J.: Link Spam-Detection Based on Mass Estimation. In: Intl. Conf. on VLDB, pp. 439–450. ACM, New York (2006) 13. Judea, P.: Probabilistic Reasoning in the Intelligent Systems: Networks of Plausible Inference. Morgan Kauffman, San Francisco (1988) 14. Lam-Adesina, A., Jones, G.: Applying Summarization Techniques for Term Selection in Relevance Feedback. In: Intl. ACM SIGIR Conf., pp. 1–9. ACM, New York (2001) 15. Misjne, G., de Rijke, M.: Boosting Web Retrieval through Query Operations. In: European Conf. on Information Retrieval, pp. 501–516 (2005) 16. Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting Spam Web Pages through Content Analysis. In: Intl. Conf. on World Wide Web, pp. 83–92 (2006) 17. Pera, M.S., Ng, Y.-K.: Using Word Similarity to Eradicate Junk Emails. In: 16th ACM Conf. on Information and Knowledge Management, pp. 943–946 (2007) 18. Perkins, A.: The classification of Search Engine Spam (2001), http://www.silverdisc.co.uk/articles/spam-classification/ 19. Svore, K.M., Wu, Q., Burges, J.C., Raman, A.: Improving Web Spam Classification Using Rank-Time Features. In: 3rd AIRWeb Workshop, pp. 9–16. ACM, New York (2007) 20. Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Scholkopf, B.: Learning with Local and Global Consistency. Advance in Neural Info. Proc. Sys. 16, 321–328 (2004)