Research on Web Spam Detection Based on Support ... - IEEE Xplore

4 downloads 306 Views 383KB Size Report
Weiwei Li. Department of Computer Science. Ningde Vocational and Technical College. Ningde, Fujian Province, China [email protected]. Youming Xia.
2012 International Conference on Communication Systems and Network Technologies

Research on Web Spam Detection Base on Support Vector Machine

Zhiyang Jia

Weiwei Li

Department of Information Science and Technology Tourism and Literature college of Yunnan University Lijiang, Yunnan Province, China [email protected]

Department of Computer Science Ningde Vocational and Technical College Ningde, Fujian Province, China [email protected]

Wei Gao

Youming Xia

Department of information Yunnan Normal University Kunming, Yunnan Province, China [email protected]

Department of information Yunnan Normal University Kunming, Yunnan Province, China [email protected]

(SEO) techniques such as improving the quality of their content [2]. However, many web site managers try to manipulate the ranking functions of search engines by immoral techniques. They may create millions of web pages that link to target pages. This technique is known as link stuffing, which can increase the rankings of target pages to search engines using link-based ranking. They may also use keyword-stuffing technique to fabricate web pages which is full of popular key words that frequently searched by users. Those web pages include many of popular key words, so that the search engine reputes the web pages as highly relevant pages to the users’ query. Web pages that want to cheat the search engine for the sole purpose of increasing the rankings in the search results are called “web spam” [3]. For search engines, web spam detection is necessary. Because the existence of web spasm brings about many problems to the search engines, such as many resources were wasted, services of search engines were reduced, and normal web pages lost many opportunities to be visited. Therefore creating an effective web spam detection model is a meaningful research topic. Given the size of the web, such a model has to be automated. The web spam detection problem is viewed as a classification problem, that means classification models are created by machine learning classification algorithms, which given a web page, it will classify it in one of two categories: Normal and Spam [4].

Abstract—with the fast development of Internet, web pages created by web spam which aimed at cheating the search engine and increasing rankings in the search results are prevailing. Web spam is a big problem for today's search engine; therefore it is necessary for search engines to be able to detect web spam during crawling. The web spam detection problem is viewed as a classification problem, that means classification models are created by machine learning classification algorithms, which given a web page, it will classify it in one of two categories: Normal and Spam. For support vector machine classification model, soft margin classifier based on linear support vector machine was developed by learning the sample set, and penalty functions were defined according to the links between web pages that seems to have similar characteristics. Not only the content features but also the link structures between web pages were taken advantage of to build classifier. Keywords- search engine; web spam; anti-spam; web spam detection; SVM

I.

INTRODUCTION

Due to the amazing amount of web pages available on the web, Internet users have to locate useful web pages by querying search engine. Given a query string, a search engine find the relevant pages on the web and presents the users with the hyperlinks to such pages, typically in batches of 10 or 20 hyperlinks. Once the users see the result hyperlinks, they may click on one or more hyperlinks in order to visit the web pages. This model of getting relevant information through the use of search engines has become pervasive and most of users use search engines to find the information they need. Given the large fraction of web traffic originating from searches and the high potential monetary value of this traffic, it is not surprising that some web site managers try to influence the positioning of their web pages within search results. [1]. Some web site managers attempt to increase their rankings through conventional Search Engine Optimization 978-0-7695-4692-6/12 $26.00 © 2012 IEEE DOI 10.1109/CSNT.2012.117

II.

CONSTRUCTION OF CLASSIFIER

Web crawler is a crucial component of the search engines. Web spam detection process should be carried out during crawling of the web and fulfilled before the indexing work, so in order to build test data set we need to simulate the search engine by crawling of the web. To design and evaluate of our web spam detection model, we constructed the experimental set by crawling the web base on breadthfirst crawling strategy and “random sample” principle [5]. In April 2010 we downloaded 137,640 web pages and then we labeled spam pages artificially from the data set. The data set

515 520 517

Suppose we want to learn a linear classifier f(x)=w•x, w is found as the minimizer of the following objective function:

is composed of 9,634 (7%) web spam pages, and 128,006 (93%) normal web pages. A. Feature Extraction Base on Content Analysis Although normal web pages and spam pages has significant difference in visual effect, but it is difficult to detect spam pages base on visual characteristics. Therefore, we extract feature by content analysis, then we view the web spam detection problem as a binary classification problem, using the method of machine learning to classify web pages into two classes: normal page and web spam just as figure 1 shows: Feature Set (x)

Classifier

Ω( w) =

1 ∑ R(w • xi , yi ) + λ w • w l

(1)

Here λ is a parameter of the algorithm. The above objective function captures the necessary trade-off between fitness and complexity, for web could choose w to correctly classify our training data while maintaining a large margin. We use the hinge function R(u,y)=max(0,1-uy) to represent the loss on the training data, but any convex loss function may be used. The quantity w•w represents the size of the margin and is often referred to as the regularization term. So the objective function can be:

Catalog label (y)

m

minimize∑ ξi + λ m w

Figure 1. Classify Process.

2

i =1

We extract several features of web pages, all of them based on the content of web pages. Some of these features are independent of the language; others use languagedependent statistical properties. When search engines compute the rankings of web pages, the weight of page’s title will be very high, and a lot of spam pages duplicate words in the web page’s title, this technology as previously described as "Keyword Stuffing". So we compute the length of pages’ titles as a feature. In order to discriminate normal pages and spam pages, we also extracted other features such as compressibility of web pages’ html source code, “” tag in the html source code of web pages, Uniform Resource Locator (URL) length of web pages, content length of web page, fraction rate of stop words and punctuation, fraction rate of popular words, fraction rate of visible content, relevant rate between topic and content text, relevant rate between anchor text and content text [6]. We tested the rule-based classification model and decision tree classification model to detect web spam base on these features. The result shows that those two classification model are ineffective, because they didn’t utilize the linkage of web pages. So we constructed a support vector machine classification model to take advantage of the linkage between web pages, which expected to get more effective result.

⎛ y ( w, xi ) ≥ 1 − ξi object to ⎜ i ξ ≥0 ⎝

i = 1.....m ⎞ ⎟ ⎠

(2)

For the particular case of classification tasks on the web, one has the additional advantage of the hyperlinks between web pages. Hyperlinks can be represented as a directed graph with edge set E. Hyperlinks are not placed at random, and it has been shown empirically that they imply some degree of similarity between the source and the target web page of the hyperlink. Base on this observation, it is natural to add an additional regularizer to the objective function:

Ω( w) =

1 l ∑ R(w • xi , yi ) + λ w • w l i =1





α ij Φ( w • xi , w • x j )

(3)

( i , j )∈E

Where αij is a weight associated with the link from web page i to web page j. The first two terms correspond to a standard linear SVM described above. The third term enforces the desired graph regularization described above. The function Φ represents any distortion measure, and is chosen according to the problem at hand. The objective (3) was proposed in [8], where Φ was chosen to be: (4) Φ (u , v ) = (u − v ) 2 Formula (4) implicitly encodes the expectation that hyperlinked neighbors should have similar predicted values. One novelty of our proposed method is that, contrary to [8], we utilize asymmetric graph metrics tuned to the particular task of web spam classification. With spam, incorporating hyperlink direction is crucial: spam pages frequently link to normal pages but rarely vice versa. This has been empirically confirmed in [9]. We exploit hyperlink direction through the alternative metric:

B. Classifier based on Support Vector Machine Support Vector Machine (SVM) model takes a set of input data, and predicts [7], for each given input, which of two possible classes the input is a member of, which makes the SVM a non-probabilistic binary linear classifier. Since an SVM is a classifier, then given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other. Intuitively, an SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.

Φ (u , v ) = max(0, v − u )

(5) In the case when the feature space is not rich enough, a simple linear classifier might not be flexible enough. We introduce a parameter zi for every web page i and learn classifier of the form:

518 521 516

Ω( w) =

1 l ∑ R(w • xi , yi ) + λ1w • w + λ2 z • z l i =1 +γ



α ij Φ( w • xi , w • x j )

spamicity of a web page is not determined by itself but determined by the pages that point to it. And the spamitities of all the pages in the data set can be compute by several iterations. Before the iterations we must assign initial spamicity to some pages According to the known class labels (spam for 1 and normal for 0). Spamities are usually distributed in the interval [0, 1] after computation, which indicate the possibility of web spam.

(6)

( i , j )∈E

We introduce two regularization parameters λ1 and λ2 for controlling the values of both w and z in Formula (6). For each pair of web pages i and j, we have calculated the spamicity of each page before the application of support vector machine algorithm. Suppose that fi represents the spamicity of page i and fj represents the spamicity of page j. That means the higher the spamicity of the normal web page that the greater the likelihood. We have already defined two possible regularization functions:

Φ sqrt =

fi − f j

Φ +sqrt = max(0, f i − f j )

III.

EXPERIMENT RESULTS AND ANALYSIS

To compare different classifier of detecting web spam on the World Wild Web, we carried out the experiment of detecting spam pages base on the data set described above. Table 1 summarizes the performance of classifiers discussed in this paper. TABLE I.

(7)

RECALL RATE, PRECISION RATE, AND F1 VALUE OF THE EXPERIMENT

(8)

Classifier

Catalog

Formula (7) penalizes the square root of any deviation between the predicted values of i and j. Formula (8), on the other hand, only penalizes the predicted web spam scores when the page creating the link has a lower predicted spam values than the link’s destination. The function Φ encodes our assumption that, while the spamicity value can reasonable decrease though a link, it should not increase. For the task of classification, the latter choice would appear more appropriate. In general, while spam pages may link to normal pages, normal pages typically have no incentive to link to spam pages. So we have found that the best choice of regularization is neither of formula (7) or formula (8) but rather a mixture of them. For any α∈[0,1] function Φ defines as: (9) Φα (a, b) = αΦ sqrt (a, b) + (1 − α )Φ +sqrt (a, b)

Rulebased Decision tree based

Spam Normal Spam Normal Spam Normal

SVM

Recall

F1 Value

0.990 0.893 0.991 0.903 0.993 0.912

0.995 0.807 0.995 0.816 0.995 0.906

0.992 0.848 0.993 0.857 0.994 0.909

Experimental data shows that the rule-based and decision tree based classifier had a similar effect, while the SVM classifier are significantly better than the other classifier, which is basically consistent with the idea of this paper, or a combination of page content feature and the link structure between pages of information, is indeed able to obtain better results. In addition, due to the small size of the experimental data set, not all the web pages linked were crawled as part of the data set, lot of links information was lost, and so the accuracy of detection was affected.

When α=0, only the normal page links spam pages are penalized. When α=1, any deviation in the predicted spam value between linked pages incurs a cost.

ACKNOWLEDGMENT

C. Iterative Algorithm of Spam Possibility Computing The possibility of web spam (not the credibility of) needs to be calculated before the training of support vector machine, and the calculating algorithm is an iterative process, the calculation algorithm for the iterative algorithm, mainly base on the link structure of the data set. Some of the pages are labeled as spam or normal, and these surpluses’ label are unknown, the link structures between each page are known. The following iterative function is adopted to calculate the spamicity vale of each web page in the data set iteratively [10]:

The authors would like to thank the editors and reviewers, who gave very insightful and encouraging comments. REFERENCES [1]

[2]

Spamicity ( A) = (1 − d ) E ( A) Spamicity (Ti ) +d ∑ C (Ti ) Ti → A

Result Precision

[3]

(10)

[4]

Where Ti is the web page that point to the web page A, C(Ti) is the total count of web pages that point to web page A, d is the damping factor. By the formula, can infer that the

519 522 517

Bing Liu, Web Data Mining: exploring hyperlinks, contents, and usage data, 1st ed., Springer: Verlag Berlin Heidelberg, 2007, pp.222225. H. Simpson, Building Findable Websites: Web Standards, SEO, and Beyond, 1st ed., Pearson Education: New Riders Press, 2008, pp.5664. Gyongyi, H Molina, “Web spam taxonomy,” In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, pp. 39-48, Bethlehem, USA, 2005. A Ntoulas, M Najork, M Manasse, “Detecting spam web pages through content analysis.” In Proceedings of the 15th International Conference on World Wide Web, pp. 83-92, Scotland, 2006.

[5]

[6]

[7]

[8]

12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 821-826, New York, 2006. [9] J Abernethy, O Chapelle, C Castillo, “Web Spam Identification through Content and Hyperlinks,” In Proceedings of the SIGIR Workshop on Adversarial Information Retrieval on the Web (AIRWEB'08), Beijing, China, April 2008. [10] Bin Zhou, Jian Pei, Zhaohui Tang, “A Spamicity Approach to Web Spam Detection,” In Proceedings of the 2008 SIAM International Conference on Data Mining, pp. 277-288, Georgia, USA, April 2008.

Ziv Bar-Yossef, Maxim Gurevich, “Random sampling from a search engine's index,” Journal of the ACM, Vol. 55, no. 5, pp. 16-20, October 2008. Zhiyang Jia, Weiwei Li, Haiyan Zhang, “Content-based spam web page detection in search engine,” Computer Application and Software, Vol. 26, no. 11, pp. 165-167, November 2009. Thorsten Joachims, “Text categorization with Support Vector Machines: Learning with many relevant features,” Machine Learning, Vol. 1398, pp. 137-142, 1998. T Zhang, A Popescul, B Dom, “Linear prediction models with graph regularization for web-page categorization,” In Proceedings of the

520 523 518