Spam Detection Methods for Arabic Web Pages

6 downloads 113175 Views 376KB Size Report
Content spam manipulates the text contents in HTML pages, and this makes spam pages more ... HTML page such as within: the tag, Anchor text, URL,.
Spam Detection Methods for Arabic Web Pages Heider A. Wahsheh

Mohammed N. Al-Kabi

Izzat M. Alsmadi

CIS Department IT & CS Faculty Yarmouk University – Jordan [email protected]

CIS Department IT & CS Faculty Yarmouk University – Jordan [email protected]

CIS Department IT & CS Faculty Yarmouk University – Jordan [email protected]

The percentage of Internet Arabic content and Arabic Wikipedia content is less than 1% of the total Internet content and total Wikipedia contents. The 1% percentage is lower than the 5% percentage of the ME region. The Arabs contribution to blogs and forums is 35%, where this percentage exceeds the normal known percentage (10%) worldwide by more than three folds [3].

Abstract—Spammers drown the Internet using Web pages that violate the SEO guidelines, to mislead search engines and gain the highest possible rank for their Web pages. This paper extended an Arabic spam dataset previously built by the authors of Arabic content-based spam Web pages. The authors applied the Decision Tree classifier which is showed to be the best classifier to detect content-based Arabic Web spam. A contentbased Arabic Web spam detector is also developed, which extracts the content features of Web pages, and compares their features with the rule-based (graph structure) of Decision Tree. The content-based Web spam detector presents a solution to clean the search engines from Arabic spam Web pages. The results of content-based Arabic Web spam detection showed an accuracy of 83%, using a dataset of 2,500 spam Web pages.

It is known that’s continuous improvements to the contents of different Web pages lead to make search engines continuously improve the rank of these Web pages, and that means gaining more visitors. In many cases the increase in the number of visitors to a Web page/site leads to gain more money. Therefore some of Web site owners attempt to use spam techniques to deceive search engines, which is considered unethical.

Keywords: Arabic Web spam; content-based; Arabic Web spam detector.

I.

Some of Web site owners are spammers or try to hire spammers. Spammers try to use illegal techniques, and take advantages of Search Engines Optimization (SEO) to increase the rank of Web pages belong to the sites they are responsible for developing it. They fill the Internet with their Web pages that mislead the search engines and take a higher rank than what they really deserve.

INTRODUCTION

The Arab world contains 22 countries, spread throughout two continents Asia and Africa. The classic Arabic language is the mother tongue of more than 300 million people, and it is considered for religious reasons the language of Islam. The Arabic language is ranked as the fifth most spoken language around the world. It is based on 28 letters, which are adopted by other languages (i.e. Urdu, Persian, and Malay), and written from right to left, since it is one of the Semitic languages [1].

There are three types of Web spam: content spam, link spam, and cloaking. Content spam manipulates the text contents in HTML pages, and this makes spam pages more relevant to some top queries. Link spam is characterized by the irrelevant hyperlinks which point to non-meaningful junctions. Cloaking spam is based on a simple idea characterized by having two copies, one it dedicated to Web crawler, while the other one is dedicated for Web browsers [4].

The use of Internet is rapidly increasing in the Arab world, particularly in the Middle East region. Internet penetration rate has scored 31%, 26%, and 20% in UAE, Kuwait and Lebanon respectively as the highest percentages. The penetration rates in Kingdom of Saudi Arabia (KSA) are high with academies, medical, and research institutions [2]. The massive number of different Internet documents which include audio, video, text and other interactive media features, leads to attract more users to be engaged with different technologies, such as: playing games, shopping, emailing, chatting and downloading media in different topics [3].

There are many studies dedicated to the detection of Web spam. In Arabic, few papers are published describing algorithms for Arabic Web spam detection. All the previous Arabic Web spam studies showed a classification of the Web pages under content features, and applied different algorithms to detect content-based Web spam. This study exhibits an improvement to three previous studies [5], [6], [7] related to the detection of Arabic Web spam detection. First merit of this study is characterized by collecting a larger Arabic contentbased spam corpus than those used in the three previous studies. This new corpus contains 15,000 Arabic spam Web

The main challenges that face the Arab Internet users lie in the regulatory and political environments, which are stereotyped by the absence of mutual strategic cooperation to encourage the use of the Internet between the Arab countries, or between individuals and institutions in the same country [2].

© ICCIT 2012

486

pages. Second merit of this study is the use of more advanced content-based features. There is a consensus among the researchers of the three previous studies [5], [6], [7] that the Decision Tree classification algorithm is the best. Therefore a Java system is built to identify spams Web pages using Decision Tree classification algorithm.

approach; called rank-time features, which enhances the performance of Web spam classifier through content-based Web spam detector. It is consider as a first study that refers to rank-time features, especially, query-dependent rank-time features. In their study Abernethy, Chapelle, and Castillo [13] identify spam Web pages through their system which is called Witch system. Witch system utilized the graph structure with the content features of Web pages. The results of using Witch system showed the effectiveness of this system to identify spam Web pages.

The rest of this study is organized as follows: Section two presents a brief review of previous studies on Web spam detection. Section three describes our framework to develop Arabic Web spam detector. Section four presents the collection of the dataset and the extraction of content-based features. Section five presents the comparisons of results with the previous approaches, section six presents the evaluation results of the conducted tests on the new built system. Section seven presents the conclusion and future work. II.

Geng, Li, and Zhang [14] proposed a new algorithm called HFSSL. HFSSL considered Web pages as vertices, and compute a score for each Web page depending on the similarity in a scored graph. Tests on HFSSL have proved its effectiveness in detecting spammed Web pages.

RELATED WORK

Chung, Toyoda, and Kitsuregawa [15] have improved online learning algorithm that identify spam link-based, which deal with many link-based features. Results tests reveal that the Japanese Web archive which was collected during three years, with accuracy of 56%.

The Term Frequency - Inverse Document Frequency (TFIDF) is a content based weight showing the importance of a word within a document and usually used within a number of similarity measures used in information retrieval (IR) and text mining [4].

Niu, Ma, He, Wang and Zhang [16] used the programming to detect Web spam. These authors built Web spam classifiers which have the capability to find and get the best possible functions to deal with the Web spam problems. The conducted results on the new classifier showed 26%, and 11% enhancement within the recall, and F-measure performance respectively.

Spammers want to increase the TF-IDF scores of spam content-based Web pages. They abuse the use of TF-IDF by the use of a large number of unrelated and repeated words. The goal of such word repetition in different parts of an HTML page such as within: the tag, Anchor text, URL, Headers (

...

tags), tags, and within the Web page is to gain a high TF-IDF score for the repeated word [8].

Hayati, Potdar, Chai, and Talevski [17] proposed a new set of user behaviors, and a novel technique to identify the spam bots, using automated machine learning approach. These authors proposed and used a new set of features, used as training dataset to Support Vector Machine (SVM) classifier, which achieve an accuracy of 96.24%.

Castillo, Donato, Gionis, Murdock, and Silvestri [9] have proposed a spam detection system; which is based on both linkbased, content-based. The authors of [9] believed that their study is a pioneering one, since it considered both link-based and content-based features. The best classifier presented in the study has achieved an accuracy of 88.4% with 6.3% false positives.

In order to enhance the accuracy of spam algorithms, Dai, Qi, and Davison [18] has tried to benefit from content features within historical Web pages. They are machine learning algorithms to produce a classifier, using WEBSPAM-UK2007 dataset which revealed a 30% improvement relative to a baseline classifier.

Liang, Ru, and Zhu [10] have presented a new algorithm; called R-SpamRank, to identify spam Web pages which exists in a link farm. The authors have manually selected a small number of spam Web pages as a seed for the proposed algorithm. They used only 10,000 Web pages with the highest R-SpamRank values, and found the accuracy of this novel Web spam detection algorithm about 91%.

In their study, Wahsheh, and Al-Kabi [5] have manually built the first Arabic Web Spam corpus, which consists of only 400 spam Arabic Web pages [5]. Three classifiers were used; Decision Tree, Naïve Bayes, and K-Nearest Neighbour (KNN). Their study has showed that the K-NN at K=1 is better than the other two classifiers (Decision Tree and Naïve Bayes) in detecting Arabic Web spam pages.

Araujo, and Martinez-Romo [11] study have presented a new Web spam detection system; based on two techniques. The first one is based on Qualified-Link (QL) analysis, while the second one is based on contents using an extended language model. Using this detection system helps to improve the results.

Jaramh, Saleh, Khattab, and Farag [6] study has based on the study of [5], and has presented new features to improve the accuracy of the Web spam detection classifiers. These authors also used three classifiers (Decision Tree, Naive Bayes, and LogitBoost), and found that Decision Tree has yielded the best results.

Svore, Wu, Burges, and Raman [12] present the construction of the dataset; which showed the independent case between the test and predicted sets. They produce a novel

487

3) The number of characters in Web page URLs. Spammers add spam keywords to the URLs to increase their Web pages rank. Therefore the URLs of these spammed Web pages will be long. 4) The number of hyperlinks in Web pages, where spammers try to increase the number of hyperlinks referring to other spam Web pages.

Al-Kabi, Wahsheh, AlEroud, and Alsmadi [7] Study overlaps with the two predecessor’s studies conducted by [5] and [6]. These studies are related solely to content-based Arabic Web spam detection. Al-Kabi, Wahsheh, AlEroud, and Alsmadi [7] study present extended content-based spam corpus, with new proposed Web spam content-based detection features, where three classification techniques (Decision Tree, LogitBoost, and SVM) are used. Test results revealed that the Decision Tree classifier is better than the other two contentbased Web spam detection classifiers and yielded good results with accuracy of 99.3462%, and error rate of 0.6538%. III.

Fig. 1 presents an algorithm of proposed content-based Web spam detection system. Algorithm

METHODOLOGY

Input:

This paper aims to build a content-based Arabic Web spam detector. This methodology is based on the following high level steps:

List of URLs stored on a text file (content-based spam.txt).

Output:

Table of the number of Web page features, stored in the database of Web spam detection system, and percentage of decision (spam/ non-spam).

BEGIN

A. Building an extended corpus of Arabic content-based spam pages relative to those used in [5],[6], and [7].

WHILE NOT EOF (content-basedSpam.txt) Read the HTTP address of Web page. Download a Web page. Count the number of words inside . Count the number of words inside . Count the number of characters inside . Count the number of characters inside . Count the number of words without repetition inside . Compute the average of word lengths inside . Count the number of words inside . Count the number of characters inside . Count the number of anchor text within the Web page under consideration. Count Web page size in Kilo bytes. Count the number of characters with the URL under consideration. Count the number of images within a Web page.

B. Extract a larger set of features from the Arabic Web pages depending on their content using the proposed Arabic content-based detector. C. Use Weka software, to apply the Decision Tree classification algorithm. D. Using a Java programming language to build an Arabic Web spam detection system which maily based on the rulebased (graph structure) of the Decision Tree results.

IV.

Content-Based Web spam detection system.

END WHILE

SPAM CORPUS AND RELEVANT FEATURES

Compare the similarity between features extractor with the Decision Tree rulebased (graph structure).

A. Web Spam Corpus This research study presents an extended corpus of spam Web pages, which is larger than the corpus used in [5], [6], and [7]. This dataset contains 15,000 Arabic spam Web pages.

Compute the true percentage of spam/ non spam detector based on the previous step. END

B. Content -Based Web Spam Detection System As a part of the content-based Arabic Web spam detection system, we developed an algorithm to extract many features from Web pages depending on their contents. The analyzer was capable to extract the features mentioned in [5, 7]; in addition to few new ones.

Figure 1. The developed Web Analyzer Algorithm.

The content-based Web spam detection system use the features extracted, and compare them using the Decision Tree rule-based (graph structure), to determine the similarity between the Web page under test and the Decision Tree rulebased (graph structure).

The proposed features extracted from Web pages in this new dataset are:

V.

1) The number of characters in the elements of Web pages, since the spammers usually used many unrelated characters in it to increase the score of the title in the ranking process. 2) The number of words in the element, where spammers usually used the Keyword stuffing technique to raise the rank of their Web pages.

COMPARISON OF RESULTS

The results presented by three studies [5, 6, and 7] conclude that the Decision Tree classification algorithm is the best relative to other competitive algorithms used in these studies. Therefore this study uses the Decision Tree algorithm (J48) to classify the Arabic spam Web pages, which yields an accuracy of 99.521%, and error rate of 0.479%.

488

Table 1 presents the detail results’ comparisons of using Decision Tree in three previous studies [5, 6, and 7] in addition to this one. Detailed accuracy metrics information by class is presented such as: True Positive (TP), False Positive (FP), Precision (P), Recall (R), F-Measure, and Receiver Operating Characteristic (ROC) between our study, and the three previous similar studies. Dataset Detecting Arabic Web Spam [5]. Detecting Arabic Spam Web Pages Using Content Analysis [6]. Combating Arabic Web Spam Using Content Analysis [7]. This Study.

TABLE 1. DECISION TREE COMPARISION RESULTS TP FP P R F-Measure

ROC

0.98

0.06

0.94

0.98

0.96

0.96

0.91

0.009

-

-

0.98

-

0.99

0.007

0.993

0.99

0.993

0.99

0.998

0.003

0.997

0.99

0.997

0.99

Figure 2. Content-Based Arabic Web Spam Detection System Results.

VII. CONCLUSION AND FUTURE WORK Web spamming is defined as any manipulation of the content, and link structure, to gain higher rank than really these Web pages deserve.

Results in Table 1 show that the evaluated method yields better results than the methods used in [5, 6, and 7] studies, where all studies used Decision Tree to identify Arabic Web spam pages. VI.

Few studies are dedicated to Arabic content-based Web spam, interested in their studies on the classification of the spam Web pages under content features, using different algorithms to detect content-based Web spam. This study is an extent of some previous studies. We collected the largest Arabic content-based spam corpus, which contains around 15,000 Arabic spam Web pages. Also extended content-based features are used by Decision Tree algorithm to build a rulebased (graph structure). Finally a new pioneering Arabic content-based Web spam detection system is built using the graph structure.

EVALUATION RESULTS

Beside using 15,000 Arabic spam Web pages, to build the rule-based algorithm of Decision Tree in the proposed contentbased Web spam detection systems, we used more than 2,500 Arabic spam Web pages, with our content-based Web spam detection systems as a training dataset, to check the effectiveness of the evaluated prediction algorithm.

We plan to increase our dataset to get more accurate Decision Tree graph structure, and increase the size of the training dataset, to enhance the efficiency of our system. On the other hand we plan to study other types of Web spam such as link spam and cloaking to build a Web spam detection system for these types spamming techniques.

The evaluated algorithm yields 83% prediction accuracy of true decision of the Web pages used in the test. Table 2 presents the effectiveness measure results of this study, such as: Kappa Statistic (KS), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Relative Absolute Error (RAE), Root Relative Squared Error (RRSE).

REFERENCES

TABLE 2. EFFECTIVENESS RESULTS OF THE PROPOSED SYSTEM Method KS MAE RMSE RAE RRSE

[1]

Proposed System

[2]

0.9911

0.0053

0.0664

1.053%

13.276%

[3]

Fig. 2 presents an example of content-based Arabic Web spam detection system results

[4] [5]

489

K. Ryding, “A Reference Grammar of Modern Standard Arabic” from http://bilder.buecher.de/zusatz/14/14749/14749960_vorw_1.pdf 2005. W. Alrawabdeh, “Internet and the Arab World: Understanding the Key Issues and Overcoming the Barriers,” The International Arab Journal of Information Technology, vol. 6, no. 1 pp. 27-33, January 2009. A. Tarabaouni, MENA Online Advertising Industry Available at: http://www.slideshare.net/aitmit/mena-online-advertising-industry (accessed 28 Oct 2011). Z. Gyongyi and H. Garcia-Molina, “Web spam taxonomy,” Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan , pp. 1-9, 2005. H. A. Wahsheh and M. N. Al-Kabi, “Detecting Arabic Web Spam,” The 5th International Conference on Information Technology, ICIT 2011, Paper ID (631), Amman, Jordan, pp. 1-8, May 11-13, 2011.

[6] [7]

[8]

[9]

[10] [11]

[12] [13] [14] [15] [16]

[17]

[18]

R. Jaramh, T. Saleh, S. Khattab, and I. Farag, “Detecting Arabic Spam Web pages using Content Analysis,” International Journal of Reviews in Computing, vol. 6, pp. 1-8, July 15, 2011. M. Al-Kabi, H. Wahsheh, A. AlEroud, and I. Alsmadi, “Combating Arabic Web Spam Using Content Analysis,” 2011 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT), Amman Jordan, pp. 401-404, December 2011. T. Liu, J. Xu, T. Qin, J. Xu, W. Xiong, and H. Li, “LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval,” SIGIR 2007 Workshop on Learning to Rank for Information Retrieval (LR4IR 2007), pp. 1-10, 2007. C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri, “Know your Neighbours: Web Spam Detection using the Web Topology,” Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, pp. 423–430, 2007. C. Liang, L. Ru, and X. Zhu, “R-SpamRank: A spam detection algorithm based on link analysis,” Journal of Computational Information Systems, vol. 3 no. 4, pp. 1705-1712, 2007. L. Araujo and J. Martinez-Romo, “Web Spam Detection: New Classification Features Based on Qualified Link Analysis and Language Models,” IEEE Transactions on Information Forensics and Security, vol. 5, no. 3, pp. 581-590, 2010. K. M. Svore, Q. Wu, C. Burges, and A. Raman, “Improving Web Spam Classification using Rank-time Features,” AIRWeb ’07, Banff, Alberta, Canada, pp. 9-16, 2007. J. Abernethy, O. Chapelle, and C. Castillo, “Web spam Identification Through Content and Hyperlinks,” AIRWeb ’08, pp. 1-4, 2008. G. Geng, Q. Li, and X. Zhang, “Link based small sample learning for Web spam detection,” In: World Wide Web Conference Series, pp. 1185-1186, 2009. Y. J. Chung, M. Toyoda, and M. Kitsuregawa, “Identifying Spam Link Generators for Monitoring Emerging Web Spam,” In: World Wide Web Conference Series, pp. 51-58, 2010. X. Niu, J. Ma, Q. He, S. Wang, and D. Zhang, “Learning to Detect Web Spam by Genetic Programming,” In: Proceedings of the 11th international conference on Web-age information management (WAIM'10), pp. 18–27, 2010. P. Hayati, V. Potdar, K. Chai, and A. Talevski, “Web Spambot Detection Based on Web Navigation Behaviour,” In: Proceedings of the 24thIEEE International Conference on Advanced Information Networking and Applications, pp. 797-803, 2010. N. Dai, B. D. Davison, and X. Qi, “Looking into the Past to Better Classify Web Spam,” In: Proceedings of the 5thInternational Workshop on Adversarial Information Retrieval on the Web (AIRWeb’09), pp. 1-8, 2009.

490

Suggest Documents