malicious web page detection based on on ... - Semantic Scholar

3 downloads 74184 Views 112KB Size Report
Decision Tree is the best. Fette et al.[7] use ... registrar-provided information only), geographic ... webpage, although have different domain name, they may.
Proceedings of the 2011 International Conference on Machine Learning and Cybernetics, Guilin, 10-13 July, 2011

MALICIOUS WEB PAGE DETECTION BASED ON ON-LINELEARNING ALGORITHM WEN ZHANG, YU-XIN DING, YAN TANG, BIN ZHAO Department of Computer Sciences and Technology, Harbin Institute of Technology Shenzhen Graduate School Shenzhen, China E-MAIL: [email protected], [email protected]

Abstract: The Internet has become an indispensable tool in peoples’ daily life. It also bring us serious computer security problem. One big security threat comes from malicious webpages. In this paper we study how to detect malicious pages. Since malicious webpages are generated inconstantly, we use on line learning methods to detect malicious webpages. To keep the client side as safe as possible, we do not download the webpages, and analysis webpages’ content. We only use URL information to determine if the URL links to a malicious pages. The feature selection methods for URL are discussed, and the performances of different on line learning methods are compared. To improve the performance of on line learning classifiers, an improved on line learning method is proposed, experiments show that this method is effective.

Keywords: Malicious webpage; Machine learning; On-line learning; Semi-supervised learning

1.

Introduction

Internet has become an necessary tool in our daily life. However, it also brings serious information security problem[1]. Even if users install antivirus software, which can prevent most malicious attacks, most users are often carelessly attacked by malicious web pages, such as webpage viruses, spam, phishing sites[2][3][4]. The characteristics of malicious webpage attack is, using the page as transmission of the carrier, undertake rapid and wide spread on the Internet. It usually takes a passive mode of transmission, that is, attacking user’s system when users browse the webpage. Also, some webpage malicious code will use some search engines to attack the web pages existing security vulnerabilities, so as to realize the initiative transmission. At present, most webpage malicious code is taking active transmission forms when attacking Web server, and once Web server is attacked and infected by webpage malicious code, it becomes a malicious Web server. When user browses the

webpage of Web server, the user's computer system would be attacked by malicious webpage. All kinds of malicious webpages seriously threat the users’ computer security. To avoid attacks from malicious webpages, it is required an efficient malicious webpage detection systems to detect a webpage before user browses it, and stop opening malicious webpage. This paper uses online learning methods to detect malicious webpages. Compared with other detection methods, our methods are safer, as they do not analysis the content of a webpage, they only use URL information to determine if a webpage is malicious. 2.

Related work

In order to select appropriate features of webpages, Hou et al. [5] divided webpage features into 6 categories according to the level of DHTML content used in a page. 1) n-gram Model: If no knowledge about a DHTML webpage is used, an n-gram model could be used to extract the features [6]. 2) HTML Document-Level Features, such as the length of the document, the average length of word, word count, distinct word count, word count in a line. 3) JavaScript Function and Objects, such as the number of each native JavaScript function and object, some of them are often used by the hacker, such as escape(), unescape(), eval(), exec(), ubound() etc. 4) ActiveX Object, such as commonly used ActiveX objects . 5) Relationship between features. 6) Sophisticated Model: CFG, Template. [5] used Naive Bayes, Decision Tree, SVM and Boosted Decision Tree as classifiers to detect malicious wbdpages, and the experiment result shows that Boosted Decision Tree is the best. Fette et al.[7] use statistical methods in machine learning to classify phishing emails. Their classifiers examine the URLs contained within a message (e.g., the number of URLs, number of domains, and number of dots in a URL). Bergholz et al.[8] further improve the accuracy

978-1-4577-0308-9/11/$26.00 © 2011 IEEE 1914

Proceedings of the 2011 International Conference on Machine Learning and Cybernetics, Guilin, 10-13 July, 2011

of [7] by introducing models of text classification to analyze email content. Abu-Nimeh et al. compare the performance of different classifiers over a corpus of phishing emails [9].Kolari et al. use URLs found within a blog page to determine whether a page is spam [10]. They used a “bag-of-words” representation to describe URLs. Garera et al.[11] use logistic regression methods to classify phishing URLs. The features they used include the presence red flag key words in the URL, features based on Google’s Page Rank and Google’s Web page quality guidelines. They achieved a classification accuracy of 97.3% over a set of 2,500 URLs. McGrath and Gupta[12] performed a comparative analysis of phishing and non-phishing URLs [12]. For examples, they compare non-phishing URLs drawn from the DMOZ Open Directory Project [13] with phishing URLs from PhishTank [14]. The features they analyzed include IP addresses, WHOIS records (containing date and registrar-provided information only), geographic information, and lexical features of the URL (length, character distribution, and presence of pre-defined brand names). Provos et al.[15] performed a study of drive-by exploit URLs and use a patented machine learning algorithm as a pre-filter for VM-based analysis. They extracted content-based features from webpages, including whether IFrames are “out of place,” the presence of obfuscated javascript, and whether IFrames point to known exploit sites. Zhang et al. classifies phishing URLs by thresholding a weighted sum of 8 features (4 content-related, 3 lexical and 1 WHOIS related) [16]. The lexical features include dots in URLs, special characters, and IP address contained in URLs. The machine learning algorithms used by the above papers are batch learning methods, which use all of the samples simultaneously to learn. However, new malicious webpages inconstantly appear, if the new one is very different from the old training samples, we have to add the new webpage to the training sample and retrain classifiers. This is a labor cost work. To address this problem, in this paper we use online learning algorithms to train classifiers. To keep the client side as safe as possible, we do not download the webpages, and analysis webpages’ content. We only use URL information to determine if the URL links to a malicious pages. The feature selection methods for URL are discussed, and the performances of different online learning methods are compared. To improve the performance of on line learning classifiers, an improved online learning method is proposed, experiments show that this method is effective.

3.

Online Learning for Malicious Webpage Detection

In this section, we design lightweight URL classifiers, that is, classifying the reputation of a Web site entirely based on the inbound URL. The motivation is to provide inherently better coverage than blacklisting based approaches (e.g., correctly predicting the status of new sites) while avoiding the client-side overhead and risk of approaches that analyze Web content on demand. 3.1.

Feature Extraction

We treat URL reputation as a binary classification problem where positive examples are malicious URLs and negative examples are benign URLs. This learning-based approach to the problem can succeed if the distribution of feature values for malicious examples is different from benign examples, the training set shares the same feature distribution as the testing set, and the ground truth labels for the URLs are correct. We classify URLs using the relationship between URLs and the lexical and host-based features that describe them, content information of webpages is not considered. Although content information is very useful for webpage classification, we do not consider it based on the following reasons. A) Avoiding downloading webpages is safe for users. B) The classification speed is high when only using URL information for classification. C) This method is more generic D) Obtaining and analyzing webpage content is a complex work [17]. We extract URL features according [18]: 1) Lexical Features:Lexical features are the textual properties of the URL itself. 2) Host-based Features:Host-based features include the following features: IP address properties, WHOIS , Geographic properties. In our research we choose the hostname, path and WHOIS information of webpage URLs as features. we give an example to illustrate the URL feature extraction method, such as the following URL: “http://item.taobao.com/item.htm?id=8614349277&c m_cat=50010167”. Its hostname is “item.taobao.com”, path is “item.htm?id=8614349277& cm_cat=50010167”. Hostname is the most basic URL feature; it includes two part, domain name and channel.“taobao.com” is the domain name of this webpage, “item” is the channel. Channel part plays a very significant role for recognition of phishing webpage. For example, the URL

1915

Proceedings of the 2011 International Conference on Machine Learning and Cybernetics, Guilin, 10-13 July, 2011

“http://www.taobao.ipx32.com/about.html”, its domain name is “ipx32”, rather than “Taobao”. The word “taobao" appears at the channel position instead of domain position. This is a kind of phishing websites, which wants to pretend to be Taobao, and deceive careless users to access the site. The “taobao” in the channel part will be given a positive value during classifier learning course, which will make decision result tend to malicious webpage. A path includes directory, file name and type. They are effective to detect malicious webpage. Some malicious webpage, although have different domain name, they may have the same token in the path. For example, the URLs,“niamhs.com/second2/live.js” and “networkmaterials.com/values2/live.js”, their domain names are different, but file name and type in the path are same. If one is malicious webpage, then another may be malicious webpage too. The file type of webpage also plays an important role for classification. Usually static web pages are generally benign webpage, and dynamic web pages are likely malicious webpage. If the file type of a webpage is EXE, it has high possibility to be malicious webpage. So, in classifier learning process, webpages with different file types are assign different weights. WHOIS information is obtained from WHOIS server, including web site owner’s personal information, such as Registrant Name, E-mail, and Registrant Organization etc. From the above analysis, we can see that the URL features are string format, which is similar to text classification. Therefore, we use feature representation for document classification—" Bag of Word" to represent the URLs. "Bag of Word" method makes each extracted string as one feature, all of the strings consist of the feature vector of a URL and its dimension is the number of extracted strings. Each feature is represented as a binary value, that is, if the string appears in the URL, its feature value is 1, otherwise 0. One we should notice is that in the course of learning the dimension of a feature vector can be changed. If a string never appeared in the old training samples, it will be added into the feature vector as a new feature. In this case the feature vectors for old training samples need not to be modified, since for them the value of the new feature value is 0. This kind of feature representation method is very suitable for online learning algorithm. 3.2.

Online Learning

are more suitable for this problem than batch learning method. a) Online methods can process large numbers of examples far more efficiently than batch methods. b) Online learning can quickly adapt to the changes in malicious URLs and their features over time. Formally, the algorithms are trying to solve an online classification problem over a sequence of pairs {(x1, y1), (x2, y2),…(xt, yt)}, where each xt is an example’s feature vector and yt ę{-1, +1}is its label. At each time step t during training, the algorithm makes a label prediction ht (xt), which for linear classifiers is ht (xt) = sign (wt · xt). After making a prediction, the algorithm receives the actual label yt. If ht (xt)yt , it means a wrong prediction, need calculate the loss and adjust the weight wt ˈthen go to the next round. Following are the online learning algorithms used in this paper. 1. Perceptron This classical algorithm is a linear classifier that makes the following update to the weight vector whenever it makes a mistake [19]:

wt +1 = wt + yt xt

(3-1) The advantage of the Perceptron is its simple update rule. However, because the update rate is fixed, the Perceptron cannot account for the severity of the misclassification. As a result, the algorithm can overcompensate for mistakes in some cases and undercompensate for mistakes in others. 2. Passive-Aggressive (PA) Algorithm The goal of the Passive-Aggressive algorithm is to change the model as little as possible to correct for any mistakes and low confidence predictions it encounters [20]. Specifically, with each example PA solves the following optimization: 1 2 wt +1 = arg min wt − w 2 w (3-2)

s.t. yt ( w ⋅ xt ) ≥ 1

(3-3) Updates occur when the inner product does not exceed a fixed confidence margin, i.e., yt ( w ⋅ xt ) < 1 . The closed-form update for all examples is as follows: (3-4) wt +1 = wt + α t yt xt 1 − yt ( wt ⋅ xt )

Online learning is an incremental learning method, each time, it use one example to do training, then adjust the weight according to the loss function. Online methods

α t = max{

1916

xt

2

, 0}

(3-5)

Proceedings of the 2011 International Conference on Machine Learning and Cybernetics, Guilin, 10-13 July, 2011

The PA algorithm has been successfully applied in practice because the updates explicitly incorporate the term, classification confidence. 3. Confidence-Weighted (CW) Algorithm The idea of Confidence-Weighted classification is to maintain a different confidence measure for each feature so that less confident weights are updated more aggressively than more confident weights [21, 22]. The update rule for CW is similar to PA. However, instead of describing each feature with a single coefficient, CW describes per-feature confidence by modeling uncertainty in weight wi with a Gaussian distribution N ( μ , σ ) . The decision rule is ht ( x) = sign( μt ⋅ x) . The CW update rule adjusts the model as little as possible so that xt can be correctly classified with probability . CW minimizes the KL divergence between Gaussians subject to a confidence constraint at time t: ( μt +1 , Σt +1 ) = arg min DKL ( N ( μ , Σ) || N ( μt , Σt )) μ ,Σ (3-6)

s.t. yi ( μ ⋅ xt ) ≥ Φ −1 (η ) xtT Σxt

(3-7) where  is the cumulative distribution function of the standard normal distribution. This optimizati on yields the following closed-form update:

μt +1 = μt + α t yt Σt xt −

Σt−+11 = Σt−1 + α tφ ut

1 2

4 4.1

(3-8)

diag 2 ( xt )

(3-9) We can see that if the variance of a feature is large, the update to the feature mean will be more aggressive. As for performance, the run time of the update is linear in the number of non-zero features in x. 3.3.

To make feature value can reflect importance of features; each feature is assigned a weight. The weight of feature is decided by the difference of feature frequency in the malicious and benign samples: when a feature appears only in one class, its value is 1; if its frequency in malicious samples is equal to the frequency in benign samples, its value is 0; in other case it is assign a value between 0 and 1, it is calculated by (3-10). Let NP be the number of malicious sample, let FPi be the frequency of the feature i in malicious samples, so the probability that feature i appear in the malicious samples is PPi = FPi / NP. Let NN be the number of benign sample, and FNi be the frequency of the feature i in benign samples, so the probability feature i appears in the benign samples is PNi = FNi / NN. So the weight of feature i denoted as Ci can be computed as follows: PPi − PN i (3-10) Ci = PPi + PNi

Improvement on Feature Extraction

We use “Bag of Word" to represent URL features, that is, each string extracted from webpage URL is a feature. Its feature value is Boolean type, it is 1 when this feature appear in the URL, Otherwise 0. This kind of feature representation method is simple, however this method does not consider the importance of features, some features only appear in one category, such features has a strong capability for classification, they are important features; however, some features’ frequency in two categories are similar, such features almost have no discriminative power for these two categories, they are not important features. The importance of features can be measured by the difference of feature’s frequencies in two categories. If the difference is big, then this feature is important; otherwise it is not.

Experiments Dataset

Experimental dataset includes malicious URLs and normal URLs, which are used for training and testing classifiers. We use three online learning methods to train classifiers; they are perceptron learning algorithm, PA learning algorithm and CW learning algorithm, respectively. Malicious URLs are obtained from the following 3 websites: 2 million malicious URLs come from “www.mwsl.org.cn”. This is a website of a laboratory, it provides malicious website from China. So the top-level domain name of these malicious URLs is the “cn". 15 million malicious URLs come from “www.malwareurl.com”. 50,000 malicious homepage URLs come from “www.malwaredomainlist.com”. URLs from the above two web site have top-level domain names “com”. We collect 1 million normal URLs from Internet. From above dataset, we select 969963 URLs including 833160 normal URLs and 136803 malicious URLs. Normal URL is labeled as “-1”, malicious URL is labeled as “+1”. From 969963 URLs we collect 680684 features in total, which include 117259 domain features, 16669 channel features, 476325 path features and 70431 who is features. We make experiments on a personal computer with 4 Intel KuRui 2 processors (2.40 G Hertz), 2GB memory and 250GB hard disk.

1917

Proceedings of the 2011 International Conference on Machine Learning and Cybernetics, Guilin, 10-13 July, 2011

4.2

Table 4-3 Average prediction values for three algorithms

Experimental Results and Analysis

Algorithm Perceptron PA CW

The performance of online learning algorithms is evaluated by three criteria: accuracy, the true positive rate (TPR), and the true negative rate (TNR). We define the following measurements. True positives (TP) indicate the number of malicious URLs classified as malicious URLs. True negatives (TN) indicate the number of benign URLs classified as benign URLs. False positives (FP) indicate the number of benign URLs classified as malicious URLs. False negatives (FN) indicate the number of malicious URLs classified as benign URLs. Accuracy = (TP+TN)/(TP+TN+FP+FN) TPR = TP/(TP+FN) TNR=TN/(TN+FP) In our experiments the training dataset includes 881785 URLs, the test dataset includes 88178 URLs. We divide the training dataset into 10 subsets; tenfold crossvalidation is used to evaluate the performance of each classifier. The performance of each online learning algorithm is shown in table 4-1. Accuracy 0.95 0.96 0.96

TPR 0.85 0.92 0.82

TNPM -18.72 -3.75 -5.02

FNPM -5.67 -1.14 -1.32

FPPM 3.46 0.47 0.69

From table 4-3 we can see that perceptron algorithm has the biggest and smallest average prediction values. In perceptron algorithm weights are updated by wt +1 = wt + yt xt . For each w , the updat rate is fixed, the i

value of wi only depends on xi , it does not consider the loss caused by error classification of xi. PA algorithm has the smallest absolute average prediction value. It changes each weight as small as possible while keeping the minimum classification loss. The prediction value of CW algorithm is smaller than that of PA algorithm. The reason is that weight update of CW algorithm depends on the distribution of samples; the update of the average value of weights depends on the variance of weights. If the variance of weight is big, the average value of weights would be changed significantly. Variance of weight vectors depends on the distribution of samples; therefore, when samples distribute widely, update of weights will be significantly, and this makes the absolute value of average prediction of PA is relatively. We adopt the method proposed in 3.3 to improve the performance of online learning methods. The experimental results are shown in table 4-4.

Table 4-1 Performance of three online learning algorithms Algorithm Perceptron PA CW

TPPM 15.84 2.29 3.12

TNR 0.97 0.97 0.99

Table 4-4 Performance of improved algorithems

From table 4-1 we can see three online learning almost have the same performance. Among them PA algorithm has the highest TPR, and CW algorithm has the highest TNR. Table 4-2 lists the quantities of TP, TN, FN and FP for each algorithm.

Algorithm Perceptron PA CW

TP 13080 14172 12568

TN 70903 71087 72653

FN 2246 1154 2758

Algorithm Perceptron PA CW

FP 1949 1765 199

In order to further analyze the experiment results, we calculate the average prediction value for each type of data. For each sample the prediction values is calculated by w t ⋅ x t . We define four types of average prediction values, they are True Positive Prediction Mean (TPPM), True Negative Prediction Mean (TNPM), False Negative Prediction Mean (FNPM), and False Positive Prediction Mean (FPPM).

TPR 0. 88 0. 95 0. 90

TNR 0. 98 0. 98 0. 99

Table 4-5 Quantities measures for improved algorithms

Table 4-2 Quantity measures for online learning algorithm Algorithm Perceptron PA CW

Accuracy 0. 96 0. 97 0. 98

TP 13597 14541 13828

TN 71584 71663 72719

FN l1688 744 1457

FP 1324 1245 189

The experimental results in table 4-4 show that accuracy, TDR and TNR of the three online learning algorithms increase about 2 percent. The overall performance of PA algorithm is better than the other two algorithms. 5. Conclusion In this paper we use online learning algorithms to detect malicious webpages. Three online learning algorithms are used to train classifiers, and their performances are compared. To make feature value can reflect importance of features; each feature is assigned a

1918

Proceedings of the 2011 International Conference on Machine Learning and Cybernetics, Guilin, 10-13 July, 2011

weight. The weight of features is decided by the difference of feature frequency in malicious and benign samples. Experimental results show that this method can improve the performance of online learning algorithms. Rererence [1]. Christian Seifert, Ramon Steenson, Thorsten Holz, Yuan Bing, Michael A. Davis. Know Your Enemy: Malicious Web Servers. The Honeynet Project & ResearchAlliance. http://www.honeynet.org/papers/mws, 2007-08-09 [2]. N Provos, D McNamee, P Mavrommatis et a1. The Ghost In The Browser-Analysis of Web-based Malware. Proc. of the 2007 HotBots. Cambridge: Usenix, 2007: 301-311. [3]. Joel Scanbray, Mike Shema. Web Application Security Secrets & Solutions. Tsinghua University Press, 2003: 55-121. [4]. Microsoft. Microsoft Security Intelligence Report. 2009(7): 82~92 [5]. Y.-T. Hou et al. Malicious Web Content Detection by Machine Learning. Expert Systems with Applications, 2009, 37(1): 55~60 [6]. J. Z. Kolter, M. A. Maloof. Learning to Detect Malicious Executables in the Wild. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, 2004: 470~478 [7]. I. Fette, N. Sadeh, and A. Tomasic. Learning to Detect Phishing Emails. In Proceedings of the International World Wide Web Conference (WWW), Banff, Alberta, Canada, 2007. [8]. A. Bergholz, J.-H. Chang, G. Paaß, F. Reichartz, and S. Strobel. Improved Phishing Detection using Model-Based Features. In Proceedings of the Conference on Email and Anti-Spam (CEAS), Mountain View, CA, 2008. [9]. S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair. A Comparison of Machine Learning Techniques for Phishing Detection. In Proceedings of the AntiPhishing Working Group eCrime Researchers Summit, Pittsburgh, PA, 2007. [10]. P. Kolari, T. Finin, and A. Joshi. SVMs for the Blogosphere: Blog Identification and Splog Detection. In Proceedings of the AAAI Spring Symposium on Computational Approaches to Analysing Weblogs, Stanford, CA, 2006. [11]. D. K. McGrath, M. Gupta. Behind Phishing: An Examination of Phisher Modi Operandi. In Proceedings of the USENIX Workshop on Large-

Scale Exploits and Emergent Threats, San Francisco, CA, 2008 [12]. sS. Garera, N. Provos, M. Chew, and A. D. Rubin. A Framework for Detection and Measurement of Phishing Attacks. In Proceedings of the ACM Workshop on Rapid Malcode (WORM), Alexandria, VA, 2007. [13]. Netscape. DMOZ Open Directory Project. http://www.dmoz.org. [14]. OpenDNS. PhishTank. http://www.phishtank.com. [15]. N. Provos, P. Mavrommatis, M. A. Rajab, and F. Monrose. All Your iFRAMEs Point to Us. In Proceedings of the USENIX Security Symposium, San Jose, CA, 2008. [16]. Y. Zhang, J. Hong, and L. Cranor. CANTINA: A Content-Based Approach to Detecting Phishing Web Sites. In Proceedings of the International World Wide Web Conference (WWW), Banff, Alberta, Canada, 2007. [17]. Y. Niu, Y.-M. Wang, H. Chen, M. Ma, and F. Hsu. A Quantitative Study of Forum Spamming Using Context-based Analysis. In Proceedings of the Symposium on Network and Distributed System Security (NDSS), San Diego, CA, 2007. [18]. Justin Ma, Lawrence K. Saul et al. Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data mining, Pairs, France, 2009 [19]. F. Rosenblatt. The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 1958(65): 386~408 [20]. K. Crammer, O. Dekel, S. Shalev-Shwartz, and Y. Singer. Online Passive-Aggressive Algorithms. Journal of Machine Learning Research, 2006(7): 551~585 [21]. M. Dredze, K. Crammer, F. Pereira. ConfidenceWeighted Linear Classification. Proceedings of the International Conference on Marchine Learning. Helsinki, Finland: Omnipress, 2008: 264~271 [22]. K. Crammer, M. Dredze, F. Pereira. Exact Convex Confidence-Weighted Learning. Advances in Neural Information Processing Systems, 2009(21): 345~352

1919