2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-5 September 2015
Investigating the Effect Of Feature Selection and Dimensionality Reduction On Phishing Website Classification Problem Pradeep Singh, Member IEEE
Niti Jain
Department of Computer Science National Institute of Technology Raipur, India
[email protected]
Abstractunauthorized
to
a
person's
private
information
like
passwords, account or credit card details. It is a deception technique
that
utilizes
social
&
engineering
technology
to
convince a victim to provide personal information, usually for monetary benefits. Phishing attacks have become frequent and involve the risk of identity theft and financial losses. Detection of phishing website has become very important for online banking and e-commerce users. We proposed an effective model that is based on preprocessing (Feature selection and dimensionality reduction)
and
classification
DataMining
algorithms.
These
algorithms were used to characterize and identify all the factors to classify the phishing website. We implemented five different classification algorithm and four preprocessing techniques to classify a websites legitimate or phishy. We also compared their respective performances in terms of accuracy and AUe.
KeywordsData Classification
Mining,
I.
phishing
website
This paper is arranged into the following sections. Section II describes the related work done in the field of phishing website classification. Section III is a brief introduction to phishing website classification. Section IV describes the data set used. Section V describes the phishing factor indicators and their significance. Section VI deals with the experimental design. In section VII, results and performance of various classifiers is discussed. The conclusions and future work are discussed in section VIII.
detection,
INTRODUCTION
The Internet is playing an increasingly significant role in 21st century's commerce and business activities than before. But, poor security on the Internet is a cause of concern since it involves huge money and vital information transactions thereby strongly encouraging attackers to enter into the low risk, but high-gain online systems. Email messages containing valuable and sensitive information can"t be lOO% protected as they move across this Internet, therefore we need efficient techniques which can protect confidential information from unauthorized parties. Monetary benefits, identity theft, Password Mining, Fame and Notoriety, malware distribution, etc are the other motivating factors for phishers. Due to Phishing, there may be financial loss, information loss, blacklisting of institutions, malware and viruses into a PC or a computer system, illegal use of user"s details and misuse of your social security number. Phishing is used for deceiving the users online, due to poor current web security. Many reports suggest an increase in Phishing [1]. Recently, Phishers have attacked users with new
978-1-4673-6809-4/15/$31.00 ©2015 IEEE
Department of Computer Science National Institute of Technology Raipur, India
[email protected]
and different Phishing techniques which are very hard to crack. Phishing is used to target individuals or bigger organizations in order to gain information. Many Phishing techniques exist today, some of which include link manipulation, forgery of authentic websites and redirecting. With more awareness spreading around, many ways have come up to combat Phishing. Using secure connection, installing anti phishing tools, ignoring the Phishing emails are some existing approaches against Phishing. Among the new emerging techniques, data mining has become one of the most efficient techniques in classification of websites into Phishing websites or legitimate websites. Attributes of the websites can be used as features and these features can be used as an effective means for website classification [2].
Phishing is a term given to the method of gaining access
AmbarMaini
Department of Computer Science National Institute of Technology Raipur, India
[email protected]
II.
RELATED WORK
One of the already existing approaches for phishing website classification is Blacklisting. In this method, a database having domains and URL of phishing website is used for blocking .URLs. . Constructing a blacklist involves time consuming human feedbacks, due to this reason they are ineffective in blocking. Websites such as Phish Tank [3] and Net craft [4] contain a blacklist of Phishing websites. Many other approaches have also come up for creating blacklists and checking their accuracy [5][6]. White list maintains a database of trustworthy and good URLs or domains. URLs which are
388
2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-5 September 2015
not present in the white list are considered to be phishy and are blocked .The backdrop of blacklisting and white listing is their inability in detecting the newly made phishy website. The weakness of blacklisting and white listing is that they are not able to detect the newly created phished websites. Over the past few years, data mining is being used extensively. Data Mining is a process in which relationship is found in the dataset among the extracted features [7][8]. In Heuristic-based methods, some features are extracted from the websites which are then classified as either phishy or legitimate, the accuracy depends on features selected. Detailed study on Feature Extraction or Feature Selection for Text Classification to identify a Phishing mail is also an approach for classification [9]. Apart from that, many approaches have been worked upon including association rule based classification of Phishing websites [10] [2I].Another approach proposed in [11], utilizes CANTINA (Carnegie Mellon Anti-phishing and Network Analysis Tool) for detection of phishy websites using the concepts of information retrieval measures such as content based technique and term-frequency inverse-document frequency (TF-IDF) . III.
IV.
DATA USED
To implement and test our approach, we have used UCI Machine Learning repository of Phishing websites [12]. The total instances of Samples are 2456.Total phishing websites is 1094. Total legitimate websites are 1362.The total number of attributes is 30 for each instance. The features are extracted from website based on different attributes- address bar, HTML, JavaScript and domains. These features are described below in Section V. V.
PHISHING FACTOR INDICATORS
The experimental set up uses extracted features from websites and then uses them further for predicting the type of websites. These features have been extracted and categorized based on the effect they have on the website. In context of website phishing identification the use of many phishing indicators usually increases data acquisition time and cost and it leads to more design time, more decision making time, and also many features enhance chances of poor decision making. Table 1 shows the phishing attributes and their significance.
WEBSITE PHISHING
TABLE T: PHISHTNG WEBS[TE FE ATURES AND THEIR SIGNIFICANCE
Phishing websites is used to refer to those websites that are made by dishonest people to get access to the confidential details. Predicting the attacks and then stopping them is a crucial step for protection of online transactions and keeping a check on online safety. The accuracy in predicting the type of the website largely depends on the features which are extracted. Mostly users depend on the results generated by the phishing detection tool thereby increasing the responsibility on the phishing detection tools to be almost accurate. In our paper, we have first applied preprocessing algorithms on our extracted features obtained from the websites followed by classification algorithms to draw a comparison between these algorithms based on their accuracy and AUC. Figi shows the steps involved in the detection of phishing website.
Sno
Website Details
Comparison of classifiers results on the basis of Accuracy and AUe.
Significance
Features used
1
Having_IP_ Addr [f IP address is used in domain name, then
2
URL_Length:
3
Shortening_Servi Phishers use Link shorteners to fool people. ce
4
Having_ At_Sym Since @ symbol leads to ignore everything bol that follows it in the URL, so websites having an @ symbol are Phishy in nature.
5
double_slash_red If there is '/1' then it can be categorised as a irecting Phishing Website.
6
Prefix_Suffix Dash
7
Having_Sub_Do Legitimate Websites use only domain main generally upto two level. Websites having more than 3 dots do it to include more domains within a domain , are generally Phishy.
8
SSLfinal_State
9
Domain_register Phishing ation_length: domains.
10
Favicon
Phishing websites use genuine favicons to fake identity.
11
Port
Phishing Websites make use of some port
Fig1: Process for phishing website classification
Legitimate URLs have length of nearly 75 characters, Websites with URL length more than this can be categorised as Phishing sites.
Legitimate websites don't use Dashes, so a website using them can be classified as Phishy.
Legitimate websites use SSL socket layer every time sensitive information is transferring. Sites not including this can be categorised as Phishing Websites. websites
have
suspicious
numbers like 82, and therefore port scanning
389
2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-5 September 2015
VI.
can help in identifYing websites which are phishy.
12
HTTPS_token
The
HTTPs
presence
on
websites
We have used Weka [14] that provides inbuilt tools for data pre-processing, classification, clustering, and visualization. Pre-processing of data is done using feature selection and extensity reduction. We have applied Feature selection algorithms- Correlation Feature Selection (CFS), Information gain (lG) and Consistency Subset and for extensity reduction Principal component analysis (PCA).After that, we have used classification algorithms J48, Naive Bayes, SVM, Random Forest and AdaBoost in order to find the most reliable technique by comparing their accuracy and AUC for classification of phishing websites.
during
transfer is a clear measure of authenticity. Phishing websites don't use HTTPs.
13
Request_URL
In legitimate websites, objects within same domain are linked to the same domain, once the tag is known, however in Phishy websites; it has been observed that objects are from different domains.
14
URL_oCAnchor
In
legitimate
websites
the
anchor
tag
is
connected to the same domain as the source code, Phishy websites have different domains.
15
Links_in_tags
Links
in
tags
lead
to
some
fraudulent
A. Features Extraction
websites.
16
SFH
Feature extraction is used in data mining to describe the tools and techniques for reducing inputs to a size that can be used for processing and analysis. Feature selection involves cardinality reduction, choosing the right attributes, and selection and choosing and discarding attributes based on their goodness for analysis. Feature Extraction, selection and application of rule based classification helps in increasing accuracy [15][16].
It contains an about blank page and in Phishy websites
this about blank is linked
to a
different page
17 18
SUbmitting_to_e mail Abnormal_URL
Genuine users are redirected to some other page on clicking on the mail. This feature is extracted from Whois Database[13] , Legitimate websites' main identity is in the URL
19
Redirect
Following are feature selection methods used:
If the numbers of redirects are more than three,
then
website
can
be
classified
as
Phishing Website.
20
on_mouseover
1) Correlation Feature Selection (CFS): It is a method that evaluates the values of subset of the attributes taking into consideration the individual prognostic ability of each feature and degree of repetition between them. The subsets of features meeting these criteria are selected by CFS . Number of features selected by CFS when applied on phishing website database are 8 which are Prefix_Suffix, SSLfinal_ State, Request_URL, URL_oCAnchor, Links_in_tag, DNSRecord, web_traffic, Google_Index.
Phishing Websites manipulate the on mouse over event in the source code written in java script to create fake URL in status bar
21
RightClick
Phishing Websites disable the right click in their websites.
22
popUpWidnow .
Legitimate
Websites
rarely
ask
users
to
submit details in a pop Up window , doing so is a clear indication of Phishy website
23
Iframe
EXPERIMENTAL DESIGN
Iframe is used by Phishing websites to embed some file in the same HTML Page and to show the other file as a part of original HTML
2) Information Gain (IG): The Information gain is based on information theory, that is selecting attribute with the highest information gain [17]. The expected information needed to classify a tuple in partition D is given in eq.(l)
Page.
24
age_oCdomain
Legitimate
websites
have
an
age
of
six
months; websites with more than this age can be classified as Phishy.
25
DNSRecord
For a Phishing website , the domain record
(1)
would not be present in DNS Record or WHOIS Database[13]
Where Pi is the nonzero probability that an arbitrary tuple in belongs to class Ci. A log function to the base 2 is used, because the information is encoded in bits. Info(D)is just the average amount of information needed to identify the class label of a tuple in D. Number of features selected when applied on phishing website database by this algorithm are 30.
If the website is either having no traffic or limited,
then
it
can
be
classified
under
Phishing Website. Phishing websites will have low page rank due to lack of links pointing to them.
28
Google_lndex
Google
offers
a
free
toolbar
aimed
at
classifYing the Phishing websites.
29
Links_pointing_t oJlage
3) Consistency Subset: It asses worth of a subset of various attributes by calibrating the level of consistency in the class values, in a process in which the training instances are projected onto the subset of attributes. Number of features selected when applied on phishing website database by this algorithm are 15 which are URL_Length, Prefix_Suffix, SSLfinal_ State,Domain_registeration_length, URL_oCAnchor, Port, Favicon, Request_URL,
Phishing websites have links pointing to zip files
that
automatically
get
downloaded
containing malware
30
Statistical_report
It provides recent analysis of scams.
390
2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-5 September 2015
age_oCdomain , Linksjn_tags, , web traffic DNSRecord Links�ointing_to�age.
having_ Sub_Domain, , Google_Indexand
tool, J48 is an open source Java implementation of the C4.5 algorithm. It uses an extension to information gain known as gain ratio, which attempts to overcome the bias for certain attributes.
4) Principal Components Analysis (PCA): It is a procedure in which orthogonal transfonnation is used to convert a set of related variables into a set of values of linear uncorrelated variables. Extensity reduction is obtained by choosing eigenvectors. These eigenvectors are responsible for a part of the variance in the initial data. We have used value of variance 0.95 in our experiment
VII.
RESUL TS
We compare each classification algorithm's performance for each of the preprocessing algorithms (feature selection and dimensionality reduction) on the dataset and recorded their accuracy.
B. Classification Algorithms For the purpose of reliable and stable results in the experiments, K fold cross validation strategy is used. K fold classifier is generally used for classification accuracy measure. In this validation K partitions are made and out of these one is used for testing and rest is used for training. The dataset can be shown as follows
In our experiment, we have used five Algorithms for classification of phishing website based on various techniques: probability based NaiVe Bayes Classifier , decision tree based J48 algorithm , Meta classifiers AdaBoost and Random Forest. 1) Naive Bayes Classifier: Naive Bayes classifier applies Bayes' theorem and it is a simple technique for constructing classifiers. Despite the naive design, these classifiers work well in many complex real-world situations. The benefit of using Naive Bayes is that, it needs only a small amount of training data to estimate the parameters necessary for classification. 2) Support Vector Machine: SVM algorithm is based on the Structured Risk Maximization theory and aims at minimizing the generalization error. Sequential Minimal Optimization version of SVM is used in this experiment. We used Complexity parameter C=1 for the tolerance degree to errors. RBF kernel, is used which proves to be efficient for classification. SVM is one of the well known classifier, so details are excluded [18] [19]. 3) AdaBoost: It is the algorithm that can be coupled with many other algorithms to enhance their accuracy in
TI T2 Tk
=
=
=
Xl PI
=
X2 P 2 Xk Pk
X2UX3 U ... UXk XIUX3 U ... UXk
(2)
=
=
XkUX3 U ... UXk-1
In the above, T), T2 , Tk are the partitions for testing and P), P2 , 'PK are for training. K is typically 10 or 30.ln this experiment, K=10 has been used. • • •
• •
The Accuracy is calculated as the number of correct classifications divided by the total number of classifications: Accuracy= TP+TN / (TP+TN+FP+FN)
(3)
Where TP is True Positive, TN is True Negative, FP is False Positive, and FN is False Negative.
classification. Let D be a data set of various tuples (X I, Yl),
(X2, Y2) ... (Xd, Y d), where Yi is the class label of tuple Xi . In its first step, the algorithm designates every tuple that is to be trained; a weight of lid, this weight is equal for all. To generate k classifiers for the group all together k rounds are required through the remaining part of the Adaboost algorithm. In round n, the tuples from data set are sampled. Adaboost has been used for Phishing Detection [20].
Area under receiver operating characteristic curve (AUC) is also used to evaluate the effectiveness of the fault prediction as it is most informative indicator for prediction accuracy. We observe that Consistency subset outperforms all the other feature extraction algorithms providing high accuracy rates despite using less number of features thereby showing that less number of prominent features can also be used to produce the same accuracy as produced by using all the features. Consistency subset produces high accuracy rates of 97.4756 % by using 15 features of the 30 features. Therefore, Consistency subset feature extraction is a reliable algorithm for feature extraction by just using 50% number of features; it produces accuracy results as obtained from 30 features and thus saves experimental time. The summary of results is in terms of accuracy is shown in figure 2 and in terms of AUC in figure 3.
4) Random Forest (RF): It is an algorithm used in constructing a forest of random trees. In this method of classification, a set of decision trees are constructed at training time. It outputs the class that is the mean of all. The individual decision trees are generated using a random selection of attributes at each node to determine the split. While classifying, each tree votes and the most popular class is returned. 5) J48 Algorithm: J48 is an extension of ID3 (decision tree algorithm). The additional features of J48 are accounting for missing values, decision trees pruning, continuous attribute value ranges, derivation of rules, etc. In WEKA data mining
391
2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-5 September 2015
TABLEII: RESULTS IN TERMS OF ACCURACY FOR DIFFERENT CL ASSIFIERS TO CL ASSIFY WEBSITE PHISHING Unproces
CFS
Info
sed Data
Subset
Gain
PCA
0.98
Consis
0.96
tency Subset
J48
94.9919
94.259
95.0773
94.1368
95.3176
Naive
94.0554
93.1189
94.0554
91.9381
93.6482
•
0.94
SVM
94.6254
93.241
94.6254
94.9591
94.7476
R.Forest
97.8013
93.9739
97.557
96.661
97.4756
AdaBoo
93.7704
93.2003
93.7704
89.658
93.7704
.Cfs Subset
0.92 0.9
Baiyes
Unprocessed Data
0.88
•
Info Gain
•
PrincipalComponents
•
Consistency Subset
st
From the results shown in table 2, we can see that Random Forest has outperformed J48, NaIve Bayes , SVM and AdaBoost.
100 98 96 94 92 90 88 86 84
VIII. CONCLUSION AND FUTURE WORK In this paper, we have investigated the role of classifiers and the effect of pre-processing (Feature selection and dimensionality reduction) in detecting phishing websites. Five classifiers and 4 pre-processing algorithms have been used in this experiment. We have compared the performance of five classification algorithms and three well known feature selection CFS, Infogain, Consistency subset and one dimensionality reduction (PCA) on the publically available website phishing data. The experiment showed that Random Forest outperformed J48, SVM, Naive Bayes, and AdaBoost in terms of accuracy and AUe. The experiments on the web site phishing data set show that 15 features selected by consistency subset achieved the same accuracy and is a good alternative. For the future work we plan to develop a robust comprehensive rule generation model for identification of phishing websites.
-
•
Unprocessed
.Cfs Subset •
Info Gain
•
PrincipalComponents
•
Consistency Subset
Fig 2 . Comparision in terms of accuracy for different classifers We have noted down the area under the ROC curve for each of the algorithm. Table 3 shows the AUC for the experiments performed. It can be seen that the AUC for unprocessed data for all the algorithms in between 0.90 to 1 which suggest that data mining algorithms can easily predict the phishy. From column number 6 it is evident that same accuracy can be achieved by less number of features.
REFERENCES [I] [2]
TABLE lll: RESULTS IN TERMS OF AUC TO CL ASSIFY WEBSITE PHISHING Unpro
CFS
Info
cessed
Subset
Gain
PCA
[3] [4] [5]
Consis tency
Data
Subset
J48
0.986
0.984
0.986
0.952
0.989
Naive
0.987
0.987
0.987
0.969
0.986
SVM
0.946
0.932
0.946
0.949
0.947
[7]
Random
0.997
0.989
0.997
0.994
0.997
[8]
0.986
0.985
0.986
0.962
0.986
[9]
[6] Baiyes
Forest AdaBoost
392
APWGPhishing AttackTrendsReporl Available:http://www.antiphishing. orglresources/apwg-reports/APWGPhishing AttackTrendsReport t H. Liu, H. Motoda , R. Setiono , Z Zhao" An Ever Evolving Frontier in Data Mining" in Journal of Machine Learning Research - JMLR vol. 10, pp 4-13, 2010. PhishTank Available : https://www.phishtank.com/. Net Craft Available : http://www.netcraft.comi. Steve Sheng, Brad Wardman, Gary Warner , Lorrie Faith Cranor , Jason Hong and Chengshan Zhang "EmpericalAnalysisOf Phishing BlackLists" at 6th International Conference on Email and Anti Spam CEAS ,Mounatin View California July 16-17 ,2009 Christian Ludl, Sean Mc Allister, Engin Kirda, Christopher Kruegel"On the Effectiveness of Techniques to Detect Phishing Sites" at Proc 4th International Conference, DIM V A 2007 Lucerne, Switzerland, pp 20-39 July 12-13,2007 Keng Siau and Sang Juan Lee "A Review of Data Mining Techniques" in Industrial Management & Data Systems , MCB University Press ,pp 41-46 , 2001. Kantardzic and Mehmed. "Data MininJ!.: Concepts, Models, Methods, and Algorithm", John Wiley & Sons, Wiley-IEEE Press, July 2011.. Masoumeh Zareapoor and Seeja K. R " Feature Extraction or Feature Selection for Text Classification A Case Study on Phishing Email Detection" in International Journal of Information Engineering and
2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-5 September 2015
[10] [II]
[12] [13] [14] [15]
[16]
ElectronicBusiness ,voI,7,N02, pp 60-65, March2015,rOnlinelAvailable: http://www.mecs-press.orglijieeb/ijieeb-v7-n2/UIEEB- V7-N2-S.pdf Neda Abdelhamid ,Aladdin Ayesh and Fadi Thabtah " Associative Classification Mining for Website Phishing Classification" in journal of Information and Knowledge management vol,II,June 2012. YZhang , J.Hong , L.Cranor."CANTIN A : A content based approach to detect phishing websites" in Proceedings of the 16th International Conference on World Wide Web , Banff, AB, Canada pp 639-64S. May OS - 12 2007 UCI machine learning repository available at http://archive.ics.uci.edu/mlldatasets/Phishing+Websites WhoiS. Available at: http://www.who.is/WHOisDatabase WekaToolavailable at http://www.cs.waikato.ac.nziml/wekalWEK ATOOL. Ram B. Basnet , Andrew H. Sung and Quingzhong Liu "Feature Selection for Improved Phishing Detection" in 25th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems,Dalian, China, June 9-12, 2012. Proceedings pp 252261. Rami M. Mohammad, Fadi Thabtah, Lee McCluskey "Intellingent Rule based Phisihing Websites Classification" in IET Information Security
[17] [IS] [19] [20]
[21]
393
,VoI,S, May 2014 , pp 153-160 , [Online]. Available: IEEE Xplore, http://www.ieee.org Jasmina Novakovic, "Using information gain attribute evaluation to classify sonar targets" in 17th Telecommunications Forum, Proceedings Telfor", November 24-26,2009, Belgrade, pages 1351-1354 Sebastian Maldonado and Gaston L'Huillier " S VM-Based feature selection and classification for email filtering" in ICPRAM Vilamoura, Algarve, Portugal, February 6-S, 2012, pages 135-14S. M Chandrasekaran , K Narayanan, S Upadhaya "Phishing E-mail Detection based on STructutral Properties" in NYS Cyber Security Conference , pp 1-7,2006. Venkatesh Ramanathan and Hary Wechsler in "PHISHGILLNET phishing Detection methodology using Probabilistic Latent Smenatic Analysis, AdaBoost and Co-Training" in EURASIP Journal on InformationSecurity,Vol.12, rOnine]March20 12,Available:http://jis.euras ipjournals.com/contenctl2012/1/1 Moh'd Iqbal ,AL Ajlouni Wa'el Hadi,Jaber Alwedyan "Detecting Phishing Websites Using Associative Classification" in Journal of Information Engineering and Applications,VoI,3, No.7, 2013