Investigating the Effect Of Feature Selection and ... - IEEE Xplore

2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-5 September 2015

Investigating the Effect Of Feature Selection and Dimensionality Reduction On Phishing Website Classification Problem Pradeep Singh, Member IEEE

Niti Jain

Department of Computer Science National Institute of Technology Raipur, India [email protected]

Abstractunauthorized

to

a

person's

private

information

like

passwords, account or credit card details. It is a deception technique

that

utilizes

social

&

engineering

technology

to

convince a victim to provide personal information, usually for monetary benefits. Phishing attacks have become frequent and involve the risk of identity theft and financial losses. Detection of phishing website has become very important for online banking and e-commerce users. We proposed an effective model that is based on preprocessing (Feature selection and dimensionality reduction)

and

classification

DataMining

algorithms.

These

algorithms were used to characterize and identify all the factors to classify the phishing website. We implemented five different classification algorithm and four preprocessing techniques to classify a websites legitimate or phishy. We also compared their respective performances in terms of accuracy and AUe.

KeywordsData Classification

Mining,

I.

phishing

website

This paper is arranged into the following sections. Section II describes the related work done in the field of phishing website classification. Section III is a brief introduction to phishing website classification. Section IV describes the data set used. Section V describes the phishing factor indicators and their significance. Section VI deals with the experimental design. In section VII, results and performance of various classifiers is discussed. The conclusions and future work are discussed in section VIII.

detection,

INTRODUCTION

The Internet is playing an increasingly significant role in 21st century's commerce and business activities than before. But, poor security on the Internet is a cause of concern since it involves huge money and vital information transactions thereby strongly encouraging attackers to enter into the low risk, but high-gain online systems. Email messages containing valuable and sensitive information can"t be lOO% protected as they move across this Internet, therefore we need efficient techniques which can protect confidential information from unauthorized parties. Monetary benefits, identity theft, Password Mining, Fame and Notoriety, malware distribution, etc are the other motivating factors for phishers. Due to Phishing, there may be financial loss, information loss, blacklisting of institutions, malware and viruses into a PC or a computer system, illegal use of user"s details and misuse of your social security number. Phishing is used for deceiving the users online, due to poor current web security. Many reports suggest an increase in Phishing [1]. Recently, Phishers have attacked users with new

978-1-4673-6809-4/15/$31.00 ©2015 IEEE


and different Phishing techniques which are very hard to crack. Phishing is used to target individuals or bigger organizations in order to gain information. Many Phishing techniques exist today, some of which include link manipulation, forgery of authentic websites and redirecting. With more awareness spreading around, many ways have come up to combat Phishing. Using secure connection, installing anti phishing tools, ignoring the Phishing emails are some existing approaches against Phishing. Among the new emerging techniques, data mining has become one of the most efficient techniques in classification of websites into Phishing websites or legitimate websites. Attributes of the websites can be used as features and these features can be used as an effective means for website classification [2].

Phishing is a term given to the method of gaining access

AmbarMaini


II.

RELATED WORK

One of the already existing approaches for phishing website classification is Blacklisting. In this method, a database having domains and URL of phishing website is used for blocking .URLs. . Constructing a blacklist involves time consuming human feedbacks, due to this reason they are ineffective in blocking. Websites such as Phish Tank [3] and Net craft [4] contain a blacklist of Phishing websites. Many other approaches have also come up for creating blacklists and checking their accuracy [5][6]. White list maintains a database of trustworthy and good URLs or domains. URLs which are

388


not present in the white list are considered to be phishy and are blocked .The backdrop of blacklisting and white listing is their inability in detecting the newly made phishy website. The weakness of blacklisting and white listing is that they are not able to detect the newly created phished websites. Over the past few years, data mining is being used extensively. Data Mining is a process in which relationship is found in the dataset among the extracted features [7][8]. In Heuristic-based methods, some features are extracted from the websites which are then classified as either phishy or legitimate, the accuracy depends on features selected. Detailed study on Feature Extraction or Feature Selection for Text Classification to identify a Phishing mail is also an approach for classification [9]. Apart from that, many approaches have been worked upon including association rule based classification of Phishing websites [10] [2I].Another approach proposed in [11], utilizes CANTINA (Carnegie Mellon Anti-phishing and Network Analysis Tool) for detection of phishy websites using the concepts of information retrieval measures such as content based technique and term-frequency inverse-document frequency (TF-IDF) . III.

IV.

DATA USED

To implement and test our approach, we have used UCI Machine Learning repository of Phishing websites [12]. The total instances of Samples are 2456.Total phishing websites is 1094. Total legitimate websites are 1362.The total number of attributes is 30 for each instance. The features are extracted from website based on different attributes- address bar, HTML, JavaScript and domains. These features are described below in Section V. V.

PHISHING FACTOR INDICATORS

The experimental set up uses extracted features from websites and then uses them further for predicting the type of websites. These features have been extracted and categorized based on the effect they have on the website. In context of website phishing identification the use of many phishing indicators usually increases data acquisition time and cost and it leads to more design time, more decision making time, and also many features enhance chances of poor decision making. Table 1 shows the phishing attributes and their significance.

WEBSITE PHISHING

TABLE T: PHISHTNG WEBS[TE FE ATURES AND THEIR SIGNIFICANCE

Phishing websites is used to refer to those websites that are made by dishonest people to get access to the confidential details. Predicting the attacks and then stopping them is a crucial step for protection of online transactions and keeping a check on online safety. The accuracy in predicting the type of the website largely depends on the features which are extracted. Mostly users depend on the results generated by the phishing detection tool thereby increasing the responsibility on the phishing detection tools to be almost accurate. In our paper, we have first applied preprocessing algorithms on our extracted features obtained from the websites followed by classification algorithms to draw a comparison between these algorithms based on their accuracy and AUC. Figi shows the steps involved in the detection of phishing website.

Sno

Website Details

Comparison of classifiers results on the basis of Accuracy and AUe.

Significance

Features used

1

Having_IP_ Addr [f IP address is used in domain name, then

2

URL_Length:

3

Shortening_Servi Phishers use Link shorteners to fool people. ce

4

Having_ At_Sym Since @ symbol leads to ignore everything bol that follows it in the URL, so websites having an @ symbol are Phishy in nature.

5

double_slash_red If there is '/1' then it can be categorised as a irecting Phishing Website.

6

Prefix_Suffix Dash

7

Having_Sub_Do Legitimate Websites use only domain main generally upto two level. Websites having more than 3 dots do it to include more domains within a domain , are generally Phishy.

8

SSLfinal_State

9

Domain_register Phishing ation_length: domains.

10

Favicon

Phishing websites use genuine favicons to fake identity.

11

Port

Phishing Websites make use of some port

Fig1: Process for phishing website classification

Legitimate URLs have length of nearly 75 characters, Websites with URL length more than this can be categorised as Phishing sites.

Legitimate websites don't use Dashes, so a website using them can be classified as Phishy.

Legitimate websites use SSL socket layer every time sensitive information is transferring. Sites not including this can be categorised as Phishing Websites. websites

have

suspicious

numbers like 82, and therefore port scanning

389


VI.

can help in identifYing websites which are phishy.

12

HTTPS_token

The

HTTPs

presence

on

websites

We have used Weka [14] that provides inbuilt tools for data pre-processing, classification, clustering, and visualization. Pre-processing of data is done using feature selection and extensity reduction. We have applied Feature selection algorithms- Correlation Feature Selection (CFS), Information gain (lG) and Consistency Subset and for extensity reduction Principal component analysis (PCA).After that, we have used classification algorithms J48, Naive Bayes, SVM, Random Forest and AdaBoost in order to find the most reliable technique by comparing their accuracy and AUC for classification of phishing websites.

during

transfer is a clear measure of authenticity. Phishing websites don't use HTTPs.

13

Request_URL

In legitimate websites, objects within same domain are linked to the same domain, once the tag is known, however in Phishy websites; it has been observed that objects are from different domains.

14

URL_oCAnchor

In

legitimate

websites

the

anchor

tag

is

connected to the same domain as the source code, Phishy websites have different domains.

15

Links_in_tags

Links

in

tags

lead

to

some

fraudulent

A. Features Extraction

websites.

16

SFH

Feature extraction is used in data mining to describe the tools and techniques for reducing inputs to a size that can be used for processing and analysis. Feature selection involves cardinality reduction, choosing the right attributes, and selection and choosing and discarding attributes based on their goodness for analysis. Feature Extraction, selection and application of rule based classification helps in increasing accuracy [15][16].

It contains an about blank page and in Phishy websites

this about blank is linked

to a

different page

17 18

SUbmitting_to_e mail Abnormal_URL

Genuine users are redirected to some other page on clicking on the mail. This feature is extracted from Whois Database[13] , Legitimate websites' main identity is in the URL

19

Redirect

Following are feature selection methods used:

If the numbers of redirects are more than three,

then

website

can

be

classified

as

Phishing Website.

20

on_mouseover

1) Correlation Feature Selection (CFS): It is a method that evaluates the values of subset of the attributes taking into consideration the individual prognostic ability of each feature and degree of repetition between them. The subsets of features meeting these criteria are selected by CFS . Number of features selected by CFS when applied on phishing website database are 8 which are Prefix_Suffix, SSLfinal_ State, Request_URL, URL_oCAnchor, Links_in_tag, DNSRecord, web_traffic, Google_Index.

Phishing Websites manipulate the on mouse over event in the source code written in java script to create fake URL in status bar

21

RightClick

Phishing Websites disable the right click in their websites.

22

popUpWidnow .

Legitimate

Websites

rarely

ask

users

to

submit details in a pop Up window , doing so is a clear indication of Phishy website

23

Iframe

EXPERIMENTAL DESIGN

Iframe is used by Phishing websites to embed some file in the same HTML Page and to show the other file as a part of original HTML

2) Information Gain (IG): The Information gain is based on information theory, that is selecting attribute with the highest information gain [17]. The expected information needed to classify a tuple in partition D is given in eq.(l)

Page.

24

age_oCdomain

Legitimate

websites

have

an

age

of

six

months; websites with more than this age can be classified as Phishy.

25

DNSRecord

For a Phishing website , the domain record

(1)

would not be present in DNS Record or WHOIS Database[13]

Where Pi is the nonzero probability that an arbitrary tuple in belongs to class Ci. A log function to the base 2 is used, because the information is encoded in bits. Info(D)is just the average amount of information needed to identify the class label of a tuple in D. Number of features selected when applied on phishing website database by this algorithm are 30.

If the website is either having no traffic or limited,

then

it

can

be

classified

under

Phishing Website. Phishing websites will have low page rank due to lack of links pointing to them.

28

Google_lndex

Google

offers

a

free

toolbar

aimed

at

classifYing the Phishing websites.

29

Links_pointing_t oJlage

3) Consistency Subset: It asses worth of a subset of various attributes by calibrating the level of consistency in the class values, in a process in which the training instances are projected onto the subset of attributes. Number of features selected when applied on phishing website database by this algorithm are 15 which are URL_Length, Prefix_Suffix, SSLfinal_ State,Domain_registeration_length, URL_oCAnchor, Port, Favicon, Request_URL,

Phishing websites have links pointing to zip files

that

automatically

get

downloaded

containing malware

30

Statistical_report

It provides recent analysis of scams.

390


age_oCdomain , Linksjn_tags, , web traffic DNSRecord Links�ointing_to�age.

having_ Sub_Domain, , Google_Indexand

tool, J48 is an open source Java implementation of the C4.5 algorithm. It uses an extension to information gain known as gain ratio, which attempts to overcome the bias for certain attributes.

4) Principal Components Analysis (PCA): It is a procedure in which orthogonal transfonnation is used to convert a set of related variables into a set of values of linear uncorrelated variables. Extensity reduction is obtained by choosing eigenvectors. These eigenvectors are responsible for a part of the variance in the initial data. We have used value of variance 0.95 in our experiment

VII.

RESUL TS

We compare each classification algorithm's performance for each of the preprocessing algorithms (feature selection and dimensionality reduction) on the dataset and recorded their accuracy.

B. Classification Algorithms For the purpose of reliable and stable results in the experiments, K fold cross validation strategy is used. K fold classifier is generally used for classification accuracy measure. In this validation K partitions are made and out of these one is used for testing and rest is used for training. The dataset can be shown as follows

In our experiment, we have used five Algorithms for classification of phishing website based on various techniques: probability based NaiVe Bayes Classifier , decision tree based J48 algorithm , Meta classifiers AdaBoost and Random Forest. 1) Naive Bayes Classifier: Naive Bayes classifier applies Bayes' theorem and it is a simple technique for constructing classifiers. Despite the naive design, these classifiers work well in many complex real-world situations. The benefit of using Naive Bayes is that, it needs only a small amount of training data to estimate the parameters necessary for classification. 2) Support Vector Machine: SVM algorithm is based on the Structured Risk Maximization theory and aims at minimizing the generalization error. Sequential Minimal Optimization version of SVM is used in this experiment. We used Complexity parameter C=1 for the tolerance degree to errors. RBF kernel, is used which proves to be efficient for classification. SVM is one of the well known classifier, so details are excluded [18] [19]. 3) AdaBoost: It is the algorithm that can be coupled with many other algorithms to enhance their accuracy in

TI T2 Tk

=

=

=

Xl PI

=

X2 P 2 Xk Pk

X2UX3 U ... UXk XIUX3 U ... UXk

(2)

=

=

XkUX3 U ... UXk-1

In the above, T), T2 , Tk are the partitions for testing and P), P2 , 'PK are for training. K is typically 10 or 30.ln this experiment, K=10 has been used. • • •

• •

The Accuracy is calculated as the number of correct classifications divided by the total number of classifications: Accuracy= TP+TN / (TP+TN+FP+FN)

(3)

Where TP is True Positive, TN is True Negative, FP is False Positive, and FN is False Negative.

classification. Let D be a data set of various tuples (X I, Yl),

(X2, Y2) ... (Xd, Y d), where Yi is the class label of tuple Xi . In its first step, the algorithm designates every tuple that is to be trained; a weight of lid, this weight is equal for all. To generate k classifiers for the group all together k rounds are required through the remaining part of the Adaboost algorithm. In round n, the tuples from data set are sampled. Adaboost has been used for Phishing Detection [20].

Area under receiver operating characteristic curve (AUC) is also used to evaluate the effectiveness of the fault prediction as it is most informative indicator for prediction accuracy. We observe that Consistency subset outperforms all the other feature extraction algorithms providing high accuracy rates despite using less number of features thereby showing that less number of prominent features can also be used to produce the same accuracy as produced by using all the features. Consistency subset produces high accuracy rates of 97.4756 % by using 15 features of the 30 features. Therefore, Consistency subset feature extraction is a reliable algorithm for feature extraction by just using 50% number of features; it produces accuracy results as obtained from 30 features and thus saves experimental time. The summary of results is in terms of accuracy is shown in figure 2 and in terms of AUC in figure 3.

4) Random Forest (RF): It is an algorithm used in constructing a forest of random trees. In this method of classification, a set of decision trees are constructed at training time. It outputs the class that is the mean of all. The individual decision trees are generated using a random selection of attributes at each node to determine the split. While classifying, each tree votes and the most popular class is returned. 5) J48 Algorithm: J48 is an extension of ID3 (decision tree algorithm). The additional features of J48 are accounting for missing values, decision trees pruning, continuous attribute value ranges, derivation of rules, etc. In WEKA data mining

391


TABLEII: RESULTS IN TERMS OF ACCURACY FOR DIFFERENT CL ASSIFIERS TO CL ASSIFY WEBSITE PHISHING Unproces

CFS

Info

sed Data

Subset

Gain

PCA

0.98

Consis

0.96

tency Subset

J48

94.9919

94.259

95.0773

94.1368

95.3176

Naive

94.0554

93.1189

94.0554

91.9381

93.6482

•

0.94

SVM

94.6254

93.241

94.6254

94.9591

94.7476

R.Forest

97.8013

93.9739

97.557

96.661

97.4756

AdaBoo

93.7704

93.2003

93.7704

89.658

93.7704

.Cfs Subset

0.92 0.9

Baiyes

Unprocessed Data

0.88

•

Info Gain

•

PrincipalComponents

•

Consistency Subset

st

From the results shown in table 2, we can see that Random Forest has outperformed J48, NaIve Bayes , SVM and AdaBoost.

100 98 96 94 92 90 88 86 84

VIII. CONCLUSION AND FUTURE WORK In this paper, we have investigated the role of classifiers and the effect of pre-processing (Feature selection and dimensionality reduction) in detecting phishing websites. Five classifiers and 4 pre-processing algorithms have been used in this experiment. We have compared the performance of five classification algorithms and three well known feature selection CFS, Infogain, Consistency subset and one dimensionality reduction (PCA) on the publically available website phishing data. The experiment showed that Random Forest outperformed J48, SVM, Naive Bayes, and AdaBoost in terms of accuracy and AUe. The experiments on the web site phishing data set show that 15 features selected by consistency subset achieved the same accuracy and is a good alternative. For the future work we plan to develop a robust comprehensive rule generation model for identification of phishing websites.

-

•

Unprocessed

.Cfs Subset •

Info Gain

•

PrincipalComponents

•

Consistency Subset

Fig 2 . Comparision in terms of accuracy for different classifers We have noted down the area under the ROC curve for each of the algorithm. Table 3 shows the AUC for the experiments performed. It can be seen that the AUC for unprocessed data for all the algorithms in between 0.90 to 1 which suggest that data mining algorithms can easily predict the phishy. From column number 6 it is evident that same accuracy can be achieved by less number of features.

REFERENCES [I] [2]

TABLE lll: RESULTS IN TERMS OF AUC TO CL ASSIFY WEBSITE PHISHING Unpro

CFS

Info

cessed

Subset

Gain

PCA

[3] [4] [5]

Consis tency

Data

Subset

J48

0.986

0.984

0.986

0.952

0.989

Naive

0.987

0.987

0.987

0.969

0.986

SVM

0.946

0.932

0.946

0.949

0.947

[7]

Random

0.997

0.989

0.997

0.994

0.997

[8]

0.986

0.985

0.986

0.962

0.986

[9]

[6] Baiyes

Forest AdaBoost

392

APWGPhishing AttackTrendsReporl Available:http://www.antiphishing. orglresources/apwg-reports/APWGPhishing AttackTrendsReport t H. Liu, H. Motoda , R. Setiono , Z Zhao" An Ever Evolving Frontier in Data Mining" in Journal of Machine Learning Research - JMLR vol. 10, pp 4-13, 2010. PhishTank Available : https://www.phishtank.com/. Net Craft Available : http://www.netcraft.comi. Steve Sheng, Brad Wardman, Gary Warner , Lorrie Faith Cranor , Jason Hong and Chengshan Zhang "EmpericalAnalysisOf Phishing BlackLists" at 6th International Conference on Email and Anti Spam CEAS ,Mounatin View California July 16-17 ,2009 Christian Ludl, Sean Mc Allister, Engin Kirda, Christopher Kruegel"On the Effectiveness of Techniques to Detect Phishing Sites" at Proc 4th International Conference, DIM V A 2007 Lucerne, Switzerland, pp 20-39 July 12-13,2007 Keng Siau and Sang Juan Lee "A Review of Data Mining Techniques" in Industrial Management & Data Systems , MCB University Press ,pp 41-46 , 2001. Kantardzic and Mehmed. "Data MininJ!.: Concepts, Models, Methods, and Algorithm", John Wiley & Sons, Wiley-IEEE Press, July 2011.. Masoumeh Zareapoor and Seeja K. R " Feature Extraction or Feature Selection for Text Classification A Case Study on Phishing Email Detection" in International Journal of Information Engineering and


[10] [II]

[12] [13] [14] [15]

[16]

ElectronicBusiness ,voI,7,N02, pp 60-65, March2015,rOnlinelAvailable: http://www.mecs-press.orglijieeb/ijieeb-v7-n2/UIEEB- V7-N2-S.pdf Neda Abdelhamid ,Aladdin Ayesh and Fadi Thabtah " Associative Classification Mining for Website Phishing Classification" in journal of Information and Knowledge management vol,II,June 2012. YZhang , J.Hong , L.Cranor."CANTIN A : A content based approach to detect phishing websites" in Proceedings of the 16th International Conference on World Wide Web , Banff, AB, Canada pp 639-64S. May OS - 12 2007 UCI machine learning repository available at http://archive.ics.uci.edu/mlldatasets/Phishing+Websites WhoiS. Available at: http://www.who.is/WHOisDatabase WekaToolavailable at http://www.cs.waikato.ac.nziml/wekalWEK ATOOL. Ram B. Basnet , Andrew H. Sung and Quingzhong Liu "Feature Selection for Improved Phishing Detection" in 25th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems,Dalian, China, June 9-12, 2012. Proceedings pp 252261. Rami M. Mohammad, Fadi Thabtah, Lee McCluskey "Intellingent Rule based Phisihing Websites Classification" in IET Information Security

[17] [IS] [19] [20]

[21]

393

,VoI,S, May 2014 , pp 153-160 , [Online]. Available: IEEE Xplore, http://www.ieee.org Jasmina Novakovic, "Using information gain attribute evaluation to classify sonar targets" in 17th Telecommunications Forum, Proceedings Telfor", November 24-26,2009, Belgrade, pages 1351-1354 Sebastian Maldonado and Gaston L'Huillier " S VM-Based feature selection and classification for email filtering" in ICPRAM Vilamoura, Algarve, Portugal, February 6-S, 2012, pages 135-14S. M Chandrasekaran , K Narayanan, S Upadhaya "Phishing E-mail Detection based on STructutral Properties" in NYS Cyber Security Conference , pp 1-7,2006. Venkatesh Ramanathan and Hary Wechsler in "PHISHGILLNET phishing Detection methodology using Probabilistic Latent Smenatic Analysis, AdaBoost and Co-Training" in EURASIP Journal on InformationSecurity,Vol.12, rOnine]March20 12,Available:http://jis.euras ipjournals.com/contenctl2012/1/1 Moh'd Iqbal ,AL Ajlouni Wa'el Hadi,Jaber Alwedyan "Detecting Phishing Websites Using Associative Classification" in Journal of Information Engineering and Applications,VoI,3, No.7, 2013

Investigating the Effect Of Feature Selection and ... - IEEE Xplore

Investigating the Effect Of Feature Selection and ... - IEEE Xplore

Suggest Documents

Semisupervised Feature Selection via Spline ... - IEEE Xplore

An Online Unsupervised Feature Selection and its ... - IEEE Xplore

Combining Feature Selection and DTW for Time-Varying ... - IEEE Xplore

Feature Subset Selection and Ranking for Data ... - IEEE Xplore

Study on Feature Selection and Machine Learning ... - IEEE Xplore

Feature Selection for Hidden Markov Models and ... - IEEE Xplore

Using Feature Selection and Classification to Build ... - IEEE Xplore

Feature Selection of Autoregressive Neural Network ... - IEEE Xplore

Feature Selection for Optimized High- dimensional ... - IEEE Xplore

Feature Selection via Cramer's V-Test Discretization for ... - IEEE Xplore

mean-shift blob tracking with adaptive feature selection ... - IEEE Xplore

Similarity-Based Online Feature Selection in Content ... - IEEE Xplore

A Hybrid Feature Selection Method for Complex ... - IEEE Xplore

Dynamic Feature Selection with Fuzzy-Rough Sets - IEEE Xplore

Relaxation-Based Feature Selection for Single-Trial ... - IEEE Xplore

Feature Selection for Online Writeprint Identification ... - IEEE Xplore

Automatic Feature Selection Technique for Next ... - IEEE Xplore

Cyclic Feature Concealing CP Selection for Physical ... - IEEE Xplore

investigating rough set feature selection for gene

A Genetic Based Wrapper Feature Selection Approach ... - IEEE Xplore

Short Text Feature Selection for Micro-Blog Mining - IEEE Xplore

Online Feature Selection with Group Structure Analysis - IEEE Xplore

Feature selection based on functional group structure for ... - IEEE Xplore

Combinational Feature Selection Approach for Network ... - IEEE Xplore