An Empirical Evaluation for Feature Selection ...

4 downloads 44200 Views 266KB Size Report
Apr 23, 2012 - est phishing email classification accuracy results ac- ... the Ranker, Wrapper with the Best-first searching .... Tests if a HTML form is found [6].
An Empirical Evaluation for Feature Selection Methods in Phishing Email Classification Mahmoud Khonji Computer Engineering Khalifa University P.O. Box 573 Sharjah, UAE [email protected]

Andrew Jones Information Security Khalifa University P.O. Box 127788 Abu Dhabi, UAE Edith Cowan University [email protected]

Youssef Iraqi Computer Engineering Khalifa University P.O. Box 573 Sharjah, UAE [email protected]

April 23, 2012

Abstract

a highly competitive anti-phishing email classifier can still be constructed by only using existing MachineLearning techniques and previously proposed features if an effective features subset is found. Keywords: Phishing E-Mail Classification, Machine Learning, Feature Subset Selection

Phishing email detection is highly dependent on the accuracy of anti-phishing classifiers. Classifiers that use Machine-Learning techniques achieve highest phishing email classification accuracy results according to the literature. Using effective features in Machine-Learning is a critical step in raising classifiers detection accuracy. This study aims at evaluating a number of feature subset selection methods as they relate to the phishing email classification domain. In order to perform this study, a total of 47 classification features were constructed as previously proposed in the literature. The primary outcome of this study is that the Wrapper evaluator and the Best-First: Forward searching method resulted in finding the most effective features subset among all other evaluated methods. This study addresses the gap that exists between fragmented literature items by evaluating them together following common evaluation metrics. Using the best performing feature selection method, an effective features subset was found among the 47 previously proposed features, which resulted in a highly accurate anti-phishing email classifier with an f1 score of 99.396%. This also shows that

1

Introduction and Motivation

Phishing is a semantic attack that communicates socially engineered messages to humans in order to persuade them to perform certain actions for the attacker’s benefit (e.g. exposing confidential data such as login credentials of e-banking systems, or executing crime-ware such as Zeus or SpyEye). Over the past years, many solutions have been proposed to minimize the impact of phishing attacks, such as enhancing user training techniques [4, 2, 22, 16] and software classification techniques of phishing messages [28, 20, 17]. To the best of our knowledge, the most accurate publicly known phishing email classifiers are [8, 6, 25], all of which use machine learning techniques to classify phishing and legitimate email messages, which justifies our focus on evaluating various aspects of the machine learningbased classification techniques.

Thanks to Buhooth for funding this work. A preliminary version of this work was published in [13].

1

In document classification via machine learning, each document (or instance) is processed to extract a set of features, which are then used as an input to a machine learning function that, through a learning phase, builds a classification model. The classification model is then able to predict classes for unlabeled instances. Theoretically, machine learning algorithms construct classification models that predict classes for unlabeled input instances by analyzing a number of features, while ignoring other features that do not help in increasing the prediction accuracy [29]. For example, decision trees (e.g. C5.0) make decision splits based on features that help in increasing prediction accuracy, and avoid splitting based on features that do not help in increasing the accuracy. However, in practice, experiments show that unnecessary features can confuse classification algorithms and degrade classification accuracy [29]. Unnecessary features also increase the complexity of the classifiers, which can result in further delays and a higher demand on resources. Thus, excluding unnecessary features is an important step in increasing the prediction accuracy of a classifier and reducing the system’s complexity. This study takes advantage of the previously proposed phishing classification features, by extracting a features subset that results in highest possible prediction accuracy according to a number of feature subset selection methods. A number of previous works on phishing classification have primarily used Information Gain (IG) [25, 7] and Wrapper evaluation criterion [6] along with ranked and best-first subset space searching algorithms. To the best of our knowledge, none of the previous works in the phishing domain presented a comparison of different feature selection techniques and their performance in the prediction accuracy of phishing email classifiers. In this study, an evaluation of a number of feature selection techniques was undertaken against public phishing and legitimate emails corpora, which primarily aims at evaluating the performance of the various feature subset selection techniques (e.g. IG with the Ranker, Wrapper with the Best-first searching method... etc).

This study extends the preliminary work in [13] by the addition of C4.5 and SVN learning algorithms (in addition to Random Forests), and the Relief features evaluation criterion. The addition of multiple learning algorithms is necessary to minimize the possibility that a feature selection method’s performance was overestimated (or underestimated) by learner’s ability in handling noisy feature sets. Background and related work are presented in Section 2, followed by a description of the analyzed phishing features in Section 3. Sections 4 and 5 introduce the evaluated feature subset searching and subset evaluation methods respectively, which are evaluated in Section 6.2. The conclusion is then drawn in Section 7.

2

Background Work

and

Related

Abu Nimeh et al. [3] compared the performance of six machine learning techniques for phishing detection, namely: Logistic Regression (LR), Classification and Regression Trees (CART), Bayesian Additive Regression Trees (BART), Support Vector Machines (SVM), Random Forests (RF), and Neural Networks (NNet). The authors used 43 features extracted from a dataset composed of 1,171 phishing emails from [19] and 1,718 legitimate emails from their own mailboxes. The authors used 10-fold crossvalidation and their evaluation criteria were: Precision (p), Recall (r), f1 score, and Weighted Error (WErr ) according to a specified weight λ, which are detailed in Equations 1, 2, 3 and 4 respectively. p=

nP →P nP →P + nL→P

(1)

r=

nP →P nP →P + nP →L

(2)

2pr p+r

(3)

f1 =

WErr = 1 − 2

λ · nL→L + nP →P λ · NL + N P

(4)

the criterion to rank the features. The IG analysis showed that language modeling features ranked the highest among all of the three data sets. For example, the highest ranking features were body richness and subject richness which are measured according to Equation 5.

Table 1: Toolan et al. data sets. Dataset ID Email Classes Dataset 1 Legitimate, SPAM Dataset 2 Legitimate, phishing Dataset 3 Legitimate, SPAM, phishing

Richness =

where nP →P is the number of phishing emails that are correctly classified as phishing, nL→P is the number of legitimate emails that are incorrectly classified as phishing, nP →L is the number of phishing emails that are incorrectly classified as legitimate, nL→L is the number of legitimate emails that are correctly classified as legitimate, NL is the total number of all legitimate emails in the testing dataset, and NP is the total number of all phishing emails in the testing dataset. When λ = 1, WErr returns an error rate with an assumption that misclassified legitimate emails and misclassified phishing emails are equally important. However, when λ = 9 then WErr returns an error rate where the penalties for the misclassification of legitimate emails is 9 times more than the penalty for the misclassification of phishing emails. The evaluation in [3] showed that RF achieved the lowest error rate WErr when λ = 1. However, when λ = 9, RF performed the worst due to its high false positives rate. On the other hand, LR achieved the best WErr when λ = 9. Abu-Nimeh et al. concluded that the addition of other features may be useful to further reduce WErr rates. Toolan et al. [26] then evaluated 40 features, most of which were previously proposed in other phishing/spam detection solutions. The aim of their evaluation was to primarily evaluate a number of features proposed by other researchers as well as showing the importance of feature selection in the prediction accuracy of machine learning classifiers. The phishing emails corpus was obtained from [19] which contained 4,563 emails. The spam and legitimate email corpora was obtained from [24] which contained 1,895 and 4,202 emails respectively. Phishing, spam and legitimate emails were then used to construct three datasets (see Table 1). Toolan et al. [26] used Information Gain (IG) as

NW ords NCharacters

(5)

where NW ords is the total number of words in the email’s body or subject, and NCharacters is the total number of characters in an email’s body or subject. The authors in [26] then used the C5.0 [21] decision tree classification algorithm to construct a number of classification models trained by three varying feature subsets, namely: best, medium and worst as ranked according to the IG criterion. Their empirical evaluations on the datasets presented in Table 1 showed that features with better IG ranks resulted in better predictive accuracy than features ranked as medium and worst by the IG criterion, which generally showed that feature selection is needed and IG is a reasonable criterion for ranking features. However, the study did not compare ranked IG features against other feature subset selection methods. Previous works on phishing classification have primarily used IG [25, 7] and the Wrapper evaluation criterion [6] along with ranked and best-first subset space searching algorithms. To the best of our knowledge, none of the previous work presented comparisons of different feature selection techniques in the phishing domain. In this paper, we continue from there by evaluating a number of feature subset evaluators, namely: Information Gain (IG), Relief, Wrapper, Inconsistency Feature Subset Evaluator (ConEval1 )[18], and Correlation-Based Feature Subset Evaluator (CorEval2 )[10], as well as features subset searching algorithms, namely: Ranker, Bestfirst: Forward, Best-first: Backward, and Best-first: Bi-directional. The feature subset that leads to highest classification performance is then chosen to construct a phishing email classifier. 1 Named “ConsistencySubsetEval” according to Weka [27], however this paper refers to it as “ConEval” for brevity. 2 Named “CfsSubsetEval” according to Weka [27], however this paper refers to it as “CorEval” for brevity.

3

TCP/IP Header

{

IP Source 1.1.1.1; IP Destination: 2.2.2.2 TCP Source: 55555; TCP Destination: 25 EHLO mx.example.com MAIL FROM: RCPT TO: DATA

RFC822 Body

SMTP DATA

SMTP Mail Object (TCP Payload)

RFC822 Header

SMTP Envelope

{{ { {

{

• Email Body Features — a total of 11 features were extracted from the body part of email messages that are found in D .

A

• Email Header Features — a total of 11 features were extracted from RFC822 email header fields that are found in C (e.g. sender, reply-to, subject).

B

From: To: Subject: Account Expired!

C

• URL Features — a total of 18 features were extracted from URLs and anchors that are found in D .

Dear Client, Your accont is expired. To activate, login: http://P hishSite.c om/activate.php Regards, Example Inc. Technical Support

• JavaScript Features — a total of 5 JavaScript features were extracted from email bodies that are found in D .

D

X Data part ID

• External Features — similar to [6], two external features were brought from SpamAssassin [1] (SA) which in turn performs analysis on parts C and D . Similar to [6], spamc was run against emails with all network tests being disabled at spamd process. In other words, we used the “heuristic” part of SA as all network tests and blacklists were disabled. By default, SA runs tests on emails and if the accumulated test scores exceed a certain threshold (5 by default), the scored email is predicted to be spam.

Figure 1: An overview of email data parts. This study uses the same phishing and legitimate email sources used in [26, 6, 3], and 47 features as described in previous literature. However, different parsing techniques can affect the performance of individual features. The aim of this study is not to evaluate the features themselves, but rather is limited to evaluate the various features subset selection techniques. Thus, simple parsing methods — development wise — were used to simplify and speed up the development process of the feature extractors.

3

Table 2 presents a list of the individual features.

Features Set

In [26], the authors collected 40 features most of which were previously used in other literature such as [6, 8, 9]. The vast majority of the features used in our study are from [26] in addition to 2 external features from [6] and 5 additional features derived from those in [26]. This results in a total of 47 features that are extracted by analyzing data parts C , and D as depicted in Figure 1. A common approach in extracting features found in data parts A and B is via the use of blacklists, which were intentionally avoided as blacklists perform poorly against zero-day attacks. 4

Table 2: Phishing email classification features. Identifier bodydearword bodyhtml bodyform bodynumchars

type binary binary binary binary

bodynumwords bodynumuniqwords bodyrichness

continuous continuous continuous

bodynumfunctionwords

continuous

bodysuspensionword bodyverifyyouraccountphrase subjectbankword subjectdebitword subjectfwdword subjectreplyword subjectverifyword subjectnumchars subjectnumwords subjectrichness

continuous binary binary binary binary binary binary continuous continuous continuous

sendnumwords senddiffreplyto

continuous continuous

sendunmodaldomain

binary

urlatchar urlbaglink

binary binary

urlip urlnumdomains urlnumexternallink urlnuminternallink urlnumimagelink urlnumip urlnumlink urlnumperiods urlnumport urlport

binary continuous continuous continuous continuous continuous continuous continuous continuous binary

Description Tests if the “dear” word is found. Tests if a HTML MIME part is found [8, 26, 25]. Tests if a HTML form is found [6]. Tests “Content-Type: multipart/alternative” is found bodyhtml. Total number of body words [7]. Total number of unique body words [7]. Total number of body words divided by total number of body characters [7]. Total number of function words, such as: “account”, “access” and “bank”. Tests if the “suspension” word is found [7]. Tests if the “verify your account” phrase is found [23, 6]. Tests if the “bank” word is found in the subject[23]. Tests if the “debit” word is found in the subject[23]. Tests if the “Fwd:” word is found in the subject[26]. Tests if the “Re:” word is found in the subject[26]. Tests if the “verify” word is found in the subject[23]. Total number of subject characters [26]. Total number of subject words [26]. Total number of subject words divided by total number of subject characters. Total number of “send” words in the “sender” field [26]. Tests the difference between the sender and reply-to email addresses [26]. Tests if sender’s domain name is the modal domain name [26]. Tests if a URL has the “@” sign[9]. Tests if a URL has the “click”, “here”, “login”, and “update” words[6]. Tests if a URL has an IP address[6, 8, 25]. Total number of domains found in URLs[8]. Total number of external links[6]. Total number of internal links[6]. Total number of image links[6]. Total number of URLs that contain an IP address[6]. Total number of anchors[6, 8, 25]. Total number of periods “.” in a URL [6, 8, 25]. Total number of URLs with port numbers [26]. Tests if a URL contains a port number [23].

5

Table 2: Phishing email classification features. Identifier urltwodomains

type binary

urlunmodalbaglink

binary

urlwordclicklink urlwordherelink urlwordloginlink urlwordupdatelink scriptjavascript scriptonclick scriptpopup

binary binary binary binary binary binary binary

scriptstatuschange

binary

scriptunmodalload

binary

externalsabinary externalsascore

binary continuous

Description Heuristically tests if a URL has two or more domains [12]. Tests if a link with an un-modal domain is found that has the keywords “click”, “here”, and “link”[6]. Tests if the anchor text contains the word “click”. Tests if the anchor text contains the word “here”. Tests if the anchor text contains the word “login”. Tests if the anchor text contains the word “update”. Tests if the email has JavaScript[8, 6]. Tests if the email has “onClick” JavaScript event. Tests if the email attempts to open pop-up windows via JavaScript code. Tests if the email attempts to modify the status bar via JavaScript code[9]. Tests if the email has JavaScript code that is loaded from a non-modal domain. Tests the classification label from SpamAssassin. Extracts the score output from SpamAssassin.

6

4

Feature Subset Methods

Searching

space, IG may rank the features F1 and F2 as the best, and F3 as the worst, because F1 and F2 can achieve a lower entropy value in dataset S than F3 . However, if F1 and F2 are highly overlapping with each other, then combining F1 with F2 may not help much in increasing the prediction accuracy. On the other hand, if F3 was less overlapping with F1 , then combining F3 with F1 may actually result in a better prediction accuracy than if F1 and F2 were combined. Although the above example holds true in theory, the question remains whether such a scenario is commonly seen in real-life datasets. If real-life features and datasets do not often have such scenarios, then the independence assumption can be efficient.

The search for an ideal features subset is generally composed of two components, namely: a features subset evaluator and a features subset searching method [29]. For example, a searching technique is used to select a features subset, and then an evaluation criterion is used to assess the effectiveness of chosen features subsets. This section describes features subset searching methods. Subset evaluators are described later in Section 5.

4.1

Exhaustive Search

4.3

Theoretically, the exhaustive search guarantees exploring all possible feature subsets within a features space. Depending on the evaluation criterion (Section 5), an ideal subset (i.e. global maximum) can be guaranteed to be found. However, the challenge is that an exhaustive search can a take very long time when dealing with a relatively large quantity of candidate features. Equation 6 presents a number of subsets that should be evaluated with the exhaustive search. T ests = 2Nf − 1

During this research, Weka’s implementation of Bestfirst, which performs greedy hill climbing in multiple directions with backtracking [29] was used. It can start from an empty set of features, and move forward (Fwd) by evaluating features and adding the best one (according to a subset evaluator), and then move further forward to find the next best feature that, when added, increases the performance of the subset. The process is repeated until no further change leads to any improvement (or reduces performance). Alternatively, the process can begin from a full set of features, and then move backward (Bwd); that is, removing features if their removal is not associated with any performance loss. It is also possible to move both forward and backward (Bi-dir), in which the implementation considers addition or deletion of features at any given time. The implementation that was used in this study also performs backtracking. This enables the algorithm to move forward for a number of steps (e.g. feature additions or deletions) even if no improvements are seen. The motive behind this was to handle scenarios where the subset space searching algorithm faces a plateau of a certain depth [29], which might in turn increase the probability of finding an ideal features subset (global maximum). It is important to note that Best-first does not guarantee finding the global maximum features sub-

(6)

where Nf is total number of possible features (e.g. 47 in this study). This means that if each test takes 1 second to complete and if the total number of candidate features is 47, then millions of years are required to complete the exhaustive search in the space of feature subsets. Obviously, an exhaustive search is impractical with the 47 features set. Thus, we will exclude the evaluation of the exhaustive search in this study.

4.2

Best-first

Ranker

The ranker simply returns an ordered (ranked) list of features according to a criterion, such as Information Gain. Theoretically, combining features that are individually good does not guarantee that they will increase the prediction accuracy. For example, in a 3 features 7

set, but rather guarantees finding the local maximum (the local maximum is not the ideal features subset). However, the use of backtracking (which is set to 5 steps in Weka by default [29]) aims towards reducing the possibility of being stuck in a plateau or a local maximum [29].

where ni is the total number of instances that fall under class i, and N is the total number of all instances in the dataset. Equation 9 describes the conditional Entropy H(S|F ) of dataset S after a split by feature F . H(S|F ) = −

5

Feature Subset Evaluators

j=1

Information Gain IG(F |S) is the difference between the entropy H(S) of a dataset S, and the conditional entropy H(S|F ) of a dataset after a split is made using a given feature F . Entropy is a method of measuring impurity in a dataset and is considered maximum if a dataset has an equal number of instances for each class [5]. For example, in a binary classification problem, the entropy is equal to 1 (maximum) if 50% of the instances belong to class “phishing”, and the remaining 50% belong to class “legitimate”, and entropy is 0 (minimum) if either one of the classes is 100% of the dataset. Equation 7 describes the entropy H(S) of dataset S. H(S) = −

pi log2 pi

5.2

where 0log2 0 ≡ 0, K is number of classes (e.g. 2 classes in our case: phishing and legitimate), and pi is the estimate probability of a given class i as described in Equation 8. ni N

(9)

Relief

The Relief algorithm[14] weights features individually based on how a feature distinguishes itself among nearest instances of different classes. The merit of the Relief algorithm is that effective features found in instances have identical or similar values if their classes are the same, and have different values if their classes are different. To achieve the above objective, the Relief algorithm randomly samples an instance x from data set

(7)

i=1

pi =

pij log2 pij

i=1

A feature F ranks best if it has the largest IG(F |S) among other features, which is achieved if F causes a higher reduction of entropy in a dataset than other features. IG is used to evaluate features individually, and thus the appropriate features subset searching method is the Ranker in this case. The advantage of IG is its simplicity, however its disadvantage is that combining individually-good features into a subset does not necessarily mean that the features subset are good “team-playing” members. However, one of the questions this evaluation attempts to answer is how likely is it to have such scenarios in real-life datasets where highly predictive individual features are not good team-playing features.

Information Gain

K X

N

where J is the total number of branches made after the split caused by applying feature F on dataset S, pij is the probability estimate for class i in branch j. In other words, H(S|F ) is the weighted sum of the entropy H(Sj ) in each branch j after a split. Finally, the Information Gain IG(F |S) of feature F in dataset S is calculated as described in Equation 10 IG(F |S) = H(S) − H(S|F ) (10)

Feature subset evaluators are the criteria to measure feature subsets explored from a given features space by the searching methods discussed in Section 4. Two main categories of evaluators exist, namely: wrappers and filters. Filters evaluate feature subsets by heuristically studying given datasets and features, while wrappers evaluate feature subsets by wrapping around a machine-learning algorithm (e.g. RF, SVM, C4.5. . . etc) and then using it as an evaluator subroutine[29].

5.1

K J X Nj X

(8) 8

S, and based on |F |-dimensional Euclid distance the algorithm finds two nearest neighboring instances xn1 and xn2 , where |F | is the total number of features in an instance, xn1 is an instances that belongs to the same class as x and xn2 is another instances that belongs to a different class than the class of x. Once the instances x, xn1 and xn2 are determined, the value of f in x is compared against those in xn1 and xn2 . If the value of f in x is different to that of xn1 then the quality weight of x reduces as the feature f is distinguishing instances of the same class. However, if the value of f in x is different than that of xn2 then the quality weight of feature f raises as the feature is able to separate instances of different classes. Similar to IG, the Relief algorithm evaluates features individually and thus should be used with the Ranker to generate a list of ordered features based on their quality.

5.3

other than the wrapped classifier since the features subset was chosen for a particular classifier. In this evaluation, we used Weka’s implementation of the wrapper, which is set by default to evaluate features subsets by 5-folds cross-validation [27].

5.4

Inconsistency Evaluator

Feature

Subset

The inconsistency features subset evaluator (ConEval) tests feature subsets against an “inconsistency” criterion to evaluate the effectiveness of each of the feature subsets as a whole (i.e. not as individual features). The inconsistency criterion states that if a candidate features subset FC has smaller cardinality than the cardinality of the previous best features subset FB and an inconsistency rate lower than γ or the rate of a previous best features subset, then FC will be considered as the new best features subset, where γ is a predefined minimum acceptable inconsistency rate. The implementation used in this study sets γ to the inconsistency rate of the full features set. The inconsistency rate of a features subset is the sum of the individual inconsistencies of features in the subset, divided by the total number of instances. The inconsistency of each of the individual features is the quantity of matching instances minus the quantity of instances of the dominant class[18]. For example, if a feature Fi matches N instances, with n1 of the N instances being labeled as class 1 and n2 labeled as class 2 where N = n1 + n2 , and if n2 > n1 then the inconsistency of the feature Fi in dataset S is as described in Equation 11.

Wrapper

The wrapper[15], by itself, does not evaluate feature subsets. However, it runs any arbitrary (according to the setup) classifier and uses it as the evaluation criterion. The wrapper simply trains a learning scheme based on a given features subset (as explored by a searching method, such as Best-first) and then evaluates the constructed classification model by using a training set. The wrapper then suggests the features subset that resulted in the best overall accuracy. Since the wrapper evaluates feature subsets together, it may address the problem faced by other evaluators that evaluate features individually, such as IG. The advantage of the wrapper is that it guarantees finding the best features subset (global maximum) for a given classification algorithm if combined with an exhaustive features subset searching method. However, using an exhaustive search is impractical in many scenarios. The disadvantage of the wrapper is that it can be relatively slower than the heuristic features subset evaluators (i.e. filters). Moreover, the chosen features subset might not be good enough for classifiers

Ic (Fi |S) = N − n2

(11)

Thus, the inconsistency rate of a features subset FI is as described in Equation 12. Ir =

Pm

i=1 Ic (Fi |S)

N

(12)

where m is the number of elements in the set FI = {F1 , . . . , Fm }. 9

6

Best-first: Bi-dir.

Ranker

The correlation-based feature subset evaluator measures the amount of correlation between features and their classes, as well as the correlation between the features themselves. The general idea behind CorEval is that good feature subsets are highly correlated with classes, but should not correlate with each other. Correlation between features and classes means how accurately a feature can predict a class. Correlation between features themselves mean that they are redundant or highly overlapping [10].

Best-first: Bwd.

Correlation-Based Feature Subset Table 3: Evaluated feature subset searching and subEvaluator set evaluation methods. Best-first: Fwd.

5.5

IG Relief ConEval CorEval Wrapper

Evaluation Approach

ered IG and Relief ranked features varied depending on the used learning algorithm, namely: the best performing top-n IG ranked features are 45 features for RF, 6 for C4.5, and 21 for SVM. The Relief algorithm’s best performing top-n ranked features are 44 for RF, 23 for C4.5 and 35 for SVM. Further details are presented in Section 6.3. Based on the 10-fold-cross validation of the constructed classification models, the f1 score and WErr rates are calculated. The f1 score is calculated as described in Equation 3, and weighted error rates WErr with multiple weights λ are calculated as described in Equation 4. Similar to Abu-Nimeh et al. [3], λ values used in this study are λ = 1 and λ = 9. When λ = 1, WErr returns an error rate with an assumption that legitimate emails and phishing emails are equally important. However, when λ = 9 then WErr returns an error rate that penalizes the misclassification of legitimate emails 9 times more than the misclassification of phishing emails, and rewards the correct classification of legitimate emails 9 times more than the correct classification of phishing emails. This study uses Weka’s implementation of Random Forests (RF), C4.5 and Support Vector Machines (SVM) classification algorithms [27]. The reason behind extending this evaluation by incorporating multiple learning algorithms is to minimize the possibility that the effectiveness of a feature subset

The evaluation begins by extracting all of the 47 features described in Section 3 from email messages described in Section 6.1. In other words, the classifier sees each email as a 47-dimensional vector, where each element is the returned value of one of the 47 features. A number of feature subsets are then constructed according to subset evaluators and subset searching methods presented in Table 3, which presents all of the combinations of feature selection methods evaluated in this study. Since IG and Relief evaluate features individually, only the Ranker is an applicable searching method. The exhaustive search is also excluded as it is impractical with 47 features space. Then a number of classification models are constructed for each of the feature subsets in addition to the full features set. This resulted in a total of 9 classification model for each learning algorithm (i.e. RF, C4.5 and SVM), which is due to the fact that some of the 11 feature subset selection methods returned feature subsets identical to other subset selection methods. Each model is then evaluated via a 10-fold-cross validation. Since the Ranker searching method does not return a specific features subset (but merely an ordered list according to a criterion), we have considered only the top-n IG and Relief ranked features which achieved the highest f1 score values. The number of consid10

searching and subset evaluation method is not overestimated (or underestimated) due to the learner’s ability in handling noisy feature sets. Since the main focus of this study is the evaluation of feature subset selection methods, detailed comparisons between the classifiers are not presented, and considered somewhat irrelevant, specially as the comparison is addressed by Abu Nimeh et al.[3]. However, the source files of this study are made publicly available, and the evaluation results can be reproduced by tools introduced in Section 6.3.

6.1

Table 4: Phishing and ham data set summary. Class Emails Time span Phishing 4,116 Nov 27, 2004 – Aug 7, 2007 Ham 4,150 Jan 28, 2001 – Sep 9 2002

6.2

Performance Evaluation

This evaluation aims to find an effective features subset by studying a number of feature selection techniques as presented in Table 3. We then ran the feature selection techniques (as shown in Table 3) against the dataset described in Section 6.1 to select feature subsets from the full set of 47 features set described in Section 3. Table 5 presents overall statistics of feature subset evaluators and their respective subset searching methods. ConEval and CorEval returned smaller feature subsets with all searching directions of Best-first than the Wrapper. The Wrapper explored similar feature subsets to ConEval and CorEval, however the Wrapper needed much more time3 than ConEval and CorEval to return a features subset in any searching direction, which is due to the fact that Wrapper trained and evaluated another classifier (i.e. RF, C4.5 and SVM). On the other hand, CorEval returned feature subsets faster than any of the other subset evaluators with any Best-first searching direction respectively. The returned features subsets from the evaluated feature selection techniques for IG, Relief, CorEval, ConEval, Wrapper with RF, Wrapper with C4.5 and Wrapper with SVM are presented in Tables 6, 7, 8, 9, 10, 11 and 12 respectively. Table 6 shows that the bodyform feature (a binary feature that is set to 1 if the body has a HTML form) is one of the worst features and is also never chosen by other feature selection techniques except the Wrapper with Best-first: Backward searching method. This confirms one of the outcomes of Toolan et al. [26] where body form was evaluated as the worst feature. However, according to our implementation,

Datasets

Publicly available sources of phishing and legitimate emails were used, namely: [19] and [24] respectively. The same email sources were also used in a number of recent studies [26, 6, 3]. The use of the publicly available phishing and legitimate email sources [19, 24] makes comparisons with previous works more meaningful. Much of the previous work has constructed datasets using the same sources including a number of recent papers, such as [26, 6], which justifies why this study also uses the same email sources. The phishing dataset is composed of 4,116 emails from the following files as distributed by [19]: • phishing0.mbox • phishing2.mbox • phishing3.mbox The legitimate dataset is composed of 4,150 emails from the following files as distributed by [24]: • 20030228 easy ham.tar.bz2 • 20030228 hard ham.tar.bz2 • 20030228 easy ham 2.tar.bz2 This assembles a dataset of phishing and legitimate emails similar to “Dataset 2” in [26]. A summary is presented in Table 4.

3 The time was measured based on Weka’s implementation via the time(1) POSIX command on a 2.40GHz dual-core machine with 4GB of RAM.

11

IG ranked both of the richness features lower than their rank in [26], which can be due to parsing differences between the implementation in this study and the implementation in [26].

Wrapper C4.5

Wrapper SVM

Time

Wrapper RF

Evaluated Subsets

CorEval

Number of Features

ConEval

Best-First Direction

Subset Evaluator

Table 5: Features subset selection statistics. Seconds, minutes and hours are s, m and h respectively.

Fwd. Bwd. Bi-dir Fwd. Bwd. Bi-dir Fwd. Bwd. Bi-dir Fwd. Bwd. Bi-dir Fwd. Bwd. Bi-dir

9 10 9 12 12 12 21 44 19 10 23 9 15 32 12

572 1,884 658 337 1,131 376 940 452 1,175 606 1,445 752 757 802 940

6.332s 23.069s 7.352s 0.987s 1.059s 0.931s 1h31m9s 1h22m41s 1h43m38s 13m40s 3h6m27s 21m58s 5h22m31s 8h40m25s 7h31m41s

12

Table 6: Individual features evaluated according to the IG criterion and sorted by the Ranker. IG Feature name IG Feature name 0.897882 externalsascore 0.054552 urlwordherelink 0.812893 externalsabinary 0.054263 urltwodoains 0.726431 urlnumlink 0.043313 senddiffreplyto 0.692546 bodyhtml 0.042734 urlwordclicklink 0.637373 urlnumperiods 0.0343 scriptonclick 0.456332 sendnumwords 0.032966 bodysuspensionword 0.43007 urlnumexternallink 0.029682 urlport 0.366045 bodynumfunctionwords 0.029682 urlnumport 0.296211 subjectreplyword 0.028526 urlnumdomains 0.295241 bodydearword 0.026164 bodyverifyyouraccountphrase 0.269922 sendunmodaldomain 0.020787 subjectnumwords 0.232089 bodynumchars 0.015132 scriptstatuschange 0.231512 bodynumwords 0.015008 subjectverifyword 0.183305 bodymultipart 0.010354 urlunmodalbaglink 0.178693 subjectrichness 0.00671 urlwordupdatelink 0.167949 urlnumip 0.005574 scriptjavascript 0.163674 urlnumimagelink 0.005219 urlatchar 0.148997 bodynumuniqwords 0.004408 scriptpopup 0.130622 urlip 0.002967 urlnuminternallink 0.115132 urlbaglink 0.002194 subjectdebitword 0.07938 subjectbankword 0.000794 bodyform 0.068562 bodyrichness 0 subjectfwdword 0.062817 subjectnumchars 0 scriptunmodalload 0.061736 urlwordloginlink

13

Table 7: Individual features evaluated according to the Relief criterion and sorted by the Ranker. Weight Feature name Weight Feature name 0.305432 externalsabinary 0.023324 urlport 0.23841 sendunmodaldomain 0.021785 urlnumexternallink 0.225236 senddiffreplyto 0.018437 bodyverifyyouraccountphrase 0.177072 sendnumwords 0.017167 urlwordherelink 0.164711 bodyhtml 0.017118 urlwordclicklink 0.134842 urltwodoains 0.014743 urlnumimagelink 0.131051 externalsascore 0.014396 bodyform 0.130426 bodydearword 0.014275 scriptjavascript 0.121268 subjectreplyword 0.01009 urlunmodalbaglink 0.110138 urlip 0.009738 urlnumip 0.073518 bodymultipart 0.009025 scriptstatuschange 0.060126 subjectbankword 0.007747 urlbaglink 0.05098 urlwordloginlink 0.00698 urlwordupdatelink 0.04288 subjectrichness 0.006738 subjectverifyword 0.040873 subjectnumwords 0.006558 bodynumfunctionwords 0.038515 urlnumperiods 0.00542 urlatchar 0.033451 subjectnumchars 0.002432 scriptunmodalload 0.028061 bodynumwords 0.001997 urlnuminternallink 0.02766 bodynumuniqwords 0.001863 scriptpopup 0.027622 urlnumlink 0.001634 urlnumport 0.02751 bodyrichness 0.000801 urlnumdomains 0.026866 bodynumchars 0.000544 subjectdebitword 0.023699 bodysuspensionword 0.00023 subjectfwdword 0.023675 scriptonclick

14

15

Best-first: Bi-dir

Features name bodynumchars bodynumfunctionwords bodyrichness externalsascore senddiffreplyto sendnumwords subjectnumchars subjectreplyword subjectrichness urlnumdomains urlnumip urlnumlink urlnumperiods urltwodomains

Best-first: Bwd.

Best-first: Bi-dir

Best-first: Bwd.

Best-first: Fwd.

Features name bodydearword bodyhtml bodysuspensionword externalsabinary scriptonclick scriptstatuschange sendnumwords subjectbankword subjectrichness subjectverifyword urlnumip urlnumlink

Best-first: Fwd.

Table 9: Features subset as evaluated by the ConEval criterion and the various Best-first searching methods.

Table 8: Features subset as evaluated by the CorEval criterion and the various Best-first searching methods.

Table 10: Features subset as evaluated by the Wrapper with RF as its criterion. Features name bodydearword bodyform bodyhtml bodymultipart bodynumchars bodynumfunctionwords bodynumuniqwords bodynumwords bodyrichness bodysuspensionword bodyverifyyouraccountphrase externalsabinary externalsascore scriptonclick scriptstatuschange scriptunmodalload senddiffreplyto sendnumwords sendunmodaldomain subjectbankword subjectdebitword subjectfwdword subjectnumchars subjectnumwords subjectreplyword subjectrichness subjectverifyword urlatchar urlbaglink urlip urlnumdomains urlnumexternallink urlnumimagelink urlnuminternallink urlnumip urlnumlink urlnumperiods urlnumport urlport urltwodomains

Best-first: Fwd.

16

Best-first: Bwd.

Best-first: Bi-dir.

Table 10: Features subset as evaluated by the Wrapper with RF as its criterion. Features name urlunmodalbaglink urlwordclicklink urlwordherelink urlwordloginlink urlwordupdatelink

Best-first: Fwd.

Best-first: Bwd.

Best-first: Bi-dir.

Table 11: Features subset as evaluated by the Wrapper with C4.5 as its criterion. Features name bodydearword bodyhtml bodymultipart bodynumchars bodynumfunctionwords bodynumwords bodysuspensionword externalsascore scriptjavascript scriptonclick senddiffreplyto sendnumwords sendunmodaldomain subjectnumchars subjectnumwords subjectreplyword subjectrichness urlbaglink urlnumdomains urlnumexternallink urlnumimagelink urlnuminternallink urlnumip urlnumlink urlnumperiods urltwodomains urlwordherelink urlwordloginlink urlwordupdatelink

Best-first: Fwd.

17

Best-first: Bwd.

Best-first: Bi-dir.

Table 12: Features subset as evaluated by the Wrapper with SVM as its criterion. Features name bodyhtml bodymultipart bodynumchars bodynumfunctionwords bodynumuniqwords bodynumwords bodyrichness bodysuspensionword bodyverifyyouraccountphrase externalsabinary externalsascore scriptjavascript scriptonclick scriptpopup scriptstatuschange scriptunmodalload senddiffreplyto sendnumwords sendunmodaldomain subjectbankword subjectdebitword subjectfwdword subjectnumchars subjectnumwords subjectreplyword subjectrichness urlatchar urlbaglink urlnumexternallink urlnumimagelink urlnuminternallink urlnumip urlnumlink urlnumperiods urlnumport urlport urltwodomains urlwordloginlink urlwordupdatelink

Best-first: Fwd.

18

Best-first: Bwd.

Best-first: Bi-dir.

According to Tables 8, 9, 10, 11 and 12, feature subsets were constructed by ConEval, CorEval, Wrapper and the varying searching methods. However, IG with the Ranker searching method merely returned a ranked list of features (i.e. not feature subsets). In order to compare IG and Relief against other feature subset selection methods in a fair manner, we selected the top-n ranked features according to IG and Relief criteria that achieved highest f1 score. In other words, we evaluated RF classification models with top-n IG and Relief ranked features independently, for n = 1 . . . 46 (top-47 excluded as it is the full features set). More details about the individual top-n IG and Relief ranked sets are presented in Section 6.3. This means that 9 classification models were constructed per learner (i.e. RF, C4.5 and SVM), namely: 2 models for ConEval, 1 model for CorEval, 3 models for Wrapper, 1 model for IG, 1 model for Relief and 1 model for all features. Tables 13, 14 and 15 present the weighted errors WErr and f1 score for RF, C4.5 and SVM respectively. By analyzing the evaluation results in Tables 13, 14 and 15, none of the feature subset evaluators and subset searching methods achieve the lowest WErr rates for both λ = 1 and 9 simultaneously. For example, the feature subset suggested by the Wrapper with RF as its evaluator and Best-first: Forward search method achieved the lowest WErr when λ = 1. When RF and C4.5 were chosen as the classification algorithms, the top-45 IG ranked features achieved the lowest WErr when λ = 9. However, the features subset selected by the Wrapper achieved the lowest WErr when λ = 9 when SVM was the learning algorithm. On the other hand, IG, ConEval and CorEval never achieved the highest f 1 score and the lowest WErr for λ = 1 in any evaluation. With respect to the used dataset, the full features subset and the classifier RF, Table 13 shows that ConEval’s performance, with all of the Bestfirst searching directions, resulted in more efficient classification models than that of CorEval. However, CorEval’s performance in all Best-first searching di19

rections exceeded that of ConEval’s when C4.5 and SVM learners were used. Relief consistently achieved higher WErr rates than IG with all λ values when when λ = 1. However, it did achieve lower WErr rates than ConEval and CorEval when λ = 1 and RF and C4.5 learners were used. Interestingly, the Wrapper and Best-first: Forward are the only features subset selection method that consistently selected a features subset with highest f 1 score and the lowest WErr for λ = 1 among all other feature subset selection methods with all of the three learning algorithms. Most of the previous studies in the phishing domain have focused on enhancing the prediction accuracy of classifiers by enhancing classification algorithms (e.g. ensembles [25]) and proposing novel features (e.g. model-based features [6]). This study also shows that highly accurate phishing email classifiers can be constructed by using existing classification algorithms and previously proposed phishing features when a good combination of features, or a subset of features, is found.

6.3

Reproducibility

Since this study relies on publicly available datasets and software, all of the presented results can be reproduced. The only exception is the measured time in Table 5, as it can be difficult to replicate the exact same hardware and software configurations as when the experiments were conducted. However, relatively similar results should be achieved with similar hardware. The experiments were carried out on a 64bit computer with an Intel Core 2 Duo 3.00GHz and 4GB of RAM, running Linux Gentoo with perl v5.12.2, SpamAssassin v3.3.1, IcedTea v1.9.7 and Weka v3.6.4, which are the latest available versions at the time of writing this paper according to Gentoo’s portage4 . A Perl script is also made available to facilitate easier reproduction of the experiments presented in this paper. All the sources files, including the exact and 4A

package management system used by Gentoo Linux.

Table 13: WErr and f1 score for the 9 RF classification models. Searching Method Features WErr when λ = 1 WErr when λ = 9 Best-first: Fwd. 9 0.865% 0.711% ConEval Best-first: Bwd. 10 1.063% 0.881% Best-first: Bi-dir. 9 0.865% 0.711% Best-first: Fwd. CorEval Best-first: Bwd. 12 1.085% 0.921% Best-first: Bi-dir. Best-first: Fwd. 21 0.601% 0.596% Wrapper RF Best-first: Bwd. 44 0.686% 0.621% Best-first: Bi-dir. 19 0.927% 0.879% IG Ranker 45 0.617% 0.546% Relief Ranker 44 0.641% 0.575% All features 47 0.676% 0.615%

Feature Evaluator

Table 14: WErr and f1 score for the 9 C4.5 classification models. Searching Method Features WErr when λ = 1 WErr when λ = 9 Best-first: Fwd. 9 1.100% 0.977% ConEval Best-first: Bwd. 10 1.240% 1.156% Best-first: Bi-dir. 9 1.100% 0.977% Best-first: Fwd. CorEval Best-first: Bwd. 12 1.039% 0.790% Best-first: Bi-dir. Best-first: Fwd. 10 0.915% 0.875% Wrapper C4.5 Best-first: Bwd. 23 0.986% 0.964% Best-first: Bi-dir. 9 0.927% 0.879% IG Ranker 6 0.971% 0.784% Relief Ranker 23 1.002% 1.010% All features 47 1.091% 1.051%

Feature Evaluator

20

f1 score 99.130% 98.930% 99.130% 98.908% 99.396% 99.311% 99.069% 99.380% 99.356% 99.320%

f1 score 98.894% 98.754% 98.894% 98.953% 99.080% 99.010% 99.069% 99.022% 98.994% 98.903%

Table 15: WErr and f1 score for the 9 SVM classification models. Searching Method Features WErr when λ = 1 WErr when λ = 9 Best-first: Fwd. 9 2.323% 0.810% ConEval Best-first: Bwd. 10 1.851% 1.488% Best-first: Bi-dir. 9 2.323% 0.810% Best-first: Fwd. CorEval Best-first: Bwd. 12 1.198% 1.126% Best-first: Bi-dir. Best-first: Fwd. 15 0.980% 0.562% Wrapper SVM Best-first: Bwd. 32 1.149% 1.097% Best-first: Bi-dir. 12 1.016% 0.491% IG Ranker 21 1.137% 1.017% Relief Ranker 34 1.234% 0.979% All features 47 1.367% 0.967%

Feature Evaluator

ARFF5 phishing and legitimate datasets, lists of feature subsets and scripts to automate their evaluation can be found in [11], which allows:

• Reproducing feature subsets given ARFF files. • Reproducing the ARFF files given raw emails. The downloadable file in [11] contains further instructions as well as lists of feature IDs and evaluation results for all of the top-n ranked features according to the IG and Relief criteria. The feature IDs are how Weka identifies a feature, thus IDs of the feature subsets are made available to further simply the reproducibility process.

Conclusion

Previous studies in the phishing email classification domain have used various feature subset evaluation and searching methods. For example, [25, 7] used IG, and [6] used the Wrapper. None of the previous studies, including evaluations in the Machine Learning domain, have studied the performance of the various feature subset evaluation and selection 5 Weka’s

98.796% 99.011% 98.845% 98.973% 98.856% 98.757% 98.620%

techniques in the phishing classification domain. This study presents the performance evaluation of various feature subset evaluation and searching methods as they apply to the phishing email classification domain. IG and Relief with the Ranker, and Wrapper with the Best-first: forward are arguably the most popular feature selection techniques in the phishing classification literature. The general motive behind the use of IG and the Ranker is that combining individually good features should result in a features subset that enables learning algorithms to construct robust classification models. On the other hand, the Wrapper and Best-first: forward evaluate feature subsets instead of evaluating the features individually. Similar to the Wrapper, the rest of the feature selection techniques used guided searching methods (i.e. Best-first) with the exception of using filters as the evaluation criterion (i.e. as opposed to wrapping the classifier as the evaluator). Each of the evaluated feature selection techniques follow certain assumptions with regards to the distribution of the features in a given data set. A question that this study answers is finding which of the assumptions behind the feature selection techniques holds true with real-life phishing and ham email messages with respect to various evaluation criteria (i.e. classification accuracy, the quantity of suggested fea-

• Reproducing RF, C4.5 and SVM classification models given ARFF files and feature subsets.

7

f1 score 97.623% 98.133% 97.623%

Attribute-Relation File Format.

21

tures, and the time taken to select a features subset). By evaluating 47 of previously proposed phishing classification features in the literature against a public data set of phishing and ham email messages, this study empirically shows that: • With real-life phishing and ham email messages, the assumption of evaluating features individually (e.g. such as the IG and Relief) is not the most effective assumption when the key objective is raising the classification accuracy. On the other hand, using the guided feature searching method (i.e. Best-first) to search feature subsets (as opposed to searching the features individually) combined with the Wrapper evaluator (as opposed to filters) resulted in the selection of the most effective feature subsets. • The Wrapper evaluator and the Best-first: Forward feature subset searching method (as used in [6]) consistently achieved the highest f1 scores with all of the learning algorithms (i.e. RF, C4.5, SVM). When SVM was the learning algorithm, the Wrapper with Best-first: Bidirectional searching method achieved the lowest ever classification error rate WErr for λ = 9.

Interestingly, the features subset returned by the Wrapper and Best-first: Forward searching method resulted in a robust RF classification model that achieved an f1 score of 99.396%, which is the most accurate publicly known phishing email classifier. Only one classifier is publicly known to have a higher f 1 score of 99.40[6]. This study also demonstrates that a highly accurate classifier can be constructed by using existing data mining techniques and previously proposed features if an effective features subset is found among them. Future work will aim at evaluating the robustness of phishing classification features from the perspective of changes in the patterns of phishing attacks. This is specifically important since phishing attacks change by time, and finding robust features (or characteristics in the features) that survive the changes can enhance the overall detection of phishing attacks.

References [1] The apache spamassassin project. spamassassin.apache.org/.

http://

[2] Phishguru. http://www.wombatsecurity.com/ phishguru. Accessed March 2011.

• The IG evaluator and the Ranker achieved the lowest classification error rate WErr for λ = 9 in most of the evaluations. However, with considerably larger number of suggested features than the Wrapper and Best-first: Forward as in most of the evaluations.

[3] S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair. A comparison of machine learning techniques for phishing detection. In Proceedings of the antiphishing working groups 2nd annual eCrime researchers summit, eCrime ’07, pages 60–69, New York, NY, USA, 2007. ACM.

• The Relief evaluator and the Ranker have suggested features subsets with noticeably higher classification error rates WErr for both λ = 1 and 9 in most of the evaluations than IG the Wrapper, as well as suggesting larger number of features in most of the evaluations.

[4] A. Alnajim and M. Munro. An anti-phishing approach that uses training intervention for phishing websites detection. 2009. [5] E. Alpaydin. Introduction to machine learning. Knowl. Eng. Rev., 20:432–433, December 2005.

• The ConEval evaluator and the various Best-first searching directions achieved the smallest number of suggested features in most of the evaluations. However, its selected features suffered noticeably from higher WErr for both λ = 1 and 9 compared to the Wrapper and IG. 22

[6] A. Bergholz, J. De Beer, S. Glahn, M.-F. Moens, G. Paaß, and S. Strobel. New filtering approaches for phishing email. J. Comput. Secur., 18:7–35, January 2010. [7] M. Chandrasekaran, K. Narayanan, and S. Upadhyaya. Phishing email detection based

on structural properties. In NYS Cyber Security Conference, 2006.

from phishing: the design and evaluation of an embedded training email system. In Proceedings of the SIGCHI conference on Human factors in computing systems, CHI ’07, pages 905– 914, New York, NY, USA, 2007. ACM.

[8] I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. In Proceedings of the 16th international conference on World Wide Web, WWW ’07, pages 649–656, New York, NY, [17] P. Likarish, D. Dunbar, and T. E. Hansen. Bapt: Bayesian anti-phishing toolbar. 2008. USA, 2007. ACM. [9] W. N. Gansterer and D. P¨ olz. E-mail classification for phishing defense. In Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval, ECIR ’09, pages 449–460, Berlin, Heidelberg, 2009. Springer-Verlag.

[18] H. Liu and R. Setiono. A probabilistic approach to feature selection - a filter solution. pages 319– 327. Morgan Kaufmann, 1996.

[10] M. A. Hall. Correlation-based feature selection for machine learning. 1998.

[20] P. Prakash, M. Kumar, R. R. Kompella, and M. Gupta. Phishnet: predictive blacklisting to detect phishing attacks. In INFOCOM’10: Proceedings of the 29th conference on Information communications, pages 346–350, Piscataway, NJ, USA, 2010. IEEE Press.

[19] J. Nazario. Phishing corpus. http: //monkey.org/~jose/wiki/doku.php?id= phishingcorpus. Accessed July 2010.

[11] M. Khonji. Phishing studies. http://khonji. org/index.php/Phishing_Studies. Accessed April 2011. [12] M. Khonji, A. Jones, and Y. Iraqi. A novel phishing classification based on url features. In GCC Conference and Exhibition (GCC), 2011 IEEE, 2011.

[21] R. Quinlan. Data mining tools see5 and c5.0. http://www.rulequest.com/see5-info. html. Accessed April 2011.

[13] M. Khonji, A. Jones, and Y. Iraqi. A study of feature subset evaluators and feature subset searching methods for phishing classification. In Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, CEAS ’11, pages 135–144, New York, NY, USA, 2011. ACM.

[22] S. Sheng, B. Magnien, P. Kumaraguru, A. Acquisti, L. F. Cranor, J. Hong, and E. Nunge. Anti-phishing phil: the design and evaluation of a game that teaches people not to fall for phish. In Proceedings of the 3rd symposium on Usable privacy and security, SOUPS ’07, pages 88–99, New York, NY, USA, 2007. ACM. [23] SonicWall. Bayesian spam classification applied to phishing e-mail. http://www. sonicwall.com/downloads/WP-ENG-025_ Phishing-Bayesian-Classification.pdf, 2008. Accessed Oct 2011.

[14] K. Kira and L. A. Rendell. A practical approach to feature selection. In D. H. Sleeman and P. Edwards, editors, ML92: Proceedings of the Ninth International Conference on Machine Learning, pages 249–256, San Francisco, CA, USA, 1992. Morgan Kaufmann Publishers Inc.

[24] SpamAssassin. Public corpus. http: //spamassassin.apache.org/publiccorpus/. Accessed January 2011.

[15] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artif. Intell., 97:273–324, December 1997.

[25] F. Toolan and J. Carthy. Phishing detection using classifier ensembles. In eCrime Researchers Summit, 2009. eCRIME ’09., pages 1 –9, 20 2009-oct. 21 2009.

[16] P. Kumaraguru, Y. Rhee, A. Acquisti, L. F. Cranor, J. Hong, and E. Nunge. Protecting people 23

[26] F. Toolan and J. Carthy. Feature selection for spam and phishing detection. In eCrime Researchers Summit (eCrime), 2010, eCrime ’10, Dallas, TX, 2010. [27] W. University. Weka 3: Data mining software in java. http://www.cs.waikato.ac.nz/ml/ weka/. Accessed January 2011. [28] C. Whittaker, B. Ryner, and M. Nazif. Large-scale automatic classification of phishing pages. http://research.google.com/pubs/ pub35580.html. Accessed July 2010. [29] I. H. Witten, E. Frank, and M. A. Hall. Data Mining: Practical Machine Learning Tools and Techniques (Third Edition). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2011.

24