c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 2 7 e4 1
Available online at www.sciencedirect.com
ScienceDirect journal homepage: www.elsevier.com/locate/cose
An approach for profiling phishing activities Isredza Rahmi A. Hamid a,b,c, Jemal H. Abawajy a,c,* a
School of Information Technology, Deakin University, Geelong, Australia Faculty Computer Science & Information Technology, University Tun Hussein Onn Malaysia, Johor, Malaysia c Parallel and Distributed Computing Lab, Deakin University, Australia b
article info
abstract
Article history:
Phishing attacks continue unabated to plague Internet users and trick them into providing
Received 26 September 2013
personal and confidential information to phishers. In this paper, an approach for email-
Received in revised form
born phishing detection based on profiling and clustering techniques is proposed. We
3 March 2014
formulate the profiling problem as a clustering problem using various features present in
Accepted 5 April 2014
the phishing emails as feature vectors and generate profiles based on clustering predictions. These predictions are further utilized to generate complete profiles of the emails.
Keywords:
We carried out extensive experimental analysis of the proposed approach in order to
Cybersecurity
evaluate its effectiveness to various factors such as sensitivity to the type of data, number
Information security
of data sizes and cluster sizes. We compared the performance of the proposed approach
Network security
against the Modified Global Kmeans (MGKmeans) approach. The results show that the
Profiling
proposed approach is efficient as compared to the baseline approach.
Phishing email
ª 2014 Elsevier Ltd. All rights reserved.
Clustering
1.
Introduction
The email services are used daily by millions of people, businesses, governments and different organizations to communicate around the globe. Email is also a mission-critical application for many businesses (Islam and Abawajy, 2013). However, e-mail born phishing attacks is an emerging problem nowadays and solving this problem has proven to be very challenging. While phishing attacks can take several forms and the tactics used can vary such as emails, websites, SMS, forum posts and comments, the main goal of the phishers is always to lure people into giving up important information. For example, in email-born phishing attacks, phishers send emails that mislead their victims into revealing credential information such as account numbers, passwords, or other personal information to the phisher. In some cases, the phishers implant malicious software that controls a computer
so that it can participate in future phishing scams. As most phishing emails are nearly identical to the normal emails, it is quite difficult for the average users to distinguish phishing emails from non-phishing once. Moreover, phishing tactics have become more and more complicated and the phishers continually change their ways of perpetrating phishing attacks to defeat the anti-phishing techniques. The main problem addressed in this paper is how to efficiently distinguish phishing emails from non-phishing emails. Stopping phishing emails from invading the email system would allow email users to regain a save and valuable means of communication. Although various anti-phishing tools and technologies have been developed to curb phishing attacks, the number of phishing email messages continues to rise rapidly. A report by the Anti-Phishing Working Group (APWG) found that the number of unique phishing emails reported by consumers rose 20% in the first half of 2012 compared to the same period in 2011. There was also an
* Corresponding author. School of Information Technology, Deakin University, Geelong 3220, Australia. Tel.: þ61 352271376. E-mail addresses:
[email protected] (I.R.A. Hamid),
[email protected] (J.H. Abawajy). http://dx.doi.org/10.1016/j.cose.2014.04.002 0167-4048/ª 2014 Elsevier Ltd. All rights reserved.
28
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 2 7 e4 1
increase by 73% in the number of unique phishing websites detected, in the first half of 2012 as compared to 2011. As phishing attacks are serious threats to security and economy worldwide (Abawajy and Kelarev, 2012), there is a strong need for automated phishing attack detection algorithms. In this paper, a novel algorithm based on clustering and profiling approaches to detect email-born phishing activities is proposed. Phishing attack related research is usually concentrated on the detection of phishing emails based on significant features such as hyperlink, number of words, subject of emails and others. In contrast, we focus on profiling phishing emails which is different from detection of phishing emails. Phishers normally have their own signatures or techniques. Thus, a phisher’s profile can be expected to show a collection of different activities. In the proposed approach, we apply clustering techniques based on modified Two-Step clustering algorithm to generate the optimal number of clusters. Moreover, we split the data into training and testing ratios as discussed in Hamid et al. (2013), Marchal et al. (2012), Xiang et al. (2011). Next, our model generates profiles from the training data to train the detection algorithm. Our major contributions are summarized as follows: 1. We propose a method for extracting features from phishing emails. The method is based on weighting of email features and selecting the features according to a priority ranking. 2. We propose phishing email profiling and filtering algorithm. We show that, general profiles can improve accuracy of the phishing detection. 3. We discuss the implications and the relative importance of the number of clusters to ensure that the phishing email profile generated is generalized. 4. We provide empirical evidence that our proposed approach reduces the false positive and false negative in regards to over fitting issues. The rest of the paper is organized as follows: Section 2 describes the problem overview and the related work on profiling and classification models. Section 3 presents the proposed classification model for phishing email detection where cluster prediction becomes elements of the profiles. The profiles constructed are then used to train the classifier algorithm. Section 4 shows the evaluation methodologies and experimental results. Finally, Section 5 concludes the work and highlights a direction for future research.
2.
Related works
Phishing attacks represent one of the most rapidly growing and persistent areas of cyber security on the global scale. It is based on social engineering and technical spoofing techniques with the aim of obtaining personal and confidential information from unsuspecting online users. The common approach is through spoofed emails with false explanations that appear to be from a legitimate institution such as a bank. The online users are urged to reply to the phishing email or visit a fake web site, where they are requested to enter personal confidential information, such as passwords and bank
account numbers that will eventually be used for identity theft. Comprehensive information on phishing and the various techniques used to scam personal identity data and financial account credentials from unsuspecting online users is discussed in Anti-Phishing Working Group. There have been many approaches to detect and prevent phishing attacks like using multi-tier classifier (Abawajy and Kelarev, 2012), anti-phishing toolbars (Fette et al., 2007) and scam website blockers (Kirda and Kruege, 2006). Further, machine learning approaches based on features such as hyperlink, number of words, subject of emails and others have been proposed in the literature. Anti-phishing tools use heuristic approaches that employ features such as the host name, checking the URL for common spoofing techniques, and checking against previously seen images or for detecting phishing sites. Another method commonly used is a blacklist that lists reported phishing URLs. Heuristic-based methods suffer from high false positives while blacklists approaches generally require human intervention and verification. Also, blacklists approaches do not protect against zero-day phishing attacks. The behavior-based and the content-based phishing detection are two favorable approaches for phishing detection. The content-based filtering approaches classify the emails as phishing or non-phishing based on the contents of the emails (Basnet and Sung, 2010; Chandrasekaran et al., 2001; Fette et al., 2007). In contrast, the behavior-based approaches (Hamid and Abawajy, 2011a; 2011b; Hamid et al., 2013; Toolan and Carthy, 2010) are capable to learn the attacker activities in order to classify emails into phishing and normal. A hybrid approach, based on the combination of the content-based and the behavior-based, is proposed in Hamid and Abawajy (2011b). Unfortunately, phishing attacks have been escalating in number and sophistication. Recently, the concept of profiling phishing emails with the aim of tracking, predicting and identifying phishing emails have been discussed in Kang et al. (2010), Yearwood et al. (2012), Yearwood et al. (2009). Profiling is a process of categorizing suspects or prospects from a large data set, and gathering a set of characteristics of a particular group based on the past knowledge. The potential usage of profiling has been identified in many domains such as customer buying patterns profile and market products profile. Market Basket Analysis (Hoanca and Mock, 2011) profiles customers based on their buying patterns. This information is then used by companies to learn the nature of competitive markets. Recently, mobile user browsing behavior patterns profile is used to help cellular service providers (CSPs) to improve service performance, thus increasing user satisfaction. The profile offers valuable insights about how to enhance the mobile user experience by providing dynamic content personalization and recommendation, or location-aware services (Kelarapura et al., 2010). Also, there have been studies in profiling behavior of Internet backbone traffic where significant behavior from massive traffic is discovered and anomalous events such as large scale scanning activities, worm outbreaks, and denial of service attacks are identified (Xu et al., 2008). The present work is motivated by the work of Kang et al. (2010), Yearwood et al. (2012), Yearwood et al. (2009) and complements these studies in many ways. Yearwood et al.
29
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 2 7 e4 1
(2012) discussed profiling of phishing activity based on hyperlinks extracted from phishing emails. The authors used three groups of features namely the text content which is shown to the email’s reader; a characterization of the hyperlinks in the email; and the orthographic features of the email. They used several clustering techniques to individually assign each instance of an email to a cluster according to its
clustering criteria and the three feature set. The clusters were then ensembled using clustering consensus approaches. Our work differs from Yearwood et al. (2012) in that we concentrate on identifying phishing emails whereas (Yearwood et al., 2012) focus on identifying a specific number of phishing groups. Also, we represent phishing emails by using a vector space model. Moreover, the technique proposed in Yearwood
Table 1 e Comparison of related work. Algorithm
Clustering algorithm
Yearwood et al. (2012)
K-means
Kang et al. (2010)
K-means
Yearwood et al. (2009)
Modified Global Kmeans
Cluster size
Features
NA
Structural properties: 1) textcontent 2) vlinks, 3) html content, 4) script, 5) table, 6) image/logo, 7) hyperlinks, 8) formtag and 9) faketags. Whois properties: 1) hacked_site, 2) hosted_site, and 3) legitimare_site_addition 5, 10, 15 and 20 1) Number of html tags in the message. (user defined clusters) 2) Number of link in the message. 3) Number of mismatched links. 4) Number of scripts included in the message. 5) Number of tables in the message. 6) Number of embedded images. 7) Number of attachments to the message. 10 clusters for fake Fake link token: link token and 13 1) Hyperlink contain keyword “/nabib” clusters for structural 2) Hyperlink contain orthographic. “/.verifyacc/”, “/.ver/”, “/verify/”, 3) Hyperlink contain keyword “/index2.files/” 4) Hyperlink contain keyword “/anb2”/ 5) Hyperlink contain keyword “/r1/?/” 6) Hyperlink contain keyword “/netbank/” or “/netbank/bankmain” 7) Hyperlink contain keyword “cbusol” 8) Hyperlink contain keyword “/.national.com.au” 9) Hyperlink contain keyword “moreinfo.htm”, “wumoreinfo.htm” 10) Hyperlink contain keyword “Others”
Training & testing 2038 emails using four-fold cross-validation.
3276 emails using ten-fold cross-validation.
2048 emails
Structural Orthographic:
Proposed approach Two-Step 10 clusters clustering algorithm
1) Size of the text and html body of an email. 2) Whether an email has text content 3) Number of visible links in an email. 4) Whether a visual link is directed to the same hyperlink in an email. 5) Whether an email contains a greeting line. 6) Whether an email contains a signature at the end. 7) Whether an email contains HTML content. 8) Whether an email contains scripts. 9) Whether an email contains tables. 10) Number of images in an email. 11) Number of hyperlinks in an email. 12) Whether an email contains a form. 13) Number of fake tags in an email. 10 features for each dataset (see Table 3) 4786 emails using ratio size value.
30
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 2 7 e4 1
Table 2 e Notation used in the paper. Notation a xi M abj Vi Nz p b s 2zj b s 2j LL(k) Ck q1 BIC(k) RBK(k) dBIC(k) R(k) RS(k) gi
Description of notation Attributes Instance value Size of the feature vector Continuous variable Distance for cluster i Number of data in cluster z Total number of continuous variable used in dataset Estimated variance of the jth continuous variable in cluster z Estimated variance of the jth continuous variable across the entire dataset The log-likelihood function kth cluster model Threshold value Initial maximum number of cluster Ratio of BIC changes Variance in BIC between k clusters Ratio of distance measures. Ratio size value for cluster k Centroid value
et al. (2012) depend on the phishing emails with embedded hyperlinks only, whereas, many of the phishing email attacks are created without hyperlinks, therefore, this technique still shows flaw in its classification role. Work in Yearwood et al. (2012) used k-means clustering algorithm to cluster the data and they do not automatically determine the final number of clusters (i.e., k) that is the most appropriate whereas we use the Two-Step clustering algorithms and develop an algorithm that determines the optimal number of clusters to be used. Also we use the profile to train the phishing detection algorithm whereas work (Yearwood et al., 2012) did not do. A method for obtaining profiles from phishing emails using hyperlink information as features, and structural and whois information as classes is discussed in Yearwood et al. (2009). Profiles are generated based on the predictions of the classifier. They employ the boosting algorithm (AdaBoost) and the support vector machine (SVM) algorithms to generate multilabel class predictions on three different datasets created from hyperlink information in phishing emails. They use fourfold cross-validation to generate the predictions and generate complete profiles of the emails. Our work differs from Yearwood et al. (2009) in that we determine the optimal number of clusters based on information gain ratio size as described later. Moreover, the actual profiling is not carried out in Yearwood et al. (2009) whereas we do, since our main focus is on phishing emails identifications. Also, we exclusively concentrate on structural characteristics found within the phishing emails. K-means clustering algorithm have been used by many authors (Toolan and Carthy, 2010; Xu et al., 2009; Yearwood et al., 2012). Kang et al. (2010) presented an approach based on unsupervised consensus clustering algorithms in combination with supervised classification methods for profiling phishing emails. They used k-means clustering algorithm to cluster the data and then several consensus functions are used to combine the independent clusters into a final
Notation
Description of notation
D F N acj d(i,u) k q
Dataset Feature vector Total number of data Categorical variable Difference between cluster i and u Number of clusters Total number of categorical variable used in dataset
Nzjl Nz
The mean of the lth category variables in cluster z
Nzjl N(kmin) N(kmax) P(kmin) P(kmax) P(k) NCk w fk tv
Number of records in cluster z whose jth categorical variables takes the lth category Size of the smallest cluster Size of the largest cluster Percentage of the smallest cluster Percentage of the largest clusters Percentage of cluster size Size for each cluster Optimum number of clusters Mean value of the distance between features Tolerance value
consensus clustering. The final consensus clustering is used to train the classification algorithm. Finally, they are used to classify the whole data set. They used tenfold cross validation to evaluate the accuracy of these algorithms. As k-means is the user-defined cluster, the number of k clusters is pre-fixed as an input parameter for their algorithm. Our work differs from Kang et al. (2010) in that we focus on identifying phishing emails whereas their objective is to produce classification of phishing emails for subsequent forensic analysis based on the resulting individual clusters. Moreover, we use Two-Step clustering algorithms. Also, the work in Kang et al. (2010) did not address the question of how to choose the most appropriate number of clusters whereas we do. Table 1 is the summary of the related work as compared to our work. The exiting approaches (Kang et al., 2010; Yearwood et al., 2012; Yearwood et al., 2009) describe profiling phishing emails based on the combination of multiple independent clustering of the email documents using different clustering algorithms, and these clustering were then ensembled using a variety of consensus functions. Moreover, the existing classification models train the classification algorithm using known
Fig. 1 e Phishing email profiling and classification model.
31
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 2 7 e4 1
label data (training data) to generate the classification model. This classification model will apply to testing dataset, which consists of data with an unknown class label. Similar to the existing classification model, we split the data into training and testing data set (Hamid et al., 2013; Marchal et al., 2012; Xiang et al., 2011). Subsequently, our model generates profiles from the training data using the clustering algorithm. Next, this profile is used to train the detection algorithm. By profiling the dataset first, it increases the efficiency and accuracy of the filtering results. Although the number of training data has been reduced, it still increases the efficiency and makes the approach less complicated. Also note that, the phishing profiling classification model reduces the error rate in regards to over fitting issues. Moreover, the proposed approach increases the accuracy with low False Positive and False Negative value as compared to profiling techniques proposed in Kang et al. (2010), Yearwood et al. (2012), Yearwood et al. (2009).
3.
Email-born phishing profiling approach
In this section, we discuss the proposed phishing profiling classification model. We will first introduce the architecture and the feature extraction and selection processes. Table 2 is the list of notations used in the paper.
3.1.
Feature extraction and clustering process
A high level description of the feature extraction and clustering process is shown in Fig. 1. The first step is to extract features from the phishing emails. Based on the output of the feature extraction step, a feature selection step is performed to select a subset of relevant features to be used in the construction of the model. In this step, the most informative features are selected using a learning model and a classification algorithm.
Table 3 e Summary of the dataset. Rank
Information gain value
Attributes
Type
Description Binary value contains binary class prediction where value ‘1’ if the emails is spam and value ‘0’ if the email is normal. Continuous value of SpamAssasin’s score. Continuous value stating number of link in URLs in an email message. Binary value determines if the email had an html part. ‘1’ if the email contains html part, ‘0’ otherwise. Continuous value determines number of periods in URLs in an email message. Continuous value specifying number of words in email sender. Continuous value determines number of external link in URLs in an email message. Continuous value specifying number of functions word in email’s body such as account, suspended and etc. Binary value specifying if the email contains word “dear”. Binary value ‘1’ if the email had word “dear”, ‘0’ otherwise. Binary value ‘1’ if the ‘verify’ word presence in email’s subject. ‘0’ otherwise. Binary value ‘1’ specifying the domain of sender is not the modal domain in an email, ‘0’ otherwise. Continuous value specifying number of body’s word in email Continuous value determines the number of characters in email body. Binary value defines if the email had a multipart. ‘1’ if the email contains multipart, ‘0’ otherwise. Continuous value determines number of ip address in URLs in an email message Continuous value stating the ‘richness’ of words that contain in email’s subject which measured as words where, Nwords and Ncharacters follows: subjectrichness ¼ NNcharacters are the total number of words and the total of characters in an email’s subject respectively. Binary value ‘1’ to show an existence of IP address in email’s message. ‘0’ otherwise Binary value ‘1’ if the ‘bank’ word presence in email’s subject. ‘0’ otherwise. Binary value ‘1’ to show an existence of word ‘login’ in URLs in an email message. ‘0’ otherwise. Binary value ‘1’ to show an existence of word ‘here’ in URLs. ‘0’ otherwise
1
0.863473
Externalsascore
Continuous
2 3
0.774079 0.707139
Externalsabinary Urlnumlink
Categorical Continuous
4
0.669168
Bodyhtml
Categorical
5
0.609389
Urlnumperiods
Continuous
6
0.413923
Sendnumwords
Continuous
7
0.410465
Urlnumexternallink
Continuous
8
0.388836
Bodynumfunctionwords
Continuous
9
0.305396
Bodydearword
Categorical
10
0.26253
Subjectreplyword
Categorical
11
0.239928
Sendunmodaldomain
Categorical
12 13
0.236486 0.223582
Bodynumwords Bodynumchars
Continuous Continuous
14
0.209478
Bodymultipart
Categorical
15
0.188858
Urlnumip
Continuous
16
0.167831
Subjectrichness
Continuous
17
0.152108
Urlip
Categorical
18
0.079708
Subjectbankword
Categorical
19
0.071806
Urlwordloginlink
Categorical
20
0.062774
Urlwordherelink
Categorical
32
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 2 7 e4 1
Table 4 e Datasets with different type of data. Rank 1 2 3 4 5 6 7 8 9 10
Cat10
Cont10
Mix10
Externalsabinary Bodyhtml Bodydearword Subjectreplyword Sendunmodaldomain Bodymultipart Urlip Subjectbankword Urlwordloginlink Urlwordherelink
Externalsascore Urlnumlink Urlnumperiods Sendnumwords Urlnumexternallink Bodynumfunctionwords Bodynumchars Bodynumwords Urlnumip Subjectrichness
Externalsascore Externalsabinary Urlnumlink Bodyhtml Urlnumperiods Sendnumwords Urlnumexternallink Bodynumfunctionwords Bodydearword Subjectreplyword
The next step is profiling of the phishing dataset based on the clustering algorithm. We propose different classification model by formulating the profiling problem as a clustering problem using the various features in the phishing emails as feature vectors. Further, we generate profiles based on clustering predictions. Thus, a cluster becomes an element of the profile. This profile then becomes the training data set to train the detection algorithm. The selection of the optimal number of clusters is crucial at this stage to increase the accuracy and efficiency of the filtering result. The next step is classification where the data set is split into training and testing ratios. The training data is used to generate profiles by clustering algorithm. Later, the profile generated in the profiling stage is used to train the classification algorithm where it is supposed to predict the unknown class label based on the input data using a classification algorithm. The detailed description of these steps is discussed later.
3.2.
Feature extraction and selection
Several email features could be used as a basis for comparison and clustering of the email datasets. These email features include the actual text content displayed to the user, the textual structure of this content, the nature of the hyperlinks embedded within the message, or the use of HTML features such as images, tables and forms. A basic approach is to represent each email message in terms of the email features and then apply a clustering algorithm to these features. However, there are two drawbacks to this simplistic approach. Firstly, the nature of the features is heterogonous in that some features are numerical while others are binary or categorical. Thus, combining the features together in a single clustering algorithm is problematic. Secondly, clustering algorithms always produce a set of clusters, even if there is no evidence of any underlying structure in the data. In our case, there are no ground-truth labels to use as a basis for testing the clustering results, as the actual source for any of the emails is unknown. Therefore, it is important that methods to validate the clusters produced by the system are found.
3.2.1.
Information gain
Information gain (IG) is used to determine which attribute in a given set of training feature vectors is most useful for discriminating between the classes to be learned. IG informs how important a given attribute of a feature vectors is. In general, information gain is given as follows:
IGðF; aÞ ¼ HðFÞ HðFjaÞ
(1)
where H denotes the information entropy, F denotes feature vector in the form of (x, y) ¼ (x1, x2, x3,., xn, y) where xa ˛valðaÞ is the value of ath attribute of feature vector x and y is the corresponding class label and H(F) is entropy defined as follows: IGðF; aÞ ¼ HðFÞ
X jfx˛Fjxa ¼ vgj $Hðfx˛Fjxa ¼ tgÞ jFj t˛valðaÞ
(2)
Table 3 shows the ranking of 20 email features and the corresponding information gain values for the classification scenario extracted from Khonji’s anti-phishing studies website. These datasets were publicly available from SpamAssassin’s ham corpus1 and Jose Nazarios phishing corpus.2 The dataset consists of both continuous and categorical values. Features with continuous values are normalized using the quotient of the actual value over the maximum value of that feature so that continuous values are limited to the range [0, 1]. This will split the data more evenly and improves the results achieved with the information gain algorithm. Based on the information gain value, we classified the 20 email features shown in Table 3 into three types of datasets defined as follows: 1) Select the top 10 features for binary or categorical data (Cat10); 2) Select the top 10 features for continuous data (Cont10); and 3) Select the top 10 features for mixed data (categorical and continuous) (Mix10). Table 4 shows the three datasets derived from the email features stated in Table 3. The top 10 features for each type of dataset are selected based on information gain value for phishing email classification derived from different parts of the email properties. The structural properties used for generating profiles are selected from the email body features, the email headers features, the URL features and the external features. In our approach, we have instead selected to use 20 features which reflect the different characteristics of the data types.
1
SpamAssassin, Public corpus, http://spamassassin.apache. org/publiccorpus/. 2 Nazario, Phishing corpus, http://monkey.org/wjose/wiki/ doku.php?id¼phishingcorpus.
33
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 2 7 e4 1
Table 5 e Range value for each feature in datasets. Fi
Cat10
xi range value
Cont10
xi range value
Mix10
xi range value
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
Externalsabinary Bodyhtml Bodydearword Subjectreplyword SendunModaldomain Bodymultipart Urlip Subjectbankword Urlwordloginlink Urlwordherelink
{0, 1} {0, 1} {0, 1} {0, 1} {0, 1} {0, 1} {0, 1} {0, 1} {0, 1} {0,1}
Externalsascore Urlnumlink Urlnumperiods Sendnumwords Urlnumexternallink BodynumFunctionwords Bodynumchars Bodynumwords Urlnumip Subjectrichness
{0e1} {0e1} {0e1} {0e1} {0e1} {0e1} {0e1} {0e1} {0e1} {0e1}
Externalsascore Externalsabinary Urlnumlink Bodyhtml Urlnumperiods Sendnumwords Urlnumexternallink BodynumFunctionwords Bodydearword Subjectreplyword
{0e1} {0, 1} {0e1} {0, 1} {0e1} {0e1} {0e1} {0e1} {0, 1} {0, 1}
3.3.
Profiling clustering algorithm
Our work considers using continuous and categorical data inherent in the email messages to profile the activities of the phishing emails. We developed techniques to automatically obtain the optimal number of group profiles. In the exiting approaches, k-means clustering algorithm was used to profile phishing emails (Juels et al., 2006; Kang et al., 2010; Yearwood et al., 2012; Yearwood et al., 2009). However, k-means clustering approach cannot handle both continuous and categorical variables and attributes. In the proposed approach, we enhanced the Two-Step clustering algorithm (Bacher et al., 2004; Cluster (Two-Step clustering algorithms) e IBM e United States, n.d.) and used it to handle the continuous and categorical data. The Two-Step clustering algorithm uses the agglomerative hierarchical clustering method in the clustering step to automatically determine the optimal number of clusters in the input data. This algorithm is selected because it is designed to handle very large datasets. Let F denotes feature vector of training set in the form of (x) ¼ (x1, x2, x3,., xn) where xi ˛valðiÞ is the value of ith attribute of feature vector x. The xi value for each features are summarized in Table 5.
3.3.1.
Dcont10 ¼ {externalsascore, urlnumlink, urlnumperiods, sendnumwords, urlnumexternallink, bodynumfunctionwords, bodynumchars, bodynumwords, urlnumip, subjectrichness} Dmix10 ¼ {externalsascore, externalsabinary, urlnumlink, bodyhtml, urlnumperiods, sendnumwords, urlnumexternallink, bodynumfunctionwords, bodydearword, subjectreplyword}.
3.3.2.
Two-Step clustering algorithm
In this section we give a brief description of the Two-Step clustering algorithm. Table 5 shows the notation used throughout the paper. Fig. 2 illustrates the two-step algorithm components. The steps for the Two-Step clustering algorithm are as follows: Step 1: (Distance similarity) Compute the distance similarity using the log-likelihood function. Step 2: (Cluster membership assignment) Allocate data points according to log-likelihood function, LL(k) to its closest center and obtain kth cluster. Step 3: Compute the initial number of clusters. Step 4: (Stopping criterion) Decide the number of final clusters.
Constructing feature matrix
In this section, we discuss the process we used for constructing feature matrix for the datasets. Let E and F denote the emails and feature vector space respectively: E ¼ fe1 ; e2 ; .; eN g
(3)
F ¼ f1 ; f2 ; .; fM
(4)
3.3.2.1. Distance similarity computation. The algorithm computes the distance similarity using the log-likelihood function. The log-likelihood distance similarity is a probability based distance between two clusters. It is related to the reduction in log-likelihood as two clusters are joined into one cluster. In
where N is the total number of emails and M refers to the size of the feature vector. Let xij be the value of jth feature of ith document. Therefore, the presentation of each dataset D is given as follows: Dd ¼ xij j1 i N; 1 j M;
d˛ðCat10; Cont10; Mix10Þ
(5)
For example, the feature matrix for the Cat10 dataset consists of 10 features Fi, 1 i 10 for all phishing and normal emails. Note that Cat10 dataset consists of 10 categorical/binary values ranging from 0 to 1. So, each dataset for Dcat10, Dcont10 and Dmix10 consists of: Dcat10 ¼ {externalsabinary, bodyhtml, bodydearword, subjectreplyword, sendunmodaldomain, bodymultipart, urlip, subjectbankword, urlwordloginlink, urlwordherelink}.
Fig. 2 e Two-Step cluster algorithm components.
34
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 2 7 e4 1
computing log-likelihood, normal distributions for continuous variables and multinomial distributions of categorical variables are assumed. Also the variables are assumed to be independent of each other and so are the cases. Let Vi in Eq. (6) and Vu in Eq. (7) be the distance for clusters i and clusters u respectively, where, 0 Vi ¼ Ni @
p X 1 j¼1
0 Vu ¼ Nu @
2
log
p X 1 j¼1
2
log
2 b s ij
þ
2 b s uj
2 b sj
þ
2 b sj
1 acj q X X Nijl Nijl A log Ni Ni j¼1 l¼1
1 acj q X X Nujl Nujl A log Nu Nu j¼1 l¼1
(6)
(7)
Accordingly, the difference between clusters i and clusters u can be defined as d(i, u) ¼ Vi þ Vu V. Let Vz be the distance similarity between cluster i and cluster u where z(z ¼ i, u, ). Here, i, u indicates the index that represents a cluster formed by combining cluster i and cluster u where, 0 V ¼ N @
p X 1 j¼1
acj X Njl l¼1
N
2
log
2 b s j
1 log
Njl A N
þ
2 b sj
k X
BICðkÞ ¼ 2LLðkÞ þ rðkÞlog N (10) Pq where rðkÞ ¼ kf2p þ j¼1 ðacj 1Þg such that q is the total number of categorical variables (acj) used in the datasets, p is the total number of continuous variables used in the dataset and k is the cluster number. Then the ratio change in BIC at each consecutive merging step relative to the first merging step decides the initial estimate. Let dBIC(k) is the change that results between k clusters and (k þ 1) clusters model where, dBIC(k) ¼ BIC(k þ 1) BIC(k). Then, the ratio of BIC changes with respect to the changes of the two clusters is defined as follows: RBKðkÞ ¼
q X j¼1
(8)
Pp 2 b 2j Þ measures The first part in Eq. (8), Nz j¼1 1=2logðb s zj þ s the dispersion of continuous variables, abj (j ¼ 1, 2,., p), where p is the total number of continuous variables used in the 2 s zj is the datasets for all number of data (Nz) in cluster z. Here b estimated variance of the jth continuous variable in cluster z 2 while b s j is the estimated variance of the jth continuous variable across the entire dataset. An estimated variance is the average of the squared differences from the mean value for continuous variables. The term b s 2j is added to circumvent a 2 s 2zj is used, d(i, u) deteriorating condition for b s zj ¼ 0. When b would be declining in the log-likelihood function after merging cluster i and cluster u. Pq Pac The second part in Eq. (8), Nz j¼1 l¼1j Nzjl =Nz logNzjl =Nz is used as a measure of dispersion for all categorical variables, acj (j ¼ 1, 2,., q) where q is the total number of categorical variables used in the datasets. Here Nzjl/Nz represents the mean of the lth category variable in cluster z, where Nzjl is the number of records in cluster z whose jth categorical variable takes the lth category for the number of data in cluster z, Nz. Then, the cluster with the smallest distance d(i, u) is merged in each step. The log-likelihood function, LL(k) for the step with kth clusters is calculated as follows: LLðkÞ ¼
two phase estimator in Two-Step clustering algorithms. The first phase can be prepared by using Akaike’s Information Criterion (AIC) or Bayesian Information Criterion (BIC). In this paper, we use BIC as the first estimator because it results in a good initial estimate of the maximum number of clusters. The BIC is computed as follows:
dBICðkÞ dBICð1Þ
(11)
Note that if dBIC(1) 0, then the number of clusters is set to 1 and the second stage is omitted. Otherwise, the initial estimate for the k clusters is the smallest number for which RBK(k) is smaller than the threshold value q1. The second phase uses the ratio of distance measures, R(k) for kth clusters and defined as follows: RðkÞ ¼
RBKðkÞ RBKðk þ 1Þ
(12)
where, RBK is the ratio of BIC changes for k cluster. We compute the same ratio with the following cluster, (k þ 1) in Eq. (12). Finally, compare the two largest R ratios. For example, assume that the value for the threshold is q1 ¼ 1.15. If the largest R ratios are more than 1.15 times than the second largest R ratios, then we select the cluster with the largest R ratio as the optimal number of cluster. Otherwise, from the two clusters with the largest R values, we select the one with the larger number of clusters as the optimal cluster. We now illustrate the concept using the information given in Table 6. The two largest values of R(k) are (R(6) ¼ 2.827) and (R(9) ¼ 1.751). Therefore the R ratios is calculated as, R(6)/ R(9) ¼ 2.827/1.751 ¼ 1.615. Assume that the threshold value, q1 ¼ 1.15. Therefore, the R ratio 1.615 is larger than the
Table 6 e Two-Step clustering. Vz
(9)
z¼1
where the LL(k) function can be understood as a distribution within the clusters. Cluster membership for each object is assigned deterministically to the closest cluster according to the distance measure used to find the clusters. The deterministic assignment may result in unfair estimates of the cluster profiles if the clusters overlap.
3.3.2.2. Computation of the initial number of clusters. The number of clusters can be automatically determined using
k
BIC(k)
dBIC(k)
RBK(k)
R(k)
1 2 3 4 5 6 7 8 9 10 11
85389.657 63367.912 50004.675 40356.122 28536.087 20220.546 17279.012 14003.789 10245.876 8099.452 5776.634
22021.745 13363.237 9648.553 11820.035 8315.541 2941.534 3275.223 3757.913 2146.424 2322.818
1.000 0.607 0.438 0.537 0.378 0.134 0.149 0.171 0.097 0.105
1.648 1.385 0.816 1.421 2.827 0.898 0.872 1.751 0.924 1.082
35
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 2 7 e4 1
Fig. 3 e Phishing profiling classification model.
Step 2: Compute the distance similarity using loglikelihood function. We apply Eqs. (6)e(9) to find the loglikelihood function value, LL(k) for the kth cluster. Step 3: Cluster membership assignment: Allocate the loglikelihood, LL(k) (k ¼ 1, 2,., w) to its closest center. Next, we apply the computation of the initial number of clusters section to obtain kth cluster model, Ck (k ¼ 1, 2,., w), where w is optimum number of clusters and the size for each cluster (NCk). For each cluster model Ck, we calculate percentage of cluster size (P(k)) for the largest and the smallest clusters. The percentage for the largest clusters (P(kmax)) and the smallest clusters (P(kmin)) are computed as follows:
threshold value 1.15. As a result, 6 is selected as the optimal number of clusters. In summary, this algorithm is sensitive to the choice of threshold value (q) to determine the initial number of clusters. A large value of q can result in a large number of initial cluster while small value can produce artificial clusters. Moreover, the number of clusters defined by this algorithm is not diverse for large number of data (Hamid and Abawajy, 2013).
3.3.3.
Algorithm for profiling phishing emails
The proposed phishing profiling approach consists of two stages. The first stage is profiling email-born phishing activities, where profiles are generated based on clustering algorithm’s predictions. Thus, clusters become elements of profiles. We then employ a clustering algorithm as clustering predictions in phishing emails. These predictions are further utilized to generate complete profiles of these emails. The number of profiles to be used in the earlier stage is crucial for the effectiveness of this model. The second stage is a classification of emails. The profiles generated from the profiling stage are used to train the classification algorithm where it supposed to predict the unknown class label based on the input data. Therefore, the key objective of the profiles is to ‘train’ the classification algorithm for better generalization capability where it can predict the class label accurately for unknown records. Fig. 3 is a schematic representation of the proposed ProPhish algorithm. The algorithm has 5 steps, each step is described below: Step 1: (Initialization). Let Xi be the corresponding input value for the algorithm such that, Xi ¼ ðxi1 ; xi2 ; .; xiN Þ
(13)
where xi is the value of attribute ith in the email and N is the total number of attributes in the email datasets. Let the cluster number set to be k ¼ 1 and ratio size, RS(1) ¼ 0. In this paper, the continuous variables is represented as, abj (j ¼ 1, 2,., p) where p is a total number of continuous variable used in the datasets within cluster k. Thus, the categorical variable is denoted as acj (j ¼ 1, 2,., q) where q is a total number of categorical variables.
Pðkmax Þ ¼
Nðkmax Þ ; N
Pðkmin Þ ¼
Nðkmin Þ N
(14)
where N represents the total datasets, N(kmax) and N(kmin) are the size of the largest cluster and the size of the smallest cluster respectively. Step 4: (Stopping criterion). The ratio size (RS) for the largest and smallest values of RS(k) for each cluster is computed as follows:
RSðkÞ ¼
Pðkmax Þ Pðkmin Þ
(15)
We compute the same ratio size for (k 1) clusters, until RS(k) z RS(k þ 1). Finally, we select a cluster model with the
Table 7 e Clustering using ratio size. k
N(kmin)
N(kmax)
P(kmin)
P(kmax)
RS(k)
2 3 4 5 6 7 8 9 10 11
2059 407 382 377 179 171 102 51 51 25
2727 2751 1878 1878 1874 1874 1874 1874 1874 1871
43.02 8.50 7.98 7.88 3.74 3.57 2.13 1.07 1.07 0.52
56.98 57.48 39.24 39.24 39.16 39.16 39.16 39.16 39.16 39.09
1.32 6.76 4.92 4.98 10.47 10.96 18.37 36.75 36.75 74.84
36
Step 5: (Profiles Prediction Result). To construct profiles, the result generated by the clustering algorithm is used. Let us denote the dataset, D ¼ {xij} where i ¼ 1, 2,., w and j ¼ 1, 2,., M. The cluster model ensemble in this data set will be denoted by C ¼ {C1, C2,., Cw} where, for each cluster model Ci, the data set D is a disjoint union of the classes in this and clustering so that Ci ¼ fCi1 ; Ci2 ; .; Ciw g i i i D ¼ C1 WC2 W.WCw for all i ¼ 1,., w where w is the optimal number of cluster generated by the clustering algorithm. Table 8 consists profile prediction for dataset Dmix10. Given Dmix10 have 10 optimal number of clusters, C10. So, attribute for dataset mix contain of DMix10 ¼ {externalsascore, externalsabinary, urlnumlink, bodyhtml, urlnumperiods, sendnumwords, urlnumexternallink, bodynumfunctionwords, bodydearword, subjectreplyword}.
C10 1 C10 2 C10 3 C10 4 C10 5 C10 6 C10 7 C10 8 C10 9 C10 10 C81 C82 C83 C84 C85 C86 C87 C88 C89 C810 C71 C72 C73 C74 C75 C76 C77 C78 C79 C710
k91 C92 C93 C94 C95 C96 C97 C98 C99 C910
Bodynumfunctionwords Urlnumexternallink
C61 C62 C63 C64 C65 C66 C67 C68 C69 C610 C51 C52 C53 C54 C55 C56 C57 C58 C59 C510 C41 C42 C43 C44 C45 C46 C47 C48 C49 C410
bodyhtml
C21 C22 C23 C24 C25 C26 C27 C28 C29 C210 C11 C12 C13 C14 C15 C16 C17 C18 C19 C110
urlnumlink Externalsabinary externalsascore
Ci1 Ci2 Ci3 Ci4 Ci5 Ci6 Ci7 Ci8 Ci9 Ci10
Table 8 e Profiles prediction for dataset Dmix10.
Ciw
larger number of clusters as optimal. For example, Table 7 shows the ratio size value for 11 clusters. Assume that it consist of N ¼ 4786 emails. Hence, the two values that satisfy RS(k) z RS(k þ 1) are when the cluster size is 10 where RS(9) ¼ 36.75 and RS(10) ¼ 36.75. As a result 10 is selected as the optimum number of clusters.
4.
C31 C32 C33 C34 C35 C36 C37 C38 C39 C310
urlnumperiods
Sendnumwords
bodydearword
subjectreplyword
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 2 7 e4 1
Performance analysis
In this section, we present the analysis and performance evaluation of the proposed approach and compare it with the Modified Global Kmeans (MGKmeans) (Bagirov, 2008) algorithm to demonstrate its effectiveness in profiling phishing emails. We will first present the experimental setup in profiling emails. Then, we discuss the results of the experiments.
4.1.
Experimental setup
For our experiment, we used datasets from Khonji’s antiphishing studies website (Khonji et al., 2011). We then used Weka (Hall et al., 2009) to test the performance of the approaches. This paper centres on supervised data where we assumed that the data sets are classified into ham and phishing data. The data set consists of 4150 ham emails and 2687 phishing email. Firstly, we split the data into 70:30 ratio sizes (training and testing) as used in Hamid et al. (2013), Marchal et al. (2012), Xiang et al. (2011). The size for training data is 4786 and each testing set is approximately 2051. We generated three datasets denoted by Cat10, Cont10 and Mix10. Each document in the testing data has been preclassified into unique class to evaluate the clustering accuracy of each clustering algorithm. For the evaluation of the accuracy and effectiveness of the profiles, all datasets used the same number of testing and training. This means, all profiles are generated from training data by using clustering algorithm. Next, the profile is used to train the classification algorithms. Finally, the testing set is used to obtain the performance characteristics. In order to assess the proposed clustering algorithm, we run the following sensitivity tests: 1) Sensitivity to data type: To evaluate the proposed algorithm thoroughly, we tested different types of dynamically
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 2 7 e4 1
Fig. 4 e Objective function values vs tolerance values.
generated data type: binary/categorical, continuous and mixed. We then generated fix number of cluster size (profiles) for Cat10, Cont10 and Mix10 datasets. We used 10 clusters as a training data to determine the performance of proposed algorithm to different data type. 2) Sensitivity to data size: We used the same number of cluster size (profile) to test the performance of the proposed algorithm to various numbers of testing data size from smallest to largest value. In order to maintain the equality among two types of profiles, we generate 10 numbers of profiles from each clustering algorithm. This is to investigate the effect of over fitting issues. 3) Sensitivity to cluster size: With the aim to measure the sensitivity of proposed algorithm on cluster size, we tested all datasets to diverse number of cluster generated by MGKmeans and ProPhish algorithm. In order to measure the effectiveness of the proposed clustering algorithm, we used the following four performance metrics. These metrics are: 1) Accuracy (Acc): How many email classes are correctly predicted by the algorithm? 2) False Negatives (FN): How many phishing emails go undetected by the algorithm? 3) False Positives (FP): How many normal emails are misclassified as phishing email? 4) Error rate (Err rate): How many emails are wrongly predicted by the algorithm? The accuracy metrics is very significant to compute the number of correctly classified emails using the proposed algorithm. If the accuracy value is high, the performance of the proposed algorithm is very effective in order to detect phishing email. Additionally, both FN and FP metrics are very important in measuring the effectiveness of security justification approaches. For instance, FP could have considerable negative values on the utility of detection and protection algorithm. This is because examining them takes time and resources. If the rate of FP is high, user might disregard them. The error rate metric is important to inspect the over fitting issues. Model over fitting is when a good model that fit training data too well but have a poorer generalization error than a model with higher training error. Once the data become too large, the error rate begins to
37
Fig. 5 e Objective function values vs the number of clusters.
increase. The errors committed by a classification model are generally caused by training errors and generalization errors. Training errors are the number of misclassification errors committed in training records, while generalization error is the expected error of the model on previous unseen records. A good classification model must fit the training data well and classify the unseen record accurately. Therefore, a good classification model must have low error rate as well as generalization error.
4.2.
Classification algorithm used
We ran series of tests to ProPhish and MGKmeans using classification algorithm. We choose to use AdaBoost and Sequential Minimal Optimization (SMO) because these are the simplest and most commonly used classification methods. Also, they have capability to handle tokens and associated probabilities according to the user’s classification decisions. We used the default setting for the classification algorithms in our initial preliminary testing: AdaBoost (AB) e The Boosting method is a stable method for improving the performance of any particular classification algorithm. It is a fairly new structure for constructing a highly accurate classification rule by joining many simple and moderately accurate hypotheses into a strong one. This method was initially presented in Freund and Schapire (1997). Sequential Minimal Optimization (SMO) e SMO is an iterative algorithm for solving the optimization problem by
Fig. 6 e Ratio size value vs cluster size.
38
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 2 7 e4 1
Fig. 7 e False positive rate.
breaking problems into a series of smallest possible subproblems, which are then solved analytically. It was invented by John Platt (1998) at Microsoft Research. This method efficiently solving the optimization problem which arises during the training of support vector machines using libsvm tool.
4.3.1.
Cluster size selection
In this experiment, we find the best number of cluster of the proposed approach and Modified Global Kmeans (MGKmeans). After that, we tested the two algorithms on different types of dataset: Cat10, Cont10 and Mix10.
4.3.1.1. Cluster size selection using MGKmeans algorithm. 4.3.
Result and discussion
We carried out extensive experiments and compare the proposed algorithm with the Modified Global Kmeans (MGKmeans). In this section, we discuss the results of the experiments.
Features are collected according to the features defined in Constructing Feature Matrix section (Section 3.3.1). We used mixed data type where all binary value features are transforming to numerical value ranging from 0 to 1. Then, the evaluation of the current feature subset is carried out using MGKmeans algorithm. The performance is evaluated using the MGKmeans objective function value (fk) on the feature
Fig. 8 e False negative rate.
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 2 7 e4 1
39
bigger the cluster size the smaller fk value it will have. This is because it is denser to the centroid value. We then calculated fk over a range of tolerance value (tv) and discovered that tv achieved constant value of 0.16 when tv 0.03. Fig. 5 shows the relationship between fk and the number of clusters. The result shows that when fk is 0.16, the number of cluster is 3. These figures indicate that a good clustering has a good balance between objective function values, the tolerance value and the number of clusters. As a result, the optimal number of cluster using MGKmeans for this dataset is achieved with 3 clusters.
4.3.1.2. Cluster size selection using ProPhish algorithm. In this experiment, we studied the relationship between the ratio size with the number of cluster executed on feature subset. We calculated the ratio size for each cluster, where ratio size is the split value between the size of the largest cluster and the smallest cluster. Fig. 6 shows the result of the experiment. The result shows that the ratio size value is constant when the number of cluster is greater than 9 such that RS(9) ¼ 36.75 and RS(10) ¼ 36.75. The optimal number of a cluster must have a balance ratio size between the largest and the smallest clusters. This is to ensure that each cluster is well diverse for the datasets. Therefore, the optimal number of cluster using Prophish is 10. 4.3.2. Sensitivity test 4.3.2.1. Sensitivity to false positive and the false negative. In
Fig. 9 e Sensitivity to data size.
subset. fk is a mean value of distance between features and centroid value, gi for all data and computed as follow: N 2 1 X minj¼1;.;k xj gi fk x1 ; .; xk ¼ N i¼1
(16)
where xj represents the centroid value for all k clusters. The stopping criterion to select the optimal number of cluster is calculated as follows: fk1 fk < tv f1 s0 f1
(17)
where tv is the tolerance values and f1 is the corresponding value of the objective function in Eq. (16). Fig. 4 shows the association between the objective function values (fk) and the tolerance values (tv). We found that the
this section, we study the false positive and the false negative rates of the ProPhish and MGKmeans algorithms. In the experiment, we fixed the number of cluster to 10. Fig. 7 shows the false positive rates (y-axis) for ProPhish and MGKmeans under the AdaBoost and SMO classification algorithms as a function of the datasets (x-axis). The results show that the proposed algorithm achieves lower false positive rates for all datasets as compared to the MGKmeans algorithm. For the Cat10 dataset and the Mix10 dataset, the false positive rates for the MGKmeans algorithm under both SMO and AdaBoost is about 90% worse than the proposed algorithm. For the Cont10 dataset, the false positive rates for the MGKmeans algorithm range from 60% to 80%. The trend for the false negative rates is similar to the false positive rates. Fig. 8 shows the false negative rates (y-axis) for ProPhish and MGKmeans under the AdaBoost and SMO classification algorithms as a function of the datasets (x-axis). From the data on the graph, we observe that the proposed algorithm achieves substantially lower false negative rates for all datasets as compared to the MGKmeans algorithm. The false negative rates of the proposed algorithm for Mix10 dataset SMO is nearly zero while under AdaBoost is less than 5% whereas it is about 95% under AdaBoost and about 99% under SMO for the MGKmeans algorithm. For the other datasets, the trends are the same as that of the false positive rates. Overall, the results show that the proposed algorithm has substantially lower false negative and false positive rates as compared to the MGKmeans algorithm. This is because the proposed algorithm is capable of handling large data sets which have a variety features either categorical, continuous or both values. Moreover, all data types (categorical data,
40
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 2 7 e4 1
Table 9 e Performance evaluation sensitivity to cluster size. Dataset
Cat10
Cont10
Mix10
Number of cluster
3 5 10 15 3 5 10 15 3 5 10 15
Adaboost
Sequential minimal optimization
ProPhish
MGKmeans
ProPhish
MGKmeans
98.4 98.4 98.4 97.1 90.2 94.3 96.1 94.8 88.9 89.3 98.0 98.9
97.9 97.9 86.1 98.8 31.9 70.9 35.0 87.4 96.5 54.1 54.2 98.4
98.1 99.2 98.4 98.4 95.6 95.3 54.8 89.3 98.3 98.4 98.7 98.7
52.5 98.5 88.3 98.6 64.2 77.0 39.5 87.3 98.7 98.6 54.4 98.4
continuous data and mixed data) along with the data integrity collectively are extremely effective in profiling phishing email.
4.3.2.2. Sensitivity to error rate. In this experiment, we measured the error rate as a function of the data set size for Prophish and MGKmeans algorithms for Cat10, Cont10 and Mix10 datasets. Fig. 9 shows the result of the experiments. In the experiment, we set number of cluster to 10. The results show that the proposed algorithm has substantially low error rates for all data sets under both classification algorithms. As the number of data in the testing data increased, the error rate remains low for the proposed algorithm. The result provides some assurance that the proposed algorithm (ProPhish) can overcome the over-fitting issues with low error rate. Overfitting happens when a model begins to remember training data rather than learning to generalize from trend. The possibility of over-fitting exists because the criterion used for training the model is not the same as the criterion used to judge the effectiveness of a model. To be exact, a model is typically trained by exploiting its performance on some set of training data. However, its effectiveness is determined not by its performance on the training data but by its ability to perform well on unseen data. As the number of data in the testing data increased, the error rate maintains low for our proposed algorithm. Even though we train the algorithm using 10 profiles, it did not degrade the performance of the classification algorithm because it generalizes well in the testing data and improved the clustering accuracy. Yet, we believe that the optimal number of cluster chosen is crucial in order to increase the accuracy and efficiency of the filtering result. 4.3.2.3. Sensitivity to cluster size. In this section, we analyzed the predictive accuracy of ProPhish and MGKmeans algorithms as the number of clusters vary. Table 9 shows the results of the experiment for Cat10, Cont10 and Mix10 datasets under the two classification algorithms (i.e., Adaboost and SMO). In the experiment, we varied the number of clusters from 3 to 15. For the Cat10 dataset, the proposed algorithm is slightly better than MGKmeans algorithm when using Adaboost classification algorithm. However, under SMO algorithm, the performance of the proposed algorithm ranges from 1% to 47% depending on the cluster size. For the Cont10 dataset, the proposed algorithm performs better than the MGKmeans
algorithm in the range of 9%e65% under Adaboost algorithm. For SMO algorithm, the performance improvement of the proposed algorithm over the MGKmeans algorithm is in the range of 2.3%e28%. For the Mix10 dataset, the performance improvement of the proposed algorithm over the MGKmeans algorithm under Adaboost is in the range of 0.5%e45% while under SMO algorithm the range is 0.3%e45%. In general, the proposed algorithm tends to perform better than MGKmeans algorithm for the cluster size equal to or less than 10. As the size of the cluster exceeds 10, the performance of the two algorithms tends to converge.
5.
Conclusion
Phishing e-mail is an emerging problem nowadays and solving this problem has proven to be very challenging. In this paper, we proposed phishing email profiling and classification model by formulating the profiling problem as a clustering problem using various features in the phishing emails as feature vectors. We proposed profiling email-born phishing (ProPhish) algorithm for selecting an optimal number of clusters based on ratio size value. The proposed algorithm is then compared with the MGKmeans clustering approach on different type of data. The experiment results showed that the classification accuracy of ProPhish algorithms is improved by adopting ratio size procedures for selected number of clusters. The sensitivity test results showed that the proposed algorithm demonstrates better results as compared to MGKmeans algorithm. We plan to investigate how we can integrate Kmeans and Two-Step clustering approach to improve clustering accuracy.
references
Abawajy JH, Kelarev A. A multi-tier ensemble construction of classifiers for phishing email detection and filtering. In: Proceedings of Cyberspace Safety and Security; 2012. pp. 48e56. Anti-Phishing Working Group. Phishing attack trends report, [Online]. Available http://www.antiphishing.org/resources/ apwg-reports/.
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 2 7 e4 1
Bacher J, Wenzig K, Vogler M. SPSS Two-Step cluster: a first evaluation. Erlange-Nurnberg, (Retrieved February 12, 2013), Available from: http://www.statisticalinnovations.com/ products/Two-Step.pdf; 2004. Bagirov AM. Modified global k-means algorithm for minimum sum-of-squares clustering problems. Pattern Recognit October 2008;41(10). ISSN: 0031-3203:3192e9. Basnet RB, Sung AH. Classifying phishing emails using confidence-weighted linear classifiers. In: International Conference on Information Security and Artificial Intelligence (ISAI 2010). IEEE; 2010. pp. 108e12. Chandrasekaran M, Shankaranarayanan V, Upadhyaya S. CUSP: customizable and usable spam filters for detecting phishing emails. In: NYS Symposium, Albany, NY; 2001. pp. 1e8. Cluster (Two-Step clustering algorithms) e IBM e United States; (n.d.). Retrieved from: http://publib.boulder.ibm.com/ infocenter/spssstat/v20r0m0/index.jsp?topic¼%2Fcom.ibm. spss.statistics.help%2Fidh_Two-Step_main.htm. Fette I, Sadeh N, Tomasic A. Learning to detect phishing emails. In: Proceedings of the 16th International Conference on World Wide Web. New York, USA: ACM New York; 2007. pp. 649e56. Freund Y, Schapire RE. A decision-theoretic generalization of online learning and an application to boosting. J Comput Syst Sci 1997;55(1):119e39. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten I. The Weka data mining software: an update. SIGKDD Explor 2009;11:10e8. Hamid IRA, Abawajy JH. Phishing email feature selection approach. In: IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). IEEE; 2011. pp. 916e21. Hamid IRA, Abawajy JH. Hybrid feature selection for phishing email detection. In: The 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP). Berlin, Germany: Springer; 2011. pp. 266e75. Hamid IRA, Abawajy JH. Profiling phishing email based on clustering approach. In: IEEE 12th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). IEEE; 2013. Hamid IRA, Abawajy JH, Kim T. Using feature selection and classification scheme for automating phishing email detection. Stud Inf Control 2013;22(1). ISSN: 1220-1766:61e70. Hoanca B, Mock K. Using market basket analysis to estimate potential revenue increases for a small university bookstore. In: Conference for Information Systems Applied Research. 2011 (CONISAR), Wilmington North Carolina, USA, vol. 4(1822); 2011. pp. 1e11. Islam R, Abawajy JH. Multi-tier phishing detection and filtering approach. J Netw Comput Appl 2013;36(1):324e35. Juels A, Jakobsson M, Jagatic TN. Cache cookies for browser authentication. In: Proceedings of the 2006 IEEE Symposium on Security and Privacy, Washington, DC, USA. NW: IEEE Computer Society; 2006. pp. 301e5. Kang BH, Yearwood JL, Kelarev AV, Richards D. Consensus clustering and supervised classifications for profiling phishing emails in internet commerce security. In: Knowledge Management and Acquisition for Smart Systems and Services. Lecture Notes in Computer Science, vol. 6232. Springer Berlin Heidelberg; 2010. pp. 235e46. Kelarapura R, Antonio AN, Zhang Z. Profiling users in a 3g network using hourglass co-clustering. ACM; 2010. pp. 341e52.
41
Khonji M, Jones A, Iraqi Y. A study of feature subset evaluators and feature subset searching methods for phishing classification. In: Proceedings of the 8th Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference (CEAS ’11). New York, USA: ACM; 2011. pp. 135e44. Kirda E, Kruege C. Protecting users against phishing attacks. Comput J 2006;49(5):554e61. Marchal S, Francois J, State R, Engel T. Proactive discovery of phishing related domain names. In: Research in Attacks, Intrusions and Defences. Lecture Notes in Computer Science, vol. 7462; 2012. pp. 190e209. Platt JC. Sequential minimal optimization: a fast algorithm for training support vector machines; 1998. pp. 1e21. Microsoft Research Tech. Report MSR-TR-98-14, Microsoft, Redmond, Wash. Toolan F, Carthy J. Feature selection for spam and phishing detection. In: Anti-Phishing Working Groups 2nd Annual eCrime Researchers Summit (eCrime); 2010. pp. 1e12. Xiang G, Hong J, Rose CP, Cranor L. CANTINAþ: a feature-rich machine learning framework for detecting phishing web sites. ACM Trans Inf Syst Secur (TISSEC) 2011;14(2):1e28. http:// dx.doi.org/10.1145/2019599.2019606. Xu K, Zhang Z, Bhattacharyya S. Internet traffic behavior profiling for network security monitoring. IEEE/ACM Trans Netw 2008;16(6):1241e52. IEEE Press Piscataway, NJ, USA. Xu KS, Kliger MM, Yilun C, Woolf PJ, Hero A. Revealing social networks of spammers through spectral clustering. In: Proceeding of the IEEE International Conference of Communications, Dresden, Germany; 2009. http://dx.doi.org/ 10.1109/ICC.2009.5199418, 2009. Yearwood JL, Webb D, Ma L, Vamplew P, Ofoghi B, Kalarev AV. Applying clustering and ensemble clustering approaches to phishing profiling. In: Proceedings of the 8th Australasian Data Mining Conference (AusDM"09); 2009. pp. 25e34. Yearwood JL, Mammadov M, Webb D. Profiling phishing activity based on hyperlinks extracted from phishing email. Soc Netw Anal Min 2012;2(1):5e16. Springer Vienna. Isredza Rahmi A. Hamid is currently attached to Deakin University, Australia as a Postgraduate student Faculty of Science, Engineering and Built Environment. She is a faculty member of Information Security Department, University Tun Hussein Onn Malaysia. She received her M.Sc Degree in Information Technology from MARA University of Technology (Malaysia) and Bachelor degree in Information Technology from Northern University (Malaysia). She has published several papers in reviewed conferences as well as journal. Her research interest includes Network Security (Phishing Detection and Profiling). Jemal H. Abawajy is a Professor Deakin University, Australia. He is a senior member of IEEE and leads the Parallel and Distributed Computing Lab. Prof. Abawajy is actively involved in funded research and has published more than 250 refereed articles. He is currently principal supervisor of several PhD students. Prof. Abawajy is on the editorial member of several journals and has guest-edited several journals. Professor Abawajy has been a member of the organizing committee for over 200 international conferences serving in various capacity including chair, general co-chair, vice-chair, best paper award chair, publication chair, session chair and program committee.