taken. A phishing attack usually starts by a false email message demanding users´ ... email, it is verified if the email possesses HTML format, contains scripts and.
Testing Phishing Detection Criteria and Methods Carine G. Webber1, Maria de Fátima Webber do Prado Lima1 1
University of Caxias do Sul , Computer Science Center, Caxias do Sul, RS, Brazil {cgwebber, mfwplima}@ucs.br
Abstract. Phishing attacks have increased in the last years despite the use of anti-phishing filters. This is mainly caused by the diversity of phishers trials and the improve on targeting of potential victims on Internet. Usually phishers employ social engineering techniques trying to convince users to supply confidential data using the email as a dissemination vehicle. Phishers disguise attacks as trustworthy organizations by cloning websites. According to international monitoring, phishing causes real injury mainly to banks and government institutions. This papers proposes some features to detect phishing attacks and employs data mining techniques to evaluate and compare them. In this work we have used public corpora of phishing messages. As a main result we have defined phishing detection criteria, which have been evaluated and best accurate results achieved using neural nets and decision trees. Keywords: Security, Privacy, Phishing Detection.
1 Introduction Phishing scam is a kind of fraud through Internet in which users have theirs identities taken. A phishing attack usually starts by a false email message demanding users´ personal data [1,2,3]. Phishers benefit from the fact that the SMTP protocol (Simple Mail Transfer Protocol) was not originally conceived to protect against security frauds. In consequence, an email consists of a heading and a body message easily manipulated by phishers. Known and trustworthy organizations are generally used to disguise fake message [4]. Liniger [5] highlights that phishers also benefit from the absence of mechanisms for authenticity checking. However, there are three initiatives for sender validation: the Sender Policy Framework (SPF), SenderID and DomainKeys. The SPF is a solution for preventing messages from certain domains by registering them at the DNS (domain names server). SenderID is royalty-free Microsoft standard based on the SPF, which checks the message´s and server´s domains. The DomainKeys solution detects the fake sender by digital signature. Using a specific heading message, the destination server matches the sender´s public key to the sender´s official domain address. Nevertheless, the three techniques just mentioned are at some point defective. First of all, data related to the SPF can be registered as a domain name that a phisher may acquire or have administrative control. As a matter of fact,
the email could be classified as authentic. In the other hand, the SenderID has constrains to be adopted due to a defective syntax and a licensed technology. Finally, the DomainKeys solution presents some vulnerabilities since it could be possible to forge a pair of keys to legalize an email message. In order to obtain successful attacks, usually phishers collect electronic addresses and spread messages to the greater possible number of users. Thousands (if not millions) of phishing emails are sent daily. Although some of them will not be received due to filters or invalid address, it is assumed that 0.01% of the users will read and believe in the message. It means that if phishers send one million messages, one hundred users will believe in the message, click on its content and send personal information. After the click the user is directed to a fake website. James [4] exemplifies one of technique used by phishers. The technique is based on making one mirror copy of a commercial bank website using CGI (common gateway interface) programming code to send collected data by an email account. Immediately afterward, the victim is guided to the real website site through an URI (uniform resource indicator) handled in a way that the victim does not realize the differences between the two websites. Although some existing filters prevent users from receiving phishing emails, anyone can attest by their mailboxes that they are faulty. In order to propose novel approaches to filters phishing messages we have testes the ability of several machine learning algorithms to detect phishing emails. This article is organized in 4 sections. Section 2 describes main approaches to detect phishing attacks in related works. Section 3 presents machine learning algorithms used in our project to detect phishing messages. Section 4 presents the results of this work and section 5 points out some conclusions.
2 Related Work This article focuses on the approach of phishing detention where an email message configure the vector of initial attack [6]. In this context, most predominant solution is the analysis of the body of the email. The body analysis includes the general structure of the email, links and semantics. For the analysis of the general structure of the email, it is verified if the email possesses HTML format, contains scripts and fulfilling forms [7, 8, 9]. Links sometimes can reveal a destination and disguise a fraud. The verification of the semantics enriches the detention since they carry a context in its essence. A few works include the analysis of the heading of the email as an additional parameter. In the heading of the email, generally, the examined field is the sender address. All these analysis although present some vulnerabilities. Email servers do not always possess tools for validating the sender, for instance. In some cases, forged emails are so close to real messages that they escape filtering methodologies being delivered to mailboxes. The regular use of emails on HTML facilitates the use of the techniques that are part of the modus operandi of phishers (phishers need to direct the users for a fake website in order to capture their personal data). In order to reach their goals, phishers
try to drive users to fake links presented as official links. Such disguise technique is supported by HTML language. Through the resource of creation of links, phishers create an email HTML with URI indicating a legitimate website. However, in its structure of creation of link, the parameter href describes the destination to it. The use of scripts, especially Javascript, is emphasized as an additional verification of the body of an email also to be used as technological resource will occult the real destination of an URI shown in form of link [6,7,8]. The status bar of a email client or internet navigator always presents the real destination of one link when the user points the mouse on. If the email - visually - shows website URI legitimate, the validation of the suggested destination can be made by the bar of status. It occurs that the Javascript brings functionalities that allow to modify the bar of status of these applicatory ones, indicating, instead of the URI of the cloned page, the URI of the legitimate website [7]. Some authors recognize that suspicious links have the following aspects: the URI destination is incoherent with the visible one or there is an IP address in URI coded destination [6, 7, 8, 9, 10,11]. The presence of a form in the email body is pointed as another indication of a phishing attack. The victim may fill the form with confidential data and submit [8,9]. Some researchers propose the examination of the link contained in the email. Links or URI characterize a connection between the email and the forged website or the cloned page and deserve special attention. Many cloned pages are hosted in suspicious computers. Such computers do not possess a domain name but an IP address instead. For this reason, the presence of links based on IP numbers inside the email constitute strong indication of a phishing attack [6]. Some researchers have proposed the semantic analysis of the email as an additional approach to phishing email detention. The outstanding aspect of social engineering employed is the persuasion, generating in the possible victims the feeling of threat, concern or urgency. In such emails words like risk, suspended and identify are very common [11]. For this reason, Ma [9] considers useful to extract keywords of these sentences with the purpose of detecting the attack. The commonly used words found in those analyses are: security, expired, account, login, non-authorized. In the same direction, some works have extracted the most frequent words as the following: account, access, bank, credit, click, identity, inconvenience, information, limited, log, minutes, password, recently, risk, social, security, service and suspended [11]. Finally, another techniques as browser´s toolbars were studied by Fette [6]. The anti-phishing toolbars warn the user when navigating at a phishing website. Although they sound useful, the toolbars have limited information available for detection.
3 Defining Phishing Detection Criteria In order to analyze solutions to the problem of phishing detection we propose the use of machine learning algorithms to recognize phishing attacks by emails. We have followed the knowledge discovery method proposed by Fayyad [12]. Three public corpus of phishing messages were used in the work. The first one is proposed by Nazario [13]. It contains 414 e-mails of phishing messages collected from 2004 to 2005, 434 phishing messages from 2005, 1423 messages from 2005 to 2006 and 2279
messages from 2006 to 2007. The second corpus is the Ling-Spam which is composed by 2412 messages from 2000 [14]. In order to enlarge the scope of messages we have incorporated as well the SpamAssassin base (easy and hard messages) from 2003 (SpamAssassin, 2011). Hard phishing classification messages comprehend 15,56% of the whole dataset. We have prepared a preprocessing text analysis to find the most frequent words from the three corpus of messages. Due to the dynamics of the phishing attacks, great attention has been devoted to this task. The most frequent words and their variations were: access*, account*, bank*, billing, click*, confirm*, dear, inform*, link, member, prod*, suspen*, updat*, user, verif*. In addition, we have taken four meaningful structural characteristics usually associated to phishing messages: the message is coded in HTML (hypertext markup language), the tag href is included in the message, the real web link is different from the written link and finally there is an IP numeric address in the real link. Next to the preprocessing step, we have built datasets based on 19 attributes: 15 related to the content and 4 about the message structure. For classifying algorithms we have added classifier to indicate if the message e a phishing or legitimate one. We have automatically built the datasets by scanning the selected corpus messages. The training dataset is defined by 514 instance (122 instances of phishing messages and 392 instances of legitimate messages). We have used the Weka tool package for data mining [15]. We have employed four categories of algorithms: decision trees (ID3 and J48), neural networks (Multilayer Perceptron), Bayesian networks (Bayes Net and Naive Bayes) and clustering (kmeans and Expectation-Maximization). For test we have used 10-fold cross-validation method in order to obtain mean classification errors and confusion matrix. Next section presents the comparative results we have obtained.
4 Comparing Phishing Detection Accuracy Weka is a valuable tool to evaluate classifying and clustering techniques. It implements as well evaluating methods through its Experimenter tool. For this study we have run several algorithms considering all possible parameters settings. Table 1 presents the ordered ten best results from the application of decision trees (Id3 and J48), neural nets (Multilayer Perceptron), Bayesian nets (Bayes Net and Naïve Bayes) and clustering (k-means and E-M) algorithms.
Table 1. Ordered results from phishing messages analysis
Algorithms
Correct
Incorrect
False Positive
False Negative
N
%
N
%
N
%
N
%
Multilayer Perceptron
496
96.5
18
3.50
7
1.36
11
2.14
J48
488
94.94
26
5.06
10
1.95
16
3.11
ID3
485
94.36
29
5.64
16
2.52
13
3.11
BayesNet TAN
479
93.19
35
6.81
21
4.09
14
2.72
BayesNet SA
477
92.80
37
7.2
20
3.89
17
3.31
ExpectationMaximization
461
89.69
53
10.31
47
9.14
6
1.17
BayesNet K2
459
89.30
55
10.7
41
7.98
14
2.72
BayesNet HC
459
89.30
55
10.7
41
7.98
14
2.72
NaïveBayes
459
89.30
55
10.7
41
7.98
14
2.72
K-means
433
84.24
81
15.76
38
7.39
43
8.37
The best global result was achieved by the multilayer perceptron algorithm. The neural net produced an accuracy of 96.5% of the data, producing only 7 false positive results (legitimate messages classified as phishing messages) and 11 false negative results (phishing messages classified as legitimate messages). Although the neural net produces a non inspectable model, it could certainly be applied to an anti-phishing filter. Concerning the decision tree algorithms, the J48 achieved the best result with an accuracy of 94.94%. Even though the ID3 algorithm produced a lower accuracy compared to J48, the number of false negatives has decreased. From the security point of view, it means that ID3 may be better at preventing that phishing messages arrive at users mailboxes. On the other side, we consider that false positive classification cause less damage to the user. When running Bayesian probabilistic algorithms, we have used Bayes net classifier and Naïve Bayes classifier. The implementation of Bayes Net at Weka produces a state space of probabilistic nets. The classifier must use an optimization algorithm to explore the state space in order to maximize its results. For this reason, we have tested the Bayes Net algorithm with the following space state search approaches: K2, hill climbing (HC), simulated annealing (SA) and TAN (Tree Augmented Naïve Bayes). All these algorithms were applied with the SimpleEstimator parameter. In terms of results both Naïve Bayes and Bayes Net performed quite well despite the number of false positives being higher than previous algorithms. The best configuration for Bayes Net was obtained using TAN method. Nevertheless, the false negative percentages were very close to those obtained using decision tree algorithms. Clustering algorithms were tested in order to analyze the dataset homogeneity considering phishing and legitimate messages. We have tested the simple K-means algorithm and the Expectation-Maximization method. The E-M has found two
partitions. The first cluster has grouped 163 email messages (31%) and the second has grouped 351 email messages (69%). The K-means clustering algorithm has found two clusters as well: first cluster composed by 11 email messages (23%) e second one composed by 397 email messages (77%). Concerning the original classification, both E-M and K-means have achieved high accuracy considering correct classification of data as seen at table 1. However, both have found high percentage of false positive misclassification, which indicates a high probability that a legitimate message doe not arrive to a user´s inbox.
5 Conclusions Nowadays phishing scam is becoming a threat to Internet and email users. Although anti-phishing filters try to deal with it, they still missclassify emails. This is mainly caused by the diversity of phishers trials and the improve on targeting of potential victims on Internet. In this article we have proposed some criteria to analyze email messages in order to build better accurate approaches of detection. We have identified semantic and structural elements to detect in order to classify an email as a phishing or a legitimate one. In a global analysis, all the tested algorithms have produced good classification results, indicating the coherence of the attributes that have been chosen to compose the dataset. It confirms as well that a small set of attributes can successfully be used to detect phishing messages. The Multilayer Perceptron algorithm has presented the best accurate results in our tests (96.5% of correct classification).
References 1. Bradley, T. Essential Computer Security: Everyone's Guide to E-mail, Internet, and Wireless Security. [S.l.]: Syngress. (2006) 2. Gregg, M. Hach the Stack: Using Snort and Ethereal to Master the 8 Layers of an Insecure Network. [S.l.]: Syngress. (2006) 3. Cajani, F.; Costabile, G.; Mazzaraco, G. Phishing e Furto d'Identita Digitale. [S.l.]: Giure. (2008) 4. James, L. Phishing Exposed. [S.l.]: Syngress, 2005. 5. Lininger, R.; Vines, R. D. Pishing: Cutting the Identity Theft Line. [S.l.]: Wiley. (2005) 6. Fette, I.; Sadeh, N.; Tomasic, A. Learning to Detect Phishing Emails. ACM, p.649-656. (2007) 7. Suriya, R.; Saravanan, K.; Thangavelu, A. An Integrated Approach to Detect Phishing Mail Attacks A Case Study. ACM, p.193-199.(2009) 8. Yearwood, J.; Mammadov, M.; Banerjee, A. Profiling Phishing Emails Based on Hyperlink Information. IEEE, p.120-127.(2010) 9. Ma, L.; Ofoghi, B.; Watters, P.; Brown, S. Detecting Phishing Emails Using Hybrid Features. IEEE, p.493-497.(2009) 10. Yu, W. D.; Nargundkar, S.; Tiruthani, N. Phishcatch - A Phishing Detection Tool. IEEE, p.451-456, 2009. 11. Chandrasekaran, M.; Narayanan, K.; Upadhyaya, S. Phishing E-mail Detection Based on Structural Properties. 9th New York Cyber Security Conference, p.2-8. (2009)
12. Fayyad, U. M.; Piatetsky-Shapiro, G.; Smyth, P.; Uthurusamy, R. Advances in Knowloedge Discovery and Data Mining. [S.l.]: MIT Press. (1996) 13.Nazario, J.Phishing Dataset http://monkey.org/jose/wiki/doku.php?id=phishingcorpus 14.Androutsopoulos, I. Ling-spam. http://labs-repos.iit.demokritos.gr/skel/i-config/downloads/ 15.Witten, I. H.; Frank, E. Practical Machine Learning Tools and Techniques. 2.ed. [S.l.]: Elsevier. (2005)