anti-phishing detection of phishing attacks using genetic ... - IEEE Xplore

ICCCCT’10

ANTI-PHISHING DETECTION OF PHISHING ATTACKS USING GENETIC ALGORITHM V.Shreeram, SASTRA UNIVERSITY KUMBAKONAM

M.Suban, SASTRA UNIVERSITY KUMBAKONAM

P.Shanthi, Assistant Professor, SASTRA UNIVERSITY KUMBAKONAM

K.Manjula, Assistant Professor, SASTRA UNIVERSITY KUMBAKONAM

sent matches the rule, the owner will be alerted and can take whatever actions to immediately prevent potential phishing attacks.

Abstract - An approach to detection of phishing hyperlinks using the rule based system formed by genetic algorithm is proposed, which can be utilized as a part of an enterprise solution to anti-phishing. A legitimate webpage owner can use this approach to search the web for suspicious hyperlinks. In this approach, genetic algorithm is used to evolve rules that are used to differentiate phishing link from legitimate link. Evaluating the parameters like evaluation function, crossover and mutation, GA generates a ruleset that matches only the phishing links. This ruleset is stored in a database and a link is reported as a phishing link if it matches any of the rules in the rule based system and thus it keeps safe from fake hackers. Preliminary experiments show that this approach is effective to detect phishing hyperlink with minimal false negatives at a speed adequate for online application.

A.

Introduction to Genetic Algorithm Genetic algorithm is a family of computational models based on principles of evolution and natural selection. These algorithms convert the problem in a specific domain into a model by using a chromosome-like data structure and evolve the chromosomes using selection, recombination, and mutation operators. The range of the applications that can make use of genetic algorithm is quite broad (Sinclair, Pierce, and Matzner 1999; see also Whitley, 1994). In computer security applications, it is mainly used for finding optimal solutions to a specific problem.

Keywords: Anti Phishing, Crossover, Mutation, Rule based system.

The process of a genetic algorithm usually begins with a randomly selected population of chromosomes. These chromosomes are representations of the problem to be solved. According to the attributes of the problem, different positions of each chromosome are encoded as bits, characters, or numbers. These positions are sometimes referred to as genes and are changed randomly within a range during evolution. The set of chromosomes during a stage of evolution are called a population. An evaluation function is used to calculate the “goodness” of each chromosome. During evaluation, two basic operators, crossover and mutation, are used to simulate the natural reproduction and mutation of species. The selection of chromosomes for survival and combination is biased towards the fittest chromosomes.

I. INTRODUCTION Phishing is a criminal trick of stealing victims personal information by sending them spoofed emails urging them to visit a forged webpage that looks like a true one of a legitimate company and asks the recipients to enter personal information such as credit card number, password and etc. The victims may finally suffer losses of money or other kinds. According to the reports of AntiPhishing Working Group, the number of phishing attacks is increasing by monthly and they can usually convince of the phishing email recipients to respond to them. By providing Internet transaction operations, it is the obligation of the companies to keep it safe. The companies may be expected to shoulder the responsibility, take the initiatives to go out to actively detect those phishing emails land phishing websites, and then prevent potential phishing attacks. In this paper, we propose an approach to detection of phishing webpages by using rule-based system formed by genetic algorithm. One important feature of phishing webpages is that they look like the true ones in the aspects of their hyperlinks. Otherwise, the victims would not believe them. Hence, a legitimate webpage owner can search for all suspicious hyperlinks by sending the hyperlink to the rule-based system. In the rule-based system, rules are formed such that if the hyperlink of the

Figure 1 shows the structure of a simple genetic algorithm. It starts with a randomly generated population, evolves through selection, recombination (crossover), and mutation. Finally, the best individual (chromosome) is picked out as the final result once the optimization criterion is met [16].

978-1-4244-7770-8/10/$26.00 ©2010 IEEE 447

ICCCCT’10

then { Phishing e-mail } This rule can be explained as follows: if there exists an IP address of the URL in e-mail and it does not match the defined Rule Set for White List then the received mail is a phishing mail; so the status is phishing e-mail. The final goal of applying GA is to generate rules that match only the anomalous URLs of websites. These rules are tested on historical URLs and are used to filter new URLs to find suspicious phishing attacks. In this implementation, data used for GA is a preclassified data set that differentiates normal URLs (websites) from anomalous ones. This data set is gathered using APWG (Anti-Phishing Work Group)[2]. The data set is manually classified based on experts’ knowledge. It is used for the fitness evaluation during the execution of GA. By starting GA with only a small set of randomly generated rules, we can generate a larger data set that contains rules for PADPS. These rules are “good enough” solutions for GA and can be used for filtering new phishing attack.

A genetic algorithm is quite straightforward in general, but it could be complex in most cases. For example, during the crossover operation, there could be one-point crossover or even multiple point crossovers. There are also parallel implementations of genetic algorithms. Sometimes series of parameters (for example, mutation rate, crossover rate, population size, chromosome size, number of evolutions or generations, and how the selection is done) needs to be considered with specific selection process. The final goal is to search the solution space in a relatively short period of time [16]. II. GENETIC ALGORITHM APPLIED TO PHISHING DETECTION Applying genetic algorithm to phishing detection seems to be a promising area. We discuss the motivation and implementation details in this section.

B.

Data Representation In order to fully exploit the suspicious level, we need to examine all fields related with a specific URL in Phishing e-mail. Example Rule in Ruleset:

A.

Overview Genetic algorithms can be used to evolve simple rules for preventing phishing attacks. These rules are used to differentiate normal website from anomalous website. These anomalous websites refer to events with probability of phishing attacks. The rules stored in the rule base are usually in the following form:

If (the IP address of the URL in the received e-mail is equal to 209.11.??.?? ) Then Phishing e-mail End if Example Chromosome structure for the above-defined rule is (d, 1, 0, b, *, *, *, *). There are eight genes in each chromosome. For simplicity, we have used hexadecimal representation for the IP address. The actual validity of this rule will be examined by matching the historical data set comprised of URLs marked as either phish-mail or not. If the rule is able to find a phishing attack, a bonus will be given to the current chromosome. Otherwise, a penalty will be given to it.

if { condition } then { act } For the problems we presented above, the condition usually refers to a match between the URL of the current website link in the e-mail and the rules in PADPS (Phishing Attack Detection and Prevention System), which indicates the probability of phishing attack. The act field usually refers to an action defined by the security policy such as reporting an alert to the browser, through the status field. For example, a rule can be defined as: if { The IP address of the URL in the received e-mail finds any match in the Ruleset }

C.

Parameters in Genetic Algorithm There are many parameters to consider for the application of GA. Each of these parameters heavily influences the effectiveness of the genetic algorithm. We will discuss the methodology and related parameters in the following section.

448

ICCCCT’10

to identify unrelated anomalies, which mean that several good rules are more effective than a single best rule. Another reason for finding multiple rules is that because there are so many types of hyperlink possibilities, a small set of rules will be far from enough. Using the genetic algorithm, we need to find local maxima (a set of “good-enough” solutions) as opposed to the global maximum (the best solution) (Sinclair, Pierce, and Matzner 1999). The niching techniques can be used to find multiple local maxima (Miller and Shaw, 1996; see also Sinclair, Pierce, and Matzner 1999)[15]. It is based on the analogy to nature in that within each environment, there are different subspaces (niches) that can support different types of life. In a similar manner, genetic algorithm can maintain the diversity of each population in a multimodal domain, which refers to domains requiring the identification of multiple optima. Two basic methods, Crowding and Sharing can be used for niching[15]. The crowding method uses the most similar member for replacements to slow down the population to converge towards a single point in the following generations. The sharing method reduces the fitness of individuals that have highly similar members and forces individuals to evolve to other local maxima that may be less populated. The similarity metrics used in these techniques can be phenotype similarity such as the relation between two URLs in this problem. This is more fitful for finding rules used in PADPS. The disadvantage of this approach is that it requires more domain-specific knowledge.

Evaluation function The evaluation function is one of the most important parameters in genetic algorithm. The proposed implementation differs from the scheme used by [15], in that the definition on calculations of outcome and fitness is different. The following steps are used to calculate the evaluation function. First the overall outcome is calculated based on whether a field of the URL matches the pre-classified data set, and then multiply the weight of that field. The Matched value is set to either 1 or 0. 8 outcome = ∑ Matched * Weighti i=1 The order of weight values is used in this function. These orders are categorized according to different fields in an IP address of the URLs. Therefore, all genes in the respective sub-domains of an IP address have the same weight. The actual values can be finely tuned at execution time. This scheme is straightforward and intuitive. These are the most important pieces of information needed to capture a phish-mail. Some URLs are more probable targets for phishing attacks—for example, URLs for Bank domains. The absolute difference between the outcome of the chromosome and the actual suspicious level is then computed using the following equation. The suspicious level is a threshold that indicates the extent to which two URLs are considered a “match.” The actual value of suspicious level reflects observations from historical data.

Other parameters There are also other parameters that need to be considered, such as mutation rate, crossover rate, number of populations, and number of generations. These parameters should be adjusted according to the application environment of the system and the organization’s security policy.

∆ = | outcome-suspicious_level | Once a mismatch happens, the penalty value is computed using the absolute difference. The ranking in the equation indicates whether or not an intrusion is easy to identify. (∆ * ranking) penalty = 100 The fitness of a chromosome is computed using the above penalty: fitness = 1-penalty

III. SYSTEM ARCHITECTURE

APWG

Obviously, the range of the fitness value is between 0 and 1.

GA

Rule Set

Rule Base

Figure 2. Architecture of applying GA into PADPS.

Crossover and Mutation Traditional genetic algorithms have been used to identify and converge populations of candidate hypotheses to a single global optimum. For this problem, a set of rules is needed as a basis for the PADPS. As mentioned earlier, there is no way to clearly identity whether a hyperlink (URL) in an e-mail is normal or anomalous just using one rule. Multiple rules are needed

Figure2 shows the structure of this implementation. We need to collect enough historical data that includes both normal and anomalous URLs. The APWG data set is used for testing PADPS. This is the first part inside the system architecture. This data set is analyzed and results are fed into GA for fitness evaluation. Then the GA is

449

ICCCCT’10

REFERENCES

executed and the rule set is generated. These rules are stored in a database to be used by the PADPS.

[1] Androutsopoulos, J. Koutsias, K.V.Chandrinos, and C.D.Spyropoulos. An Experimental Comparison of Naive Bayesian and Keyword -Based Anti-Spam Filtering with Encrypted Personal E-mail Message. In Proc. SIGIR 2000, 2000. [2] The Anti-phishing working group. http://www.antiphishing.org/. [3] Neil Chou, Robert Ledesma, Yuka Teraguchi, Dan Boneh, and John C.Mitchell. Client-side defense against web-based identity theft. In Proc. NDSS 2004, 2004. [4] Cynthia Dwork, Andrew Goldberg, and Moni Naor. On MemoryBound Functions for Fighting Spam. In Proc. Crypto 2003, 2003. [5] EarthLink.ScamBlocker. http://www.earthlink.net/software/free/toolbar/ [6] David Geer. Security Technologies Go Phishing. IEEE Computer, 38(6):18–21, 2005. [7] ohn Leyden. Trusted search software labels fraud site as ’safe’.http://www.theregister.co.uk/2005/09/27/untrusted search/. [8] Microsoft. Sender ID Framework. http://www.microsoft.com/ mscorp/safety/technologies/senderid/default.mspx. [9] Netcraft. Netcraft toolbar. http://toolbar.netcraft.com/. [10] PhishGuard.com. Protect Against Internet Phishing Scams http://www.phishguard.com/. [11] onathan B. Postel. Simple Mail Transfer Protocol. RFC821:http://www.ietf.org/rfc/rfc0821.txt. [12] Georgina Stanley. Internet Security - Gone phishing. http://www.cyota.com/news.asp?id=114. [13] Meng Weng Wong. Sender ID SPF www.openspf.org/whitepaper.pdf.7

IV. CONCLUSION Phishing has becoming a serious network security problem, causing financial lose of billions of dollars to both consumers and e-commerce companies. And perhaps more fundamentally, phishing has made e-commerce distrusted and less attractive to normal consumers. In this paper, the hyperlinks that were embedded in phishing emails are concentrated. Then ruleset is formed by Genetic Algorithm and this ruleset is used to match the hyperlink. This included feature informs the user about the status of the mail before the user reads the mail and submits their information. Once the mail is declared as the phishing mail then it is advised that the user should not read the message and even if they read they should not follow the hyperlink provided in it. The messaged from the original domain has the status of Not-Phishing, which the user can access. Since Genetic Algorithm is rule based, it is used to detect the Phishing attacks. Genetic Algorithm for Windows XP is implemented. Hence the Genetic Algorithm is not only useful for detecting phishing attacks, but also can shield users from malicious or unsolicited links in Web pages. The future work includes further extending the Genetic Algorithm, so that it can handle CSS (cross site scripting) attacks.

450