Future Generation Computer Systems Discovering phishing target ...

25 downloads 44352 Views 2MB Size Report
Contents lists available at ScienceDirect. Future Generation ... Available online 12 August 2009 .... combinations of possible domain names, and all webpages.
Future Generation Computer Systems 26 (2010) 381–388

Contents lists available at ScienceDirect

Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs

Discovering phishing target based on semantic link network Liu Wenyin ∗ , Ning Fang, Xiaojun Quan, Bite Qiu, Gang Liu City University of Hong Kong, Hong Kong

article

info

Article history: Received 30 January 2009 Received in revised form 4 June 2009 Accepted 24 July 2009 Available online 12 August 2009 Keywords: Phishing Anti-phishing Semantic Link Network Web document analysis

abstract An approach to the discovery of the phishing target of a suspicious webpage is proposed, which is based on construction and reasoning of the Semantic Link Network (SLN) of the suspicious webpage. The SLN is constructed from the given suspicious webpage and its associated webpages. Since reasoning of the SLN can discover implicit relations among webpages, the true association relations between a phishing webpage and its target are acquired via reasoning. Afterwards, by analysis of the relations, the suspicious webpage can be identified as phishing or not based on the predefined rules, and its target can be discovered if it is phishing. Our test dataset consists of 1000 phishing pages selected from PhishTank, and 1000 legitimate webpages. The experimental results show that the proposed method yields a false negative rate of 16.6% on the phishing pages and a false positive rate of 13.8% on the legitimate pages. © 2009 Elsevier B.V. All rights reserved.

1. Introduction The World Wide Web provides a worldwide e-commerce platform which greatly facilitates the trades among persons in different places. However, at the same time plentiful web-based phishing attacks also emerge. A phishing attack is a criminal activity which mimics a certain legitimate webpage (also referred to as true webpage in the rest of this paper) using a fake webpage with an intention of luring end-users to visit the fake website and stealing their personal information such as usernames, passwords and the details of credit cards [1]. The legitimate/true webpage mimicked by the fake webpage is defined as the phishing target, and the fake webpage as the phishing page. Statistics from AntiPhishing Working Group (APWG) show that during 2008 there have been 363,662 unique phishing sites reported [2]. More than $3 billion was lost due to phishing attacks in the United States in 2007, according to a survey conducted by Gartner [3]. According to a description of phishing by APWG, the ways phishers steal consumers’ personal information consist of social engineering and technical subterfuge. In technical-subterfuge schemes, phishers furtively plant crimeware onto users’ computers to intercept their online account user names and passwords, while in social-engineering schemes they send spoofed e-mails to consumers purporting to be from legitimate businesses and agencies, and then mislead consumers to counterfeit websites [4]. In



Corresponding author. E-mail addresses: [email protected] (L. Wenyin), [email protected] (N. Fang), [email protected] (X. Quan), [email protected] (B. Qiu), [email protected] (G. Liu). 0167-739X/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2009.07.012

addition, according to a study by Gartner [5], 57 million US Internet users have received e-mails that linked to phishing scams and about 2 million of them claimed to have been tricked into leaking their sensitive information. A serious problem that consistently confuses ordinary Internet users is: Does the URL I have received by e-mail or other avenues link to a phishing page, if so, which website is the phishing target it attacks? Quite a few researchers have been engaged in anti-phishing research and a lot of solutions have been developed to detect whether a webpage is a phishing page or not. However, we have not seen any technical solution which can automatically find the phishing target. This is because it is very difficult for a machine to automatically discover the possible phishing target of any suspicious webpage, although it is easier for a human being. On the contrary, many anti-phishing solutions need to know the phishing target in order to determine whether a suspicious webpage is a phishing page or not. For example, Liu et al. [6] require that the phishing target is registered in their system as a protected webpage for comparison with a suspicious webpage. In many cases, phishing webpages just attack well-known webpages and the system with these well-known web pages registered as the protected webpages can work well to detect these phishing webpages. However, there are also a few phishing cases attacking less popular webpages or new webpages. In these cases, it is very hard even for a system administrator to tell which the phishing webpages are and which their targets are if the targets are not labeled. Therefore he/she cannot register these less popular webpages or new webpages as the protected webpages in advance. Hence, these kinds of systems will probably fail to detect these kinds of phishing webpages. As a result, how to effectively and efficiently discover the phishing target of a phishing webpage is a great challenge for anti-phishing, which will be addressed in this paper.

382

L. Wenyin et al. / Future Generation Computer Systems 26 (2010) 381–388

In this paper, we propose to identify a phishing webpage and discover its phishing target based on its Semantic Link Network (SLN), which is a self-organized semantic data model for semantically organizing web resources. Through appropriate reasoning of the SLN, the implicit semantic relations in the WWW environment can be discovered [7]. In our method, the SLN is constructed and reasoned in three major steps: (1) We retrieve the associated webpages related to a suspicious webpage, and the associated webpages are derived from two sources. One is from forward links contained in the suspicious webpage, and the other is from a powerful search engine, which returns candidate webpages with similar text content to the suspicious webpage. (2) We construct the SLN from the suspicious webpage and its associated webpages. (3) Reasoning is conducted on the SLN to mine the implicit association relations, which are defined as the relations among all webpages which include the suspicious webpage and its associated webpages. With reasoning on the SLN, the suspicious webpage can be identified based on certain predefined rules, and if it is a phishing, its target can also be discovered from its associated webpages. Generally, a suspicious webpage may have stronger association relations with its target than with other associated webpages. It is quite possible to automatically discover the relations through reasoning of SLN. In our experiments, we use 1000 phishing webpages collected from PhishTank [8] as our test dataset to verify the proposed method and the false negative rate (the rate that the target of a phishing webpage is not discovered accurately) on this dataset is 16.6%. We also selected 1000 legitimate webpages to test the false positive rate (which is the rate that a legitimate webpage is falsely identified as phishing) of the proposed method, and we obtain a relatively low rate of 13.8%. The innovations of this paper are twofold. Firstly, we propose a new problem of discovering the phishing target of a given phishing webpage. Previous work on anti-phishing mainly focuses on how to accurately identify whether a suspicious webpage is phishing or not, and little effort has been made on how to discover the phishing target of a phishing webpage. Therefore, this work is highly significant for anti-phishing. The discovery of phishing target helps not only verify the accuracy of identification of phishing, but also remind the mimicked legitimate websites to resort to lawsuit. Secondly, an application of the SLN theory is explored for this new problem. A phishing webpage usually contains some forward links to other related legitimate webpages but never to its target directly. Furthermore, the phishing webpage may employ pictures instead of textual contents to avoid being discovered by a strong search engine. In this case, it is very difficult to discover the phishing target of the phishing webpage. However, the SLN-based method can still work in this case, because the implicit relations between a phishing webpage and its phishing target can be reinforced by the reasoning of SLN, which gives an advantage of the SLN-based anti-phishing method for discovering the phishing target. The structure of this paper is organized as follows. In Section 2, we review related work on anti-phishing. In Section 3, we present how to construct the Semantic Link Network of a given webpage. In Section 4 we present the approach to discovering the phishing target based on the Semantic Link Network. We conduct the experiments to test the proposed method in Section 5, and then conclude the paper and present future work in Section 6. 2. Related work Various solutions to anti-phishing have been developed during the past years. In this section, we will briefly review the previous anti-phishing work by summarizing them into six categories.

1. Blacklist/whitelist. This is probably the most straightforward solution for anti-phishing. A whitelist contains URLs of known legitimate sites while a blacklist contains those of known phishing sites. Many current anti-phishing technologies rely on the combination of whitelist and blacklist. The representative blacklist/whitelist based systems include PhishTank SiteChecker [8], Google Safe Browsing [9], FirePhish [10], and CallingID Link Advisor [11], etc. These anti-phishing solutions are usually deployed as toolbars or extensions of Web browsers to reminder the users whether they are browsing a safe website. Blacklist suffers from a window of vulnerability between the time a phishing site is launched and the site’s addition to the blacklist. A blacklist of phishing sites also requires frequent updating but still cannot include new phishing sites timely. Similarly, a whitelist also needs to update its content in a large scale. Unfortunately, it cannot include all legitimate sites. 2. Reputation scoring. Reputation scoring, e.g. WOT [12] and iTrustPage [13], is a relatively recent innovation. This technique rates the phishing possibility of a given webpage using reputation scores either reported from the anti-phishing community or computed from the given webpage. However, the reliability of the reputation scoring algorithm is a great challenge to this technique. 3. Malware detection. Malware is not phishing but it could be used to assist phishing. With the development of the anti-phishing techniques, traditional phishing methods may fail to work and more phishers could turn to malware. The representative product is Finjan [14]. 4. Relevant domain name suggestion. This technique suggests users the relevant domain name when they are accessing the Web. For example, SpoofStick [15] remarkably displaying only the most relevant domain information. This toolbar can help user to detect the actual website if they are visiting a rogue page which has a domain name that similar to a legitimate site. However, this method cannot directly judge whether a suspicious page is phishing. 5. Visual similarity. This method is used to measure the similarity between two given webpages by calculating the similarity between the content elements (text, image, layout-based, etc.) contained in the webpages. Liu et al. [6] propose a visual similarity based strategy for detection of phishing webpages. They first require users or system administrators to register with their system the true webpages (phishing targets) they want to protect. Afterwards, suspicious webpages are found in a variety of ways, including URLs in e-mails, various combinations of possible domain names, and all webpages accessed by users. Finally they employ a few algorithms to compute visual similarity to detect the phishing pages which have higher similarities to phishing targets (protected webpages). However, this approach needs to find the phishing target prior to the similarity comparison procedure. 6. Content-based approach. Zhang et al. [16] design, implement and evaluate the CANTINA, a content-based approach to detect phishing websites, which combines a Term Frequency-Inverse Document Frequency (TF-IDF) information retrieval algorithm with heuristics and determines the likelihood that a given webpage is a phishing page. CANTINA uses the five words with the highest TF-IDF weight on a given webpage as the lexical signature of that site and submits them to Google. If CANTINA finds the URL of the site in question within the top results, they classify it as legitimate webpage and otherwise as phishing webpage. However, its efficacy heavily depended on the reliability of the search engine and whether the lexical signature selected is really representative and precise as a query for the search engine.

L. Wenyin et al. / Future Generation Computer Systems 26 (2010) 381–388

383

Table 1 Comparison between anti-phishing methods. Anti-phishing methods

Phishing identification

Manual/automatic identification

Phishing target discovery

Black/whitelist WOT Finjan SpoofStick CANTINA SLN

Yes Yes No Yes Yes Yes

Manual Manual Automatic Automatic Automatic Automatic

No No No No No Yes

Table 1 shows the qualitative comparisons of the popular anti-phishing methods mentioned above and our SLN-based antiphishing method. We compare them from the following aspects: (1) whether or not capable of identifying a phishing webpage, if yes, (2) identifying a phishing webpage manually or automatically; (3) whether or not capable of discovering its phishing target, if a webpage is phishing. 3. Construction and reasoning of semantic link network By construction and reasoning of an SLN, we can identify a suspicious webpage and even discover its target if it is phishing. We first present related definitions.

α

γ

β

For example, Rule1: n −→ n0 , n0 −→ n00 ⇒ n −→ n00 can be represented as multiplication of two semantic relations: α · β = γ , α

β

γ

and Rule2: n −→ n0 , n −→ n0 ⇒ n −→ n0 can be represented as addition of two semantic relations: α + β = γ . For our problem in this paper, the SRM is represented as M = Mij (n × n), where, n denotes the number of dimensions of matrix M. We assume Mii = 0 and Mij 6= Mji in this paper. Mir × Mrj means that the ith node can reach the jth node via a semantic relation deduced (by one reasoning step) from the two relations Mir and Mrj , and the value of the deduced relation between the ith node and the jth node is calculated as Mij = Mir × Mrj . 3.3. Calculation of association relations

3.1. Definitions of semantic link network Semantic Link Network (SLN) is defined as a self-organized semantic data model for semantically organizing resources. An SLN is composed of semantic nodes and semantic links. The semantic nodes in an SLN can be an atomic node (a piece of text or image) or a complex node (another SLN), while the semantic links are semantic relationships among the nodes and they are the natural and smooth extension of hyperlink in semantics [17,21]. SLN is suitable for reasoning and discovering the implicit semantic relations in a large-scale network. SLN schema [7] is needed for defining an SLN for each particular application. An SLN Schema is a triple denoted as SLN-Schema =< ResourceTypes, LinkTypes, Rules>. ResourceTypes is a set of resource types, each of which is the type of a node in SLN and is represented as ResourceType = [name: field] | [name: field, . . . , name: field], where name is the name of resource type, and field is the feature of the resource type. LinkTypes is a set of various types of semantic links, each of which is the type of link (relation) between a pair of nodes and is represented as LinkType = [name: (ResourceType, ResourceType)]. Rules is a set of reasoning rules on LinkTypes. Semantic Relationship Matrix (SRM) [18,22] is used to represent an SLN, where the element Mij represents the semantic relations from the ith resource to the jth resource, and Mji is the reverse relation of Mij . The SRM of an SLN is unique if the order of nodes in the matrix is fixed. Closure of an SLN [18,22] is a complete SLN after multiple steps of reasoning. That is, no new semantic link can be derived from the SLN by the reasoning rules. 3.2. Building SLN model for anti-phishing In this paper, we use an SLN to model the association relations among all the webpages that include the suspicious webpage and its associated webpages. In the SLN, the ResourceType is webpage and the LinkType, is the explicit/implicit semantic relation which will be described in detail in the following subsection. In our method, two rules for reasoning of the SLN are defined, i.e., Rules = {Rule1, Rule2}, where Rule1 = {α · β = γ | α, β, γ ∈ LinkTypes} and Rule2 = {α + β = γ | α, β, γ ∈ LinkTypes}. In other words, a reasoning rule is defined as an operation of multiplication or addition on semantic relations, denoted as ‘·’ and ‘+’ respectively.

Since phishers try their best to gain the consumer’s trust, they usually build phishing webpages by mimicking legitimate webpages. Accordingly, a phishing webpage inevitably has intensive explicit/ implicit association relations with its target. According to the theory of SLN, with construction and reasoning of an SLN, we may obtain the association relations between a phishing webpage and its target. The association relations between a phishing webpage and its target can be reflected by Link relation and Similarity relation. Link relation means that there is a direct hyperlink from a webpage to another one. Similarity relation includes search relation and text relation. Search relation from a phishing webpage to its target can be measured by the rank of the target in searching result of a search engine with keywords extracted from the phishing webpage as query. Text relation can be measured by the textual similarity between the phishing webpage and its target. According to the above description, we regard an association relation as the combination of link relation, search relation, and text relation. Therefore, the value of an association relation W is defined as: W = a1 Wl + b1 Ws + c1 Wt ,

(1)

where, Wl , Ws and Wt denote the values of link relation, search relation and text relation respectively, which are defined in the following subsections; a1 , b1 and c1 are the weights that indicate the importance of the three relations, and they will be set empirically. 3.3.1. Link relation Link relation is measured based on the hyperlinks (forward links) inside a page, which directly imply reference relationships from the page to their destinations. Such reference relationships are frequently used in phishing webpages such that visitors can trust them if they can reach the legitimate webpages by clicking on such forward links. However, it is impossible for legitimate webpages to provide forward links back to phishing webpages. The number of forward links is used to measure the strength of the link relation between two webpages. If a suspicious webpage has many hyperlinks pointing to another particular webpage but has no hyperlink pointing back from that page, it would be a phishing webpage with a very high probability. In our method, the link

384

L. Wenyin et al. / Future Generation Computer Systems 26 (2010) 381–388

relation between two webpages is asymmetric, hence, the value of link relation from pagei to pagej can be defined as: Wl (i, j) =

Nl (i, j) Nl (i)

,

(2)

3.3.2. Search relation The search relation from pagei to pagej can be derived based on the ranks of pagej in the search result using the content of pagei as query. If the domain name of pagej matches with any of the domain names in the top N search results, we define that there is a search relation from pagei to pagej . Intuitively, search relation between two webpages is not symmetric. In this paper, we use Google as the search engine to mine the search relation. We select five words with the highest term frequency as the keywords for query after removing stop words, and this is a similar way with the rule of CANTINA [16]. The value of Search relation from pagei to pagej is defined as: Ws (i, j) = a2 Wst (i, j) + b2 Wsm (i, j) + c2 Wsb (i, j) ,

(3)

where, Wst (i, j), Wsm (i, j) and Wsb (i, j) denote the values of the ranks of search results when the queries are derived from title, meta, and body of pagei , respectively; a2 , b2 and c2 are the corresponding weights. Wst (i, j) can be calculated by Eq. (4). Wst (i, j) =

Nr − (Rs − 1) Nr

,

(4)

where, Nr is number of search results, and it is set as 20 in this paper; Rs is the rank of pagej in the results. If pagej cannot be found in the search result, its rank value is set to zero. For example, if we use the title (e.g., ‘‘Hello world!’’) of pagei as the query and found pagej is ranked 5th (Rs ) in the top 20 (Nr ) results, Wst (i, j) is 0.8. Wsm (i, j) and Wsb (i, j) are calculated in the same way as Wst (i, j) but use the keywords from meta and body of pagei as queries respectively. 3.3.3. Text relation A phishing webpage usually uses similar or even the same text content to its target webpage in order to lure their visitors. If the text on a suspicious webpage is very similar to that on an associated well-known webpage, but the domain names of these two webpages are different, it is highly possible that this suspicious webpage is a phishing webpage. In this paper, we calculate the value of the text relation from pagei to pagej as: Wt (i, j) = a3 Wtt (i, j) + b3 Wtm (i, j) + c3 Wtb (i, j) ,

(5)

where, Wtt (i, j), Wtm (i, j), and Wtb (i, j) are the values of the text relations from pagei to pagej using the features included in title, meta, and body of pagei and pagej , respectively; a3 ,b3 and c3 are the corresponding weights; Wtt (i, j) , Wtm (i, j), and Wtb (i, j) are calculated with a similarity model proposed by psychologist Tversky [19], who measure the similarity between two objects in terms of their common and distinctive features. It is calculated as follows. Wtt (i, j) =

Ti (k) ∩ Tj (k) |Ti (k)|

,

(6)

where, Ti (k) and Tj (k) are the words set extracted from the title of pagei and pagei respectively. |Ti (k) ∩ Tj (k)| is the number of common words they share. Wtm (i, j) and Wtb (i, j) can be calculated similarly to Wtt (i, j).

3.4. Reasoning of SLN Reasoning of SLN is to discover the implicit semantic relations of any two resources. To conduct one step of reasoning on an SLN is simply the multiplication of the SRM by itself [18]. The resulting matrix of the self-multiplication of an SRM a number of times shows the strength (value) of the implicit semantic relation of any two resources. Such strength (value) of the implicit semantic relation is actually the summation of the indirect relations on all possible paths between the two resources. In the context of this paper, through the reasoning of SLN in terms of the multiplication of the SRM by itself, the implicit relation between a phishing webpage and its target can be discovered. Specifically, given a suspicious webpage, which is represented as the first node in the SLN, we use a vector P (which is referred to as the probability vector) to represent the values of relations from the given suspicious webpage to its associated webpages. In this vector P, the value of an element means the probability that the webpage corresponding to this element is the phishing target of the given suspicious webpage (the first node in the SLN). A value close to 1 means that most probably its corresponding webpage is the phishing target, while a value close to 0 means that most probably its corresponding webpage is NOT the phishing target. In each step of reasoning, the vector P is multiplied by the SRM. Therefore, we denote vector P after k steps of reasoning as P k = P 0 × M (k) , where, M (k) denotes the multiplication among k matrices of M, and P 0 denotes the initial vector. For example, suppose the given suspicious webpage is denoted as A, and its three associated webpages are denoted as B, C and D, respectively, the initial probability vector for webpage A can be A B C D

denoted as P(0A) = [1 0 0 0]. P k denotes the probability vector of the suspicious webpage after k steps of reasoning. Specifically, the vector is denoted as P k = P1k , P2k , . . . , Pjk , . . . Pnk , where, Pjk denotes the value of the association relation from the suspicious webpage to the jth webpage after k steps of reasoning. According to the definition of the probability vector of a suspicious webpage, it is necessary to normalize the probability vector after each step of reasoning, as shown in Eq. (7). 0

Pj k =

Pjk n P

,

(7)

Pik

i=1 0

where, Pj k denotes the normalized value in the probability vector of the given suspicious webpage; n denotes the number of the nodes in the SLN and it is actually the number of webpages including the suspicious webpage and its associated webpages. To identify a suspicious webpage as phishing or not, the maximal reasoning step in an SLN is determined by n − 1 according to the definition and the theoretical proof of closure of SLN [7,18,22], that is, no new semantic relation can be obtained after more than n − 1 reasoning steps. 4. Discovery of phishing target The analysis and mining of implicit association relations between a suspicious webpage and its associated webpages is helpful for us to discover the phishing target of the suspicious webpage. After the relations among the suspicious webpage and its associated webpages are established in an SLN, the implicit association relations between the suspicious webpage and its target can be reinforced through reasoning. Consequently, the phishing target of a phishing webpage can be discovered based on predefined rules and strategies.

L. Wenyin et al. / Future Generation Computer Systems 26 (2010) 381–388

385

Fig. 3. Four normalized vectors of webpage A after each step of reasoning of the SLN in Fig. 2. Fig. 1. Semantic link network for four webpages.

Fig. 2. Semantic relation matrix for the four webpages in Fig. 1.

4.1. Major steps in phishing target discovery

Fig. 4. The new Semantic Link Network for the four webpages.

The procedure of identifying a suspicious page and even discovering its phishing target can be summarized as the following steps: retrieve the associated webpages of a suspicious webpage; construct the SLN; reason the SLN; identify the suspicious webpage, and discover its phishing target if it is a phishing. Each step is shown in detail as follows: 1. Retrieve the associated webpages to which a suspicious webpage has link relation and search relation. 2. Construct an SLN for the suspicious webpage by calculating the initial values of the association relations among all these associated webpages. 3. Reason the SLN and identify the given suspicious webpage as phishing or not based on inferring rules in Section 4.2. 4. Discover the phishing target based on the strategies in Section 4.3. The reasoning mechanism of SLN can help us obtain the intrinsic relationship among the webpages [18,22]. Therefore, after a few steps of reasoning of the SLN, the associated webpage with which the suspicious webpage has the strongest association relations in each step of reasoning can be considered as the potential phishing target. Afterwards, according to the strategies of phishing target discovery in Section 4.3, the final phishing target of the suspicious webpage can be derived from these potential phishing targets. 4.2. Inferring rules for identification of a suspicious webpage In the hyperlink network, the importance of a webpage is influenced by the ranks of its neighbors [20]. However, different from the hyperlink network, a semantic link in an SLN is influenced by other semantic links in the reasoning process [7]. Hence, the implicit relation can be discovered through reasoning, and accordingly, the association relation between a suspicious webpage and its target can be reinforced by other links. We use the example in Fig. 1 to illustrate the reasoning procedure. Fig. 1 shows an SLN constructed with four webpages, denoted as A, B, C, and D. Fig. 2 shows the values of the association relations among them, which are derived by Eq. (1) and expressed by matrix M as given in Fig. 2 Assume that webpage A is a suspicious webpage, and the webpages B, C , and D are its associated webpages. The initial A B C D

probability vector of webpage A is denoted as P(0A) = [1 0 0 0].

Fig. 5. The new semantic relation matrix for the four webpages in Fig. 4.

To discover the implicit association relations, the reasoning is performed on the SLN by multiplication of vector P(0A) and matrix M in multiple steps. The probability vector after k steps of reasoning is denoted as P(kA) = P(0A) · M (k) . According to Section 3.1, we have k ≤ n, where, n = 4. Four normalized vectors of webpage A are obtained after iterative reasoning, as shown in Fig. 3. From Fig. 3, we can see that the maximum values in 0 0 0 0 P(A1) , P(A2) , P(A3) , and P(A4) correspond to webpage C , B, A, and A, respectively. Hence, we say that both the third and fourth steps of reasoning discover A as the potential phishing targets. In other words, webpage A possibly targets at itself. This is reasonable since there is a high-weighted link loop from the suspicious webpage 0.2

0.2

0.3

0.5

A back to itself, i.e., A −→ C −→ B −→ D −→ A. Since a webpage cannot be considered as the phishing target of itself, webpage A is identified as a legitimate webpage. According to the above example, we have the following inferring rule of legitimate webpage. Inferring rule of legitimate webpage: if a given suspicious webpage targets at itself in any step of reasoning on the SLN, it is considered as a legitimate webpage. 0.5

If we delete link D −→ A from the SLN in Fig. 1, the new SLN is shown in Fig. 4 and the corresponding matrix Mnew is shown in Fig. 5. The four normalized probability vectors after four steps of reasoning are shown in Fig. 6, respectively. 0 From Fig. 6, we can see that the maximal values in P(A1)new , 0

0

0

P(A2)new , P(A3)new , and P(A4)new correspond to webpage C , B, D and B, respectively. Hence, we say that all steps of reasoning of SLN do not discover A as the potential phishing target. In other words, webpage A does not target at itself. The reason is that there is no link from the associated webpages back to webpage A. Consequently, webpage A is regarded as a phishing webpage. According to the above example, we have the following inferring rule of phishing webpage.

386

L. Wenyin et al. / Future Generation Computer Systems 26 (2010) 381–388 Table 2 Values of nine parameters. Parameter Weight

Fig. 6. Four normalized vectors of webpage A after each step of reasoning on the new SLN in Fig. 4.

Inferring rule of phishing webpage: if a given suspicious webpage targets at other associated webpages in all steps of reasoning on the SLN, it is considered as a phishing webpage. 4.3. Strategies of discovering phishing target If a given suspicious webpage is identified as a phishing webpage based on the above inferring rules, we define the webpage that has the maximal value of association relation with the suspicious webpage as a potential phishing target in each step of reasoning, since a bigger value means a higher possibility of phishing target. For the example in Fig. 4, the potential phishing targets of webpage A are webpage C , B, D, and B respectively after each of the four sequential steps of reasoning. Next, we will discuss several situations where the final phishing target can be discovered. According to Section 3.4, we identify a given suspicious webpage based on the inferring rules of legitimate/phishing webpage in at most n − 1 steps of reasoning. However, to discover the final phishing target of a phishing webpage, we need to reason an SLN until the convergence of the potential phishing target. The convergence will be discussed in the following three situations using the example in Fig. 4. 0.5

1. If we add a new link B −→ C in Fig. 4, multiple steps of reasoning on the SLN result in a convergence at webpage B. That is, after a few steps of reasoning, we find B as an invariable potential phishing target in each further step. This situation usually occurs when there are a few loops passing the same webpage, i.e., both 0.3

0.5

0.1

0.2

loops B D and B C passing by B in the SLN of Fig. 4 after adding the new link. In other words, the reasoning converges at webpage B and it is considered as an active ‘center’ in the SLN. 2. If the reasoning is conducted in several steps in the SLN of Fig. 4 without adding the new link, the reasoning will find B and D as the potential phishing target alternatively. We refer to the case as convergence at multiple webpages when the potential phishing target alters periodically among a fixed set of webpages (e.g., B and D in this case) in the SLN. In other words, the reasoning converges at B and D, and there is an active ‘community’ consisting of B and D in the SLN. 0.1

3. If we delete the link D −→ B from Fig. 4 and reason the SLN for multiple steps, the reasoning procedure will stop at a zerovector (all the elements of this vector are 0). This situation may occur when there is no loop in the SLN. Actually, it rarely occurs since usually there are certain loops in the SLN. According to the above three situations, the reasoning of the SLN will be regarded as convergence when any of the following conditions is satisfied: (1) When the potential phishing targets do not change after a further round of reasoning of SLN; (2) When the potential phishing targets change periodically in a fixed set of potential phishing targets; (3) When a zero-vector is generated during the process of reasoning of SLN; or (4) When the number of reasoning steps exceeds the maximal number, which is n − 1 (where n is the dimensionality of the Semantic Relation Matrix) [7, 18,22].

a1 0.5

b1 0.4

c1 0.1

a2 0.5

b2 0

c2 0.5

a3 0.5

b3 0

c3 0.5

Next, we define the following strategies for discovering the final phishing target according to the above discussions. Strategy 1: If the reasoning of the SLN finally converges at a single webpage, which is considered as the active center in the SLN, the single webpage is regarded as the phishing target of the given suspicious webpage. Strategy 2: If the reasoning of the SLN finally converges at a fixed set of webpages, the webpage in the set of webpages that has the largest value of association relation is determined as the phishing target. Strategy 3: If the reasoning procedure stops at a zero-vector, the phishing target is determined as the potential phishing target in the step of reasoning just before obtaining the zero-vector. Strategy 4: If the number of reasoning steps reaches the maximal value (n − 1), the phishing target is determined as the potential phishing target in the last step of reasoning. 4.4. Explanations of the inferring rules for discovering phishing targets Since reasoning of an SLN can discover implicit relations among webpages [17], the true association relation between a phishing webpage and its target can be acquired. Therefore, if a suspicious webpage shows strong association relation with itself, it usually tends not to be a phishing webpage. The reason is as follows: if this webpage shows the strongest association relation with itself after any step of reasoning, there must be some link loop from the webpage back to itself, which is impossible for a phishing webpage according to Section 3.3.1. This is actually the inferring rule for determining a legitimate webpage. However, if a suspicious webpage shows the strongest association relation with other webpages rather than the suspicious webpage itself in each step, it is very probable that the suspicious webpage is targeting at other associated webpages. This is actually the inferring rule for determining a phishing webpage and its phishing target. 5. Experiments and evaluation We implement our method in a prototype system at http:// www.sitewatcher.com.cn/SLN, which can identify suspicious URLs and find their phishing targets if they correspond to phishing webpages. The user interface of the application is shown in Figs. 7 and 8. The result shows whether the suspicious webpage is a legitimate one or a phishing one. If it is identified as a phishing webpage, the system will display the potential phishing targets found in each step of reasoning on the SLN. The parameters in Eqs. (1), (3) and (5) are set as shown in Table 2. We set these parameters for the following reasons (guidelines). First, although we consider that the meta information of a webpage is sometimes an important source, however, in practice, we find the meta information is not a reliable source to represent the webpage because meta is usually created by humans and the format of meta is not unified. Therefore, we simply set b2 and b3 as 0 for our task in this paper. Second, to reasonably set the parameters a1 , b1 and c1 which measure the corresponding importance of link relation, search relation, and text relation, respectively, we analyze a lot of phishing webpages and empirically determine the ranking of importance of the three relations as: link relation> search relation> text relation. Consequently, the corresponding parameters of the three relations are set as 0.5, 0.4 and 0.1 empirically to obtain the

L. Wenyin et al. / Future Generation Computer Systems 26 (2010) 381–388

387

Fig. 7. A legitimate webpage is identified based on SLN.

Fig. 8. A phishing webpage and its phishing target are identified based on SLN.

best performance we can obtain in our test data. Note that the setting of the parameters is just based on our empirical experience and they cannot be guaranteed to be optimal. The difficulty of obtaining the optimal parameters lies in that different phishers may employ quite different strategies when making phishing pages. For example, some phishers use only key words to mimic a legitimate webpage while others may use both hyperlinks and keywords. Finally, although the title contains fewer words than the body of a webpage, the words in title are usually good description of the webpage. Hence, we give the title and body of a webpage the same importance. Based on the above analysis, the parameters a2 and c2 are both set as 0.5 empirically. Similarly, parameters a3 and c3 are also set as 0.5 in this paper. 5.1. Two examples using our system We present an example of a suspicious webpage at http:// www.sitewatcher.com.cn, and the result is shown in Fig. 7. Since the potential phishing targets we find include itself during the reasoning procedure of the SLN, it is regarded as a legitimate webpage. Fig. 8 shows the experimental result of another example with the suspicious phishing webpage whose URL is http://www. netbnk-commbnk-au.com, submitted to PhishTank [8] at http:// www.phishtank.com/phish_detail.php?phish_id=713563. As shown in Fig. 8, it is identified as a phishing webpage and its phishing target discovered by our method is the website at http://www. commbank.com.au/personal/netbank/. This result is confirmed as correct based on our human recognition. 5.2. The experiments in large dataset We selected 1000 phishing URLs from PhishTank [8] to test the performance of the proposed method. We download and save

them as our phishing dataset when they were alive. These 1000 phishing webpages target at 61 well-known webpages. We use the false negative rate to measure the accuracy for detecting phishing webpages. A false negative response is defined as a phishing webpage falsely identified as a legitimate webpage or wrong phishN −N ing target. The false negative rate is calculated by Ratefn = PN C , P where NC is the number of the phishing targets that are correctly identified and NP is the total number of the phishing webpages we tested in the experiments. A discovered phishing target is considered as correct if its domain name and IP address matches with the ground truth. The false negative rate of the proposed method on the 1000 phishing webpages is 16.6%. Another testing dataset is built by collecting 1000 legitimate pages, including 500 famous webpages and 500 less popular webpages. These legitimate webpages are used to test the false positive rate of our method, that is, how often a legitimate webpage is falsely identified as phishing. The false positive rate is calculated N −N with Ratefp = T N np , where Nnp is the number of the webpages T with legitimate ones identified by our method and NT is the total number of the legitimate webpages in the test. Our method’s false positive rate on this testing dataset is 13.8%. Based on the analysis of the characteristics of our inferring rules for legitimate webpages and phishing webpages, the following reasons are found for false negative cases and false positive cases. The reasons of false negative cases may include: (1) The phishing target of a phishing webpage is not found in the set of its associated webpages. This phenomenon may occur for the reason that the phishing webpage contains few hyperlinks, or the keywords extracted from the phishing webpage do not match with the keywords of the target webpage; (2) In its associated webpages, there is certain active webpage which has stronger association relation than the phishing target with the phishing page. The active webpage may be a webpage of a famous news website. The reasons

388

L. Wenyin et al. / Future Generation Computer Systems 26 (2010) 381–388

of false positive cases may include: (1) If a legitimate webpage is not easily discovered by a search engine, it is likely to be identified as phishing; (2) If a legitimate webpage is not frequently linked by other associated legitimate webpages, it is also likely to be identified as phishing. 6. Conclusion and future work In this paper, our main contributions include two aspects: first, a new problem of discovering the phishing target of a given phishing webpage is proposed, which is more significant than only identifying a given suspicious webpage as phishing or not in previous work. Second, an application of the SLN theory is explored for this new problem. We proposed a novel approach to identifying a given suspicious webpage and discovering its phishing target by calculating and reasoning defined association relations on its Semantic Link Network. The association relations among all webpages that include the suspicious webpage and its associated webpages are measured as the combination of link relation, search relation, and text relation. After multiple steps of reasoning on the SLN, the suspicious webpage can be identified as phishing or not based on the inferring rules. If the suspicious webpage is identified as phishing, its phishing target can be discovered based on the proposed strategies. These strategies are specified in terms of four convergent situations in the reasoning procedure of the SLN. We implement and evaluate the approach with 1000 phishing webpages and 1000 legitimate pages as the test datasets. Preliminary results show that the false negative rate of our approach on the phishing webpages is 16.6% and the false positive rate is 13.8% on the legitimate webpages. There are still several issues worthy of further study. First, more kinds of association relations can be investigated, which may include visual similarity relation, layout similarity relation, and domain similarity relation, etc. Second, the importance of various sub-relations in the combined association relations should also be studied. Finally, more effective inferring rules for identifying a given suspicious webpage and strategies of discovering its phishing target should be designed to further improve the overall performance of the proposed method. Acknowledgments The work described in this paper was fully support by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China [Project No. CityU 117907] and the National Grand Fundamental Research 973 Program of China under Grant No. 2003CB317000. References [1] Wikipedia. Available at http://en.wikipedia.org/wiki/Phishing. [2] Anti-Phishing Working Group, Phishing Attack Trends Report - First Quarter 2008. Available at http://www.anti-phishing.org/reports/apwg_report_jan_ 2008.pdf. [3] Gartner, Inc., Press Releases, 2007. Available at http://www.gartner.com/it/ page.jsp?id=565125. [4] K. Jaishankar, Identity related crime in the cyberspace: Examining phishing and its impact, International Journal of Cyber Criminology 2 (1) (2008) 10–15. [5] O. Gunter, The Phishing Guide – Understanding and Preventing Phishing Attacks, White Paper, Next Generation Security Software Ltd., 2004. [6] W. Liu, X. Deng, G. Huang, A.Y. Fu, An anti-phishing strategy based on visual similarity assessment, IEEE Internet Computing 10 (2) (2006) 58–65. [7] H. Zhuge, Communities and emerging semantics in semantic link network: Discovery and learning, IEEE Transactions on Knowledge and Data Engineering 21 (6) (2009) 785–799. [8] PhishTank. Available at http://www.phishtank.com/. [9] Google Safe Browsing. Available at http://www.google.com/tools/firefox/ safebrowsing/. [10] FirePhish. Available at http://opdb.berlios.de/. [11] CallingID Link Advisor. Available at http://www.callingid.com/ DesktopSolutions/CallingIDLinkAdvisor.aspx.

[12] [13] [14] [15] [16]

[17] [18]

[19] [20]

[21] [22]

WOT. Available at http://www.mywot.com/. iTrustPage. Available at http://www.cs.toronto.edu/∼ronda/itrustpage/. Finjan. Available at http://securebrowsing.finjan.com/. SpoofStick. Available at http://spoofstick.com/. Y. Zhang, J.I. Hong, L.F. Cranor, CANTINA: A content-based approach to detecting phishing web sites, in: The International World Wide Web Conference, WWW 2007, ACM Press, Banff, Alberta, Canada, 2007, pp. 639–648. H. Zhuge, The Knowledge Grid, World Scientific, Singapore, 2004. H. Zhuge, Y. Sun, R. Jia, J. Liu, Algebra model and experiment for semantic link network, International Journal of High Performance Computing and Networking 3 (4) (2005) 227–238. A. Tversky, Features of similarity, Psychological Review 84 (4) (1988) 327–352. L. Page, S. Brin, R. Motwani, T. Winograd, The pagerank citation ranking: Bringing order to the web, Technical Report, Stanford Digital Libraries SIDLWP-1999-0120, 1999. H. Zhuge, J. Liu, A fuzzy collaborative assessment approach for Knowledge Grid, Future Generation Computer Systems 20 (1) (2004) 101–111. H. Zhuge, Autonomous semantic link networking model for the Knowledge Grid, Concurrency and Computation: Practice and Experience 7 (19) (2007) 1065–1085.

Liu Wenyin is an assistant professor in the computer science department at the City University of Hong Kong. Before that, he was a full time researcher at Microsoft Research China/Asia. His research interests include question answering, anti-phishing, graphics recognition, and performance evaluation. He has a BEng and MEng in computer science from Tsinghua University, Beijing and a DSc from the Technion, Israel Institute of Technology, Haifa. In 2003, he was awarded the International Conference on Document Analysis and Recognition Outstanding Young Researcher Award by the International Association for Pattern Recognition (IAPR). He is also TC10 chair of IAPR and a guest professor of University of Science and Technology of China (USTC). He is a senior member of IEEE. Ning Fang is currently a research associate in department of computer science, City University of Hong Kong. He got his Ph.D. from the school of computer science and engineering, Shanghai University in 2009. He got his M.E. degree from Nanjing University of Post and Telecommunication, China in 2005, and his B.E. degree from Southeast University, China in 1998. His main research interests include web document analysis, modeling, reasoning, integrating and extracting of knowledge.

Xiaojun Quan received the B.E. degree in computer science from the Chang’an University in 2005 and the M.E. degree in computer science from University of Science and Technology of China (USTC). He is currently a research assistant in department of computer science, City University of Hong Kong. His research interests include data mining, information retrieval, question answering and anti-phishing.

Bite Qiu received the B.E. degree in software engineering from the Tongji University, Shanghai in 2007. He is currently an MPhil candidate in department of computer science, City University of Hong Kong. His research interests include anti-phishing, information retrieval and web data mining.

Gang Liu received the B.E. degree in computer science from Tsinghua University, Beijing. He is currently pursuing his Ph.D. degree in the department of computer science, City University of Hong Kong. His research interests include artificial intelligence approaches to computer security and privacy, web document analysis, information retrieval, and natural language processing.

Suggest Documents