2017 45(1 )
Enhance Detecting Phishing Websites Based on Machine Learning Techniques of Fuzzy Logic with Associative Rules Shireen Riaty Computer Science Dept, KASIT The University of Jordan Amman, Jordan
[email protected]
Ahmad Sharieh Computer Science Dept, KASIT The University of Jordan Amman, Jordan
[email protected]
Hamed Al Bdour Computer Science Dept, KASIT The University of Jordan Amman, Jordan
[email protected]
Riad Jabri Computer Science Dept, KASIT The University of Jordan Amman, Jordan
[email protected]
Abstract— Phishing web-sites can cause the loss of thousands of dollars and leads to the damage of the brand image of organizations. Thus, automatic filtering of phishing web-site becomes a necessity. This paper presents a phishing detection technique based on Fuzzy Inference Process. The proposed phishing detection has rules for converting the input features of the web-sites into an output that reveals the nature of the web-site. The detection is built by constructing new set of features from the input data by transferring the features into different forms of continuous values using clustering, frequent pattern mining and value mapping process. The newly constructed features are then used with fuzzy system, that is learned by optimizing a set of rules and membership function to predict phishing web-sites. The experiment results show that the proposed work has over performed other rule-based classification techniques on the same data by 2%. Moreover, the results show that the clustering and frequent pattern mining has enhanced the outcome by 4%. It is commended using the preprocessing and fuzzy system before detection to enhance the accuracy of detection of phishing websites. Keywords- Detection Phishing Website, Feature Classification; Fuzzy Logic; Pattern Mining. I.
INTRODUCTION
Phishing is the attempt to acquire sensitive information such as usernames, passwords, and credit card details [20]. It is used to deceive users and exploits websites; and constitutes a threat in current social media. It is used by criminal to mimic the web-site and brand of a well-known organization. The most phishing web-sites are those concern about E-banking and E- Business. Thus, it may cost thousands of dollars per attack. For example, it is reported that the annual worldwide impact of phishing could be as high as $5 billions [31]. Thus, automatic filtering of phishing web-site becomes a necessity. To solve this problem, researchers started to propose solutions by detection and filtering these phishing web-pages using data mining techniques [13]. Phishing detection has been implemented by modeling the previously detected phishing websites. Each web-site (phishing and non-phishing) is compromised of a set of items and properties. These items and properties can be described as variables. In a specific-website, these variables can be assigned a
64
2017 45(1 )
value based on the content of that web-site. Thus, the problem can be formed as a detecting of future phishing websites based on modeling the previously detected websites using data mining processes [10]. Classification, as one of the most important category of the data mining processes, and one utilized with the phishing detection, is the process of predicting the output class (family/category) of a given input with unknown class [5]. To use classification in phishing detection, the modeled website should be described based on predetermined set of properties/features [7]. Fuzzy logic (FL) is a methodology for system controller that deals with small, simple to large problems [24]. FL takes an input as a set of data that may include missing fields, noise, imprecise and ambiguous information. FL uses such input to model an experience in the form of rule induction model {IF A THEN B} to make the future decision. The most characteristic of FL approach is the ability to build an relation between input variables and outcome control character (decision) based on the previous experiences forms as input data and with the interference of human experts that can inject their experience into the FL system in the forms of rules. It allows representation of partial membership in sets to calculate results, not as Boolean logic which presents it by 0 or 1. In phishing detection, websites can be modeled as a group of variables that can be the input to a system controller designed using FL to produce a correct output label for a given website. There are a number of association rule mining algorithms that finds the association by analyzing the joint present of the item sets [ 3]. The main problem that will be tackled in this paper is how to set rules for converting the input features of the web-sites into an output that reveal the nature of the web-site as phishing or nonphishing. This paper is extracted from a thesis in [23]. It proposes an intelligent system for detection phishing websites based on machine learning techniques of fuzzy logic with associative rules mining. The proposed work discovers patterns of the phishing websites properties using association rules that inserted into a FL classifier. II.
RELATED WORK
An empirical study implemented by Dhamija et al. [11] proved that general users are easily to be tricked by a fraud and that the standard notification mechanisms, including the address bar, status bar and notification bar included in the browser are not helpful for general users who do not give enough attention to these notifications. The study has found several strategies that are used to trick the users and thus initiate a set of web-sites features that can be used to initiate automatic phishing detection [11]. Jakobsson proposed an approach for online visualization the information flow in the attack[14] . Existing phishing detection and prevention solutions can be categorized as Email filtering or Webbased (toolbar filtering, visual-based, and feature-based). Chandrasekaran et al. proposed an email filtering approach based on classification of emails' structural properties, emails' subject and emails' body contents [7]. The results show that there are 29 features to distinguish legitimate from attacks emails. Abu-Nimeh et al. [2] compared six machine-learning techniques to classify phishing emails. The dataset for phishing classification consisted of a total of 2889 emails with 43 features (variables). In [6], they proposed an email filtering approach that uses a novel set of features, such as the resolution of the email topic and the list of words that appear in the body of the email. Toolbar filter in Internet Explorer (IE)-8 [18, 20] and some other browsers contain a blocking mechanism that may prevent users from accessing phishing websites. The filtering mechanism is based on updated blacklist and white list that are accessed from the dominated server. The browser allows user to access all those in white list only. Another example of the anti-phishing toolbar is NETCRAFT [19]. Similarly, an integrated tool by MCAFEE [16] and Google Chrome safe browsing [21] use also blacklist of URLs to determine phishing sites. MACAFEE also keeps a white list of legitimate websites that are
65
2017 45(1 )
not presented in Google Chrome safe browsing that makes MACAFEE more trusted in phishing alert [23]. Chou et al. proposed a tool of phishing detection that is installed to the client machine as a browser add-in toolbar [9]. Kirda and Kruegel proposed a browser extension to protect users against phishing attacks [15]. A major disadvantage of the toolbar-based filtering is that it requires a concentration from users on what browsers may suggest, when there is not enough evidence and no clear decision for a specific websites. Ye et al. proposed a way for a pathway between the users and the browsers in order to increase the interaction between them and ease the user burden [29]. Another major disadvantage is that, toolbars depend on previous bad experience of known phishing web-sites. Sheng et al. have proven experimentally the limitation of using toolbars, that mainly depend on a blacklist, to detect and protect against phishing [22]. In [11], a proposed approach that saves images for the logo of the legitimate websites that users register with. Then, with every future visit to a website, the visual differences and similarities in logo between this site and the saved sites are analyzed to detect the phishing. Wenyin et al. proposed an approach for phishing detection that uses simple image processing technique to detect phishing websites [27]. A set of legitimate web-sites are saved and used to compare each web-site with the saved one. If the visual differences are above some threshold, the web-site is considered as phishing. Medvet et al. proposed an approach for phishing detection by the means of using the look-and-feel properties of the web-site [17]. Afroz and Greenstadt proposed PhishingZoo, a phishing detection technique that is built by hashing the appearances of the trusted web-sites [4]. Fuzzy matching algorithms were used to match the hash of the saved web-sites with the hash of the sites being judged. The problem of this approach, as similar to the toolbar filtering, is that there is no complete list would be found in any dataset. Xhang et al. proposed a phishing detection approaches based on the content of the web-pages and using Bayesian algorithm[28]. Similar to the work proposed in [19], the two main utilized detection properties are the text and images of the web-pages. The work by Wardman et al. [25] aims at identifying the common techniques for phishing sites and helping in informing webmasters about the hosted phishing web-sites, as the server's owners normally do not know about being hosting phishing web-sites. Abbasi et al. proposed an approach that used the lexical and the domain properties of the web-sites URL and body content [1]. Gastellier-Prevost et al., (2011), proposed Phishark, anti-phishing toolbar that uses heuristics algorithms for building phishing model and classification of the accessed web-sites into phishing and non-phishing web-sites [12]. The proposed approach uses both URL-based and HTML-based features that embedded in the common utilized of some HTML tags in the phishing web-sites. Wardman et al. proposed a phishing detection approach using features of the HTML source code [26]. The algorithms that are used for the classification are: SVM, logistic regression, Random Forest, Bayesian Network. Barraclough, et al. argued that the disadvantages of the classification used by the existing phishing detection approach is the parameter tuning [5]. Because parameterization problem is efficiently solved by Fuzzy System (FS), this approach proposed a FS for phishing detection. Choi et al. proposed an approach that used the lexical and domain properties of the web-sites URL and the content features [8]. The algorithms that are used for the classification are: SVM and K-Nearest Neighborhood. Some of the approaches that were categorized as active have taken different direction from the common one. For example, Afroz and Greenstadt proposed PhishingZoo, a phishing detection technique that is built by hashing the appearances of the trusted web-sites [4]. Then, the hash is used as model for classifying new web-sites. In [27], they extracted the common patterns used by the phishing web-sites in their URL. Gastellier-Prevost et al. found the common utilized HTML tags in the phishing websites[12]. In Wardan [28], they proposed Phishing detection based on file matching of the source code of the web-site being judged. Most importantly, Barraclough et al. have successfully used Fuzzy System
66
2017 45(1 )
(FS) to ease the parameter tuning required in the classification algorithms utilized with phishing detection approaches [5] . This research proposes adding a pre-processing step to an existing system for detecting phishing websites based on machine learning techniques of fuzzy logic with associative mining rules. The work discovers patterns of the phishing websites properties using association rules that are inputted into a FL classifier. PROPOSED APPROACH The proposed approach uses FS to construct a set of rules from existing dataset of known phishing websites and non-phishing websites. First: FS is commonly applied to control variables of continuous nature. The nature of the features that are used with the phishing detection are nominal (e.g.: Feature cookies, values: Yes/No). It is important to pre-process the input features prior to applying FS. Second: Changing the shape and nature of the input features will influence the accuracy of the phishing detection technique positively or negatively. Thus, it is important to follow a systematic and significant way while changing the shape of the features from nominal into continuous. Third: It is necessarily to reduce the number of features utilized with FS. For all these reasons, the features are first pre-processed in order to produce a new, reduced and continuous form of the input features. The pre-processing steps will technically answer the following questions: First: What features to be combined. Second: How to systematically construct a new set of features of continuous type and how to give values for the newly constructed features based on their values in the input set of features. The answer of the first question, will be using the methods of frequent pattern mining. The second question will be answered by using a similarity calculation with reference to the frequent pattern. Original features, in the input dataset, that are to be combined together are those highly associated with each other. Highly associated features are combined to form a new feature or multiple set of features. Highly associated features are those co-presented with each other in uniform values. For example, given two features F1 and F2 with their possible values F1: v11 and v12, F2: v21 and v22, these features are said to be highly associated if and only if, in the given set of samples, the value v11 for F1 is always co-presented with the value of v21 for F2. More association is captured if for the same set of features, the value v12 for F1 is always co-presented with v22. No association between these features is present if v11 sometimes is co-presented with v21 in some samples and with v22 in other samples, and v12 is co-presented with v21 in some samples and with v22 in other samples. Highly associated features are more likely to represent a single, yet curser feature of the website. For example, the features of "Long URL" and the feature "Using prefix and suffix in URL", are commonly used in phishing detection. It is logically that there is frequent pattern between these features, because mostly a sample web-site with the value "Yes" for "Long URL", will have value "Yes" for "Using prefix and suffix in URL". This is logically true because the URLs that include prefix and suffix will mostly be long URLs. All the features that are used for fishing detection are nominal, with mostly two values 0 and 1, with some exception of using three values, 0, 1 and -1. The value of 0 is mostly refers to "False" and 1 for "True". For example, given that feature labeled as "F1", referred to the feature of "Having long URL", then a web-site with a value 0 for F1 has not a long URL, while a web-site with a value of 1 for F1 will be having a long URL, and so on. A set of features, which have multiple frequent patterns, will have multiple features in the newly constructed set. Overall, having multiple frequent patterns for the same set of features will be reflected to construct the same number of features in the new set. The features that have no association with any other features will remain the same in the newly constructed set of features. III.
67
2017 45(1 )
As a sample, F1 and F2 is highly associated through two frequent patterns that are {0,0} and {1,1}. The first pattern {0,0} means that, for most of the samples that are given in this dataset, when the sample has a value of 0 for F1, it also has the value of 0 for F2. The pattern {1,1} means that, for most of the samples that are given in this dataset, when the sample has a value of 1 for F1, it also has the value of 1 for F2. The frequent pattern mining in the proposed approach is based on Apriori algorithm. The inputs to this process are the original dataset with samples, each sample is represented by a feature vector, where a value in the feature vector is in correspondence to a specific feature. The outputs of Apriori algorithm are frequent patterns with various length generated under specific support values. The minimum support value considered is 0.8, for example. The process starts with extracting, from the input dataset, all item sets of size two (e.g.: contains two features). An item in the item set is represented by a pair of (Feature, value). All possible item sets from the whole dataset is extracted initially. Then, the support of each item set is calculated. The support is the number of samples in which the item set co-presented divided by the total number of samples. Next, all item sets with support less than the minimum support (e.g.: 0.8) is removed and the rest is considered as frequent patterns and carried out for the next stage. In the next stage, a group of item sets of bigger size is extracted by all possible combinations of the frequent patterns generated in the previous step. After calculating the support for these item sets, frequent patters with larger size are generated. This process is repeated until no more combination is possible as no more frequent pattern of specific size is found. All the frequent patterns are then processed such as all item-sets that are sub-sets of others are removed. By doing this, the longest frequent patterns that covered that same items are considered.
(a)
F1 F2 F3 F4 F5 F6 Label 0 S1 0 0 0 1 0 0 0 S2 0 0 1 1 1 1 1 S3 1 1 0 0 1 1 1 S4 1 0 1 1 1 1 0 S5 0 0 1 1 0 0 1 S6 1 1 0 1 0 0 1 S7 1 1 0 0 1 1 F: Feature, such as LongURL (0: No, 1: Yes), S: Sample web-site
F1, 2 F1,2,3 F5,6 F5,6' F4 Label 0 0.66 0 1 1 0 0 1 1 0 1 0 1 0 1 0 0 1 0.5 0.66 1 0 1 1 0 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 0 1 (b) Value Mapping in Mock Dataset Figure 1: Six features on 7 sample websites (a) mapped into (b) Mock Dataset S1 S2 S3 S4 S5 S6 S7
68
2017 45(1 )
The patterns are then used to construct a new set of features even if multiple patterns belong to the same item set of features. All the features that are not presented in the frequent patterns are then mapped as it is to the set of the newly constructed features. The newly constructed set of features will be used to build a dataset equivalent to the original one. A feature vector for each sample in the dataset is created. The values in the feature vector are calculated using a similarity function with reference to the frequent pattern in correspondence to each feature. A sample with a set of values that is equal to the values of the frequent pattern will be given a centre value of 0. The other samples will have a value represents the drift from the frequent pattern. For example, given two features F1 and F2 having a frequent pattern {v11 and v21}, the two features are equivalent to the feature F1,2 in the newly constructed set of features. All the samples that have the values of {v11 and v21} for F1 and F2, will have value of 0 in F1,2. All the samples with pattern {v12 and v21} or {v11 and v22} for F1 and F2, will have values greater than 0, say 0.5 for F1,2 because these values drift by one from the frequent pattern {v11 and v21}. The samples with pattern {v12 and v22} for F1 and F2, will have values greater than 0.5, say 1.0 for F1,2 because these values drift by two from the frequent pattern. For example, see Figure 1. The value mapping process assigns values of the newly constructed set of features based on the frequent patterns. For a given feature, if the sample, has a value identical to the frequent pattern, then a value of 0 is given, otherwise the value is calculated based on equation (3.1). (3.1) where mis(vi,pi) is the mismatch between the feature value and its correspondent value in the frequent pattern and len is the length of the pattern. Based on Equation (3.1), if the input feature value is exactly the same with the associated pattern, then a 0 value is given, a value of 1 is given when there is no match at all, and a ratio value between 0-1 is given when there is a mix of match and mismatch. The clustering process is implemented as follows. The number of clusters and the cluster centers are determined first. Then, each sample is placed in a suitable cluster based on the similarity with the existing clusters. The sample is placed in the cluster that is the closest to that sample. When all samples are assigned a cluster, the cluster centers are updated. The cluster centers are the average/middle of the samples assigned to those clusters. The process of assigning clusters and update the centers are repeated until converged. Figure 2 illustrates the clustering outcomes in mock dataset. For example, because samples S 2, S4 and S5 are similar to each other, in term of values in their feature vector- see Figure 1, they are placed in the same cluster. Similarly, S3, S6 and S7 are similar to each other, yet different from S2, S4 and S5 that are placed in another cluster and so on. Cluster # 1: 0 0 1 0 {S1} Cluster # 2: 0 1 1 1 {S2, S4, S5} Cluster # 3: 0 0 0 1 {S3, S6, S7} Figure 2: Clustering in Mock Dataset The clustering process is based on K-means algorithm with simple similarity function that counts the identical values between the input being measured for similarity evaluation [24]. The input to this process is the original dataset, which is represented by a set of samples contains values for all feature set. K-means algorithm cluster these samples into k clusters. The number of clusters are chosen automatically by trying all possible number of clusters starting from 2. With a different number of clusters, the mean square error for the data-samples and centers is given in Equation (3.2).
69
2017 45(1 )
(3.2) where D is distance calculated as Euclidian, si is the sample data, ci is the center that si belongs to, ci and cj re two different cluster centers. The clustering process is repeated while increasing the number of clusters to be considered, when the value of the error metrics is increased or freeze, the clustering process stops and the cluster output of the best error is produced as an output. Fuzzy System (FS) is built based on four steps: Fuzzification, Rule Evaluation, Aggregation and Defuzzification [22], as shown by Figure 3. The input of FS is the samples of the dataset with the newly constructed features. In the Fuzzification, the input values of each feature, for all the samples are converted into linguistic terms based on pre-defined fuzzy set. Each feature (the actual variable for FS) has its own fuzzy set. The membership function utilized with all the features are of triangular shape with three linguistic terms {match, medium, derivate}. The fuzzy set for the output variable has triangular shape with two terms, Label: {phishing, non-phishing}. The input crisp value is converted into multiple terms with different probability based on the intersection with the boundaries of the terms: match, Medium, and Derivate. Create Fuzzy Sets Input & Output Variables
Create Membership Functions for Each Term in the Fuzzy sets
Create Rule Set
Fuzzy Sets
Membership Functions
Rule Set
Crisp Input
Terms for Input Variables
Terms for the Output Variable Crisp Output
Figure 2: Flowchart of the Proposed Fuzzy System A critical issue in the FS is to determine the boundary and the slope of the linguistic terms in the fuzzy set. Thus, the initial values of the boundaries are given as: Match = FuzzyTriangular{0.0,0.0,0.0,0.4} Medium = FuzzyTriangular{0.2,0.5,0.5,0.8} derivative = FuzzyTriangular{0.6,1.0,1.0,1.0} phish Match = FuzzyTriangular{0.0,0.0,0.0,1.0} non-phishing = FuzzyTriangular{0.0,0.2,0.2,0.6} Such values were obtained by firstly determine the number of linguistic terms, which was chosen to be three as the features mainly should represent two categories, which are phishing and non-phishing, thus two labels as minimum are required. In order to capture the overlapping in the feature of these labels, another intermediate category is added. Second, having more than three linguistic variable will make the process of rules construction very complicated. The values were obtained by moving the boundary of the represented triangles left and right by 0.05 each time in trial and error process.
70
2017 45(1 )
In the Rule evaluation step, a set of rules are established by linking the linguistic terms of all features in the sample data with the output label (phishing vs. non-phishing). The linguistic terms are linked with AND operators. An example of the utilized rules are given in Table 1. In the Aggregation step, given the membership functions and the rules have been built, the aggregation is implemented with the testing data sample. An input sample is processed using the membership function to transform its crisp values into linguistic terms, then the output terms are used to find the best set of rules matching these terms. For each matched rule, the output of the rule is considered with a probability value that is equal to the average probability value of each term applied to the rule. Then, the output linguistic terms are aggregated by taking the highest probability of each term. Table 1: Examples of the Set of Fuzzy Logic Rules Rules IFF1 is match THEN Label is phishing IFF1 is derivate THEN Label is non- phishing IF F3 is medium AND F8 is match THEN Label is phishing Finally, in Defuzzification step, the linguistic terms of the output variable and their associated probability is converted into crisp output using center of gravity. Fuzzy System is then built to classify web-sites into phishing and non-phishing based on a set of rules and membership functions founded in the training phase of FS. Each input variable (feature) and output variable (label) is assigned a membership function that is used to map the crisp value for the input variable into linguistic term and the linguistic term into crisp for the output variable. Based on the linguistic terms of the input and the output variables in the training data, a set of rules are discovered in the form of "IF x is X AND y is Y THEN z is Z", where "x", "y" and “z” are the linguistic variables and “X”, “Y” and “Z” are their representative linguistic terms. Examples of rules are IF URL-Length is Long and URL-sub-domains is High Then Phishing is High. The aggregation and defuzzification are implemented in the testing phase to convert the output of the rules into crisp value. A learning procedure is implemented in order to find the optimized values for membership function and rules which are critical factors of FS. The proposed phishing detection is built by constructing new set of features from the input data by transferring the features into different form of continuous values using clustering, frequent pattern mining and value mapping process. The newly constructed features are then used with FS that is learned by optimizing a set of rules and membership function to predict phishing web-sites. IV.
EXPRIMENTS AND ANALYSIS
A set of experimental frames were conducted. The proposed technique without clustering involves all the processing steps, but the frequent patterns are extracted from the data itself, not from the patterns, and the rest of the processes were applied. The proposed technique without frequent pattern mining involves the FS only. Frequent pattern mining and value mapping are not implemented, and the data reconstruction phase is not necessarily in this frame of the proposed work. In each of the experiments, the data were divided into training and testing sets prior to the actual experiments. The division has taken the following forms: 95% of the data for training and 5% of the data for testing, 90% training and 10% testing, 85% training and 15% testing, 80% training and 20% testing.
71
2017 45(1 )
The utilized dataset for phishing detection is provided by UCI Machine Learning Repository1. The dataset has a total number of 11055 samples with 30 features. The dataset contains a set of phishing and non-phishing web-sites that have been captured in 2012. All these features and the class labels are given categorical values of -1 (for Fair), 0 (for No), and 1(for Yes). The dataset file is given in ARFF (Attribute Relation File Format) that was developed by the Waikato University to be used with WekA tool [30]. Nowadays, several other tools and API recognized this format as input to several machine learning tasks. The dataset is divided prior to running the experiments into two sets: training set, which is used to build a learned model and testing set which is used to evaluate the learned model. The phishing detection is then divided into training and testing phases. The training phase: starts by reading the input data, the data is formed as a set of samples, each sample, in this phase, includes both features (e.g.: URL, content, etc) and the class label (phishing, non-phishing). Then, the set of the processing steps, which are: clustering, frequent pattern mining, feature reconstruction and value mapping are applied. The new dataset is used to construct a set of rules for FS. The rules of FS over the original dataset are constructed using trial-and-error manner. The initial membership function for all the rules and the class label is initialized to the values of {-1 -0.5 0, -0.5 0 0.5, 0 0.5 1} for all terms related to the original dataset and {0 0 0.5, 0.2 0.5 0.8, 0.5 1 1} for all terms related to the reconstructed dataset. The output set of rules and membership functions represent the output model of this phase. In the testing phase, the input are samples with unknown class labels and a set of features that are identical to those used in the training phase. The features of the samples are mapped using the same value mapping process that has been used in the training phase, to generate a new sample with different set of features. These samples are then injected into the FS. The membership functions is applied and followed by applying the rules to generate an output label that is phishing or non-phishing. The accuracy of the proposed technique as a whole, with the excluded components, Naive-Bayesian, Decision Tree and Barraclough’ [5], for 95-5 mode is given in Table 2. As shown in the results, the proposed technique is better than the compared techniques. The results are justified by the utilization of the clustering and pattern mining processes. Without these components the proposed technique using FIS would not be as good as the existing methods, these are Naive-Bayesian and Decision Tree. The frequent pattern mining helps FS much in accruing a better performance. The value added by the clustering is not as significant as the frequent pattern mining. The proposed work in [4], gives competitive results with this proposed work. It is noted also that Decision Tree gives better accuracy compared to Bayesian, and Both of these algorithms give better accuracy compared to the sole FS. Using FS only leads to low accuracy, this is because that of the complicated relations among the involved features themselves and the output labels. Capturing such relationships is hard, unless, enormous numbers of rules are created for the underlying dataset, which also leads to a big chance of inconsistency among these rules. Subsequently, the association between the features and the label could not be captured accurately using FS by itself. Using the frequent pattern mining, the association is captured using the frequent patterns, which leads to ease the process of FS and enables capturing the association using a reasonable rule set. As for using clustering, more association is captured, thus leads to enhance the results. The precision and recall of the compared experiments under 85-15 reflects the accuracy results. The precision of all techniques are much better than recall. In real application, the recall is more important as it is the ability to detect phishing web-sites. The recall's based obtained result shows that even the proposed method still cannot achieve high false positive, which is critical requirement in security alarm systems. 1
https://archive.ics.uci.edu/ml/datasets.html
72
2017 45(1 )
The results show that the proposed technique has outperformed the compared techniques: Decision Tree and Bayesian. The results also justified by the utilization of the clustering method and pattern mining method. The experimental results also show the need to have sufficient portion of data in the testing phase, as the mode 85-15 gave better results compared to 90-10 mode, which gave better results compared to 95-5 mode. A summary of the results are given in Table 2. Table 2: Summary of Results: Ac: Accuracy, Pr: Precision, Re: Recall Ac 95-5
Ac 90-10
Ac 85-15
Ac 80-20
Bayesian
92.2
92.8
92.8
92.8
93
92.5
Decision Tree
92.9
94.1
94.4
94.4
94.5
94.2
Barr.
94.9
96.1
96.1
96.1
96.4
95.8
Proposed
94.7
96
96.1
96.1
96.3
95.8
Proposed (noClustering) Proposed (no-FP)
93.8
95.8
95.9
95.9
96
95.8
90.1
90.5
90.5
90.5
90.6
90.4
V.
Pr Re 85-15 85-15
CONLUSION
Fishing web pages is an example of forms used by crackers to mimic the web-site and brand of well-known organization. Thus, automatic filtering of phishing web-site becomes a necessity. The existing approaches for anti-phishing are either email-based or web-based. This paper presents a technique that sets rules for converting the input features of a web-site into an output that reveals the nature of the web-site as phishing or non-phishing. The paper presents a phishing detection technique based on fuzzy logic and a set of pre-processing phases that borrowed from the data mining fields. The form of the input data that is used with the proposed work is discussed. A fuzzy system is commonly applied to control variables of continuous nature. Thus, it is important to pre-process the input features prior to applying FS. Changing the shape and the nature of the input feature will influence the accuracy of the phishing detection technique positively or negatively. It is necessarily to reduce the number of features utilized with FS. For all these reasons, the features are first pre-processed in order to produce a new reduced and continuous form of the input features. The applied set of pre-processing steps, which are value mapping, frequent pattern mining and clustering, have improved the results of the fuzzy system. The contributions of this research are: A set of features are used in building a model for phishing web-sites and using the model for detecting phishing web-sites. Using frequent patterns mining, highly associated features are combined to form a new feature or multiple sets of features. The paper presents experimental results of the proposed technique in comparison with the existing rule-based classification implemented all in the same environment and on the same dataset. The results showed that the proposed technique has outperformed the compared techniques significantly. The results also justified the utilization of the clustering method and pattern mining method.
73
2017 45(1 )
A future work will focus on collecting phishing and non-phishing web-sites that are currently accessible in the www and extract a list of features that are different from the one commonly used in phishing detection, such as the "time-to-response". A potential area of research is comparing several rules learner techniques to be used with the Fuzzy Inference Process. Another potential area is reducing the number of rules to enhance the system performance and studying which rule has more impact on the system output.
REFERENCES [1] [2]
[3] [4]
[5]
[6] [7] [8]
[9]
[10] [11] [12]
[13] [14] [15] [16] [17]
[18] [19] [20] [21] [22] [23]
Abbasi,
A., Zhang, Z., Zimbra, D., Chen, H., &NunamakerJr, J. F, "Detecting fake websites: the contribution of statistical learning theory," Mis Quarterly, pp. 435-461, 2010. Abu-Nimeh, S., Nappa, D., Wang, X., & Nair, S, "A comparison of machine learning techniques for phishing detection," Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit, pp. 60-69, ACM,2007. Abur-rous, Maher, “Phishing Website Detection Using Intelligent Data Mining Techniques,” PhD Dissertation, Department of Computing, University of Bradford, 2010 Afroz, S., & Greenstadt, R., "Phishzoo: An automated web phishing detection approach based on profiling and fuzzy matching," Proceedings of the Semantic Computing Conference (ICSC),2009, pp.1-10. Barraclough, P. A., Hossain, M. A., Sexton, G., & Aslam, N, "Parameter optimization for intelligent phishing detection using Adaptive Neuro-Fuzzy," International Journal of Advanced Research in Artificial Intelligence, 3(10), 2014. Bergholz, A., De Beer, J., Glahn, S., Moens, M. F., Paaß, G., & Strobel, S., "New filtering approaches for phishing email," Journal of computer security, 18(1), 2010, pp.7-35. Chandrasekaran, M., Narayanan, K., &Upadhyaya, S. "Phishing email detection based on structural properties," NYS Cyber Security Conference, 2006, pp. 1-7. Choi, Seong-Muk, Yeol-Joo Ryou, Hoo-Ki Lee, Hee-Hoon Cho, and Jong-Bae Kim, "Malicious Processor Detection based on the Security Agent," International Journal of Security and Its Applications, 9(11),2015, pp. 47-54. Chou, N., Ledesma, R., Teraguchi, Y., & Mitchell, J. C., "Client-Side Defense Against Web-Based Identity Theft," NDSS ’04: Proceedings of the 11th Annual Network and Distributed System Security Symposium, San Diego, 2005. Dhamija, R., Tygar, J. D., & Hearst, M., "Why phishing works," Proceedings of the SIGCHI conference on Human Factors in computing systems (pp. 581-590), ACM, 2005. Dhamija, Rachna, and J. Doug Tygar, "The battle against phishing: Dynamic security skins," Proceedings of the 2005 symposium on Usable privacy and security. ACM, 2005. Gastellier-Prevost, S., Granadillo, G. G., & Laurent, M., "Decisive heuristics to differentiate legitimate from phishing sites," Network and Information Systems Security (SAR-SSI), 2011 Conference, pp. 1-9, IEEE, 2011. Google, Google Transparency Report, https://www.google.com/transparencyreport/safebrowsing/, Last-visited 1/5/2015. Jakobsson, M., "Modeling and preventing phishing attacks," Financial Cryptography, Vol. 5, 2005. Kirda, E., & Kruegel, C., "Protecting users against phishing attacks with antiphish," Computer Software and Applications Conference, pp. 517-524, IEEE, 2005. MCAFEE, Mcafee site adviser, http://home.mcafee.com/root/landingpage.aspx?lpname=get-itnow&affid=0&cid=170789&ctst=1, Last-visited 1/5/2015. Medvet, E., Kirda, E., &Kruegel, C., "Visual-similarity-based phishing detection", Proceedings of the 4th international conference on Security and privacy in communication netowrks, p. 22, ACM, 2008. Microsoft, (2007)Microsoft Cooperation, "Internet Explorer 8 Beta 1". Windows IT Pro. 2007, www.microsoft.com, Last-visited 1/5/2015. Netcraft, (2006), Netcraft: Web Server Survey, http://news.netcraft.com/archives/2006/11/01/, Lastvisited 1/5/2015. Phishing, https://en.wikipedia.org/wiki/Phishing, vistied 31/5/2016. Ries et al., (2009), Reis, C., Barth, A., &Pizano, C., "Browser security: lessons from google chrome", Queue, 7(5), pp. 3. Sheng et al., (2009), Sheng, S., Wardman, B., Warner, G., Cranor, L. F., Hong, J., & Zhang, C., "An empirical analysis of phishing blacklists" In CEAS. Shireen Riaty, Ahmad Sharieh, Hamed AlBdour, “Enhancing the Perfomance of Detecting Phishing Website Rate,” Master of Science Thesis, Graduate Studies, The University of Jordan, May 2016.
74
2017 45(1 )
[24] [25]
[26]
[27]
[28]
[29] [30] [31]
Sugeno & Yasukawa (1993), Sugeno, M., & Yasukawa, T."A fuzzy-logic-based approach to qualitative modeling", IEEE Transactions on fuzzy systems, 1(1), pp. 7-31. Wardman et al., (2009) Wardman, B., Shukla, G., & Warner, G., "Identifying vulnerable websites by analysis of common strings in phishing URLs", In eCrime Researchers Summit, (eCRIME'09), pp. 113, IEEE. Wardman et al., (2011), Wardman, B., Stallings, T., Warner, G., &Skjellum, A., "High-performance content-based phishing attack detection", In eCrime Researchers Summit (eCrime'11), pp. 1-9, IEEE. Wenyin et al., (2005), Wenyin, L., Huang, G., Xiaoyue, L., Min, Z., & Deng, X., "Detection of phishing webpages based on visual similarity", In Special interest tracks and posters of the 14th international conference on World Wide Web, pp. 1060-1061, ACM. Xhang et al., (2011), Zhang, H., Liu, G., Chow, T. W., & Liu, W., "Textual and visual content-based anti-phishing: a Bayesian approach", IEEE Transactions on Neural Networks, 22(10), pp. 15321546. Ye et al., (2005), Ye, Z., Smith, S. and D. Anthony, "Trusted paths for browsers", ACM Transactions on Information and System Security, 8 (2), p. 153-186. https://en.wikipedia.org/wiki/Weka_(machine_learning)#ARFF_file, last visited date March 2016. “Phishing Activity Trends Report”, April-June, http://docs.apwg.org/reports/apwg_trends_report_q2_2016.pdf, last visited date Nov.2016.
75