Using Feature Selection and Classification to Build Effective and Efficient Firewalls Randall Wald∗ , Flavio Villanustre† , Taghi M. Khoshgoftaar∗ , Richard Zuech∗ , Jarvis Robinson† , and Edin Muharemagic† ∗ Florida Atlantic University † LexisNexis Business Information Solutions Email:
[email protected],
[email protected],
[email protected],
[email protected],
[email protected],
[email protected]
Abstract—Firewalls form an essential element of modern network security, detecting and discarding malicious packets before they can cause harm to the network being protected. However, these firewalls must process a large number of packets very quickly, and so can’t always make decisions based on all of the packets’ properties (features). Thus, it is important to understand which features are most relevant in determining if a packet is malicious, and whether a simple model built from these features can be as effective as a model which uses all information on each packet. We explore a dataset with real-world firewall data to answer these questions, ranking the features with 22 feature selection techniques and building classification models using four classifiers (learners). Our results show that the top two features are proto and dst (representing the network protocol and destination IP address, respectively), and that models built using these two features in combination with the Na¨ıve Bayes learner are highly effective while being minimally computationally expensive. Such models have the potential to replace conventional firewalls while lowering computational needs. Keywords-Firewall, Intrusion Detection, Classification, Feature Selection
I. I NTRODUCTION Computer systems connected to the internet can contain a wealth of potentially-valuable information, making them a target for attacks. The majority of these need to pass between the outside world and the internal network, however, making this interface is an important region to monitor in order to observe and block attacks in progress. Of course, the majority of traffic passing in or out of the local network is not malicious, posing the challenge of identifying dangerous packets from the “normal” traffic while dealing with an extremely large quantity of traffic. Although examining network traffic is only one element of a comprehensive Defense in Depth strategy, it constitutes an important first line of defense for intrusion detection and data loss prevention. Frequently, the task of identifying malicious packets falls to the network firewall. This piece of software (which may be installed on dedicated hardware) contains specific policies to apply to incoming packets. Depending on the needs of a network, these can be as short as “only accept incoming traffic which is a reply to previous outgoing traffic,” or more complex, making decisions based on different properties of each packet. In addition, these rules may be created manually by blacklisting or whitelisting certain behaviors (a rules-based IEEE IRI 2014, August 13-15, 2014, San Francisco, California, USA 978-1-4799-5880-1/14/$31.00 ©2014 IEEE
packet), or use a machine learning model to classify instances as belonging either to the “normal” or “attack” classes. Modelbased approaches may be more adaptive, able to identify malicious packets which the network administrators have yet to devise specific rules against. However, complexity comes with a price: the more complex and expansive the set of rules, the longer it takes to process each packet traversing the network. While extremely complex rule sets that depend on a large number of packet properties may show slightly improved performance at blocking malicious packets (and permitting normal traffic), on large-scale computer systems the burden of a slow firewall is more harmful than occasionally letting a malicious packet through, as it is generally assumed that no firewall will be 100% effective against an ever-changing range of threats. Thus, it is important to understand the packet properties which are most important for identifying malicious packets, so simpler (but nonetheless effective) rules can be created. In the present work, we explore a dataset of firewall log records which contain packet information (independent features) as well as what actions were used for each packet, specifically “accept” or “drop” (the class variable). With this, we apply 22 distinct feature ranking techniques to understand the relative importance of each feature, and we aggregate these results to see which features are most important across different feature selection techniques. This is one of the first works using feature selection on firewall data, and the only to find aggregate features which are chosen by multiple feature selection approaches. We then build classification models using only the most important features, to understand if these models are effective enough to potentially replace a network firewall. Overall, the goal is to determine which set of features is most important for building models, both to understand more about the nature of malicious packets and to build classification models which are both effective and efficient. Our results show that the most important features for identifying malicious packets are proto and dst, representing the network protocol of the packets and the packet destination IP address. Following these are service (packet destination port), src (packet source IP address), orig (IP address of network device that logged the packet), and i/f-name (name of the interface which logged the packet). We also
850
discovered that using the first two of these features gave us the best balance of classification performance and small feature subset size, and that the Na¨ıve Bayes learner likewise produced the best balance of performance and efficiency. Thus, we recommend that new model-based firewalls use this combination of features and algorithm to quickly and effectively identify malicious packets. The remainder of this paper is organized as follows: Section II presents the feature selection techniques, classifiers, and performance measurement approaches used in this paper. Section III introduces our case study data, discussing the properties of its features. In Section IV, we discuss our results for both feature selection and classification models. Finally, in Section V we conclude our work and include ideas for future research. Due to space limitations, we are unable to provide a comprehensive collection of related works on this topic, and our discussion of methods is likewise brief. II. M ETHODS In this work, we employ 22 feature ranking techniques and four classifiers to try and learn more about packets which are likely to be malicious. Section II-A discusses the feature ranking techniques used. Section II-B contains more detail on the specific classifiers we used. Finally, Section II-C goes into more detail about how we evaluated the performance of our classification models, as well as how we judged our feature ranking results. A. Feature Selection Techniques 22 feature ranking techniques were used in this paper: Deviance (Dev), F-Measure (F), Geometric Mean (GM), Gini Index (GI), Kolmogorov-Smirnov statistic (KS), Mutual Information (MI), Odds Ratio (OR), Power (Pow), Probability Ratio (PR), Area Under the Receiver Operating Characteristic Curve (ROC), Area Under the Precision Recall Curve (PRC), Fisher Score (FS), Fold Change Ratio (FCR), Fold Change Difference (FCD), Signal-to-Noise Ratio (S2N), Significance Analysis of Microarrays (SAM), Welch T-Statistic (WTS), Wilcoxon Rank Sum (WRS), Chi Squared (CS), Information Gain (IG), Gain Ratio (GR), and Symmetric Uncertainty (SU). These can be divided into three groups: threshold-based feature selection techniques (TBFS) which use the feature values as posterior probabilities to estimate classification errors (this includes Dev, F, GM, GI, KS, MI, OR, Pow, PR, ROC, and PRC), firstorder-statistics based techniques (FOS) that employ mean and standard deviation values to determine feature relevance (this includes FS, FCR, FCD, S2N, SAM, WTS, and WRS), and techniques commonly used in other literature (this includes CS, IG, GR, and SU). The 11 TBFS ranking techniques, proposed and implemented recently by our research group [1], [5], operate by evaluating each attribute against the class, independent of all other features in the dataset. After normalizing each attribute to have a range between 0 and 1, simple classifiers are built for each threshold value t ∈ [0, 1] according to two different classification rules (e.g., whether instances with values above the
threshold are considered positive or negative class examples). The normalized values are treated as posterior probabilities: however, no real classifiers are being built, thus making TBFS a form of filter-based feature ranking. Any classification metric may be used with TBFS; for further details regarding those used in this paper, see the above references. Seven of the rankers used in this paper are univariate feature selection techniques which we have combined into a family of techniques we name First Order Statistics (FOS) based feature selection. This name was chosen because all these techniques exhibit the use of first order statistical measurements such as mean and standard deviation. Although some of these techniques have been utilized in earlier papers, our research group very recently combined these techniques into a single family and studied their similarity to one other, and how they perform in classification [3]. Four commonly used filter-based feature ranking techniques were used in this work: chi-squared [6], information gain [4], gain ratio [4], and symmetric uncertainty [6]. All of these feature selection methods are available within the Weka machine learning tool [2], and Weka’s default parameter values were used unless otherwise noted. Since most of these methods are widely known and for space considerations, we cannot elaborate on these rankers; the interested reader can consult the provided references. B. Classification Algorithms Four learners were chosen for our analysis: 5-Nearest Neighbor (5-NN), two forms of C4.5 Decision Trees (C4.5D and C4.5N), and Na¨ıve Bayes (NB). These learners were all chosen due to their relative ease of computation and their dissimilarity from one another. Additional learners were explored during our preliminary investigation (Logistic Regression, Multi-Layer Perceptron, and Support Vector Machines), but these were found to take significantly more computational resources while also giving worse classification results. All models were built using the W EKA machine learning toolkit [2], using default parameters except as follows: the 5NN model set the number of neighbors to be “5” and weighted 1 each neighbor by a factor of distance , and the C4.5N model (which was based on Weka’s J48 algorithm) turned on Laplace smoothing and turned off tree pruning (C4.5D uses Weka’s default parameters for J48). For all classification models, the “malicious packet” class was considered as the positive class, with the “normal, healthy packet” class as the negative class. C. Performance Evaluation For our feature ranking, we did not begin by choosing an arbitrary feature subset size and then using this to guide our feature selection. Instead, we arranged all 22 lists side-by-side and then counted how often each feature would appear as the 1st-place, 2nd-place, and so on members of this list. Based on these evaluations, we noticed that natural cut-off points occurred where the difference in ranking scores between one feature and the next was larger than between features that were of relatively similar importance to one another. That is to say,
851
we found that when putting the features into an aggregate order based on how frequently they appeared towards the top of the various ranked feature lists, adjacent features in this ordering might nonetheless have very different characteristics in terms of how often they appeared towards the beginning of the ranked feature lists. Using this information, we determined that the top 8 features (determined from the collection of lists) are of our greatest interest. For simplicity, in our results we present how often these appeared within the top 6 of each list individually, although our decisions regarding these features were based on the full collection of how often each feature appeared as the top kth feature (for k from 1 to 13). Further investigation could consider additional features, but as there are only 13 features to begin with in our study, choosing more than 8 begins to obviate the purpose of feature selection. To evaluate the quality of our classification models, we used Area Under the Receiver Operating Characteristic Curve (AUC) as our metric. AUC builds a graph of the True Positive Rate vs. False Positive Rate as the classifier decision threshold is varied, and then uses the area under this graph as the performance across all decision thresholds. Note that this is the same metric as is used in the ROC ranker, but it differs in one important way: when used as a ranker, ROC operates solely on the normalized values of a single attribute, pretending that these are the output of a classifier. When AUC is used to evaluate our classification models, however, it is based on the actual output of those classification models. The different acronyms are used to highlight this distinction. Five-fold crossvalidation was also used in preparing our models. III. C ASE S TUDY The data utilized in this paper was collected from live firewalls defending a large corporate supplier of cloud systems. As this data (both the attacks and the “normal” data) is directly collected from the real world, the proportion of different traffic types cannot be easily discovered: there is no “ground truth” as in the case of artificially-generated data. Nonetheless, this is an important use case, as artificial data can never capture all of the variety found in real-world data. For this study, we consider a packet to be malicious (a member of the positive class) if it had been dropped by the firewall, and normal (a member of the negative class) if it was accepted by the firewall. Packets which were handled other than this (specifically, when the actions taken were “ctl”, “monitor”, or “reject”) were simply ignored. While the detailed makeup of the attacks on this system cannot be determined (due to the use of naturally-occurring attacks), we can discuss the types of attacks generally faced by this corporation, those which warranted their own special firewall rules. For example, exfiltration of proprietary information was monitored by checking for packets from abnormal ports trying to leave the local network. Ports associated with known malware, or which did not make sense in context (for example, mail or DNS packets not heading towards the appropriate internal servers) were also monitored. In addition, broadscale network probes (many pings from the same external IP
Feature Name orig i/f-dir i/f-name* policy-id-tag src* s-port* dst* service* proto* rule* ICMP* ICMP Type* ICMP Code*
Meaning Firewall IP address Direction of packet Firewall device name Policy ID tag Source IP address Source port Destination IP address Destination port Network protocol Firewall rule ICMP packet description ICMP packet type ICMP packet code
# of Values 15 2 68 8 6,666 60,261 4,914 3,506 4 288 9 7 6
TABLE I D ESCRIPTION OF FEATURES FOUND IN DATASET
address), ICMP probes, and known bad IP addresses were also blocked. Overall, the goal of the firewall was both to prevent known attack vectors (IP addresses and ports associated with existing malware) as well as reduce the attack surface for potentially unknown attack vectors (by limiting abnormal packets which aren’t overtly harmful but which do not conform to expectations regarding packet flows). In addition to acting on the packets, the firewall kept a log of its behavior, and it is this log dataset which was used for the present work. The firewall log contains 13 attributes, as presented in Table I. All of the attributes are nominal, meaning that they contain some number of distinct values and that adjacent values are no more “similar” to one another than any other two values. The total number of values per feature are also listed in the table. The features with an asterisk (*) next to their name are those which potentially contain missing values; note that “missing value” is one possible value counted in the # of Values column (for example, proto has three “real” values and the missing value). Features with a large number of distinct values can be challenging to work with, as typically nominal attributes must be transformed into a collection of binary attributes, one per value (representing whether the original feature had the given value). The class value used for this dataset is action, which describes what action was taken with the packet in question. Although the original dataset had a number of possible actions, we considered only two values, “accept” and “drop”, as these were the most common (and most important) values. All instances with action values other than these were removed (and the values in Table I only consider the “accept” and “drop” instances). The total number of instances in this binaryclass dataset is 288,390, with 12,922 of these being positiveclass (malicious) instances. It should be noted that the rule and policy-id-tag attributes are a product of the firewall performing its work to identify each feature: rule is the rule which is matched by a given packet, while policy-id-tag is the policy applied to that packet. As these directly relate to the action taken on the packet, it would not be fair to include them when building classification models. Thus, although we include these in our feature ranking analysis, they will not be used for building
852
Feature rule proto dst policy-id-tag service src orig i/f-name
1st 7 8 5 1 0 0 0 0
2nd 2 0 3 7 5 3 1 0
Ranking 3rd 4th 0 0 0 0 2 1 1 0 1 0 3 2 7 2 6 9
5th 4 0 1 0 8 5 0 4
Dataset
6th 3 1 2 1 3 1 0 2
Cleansed Data Full Data
Chosen Features Top 6 Top 2 Top 6 Top 2
5-NN 0.99984 0.98602 0.99981 0.98560
Learner C4.5D C4.5N 0.98445 0.99835 0.97835 0.98354 0.98511 0.99817 0.97842 0.98300
NB 0.99854 0.98511 0.99849 0.98484
TABLE III C LASSIFICATION R ESULTS
TABLE II F EATURE R ANKING RESULT COUNTS
classification models. Two versions of the dataset were prepared, based on how instances with missing values were handled. First, it must be noted that some missing values were “expected”: instances with a proto value of “tcp” or “udp” will have missing values for ICMP, ICMP Type, and ICMP Code, as only ICMP packets have values for these. Conversely, packets with a proto of “icmp” will never have values for s-port or service, as ICMP packets do not make use of ports for either their source or destination. However, a number of instances were found to have missing values beyond these. In one version of the dataset (called the “Full Data”), we retained these instances, instructing our models to handle the missing values using their built-in mechanisms. In the second version, all such instances were removed, giving us the “Cleansed Data.” This Cleansed Data, which contains 287,913 instances (12,482 positive-class), was considered more realistic, and thus was used as the basis for our feature selection experiments, but to validate our chosen features we built classification models for both datasets. IV. R ESULTS A. Feature Selection Results In Table II, we see the top eight features in the Cleansed Data, as well as how often each feature appeared as the 1st, 2nd, . . . , 6th feature among the 22 ranked feature lists for this dataset. Note that we did not set out initially to choose exactly eight features: rather, upon examination of how often each feature appeared towards the top of the various feature lists (as discussed in Section II-C), we concluded that these eight (in the order presented) are the best features to discover malicious packets. Based on our results, the top two features to identify attacks are rule and proto. These make sense, because rule describes which security policy rule (if any) was triggered by a given packet (and which was, in turn, used to decide on the appropriate action with this packet), while proto contains the network protocol of the packet (and as certain attacks only occur over certain protocols, knowing the protocol can help rule out some attacks). Following these two features (which are of similar importance to one another), the next two important features are dst and policy-id-tag. These represent the IP address of the packet’s destination and the policy ID, respectively, and (as with the top two features) are
of relatively equal importance to each other. The presence of rule and policy-id-tag among these top four is unsurprising, given their close association with the class value, but it is notable that these two are only the top feature for 8 of the 22 ranked feature lists, with either proto or dst being the top feature for 13 lists. This shows that in some cases, basic information about the packet (such as the packet’s protocol and where it is going) are more useful than the actual rules used to decide a packet’s fate. Past these top four features, we find another set of four features which has somewhat less importance but which may nonetheless help to refine models. These include service, which contains the destination port number (that is, the targeted service) of the packet; src, which contains the IP address of the packet’s source; orig, which contains the IP address of the device that generated the log entry, and i/f-name, which is the name of the interface through which the packet passed. Collectively, these continue to show that understanding where a packet comes from and goes to will help to predict if that packet is malicious. In particular, including the dst attribute found in the first four features, it is clear that both the source and destination IP addresses are important in identifying malicious packets, as well as the IP address and name of the machine which noticed the packet passing into the network. The destination port is also important, and although the s-port feature (representing the source port) is not one of the top eight features, it was nonetheless also somewhat important, showing that the port information can help as well. B. Classification Results The classification results in terms of AUC for both the Cleansed and Full Data are presented in Table III. Because it does not make sense to build models which include the rule and policy-id-tag features (since these would produce nearly-perfect models on their own), we used both the twofeature subset of proto or dst and the six-feature subset which includes these along with service, src, orig, and i/f-name. Note that although the feature subsets were chosen based on the Section IV-A values, which consider only the Cleansed Data, they were considered to be relevant to the Full Data as well (and the results presented here confirm this). Within each row (which represents a combination of dataset and feature subset), the learner with the highest classification performance has its value printed in bold, while the learner with the lowest value is printed in italics. The most important observation from this table is that all
853
models, on either dataset, with any learner, and using as few as two features, gave fairly high performance, exceeding an AUC value of 0.975 in all cases. From this, we can tell that using classification models to identify malicious packets is a valid strategy: it can give effective results without direct use of firewall rules or other human-based knowledge of packet properties. Looking more closely at the two different feature subsets (Top 6 and Top 2), we note that when using the Top 6 features with any learner other than C4.5D, the AUC value will exceed 0.99 (with a value above 0.999 when using the best learner). When using the Top 2 features alone, this value can only be guaranteed to exceed 0.98. This suggests that using all six features may be worth it due to the improvements in classification performance. However, the number of chosen features must be compared with the total number of features in the dataset: although it starts with 13 features, three of these only apply to ICMP packets, and two more were removed due to being too closely tied with the class value. Thus, only eight features remain, and achieving good results when using six of them only demonstrates that classification per se is useful, rather than that the chosen features are particularly useful. On the other hand, even when using only the top 2 features, decent models can be built. This leads us to suggest that for a quickand-dirty approach to discovering the malicious packets, using the proto and dst features alone is surprisingly effective. Regarding the learners, it is clear that 5-NN consistently gives the best classification models, while C4.5D gives the worse. Of the remaining two, NB is consistently higher than C4.5N. However, the gap among the top three learners is not very large, never exceeding 0.003. Thus, in general, as long as the C4.5D learner is not used, classification performance is not the primary deciding factor among the learners.
V. C ONCLUSION In this work, we studied a firewall log dataset to understand which packet features are most important for identifying malicious packets. To accomplish this, we used 22 different feature ranking techniques and seven classifiers. We aggregated the features across all of the ranked lists and created a consensus ranking, which enabled us to choose the top two and six features, and we then built classification models using these feature subsets. We discovered that the top features (in order) are proto, dst, service, src, orig, and i/f-name, with the top two and six making natural cut-off points (where the difference in ranking counts between two features was great enough to suggest that the lower feature was meaningfully less important than the higher feature). The models built from these feature subsets show that the top two features have sufficient power to build effective models alone, without needing any other features from the dataset. In addition, when comparing performance across the different classifiers (and recalling that some models are particularly time-intensive to train and execute at run-time), we found that Na¨ıve Bayes is the best choice of learner for this application domain. Thus, the use of this learner along with the top two features is a promising strategy for building more effective and efficient firewalls. Future work can consider a wider range of datasets from various security vendors and application domains (such as political organizations, financial institutions, and other types of organizations) which exhibit different attack profiles and packet features. Also, more classification algorithms which (like Naive Bayes) are suitable for use on extremely highthroughput data should be explored. In addition, firewall logs can be combined with other data sources (e.g., deep-packet inspection, application logs, etc.) to build a more holistic picture of the network’s security status. R EFERENCES
One consideration which is especially important for network security is computational efficiency. It is not enough to have a model which can, given enough time, correctly label all of the packets on a network; it must be able to perform this labeling process quickly enough that running the classifier in real-time will not introduce latency. In light of this, the 5-NN model is not an appropriate choice for this domain, despite its classification performance: as a lazy learner, all computation is performed at evaluation time, when efficiency is most important. Given these constraints, NB becomes the natural choice for classification in this application domain: once the posterior probabilities have been computed, classifying a new instance is simply a matter of multiplying its feature values by some known constants and comparing the resulting sum with a simple threshold. Especially when using only the top 2 features, this creates a very efficient model. Thus, although it does not give the best classification results from the models presented here, we nonetheless recommend the use of NB with the top 2 features, proto and dst, to build an effective and efficient classification model.
[1] D. J. Dittman, T. M. Khoshgoftaar, R. Wald, and J. Van Hulse, “Comparative analysis of DNA microarray data through the use of feature selection techniques,” in Ninth IEEE International Conference on Machine Learning and Applications (ICMLA), December 2010, pp. 147–152. [2] M. A. Hall and G. Holmes, “Benchmarking attribute selection techniques for discrete class data mining,” IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 6, pp. 1437–1447, November-Dec. 2003. [3] T. M. Khoshgoftaar, D. Dittman, R. Wald, and A. Fazelpour, “First order statistics based feature selection: A diverse and powerful family of feature seleciton techniques,” in 11th International Conference on Machine Learning and Applications (ICMLA), vol. 2, Dec. 2012, pp. 151–157. [4] J. R. Quinlan, C4.5: Programs for Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1993. [5] J. Van Hulse, T. M. Khoshgoftaar, and A. Napolitano, “A comparative evaluation of feature ranking methods for high dimensional bioinformatics data,” in 2011 IEEE International Conference on Information Reuse and Integration (IRI), August 2011, pp. 315–320. [6] I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical machine learning tools and techniques, 3rd ed. Burlington, MA: Morgan Kaufmann, January 2011.
854