A Machine Learning Based Reputation System for

0 downloads 0 Views 337KB Size Report
San Antonio, TX 78249 USA. (e-mail: rakbani@cs.utsa.edu). Turgay Korkmaz is with the Dept. of Computer Science, University of. Texas, San Antonio, TX 78249 ...
A Machine Learning Based Reputation System for Defending Against Malicious Node Behavior Rehan Akbani, Turgay Korkmaz, and G. V. S. Raju

Abstract— Reputation Systems (RS) are designed to detect malicious nodes in a network and thwart their attacks, such as the spreading of viruses or worms, or attacking known vulnerabilities. They do this by collecting information about past transactions of a node and utilizing that to predict its future behavior. Traditionally, RSs have been designed by manually devising specific models or equations that use historical data to defend against certain types of attacks. In this paper, we propose a Machine Learning based RS that automates the process of devising the RS model and defends against many patterns of attacks. We discuss the merits of this approach and propose using Support Vector Machines as the basis of the RS. We delineated the factors associated with building the SVM based RS and then proposed and evaluated our technique. We compared the performance of our RS with another RS found in the literature, called TrustGuard, and showed that our RS significantly outperforms TrustGuard. Our RS correctly distinguishes between good and malicious nodes with high accuracy, even when the proportion of malicious nodes in the network is very high. Index Terms—Machine Learning, Peer-to-Peer, Reputation Systems, Support Vector Machines.

R

I. INTRODUCTION

EPUTATION Systems (RS) have been proposed by researchers for use in Peer-to-Peer (P2P) networks, large scale distributed networks, wireless networks, and on the Internet [1]-[4]. Their goal is to address the challenging problem of securing the network against attacks from nodes that are malicious or have been compromised. This is very useful for applications in wireless networks, such as WLANs, MANETs and Sensor Networks, where it is easy for nodes to join the network or impersonate another node. Reputation Systems are also useful in other application areas such as in the military, in disaster recovery efforts, file sharing, and in general, any network that is susceptible to intruders. Manuscript received March 31, 2008. Rehan Akbani is with the Dept. of Computer Science, University of Texas, San Antonio, TX 78249 USA. (e-mail: [email protected]). Turgay Korkmaz is with the Dept. of Computer Science, University of Texas, San Antonio, TX 78249 USA. (e-mail: [email protected]) G. V. S. Raju is with the Dept. of Electrical Engineering, University of Texas, San Antonio, TX 78249 USA. (e-mail: [email protected])

Unfortunately, there is no litmus test to enable one to verify whether a node in a network is malicious or benign. Reputation Systems try to predict the future behavior of a node, such as a server or a client, by analyzing its past behavior. They try to discern the intentions of a node by observing its behavior and discriminating legitimate behavior from malicious behavior. The concept in RS is that past records of transactions conducted by a node are stored in the network in a distributed fashion. Each node that interacts with another node stores some feedback about the interaction. The interaction is either classified as legitimate or suspicious (or some value in between). If, for instance, the interaction consisted of downloading a file, the client could determine if the downloaded file was indeed the one requested, or it was a Trojan, a virus, or spam. Another example would be that either the client or server violated the rules of a network protocol and sent malformed packet headers to try to attempt a buffer overflow, or launch some other attack against known vulnerabilities. We realize that the detection process may not be perfect and we need to account for inaccuracies. Based on the feedback of various nodes, a new node can decide whether to transact with a given node or not, even though they may never have interacted before. eBay utilizes this form of reputation system where users leave feedback about other users they have transacted with. Alternatively, a client looking for a specific service could choose the server with the best reputation among a pool of servers. We analyzed the approach taken by various researchers for designing Reputation Systems [1]-[3], [5] and devised a general framework for RS as shown in Fig. 1. In general, a node that needs to decide whether to transact with another node or not must first gather historical data about that node (e.g., the proportion of good vs. bad transactions in the last x minutes). Then it applies a customized mathematical equation or statistical model to the data to produce an output score. For example, the RS in [3] is based on using Eigen values from Linear Algebra, the one in [2] is based on using derivatives and integrals, whereas the one in [5] is based on Bayesian systems utilizing the Beta distribution. Depending on the output of the equation or model, the system then decides how to respond. The major differences between existing Reputation

978-1-4244-2324-8/08/$25.00 © 2008 IEEE. This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2008 proceedings.

Collect Historical Data

Apply Equation to Data

Apply Threshold to Output

Yes No

Fig. 1. General framework of a Reputation System that decides whether to transact with a given node or not.

Systems are the type of historical data that are collected and the equation or model that is applied to it. In most cases, the equation or model is customized to detect specific types of malicious behavior only. This makes sense since we would expect different behaviors to exhibit different behavioral patterns which may not all be discernible by a single “general purpose” equation. For instance, the algorithm in [2] is especially designed to detect malicious behavior that alternates with good behavior and varies over time. II. OUR MACHINE LEARNING APPROACH Based on our analysis, we redefined the problem of designing Reputation Systems (RS) into one of finding the optimal set of input features and equations for specific types of malicious behaviors (steps 1 and 2 in Fig. 1). In this context, we have found Machine Learning (ML) to be of particular significance since many ML algorithms are able to determine and approximate the optimal equation needed to classify a given set of data. In particular we opted to use Support Vector Machines (SVM) as our ML algorithm because it has been shown to successfully approximate mathematical functions [6] and make time series predictions [7] based on given data. Given the history of transaction data, the goal would be to construct ML classifiers that can utilize this data to make predictions. We envision the problem of RS as a time series prediction problem, which states: Given the values of the dependent variable at times (t, t-1, t-2, ..., t-n), predict the value of the variable at time (t + 1) [8], [9]. The dependent variable in this case is the proportion of good transactions conducted by a node in a given time slot. Predicting this variable at time (t + 1) gives us the probability that the node will behave well if we choose to transact with it at time (t + 1). However, we simplify the problem by not explicitly requiring the value of this probability, but use it implicitly to make a “yes” or “no” decision. By doing so we change the problem from being one of regression, to being one of classification (although SVM can be used in either context [6]). Therefore, we will need to construct classifiers that can predict the behavior of a node with high accuracy. We would need to build SVM models against different types of malicious behaviors offline, and then upload those models to the nodes in the network. Each time a new attack is discovered, the training data are updated to include the new attack data and the models are retrained and updated. The nodes can then use those models to classify new nodes and predict if a new node is malicious or not. Constructing models is computationally expensive and requires plenty of attack and normal data, so it is done offline, possibly by a third party.

However, the classification step is not very expensive and can be done on the node in real time. This system is similar to how anti-virus systems work where the anti-virus is developed offline and then uploaded to computer systems. Similarly, in our scheme the vendor of the RS system might update its subscribers with SVM models against new attacks. An implied assumption is that after a transaction has taken place, a node can determine if the transaction was good or bad with a certain high probability. This is true in many cases, such as in commercial transactions on eBay, as well as in file downloads (where a corrupted or virus infected file would be considered bad), or in providing network services (where a malformed header, or non-compliance with a protocol, or incomplete data could be considered bad) [10], [11]. This assumption is made by many researchers in the field [1], [2], [3] and we also make the same assumption in our study. The transactions which are incorrectly labeled good or bad are considered noisy instances. SVM can handle fair amounts of noise in the dataset [6]. III. BUILDING THE MACHINE LEARNING CLASSIFIER FOR REPUTATION SYSTEMS There are many factors to consider in building the classifier that will be used to distinguish malicious nodes from good nodes. The most important factor is feature selection, which we describe below. Other factors are also discussed. We used the following approach for building our classifier: 1. Feature Selection: This is the set of features used to train and test the classifier. Feature selection is a critical step in constructing the classifier. To collect our features, we divide time into regular intervals called time slots. The features in our experiments take into account the proportion of positive vs. negative feedbacks assigned to a test node based on the transactions it conducted during a given time slot. In an actual setting we would query all the nodes in the network and ask them to provide us any feedbacks they have about the test node for the given slot. The fraction of positive feedbacks divided by total feedbacks for that slot forms a single feature. The feedbacks are either transmitted over a secure overlay network, or if the overlay is insecure, then each feedback is digitally signed by the sender and verified by the receiver to ensure it has not been tampered with. Each time slot corresponds to one feature. This idea is similar to features used in time series prediction problems [8], [9] and in [2]. Using too few features (i.e. too few time slots) might not provide sufficient information to the classifier and may result in “under fitting,” where the ML model does not represent the data well. On the other hand, using too many features might result in over fitting, where the training data fit the model too well, but the model is unable to generalize well to other datasets. It is common knowledge in ML that increasing the number of features increases the accuracy up to a certain point. Thereafter, increasing the number of features results in a degradation in performance. This phenomenon is called the “Curse of Dimensionality” [6]. We can vary the number of features by varying the number of time slots used. We use 15

978-1-4244-2324-8/08/$25.00 © 2008 IEEE. This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2008 proceedings.

time slots in our experiments since they yield good results. To guard against fake transaction feedbacks, where the node giving the feedback never really transacted with the test node, we suggest using “transaction proofs,” as proposed by the authors of [2]. They proposed a mechanism for binding transactions to transaction proofs, using third parties, such that a proof cannot be forged and is always exchanged atomically. These proofs ensure that every feedback is tied to a transaction that actually took place. However, they do not guard against false feedback where the node deliberately gives dishonest feedback, or inadvertently incorrectly detects and labels a transaction. The ML classifier is designed to work well in the presence of dishonest or incorrect feedback, as discussed in Section IV. 2. Proportion of Malicious Nodes: This refers to the proportion of good vs. bad nodes in the dataset. It relates to the degree of imbalance in the dataset and it need not be the same in the training and test sets. In our experiments, we fix the imbalance ratio in the training set and vary it in the test set, since in an actual setting, we would not know the imbalance ratio in the network. 3. Size of Dataset: This refers to the number of instances in the train and test sets. We use 1,000 train and 1,000 test instances. 4. Evaluation Methodology: This is the method used to evaluate the performance of the classifier, such as separate train and test sets, n-fold cross validation, leave one out, etc. We use separate train and test sets for evaluating our classifier. This is because we want to fix the imbalance ratio in the training set but vary it in the test set. 5. Evaluation Metrics: We need to consider the metrics used to evaluate the classifier, such as accuracy or error rate, precision, recall, etc. In accordance with [2], we also use error rate for evaluating our classifier. 6. Kernel Used: For SVM, we need to decide which kernel to use. We use the linear kernel in our experiments because it is the simplest and therefore the most computationally efficient kernel, and it gives good performance. 7. Using the ML output: If the ML classifier predicts that a node is malicious, we simply do not transact with that node. However, if the ML classifier predicts that the node is benign, we can go ahead and transact with it. We do not discriminate between malicious and honest nodes when collecting feedbacks and all the feedbacks are accepted for generating features. If we knew which feedbacks were honest and which were not, or if all the feedbacks were honest, the RS problem would become trivial because a simple majority vote would suffice and we would not need Machine Learning. The job of the classifier is to produce reliable predictions in the presence of malicious nodes giving false feedbacks.

We constructed and trained the SVM classifier based on these factors. We also implemented TrustGuard’s Naïve and TVM algorithms [2] and compared its performance with SVM, as discussed in the next section. IV. EXPERIMENTS AND RESULTS A. Simulation Setup We generated the datasets using simulations of a network consisting of 1,000 nodes. Time was divided into slots and in each time slot, several transactions were conducted between two randomly chosen pairs of nodes. Each node would then label the transaction as good or bad and store that label, along with a time stamp and the ID of the other node. The label may or may not reflect the true observation of a node, i.e. a node may lie about a transaction and give dishonest feedback. Good behavior was characterized as a node conducting a normal transaction and giving honest feedback about it. Bad behavior was characterized as a node conducting a malicious transaction and lying about the feedback. After several time slots, the simulation was stopped and data about each node was gathered. To gather data about a node x, all the other nodes in the network were queried and asked to give information about x going back a certain number of time slots. The total number of good and bad transactions conducted by x in a given time slot were accumulated and the proportion of positive feedback was computed. This computation was repeated for each time slot of interest. In this way a concise, aggregate historical record of x was obtained. A correct label of malicious or benign was assigned to x by us, based on its role in the simulation. In preliminary experiments, the proportion of malicious nodes, or imbalance ratio, in the training set was varied and the results on the testing set were observed. Then the imbalance ratio which yielded the best results for training was used to construct the full training set for a given attack pattern. The following sets of experiments were conducted. B. Experiment 1: Periodic Behavior without Collusion In this experiment the behavior of good nodes is constant, while the behavior of bad nodes oscillates between good and bad periodically. The good nodes always behave well and try to conduct good transactions. Malicious nodes behave well for a short time in order to boost their reputation and then use that reputation to conduct malicious transactions. The malicious nodes do not collude. We used 15 time slots of historical data. The amount of time a malicious node behaves well was kept less than 15 time slots (otherwise it would have been indistinguishable from legitimate good behavior). The training set had 50% malicious nodes because this imbalance ratio was found to perform well during preliminary testing. The test set’s proportion of malicious nodes was varied and the prediction error was plotted. Prediction error is defined as the proportion of nodes that were incorrectly classified as good or malicious.

978-1-4244-2324-8/08/$25.00 © 2008 IEEE. This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2008 proceedings.

Fig. 2. Error in prediction accuracy vs. Proportion of malicious nodes for Periodic Behavior without Collusion.

Fig. 3. Error in prediction accuracy vs. Proportion of malicious nodes for Periodic Behavior with Collusion.

In a real world setting, if a node was classified as good, we would transact with it, and if it was classified as malicious we would not. Since TrustGuard does not output a classification label but a trust score between 0 and 1, we considered a node with a score ≥ 0.5 as being good, and < 0.5 as being malicious. We compared the performance of our algorithm against the performance of Trust-TVM and Trust-Naïve algorithms given in [2]. The results are plotted in Fig. 2. The results show that SVM significantly outperforms TrustGuard’s Naïve and TVM algorithms. SVM’s maximum error is about 0.065 (or 6.5%), whereas TrustGuard Naïve’s maximum error is 0.166 (16.6%) and TVM’s maximum error is 0.192 (19.2%). It is interesting to note that the maximum error occurs at 80% malicious nodes, not 90% as one would expect. This is probably because of the high imbalance ratio of good vs. malicious nodes. Note that at 90% imbalance ratio, if every node was classified as malicious, we would get 90% accuracy. If the proportion of malicious nodes is very large in the test simulation, they will tend to give false feedbacks so that almost every node seems malicious. This will cause the classifier to classify almost every node as malicious as well. Therefore, at very high imbalance ratios, the overall accuracy increases.

As before, 15 time slots of historical data were used. The training set had 60% malicious nodes, since this was found to perform well in preliminary testing. The test set's proportion of malicious nodes was varied and the prediction error was plotted as in experiment 1. Once again, we compared the performance of our algorithm against the performance of Trust-TVM and Trust-Naïve algorithms given in [2]. The results are plotted in Fig. 3. Again, SVM performs much better than TrustGuard’s Naïve and TVM algorithms. As expected, the attack with node collusion is more severe, so the overall error rises as compared to the non-collusive setting. However, even at 80% malicious nodes, the maximum error of SVM is only 0.227 (22.7%). The maximum error of TrustGuard is around 0.45 (45%). As explained in experiment 1, SVM’s error drops at 90% malicious nodes due to the high imbalance ratio. The results show that the Machine Learning approach to designing Reputation Systems is viable and yields high performance.

C. Experiment 2: Periodic Behavior with Collusion In experiment 2, the behavior of good nodes is once again constant, while the behavior of bad nodes oscillates between good and bad as in experiment 1. The difference is that now when two malicious nodes transact with each other, they recognize each other and each node gives a positive feedback about the other. In this way, they help each other in boosting their reputation even faster. Of course, they cannot conduct an arbitrarily large number of transactions with each other and leave extraordinary feedbacks since that would make the good nodes suspicious and put the malicious nodes’ credibility in doubt.

V. CONCLUSIONS AND FUTURE WORK This paper proposes a Machine Learning (ML) technique for designing Reputation Systems (RS). RS can be used to defend against malicious transactions attempted by compromised or malicious nodes in a network, such as the spreading of viruses or worms, or attacks against known vulnerabilities. There is no perfect way of knowing if a future transaction will be malicious or not, but we can try to estimate the reliability of a node by observing its past behavior. The problem of designing a RS can be viewed as one of trying to find the optimal model and feature set that can predict the future behavior of a node based on its past behavior. We discussed why ML is a good approach for designing RS and gave precedence for similar problems using Support Vector Machines (SVM). Machine Learning algorithms, especially Support Vector Machines, have been successfully applied in modeling and time series prediction problems [7]. The RS problem can also be categorized as a time series prediction problem where we are given observations made

978-1-4244-2324-8/08/$25.00 © 2008 IEEE. This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2008 proceedings.

during past time slots and we need to predict the observation in the next time slot. The observation in this case is the probability that a given node conducts a good transaction. We highlighted the factors that one needs to consider in designing an SVM based Reputation System. Then we proposed our specific method, and simulated and tested our SVM based RS, as well as another comparable RS found in the literature, called TrustGuard [2]. We chose TrustGuard’s Naïve and TVM algorithms for comparison because their overheads, such as bandwidth and communication overheads, are very similar to ours. We conducted experiments to show that our model outperforms TrustGuard in both attack scenarios, noncollusive (where malicious nodes do not collude with each other) as well as collusive (where they do collude). We showed that our scheme can achieve very high accuracy and correctly predict good vs. malicious nodes, even when the proportion of malicious nodes in the network is very high. In future we plan to enhance our classifier by taking into account the reliability of the agent providing the feedback. This is done in TrustGuard’s PSM algorithm, however the overhead involved in PSM is far greater since we have to recursively collect further feedback about agents that gave us the original feedback, which increases the overheads exponentially. This is why we did not compare our RS’s performance against PSM. In future, we will study and try to minimize the overhead associated with collecting the features. In addition, we plan to perform further experiments by altering various factors involved in constructing the classifier, such as the kernel and the feature set, and observe their effects. REFERENCES [1]

T. Jiang, and J.S. Baras, “Trust evaluation in anarchy: a case study on autonomous networks,” in Proc. 25th Conference on Computer Communications, April 2006. [2] M. Srivatsa, L. Xiong, and L. Liu, “TrustGuard: countering vulnerabilities in reputation management for decentralized overlay networks,” International World Wide Web Conference, WWW, 2005. [3] S. D. Kamvar, M.T. Schlosser, and H. Garcia-Molina, “The EigenTrust algorithm for reputation management in P2P networks,” in Proc. WWW, 2003. [4] A. Josang, R. Ismail, and C. Boyd, “A survey of trust and reputation systems for online service provision,” Decision Support Systems, Vol. 43, Issue 2, pp 618-644, March 2007. [5] A. Josang, and R. Ismail, “The beta reputation system,” in Proc. 15th Bled Electronic Commerce Conference, Slovenia, June, 2002. [6] S. Abe, “Support Vector Machines for pattern classification (advances in pattern recognition),” Springer, ISBN: 1852339292, Chp 6, 11, July 2005. [7] F. Camastra, and M. Filippone, “SVM-based time series prediction with nonlinear dynamics methods,” Knowledge-Based Intelligent Information and Engineering Systems, LNCS, Springer, Vol. 4694, pp 300-307, 2007. [8] R. J. Frank, N. Davey, and S.P. Hunt, “Time series prediction and neural networks,” Journal of Intelligent and Robotic Systems, pp 91-103, 2000. [9] C. L. Giles, S. Lawrence, and A. C. Tsoi, “Noisy time series prediction using a recurrent neural network and grammatical inference,” Machine Learning, Vol. 44, Number 1/2, July/August, pp 161-183, 2001. [10] J. S. Baras, and T. Jiang, “Managing trust in self-organized mobile ad hoc networks,” Workshop NDSS, Extended abstract, 2005. [11] T. Jiang, and J.S. Baras, “Ant-based adaptive trust evidence distribution in MANET,” in Proc. 24th International Conference on Distributed Computing Systems, Tokyo, Japan, March, 2004.

978-1-4244-2324-8/08/$25.00 © 2008 IEEE. This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2008 proceedings.

Suggest Documents