A new anomaly detection method based on models of user behavior ..... Computer Security Institute: CSI/FBI Computer Crime and Security Survey Results.
A Hybrid Command Sequence Model for Anomaly Detection Zhou Jian, Haruhiko Shirai, Isamu Takahashi, Jousuke Kuroiwa, Tomohiro Odaka, and Hisakazu Ogura Graduate School of Engineering, University of Fukui, Fukui-shi, 910-8507, Japan {jimzhou, shirai, takahasi, jousuke, odaka, ogura}@rook.i.his.fukui-u.ac.jp
Abstract. A new anomaly detection method based on models of user behavior at the command level is proposed as an intrusion detection technique. The hybrid command sequence (HCS) model is trained from historical session data by a genetic algorithm, and then it is used as the criterion in verifying observed behavior. The proposed model considers the occurrence of multiple command sequence fragments in a single session, so that it could recognize non-sequential patterns. Experiment results demonstrate an anomaly detection rate of higher than 90%, comparable to other statistical methods and 10% higher than the original command sequence model. Keywords: Computer security; IDS; Anomaly detection; User model; GA; Command sequence.
1 Introduction Preventative methods are widely used to safeguard against access to restricted computing resources, including techniques such as accounts, passwords, smart cards, and biometrics to provide access control and authentication [1]. However, with growing volume and sensitivity of data processed on computer systems, data security has become a serious consideration, making it necessary to implement secondary defenses such as intrusion detection [2]. Once intruders have breached the authentication level, typically using the system under a valid account, online intrusion detection is used as a second line of defense to improve the security of computer systems. Intrusion detection systems (IDS) have been studied extensively in recent years with the target of automatically monitoring behavior that violates the security policy of the computer system [3] [4] [5]. The present study focuses on anomaly detection at the command line level in a UNIX environment. Each user in a homogeneous working environment has specific characteristics of input that depending on the task, such as familiar commands and usage habits, and the topic of work will be stable within discrete periods. Users also differ individually in terms of work content and access privileges. For example, a programmer and a secretary may exhibit very different usage behaviors. One means of intrusion detection is therefore to construct a user Z.-H. Zhou, H. Li, and Q. Yang (Eds.): PAKDD 2007, LNAI 4426, pp. 108–118, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Hybrid Command Sequence Model for Anomaly Detection
109
model by extracting characteristics of user behavior from historical usage data and detecting out any variation from this typical usage pattern as a potential intruder. The present authors have already conducted some research on anomaly detection at the command level using such a user model [7]. Detection was realized by a simple command sequence (SCS) model method, in which the user model was built by extracting command sequence fragments frequently used by the current user and seldom used by others. The model was trained by machine learning with a genetic algorithm (GA), and the method successfully detected more than 80% of anomalous user sessions in the experiment. In this paper, a new hybrid command sequence (HCS) model is presented. The characteristics of user behavior are extracted by machine learning, and a list of unique and frequently used command combinations are built for each user. The trained HCS model can then be used as the criteria on detecting illegal user behavior (breach of authentication) or anomalous internal user behavior (misuse of privileges). These improvements provide a substantial increase in performance over the original SCS method, and it also shows comparable to other statistical methods based on the experiment of a common command sequence data set.
2 Related Work Intruder detection systems are broadly based on two ways: anomaly detection and misuse detection. Anomaly detection is based on the assumption that on a computer system, the activity during an attack will be noticeable different from normal system activity. Misuse detection is realized by matching observed behavior with that in a knowledge base of known intrusion signatures [3]. Each of these techniques has weaknesses and strengths. Anomaly detection is sensitive to behavior that varies from historical user behavior and thus can detect some new unknown intrusion techniques, yet also often judges new but normal behavior as illegal. In contrast, misuse detection is not sensitive to unknown intruder techniques but provides a low false alarm rate. Anomaly detection using Unix shell commands has been extensively studied. In addition to providing a feasible approach to the security of Unix systems, it’s also possible to be generalized to other systems. Schonlau et al. [6] summed up six methods of anomaly detection at command line level: “Uniqueness”, “Bayes one-step Markov”, “Hybrid multi-step Markov”, “Compression”, “IPAM” and “Sequence-match”. The Bayes one-step Markov method is based on one-step transitions from one command to the next. The detector determines whether the observed transition probabilities are consistent with the historical transition probabilities. The hybrid multi-step Markov method is based mostly on a high-order Markov chain and occasionally on an independence model depending on the proportion of commands in the test data that were not observed in the training data. The compression method involves constructing a reversible mapping of data to a different representation that uses fewer bytes, where new data from a given user compresses to about the same ratio as old data from the same user. The incremental probabilistic action modeling (IPAM) method is based on one-step command transition probabilities estimated from the training data with a continually exponential updating scheme. The sequence-matching method computes a similarity measure between the 10 most recent commands and a user’s profile using a
110
Z. Jian et al.
command-by-command comparison of the two command sequences. The uniqueness method is based on the command frequency. Commands not seen in the training data may indicate a masquerade attempt, and the more infrequently a command is used by the user, the more indicative that the command is being used in a masquerade attack. These approaches conducted anomaly detection in a statistical way, where deviation of system running state was monitored with a statistical value, and a threshold was used as the classify standard. However, the characteristic based classification is another way to anomaly detection. That is, it should be possible to use unique command combinations specific to each user to verify user behavior and define security policies. The SCS model was proposed by the present authors [7] as such an approach, in which the characteristic user model was constructed by extracting frequently appearing command sequence fragments for each user and applying GA-based machine learning to train the model. In this paper, the HCS model is presented as an extension of the SCS model to account for additional characteristics of user behavior. GA programming is also employed for machine learning the model from historical profile data.
3 Hybrid Command Sequence Model 3.1 Hybrid Command Sequence Model The SCS model is constructed from the historical session profile data of individual users, where session is defined as the command sequences inputted between login and logout. A session is regarded as a basic analysis unit for user behavior, which involves activities conducted to achieve a certain missions. By analyzing user behavior in discrete periods with similar task processing, the unique behavioral characteristics for each user can be extracted. The SCS model was thus constructed to characterize user behavior, and the trained model was used to label unknown sessions. The model is trained by a GA method from historical session profile data to search command sequence fragments that frequently appeared in the current user’s normal session data set {St}, but which occurred only rarely in the data sets of other users {Sf}. For each user, the number of commands in one command sequence (CS) fragment and the number of CS fragments in the learned SCS model are all variable according to the training process. For an unknown session, if it contains either CS fragment of the SCS model, it’ll be labeled as legal input, otherwise as illegal. The matching between CS fragments of the SCS model and the observed session data is illustrated in Fig. 1, where the CS fragment (CS3) contains three commands C31, C32 and C33. If the observed session contains these three commands sequentially, regardless of their location in the session, it is labeled as legal. Experiment results showed that the SCS model is capable of up to 80% accuracy in detection of illegal sessions. CS3
C31
C32
C33
Session
… … C31 C32 C33 …
Fig. 1. Matching between a CS fragment of a SCS model and the observed session data
A Hybrid Command Sequence Model for Anomaly Detection
111
However, the SCS model could not fit to the situation that frequently used command combinations (combination of CS fragments) may occur in one session but not necessarily in sequential order. For multiple CS fragments in a single session may be more powerful in characterizing user behavior, the hybrid command sequence (HCS) model is therefore proposed. The HCS model, as an extension of the SCS model, could describe further characteristics such as multiple CS fragments and discrete commands in a single session. Fig. 2 shows the structure of the HCS model. The model is constructed by multiple units, and each unit contains multiple CS fragments. The number of units in one model, the number of CS fragments in one unit, and the number of commands (denoted C in Fig. 2) in one CS fragment are all variable depending on training. An example of the HCS model is shown in Fig. 3. The different numbers of unit, CS fragment and command in the HCS model could describe different characteristics of user behavior. For example, if each unit contains only one CS fragment, the HCS model is identical to the SCS model; while if each CS fragment contains only one command, these discrete commands in the unit are used as the criteria for anomaly detection without consideration of the sequential characteristic. In the case that one unit contains both continuous sequence and discrete commands, the model is a composite of these two characteristics. The HCS model is thus a more powerful model in representing characteristics of user behavior. When the HCS model is used to detect an unlabeled session, the session is labeled as legal if either unit of the model is found in the session, which means that all CS fragments of the unit must be contained by the session. An illustration of the matching between a unit of the HCS model and a session is shown in Fig. 4, where a unit consists of two CS fragments (CS11, CS12), the CS fragment CS11 contains three commands (C111, C112 and C113), and CS12 contains two commands (C121 and C122). Thus, if the session contains both CS11 and CS12, the session is labeled as legal. CS11
C111
C112
CS12
C121
C122
CS13
C131
C132
C113
C114
UNIT1 C133
…
H C S UNIT2 UNIT3
… Fig. 2. Structure of the hybrid command sequence model
112
Z. Jian et al.
UNIT1: CS11: CS12: CS13:
exit le make
CS21: CS22:
ll ll
CS31: CS32: CS33:
ll kill cd
le vi
UNIT2: vi
UNIT3:
Fig. 3. Example of an HCS model CS11 : C111 C112 C113
CS12 : C121 C122
Session : … C111 C112 C113 … C121 C122 … Fig. 4. Matching between a unit in the HCS model and the observed session data
3.2 Definition Judgment of the legality of a session is a binary classification problem. The verification of T unlabeled sessions, with TP as correct acceptance classifications, TN as correct alarms, FP as false acceptances classifications, and FN as false alarms (T = TP + TN + FP + FN), is therefore given by
FRR =
FN . TP + FN
(1)
FAR =
FP . TN + FP
(2)
where FRR is the false rejection rate (incorrect classification of normal data as hostile), FAR is the false acceptance rate (incorrect classification of hostile data as normal). The quality of a detector can then be judged in terms of its ability to minimize either FRR or FAR. In reality, there is often a bias favoring on either FRR or FAR. Therefore, the overall quality of the method can be evaluated by a cost function as follows.
Cost = α × FRR + β × FAR .
(3)
As the cost of a false alarm and a miss alarm will vary to the application, there is no way to set relative values of α and β to achieve an optimal cost in all cases. By convention, α and β in a cost function are both set to 1, given the relation
Cost = FRR + FAR .
(4)
A Hybrid Command Sequence Model for Anomaly Detection
113
3.3 Machine Leaning of HCS by GA The training of a model with historical user data is a search problem to find specific and differential behavioral characteristics for a particular user. It’s impossible to search such complex command patterns of the HCS model directly from the large data space. GA is a relatively efficient approach for searching in a large data space. In the learning stage by GA, the optimization target is to find the model that occurs frequently in the target user’s normal data set {St} and seldom in the data of other users {Sf}. The two-stage GA is employed. Encoding and implementation are described as below. 3.3.1 GA Encoding Training of the HCS model is performed to find the optimal combinations of commands. A command table of frequently appearing commands is constructed initially, and each command is indexed by a number value. The search operation is conducted based on frequently appeared commands rather than all commands to improve searching efficiency. Each chromosome in GA has a structure similar to that of the HCS model (Fig. 2), so that each HCS model is encoded as a chromosome. Rather than binary encoding, each gene in a chromosome is encoded using the indices in the command table. In the decoding stage, each chromosome is decoded as a solution of the HCS model which contains multi-command combinations. One solution of the HCS model appears as shown in Fig. 3. The numbers of unit, CS fragment and command are all variable depending upon the initialization and the evolutionary operation of the GA. Start Initialize Fitness calculation Selection CS-level crossover Command-level crossover Mutation N
Gen>Max Y End Fig. 5. GA processing
114
Z. Jian et al.
3.3.2 GA Processing GA processing typically involves an initialization stage and an evolution stage, which includes fitness calculation, selection, crossover and mutation, as shown in Fig. 5. When initializing a population, numbers of unit, CS fragment and command are all set randomly. Here, the CS fragments are initialized with randomly selected CS fragments extracted from the user’s sessions. The optimization target of the GA is to gain the minimum cost function (Eq. (3) or (4)). Therefore, the fitness function is defines as:
Fitness = 2 − (α × FRR + β × FAR ) .
(5)
where α + β = 2, and α and β may vary according to different applications. By convention, α and β are set to 1 in this paper, leading to the relation
Fitness = 2 − ( FRR + FAR ) .
(6)
Selection is processed under the proportional selection strategy according to individuals’ fitness. Crossover, a key process of evolution in GA, is performed in a special way to account for the variability of the numbers of unit, CS fragment and command in a chromosome. Here, the two stage crossover is employed: CS-level crossover and command-level crossover, where the former provides stability and ensures evolution of the population, and the latter allows evolution of the number of commands. In CSlevel crossover, a randomly chosen point is assigned as a crossover point for a pair of mated individuals. As the example shown in Fig. 6, the crossover point is 4. For command-level crossover operations, a pair of CS fragments is chosen randomly according to a probability (set at 0.1 here based on experiences) from a pair of mated individuals. Crossover points are then selected randomly for the two CS fragments separately. As the example shown in Fig. 7, the crossover point of CS33 is 2, and CS33’ is 3. Result of the crossover operation is shown in Fig. 8. Mutation is realized by randomly choosing one Gene in a chromosome according to a probability, and setting the point to a randomly selected value from the command table index. CS11
CS12
CS11’
CS12’
CS13 CS13’
CS21 CS21’
CS22’ CS22
CS23’
…
CS23
…
Fig. 6. CS-level crossover of two individuals at point 4 CS33
C1
C2
C3
C4
CS33’
C1’
C2’
C3’
C4’
C5
Fig. 7. Selection of two CS fragments for command-level crossover
A Hybrid Command Sequence Model for Anomaly Detection
CS33
C1
C2
C4’
CS33’
C1’
C2’
C3’
C3
C4
115
C5
Fig. 8. Result of command-level crossover operation in Fig. 7
4 Experiments The HCS model method was evaluated on the same “session data” as the SCS model, where session is regard as analysis unit. At same time, to compare the HCS model with other previous methods, experiments were also conducted on the common data set provided by Schonlau et al. [6]. However, the Schonlau data is a simply collection of 15000 commands for each user, without labeling the session each command belongs to. Thus, to apply the HCS model to the Schonlau data, the command set of each user is divided manually into 150 sessions with 100 commands in each session. 4.1 The Session Data The session data set consists of historical session profile data for 13 users collected over 3 months. The users were graduate computer science students who used the computer system for programming, document editing, email, and other regular activities. Only sessions containing more than 50 commands were recorded, command arguments were excluded. A total of 1515 sessions were collected, including 83694 commands. For seven users, 529 sessions were used as the training data, and the remaining 521 sessions were used as the testing data. The 465 sessions for the other six users were used as the independent testing data [7]. The results are listed in Tables 1–3. FRR and FAR of the HCS model for the training, testing and independent data exhibits a remarkable 10% improvement compared to that achieved by the SCS method [7]. The average FRR for the testing data set is higher than the average FAR for the testing and independent data, and it shows that the HCS model is relatively more powerful for anomaly detection, but suffers from a slightly elevated FRR. The average FRR for the testing data set is also much higher than the average FRR for the training data, and it shows that there has some degradation when the trained model is applied to the test data. The FAR Table 1. FRR and FAR for the training data (%) Subject User 1 User 2 User 3 User 4 User 5 User 6 User 7 Average
FRR 3.2 10.4 14.3 16.9 14.7 1.2 3.8 9.2
FAR 6.5 0.5 3.6 19.0 0.0 1.1 1.1 4.5
116
Z. Jian et al. Table 2. FRR and FAR for the testing data (%) Subject User 1 User 2 User 3 User 4 User 5 User 6 User 7 Average
FRR 5.7 26.0 31.6 19.7 41.5 18.6 17.9 23.0
FAR 6.5 0.6 3.6 25.7 3.2 3.6 1.0 6.3
Table 3. FAR for the independent data (%) Subject User 8 User 9 User 10 User 11 User 12 User 13 Average
FAR 6.9 0.0 19.3 12.4 0.0 17.2 9.3
value for user 5 is 14.7% for the training data and 41.5% for the testing data. This therefore indicates that a certain users as user 5 are difficult to describe uniquely by the HCS model. 4.2 The Schonlau Data Schonlau et al. collected command line data for 50 users, with 15000 commands in the set (file) for each user. The first 5000 commands do not contain any masqueraders and are intended as the training data, and the remaining 10000 commands (100 blocks of 100 commands) are seeded with masquerading users (i.e. with data of another user not among the 50 users). A masquerade starts with a probability of 1% in any given block after the initial 5000 commands, and masquerades have an 80% chance of continuing for two consecutive blocks. Approximately 5% of the testing data are associated with masquerades [6]. The time cost for the collection of data in the Schonlau set differs for each user. To adopt the session notation, the training data are divided manually into 100 sessions of 100 commands. Although this will result in some degradation of the performance of the HCS method, which is built on the notion of session, the results are still comparable with other methods. The first 5000 commands of each user are divided into 50 sessions as the training data. The 50 sessions of the current user are used as the legal training data, and 1000 sessions of other 20 randomly selected users are used as the illegal training data. Experiment results of previous methods and the HCS method based on the Schonlau data are shown on Table 4, and Cost is calculated according to Eq. 4. We could see efficiency of HCS is better than others, and it gains the best Cost value with a relative high FRR of 33.9%.
A Hybrid Command Sequence Model for Anomaly Detection
117
Table 4. Results based on the Schonlau data (%) Method 1-step Markov Hybrid Markov IPAM Uniqueness Sequence Matching Compression HCS
FAR 30.7 50.7 58.6 60.6 63.2 65.8 1.4
FRR 6.7 3.2 2.7 1.4 3.7 5.0 33.9
Cost 37.4 53.9 61.3 62.0 66.9 70.8 35.3
5 Discussion The HCS model is an extension of the SCS model. Besides the sequential characteristic as the SCS model, it could also describe the co-existence characteristic. As searching such complex command patterns directly from the large data space is impossible, GA, which is a relatively efficient approach for searching in a large data space, is employed for learning the HCS model. As a result, the HCS model method exhibited a 10% improvement in anomaly detection compared to that of the SCS model method. Different from previous methods, which took all commands into account in a statistical way, the HCS model method only depends on the usages of typical command combinations owned by individual users. Therefore, processes of anomaly detection by the HCS model is more directly and interpretable than that of statistical methods. Additionally, for anomaly detection by the HCS method only needs matching operation between input commands and commands patterns of the model, it needs less computation cost than other methods. The performance of the HCS method based on the Schonlau data is also comparable to that of the other six methods summarized by Schonlau et al. As while, the performance of the HCS method is somewhat lower on the Schonlau data than on the session data. This is partly due to the manual division of the data set into sessions, which destroys the structure of the data set. The Schonlau data was collected without the consideration of session or period. A long period of data collection may also cause a substantial shift in user behavior, reducing the performance of the HCS method. To adapt to variation of user behavior over time, the HCS model should be further maintained periodically to ensure its efficiency. In both experiments, the FRR is much higher than the FAR. It rests on that: in the HCS model, only typical characteristics of command combination used by the current user are searched and employed to verify normal behavior of this user. For the typical characteristics of the current user are rarely used by anomaly sessions inputted by other user, it gained a high efficiency in anomaly detection. At same time, for only limited typical characteristics are extracted to describe behaviors of the current user, the normal behavior, which doesn’t necessarily contain typical usages, will be apt to be judged as abnormal and leads to a high false alarm rate. That’s the reason why the FRR is difficult to decrease even using different parameters of α and β in Eq. (5). The start point of the HCS is different from other statistical methods. As in the Uniqueness method, commands not seen before or infrequently used are extracted and
118
Z. Jian et al.
used to label anomaly behaviors. Yet, the HCS method extracts command patterns that frequently used by the current user and seldom used by others to discriminate normal behaviors. Therefore, the two methods are complimentary in using characteristics of user behavior, and the composition of the two methods is expected to be researched in future to gain a better result.
6 Conclusions The hybrid command sequence model (HCS) was proposed as the basis for a new intrusion detection method. The HCS model is constructed by extracting characteristics of user behavior (combinations of command sequence fragments) from historical session data by GA. When applying the learned model to the unknown session data, the HCS model method achieves a false acceptance rate of lower than 10% for anomaly detection, and provides a 10% improvement in efficiency than the previous the SCS model. The HCS method also performs comparable to other statistical techniques, even though the method approaches the problem from a different starting point. When combined with other preventative methods such as access control and authentication, the HCS method is expected to improve the security of a computer system significantly. The HCS model also has the advantage of low computation cost in anomaly detection, requiring only a matching operation between a set of limited command combinations of the HCS model and the observed session. For the present system still has a high false rejection rate (>20%), future work should be done to improve efficiency further more. It may include taking a wider range of characteristics of user behavior into account, and composing the HCS method with other statistical techniques. Additionally, it should also be evaluated on other context, such as fraud detection, mobile phone based anomaly detection of user behavior, and so on.
References 1. Kim, H.J.: Biometrics, Is It a Viable Proposition for Identity Authentication and Access Control? Computers & Security, vol. 14 (1995), no. 3, pp. 205-214 2. Computer Security Institute: CSI/FBI Computer Crime and Security Survey Results Quantify Financial Losses. Computer Security Alert (1998), no. 181 3. Biermann, E., Colete, E., and Venter, L.M.: A Comparison of Intrusion Detection Systems. Computers & Security, vol. 20 (2001), no. 8, pp. 676-783 4. Axelsson S.: Intrusion Detection Systems: A Survey and Taxonomy. Technical Report 9915, Dept. of Computer Engineering, Chalmers University of Technology, Sweden 5. Murali, A., Rao, M.: A Survey on Intrusion Detection Approaches. Proc. of ICICT 2005, pp. 233-240 6. Schonlau, M., DuMouchel, W., Ju, W., Karr, A., Theus, M., and Vardi, Y.: Computer Intrusion: Detecting Masquerades. Statistical Science, vol. 16 (2001), no. 1, pp. 58-74 7. Odaka, T., Shirai, H., and Ogura, H.: An Authentication Method Based on the Characteristics of the Command Sequence. IEICE, vol. J85-D-I (2002), no. 5, pp. 476-478