Weighted Intra-transactional Rule Mining for Database Intrusion Detection Abhinav Srivastava1, Shamik Sural1 , and A.K. Majumdar2 1
School of Information Technology, Department of Computer Science & Engineering, Indian Institute of Technology, Kharagpur, India
[email protected],
[email protected],
[email protected] 2
Abstract. Data mining is the non-trivial process of identifying novel, potentially useful and understandable patterns in data. With most of the organizations starting on-line operations, the threat of security breaches is increasing. Since a database stores a lot of valuable information, its security has become paramount. One mechanism to safeguard the information in these databases is to use an intrusion detection system(IDS). In every database, there are a few attributes or columns that are more important to be tracked or sensed for malicious modifications as compared to the other attributes. In this paper, we propose an intrusion detection algorithm named weighted data dependency rule miner (WDDRM) for finding dependencies among the data items. The transactions that do not follow the extracted data dependency rules are marked as malicious. We show that W DDRM handles the modification of sensitive attributes quite accurately. Keywords: Data dependency, Weighted rule mining, Read-Write sequence, Intrusion detection.
1
Introduction
Data mining has attracted a great deal of attention in the industry in recent years due to the wide availability of huge volume of data and the imminent need for turning such data into useful information and knowledge [1]. Data mining generally refers to the process of extracting models or determining patterns from large observed data [2]. It involves an integration of techniques from multiple disciplines such as database technology, statistics, machine learning, highperformance computing, spatial data analysis, neural network and others. Recently, researchers have started using data mining techniques in the emerging field of information and system security and specially in intrusion detection systems. An intrusion is defined as any set of actions that attempt to compromise the integrity, confidentiality or availability of a resource. Intrusion Detection is the process of monitoring the events occurring in a computer system or network and analyzing them for signs of intrusions [3]. W.K. Ng, M. Kitsuregawa, and J. Li (Eds.): PAKDD 2006, LNAI 3918, pp. 611–620, 2006. c Springer-Verlag Berlin Heidelberg 2006
612
A. Srivastava, S. Sural, and A.K. Majumdar
Intrusion detection has been discussed in public research since the beginning of the 1980s. In the last few years, it became an active area of research and commercial IDSs started emerging [4]. Several research works also have been proposed that apply data mining for intrusion detection. Lee et al [5] have suggested data mining techniques for network intrusion detection. They consider several categories of data mining algorithms, namely, classification, link analysis and sequential analysis along with their applicability in the field of intrusion detection. Barbara et al [6] have built a testbed using data mining techniques to detect network intrusions . Though intrusion detection is a well researched area, only a few researches have focused on database intrusion detection. Chung et al [7] use the idea of ”working scope” to find the frequent itemsets referenced together and used this information for anomaly detection. Lee et al [8] propose an intrusion detection system in real-time databases using time signatures. Lee et al [9] have suggested a method for fingerprinting the access patterns of legitimate database transactions and using them to identify database intrusions. Barbara et al [10] use hidden markov model (HMM) and time series to find malicious corruption of data. They use HMM to build database behavioral models that capture the changing behavior over time, and uses them to recognize malicious patterns. Zhong et al [11] have proposed an algorithm to mine user profiles based on the queries submitted by the user. Hu et al [12] have proposed an idea of determining dependency among data items in databases. The transactions that do not follow the mined data dependencies are identified as malicious transactions. In this paper, we propose an algorithm for database intrusion detection using a data mining technique, which takes the sensitivity of the attributes into consideration. Sensitivity of an attribute signifies how important the attribute is for tracking against malicious modifications. This approach mines dependency among attributes in a database. The transactions that do not follow these dependencies are marked as malicious transactions. The rest of the paper is organized as follows. In Section 2, we describe weighted data dependency rule mining (W DDRM ) algorithm with an example. We present details of our experiments and provide results in Section 3. Finally, we conclude the paper with some discussions.
2 2.1
Weighted Data Dependency Rule Mining Intuition
Databases are increasing in size in two ways: the number N of records, or objects in the database, and the number d of fields, or attributes, per object. Databases containing of the order of N = 109 objects are increasingly common nowadays. The number d of attributes can easily be of the order of 102 or even 103 in various applications [2]. With the number of attributes increasing at such a high rate, it is very difficult for administrators to keep track of attributes whether they are accessed or modified correctly or not. By dividing the attributes into different categories based on their relative importance or sensitivity, it is comparatively
Weighted Intra-transactional Rule Mining for Database Intrusion Detection
613
easier to track only those attributes whose unintended modification can have the largest impact on the application or the system. Practitioners as well as researchers have observed that IDS can easily trigger thousands of alarms per day, a number of which are triggered incorrectly by benign events [13]. Categorization of attributes helps the administrator to check only those alarms, which are generated due to malicious modification of sensitive data instead of checking all the attributes. Since the main objective of a database intrusion detection system is to minimize the loss suffered by the owner of the database, it is important to track high sensitive attributes with more accuracy. If sensitive attributes are to be tracked for malicious modifications then we need to generate data dependency rules for these attributes. Unless there is a rule for an attribute, the attribute cannot be checked. If high sensitive attributes are accessed less frequently, then there may not be any rule generated for these attributes. The motivation for dividing attributes in different sensitivity groups and assigning weights to each group is to bring out the dependency rules for possibly less frequent but more important attributes. Once we have rules for these sensitive attributes, we can check them in each transaction and if any transaction does not follow the mined rules, it will be marked as malicious. We discuss the main components of an IDS in the following subsections. 2.2
Security Sensitive Sequence Mining
The problem of finding sequences among the attributes along with the operations {read,write} is similar to the problem of mining sequential patterns. Mining sequences from large sets of data is a known problem. Agrawal et al [14] have proposed an algorithm for finding sequential patterns from data. In this algorithm, all the data items are considered at the same level without any weightage. We modify an existing sequential mining algorithm and make it security sensitive sequential mining by introducing weights for each attribute based on the sensitivity group. Higher the sensitivity of an attribute, higher is its weight. We have categorized the attributes in three sets : High Sensitivity (HS) attribute set, Medium Sensitivity (MS) attribute set and Low Sensitivity (LS) attribute set. The sensitivity of an attribute is dependent on the particular database application. Also, modification of the sensitive attributes are more important than reading those attributes from the point of view of integrity. For the same attribute say x, if x ∈ HS then W (xw ) > W (xr ), where W is a weight function, xw denotes writing or modifying attribute x and xr denotes reading of attribute x. Given a schema, we categorize all the attributes into the above mentioned three sets based on their sensitivities and assign numerical weights to each set. Let w1 , w2 , w3 ∈ R, where R is the set of real numbers and w3 ≤ w2 ≤ w1 are the weights of HS, M S and LS, respectively. Let d1 , d2 , d3 ∈ R be the additional weights of the write operations for each category such that d3 ≤ d2 ≤ d1 . Let x ∈ HS be an attribute which is accessed in a read operation. Then the weight given to x is w1 . If it is accessed in write operation then the weight given to x is w1 + d1 .
614
A. Srivastava, S. Sural, and A.K. Majumdar TID Attribute access sequence 1 11r , 13w , 4r , 8r , 2r , 16r , 17r , 14r 2 7r , 2r , 7r , 2r , 14r , 15w 3 16r , 17r , 14r , 14r , 15w , 17w , 2r , 7w 4 11r , 12w , 2r , 4w , 16r , 17r , 14r 5 2r , 4w , 2r , 7w , 7r , 8r , 2r 6 11r , 13w , 4r , 8r , 2r , 2r , 4w 7 14r , 15w , 4r , 8r , 2r , 8r , 2r 8 7r , 8r , 2r , 2r , 2r , 8w , 5w , 2r , 4w 9 8r , 2r , 14r , 15w , 7r , 2r 10 14r , 15w , 16r , 17r , 14r , 14r , 15w , 17w Fig. 1. Example transactions for the Sequence Mining Algorithm
Table Name Column Name Customer N ame, Customer id, Address, P hone no Account Account id, Customer id, Status, Open dt, Close dt, Balance Account type Account type, M ax tran per month, Description Fig. 2. Bank database schema
For security sensitive sequence mining, we assign weights to each sequence based on the sensitivity groups of the attributes present in the sequence. The weight assigned to a sequence is the same as the weight of the most sensitive attribute present in that sequence. The weight assigned to each sequence also depends on the operation applied on the attributes. The weights assigned to all the sequences are used in the second pruning step which calculates the support of each sequence in the transaction. If support value for any sequence is above the minimum support, the sequence is considered to be a frequent sequence. Let us assume that there is a sequence s with weight ws . Let N be the total number transactions. If s is present in n transactions out of N transactions, then the support of sequence s would be: Support(s) = (n ∗ ws ) / N
(1)
The effect of this weighted approach on sequence mining algorithm is significant. With this approach, sequences containing high sensitive attributes but accessed less in the transactions can become frequent sequences because each such sequence’s count is enhanced by multiplying with its weight. The weighted support can now exceed the minimum support. Consider the example transactions shown in Figure 1. There are 10 transactions. These transactions are generated from the bank database schema shown in Figure 2 with attributes encoded into integers. In Figure 3, the weight of each attribute is shown. These attributes are categorized into HS, M S and LS groups depending upon the sensitivity. First, these transactions are given input to a sequential pattern mining algorithm [14] for extracting the sequences using normal
Weighted Intra-transactional Rule Mining for Database Intrusion Detection
615
Sensitivity Group Attributes Weights Write Weights Normalized Weights HS 7, 8, 13 3 .25 .48 MS 5, 16 2 .25 .33 LS 2, 4, 11, 12, 14, 15, 17 1 .25 .19
Fig. 3. Weight table for the attributes used in the bank database
Sequence using Non-weighted Method < 4r , 8r , 2r >, < 14r , 15w , 2r >, < 2r , 4w >, < 2r , 7r >, < 2r , 14r >, < 16r , 17r , 14r >, < 7r , 2r >, < 11r , 2r >
Sequence using Weighted Method < 16r , 17r , 14r , 15w , 17w , 7w >, < 2r , 4w , 7w , 7r , 8r >, < 7r , 8r , 8w , 2r , 4w >, < 7r , 8r , 2r , 8w , 4w >, < 8r , 2r , 14r , 15w , 7r >, < 8r , 14r , 15w , 7r , 2r >, < 11r , 13w , 8r , 2r , 4w >, < 11r , 13w , 8r , 2r , 16r >, < 13w , 4r , 8r , 2r >, < 7w , 7r , 8r , 2r >, < 2r , 7r , 14r , 15w >, < 7r , 2r , 14r , 15w >, < 14r , 15w , 2r , 7w >, < 14r , 15w , 2r , 8r >, < 14r , 15w , 8r , 2r >, < 4r , 2r , 8r >, < 13w , 8r , 16r , 17r , 14r >, < 13w , 8r , 2r , 16r , 14r >
Fig. 4. Mined sequence using Minimum Support value 25%
definition of support. These transactions and weights are also given as input to the proposed weighted sequential mining algorithm. Here, support values of the sequences are calculated using equation (1). In both the cases, minimum support is set to 25%. The sequences generated from the two algorithms are shown in Figure 4. 2.3
Read-Write Sequence Generation
In this subsection, we first define some of the terminologies used in the rest of the paper. Definition 1. A read sequence denoted as ReadSeq of attribute aj is the sequence of the form < a1r , a2r , a3r ...., akr , ajw >, which is the sequence of attributes a1 to ak that are read before attribute aj is written. All such sequences form a set named as read sequence set denoted by ReadSeqSet. Definition 2. A write sequence denoted as W riteSeq of attribute aj is the sequence of the form < ajw , a1w , a2w , a3w , ...., akw >, which is the sequence of attributes a1 to ak that are written after attribute aj is written. All such sequences form a set named as write sequence set denoted by W riteSeqSet. The sequences shown in Figure 4 are next used to generate read and write sequences. As per the definitions, ReadSeq and W riteSeq must contain at least one write operation. All the sequences that do not have any attribute with write operation, are not used for read and write sequence generation. A sequence that contains a single attribute does not contribute to the generation of dependency rules and hence will be ignored too. The read-write sequences are generated as follows.
616
A. Srivastava, S. Sural, and A.K. Majumdar
Non-weighted Method Weighted Method Read Set Write Set Read Set < 14r , 15w >, < 16r , 17r , 14r , 15w >, < 16r , 17r , 14r , 17w >, < 2r , 4w > < 16r , 17r , 14r , 7w >, < 2r , 4w >, < 2r , 7w >, < 7r , 8r , 8w >, < 7r , 8r , 2r , 4w >, < 8r , 2r , 14r , 15w >, < 8r , 14r , 15w >, < 11r , 13w >, < 11r , 8r , 2r , 4w >, < 2r , 7r , 14r , 15w >, < 7r , 2r , 14r , 15w , > < 14r , 15w >, < 14r , 2r , 7w >, < 7r , 8r , 2r , 8r >
Write Set < 15w , 17w , 7w >, < 8w , 4w >, < 15w , 7w >, < 4w , 7w >, < 13w , 4w >
Fig. 5. Read Sequences and Write Sequences
For each write operation ajw in a sequence, add < a1r , a2r ...akr , ajw > to ReadSeqSet where a1r , a2r ...akr are the read operations on attributes a1 to ak before the write operation on attribute aj . To generate write sequences, for each write operation ajw in a sequence, add < ajw , a1w , a2w , ....akw > to W riteSeqSet where a1w , a2w , ....akw are write operations on attributes a1 to ak after the write operation on attribute aj . The read-write sequences generated from the mined sequences of Figure 4 are shown in Figure 5. 2.4
Weighted Data Dependency Rule Generation
There are two types of data dependency rules, namely, read rules and write rules. A read rule of the form ajw → a1r , a2r ..., akr implies that attributes a1 to ak are read in order to write attribute aj . Write rule of the form ajw → a1w , a2w , ....akw implies that after writing attribute ajw , attributes a1w , a2w , ...akw are modified. These rules are generated from the read and write sequences. Weighted data dependency rule generation uses weighted confidence. The confidence of the read and write rules are calculated by the following method. Let R be a read rule of the form ajw → a1r , a2r , ....akr , generated from the read sequence rs ∈ ReadSeqSet . Let Count(ajw ) and Count(rs) be the total count of the attribute ajw and that of rs among the total number of transactions. The weighted confidence of the rule R is defined as: Conf idence(CR ) = Count(rs) / Count(ajw )
(2)
Count(ajw ) is defined as follows:
Count(ajw ) =
(w3 + d3 )
∀ T ransaction T, ajw ∈T and rs ∈ /T
=
(3)
max(W (rs))
∀ T ransaction T, rs ∈ T
Count(rs) is defined as: Count(rs) =
∀ T ransaction T, rs ∈ T
max(W (rs))
(4)
Weighted Intra-transactional Rule Mining for Database Intrusion Detection
617
Non-weighted Method Weighted Method Read Rules Write Rules Read Rules Write Rules < 15w → 14r >, < 17w → 16r , 17r , 14r >, < 4w → 2r >, < 8w → 4w > < 4w → 2r > < 8w → 7r , 8r >, < 13w → 11r >, < 8w → 7r , 8r , 2r >, < 15w → 14r >, < 7w → 2r > Fig. 6. Read and Write Dependency Rules with Confidence value 70% ALGORITHM WDDRM: Initialize two sets ReadSeqSet = {Φ}, WriteSeqSet = {Φ} for storing read and write sequences respectively. Initialize two sets ReadRuleSet = {Φ}, WriteRuleSet = {Φ} for storing read and write rules respectively. Create a set weighted data dependency rules WDDR = {ReadRuleSet, WriteRuleSet}. Execute sequential mining algorithm with minimum support minSup. At each step, calculate support of the sequences using equation (1). For each sequential pattern Pi , where Pi contains at least one write operation IF (a1r , a2r , ...., akr , ajw ∈ Pi and a1r , a2r , ...., akr = ∅) where a1r to akr are all the read operation on attributes a1 to ak before ajw , the write operation on attribute aj For each write operation ajw Generate read sequence < a1r , a2r , ...., akr , ajw > and Add to ReadSeqSet IF (ajw , a1w , a2w , ....akw ∈ Pi and a1w , a2w , ...., akw = ∅) where a1w to akw are all the write operation on attributes a1 to ak after ajw , the write operation on attribute aj Generate write sequence < ajw , a1w , a2w , .....akw > and add to WriteSeqSet For each read sequence rs of the form a1r , a2r , ...., akr , ajw ∈ ReadSeqSet Construct read rule rr of the form ajw → a1r , a2r , ...., akr Calculate the confidence C of rr using equation (2) IF (C ≥ minConf ) Add rr ∈ ReadRuleSet For each write sequence ws of the form ajw , a1w , a2w , ....akw ∈ WriteSeqSet Construct write rule wr of the form ajw → a1w , a2w , ..., akw Calculate the confidence C of wr using equation (2) IF (C ≥ minConf ) Add wr ∈ W riteRuleSet Return WDDR = {ReadRuleSet, W riteRuleSet} Fig. 7. Weighted Data Dependency Rule Miner Algorithm
The rules generated from the read-write sequences are shown in Figure 6. After the rules are generated, they are used to verify whether the incoming transactions are malicious or not. If an incoming transaction has a write operation, it is checked whether there are any corresponding read or write rules. If
618
A. Srivastava, S. Sural, and A.K. Majumdar
the write operation violates these rules, it is marked as malicious and an alarm is generated. Otherwise, normal operation proceeds. The complete algorithm for the weighted data dependency rule mining is shown in the Figure 7.
3
Experimental Results
We have carried out several experiments to show the efficacy of the developed method. The system has been developed using Java as front end and MS SQL 2000 Server as the back end database. We have used the bank database of Figure 2 for our experiments. Volunteers from our institute were invited to interact with the system and make malicious transactions. This was beneficial because the interaction by the volunteers helped us to capture real data that would be expected in a normal application. They were provided the schema and the information on sensitive attributes. The volunteers tried novel ways of committing malicious transactions since it was announced that scores would be awarded based on the total weight of attributes they could modify. In the learning phase, we have generated a number of sets of training data with each set of size 10,000 transactions having different distributions. In one experiment, we have used the following distributions. Insert/Update=90%, Select=10%. We also choose the number of transactions containing most sensitive attributes in the training data as a parameter. We used 20% of the transactions with highly sensitive attributes in the training data. All these parameters are varied for different experiments. The support and confidence values are .25 and .70, respectively. Once the transactions are generated, we have run the nonweighted algorithm to generate the data dependency rules. After that, we have used WDDRM algorithm on the training data with weight ratios 1:2:3 for LS, MS and HS groups, respectively. In the experiments, we have taken additional weight of write operation as 0.25 for all the three categories. We used weights for different groups as another parameter. Dependency rules for each set of weights were finally generated. In order to study relative performance, we have compared our work with the non-weighted dependency rule mining approach. We call this method as DDRM and use it for comparison. Figure 8(a) shows a comparison of DDRM and W DDRM . The percentage of malicious transactions detected is plotted against the sensitivity ratio. When the weights of all three groups are equal, then W DDRM reduces to DDRM . However, when distinct weights are assigned to the three groups, W DDRM detects higher percentage of malicious transactions. DDRM cannot be effectively applied in this situation. In Figure 8(b) , comparative performances is shown for each sensitivity group. It is seen that W DDRM outperforms DDRM for more sensitive attributes. Figure 9(a) shows the effect of the number of write operations on the performance of the intrusion detection system. As the number of write operations increases, the effectiveness of the system also increases. This is because write operations are required to generate the data dependency rules. If there are more write operations on attributes in the transactions, more rules are generated.
Weighted Intra-transactional Rule Mining for Database Intrusion Detection
(a) Comparison of DDRM and WDDRM with different sensitivity ratio
619
(b) Comparison of DDRM and WDDRM for different sensitivity groups Fig. 8.
(a) Performance of W DDRM algorithm with number of write operations
(b) Comparison of DDRM and W DDRM in terms of Loss Suffered by IDS Fig. 9.
Hence, detection rate increases if more insert/update statements are present in the transactions. Figure 9(b) shows the loss suffered by the intrusion detection system in terms of weight unit using both the approaches. The ratio of weights used for the experiment is 3:2:1 for HS, M S and LS, respectively and distribution of Insert/Update=90% and Select=10%. Loss is computed by adding the weights of all the attributes whose malicious modifications are not detected by the IDS. It is evident from the figure that W DDRM outperforms DDRM . This is because W DDRM tracks the sensitive attributes in a much better way than DDRM and hence overall loss is minimized.
4
Conclusions and Discussions
In this paper, we have identified some of the limitations of the existing data mining based intrusion detection systems, in particular, their incapability in
620
A. Srivastava, S. Sural, and A.K. Majumdar
treating database attributes at different levels of sensitivity. We proposed a novel weighted data dependency rule mining algorithm that considers the sensitivity of the attributes while mining the dependency rules. Experimental results show that our proposed algorithm performs better than some of the previous work done in this area. The sensitivity levels can be syntactically captured during data modeling through the E-R diagram notations.
Acknowledgements This work is partially supported by a research grant from the Department of Information Technology, Ministry of Communication and Information Technology, Government of India, under Grant No. 12(34)/04-IRSD dated 07/12/2004.
References 1. J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers (2001). 2. U. Fayyad, G. P. Shapiro, P. Smyth, The KDD Process for Extracting Useful Knowledge from Volumes of Data, Communications of the ACM, pages 27-34 (1996). 3. R. Bace, P. Mell, Intrusion Detection System, NIST Special Publication on Intrusion Detection System (2001). 4. E. Lundin, E. Jonsson, Survey of Intrusion Detection Research , Technical Report Chalmers University of Technology, (2002). 5. W. Lee, S.J. Stolfo, Data Mining Approaches for Intrusion Detection, Proceedings of the USENIX Security Symposium, pages 79-94 (1998). 6. D. Barbara, J. Couto, S. Jajodia, N. Wu, ADAM: A Testbed forExploring the Use of Data Mining in Intrusion Detection, ACM SIGMOD, pages 15-24 (2001). 7. C. Y. Chung, M. Gertz, K. Levitt, DEMIDS: A Misuse Detection System for Database Systems, IFIP TC-11 WG 11.5 Working Conference on Integrity and Internal Control in Information System, pages 159-178 (1999). 8. V.C.S. Lee, J.A. Stankovic, S.H. Son, Intrusion Detection in Real-time Database Systems Via Time Signatures, Proceedings of the Real Time Technology and Application Symposium, pages 124-133 (2000). 9. S.Y. Lee, W.L. Low, P.Y. Wong, Learning Fingerprints for a Database Intrusion Detection System, Proceedings of the European Symposium on Research in Computer Security, pages 264-280 (2002). 10. D. Barbara, R. Goel, S. Jajodia, Mining Malicious Data Corruption with Hidden Markov Models, IFIP WG 11.3 Working Conference on Data and Application Security, pages 175-189 (2002). 11. Y. Zhong, X. Qin, Research on Algorithm of User Query Frequent Itemsets Mining, Proceedings of the Machine Learning and Cybernetics, pages 1671-1676 (2004). 12. Y. Hu, B. Panda, A Data Mining Approach for Database Intru sion Detection, Proceedings of the ACM Symposium on Applied Computing, pages 711-716 (2004). 13. K. Julisch, M. Dacier, Mining Intrusion Detection Alarms for Actionable Knowledge, Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 366-375 (2002). 14. R. Agrawal, R. Srikant, Mining Sequential Patterns, Proceedings of the International Conference on Data Engineering, pages 3-14 (1995).