Exploiting Available Domain Knowledge to Improve Mining Aviation ...

9 downloads 28259 Views 33KB Size Report
This paper discusses a method for incorporating available domain knowledge into data ... mar and is used within the algorithms in order to reduce the search space and gen- ... .
Exploiting Available Domain Knowledge to Improve Mining Aviation Safety and Network Security Data Zohreh Nazeri 1 and Eric Bloedorn 1 1

The MITRE Corporation, 7515 Colshire Drive, McLean, Virginia 22102, U.S.A.

{[email protected], [email protected]} Abstract. This paper discusses a method for incorporating available domain knowledge into data mining techniques in order to improve the interestingness of the discovered rules. Existing domain knowledge is represented by a simple grammar and is used within the algorithms in order to reduce the search space and generate more interesting results. We implemented the proposed approach in the APriori and C4.5 algorithms and applied them to data from aviation safety and intrusion detection domains. Our experiments show promising results.

1 Introduction Data mining algorithms generate many rules that are not interesting to the user. In an effort to reduce generation of uninterestin g rules, we have modified the A-Priori Association Rules algorithm [1] and the C4.5 decision tree algorithm [2] to allow direct use of domain knowledge within the algorithms. To evaluate the effect of our technique, for each algorithm we compare the results of applying the original and the modified algorithms to the same set of data.

2 Discussion To reduce the number of uninteresting discovered rules, many researchers have experimented with different methods. Generally these methods promote or eliminate the rules based on frequency of the itemsets (e.g., Lift measure [3], J measures [4], and [5]) or based on some type of knowledge either before the mining (pre -processing) or after the mining (post-processing) (e.g., [6], [7], [8], [9], [10], [11], [12], and [13]). The pre -processing approaches limit the potential for discovering ‘surprising’ information in the data. The post-processing approaches, on the other hand, sacrifice processing speed for many rules are generated which are then pruned. The method

described in this paper encodes the knowledge within the mining algorithm in order to eliminate search space that is uninteresting to the user or to emphasize the search space that is interesting to the user.

2.1 Knowledge Representation In our work, we focus on fragmentary domain knowledge that represent two types of domain knowledge: 1) Facts – knowledge that is known and accepted by all domain experts, and 2) Beliefs/Preferences – knowledge that the expert has gained by experience; while this knowledge is not a fact, the expert believes it could be true or wants to see the result of the overall analysis assuming this belief. The facts and preferences are then used in one of the two ways: 1) Reduction of ‘uninteresting’ rules, or 2) Encouragement of ‘interesting’. We have developed the grammar shown in Figure 1 for representing the knowledge fragments.

::= EOF ::= rank | percent ::= {}* ::= : ; ::= - | + | ::= {,}* ::= ::= | * ::= = | > | >= | < | = minsup} keep the itemset generate rules Fig. 2. A-Priori algorithm incorporating available knowledge

We implemented the A-Priori algorithm [1] with some modifications. for the internal presentation of data we used the data structure described in [16]. In order for the alg orithm to use the encoded knowledge, we modified it as indicated in Figure 2; modifications are shown in bold.

2.3 Decision Tree Algorithm In C4.5 algorithm, we modified the attribute selection method so that lower scoring attributes can be chosen if the domain knowledge indicates such a preference. We modified the algorithm as indicated in Figure 3; modifications are in bold.

Given a set of vectors (E) described by attributes (A): Find the ‘best’ attribute X given examples E Check DK to see if it matches rules and attributes being considered Modify score of attributes based on matching DK and determine which one to use Split the set of examples S into subsets S1..SN such that all examples in S1 have X = v1, Si have X=Vi For each Si If all examples belong to the same class, build leaf node and stop Else, go to step 1 with examples in Si Fig. 3. Top down induction (ID3 Quinlan)

We used the results of each iteration of running the algorithm as the domain knowledge for the next iteration. The rules were used as knowledge fragments and were fed to the process to see which rules persist over a period of time. Section 3.2 shows the results.

3 Experimentation To examine the effectiveness of the proposed approach, we selected two problems from two different domains where data and domain knowledge were available to us. Each of these problems required a different data mining technique. Consequently, evaluation of the results varies for each technique. The knowledge for both domains is obtained from domain experts at MITRE and the collaborating airlines.

3.1 Application of Modified A-Priori to the Aviation Safety Data The aviation safety reports we used are collections of incident reports composed of structured fields as well as unstructured narratives. For this experiment, we used the structured fields only. W e applied the proposed modified A-Priori (Figure 2, above) to over 4000 reports. Tables 1 and 2 compare the results quantitatively, when using low and high supports respectively. The use of knowledge has reduced the number of uninteresting rules without losing any of the interesting rules. The improvement in elimination of uninteresting rules, shown in Tables 1 and 2, is achieved by providing the following two knowledge fragments: -1:ANOMALY=TAKEOFF_ABORTED; -1:MISSION=PASSENGER, IGHTING=DAYLIGHT, FLIGHT_CONDITION=VMC;

The first knowledge fragment eliminates aborted takeoffs; the second fragment eliminates commercial flights during the daylight and in Visual Meteorological Conditions (VMC). Table 1- A-Priori Results for low support Using knowledge Without use of knowledge Uninteresting/Total rules 503/1656 = 30% 5353/6003 = 89% Support & Confidence

s=.05, c=0.9

s=.05, c=0.9

Table 2- A-Priori Results for high support Using knowledge Without use of knowledge Uninteres ting/Total rules 0/4 = 0 17/21 =80 % Support & Confidence

s=.55, c=.9

s=.55, c=.9

3.2 Application of Modified C4.5 to the Intrusion Detection Data We ran the C4.5 on MITRE’s intrusion detection data (see [19] and [17]), originally without any domain knowledge. Then we used the results of this first iteration as knowledge fragments for running C4.5 on a new set of data. The results were used in turn as d omain knowledge for the third iteration and so on. The goal was to see which rules (intrusion alarms) are persistent over a period of time in progressive data sets. We observed the following behavior for the discovered rules over 25 intrusion detection data sets collected over a progressive period of time: The following two rules first appearing in iteration 1, made it to iterations 2 and 3: Rule 2: connection = outbound -> class T [99.9%] Rule 5: connection = mitre -> class T [99.2%]

Similarly, some rules showing up in different iterations survived through a few iterations later. Some rules re-appeared in a few iterations later. For example the following rule first appeared in run5, then in run 7, run 10 through run 13, run 18, and finally in run 24 and run 25: Rule 5: srcIPzone = boundary -> class T [85.7%]

Table 3- Error rates for iterations of C4.5 Run 1

Error 6.5

With DK Time #rules 0.9 6

Error 6.5

No DK Time 0.8

#rules 6

2

0.3

1

8

24.2

3.6

11

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Std.Dev Avg

1.3 0.5 1.1 17.6 12.8 17.6 3.4 1.6 0.9 2.9 41.4 0 0 0 0.5 22 33.5 0 0.2 0 0.8 11.1 0 11.3 7.0

0.8 1.1 1.1 0.6 2 3 0.8 1.3 10 2.2 1.5 0.1 0 0.2 0.4 0.6 0.8 0.3 0.5 0.5 1.3 1.9 2.3 1.9 1.4

5 4 4 0 8 5 5 5 13 5 9 0 0 0 0 2 12 7 0 3 0 5 10 3.9 4.6

11.3 0.3 3.7 9.2 51. 9 1.4 24.6 17.3 2.3 6.7 50.7 60 0.1 2.5 0.8 20.6 11.2 0 21.1 1.1 1.2 7.6 0.1 17.3 6.8

3.7 4.6 6.9 7.9 11.5 44.4 26.4 51.9 78.9 59.4 117.7 127.9 131.5 137.2 134.1 177 163.3 213.8 218.4 253.3 286.1 365.8 244 104.3 114.8

13 15 17 18 19 37 23 40 42 33 53 49 51 51 43 46 62 61 63 73 69 74 72 21.6 41.6

Table 3 compares the results for running C4.5 (without use of knowledge) against the new algorithm (C4.5 with our modifications to use domain knowledge). As shown in the table, the number of rules has improved from 41.6 to 4.6 (avg) while the error rate of the DK tree approach is about the same as the batch process rule error rate, but obtained with much less average running time (114.8 sec vs. 1.4 sec) and improved s tability over time (17.3 vs. 11.3, max 60% erro r no DK, 41% error with DK).

4 Conclusion We discussed our approach to improving the quality of discovered rules by A-priori (associations) and C4.5 (decision Tree) algorithms by reducing the number of uninteresting rules. Our preliminary results indicate an improvement in the quality of discovered rules: a reduction in the number of uninteresting rules and improved stability and decreased running time. We are doing further work in this area applying domain

knowledge to improve graph analysis.

References 1. Agrawal, R., et al., “Fast Discovery of Association Rules”, Advances in Knowledge Dis covery and Data Mining, 1996, AAAI Press/The MIT Press, pages 307-328. 2. Quinlan,J.R. “C4.5: Programs for Machine Learning”, San Mateo: Morgan Kaufmann,1993. 3. Two Crows Corporation. “Introduction to Data Mining and Knowledge Discovery”, Second Edition, ISBN 892095-00-0, 1998. 4. Wang, K. et al “Intrestingness-based interval Merger for Numeric Association Rules”, American Association for Artificial Intelligence”, 1998. 5. Bayardo, R., and Agrawal, R., “Mining the Most Interesting Rules”, SIGKDD 1999. 6. Tseng, Shin-Mu, “Mining Association Rules with Interestingness Constraints in Large Databases”, International Journal of Fuzzy Systems, Vol. 3, No. 2, June 2001. 7. Liu, B. et al., “Discovering Conforming and Unexpected Classification Rules”, IJCAI-97 Workshop on Intelligent Data Analysis in Medicine and Pharmacology (IDAMAP-97). 8. Klemettinten, M., et al, “Finding Interesting Rules from Large Sets of Discovered Associa tion Rules”, International Conference on Information and Knowledge Management, 1994. 9. Padmanabhan, B., Tuzhilin,A “A Belief-Driven Method for Discovering Unexpected Pat terns”, American Association for Artificial Intelligence, 1998. 10. Sahar, S., “Interestingness Via What Is Not Interesting”, SIGKDD 1999. 11. Liu, B. et al, “Pruning and Summarizing the Discovered Associations”, SIGKDD 1999. 12. Liu, B. et al, “Identifying Non-Actionable Association Rules”, ACM SIGKDD 2001. 13. Zaki, M., “Generating Non-Redundant Association Rules”, ACM SIGKDD 2000. 14. Srikant, R. et al, “Mining Association Rules with Item Constraint”, ACM SIGKDD 1997. 15. Adomavicius, G., and Tuzhilin, A. “Discovery of Actionable Patterns in Databases: The Action Hierarchy Approach”, SIGKDD 1997. 16. Brin, et al, “Dynamic Itemset Counting and Implication Rules for Market Basket Data”, ACM SIGMOD 1997. 17. Skorupka, C. et al “Surf the Flood: Reducing High-Volume Intrusion Detection Data by Automated Record Aggregation”, SANS2001 Technical Conference, May 2001. 18. Nazeri, Z. et al., “Experiences in Mining Aviation Safety Data”, ACM SIGMOD 2001. 19. Halme, L. and Bauer, R.K., “AINT Misbehaving – A Taxonomy of Anti Intrusion Detec tion Techniques”, National Information Systems Security Conference, October 1995.

Suggest Documents