Knowledge discovery in a dairy cattle database: Automated

5 downloads 0 Views 220KB Size Report
program utilises Australian Breeding Values (ABVs) which measure the genetic potential of .... values measured on a set of approximately monthly test days.
Contents

Knowledge discovery in a dairy cattle database: Automated knowledge acquisition H.A. Abbass*, W. Bligh, M. Towsey, G.Finn Machine Learning Research Center Data Mining Lab School of Computing Science Queensland University of Technology GPO Box 2434, Brisbane Qld 4001, Australia [email protected] M. Tierney Department of Primary Industries, Queensland Animal Research Institute, 665 Fairfield Road, Yeerongpilly Qld 4105, Australia [email protected]

ABSTRACT Knowledge Discovery in Databases (also known as Data Mining) is a powerful paradigm for automated knowledge acquisition. In this paper, we apply two data mining (DM) techniques, C5 and RULEX, on the Australian Dairy Herd Improvement Service (ADHIS) database for the Australian State of Victoria. The objective is to obtain accurate and comprehensible rule sets for reasoning about the expected milk yield of daughters arising from a potential mating. The rules sets are to be incorporated into the knowledge base of an intelligent decision support system (IDSS) for the dairy breeding industry in Australia. The data is pre-processed and categorised in two different ways, using Autoclass (a Bayesian classifier) and a domain expert’s heuristic. We present a comparison study and find that rule sets derived from Autoclass categorisation and C5 decision tree induction give best results. 1

INTRODUCTION Dairy cattle are a major agricultural resource in Australia. In 1995-96, there were 13,888 dairy herds in Australia with 1.92 million cows yielding 8.7 billion litres of milk having a value of $AUD 3 billion on the wholesale level. The continuing importance of the industry in the Australian economy depends on productivity increases through breeding programs and more efficient management practices. Breeding programs are expensive long-term projects that require careful planning and design. The four main components of a breeding program are summarised by Dekkers (1998) as (1) formulation of a breeding goal, (2) evaluation of the genetic potential of animals with respect to the goal, (3) selection of animals for mating and (4) determination of optimum mating strategies. Steps (3) and (4) are collectively known as the mate allocation problem, that is, deciding which sires are to be mated with which dams so as to optimise the breeding goal. From a mathematical point of view, mate allocation is a constrained optimisation problem.

*

The author to whom all correspondence should be addressed.

1 Proceedings of ISDSS'99

 1999 International Society for Decision Support Systems

Knowledge discovery in a dairy cattle database

Abbass, H A, Bligh, W, Towsey, M, Finn, G, Tierney, M

Although the overall goal is an economic one, Dekkers argues that a crucial decision in any breeding program is how to compromise between the rate of genetic improvement and risks associated with inbreeding. Typically, the faster the rate of genetic improvement the greater the amount of inbreeding. These and other considerations give rise to complex decision making problems with many variables. Thus animal breeding programs would appear to be likely candidates to benefit from the Intelligent Decision Support Systems (IDSS) paradigm. Previous studies (Lacroix, 1997; Wade, 1994) have shown the potential of using neural networks in animal breeding. The Victorian Department of Agriculture (Bowman, 1996) developed a computer program ‘Selectabull’ for selecting the most suitable set of bulls for a farmer’s herd and ranking them according to the expected increase in herd profitability. The program utilises Australian Breeding Values (ABVs) which measure the genetic potential of available bulls within the artificial insemination program. Another system called ConnectiBull (Finn, 1996), developed by the Machine Learning Research Center at the Queensland University of Technology (QUT), uses a neural network to predict expected progeny performance. However, it was designed around a dedicated neural network for each available sire and therefore suffers from being bull specific.

Mining Model

Model Base

Dictionary

Dictionary

Data

Data Base

Dictionary

Mining Base Data manager

Model Manager Mining manager

Knowledge manager

Knowledge Base

Kernel

Dialogue manager

Knowledge

Dialogue

Dialogue

Dictionary

Dictionary

Base

Figure 1. The data mining component within the IDSS paradigm

The mate allocation problem requires the ability to predict the expected milk-producing performance of the offspring from any mating. Two kinds of prediction are necessary, a numerical prediction (e.g. the expected yield expressed in litres of milk in the offspring’s first lactation) to support the optimisation model (Abbass et. al. 1999b) and a symbolic prediction (e.g. classify expected production as low, average or high) to support reasoning about mating decisions. Knowledge Discovery in Databases (KDD) is a powerful paradigm

2 Proceedings of ISDSS'99

 1999 International Society for Decision Support Systems

Abbass, H A, Bligh, W, Towsey, M, Finn, G, Tierney, M

Knowledge discovery in a dairy cattle database

for discovering hidden relations and patterns within large databases. The basic idea is to incorporate KDD as a component within the IDSS. The data mining module fits well with the classical versions of the IDSS framework (Turban 1990, Abbass et. al. 1998). Moreover, it enriches the modularity and generality of the IDSS framework by providing it with up-todate knowledge and models’ parameters so extending its useable lifetime. In conjunction with the Department of Primary Industries, Queensland, the authors have designed a prototype IDSS to support all four components of a dairy breeding program (Abbass et. al. 1998). An ameliorated version of this prototype is presented by Abbass et. al. (1999a) and is shown in Figure 1. The model base contains a prediction model and a stochastic multi-objective multi-stage optimisation model which is currently under development. The database contains both farm data and information about bulls available in the Australian artificial insemination program. The knowledge base contains a set of rules for reasoning about the rejection or acceptance of a specific mating, for evaluating a potential mating qualitatively and for matrix generation and problem decomposition to support the models in the model base. The dialogue base contains different scenarios for interaction. The data mining component will use the database within the intelligent decision support system to either find patterns for prediction or auto elicit knowledge for reasoning. If the objective is function estimation, the estimated function will be used to generate a predictive model, which will be added to the model base. If the objective is to automate knowledge acquisition, this knowledge will be added to the knowledge base to be available for the use of the KDD-IDSS. In this paper we are concerned only with the automated knowledge acquisition component of the KDD-IDSS. Our task is a classification problem and the objective is to elicit knowledge in the form of a set of symbolic rules that can be added to the knowledge base to support reasoning about expected progeny performance. We compare the quality of different rule sets obtained by different data mining methodologies. In particular, we compare two different methods of attribute categorisation, Autoclass and the domain expert, and two different methods of automated knowledge acquisition, C5 and Rulex. In section 2 of the paper, we describe data and algorithm engineering, including two methods of attribute categorisation and two methods of rule induction. The first categorisation method employs Autoclass (Cheeseman 1996), an unsupervised Bayesian classifier, while the second uses our domain expert to categorise the data. The two rule induction methods are RULEX and C5 (described below), each of which builds a set of logical rules to explain the data. Our objective is to obtain rule sets which are both accurate and comprehensible to the farmer as the end user of the system. The rule induction algorithm should be reliable and able to mine the database with the least possible amount of user involvement. This is because the induction algorithm itself will be incorporated into the KDD-IDSS so that rule sets can be refined and updated on-line without intervention of the design engineer. Section 3 is devoted to results and discussion while conclusions are drawn in section 4.

DATA AND ALGORITHM ENGINEERING Feature selection and data preparation The raw data used for this study was obtained from the Australian Dairy Herd Improvement Service. It covers the years 1993-95 and contains 48 plain text files with over 80 million records occupying disk space of some 6 Gigabytes. We selected data only for the State of Victoria as it accounts for over 60% of the Australian dairy production. This resulted in

3 Proceedings of ISDSS'99

 1999 International Society for Decision Support Systems

Abbass, H A, Bligh, W, Towsey, M, Finn, G, Tierney, M

Knowledge discovery in a dairy cattle database

around 5.5 million lactation records with about 19 million test-day records for around 700,000 cows (dams and daughters)1. The data contained numerous missing values, for example cows without listed daughters and daughters without listed dams. It was decided to remove such records to increase the reliability of the discovered knowledge. The five attribute fields, retained for our study, were ◊

Herd average milk volume (MV) for all non-first lactations of all cows in the same herd as a dam;



Dam’s second lactation MV (SLMV);



Sire’s ABV for milk volume;



Season of dam’s second calving;



Daughter’s first lactation MV (FLMV) which is the quantity to be predicted.

All the attributes except season were continuous valued. The season of the dam’s calving was sparse-coded for reasons to be explained later. Following Schmidt et. al. (1974), all dam SLMV values and all daughter FLMV values were corrected for the dam age at calving. Since age correction factors are not commonly published because they vary with environment and management practices, we calculated them from the database. The correction factor (CF) for any yearly age group was calculated as: CF = MVmature / MVgroup where, ‘MVmature’ is the average MV of all mature cows aged 5 to 7 years (when production is at a peak) and ‘MVgroup’ is the average MV of all cows in the age group being corrected. In Figure 2, we show the average dam SLMV in each yearly age group before and after the correction for age. It is clear from the figure that the correction has removed a non-linear effect of age on milk production. We would expect the correction to improve the performance of learning algorithms such as C5 which divide the input space with hyper- planes having axis parallel orientation.

3000

3000 156

145

138

131

124

117

110

96

103

89

82

75

68

61

54

47

40

33

Age group

93 10 3 11 3 12 3 13 3 14 3 15 7

4000

83

4000

33

5000

6000 5000

73

7000

6000

63

8000

7000

53

8000

43

Average Dam milk

Average Dam milk

Age group

Figure 2. Graphs of average dam milk volume (in litres) as a function of dam age (in months) before (left) and after (right) applying the correction factor for age.

1

Production of an animal is given over a standard 300-day lactation period estimated from production values measured on a set of approximately monthly test days. We excluded lactation records estimated from fewer than seven test days.

4 Proceedings of ISDSS'99

 1999 International Society for Decision Support Systems

Abbass, H A, Bligh, W, Towsey, M, Finn, G, Tierney, M

Knowledge discovery in a dairy cattle database

Season of calving also affects the MV of the subsequent lactation due to marked seasonal fluctuations in pasture growth. However, because we could find no simple trend in our database, we included the four calving seasons as a four-input sparse-coded binary variable. Thus to classify daughter FLMV, we used seven inputs for the neural network used by RULEX, one each for average herd MV, dam SLMV and sire ABV for MV and four inputs for the four possible seasons of the dam’s calving. For C5, the dam’s seasons of calving was represented as a single input taking the values 1-4 for the four seasons. Categorisation Two methods were adopted for the categorisation of the continuous attributes. The first was based on a heuristic provided by the domain expert while Autoclass, a Bayesian classifier, performed the second. Expert categorisation

The expert categorised the continuous input attributes into five categories and the output attribute into three categories determined by occurrence frequencies rather than ranges of absolute values. This is in accord with the standard practice of describing quantitative traits relative to breed, herd and season. If category boundaries were defined in absolute units (e.g. litres of milk), they would have to be redefined frequently to adjust for seasonal changes and genetic improvement. Accordingly, the continuous input attributes were categorised into low, below average, average, above average and high categories containing 10%, 15%, 50%, 15%, and 10% of the animals respectively. The output attribute was split into three categories, low, average and high, containing 25%, 50%, and 25% of the animals respectively. The expert restricted the output attribute to three classes in order to enhance rule comprehensibility. It also allowed for direct comparison with the Autoclass method which produced three categories for the output attribute. Autoclass categorisation

Autoclass (Cheeseman 1996) is an unsupervised clustering algorithm based on Bayes theorem. In unsupervised learning, the task is to discover clusters or classes in the data, rather than generate class descriptions from labelled examples as in supervised learning. Autoclass is based on the classical finite mixture distribution, which is a two-part model. In the first part, given a data set and a probability density function, we seek the posterior parameter values that give the maximum interclass mixture probability that an instance belongs to the class. In the second part, we seek the most appropriate probability density function that distinguishes this class. Gaussian density distributions are employed as they avoid placing bounds on the parameters. User-interaction is required to set parameters to produce satisfactory clusters, for example by setting an upper limit on the number of clusters. Autoclass defines its clusters in terms of mean and standard deviation of a Gaussian distribution. In order to obtain the boundary, b, between any two classes having distribution means m1 and m2 (m2 > m1) and standard deviations s1 and s2 respectively, we used the formula b = m1+ s1 (m2-m1)/(s1+s2). Learning algorithms RULEX

Rule extraction from neural networks helps provide explanations for the behaviour of neural networks. Additional advantages to rule extraction include: the rules sometimes generalise better than the networks from which they are extracted; the rules may expose previously unrecognised dependencies; rule extraction helps to integrate connectionist systems with

5 Proceedings of ISDSS'99

 1999 International Society for Decision Support Systems

Abbass, H A, Bligh, W, Towsey, M, Finn, G, Tierney, M

Knowledge discovery in a dairy cattle database

symbolic ones, and it is a powerful tool for knowledge acquisition. RULEX (Andrews 1996) is a two-step technique, developed at the Machine Learning Research Center at QUT, for extracting rules from neural networks. In the first step the network is trained by the ‘rapidbackprop’ algorithm. Each node in the network is a local response unit (LRU) similar to radial basis function networks, except that the LRU is constructed from Sigmoid functions rather than Gaussian. The LRU is then approximated with an axis parallel hyper-rectangle. The network is trained by adjusting the centre, width and steepness of each hidden unit’s non-overlapping response function so as to minimise an output error. When training is complete, the second step is to extract rules by a direct encoding of the response field of each hidden unit. In the simplest case, each hidden unit produces a single if-then rule. There is an additional rule-refinement phase which simplifies the rules so as to increase comprehensibility. The three refinement operations are negation, elimination, and absorption. If all possible values of an attribute but one occur within a rule, negation of the absent value is used instead. If all possible values of an attribute make the corresponding ridge active, that attribute is eliminated because it does not contribute to discrimination. Absorption refers to the elimination of the negation of an attribute when it is redundant. The final RULEX rule set depends on the configuration of the LRUs. As with C5, the disadvantage of locally derived rules is that global structure is difficult to represent and may even be undetectable. Furthermore, rules may not account for data in overlapping regions. Nevertheless, the network training is fast and rule extraction is accurate and concise. Andrews (1996) compared a number of rule extraction techniques and found that RULEX alone does not depend on parameter initialisation, although network training by rapid backpropagation does indeed depend on the choice of training parameters. C5

The classifier C5 (Quinlan, 1997), is a supervised-learning decision tree classifier and an enhancement of the earlier C4.5 (Quinlan, 1993). C5 branches on that attribute which maximises the information gain ratio. Most importantly, from our point of view, the default parameter settings for C5 are extremely robust, that is, they yield good results for a wide range of problem domains and data sets. Having constructed a complete decision tree C5 proceeds to extract rules. C5 suffers from the same disadvantage as RULEX in that it does not easily recognise global rules. For example, if the data can be separated into two classes by the relation x < y for any two attributes, x and y, complete accuracy can only be achieved with a tree or network of infinite size2. The main difference between C5 and RULEX is that C5 partitions the input space directly whereas RULEX does so through the intermediate step of a neural network. Our study consisted of comparing the rule sets obtained using different combinations of data representation and rule induction algorithms described above. Four different data sets (labelled A, B, C, and D) were constructed, which differed in the way that the input and output attributes were prepared. 1 Data Set A: inputs continuous values, outputs discretised by expert. 2 Data Set B: inputs continuous values, outputs discretised by Autoclass. 3 Data Set C: inputs and outputs both discretised by expert. 4 Data Set D: inputs and outputs both discretised by Autoclass.

2

This arises since the line separating the classes x < y and x ≥ y lies at 45o to the axes of the attributes x and y, whereas the field defined by a rule’s antecedents has axis parallel sides.

6 Proceedings of ISDSS'99

 1999 International Society for Decision Support Systems

Abbass, H A, Bligh, W, Towsey, M, Finn, G, Tierney, M

Knowledge discovery in a dairy cattle database

In all four data sets, the number of output categories was three (labelled ‘low’, ‘average’, and ‘high’), giving rise to ‘three-class’ classification problems. Two rule extraction techniques, RULEX and C5 were applied to each of these data sets and the eight resulting rule sets compared. Comparisons were based on three criteria, rule set accuracy, the number of rules, and the average number of antecedents per rule. Rule set accuracy was determined by calculating the average of precision and recall for each output class. Precision is defined as the percent of instances classified as positive that are actually positive and recall is defined as the percent of actual positive instances correctly classified as positive. The number of rules and the average number of antecedents per rule afford an objective measure of rule set comprehensibility, although other factors also affect comprehensibility. It is generally assumed that fewer rules and/or antecedents result in greater comprehensibility. Comparisons between C5 and RULEX are difficult in this regard because, unlike C5, RULEX combines rules using disjunctions and negations. To make our objective comparisons fairer, we reversed the rule refinement step of RULEX by decomposing disjunctive rules into simpler elemental rules. For example, a rule of the form ‘if (V or W) and (X or Y) then Z’ consists of four elemental rules after Boolean operator distribution. That is, (if V and X then Z), (if V and Y then Z), (if W and X then Z), and (if W and Y then Z).

RESULTS AND DISCUSSIONS In Table 1, we present a summary description of the categorisation of continuous attributes resulting from the expert’s heuristic and Autoclass. Accuracy and comprehensibility results for the rule sets extracted by C5 and RULEX are shown in Tables 2 and 3. We find it convenient to discuss the results in the sequence of the three major choices to be made; the choice between continuous versus discrete input representation; the choice between expert versus Autoclass discretisation of the attributes; the choice between C5 and RULEX as a rule induction algorithm.

7 Proceedings of ISDSS'99

 1999 International Society for Decision Support Systems

Abbass, H A, Bligh, W, Towsey, M, Finn, G, Tierney, M

Knowledge discovery in a dairy cattle database

Table 1: Categorisation summary

Average Volume

Daughter FLMV

Sire ABV Milk

Dam SLMV

Herd milk

Expert

Autoclass

Upper Bound

Freq. (%)

Upper Bound

Freq. (%)

Low

4164

10

4794

24

Below Average

4820

15

5306

20

Average

6132

50

5613

12

Above Average

6815

15

5986

14

High



10

6300

10

Very High







20

Low

4155

10

4346

13

Below Average

4974

15

5039

14

Average

6968

50

5830

20

Above Average

7870

15

6350

13

High



10



40

Very Low

N/A

N/A

278

02

Low

537

07

623

12

Below Average

731

18

735

11

Average

1264

50

842

12

Above Average

1501

15

1108

24

High



10

1475

26

Very High







13

Low

4030

25

4632

44

Average

5643

50

5624

30

High



25



26

(†) No such category was generated by the categorisation method (‡) The upper limit is open-ended

Input representation: Continuous vs Discrete In most cases, the accuracy of the rule sets produced with continuous and discrete inputs were similar. Only in two cases, (both involving RULEX’s performance on the ‘high’ class) was there a major difference (see the next to last row of Tables 2 and 3). There were however quite clear differences in the numbers of rules. RULEX produced more rules than C5 (Column 7 in Tables 2 and 3) whereas for continuous inputs, C5 produced more rules than RULEX (Column 4 in Tables 2 and 3). Accordingly, in terms of comprehensibility, RULEX had the advantage with continuous data, whereas C5 had the advantage with discrete data.

8 Proceedings of ISDSS'99

 1999 International Society for Decision Support Systems

Abbass, H A, Bligh, W, Towsey, M, Finn, G, Tierney, M

Knowledge discovery in a dairy cattle database

We asked the domain expert to examine the rule sets with a view to expressing a preference between rules with continuous antecedents versus discrete. He clearly expressed a preference for discrete antecedents even in cases where this resulted in a larger rule set. Where rules are to be used to support a decision, it is semantically easier to employ category labels rather than numeric intervals. For example, a rule such as ‘if the herd milk average is low, then the daughter’s milk production is low’ is easier for the user (e.g. farmer) to appreciate than ‘if the herd milk average is less than 4164 litres, then the daughter’s milk production is less than 4030 litres’. In addition, the expert noted that rules with antecedents expressed in terms of discrete categories are likely to remain valid even if improvements in breeding and management increase absolute animal yields and thus change the category boundaries. Table 2: Results for data categorised by the expert Data Set A

Data Set C

Continuous valued inputs

Discrete valued inputs

accuracy

# rules

# ant

accuracy

# rules

# ant

Rulex

64

3

4

63

72

4

C5

63

13

3

62

3

2

Rulex

75

3

4

73

17

3

C5

70

17

3

70

14

3

High

Rulex

63

2

3

40

148

4

(25%)

C5

64

16

3

64

4

2

Output Class

Low

Classifier

(25%)

Average (50%)

Accuracy and comprehensibility (as determined by the total number of rules and the average number of antecedents per rule) of rule sets extracted from data sets A and C using both RULEX and C5.

Categorisation method: Expert versus Autoclass If we compare the accuracy of the discrete rule sets (i.e. column 6 in Table 2 with column 6, Table 3), it is apparent that rule sets derived via Autoclass categorisation performed better on the ‘low’ and ‘high’ classes. However, the rule sets derived via expert categorisation performed better on the ‘average’ class. This was true for both RULEX and C5 rule extraction. There were no clear differences in rule numbers or antecedents. We would argue that good performance on the extreme classes (‘low’ and ‘high’) is more important than good performance on the ‘average’ class because the end user is particularly interested to identify matings with good potential and to avoid matings with poor potential. For the attributes listed in Table 1, Autoclass tended to produce a more even distribution of category sizes than did the expert. For the above reasons, we preferred discretisation using Autoclass. However, the question arises as to the domain expert’s opinion of the categories generated by Autoclass. Whereas the expert decided on 5 categories for each input attribute and 3 categories for the output attribute, Autoclass produced 6, 5, 7, and 3 categories for herd MV, dam SLMV, sire ABV and daughter FLMV respectively. We asked the domain expert to identify his preferred

9 Proceedings of ISDSS'99

 1999 International Society for Decision Support Systems

Abbass, H A, Bligh, W, Towsey, M, Finn, G, Tierney, M

Knowledge discovery in a dairy cattle database

choice of categorisation for each attribute based on which produced the most meaningful category boundaries expressed in litres. Although the expert endeavoured to be objective, he preferred his own categorisation in all cases. However, there was a likely bias here because his categories were easily recognisable, being always five in number. When asked to select the best two rule sets (where neither the source of categorisation nor the classifier was obvious to him), the expert chose the rule set derived via Autoclass discretisation and C5 and the rule set derived via expert discretisation and RULEX. Table 3: Results for data categorised by Autoclass Output Class

Low (44%)

Average (30%)

High (26%)

Data set B

Data Set D

Continuous valued inputs

Discrete valued inputs

Accuracy

# rules

# ant

Accuracy

# rules

# ant

Rulex

75

2

3

76

36

3

C5

75

28

4

76

12

2

Rulex

39

4

3

48

134

4

C5

41

43

5

36

13

3

Rulex

53

3

3

66

30

3

C5

65

21

3

65

6

2

Classifier

Accuracy and comprehensibility (as determined by the total number of rules and the average number of antecedents per rule) of rule sets extracted from data sets B and D using both RULEX and C5.

We conclude overall that Autoclass satisfies our objective of fully automating the categorisation process. Also it is important to note that a domain expert involves human resources and is an expensive source for knowledge. Since our IDSS is being designed to serve farmers under a range of environmental and managerial conditions we must recognise the importance of having automated tools, to both categorise and classify data. Mining algorithm: Rulex versus C5 The final choice is between C5 and RULEX based only on their performance on data categorised using Autoclass (see the right side of Table 3). Accuracy of the two induction algorithms is similar for the ‘low’ and ‘high’ classes but RULEX performs better on the ‘average’ class. As noted above, however, the expert preferred the C5 derived rule sets because they contained fewer rules and antecedents. The larger number of RULEX rules is the result of reversing the rule refinement process. Rule refinement certainly increases rule comprehensibility but it should be noted that the elemental rules are more suitable for reasoning in a knowledge base than are refined disjunctive rules. Our decision in favour of C5 over RULEX is admittedly complicated by the difficulty in comparing rule set comprehensibility. However, RULEX has an additional disadvantage in that it is restricted to learning two class problems. For problems with more than two classes, a separate neural network must be trained for each class, adding not only to computation but

10 Proceedings of ISDSS'99

 1999 International Society for Decision Support Systems

Abbass, H A, Bligh, W, Towsey, M, Finn, G, Tierney, M

Knowledge discovery in a dairy cattle database

also to classification difficulties. Because each record must be classified a number of times, some sort of conflict resolution is required where there are contradictions among the intermediate classifiers. C5 by contrast can handle classification problems involving large numbers of classes. In summary, C5 would appear to be the more suitable rule induction algorithm for this application because it produces smaller and more comprehensible rules sets and because it performs well with a default set of parameters thus obviating the need for user interaction. In the appendix, we list the rules generated by C5 for the class ‘High’ using Autoclass categorisation. It is interesting to see the type of dependencies in these rules. For example, the first rule indicates that the daughter milk volume is high even though the season is summer. The domain expert interpreted this as if the average herd milk volume is above average, that is the management of the farm is good, then the daughter performance is expected to be high given the genetic compositions of the dam and the sire are high. Also, Rule 2 reveals the importance of the managerial aspects over the genetics in the problem domain. Rules 3-5 indicate that if the sire and any one of the two variables, management or dam, are high, the other variable is not important. From our point of view, this is one of the most interesting conclusions in these rules since the genetic improvement program can only concentrate on two variables, such as selecting a good sire and improving the efficacy of the management, to improve the herd.

CONCLUSION This paper demonstrates that automated tools for knowledge acquisition are reliable techniques that can enhance the capabilities of IDSS. For the production of accurate and comprehensible rule sets, we have compared the use of Autoclass (a Bayesian classifier) and a domain expert for attribute categorisation where Autoclass was found to be in a competitive position with the expert. Moreover, We compared two rule induction algorithms, RULEX and C5 and found C5 to have the advantage in this application. For further research, the following points summarise potential research areas inspired by this paper. We have ordered these points according to the sequence of the paper. ◊

Automated knowledge acquisition methods can be used for both knowledge elicitation and refinement. It would be interesting to know in which domains and situations, knowledge can be elicited and refined automatically.



In a decision support system, the knowledge refinement can be done using integrity constraints. Since some of the automated knowledge acquisition methods represent knowledge using constraint systems, how to combine these two methods is another interesting research point.

ACKNOWLEDGEMENT This work was done as a part of an ARC collaborative grant number C19700273.

REFERENCES Abbass H.A., Finn G. and Towsey M. ‘A Meta-Representation for Integrating OR and AI in an Intelligent Decision Support Paradigm’, Proceedings of the 15th National Conference of the Australian Society for Operations Research, Gold Coast, Australia, July, 1999a.

11 Proceedings of ISDSS'99

 1999 International Society for Decision Support Systems

Abbass, H A, Bligh, W, Towsey, M, Finn, G, Tierney, M

Knowledge discovery in a dairy cattle database

Abbass H.A., Macrossan L, Towsey M., Mengersen K, and Finn G. ‘Knowledge Discovery in a Dairy Cattle Database (Mining for predictive models)’, Technical Report FIT-TR99-01, 1999b. Abbass H.A., Towsey M., & Finn G., ‘An intelligent decision support system for dairy cattle mate-allocation’, Proceedings of the Australian Workshop on Intelligent Decision Support and Knowledge Management, AWIDS’98, Sydney, Australia, 1998, p. 45-58. Andrews R. & Geva S., ‘Rules and local function networks’, in Rules and networks, Andrews R. And Diederich J., Eds. QUT Press, 1996. Bowman P. J., Visschert P. M., & Goddard M. E., ‘Customized selection indices for dairy bulls in Australia’, Animal science, vol. 62, p. 393-403, 1996. Cheeseman P. & Stutz J., ‘Bayesian Classification (AutoClass): Theory and Results’, in Advances in Knowledge Discovery and Data Mining, Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, & Ramasamy Uthurusamy, Eds. AAAI Press/MIT Press, 1996. Dekkers J.C., ‘Design of breeding programs: chairman’s summary’, The Sixth World Congress on Genetics Applied to Livestock Products, Vol. 23-28, p. 405-407, Armidale, Australia, 1998. Finn G. D., Lister R., Szabo T., Simonetta D., Mulder H., & Young R., ‘Neural networks applied to a large biological database to analyze dairy breeding patters’, Journal of neural computing and applications, Vol. 4, p. 237-253, 1996. Lacroix R., Salehi F., Yang X. Z. & Wade K. M., ‘Effects of data preprocessing on the performance of the artificial neural networks for dairy yield prediction and cow culling classification’, Transactions of the ASAE, vol. 40, no. 3, 839-846, 1997. Quinlan J.R., ‘C4.5: Programs for machine learning’, Morgan Kaufman publications, 1993. Quinlan J.R., ‘C5 software’, http://www.rulequest.com/see5-info.html, 1997. Schmidt G.H. & Van Vleck L.D., ‘Principles of dairy science’, W.H. Freeman and Company, 1974. Turban E., ‘Decision support and expert systems: management support systems’, MacMillan series in information systems, 1990. Wade K. M., & R. Lacroix, ‘The role of artificial neural networks in animal breeding’, The fifth World Congress on Genetics Applied to Livestock Production, Vol.22, pp. 31-34, 1994.

12 Proceedings of ISDSS'99

 1999 International Society for Decision Support Systems

Abbass, H A, Bligh, W, Towsey, M, Finn, G, Tierney, M

Knowledge discovery in a dairy cattle database

APPENDIX: A LIST OF C5 RULES FOR CLASS ‘HIGH’ USING AUTOCLASS CATEGORISATION Rule 4

Rule 1 If

Dam season of calve = summer

If

Herd milk average = high

and

Herd milk average = above average

and

Sire ABV milk = very high

and

Dam milk volume = above average

then

Daughter Milk Volume = High

and

Sire ABV milk = very high

then

Daughter Milk Volume = High

Rule 2

Rule 5

If

Herd milk average = very high

If

Dam milk volume = high

then

Daughter Milk Volume = High

and

Sire ABV milk = high

then

Daughter Milk Volume = High

Rule 3

Rule 6

If

Dam milk volume = high

If

Dam season of calve = autumn

and

Sire ABV milk = very high

and

Herd milk average = above average

then

Daughter Milk Volume = High

and

Sire ABV milk = very high

then

Daughter Milk Volume = High

13 Proceedings of ISDSS'99

 1999 International Society for Decision Support Systems