Identification on Biological Databases

0 downloads 0 Views 181KB Size Report
... analysis of yeast species and strain data, including DNA data for 700+ yeast species. ... another of the KDD steps, data mining, as one of the common tasks of data mining is the development .... databases using modern heuristic techniques.
MIC’2001 - 4th Metaheuristics International Conference

267

Classification/Identification on Biological Databases Beatriz de la Iglesia∗ ∗

1

Vic J. Rayward-Smith∗

Jan-Jaap Wesselink∗

School of Information Systems, University of East Anglia Norwich, England Email: {bli,vjrs}@sys.uea.ac.uk

Introduction

Heuristic algorithms for mining large databases are being adapted to enable discriminatory analysis to be performed on biological data, accelerating the progress in understanding biological diversity and its industrial implications. A range of knowledge discovery algorithms are being applied to yeast characteristics data, providing new research leads and decision making tools. The research presented here is part of a project funded by the BBSRC 1 which involves the curation and data mining analysis of yeast species and strain data, including DNA data for 700+ yeast species. There is special industrial interest in the investigation of yeast species capable of causing food spoilage, including emerging spoilage food agents. The initial phases of the project, which cover some of the work reported here, have focused on understanding and improving the current methods for identification and classification in the biological domain. The new methods being developed should lead to faster, cheaper and more reliable classification and identification. Yeast data is being used as an initial case study, but it is expected that the approach can be extended to other taxa in the biological domain, and possibly to other domains such as medical diagnosis. This paper will report on the adaptation of Simulated Annealing, previously used to extract partial classification rules [5], to the problem of finding an economic model for classification of organisms. The objective of the simulated annealing algorithm is to choose the classification model which allows for maximum differentiability of organisms, while minimising the total number of characteristics (or the total cost of characteristics, if this is given) required for classification. An alternative problem is that of “economic” identification of unnamed organisms. In this case, the objective is to choose a model which will maximise the chances of delivering an identity for any unknown organism while minimising the total cost of identification. The problem of identification will be modelled in this paper. Organisms may have a known frequency of occurrence, in which case the model chosen should ensure identification of most frequent organisms at a minimum cost. The approach has, for example, immediate application to the determination of the tests necessary to distinguish types of yeast, such as food spoilage agents, which are of great industrial interest. This type of database analysis has parallels with Feature Subset Selection (FSS) [6], which is one of the tasks of the pre-processing stage in the Knowledge Discovery in Databases (KDD) process [7]. The objective of FSS is to isolate a small subset of highly discriminant features. It also has parallels to another of the KDD steps, data mining, as one of the common tasks of data mining is the development and analysis of classification models. 1 BBSRC

Grant No: 83/BIO 12037

Porto, Portugal, July 16-20, 2001

268

2

MIC’2001 - 4th Metaheuristics International Conference

Problem formulation

The data will consist of a number of records which contain a series of characteristics with their values for different organisms (or groups of organisms). Depending on the nature of the data, there may be one or more records per organism. Characteristics recorded may have an associated cost value. Organisms may have a frequency of occurrence. It is worth noting that with the available data it may not be possible to obtain a perfect classifier (i.e. one that is capable of distinguishing all available organisms); also some characteristics may be redundant to the classification. In this context, the definition of an optimal model may vary but will incorporate the idea of “economy” of classification/identification. More formally, let us consider a set of characteristics P = {t1 , t2 , . . . , tn } which applies to each organism s, s ∈ S (S is the set of all known organisms to be classified). Each organism is initially defined by a set of values for each characteristic in P , hence organism i will be defined according to P by a list of values si [P ] = (si [t1 ], si [t2 ], . . . , si [tn ]) where si [tj ], 1 ≤ j ≤ n denotes the value of characteristic j for organism i. We define an organism s as differentiable if, for a given set of characteristics, Q ⊆ P , the combination of values in s, s[Q], is unique in S, i.e. s[Q] = s [Q] ⇒ s = s . Associated with any set of characteristics, Q ⊂ P , and any s ∈ S, is a measure si [Q] where si .[Q] is equal to zero if s is not differentiable according to tests in Q and it is one otherwise. Each characteristic may have an associated cost, c(ti ). Each organism may have a prior probability, p(s) ∈]0, 1[ depending on the frequency with which they occur. The classification problem is then to choose a subset of characteristics, Q ⊆ P , that maximises  s [Q] ∗ p(s), s∈S

subject to a bound on the total cost of the characteristics contained in Q, i.e. such that  c(t) ≤ B, t∈Q

for some bound, B. Any set of characteristics, Q, partitions the organisms into equivalence classes of organisms indistinguishable by Q. In the above, the ideal is to produce a cheap set where the associated equivalence classes each contain at most one organism as this would deliver a perfect classification. However, any equivalence class with more than one element may be further partitioned by an additional set of characteristics. Hence, for the problem of identification, rather than conduct a single batch of experiments containing all possible experiments to distinguish all organisms, the researcher can choose to conduct an initial smaller batch, consider the output and then determine what further experiments to undertake. Ideally, this is repeated until a unique organism is identified. For this purpose, we require a search tree. Each internal node is a batch of experiments and each leaf node is an equivalence class of indistinguishable organisms. Ideally, these equivalence classes have a single element. Let T be the search tree induced by some experimental procedure. If s ∈ S occurs by itself in an equivalence class at a leaf node in T , we set s [T ] = 1, otherwise s [T ] = 0. We denote by cost(s) the cost of all the experiments performed at each node on the path from the root to the leaf containing s. For identification, we seek a tree, T , such that  s [T ] ∗ p(s) s∈S

is maximised subject to



cost(s) ∗ p(s) ≤ B

s∈S

for some bound B. Where each batch of experiments is a single experiment, the resulting tree is a decision tree familiar to researchers in KDD. In the special case where each experiment has a binary outcome, finding the Porto, Portugal, July 16-20, 2001

MIC’2001 - 4th Metaheuristics International Conference Responses + others

; ; ; −; ; ; 0 1 0

; ; ; +; ; ; 1 0 0

269

others 0 0 0

Table 1: Differentiability matrix for the yeast

optimal decision tree is NP-hard [8]. Greedy algorithms such as C5 [12, 13] can be used or a better tree can be induced using a modern heuristic [14].

3

A case study: yeast data

For this case study, data was donated by the Centraalbureau voor Schimmelcultures (CBS) 2 . The CBS maintains the largest collection of living fungi in the world. Their yeast database contains data on more than 4,500 strains kept in the Netherlands, and on about 1,250 strains from the IGC collection in Oeiras, Portugal. They also have data at the species level, for 745 species. The species data was the one used for this particular exercise. The CBS data stored for each yeast, at species or at strain level, consists of what we will name as conventional characteristics used to classify the yeasts. There are 96 characteristics recorded for each yeast species and they include microscopical appearance of the cells; mode of sexual reproduction; certain physiological activities and certain biochemical features. Explanation of characteristics used for the classification of yeasts and their possible values is given in [3]. When comparing characteristic values for yeast, only negative and positive responses to characteristics are considered traditionally as establishing a difference, all other responses are considered as equivocal [1, 2]. The differentiability matrix for the yeast is shown in table 1. This matrix produces a “crisp” differentiation of species, i.e. minor differences are not considered as sufficient for differentiation. Yet, with this differentiability matrix, many species can not be differentiated using conventional tests. Results will also be described using a “fuzzy” version of this differentiability matrix, which allows minor differences (for example, the difference between a positive or a delayed response to a test) to count partially towards the differentiation of two species or strains. This fuzzy classification model results in higher differentiability between species. The work presented here can be extended to classification/identification of other taxa by providing an appropriate differentiability matrix for the problem to be solved. Traditional classification methods for yeast use the different responses to conventional characteristics tests as a criteria for differentiability of species/strains. At present, comparisons of genomes in terms of base sequences are increasingly used for classification. The creation of a classification system based entirely on DNA sequence data is an additional deliverable of this project not reported here. There are 10 species which have been identified by the experts as food spoilers [11], hence yeasts associated with food spoilage represent a minority of species. Nevertheless, they may cause very significant losses to the food and beverages industry, and therefore the isolation of characteristics that distinguish them from other yeast species is an important exercise. Results showing the common characteristics of food spoilage yeasts which make differentiate them from other types of yeast will be presented. They will serve to validate the approach presented for classification/identification of specific groups of organisms. 2 The

CBS is located in Utrecht. Details can be found at http://www.cbs.knaw.nl

Porto, Portugal, July 16-20, 2001

270

4

MIC’2001 - 4th Metaheuristics International Conference

The use of Simulated Annealing

Simulated Annealing (SA) is a modern heuristic algorithm first proposed by [9], and, independently, by [4]. The simulated annealing technique is essentially local search in which a move to an inferior solution is allowed with a probability which decreases, as the process progresses, according to some Boltzmann-type distribution. A general purpose Simulated Annealing toolkit, SAmson [10], developed in-house, was used as the platform for the implementation of the solution to the classification problem. SAmson is an easy to use environment for the development of optimisation problems using SA. The SAmson toolkit consists of a intuitive interface including binary, integer and floating point representations, various cooling schedules, neighbourhood operators and other features. Implementation and parameter details for this problem will be discussed in the full paper. Some of the results obtained for the yeast case study have established, for example, that even using the total number of characteristics available for yeast species (96) with the established non-fuzzy differentiability matrix, only 544 out of 745 species presented to the algorithm can be considered distinct. An important conclusion from this set of experiments is that 20 characteristics are redundant, as 544 species can already be identified using a combination of 75 tests, and by using only 45 tests 500 species are already identified. Full results, including results on the modelling of the identification problem and on the fuzzification of differentiability criteria will be presented in the paper.

5

Preliminary conclusions

Simulated Annealing has been used successfully to improve on current classification methods for yeasts using conventional data. The results allow for the removal of redundant tests, and can highlight subset of tests that are most powerful for differentiating the majority of species, or subsets of species (such as spoilage yeasts). We show how the approach can be extended to solve the problem of identification in a cost effective way. The approach developed and tested using yeast data is general enough to be suitable for the optimisation of classification/identification models of other organisms even outside the biological domain.

References [1] J. A. Barnett. Identifying yeasts. Nature, 229(578), 1971. [2] J. A. Barnett. Selection of tests for identifying yeasts. Nature, 232:221–223, 1971. [3] J. A. Barnett, R. W. Payne, and D. Yarrow. Yeasts: Characteristics and identification, Third Edition. Cambridge University Press, Cambrige, UK, 2000. [4] V. Cerny. A thermodynamical approach to the travelling salesman problem: An efficient simulation algorithm. J. Optim. Theory Appl., 45, 1985. [5] B. de la Iglesia, J. C. W. Debuse, and V. J. Rayward-Smith. Discovering knowledge in commercial databases using modern heuristic techniques. In E. Simoudis, J. W. Han, and U. M. Fayyad, editors, Proceedings of the Second Int. Conf. on Knowledge Discovery and Data Mining. AAAI Press, 1996. [6] J. C. W. Debuse and V. J. Rayward-Smith. Feature subset selection within a simulated annealing data mining algorithm. Journal of Intelligent Information Systems, 9:57–81, 1997. Porto, Portugal, July 16-20, 2001

MIC’2001 - 4th Metaheuristics International Conference

271

[7] J.C.W. Debuse, B. de la Iglesia, C. M. Howard, and V. J. Rayward-Smith. Building the KDD roadmap: A methodology for knowledge discovery. In R. Roy, editor, Industrial Knowledge Management, pages 179–196. Springer-Verlag, London, 2000. [8] L. Hyafil and R. L. Rivest. Constructing optimal binary decision trees is np-complete. Information Processing Letters, 5:15–17, 1976. [9] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimisation by simulated annealing. Science, 220, 1983. [10] J. W. Mann. X-SAmson v1.5 developers manual. School of Information Systems Technical Report, University of East Anglia, UK, 1996. [11] J. I. Pitt and A. D. Hocking. Fungi and food spoilage 2nd Edition. Blackie Academic and Professional, London, 1997. [12] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993. [13] J. R. Quinlan. Bagging, boosting, and C4.5. In Proc. of the Thirteenth National Conference on A.I. AAAI Press/MIT Press, 1996. [14] M. D. Ryan and V. J. Rayward-Smith. The evolution of decision trees. In Proc. of the Third Annual Genetic Programming Conference, University of Wisconsin, Madison, Wisconsin, 1998. Morgan Kaufmann Publishers, Inc.

Porto, Portugal, July 16-20, 2001

Suggest Documents