ASSOCIATION-RULE-BASED PREDICTION OF OUTER ... - CiteSeerX

0 downloads 0 Views 327KB Size Report
ii. Approval. Name: Rong She. Degree: Master of Science. Title of thesis: ...... from non-outer membrane proteins, so that these patterns could be used by.
ASSOCIATION-RULE-BASED PREDICTION OF OUTER MEMBRANE PROTEINS

by

Rong She B.Eng., Shanghai Jiaotong University, 1993

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE

in the School of Computing Science

© Rong She 2003 SIMON FRASER UNIVERSITY April 2003

All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without permission of the author.

Approval

Name:

Rong She

Degree:

Master of Science

Title of thesis:

Association-Rule-Based Prediction of Outer Membrane Proteins

Examining Committee: Chair:

Dr. Joseph G. Peters

Dr. Ke Wang Senior Supervisor

Dr. Martin Ester Supervisor

Dr. Fiona S. L. Brinkman External Examiner Department of Molecular Biology and Biochemistry Simon Fraser University

Date Approved:

ii

Abstract A class of medically important disease-causing bacteria (collectively known as Gramnegative bacteria) has been shown to have a rather distinct cell structure, in which there exists an extra "outer" membrane in addition to the "inner" membrane that presents in the cells of most other organisms. Proteins resident in this outer membrane (outer membrane proteins) are of primary research interest since such proteins are exposed on the surface of these bacterial cells and so are the prioritized targets to develop drugs against. Determination of the biological patterns that discriminate outer membrane proteins from non-outer membrane proteins could also provide insights into the biology of this important class of proteins.

To date, it remains difficult to predict outer membrane proteins with high precision. Existing protein localization prediction algorithms either do not predict outer membrane proteins at all, or they simply concentrate on the overall accuracy or recall when identifying outer membrane proteins. However, as the study of a potential drug or vaccine takes great amount of time and effort in the laboratory, it is more appropriate that priority be given to getting a high precision on outer membrane protein prediction.

In this thesis, we address the problem of protein localization classification with the performance measured mainly on precision of the outer membrane protein prediction. We apply the technique of association-rule based classification and propose several important optimization techniques in order to speed up the rule-mining process. In addition, we introduce the framework of building classifiers with multiple levels, which we call the refined classifier, in order to further improve the classification performance on top of the iii

single-level classifier. Our experimental results show that our algorithms are efficient and produce high precision while maintaining the corresponding recall at a good level. Also, the idea of refined classification indeed improves the performance of the final classifier. Furthermore, our classification rules turn out to be very helpful for biologists to improve their understanding of functions and structures of the outer membrane proteins.

iv

Dedication

To Michael

v

Acknowledgments I wish to sincerely thank my senior supervisor, Dr. Ke Wang, for his endless inspirations and support. Without the tremendous guidance that he has provided, this thesis work would not have gone this far. Special thanks go to Dr. Martin Ester, my supervisor, from whom I have received instrumental advices on improving the quality of this thesis. My truly gratitude also goes to Dr. Fiona S. L. Brinkman, without whom there may be no such research at all. I also wish to thank Ms. Jennifer L. Gardy, who took her time to manually curate all biological sequence data on which this entire research was based. I am so lucky to have enjoyed the great experience of the collaboration work with all of them.

I would also like to thank my fellow students Senqiang Zhou and Fei Chen who have offered me great help and extensive discussions throughout my research work. And thanks to all friends for their warm friendship and encouragements that keep me going.

vi

Table of Contents Approval.......................................................................................................................ii Abstract ........................................................................................................................iii Dedication ....................................................................................................................v Acknowledgments ........................................................................................................vi Table of Contents .........................................................................................................vii List of Tables................................................................................................................x List of Figures ..............................................................................................................xi 1. Introduction ..............................................................................................................1 1.1 Cell Structure..............................................................................................2 1.2 Outer Membrane Proteins ..........................................................................4 1.3 Problem Characteristics and Challenges ....................................................6 1.4 Thesis Organization....................................................................................7 2. Related Work............................................................................................................9 2.1 Protein Subcellular Localization Prediction...............................................9 2.1.1 Prediction of Cytoplasmic, Periplasmic and Extracellular Proteins........................................................................................9 2.1.2 Prediction of Outer Membrane Proteins......................................10 2.2 Related Techniques ....................................................................................12 2.2.1 Association Mining .....................................................................12 2.2.2 Classification...............................................................................13

vii

2.2.3 Biological Sequence Classification .............................................16 3. Dataset and Evaluation Methodology ......................................................................20 3.1 Dataset........................................................................................................20 3.2 Classifier Evaluation Methodology............................................................22 4. Association-Rule Based Prediction..........................................................................24 4.1 Algorithm Motivation ................................................................................24 4.2 SAC Algorithm ..........................................................................................26 4.2.1 Finding Frequent Subsequences..................................................27 4.2.2 Finding Frequent Patterns ...........................................................29 4.2.3 Building Classifier.......................................................................31 4.2.4 Optimization Techniques ............................................................34 4.2.4.1 Maximum Support Pruning..........................................34 4.2.4.2 Confidence Pruning......................................................35 4.3 RAC Algorithm ..........................................................................................36 4.4 Summary ....................................................................................................38 5. Empirical Studies .....................................................................................................40 5.1 Classification Evaluation............................................................................40 5.1.1 SAC classifier..............................................................................40 5.1.2 RAC classifier .............................................................................43 5.1.3 Martelli et al.’s HMM .................................................................44 5.1.4 Support Vector Machines............................................................45 5.1.4 Decision Tree Classifier (See5)...................................................46 5.1.6 Summary .....................................................................................47 5.2 Effect of Maximum Support Pruning.........................................................48 6. Conclusions and Future Work..................................................................................52 6.1 Contributions..............................................................................................52 6.2 Future Work ...............................................................................................53 viii

Bibliography.................................................................................................................55

ix

List of Tables Table 3.1 Gram-negative Bacterial Protein Dataset .....................................................22 Table 3.2 Confusion Matrix in Classification ..............................................................22 Table 5.1 Number of Frequent Patterns Mined............................................................41 Table 5.2 Number of Frequent Patterns Mined (Continued)........................................41 Table 5.3 Performance of SAC classifiers ...................................................................42 Table 5.4 Performance of SAC classifiers (Continued) ...............................................42 Table 5.5 Performance of RAC classifiers...................................................................44 Table 5.6 Performance of SVMs..................................................................................46 Table 5.7 Performance of See5 ....................................................................................47 Table 5.8 Effects of Maximum Support Pruning .........................................................49

x

List of Figures Figure 1.1 Typical cell structure...................................................................................2 Figure 1.2 Cell structure of a Gram-negative bacterial cell .........................................3 Figure 1.3 A β-barrel outer membrane protein ............................................................5 Figure 2.1 SVM with kernel function mapping ...........................................................17 Figure 4.1 Algorithm SAC ...........................................................................................27 Figure 4.2 The GST for three sequences: MNQIHK, MKKFK and MKKC ...............28 Figure 4.3 Pattern enumeration ....................................................................................30 Figure 4.4 Rule pruning based on pessimistic error estimation ...................................33 Figure 4.5 Procedure MineSubsequence.......................................................................36 Figure 4.6 Algorithm RAC...........................................................................................37 Figure 5.1 Comparison of 5 Outer Membrane Protein Prediction Algorithms ............47 Figure 5.2 Effects of Maximum Support Pruning on precision and recall...................50 Figure 5.3 Effects of Maximum Support Pruning on running time .............................51

xi

Chapter 1 Introduction The explosive growth in the biological research such as the high-throughput genome sequencing projects [11, 31] is generating data whose volume and complexity are unprecedented in biology. For example, the number of entries in the data bank of protein sequences SWISSPROT [4] has grown by over 30 times since its first establishment in 1986. It currently contains 122564 protein sequence entries (release 41.0, 05-Mar-2003), representing 7778 different kinds of species. In order to realize the full potential of such accomplishments, there is an urgent need to bridge the gap between the presence of the enormous amount of sequence data and deriving meaningful knowledge from such data. This will define the biological research throughout the coming decades and require the expertise and creativity from teams of biologists, chemists, engineers and computer scientists, among others [28]. Although scientists have attempted to organize and annotate these sequences with some success, much remains to be known. In particular, one of the most critical tasks is to correctly classify biological sequences into their corresponding functional families and one such classification problem that is of special biological significance is to classify proteins according to their subcellular localizations [9].

Protein subcellular localization plays an important role with regards to the functions of proteins. For proper functioning, a protein has to be transported to the correct intra- or extra-cellular compartment of a cell, or attached to a membrane that surrounds the cell. Knowing the location where a protein resides in a cell will provide substantial insights for biologists to predict its corresponding functions and interactions with other molecules. In 1

the rest of this chapter, we motivate the study of one particular protein subcellular localization prediction problem --- predicting outer membrane proteins --- by introducing some background information in molecular biology.

1.1 Cell Structure All living things are composed of cells, from just one to many millions, whose details usually are visible only through a microscope. Within cells, many of the basic functions of organisms, such as extracting energy from food and getting rid of waste, are carried out. Although cells of different organisms or different tissues come in different sizes and shapes, the way in which a cell functions and the basic cell structure are for the most part similar.

The work of the cell is carried out by many different types of molecules it assembles, mostly proteins. Protein molecules are long, usually folded chains made from 20 different kinds of amino acid molecules. The function of each protein molecule depends on its specific sequence of amino acids and the shape the chain takes as a consequence of attractions between different parts of the chain. Proteins have to be localized at their final destinations to exert their physiological functions. That is, they have to be transported to the correct subcellular compartments to fulfill their tasks.

A typical cell structure found in the animal and plant kingdoms as well as most bacteria has the following common components which are illustrated in Figure 1.1:

1

2

3

Cytoplasm

Cell Membrane

Figure 1.1 Typical cell structure 2



Cytoplasm, gel-like substance inside the cell.



Cell membrane, which serves as a boundary between the cell and the outside environment.

Consequently, in these cells, once proteins are synthesized in the cytoplasm, they can reside in these subcellular locations (as indicated by the numbers in Figure 1.1): (1) in the cytoplasm, (2) attached to the membrane and (3) transported to outside of the cell (the extra-cellular space).

However, the cell structure of a special family of bacteria, collectively called as Gramnegative bacteria1, is somewhat different. In a Gram-negative bacterial cell, there exists an extra “outer” membrane in addition to the usual membrane that presents in most other organisms, as shown in Figure 1.2.

1

2

3

4

5

Inner Membrane

Cytoplasm

Periplasm Outer Membrane Figure 1.2 Cell structure of a Gram-negative bacterial cell

1 Gram-negative bacteria are those that decolorize during the Gram stain procedure. Originally developed by Hans Christian Gram, Gram stain procedure involves the application of a solution of iodine to cells previously stained with crystal violet or gentian violet. This procedure produces "purple colored iodine-dye complexes" in the cytoplasm of bacteria. The cells are then treated with a decolonizing agent such as 95% ethanol or a mixture of acetone and alcohol. While Grampositive bacteria retain purple iodine-dye complexes, Gram-negative bacteria do not retain complexes when decolorized. To visualize decolorized Gram-negative bacteria, a red counterstain such as safranin is used after decolorization treatment. Picking up safranin, decolorized Gramnegative bacteria will appear pink.

3

The outer membrane completely surrounds the cell and provides extra strength to protect the cell from the harsh environment. The periplasm is the gelatinous material between the outer membrane and the inner membrane. It contains enzymes for nutrient breakdown as well as binding proteins to facilitate the transfer of nutrients across the inner membrane. Thus in a Gram-negative bacterial cell, proteins have five different localizations as indicated by the numbers in Figure 1.2: (1) cytoplasm, (2) inner membrane, (3) periplasm, (4) outer membrane and (5) extra-cellular.

1.2 Outer Membrane Proteins Biological experiments indicate that the information required to direct a protein to its corresponding localization site is primarily encoded in its amino acid sequence. For example, the presence of a string of amino acid residues in a protein that forms a structure known as a transmembrane α-helix is indicative of a protein resident at the inner membrane. Such α-helix structures have a very characteristic sequence and current algorithms for detecting them in a given protein sequence are already very accurate [20]. However, proteins that resident at the outer membrane of Gram-negative bacteria, the integral outer membrane proteins, do not consist of such characteristic α-helices; instead, they contain antiparallel β-strands that form a barrel shape (β-barrels). These proteins are also referred to as β-barrel membrane proteins [29] as shown in Figure 1.3. In a β-barrel membrane protein molecule, the central barrel shape, formed by antiparallel β-strands, rests in the outer membrane of the bacteria. The aromatic amino acids shown (“rings” near both the top and the bottom of the protein) form a “girdle” to anchor the protein in the membrane. The β-strands are connected by short stretches of amino acid sequences (turns) at the inner, or periplasmic side, and longer stretches (loops) at the outer, or extracellular side.

4

Outer Loops Extracellular side

Outer β-strands

Membrane

Periplasmic side

Inner Turns

Figure 1.3 A β-barrel outer membrane protein

The identification of such outer membrane proteins is of great research interest for several reasons: •

Gram-negative bacteria are medically important diseasing-causing pathogens. They are responsible for many diseases as diverse as food-poisoning, water-borne diseases, gonorrhea, plague, ulcers, dermatitis, meningitis etc. Some of these bacteria also cause diseases that affect other animals and plants of agricultural interest. For example, Gram-negative bacterial infections are the leading cause of morbidity and mortality in dairy cattle.



Additionally, the presence of the outer membrane excludes certain drugs and antibiotics from penetrating the cell, partially accounting for why Gram-negative bacteria are generally more resistant to antibiotics than are Gram-positive bacteria. Thus developing new drugs against these Gram-negative bacteria is especially important.



Because outer membrane proteins are exposed on the surface of such bacterial cells, and it is usually easier to develop drugs against the surface of a disease5

causing bacterial cell, these proteins are the most accessible targets for antibiotic and vaccine drug design. The ability to identify such potential targets from sequence information alone would allow researchers to quickly prioritize a list of proteins for further study. •

These surface-exposed proteins are also useful in diagnostics as a means to detect the bacteria, hence are useful for diagnosing diseases or detecting bacteria in the environment.



Furthermore, outer membrane proteins exhibit high similarity at the threedimensional level but little at the level of amino acid sequence itself, and it remains difficult to characterize the factors that cause a protein to take its threedimensional shape. Being able to predict outer membrane proteins from sequence information alone will effectively assist in genome annotation and classification.



Determination of the relevant sequence patterns that discriminate outer membrane proteins from non-outer membrane proteins may also provide insights into the biology of this important class of proteins.

1.3 Problem Characteristics and Challenges The distinct cell structure of the Gram-negative bacteria presents an interesting challenge for protein localization prediction. While Gram-negative bacteria have 5 primary sites for protein localization, most other organisms do not have the “outer” membrane, hence methods for membrane protein prediction in other organisms are not directly applicable to predict outer-membrane proteins in Gram-negative bacterial cells.

In the process of finding new methods to deal with such situation, two characteristics of the outer membrane protein prediction problem should be noted. •

First of all, the measure for evaluating the performance of outer membrane protein prediction differs from that of many other classification problems. Instead of using 6

overall accuracy, or favoring recall over precision, we favor precision over recall.2

Because of the medical significance of outer membrane proteins, and the lengthy time it takes to further study a prioritized drug, vaccine or diagnostic target in the laboratory, biologists want to be fairly sure about the certainty of a sequence being an outer membrane protein once it is classified as so by a classifier. That is, for a classifier that discriminates outer membrane proteins from non-outer membrane proteins, the highest priority should be given to the precision of outer membrane protein prediction, since our goal is to reliably identify outer membrane proteins.

This is very different from many data mining applications, where the overall accuracy for all classes is used as the performance measure and there is no single “target” class. It is also different from typical rare-class classification problems, where an important consideration is to cover as many rare-class samples (for example, responders in direct marketing applications or intruders in network intrusion detection problems) as possible, at the expense of falsely covering some non-target class samples, i.e., favoring recall over precision.

Secondly, in order for biologists to gain further insights into the underlying biochemical mechanism, it is important to apply techniques that could identify the most significant sequence patterns that discriminate outer membrane proteins from non-outer membrane proteins, so that these patterns could be used by biologists to make further analysis.

1.4 Thesis Organization

2 Formal definitions of all three commonly used classification performance measures (overall accuracy, precision and recall) are given in Chapter 3.

7

The rest of this thesis is organized as follows. Chapter 2 reviews some related existing work. Chapter 3 presents the outer membrane protein dataset used for our experimentation, as well as our evaluation methodology. Chapter 4 describes the details of our algorithms. In Chapter 5, we report our experimental results and performance study. Finally in Chapter 6, we conclude the thesis and present some future work.

For convenience, from this point on, we will refer to the outer membrane protein as OMP.

8

Chapter 2 Related Work In this chapter, two kinds of related work are reviewed. Section 2.1 provides a brief discussion on previous works that are on similar protein localization problems. In Section 2.2, we review researches that are related to our work technically.

2.1 Protein Subcellular Localization Prediction The protein subcellular localization problem has been studied for some time. In particular, among the 5 primary localization sites in which proteins could reside in the Gramnegative bacteria, proteins integrated at the inner membrane (α-helix transmembrane proteins) already can be accurately predicted [20], as mentioned in Section 1.2. Thus the remaining question is how to predict the other 4 protein localizations in the Gramnegative bacteria cells.

2.1.1 Prediction of Cytoplasmic, Periplasmic and Extracellular Proteins Several predictors exist in the biological domain that predict the localizations of proteins in both prokaryotic and eukaryotic cells 3 and are publicly available. Most of these prediction methods are based on the amino acid composition of protein sequences.

3 All living cells can be divided into two groups: prokaryotic and eukaryotic. Prokaryotic cells are

generally much smaller and simpler than eukaryotic cells. Animals, plants, fungi, protozoans, and algae all possess eukaryotic cell types. Only bacteria have prokaryotic cell types.

9

Nakashima and Nishikawa have shown that intracellular and extracellular proteins differ significantly in their amino acid composition [23]. Since then, several methods based on protein amino acid composition have been developed. Reinhardt and Hubbard [25] constructed a prediction system using supervised neural networks. They dealt with prokaryotic and eukaryotic sequences separately and obtained an overall accuracy of 81% for 3 subcellular locations in prokaryotic sequences and 66% for 4 subcellular locations in eukaryotic sequences. Later on other scientists have proposed different algorithms and also tested them on the same dataset that was used by Reinhardt and Hubbard. Chou and Elrod [6] used a covariant discriminate algorithm and achieved an overall accuracy of 87% on the prokaryotic sequences. Yuan [37] built Markov chain models and obtained an accuracy of 89% for the prokaryotic sequences and 73% for the eukaryotic sequences. Hua and Sun [14] used support vector machines and achieved the highest accuracy on this dataset --- 91% for prokaryotic sequences and 79% for eukaryotic sequences.

However, these methods are limited to predicting three subcellular localizations for prokaryotic cells and four subcellular localizations for eukaryotic cells. Particularly, the three localizations for prokaryotic sequences are cytoplasmic, periplasmic and extracellular. They do not predict the proteins that attach to the outer membrane in the Gram-negative bacterial cells.

2.1.2 Prediction of Outer Membrane Proteins To date, little research has been done into the prediction of OMPs. Scientists have previously used neural network-based method [8, 16], hydrophobicity analysis [27], and combinations of methods, including homology analysis and amino acid abundance [36, 38], to varying degrees of success. The most recent approach, reported by Martelli et al. [22], is so far the most successful attempt at the OMP prediction. They used a hidden Markov Model (HMM) to represent the prototypes of OMPs, as it is known that each amino acid residue in the β-barrel membrane proteins can be categorized into one of three types: outer loops that stick out into the extracellular space, transmembrane β-strands, 10

and inner turns that stretch into the periplasm. The training dataset for building the HMM was the set of all non-redundant β-barrel membrane proteins whose three-dimensional structure has been determined experimentally. However, there were only 12 of such proteins available, which is not a large dataset. Once the HMM was trained, classification was performed by computing the probability of the protein sequence being emitted by such topological model. They reported a fairly good recall of 84% for OMP prediction (called accuracy in their publication) on their testing dataset.

Unfortunately, none of the OMP prediction algorithms are publicly available. Additionally, most of them were trained on small datasets such as the small set of proteins used by Martelli et al. In many cases, these methods were not tested on some other dataset with known OMPs; instead, they were used to screen the genomes or a list of putative OMPs, and the number of OMPs found was reported. In other words, they do not report accuracy or precision on OMP prediction. Thus, there is no way to critically evaluate any of these methods except the HMM algorithm used by Martelli et al. Even the testing dataset used by Martelli et al. was assembled based on the third-party annotations extracted directly from the SWISSPROT database, annotations which are not verified in some cases and may present incorrect information.

Furthermore, all above protein localization research evaluated the classification performance based on overall accuracy and weighted all locations equally. In the context of this research, however, the overall accuracy does not accurately measure the prediction performance on OMPs, as the number of OMPs is relatively small compared with proteins at other localization sites. In fact, as will be shown in Chapter 3, OMPs currently represent only less than 30% of the entire family of Gram-negative bacterial proteins with known localization sites. Therefore, instead of using the overall accuracy which will be determined mainly by the prediction performance of non-OMPs, our performance evaluation is focused on the precision of OMP prediction. In terms of this measure, the method used by Martelli et al. only obtained a very low precision of 46% on their testing dataset, as calculated by us based on other reported performances in their publication; on 11

the other hand, the overall accuracy that they have achieved is 89.4%. The dramatic difference between these two performance measures shows the importance of choosing the appropriate measure in classifier evaluation.

2.2 Related Techniques In the context of OMP prediction, biologists are not only interested in correctly classifying proteins as OMPs or non-OMPs, but also interested in identifying the most significant subsequences capable of discriminating OMPs from non-OMPs, as such patterns may lead to new insights into the biology of this important class of proteins. To this end, two major topics in the data mining research provide useful techniques: association mining and classification.

2.2.1 Association Mining Association mining deals with the problem of identifying statistically significant itemsets (sets of items) that frequently occur in a given dataset [1]. Algorithms for mining frequent itemsets in the transactional database have been studied extensively [2, 13, 34]. These algorithms search for interesting frequent patterns, associations, correlations, or causal relationships among sets of items or objects. Such relationships are usually represented by association rules, rules that are produced by association mining.

For example, a rule A

B indicates that data samples that contain itemset (or item) A are

also likely to contain itemset (or item) B at the same time. Each association rule is associated with two parameters: support and confidence. Support is the percentage of data samples in the entire dataset that contain both A and B, and confidence is the conditional probability that a data sample contains B given that it contains A.

When dealing with the biological sequence data, special techniques are needed to preserve the sequential order that is implicit in the data. The bodies of the association rules produced from sequences are not itemsets, rather they are amino acids patterns, or 12

motifs, thus the rule bodies themselves will be a good starting point for further biological analysis. There has been no generally agreed upon method for how patterns should be derived from biological sequences. An overview of pattern construction algorithms is given in [5], giving a classification of algorithms based on whether they use a bottom-up or a top-down approach. Bottom-up algorithms work by enumerating candidate patterns and counting their occurrences in the sequences. In contrast, top-down approaches look for local similarities between sequences and extract candidate patterns based on these similarities. This can be done in several ways, for example by searching for sufficiently long common subsequences, or by first aligning the sequences to minimize the mismatches, and then extracting patterns from the alignment. We use a method that is a combination of both. First we extract frequent subsequences with a top-down approach, and then we count the occurrences of combinations of these subsequences in the protein sequences in a bottom-up fashion.

It is easy to see that when the right hand side of an association rule represents a class label, in our case, OMP or non-OMP, the rule can be directly used for classification. Such rule is applicable to only those sequences that contain the amino acids in exactly the same order as in its rule body.

2.2.2 Classification Classification tries to predict the class labels of unseen cases. In order to construct a classifier, a training dataset is usually used to extract features to build the classification model. The classification performance is then measured on another dataset that is reserved for algorithm testing.

Classification has been studied extensively in literatures and techniques developed for classification include a variety of learning algorithms, such as k-nearest neighbors, decision tree induction, Bayesian classification, neural networks, hidden Markov models, support vector machines, association rule-based classification, etc. 13

While no single technique is proven to be the best in all situations, association rule-based classifiers have been found to be very useful in many situations [3, 21, 35]. The richness and profoundness of association rules lay concrete foundations for building classification models. In addition, the easy interpretability of the rules by humans and the competitive performance exhibited in many application domains made such rule-based models especially popular. For example, it has been shown to outperform C4.5, a classic decision tree algorithm [21, 35].

The association rule-based classification usually contains following steps:

1.

Compute frequent itemsets that occur together in the training dataset at least as frequently as a pre-determined minimum support percentage, a threshold denoted by MinSup. The itemsets mined must also contain the class labels.

2.

Generate association rules from the frequent itemsets, where the right hand side of the rules only contains class labels. In addition to the MinSup threshold, these rules must also satisfy a minimum confidence, set by a threshold MinConf.

3.

As association rules generated from step 2 usually are of huge amount, and many of them are overfitting4 and do not generalize to unseen testing cases, pruning needs to be done in order to select appropriate association rules to be used for

4 Overfitting is the phenomenon that a learning algorithm adapts so well to a training set, that the random disturbances in the training set are included in the model as being meaningful. Consequently (as these disturbances do not reflect the underlying distribution), the performance on the test dataset (with its own, but definitively other, disturbances) will suffer from techniques that learn too well.

14

classification. Pessimistic error estimation5 [24] is often used in order to prune overfitting rules. The basic principle is to prune rules that are not helpful in reducing the error rate of the final classifier.

4.

Finally classification is done by using remaining classification rules to predict the class labels of the test data. This can be done by either using a collection of matching rules and taking the majority vote or using the highest ranked matching rule.

However, all previous association-rule based classification studies are concentrated on transactional datasets and none of them has been applied on biological sequence classification. Lesh et al. [19] applied association mining techniques to the task of feature selection in order to improve the classification performance on sequential data including biological sequences. They selected a subset of frequent sequential patterns as features for classification. However, their work is focused on efficient feature mining but not on building a classifier. When it comes to classification, features are not directly used to construct classification rules; instead, they are used to transform the original sequential data into vectors and then standard classifiers that take vectors as input are applied, for example, the Naïve Bayes classification algorithm. Note that when mining frequent sequential patterns, they limited the patterns to be no longer than some user-defined maximum length and did not distinguish the consecutive sequential patterns from nonconsecutive ones.

5 The estimated error of a rule can be computed based on its observed error on training data. For

example, if a rule currently covers N cases in training data and E of them are incorrectly classified, we may consider it as observing E events in N trials in the binomial distribution, then with a confidence factor CF (the most often used CF is 25%), the upper limit of error rate over the entire population is given by UCF(N, E). In other words, if we randomly select N samples that are covered by such rule, the number of incorrectly classified cases would be at most N ×UCF(N, E), with the confidence factor 25%. Overly specific rules tend to have larger UCF(N, E) values and are more likely to be pruned.

15

2.2.3 Biological Sequence Classification On the other hand, the growing interest in bioinformatics has resulted in a lot of researches into the biological sequence classification and many techniques have been used. Several most widely used approaches include k-nearest neighbors, Markov models and support vector machines. Deshpande and Karypis [7] evaluated these techniques and showed that support vector machines generally produce higher accuracies when doing classification on biological sequences.

Support vector machines [32], or SVMs, have a solid theoretical foundation based on the statistical learning theory and are especially useful for two-class classification problems. The general idea of SVM is to treat each sample as an input vector that is mapped into some high-dimensional feature space so that each sample is essentially a data point at this high-dimensional space. SVMs are trained to construct an optimal separating hyperplane (OSH) which maximizes the margin (the distance between the hyperplane and the nearest data points of each class) in the feature space. The maximum margin hyperplane is found by quadratic programming techniques that determine the direction of the hyperplane through a set of support vectors (SV, a small subset of the input vectors).

SVMs can generalize well to unseen data samples even with the presence of a large number of features, as long as the training data can be separated by a sufficiently wide margin. When training data are not linearly separable, a kernel function K can be used to map the original data vectors into a much higher dimensional space where the data points are linearly separable. Typical kernel functions include: Linear function

K ( xi , x ) = xi • x

(1)

Polynomial kernel function

K ( xi , x ) = ( xi • x + 1) d

(2)

Radial basic function (RBF)

K ( xi , x ) = exp(−γ || xi − x || 2 )

(3)

The parameter d in equation (2) is the degree of the polynomial function; when d=1, the function will revert into the linear function. In addition, a soft margin separating hyperplane can be applied which allows some degree of misclassification errors in 16

training data in order to obtain a large margin. Figure 2.1 illustrates a SVM with some kernel function mapping.

In Figure 2.1, two classes denoted by circles (colored differently as black and white circles) are not linearly separable in the input space. SVM maps the input vectors to a high-dimensional feature space by using a kernel function K and finds the optimal separating hyperplane (OSH, the solid line) in the feature space that corresponds to a nonlinear boundary in the input space. Data points that are labeled 1, 2 and 3 are support vectors, as they are the closest data points of the two classes with regard to the OSH. Note that the margin for OSH (denoted as “margin1”) is quite small since the OSH does not allow misclassification in the training samples. Obviously this tends to produce the problem of overfitting.

On the other hand, by allowing classification errors to some extent, SVMs can generate separating hyperplanes with larger margins (Soft margin SH, the double solid line). The new margin is denoted as “margin2” in Figure 2.1. In this case, data point 3 represents the misclassification traded for the larger margin, and data points 1, 2 and 4 become the new set of support vectors.

Input space

Feature space

Soft margin SH margin2 (soft margin SH)

OSH

K ( xi , x )

margin1 (OSH)

1 2

3 4

Figure 2.1 SVM with kernel function mapping

17

The class label of any new sample x is determined by the output value of the following decision function:

λi y i K ( xi , x ) + b 

f( x )=



(4)

xi ∈SV 





where K ( xi , x ) is the kernel function, yi is the class label of each support vector and the coefficients λi are the weights assigned to support vectors, obtained by solving the following convex Quadratic Programming (QP) problem: N

Maximize i =1

subject to

λi −

1 2

N

λiλj • yiyj • K ( xi, xj ) 



i =1 j =1

0 ≤ λi ≤ C N

N

i = 1,2,..., N .

(5)

λiyi = 0

i =1

Once the weights λi are determined, support vectors are just those input vectors who have non-zero weights. In equation (5), N is the number of input vectors, i.e., number of training samples; C is a regularization parameter that controls the tradeoff between the margin and the misclassification error when the soft margin separating hyperplane is used. The larger value C is, the heavier punishment will be placed on misclassification and SVMs will find a separating hyperplane with a smaller margin (or even cannot find such a margin).

While SVMs make robust classifiers that give good classification performance, no biological analysis can be easily done with such classifiers. SVMs do not explicitly assign weights to all features. Instead, they perform a kernel convolution between test data and support vectors for the sake of efficient classification in high dimensional feature spaces. This holds even if a simple kernel, say the linear kernel, is used, which explicitly expresses the decision function as a linear combination of features. Therefore, it is very hard to extract the most important features from SVM classifiers. In contrast, the classification approach based on association rules explicitly expresses the implications 18

between the class labels (OMPs or non-OMPs) and the sequence patterns that are discriminative enough to serve as classification rules, rules that are interpretable and can be easily understood by biologists.

19

Chapter 3 Dataset and Evaluation Methodology In this chapter, we present a high-quality dataset that was created for the purpose of classifying protein localizations in the Gram-negative bacteria cells. We also introduce the evaluation measure that we will be using in the empirical studies.

3.1 Dataset To critically evaluate the effectiveness of OMP prediction methods, we obtained a dataset from our partners at the department of Molecular Biology and Biochemistry in Simon Fraser University as part of our collaboration efforts [12] to tackle the outer membrane protein prediction problem. This dataset represents the largest available set of Gramnegative bacterial proteins with experimentally determined subcellular localizations and is available online at http://www.psort.org/dataset. The dataset was created by extracting all Gram-negative proteins with an annotated subcellular localization site from the SWISSPROT database (http://us.expasy.org/sprot/). The annotated localization sites were then confirmed through a manual search of the literature, and those proteins with an experimentally verified localization site were added to the dataset. Being the largest dataset of its kind, and with the subcellular localization of each protein confirmed by biological experiments, this dataset is of excellent quality and provides a reliable means to evaluate different methods.

The dataset contains protein sequences of variable lengths. Protein sequences are constructed by hundreds, sometimes thousands of amino acids, over an alphabet of 20 20

amino acids. Each amino acid is represented by a letter: alanine (A), cysteine (C), aspartic acid (D), glutamic acid (E), phenylalanine (F), glycine (G), histidine (H), iosleucine (I), lysine (K), leucine (L), methionine (M), asparagines (N), proline (P), glutamine (Q), arginine (R), seine (S), threonine (T), valine (V), tryptophan (W) and tyrosine (Y). The longest sequence in our dataset consists of 3705 amino acid residues and the shortest sequence has a length of only 50.

A sample outer membrane protein sequence:

(A32247 Outer membrane (Autotransporter); made up of 1102 amino acids) MNQIHKFFCNMTQCSQGGAGELPTVKEKTCKLSFSPFVVGASLLLGGPIAFATPLS GTQELHFSEDNYEKLLTPVDGLSPLGAGEDGMDAWYITSSNPSHASRTKLRINSDI MISAGHGGAGDNNDGNSCGGNGGDSITGSDLSIINQGMILGGSGGSGADHNGDG GEAVTGDNLFIINGEIISGGHGGDSYSDSDGGNGGDAVTGVNLPIINKGTISGGNG GNNYGEGDGGNGGDAITGSSLSVINKGTFAGGNGGAAYGYGYDGYGGNAITGD NLSVINNGAILGGNGGHWGDAINGSNMTIANSGYIISGKEDDGTQNVAGNAIHIT GGNNSLILHEGSVITGDVQVNNSSILKIINNDYTGTTPTIEGDLCAGDCTTVSLSGN KFTVSGDVSFGENSSLNLAGISSLEASGNMSFGNNVKVEAIINNWAQKDYKLLSA DKGITGFSVSNISIINPLLTTGAIDYTKSYISDQNKLIYGLSWNDTDGDSHGEFNLK ENAELTVSTILADNLSHHNINSWDGKSLTKSGEGTLILAEKNTYSGFTNINAGILK MGTVEAMTRTAGVIVNKGATLNFSGMNQTVNTLLNSGTVLINNINAPFLPDPVIV TGNMTLEKNGHVILNNSSSNVGQTYVQKGNWHGKGGILSLGAVLGNDNSKTDR LEIAGHASGITYVAVTNEGGSGDKTLEGVQIISTDSSDKNAFIQKGRIVAGSYDYR LKQGTVSGLNTNKWYLTSQMDNQESKQMSNQESTQMSSRRASSQLVSSLNLGE GSIHTWRPEAGSYIANLIAMNTMFSPSLYDRHGSTIVDPTTGQLSETTMWIRTVGG HNEHNLADRQLKTTANRMVYQIGGDILKTNFTDHDGLHVGIMGAYGYQDSKTH NKYTSYSSRGTVSGYTAGLYSSWFQDEKERTGLYMDAWLQYSWFNNTVKGDG LTGEKYSSKGITGALEAGYIYPTIRWTAHNNIDNALYLNPQVQITRHGVKANDYIE HNGTMVTSSGGNNIQAKLGLRTSLISQSCIDKETLRKFEPFLEVNWKWSSKQYGV IMNGMSNHQIGNRNVIELKTGVGGRLADNLSIWGNVSQQLGNNSYRDTQGILGV KYTF 21

Two classes are present in this dataset: OMPs and non-OMPs. The distribution of these two classes is imbalanced, with 27% being “OMP” and 73% being “non-OMP”. The details are shown in Table 3.1.

Table 3.1 Gram-negative Bacterial Protein Dataset

Number of

Percentage of

Minimum

Maximum

Average

Sequences

Each Class

Length

Length

Length

OMP

427

27.4%

91

3705

571.1

Non-OMP

1132

72.6%

50

1034

256.8

Total

1559

Data

342.9

3.2 Classifier Evaluation Methodology The performance of a classifier is usually measured by classification accuracy, precision and recall. They are defined based on a confusion matrix as shown in Table 3.2. Because we are primarily interested in identifying OMPs, we refer to the OMPs as “positive” samples and all non-OMPs as “negative” samples here.

Table 3.2. Confusion Matrix in Classification

Actual OMP

Actual non-OMP

Classified as

TP

FP

OMP

(true positive)

(false positive)

Classified as

FN

TN

non-OMP

(false negative)

(true negative)

Overall Accuracy =

TP + TN TP + FP + FN + TN

(6) 22

Precision for OMP prediction = Recall for OMP prediction =

TP TP + FP

(7)

TP TP + FN

(8)

In our research, the performance of our predictive methods is not measured by the overall accuracy, because the majority of proteins belong to the non-OMP class and the overall accuracy would be influenced mainly by the non-OMP sequence prediction. Instead, since the identification of OMPs is our primary concern, our goal is the high precision in OMP identification. Because of the inherent tradeoff between precision and recall in classification, our goal is to achieve a precision for OMP prediction at or above the 90% level while maintaining the recall of the same class at a reasonable level (at least 50%). Because the classification performance of the non-OMP class is not our main concern, all classification results are evaluated based on precision and recall of the OMP class.

23

Chapter 4 Association-Rule Based Prediction In this chapter, we present two classification algorithms, SAC and RAC, both based on association rules. SAC is the short form of Single-level Association-based Classification, whereas RAC is the short form of Refined Association-based Classification. We will discuss both algorithms in details in the remaining sections.

4.1 Algorithm Motivation First we introduce the motivations behind our approaches. In searching for methods that deal with the specific characteristics of the outer membrane protein prediction problem, we noticed that techniques that have been developed in the data mining research seem promising. In particular, the techniques that make use of frequent subsequences for classification are useful for several reasons.

First of all, as biologists are interested in discovering sequence patterns that discriminate outer membrane proteins from non-outer membrane proteins, one natural choice is to mine subsequences that occur frequently in outer membrane proteins and rarely in nonouter membrane proteins. Common frequent subsequences that occur in many outer membrane proteins are statistically significant and it is probable that they represent meaningful residues that are useful for further biological analysis.

It has been known that common subsequences among related proteins may perform similar functions via related biochemical mechanisms and previous attempts on modeling 24

proteins based on their sequence homologies have obtained reasonably good results. For example, the publicly-available protein pattern library PROSITE [30] collects patterns that are expressed as regular expressions and uses them for structural and functional classification of new sequences, and it is considered to be reliable enough to be used in a system for automated functional annotation of new sequences [10].

On the other hand, the patterns in PROSITE tend to model a single conserved region, while the alignments of a protein family usually contain a number of conserved regions, interspersed with regions of higher variation. It is known that outer membrane proteins have the general structure of alternating “turns”, “strands” and “loops” within the β-barrel shape (see Figure 1.3). Different regions have different characteristic sequence residues. For example, the aromatic residues are located near turns, followed by some particular residues that prefer to form the “strand” structures. The amino acid sequences at the loops are highly variable and share little sequence identities. Thus a model that involves patterns of more flexibility is preferred. Such a model should be able to capture the local similarities present in the OMPs by including subsequences that encode uncompromising rules for strongly conserved positions, and representing the highly variable regions by wild cards that do not have any restrictions. More specifically, as the β-strands usually contain some common amino acid sequences, naturally, these common sequences form the sequential patterns that occur frequently in OMPs. We exploit the notion of frequent subsequences studied in association mining to capture such similarities present in the OMPs.

Definition 4.1: A frequent subsequence is a subsequence made of consecutive amino

acids that occurs in more than a certain fraction (defined by the user-specified threshold minimum support, or MinSup) of OMPs.

On the other hand, highly variable loop regions are very unlikely to produce common frequent sequential patterns. Therefore, the general pattern that we used to model OMPs 25

includes alternating frequent subsequences and wild cards. We search for patterns that are combinations of frequent subsequences where subsequences are concatenated by wild cards representing arbitrary “gaps” between each conserved region and use such combinations of subsequences to distinguish outer membrane proteins from non-outer membrane proteins.

Definition 4.2: A frequent pattern has the form *X*X*…, in which each ‘X’ is a frequent

subsequence made of consecutive amino acids, and each ‘*’ is a VLDC (variable-lengthdon’t-care) which may substitute for one or more letters when matching the pattern against a protein sequence. Subsequences are used to capture the local similarity that may relate to important structures or functions of OMPs, and VLDCs compress the remaining irrelevant portions. In order to be frequent, such pattern must also occur in more than the MinSup fraction of OMPs.

4.2 SAC Algorithm Since we are interested in identifying OMPs, only rules that are mined from the OMP sequences are used. Thus we divided the original training dataset into two subsets, with each subset containing only sequences of one class. Rules are then mined only from the OMP class subset, with the following form: P

OMP

where P is a frequent pattern as defined in the previous section.

As the protein sequences have the alphabet of only 20 amino acids, it is likely that very short subsequences will occur in sequences of both classes and are non-discriminative with regards to classification. To remove such trivial similarities, we restrict frequent subsequences to those of some minimum length, specified by a MinLgh threshold.

26

The pseudocode of our SAC algorithm is given in Figure 4.1. The input of the algorithm includes the original dataset D and user-specified thresholds MinSup, MinLgh, MinConf and MaxSup. Details of its main subroutines are described in following subsections.

Procedure SAC(D, MinSup, MinLgh, MinConf, MaxSup) DOMP = SplitData(D); S = MineSubsequence(DOMP, MinSup, MinLgh, MaxSup); P = MinePattern(DOMP, S, MinSup, MinConf); C = BuildClassifier(P); Return C; //return classifier C

Procedure SplitData(D) DOMP = φ; For each sequence s in D If s is an OMP sequence, then add it to DOMP; Return DOMP;

Figure 4.1 Algorithm SAC

4.2.1 Finding Frequent Subsequences In order to get frequent subsequences from OMP sequences, we made use of an efficient implementation [33] of the generalized suffix tree (GST). A suffix tree is a compact string representation. It is a trie-like data structure that represents every suffix of a string by a path from the root to a leaf. A GST is an extension of the suffix tree, designed for representing a set of strings. Each leaf in a GST is associated with an index i. The edges are labeled with strings such that the concatenation of the edge labels on the path from the root node to the leaf with index i is a suffix of the ith string in the set. Suffix trees have been extensively used in string matching and are shown to be an effective data structure for finding common subsequences that runs in linear time [15, 18]. Since each protein 27

sequence is essentially a string of letters, generalized suffix trees can be easily applied to mine frequent subsequences among protein sequences.

The algorithm for constructing the GST works as follows. A unique symbol is appended to the end of each sequence and all sequences are concatenated into a single one. The suffixes of the sequences are then inserted into a trie. When a node has only one child, the child is collapsed with its parent and the edge going down from the parent is labeled with a substring instead of a single character. In this way, we construct a GST for all OMP sequences. 3 QIHK

C K HK

FK

1

1

1

M

IHK

1

NQIHK 3

C 1

3

KK 2

C

2 FK

1 HK

1 IHK

1 K

2 K

3 KC

2 KFK

3 KKC

2 FK C

1

3 C

1

NQIHK

FK K 1

1

1

1

1 FK 1

2 3 1 1 1 MKKC MNQIHK QIHK KKFK MKKFK NQIHK 2

Figure 4.2 The GST for three sequences: MNQIHK, MKKFK and MKKC.

Example 4.1: Figure 4.2 shows a GST that is constructed for three strings: (1) MNQIHK;

(2) MKKFK and (3) MKKC. Each leaf node is represented by a rectangle, labelled with the string index. Each non-leaf node is represented by circles, labelled with the number of 28

different indexes associated with the leaves in its subtree. Every leaf corresponds to a suffix that is the concatenation of the edge labels on the root-to-leaf path, as shown below the leaf. Note that the suffix ‘K’ appears in both string (1) and (2), hence appears twice in the leaves.

Next, we traverse the GST to find all frequent subsequences that satisfy the support minimum (specified by a user-defined constraint MinSup) and the length minimum (specified by another user-defined constraint MinLgh).

Example 4.2: Given the GST as shown in Figure 4.2, suppose MinLgh = 2 and MinSup =

2. We traverse the GST and check the count label of each node, we find two frequent subsequences: ‘KK’ and ‘MKK’.

4.2.2 Finding Frequent Patterns Starting from frequent subsequences, we build frequent patterns by counting the support of each candidate pattern constructed by concatenating two or more frequent subsequences with VLDCs. The construction of the candidate patterns is an enumeration of all possible combinations of all frequent subsequences.

Definition 4.3: In mining frequent patterns, we define the length of a pattern is the

number of frequent subsequences it contains. For example, ‘MKK*KK’ is a pattern that is of length 2.

An enumeration tree is used to represent all frequent subsequences and the candidate patterns are enumerated in depth-first order. For each frequent subsequence X, starting from length 2, if a length-k pattern X*X2*…Xk which contains X has support greater than or equal to the MinSup threshold, the support of a length-k+1 pattern X*X2*…Xk*Xk+1 with one additional subsequence will be checked. Otherwise we stop expanding on this path. 29

Example 4.3: Figure 4.3 illustrates the situation where candidate patterns are enumerated

based on two frequent subsequences ‘KK’ and ‘MKK’. Each child node of the root represents one frequent subsequence. The candidate patterns are checked in a depth-first order, i.e., in the order of node 1→2→3→4→5→6→7→8. For example, suppose the support of length-2 pattern ‘KK*KK’ is greater than MinSup, length-3 pattern ‘KK*KK*KK’ is then checked. When it is found that such length-3 pattern does not satisfy MinSup threshold, another length-3 pattern ‘KK*KK*MKK’ is checked. If it is also not frequent, next length-2 pattern ‘KK*MKK’ will be checked and so on. Note that it is possible to concatenate a frequent subsequence with itself. Root

MKK

KK

1

4 KK

5

MKK

2

6

3 KK

MKK

KK

7

MKK

8 MKK

KK

Figure 4.3 Pattern enumeration

As we mine such patterns only from the OMP class, each frequent pattern P is also an association rule P

OMP. Since our goal is to predict OMP sequences with high

precision, i.e., whenever we classify a sequence as OMP, we should have high confidence in such prediction. Therefore, the rules used to predict OMP sequences should be highly confident. Hence we set the minimum confidence threshold, MinConf, at a fairly high level (greater than 85%). The confidence of a rule ‘P

OMP’ is calculated based on the

support count of the pattern P in OMP sequences and non-OMP sequences and must be greater than or equal to the MinConf value, i.e., Confidence (P

OMP) =

S omp ( P) S all ( P)

≥ MinConf 30

(9)

where Somp(P) is P’s support count in OMP sequences, i.e., the number of OMP sequences that contain P; Sall(P) is P’s support count in all sequences, i.e., the total number of sequences that contain the pattern P. Note that if P occurs more than once in the same sequence, it is counted only once. Any frequent pattern must also satisfy this constraint to be output.

4.2.3 Building the Classifier To build the final classifier from the frequent pattern rules, we made use of the algorithm as described in [35]. This algorithm has two major advantages that we find applicable in our classification problem. Firstly, it is appropriate to use the rule ranking criteria introduced in [35], the MCF principle (most-confident-first), as we also prefer rules with high confidences in our application, since future cases that match such rules are more likely to be actual OMPs. In other words, the most confident rule has the most predictive power in terms of precision, thus it should have the highest priority. Secondly, the state-of-the-art rule pruning technique used in [35] is efficient and was shown to have effectively removed overfitting rules and improved classification performance.

Before introducing how the algorithm works, we define the following general-to-specific relationship between two rules in our context.

Definition 4.4: Rule r1 is more general than rule r2 if the pattern at the left hand side of

rule r1 is a subsequence of the pattern at the left hand side of rule r2. This also means that rule r2 is more specific than rule r1. Note that a subsequence does not have to be strictly consecutive. For example, if we have two rules where rule 1 is ‘BC*C OMP’ and rule 2 is ‘ABC*B*CD OMP’, since ‘BC*C’ is a subsequence of ‘ABC*B*CD’, rule 1 is more general than rule 2. However, such relationship does not always exist between any two rules. For example, a rule ‘BD OMP’ is neither more general nor more specific than the rule ‘BC*C OMP’. 31

Now we rank all rules using the MCF principle [35], i.e., rules are ranked in the following order: 1. Rank rules according to their confidence values; 2. If two rules have the same confidence values, rank them according to their support level; 3. If two rules have the same support level, rank the more general rule before the more specific rule if applicable; 4. Finally, if none of the above applies, rank rules according to the lexicographical order.

When all rules are ranked in such order, if two rules r and r’ have the same confidence and rule r is more specific than r’, then rule r is redundant. This is because r cannot have higher support than r’ and will be always ranked after r’, any protein sequence that matches rule r must also match r’ and will always be classified using rule r’. Therefore we discard any rule that does not have higher rank than all of its general rules. This substantially cuts down the number of rules because many specializations of rules do not improve the confidence over its more general form.

Next, in order to prune overfitting rules, a tree structure, called ADT (association-based decision tree) [35], is constructed, where each node represents one rule. The parent of a rule node is the node representing a rule that is more general and has the highest possible rank. As the majority of our dataset are non-OMP sequences, a default rule ‘φ

non-

OMP’ is added and used as the root node of the entire tree, as it is the most general rule in our ruleset.

The error-based pruning based on pessimistic error estimation is used to prune rules in ADT, in the same way as introduced in the prestigious decision tree algorithm C4.5 [24]. The general principle of pruning is that if the classifier without certain rules has the same or lower estimated error than the classifier with such rules, the rules are pruned. Pruning 32

is done in a bottom-up fashion. At each non-leaf node n, if the estimated error of n’s subtree is higher than when its entire subtree is replaced only by node n, then the entire subtree is pruned and n becomes a leaf node that covers all cases that are previously covered by its subtree nodes. Such pruning is then repeated at higher levels until the root node is encountered.

Example 4.4: Consider a node with 3 leaf nodes in its subtree, as shown in Figure 4.4.

Each node is associated with a pair of numbers (N, E), where N is total number of training data that are covered by this rule node and E of which are incorrectly covered. According to the way of constructing the ADT tree, rule node 1 is always more general than all its descendant nodes, i.e., it can cover all cases that are covered by its descendants.

(16, 1) 1

(6, 0)

(9, 1)

2

3

(1, 0) 4

Figure 4.4 Rule pruning based on pessimistic error estimation

With the confidence 25%, the estimated error for the subtree (nodes 2, 3, 4) is calculated as: 6×U0.25(6, 0) + 9×U0.25(9, 1) + 1×U0.25(1, 0) = 6×0.206 + 9×0.269 + 1×0.75 = 4.407 Meanwhile, the estimated error by using only the node 1 is: 16×U0.25(16, 1) = 16×0.157 = 2.512 Thus nodes 2, 3 and 4 are pruned and node 1 is made a leaf node, since such pruning helps to reduce the estimated error.

The remaining rules are used as classification rules. Any unseen case will be matched against these rules in the MCF order. As all rules are mined from OMP sequences and 33

only predict the OMP class, the default rule ‘φ

non-OMP’ is added to the last of the

ranked list of rules, in order to cover cases that cannot be covered by any previous rules. Only when all previous rules have failed, would a protein sequence be predicted as nonOMP by the default rule.

4.2.4 Optimization Techniques We notice that when constructing candidate patterns from frequent subsequences, the growth in the number of candidate patterns is explosive based on the number of subsequences, since the set of candidate patterns includes all permutations (allowing repetition) of the frequent subsequences. In order to speed up the process of frequent pattern mining, two pruning methods are used to reduce the number of subsequence combinations to be examined while maintaining classification performance. This is based on the observation that the patterns eventually serve as the input of the classifier, and we are interested in finding those patterns such that they provide high confidence and produce highly reliable prediction. In other words, the patterns we mined are not the complete set of frequent patterns that appear in the training set, since we have no need to do so.

4.2.4.1 Maximum Support Pruning Since our goal is to predict OMP sequences with high precision, it would be preferred that the patterns used for classification should be highly diagnostic, i.e., have perfect discrimination of OMP from non-OMP sequences. This means such pattern should not only appear frequently in OMP sequences, it should also appear very infrequently in nonOMP sequences. Hence we introduce another constraint called maximum support, or MaxSup, which is the maximum allowed support level of a frequent subsequence in the non-OMP training sequences. If a subsequence does not satisfy this constraint, it is pruned immediately, i.e., it will not be used to construct any further candidate patterns. In this way, the number of frequent subsequences is effectively reduced, resulting in much less candidate patterns to be examined. 34

Such pruning is called maximum support pruning. It substantially reduced the number of candidate patterns by concentrating only on the patterns that are most likely to be highly confident. To this end, the MaxSup threshold serves the similar purpose as the MinConf of the patterns, as both thresholds aim to extract patterns that are highly confident. On the other hand, it is always possible that even if the support of a subsequence in non-OMP class exceeds the MaxSup threshold, it could still be highly confident, as shown in the following example.

Example 4.5: Suppose a dataset contains 40 OMPs and 100 non-OMPs. Let the MinConf

be set to 85%, whereas the MinSup in OMP class and MaxSup in non-OMP class are both set to be 5%. If a subsequence s occurs in OMP sequences for 40 times and in non-OMP sequences for 6 times, then s does not satisfy the MaxSup constraint. However, the confidence of s is 40/46 = 87%, i.e., s is still a highly confident rule.

In other words, the maximum support pruning is an optimization heuristic in order to improve the algorithm efficiency. Our later experiments show that it substantially reduced the running time of our rule-mining process while having almost no effect on the classification performance.

4.2.4.2 Confidence Pruning Additionally, as we follow the MCF principle to construct the final set of classification rules, if a frequent subsequence itself has confidence of 100%, i.e., it appears only in OMP sequences, any pattern that is built upon this subsequence will have the same confidence (100%) and thus is redundant. Therefore no further patterns need to be examined on top of such subsequences; instead, these subsequences will be output immediately as classification rules in the next stage. This optimization technique is called confidence pruning.

Consequently, the optimized procedure for mining frequent subsequences is shown in Figure 4.5. 35

Procedure MineSubsequence(DOMP, MinSup, MinLgh, MaxSup) S = φ; BuildGST(DOMP); For each frequent subsequence fs found in GST if length(fs) < MinLgh continue; if confidence(fs) = 100% OutputAsRule(fs); continue; if support-in-non-OMP(fs)

Suggest Documents