Combining Bayesian Networks and Decision Trees to Predict Drosophila melanogaster Protein-Protein Interactions Jingkai Yu Wayne State University
[email protected]
Farshad Fotouhi Wayne State University
[email protected]
Abstract Protein-protein interactions are important in many aspects of cellular processes. Discovery of protein interactions that take place within a cell can provide a starting point for understanding biological regulatory pathways. High-throughput experimental screens developed so far show high error rates in terms of false positives and false negatives. There is thus a great need for new computational approaches to enable the prediction of new protein-protein interactions and to enhance the reliability of experimentally derived interaction maps. Many of the computational approaches developed thus far are based on strong biological assumptions, resulting in biases towards certain types of predictions. As a first step towards a more complete and accurate interaction map, we propose to predict protein-protein interactions using existing experimental data combined with the Gene Ontology (GO) annotations of proteins. We do not use strong prior rules about GO patterns and proteinprotein interactions and thus avoid biases associated with various assumptions. We show that GO annotations can be a useful predictor for proteinprotein interactions and that prediction performance can be improved by combining the results from both decision trees and Bayesian networks.
1. Introduction Protein interactions play important roles in many biological processes [1]. The discovery of new protein interactions can help to elucidate functions of uncharacterized proteins by the so-called ‘guilt by association’ approach [2], which suggests that novel proteins may take part in the same processes as their interacting partners. A complete and reliable proteinprotein interaction map representing the specific binary interactions within a cell would provide a significant platform for understanding many biologically relevant processes. Several high throughput experimental methods have been developed in efforts to map the interactions among
Proceedings of the 21st International Conference on Data Engineering (ICDE ’05) 1084-4627/05 $20.00 © 2005
IEEE
Russell L. Finley, Jr. Wayne State University School of Medicine
[email protected]
all of the proteins encoded by a genome (the proteome). While the data from these studies has been useful to biologists, it also has several shortcomings. In particular, the results from high throughput interaction mappings have low accuracy. Estimated error rates of high throughput interaction results range from 41% to 90% [3-7]. Experimental interaction detection is also labor intensive and costly, in part because the number of possible protein-protein interactions, or the search space, is very large. For the ~25,000 proteins encoded by the human genome, for example, there are about 312 million possible protein pairs. Many researchers have looked to computational methods to evaluate the accuracy of experimentally derived protein interactions maps, and to create more complete maps by predicting interactions missed by the high throughput experiments. All of the methods devised thus far have limitations; and new approaches are still needed to tackle the problem of protein-protein interactions from different perspectives. Gene Ontology (GO) is an effort to annotate proteins/genes using a controlled, structured vocabulary. We propose to predict protein-protein interactions by using GO annotations with machine learning algorithms. Our approach does not use strong prior assumptions and thus avoids biases corresponding to various assumptions. Many existing computational studies have targeted the model organism Saccharomyces cerevisiae (yeast), due to its simplicity and the availability of extensive experimental data. We chose to apply our methods to another model organism, Drosophila melanogaster (fruit fly) because its biochemistry is remarkably similar to that of humans and it provides a model for studying many human diseases caused by defective genes. The remainder of this paper is organized as follows. Section 2 summarizes major work in detecting and predicting protein-protein interactions. Our method is explained in section 3; results and discussions are presented in section 4. Section 5 contains our conclusions.
2. Related Work Yeast two-hybrid (Y2H) and protein co-purification are two high-throughput methods used extensively to detect protein-protein interactions by experimental screening. Y2H [8] detects binary protein-protein interactions. The assay is conducted in a yeast cell where two hybrid proteins are experimentally synthesized. One hybrid includes the protein being tested fused to a DNA binding domain (BD), and the other hybrid consists of the second protein to be tested fused with a transcription activation factor/domain (AD). If the BD-fused protein (also called the bait) interacts with the AD-fused protein (prey), it brings the AD close enough to a reporter gene to activate its transcription, which can be detected by a color change in the growth medium. High throughput Y2H screens have been used to map interactions among yeast proteins [5, 9], C. elegans proteins [10], and Drosophila proteins [11]. Protein co-purification detects members of a possible protein complex, in which all proteins may not necessarily interact with each other. A single protein (bait) is tagged and used to extract other members of a protein complex. The proteins extracted are then separated and identified by mass spectrometric analysis. High throughput protein complex determination has been used most extensively to detect complexes among yeast proteins [3, 12, 13]. Various computational methods have also been designed to predict protein-protein interactions, which contain two types of relationships between proteins. One is physical protein-protein association and the other is functional links between proteins. Section 2.1 explains major algorithms in detecting functional links and section 2.2 presents major work in predicting physical protein-protein interactions.
2.1. Functional links Proteins associated by functional links may or may not interact with each other physically. Functional links are usually detected by comparative genomics methods, which make various biological assumptions. Major methods [14] of this type include the following. The gene neighbor [15] method infers a functional link between two proteins if they are neighbors on one chromosome in organism X and their orthologs in organism Y are also neighbors to each other. The assumption is that protein-protein interactions impose evolutionary constraints to keep genes together. The gene fusion [16, 17] approach predicts a functional link between two proteins when they are separated in one organism and fused into one protein encoded by one gene in another organism.
Proceedings of the 21st International Conference on Data Engineering (ICDE ’05) 1084-4627/05 $20.00 © 2005
IEEE
The phylogenetic profile approach [18] evaluates the presence or absence of a group of gene homologs across a set of completely sequenced genomes. Functional links are assumed between genes having similar phylogenetic profiles, again because functionally linked proteins should be co-conserved through evolution. For a detailed explanation of the above three algorithms, refer to [19, 20]. They have been implemented in several databases [21-23]
2.2. Physical protein-protein interactions Computational methods have also been applied to predict physical protein-protein interactions. They use different data sources and make different assumptions, according to which they can be roughly grouped into the following categories. Mapping between interactions of two different organisms [24]. The main idea is to use the interaction map of one organism as a template to predict interactions in another. Correspondence between proteins in the two organisms is established by comparing their sequences to reveal orthologs. When protein X and Y interact in organism A, their orthologs in organism B are predicted to interact [25, 26]. A major disadvantage of this approach lies in the fact that there are no highly reliable and complete or near complete protein interaction maps for any organism. The closest to a complete map would be for yeast, but this organism is evolutionarily distant from humans, thus limiting the number of orthologs that can be identified. A related approach is based on domain-domain interactions. In this type of method, protein-protein interactions are predicted based on the amino acid sequence of domains, which are sections of a protein that can be very similar from one protein to another [27-30]. A set of domain-domain interactions is learned from a training set of protein-protein interactions. A pair of proteins is predicted to interact when they contain a pair of domains found in interacting proteins in the training set. Many multiple sequence alignments are required to get a list of protein domains on a proteomic scale. Moreover, the training set of protein interactions may not be sufficiently complete to cover all of the possible domain-domain pairs that are responsible for protein-protein interactions, which can result in false negatives when used to predict new protein interactions. Structure-based methods use the physicochemical properties of amino acids to predict interactions. In one approach, for example, a feature vector was derived from the physicochemical properties of interacting amino acid residues, including charge, hydrophobicity,
and surface tension [31]. They were then input into a support vector machine (SVM), which learns to differentiate interactions from non-interacting pairs. This approach had an accuracy of 80% for test sets of nearly equal numbers of positive and negative interacting pairs. The above methods make various assumptions, which bring different biases to their results. For example, comparative genomics based approaches more likely find evolutionarily conserved interactions. Mapping between different organisms finds interactions between proteins that have orthologs, while missing interactions between proteins that do not have significant orthologs in a source organism. Using domain-domain interactions to predict protein-protein interactions misses protein interactions between domain pairs that are not covered by the training interaction map. In our approach, we do not use strong prior rules about GO patterns and protein-protein interactions and thus avoid strong biases associated with various assumptions. We examined the predictive power of GO annotations in detecting protein-protein interactions. Our results show that prediction performance can be improved by combining the results from both decision trees and Bayesian networks.
3. Methods We made use of Gene Ontology (GO) annotations of Drosophila genes to predict protein interactions. GO contains three “structured, controlled vocabularies (ontologies)” [32], biological processes, cellular component, and molecular functions, whose structures are directed acyclic graphs (DAG). FlyBase is a central database of genetic and molecular data for Drosophila [33]; it includes GO annotations of genes. Genes and their GO annotations were downloaded from Flybase and GO respectively on 2/11/04. Both positive (interaction) and negative (noninteraction) data are necessary for the purpose of training. We used two Y2H data sets [11, 34] with a total of ~22,000 interactions among ~7,500 proteins. 2,726 of the interactions are between proteins that are both annotated with at least one GO biological process term. This was used as our positive interaction set. Negative data is not readily available on a large scale since negative experimental results are not conclusive. We thus need to synthesize a negative set in some way. There were two approaches reported in previous work. One was to assume that proteins more than 5 or 6 links away in an experimentally derived interaction map would not interact with each other [35]. The other was to assume that proteins in disparate subcellular components are not likely to interact [36, 37]. We think
Proceedings of the 21st International Conference on Data Engineering (ICDE ’05) 1084-4627/05 $20.00 © 2005
IEEE
the latter method makes more biological sense and adopt it in this work. Protein localization data were obtained from the Swiss-Prot protein knowledge database [38]. Proteins annotated with more than two different localizations were discarded for the purpose of synthesizing negative data. We obtained 29,616 noninteractions by pairing proteins that are localized in different cellular components, i.e., nucleus versus mitochondria, mitochondria versus endoplasmic reticulum, extracellular versus cytoplasm, mitochondria, nucleus, or endoplasmic reticulum. Flybase annotates genes with the deepest (most detailed) GO terms possible. A lower level GO term automatically implies its parent terms due to its DAG property. In order to make this implication explicit, we added all parent nodes of a protein’s original GO annotations to its list of final annotations. GO terms were sorted according to their identification number. Each gene was assigned a binary annotation vector in which 1 denotes being annotated with a specific GO term and 0 not. An annotation vector for the pair was obtained by summing up the two vectors for the two genes. For instances, with attributes W, X, Y, Z, if gene A and B have the following annotation vectors
A: B:
W [1, [1,
X 0, 1,
Y 1, 0,
Z 0] 0]
The annotation vector for the pair will be [2, 1, 1, 0]. There were totally 2086 GO biological process terms annotated to Drosophila genes at Flybase. However, most terms were annotated to very few genes. To reduce data sparsity, we chose only terms annotated to as least 10 genes, resulting in 653 terms (set C). Of these, 369 (set A) were annotated to less than 100 of the 2726 positive interactions and 284 (set B) to more than 100 positive interactions. Decision trees and Bayesian networks were trained with experimental interaction data and synthesized negative data. We used C4.5 [39] decision trees and Naïve Bayesian networks [40] in our study. We based our implementation on the machine learning software Weka developed at the University of Waikato in New Zealand [41].
4. Results and discussion For a pair of proteins, its true state (interaction or non-interaction) and the state predicted by certain algorithm can have the following four combinations. TP (True Positive) denotes the number of interactions that are predicted as interactions and are indeed true interactions. FN (False Negative) is the number of
protein pairs that are reported as not interacting but are indeed true interactions. FP (False Positive) is the number of predicted interactions that are in fact not real interactions. TN (True Negative) is the number of protein pairs that are correctly predicted to not interact. Two performance criteria, precision and recall are defined as follows.
Pr ecision
TP , Re call TP FP
TP TP FN
Precision measures the percentage of true interactions among all of the predicted interactions; it describes how accurate the prediction is and is sometimes called positive predictive value (PPV). Recall measures what percentage of the true interactions is predicted correctly, it describes how well the method covers the true interaction map. High values for both performance criteria are desirable. However, there is generally an inverse relationship between these two; that is, when precision gets higher, recall tends to go down, and vice versa. In the task of predicting protein-protein interactions, we put more emphasis on precision since low reliability is one of the main weaknesses of the experimental methods; at the same time, we aim to obtain a reasonable value of recall. Due to the fact that a reference interaction map is not available for Drosophila melanogaster, cross-validation was used to evaluate the algorithms. Available data were equally split into five sets. Four sets were used to train the algorithms and the remaining set was used to test the algorithm. This process was iterated five times and performance was averaged over the five runs to give the final result for a specific combination of training and testing sets. Our negatives were synthesized by disparate subcellular localizations. To reduce possible biases of individual negative sets, we ran each experiment using five different sets of synthesized negatives. Their results were then averaged to give the final result. We first tested to see which annotation terms have more predictive power. C4.5 [39] decision trees were used to run the following experiments. We used different sets of GO biological process (BP) terms, different number of negatives, and different amounts of training data. We found that when GO terms are annotated to more protein pairs they have a better overall prediction performance. Similar results were also observed in Naïve Bayesian network learning [40]. In our testing, decision tree learning took much more computing time than Bayesian networks, especially when using all of the attributes, i.e., attribute set C (see Methods). Set B was used in the following experiments.
Proceedings of the 21st International Conference on Data Engineering (ICDE ’05) 1084-4627/05 $20.00 © 2005
IEEE
To examine effect of the ratio of negatives to positives on prediction performance, we ran a series of cross-validation experiments using data sets with different ratios of negatives to positives. The results are shown in Table 1. Table 1. Effect of negatives to positives Neg1
Ratio2
3000 6000 9000 12000
1 2 3 4
Bayesian Network P3 R4 0.782 0.550 0.664 0.483 0.603 0.354 0.562 0.346
Decision Tree P R 0.819 0.712 0.764 0.602 0.680 0.486 0.593 0.414
1
- Number of Negatives - Approximate ratio of the number of negatives to the number of positives 3 – Precision 4 – Recall 2
We can see that when the ratio of negatives to positives gets higher (percentage of positives gets lower), the learning ability of the system gets worse. One of the reasons lies in the way Bayesian classifiers classify objects. In Bayesian classifiers, the class with a probability over 50% will be reported as the predicted class. Considering the low percentage of true interactions among all possible protein pairs and the noisy nature of our input training data (Y2H data), this criterion may be too loose for an interaction to be true. Raising the probability threshold for a protein pair to be a true interaction may increase the learning performance. The effect of threshold was examined and results are shown in Table 2. A higher threshold increased precision performance, but fewer interactions were predicted as true. We obtained better results when the ratio of positives to negatives is close to 1 and when the positive probability threshold is increased to above 0.50. The decision trees and Bayesian network approaches obtained similar results when run individually. What will happen when their results are combined? There are two extremes when combining their results. One is the union of results, which leads to more predicted interactions but their precision is expected to get worse. The other is the intersection of their results, which leads to a much smaller number of predicted interactions (recall goes lower) but is expected to have a higher precision. We did not use either of these two extremes but instead combined the two methods. If pd and pb are the probability of being positive reported by a C4.5
decision tree and Naïve Bayesian network, respectively, we combined the results in the following way. pd > 0.8 OR pb > 0.8 Æ positive 0.5 pd 0.8 AND 0.5 pb 0.8 Æ positive otherwise Æ negative. Table 2. Effect of probability threshold for positive class Bayesian Network 1
T 0.50 0.60 0.70 0.80 0.90 1.00
2
P 0.782 0.797 0.810 0.826 0.854 1.000
3
R 0.550 0.526 0.426 0.366 0.311 0.005
Decision Tree P 0.819 0.827 0.834 0.850 0.858 0.980
R 0.712 0.624 0.507 0.432 0.403 0.012
1
- Threshold of positive probability - Precision 3 - Recall 2
That is, all predictions with probabilities higher than 0.8, are considered positives, whereas predictions with probabilities between 0.5 and 0.8 are considered positive only if they are predicted by both approaches. This combination retains high confidence interactions and gets rid of some low confidence predictions. We obtained a precision of 87% and recall of 60%, which showed a significant improvement in precision over results obtained by just one method alone (see first rows of Table 1 and Table 2), and maintained a relatively high recall value compared to individual methods. For comparison, simply taking the intersection of Bayesian network and decision tree resulted a precision of 89% and recall of 43%.
5. Conclusion Our results show that GO annotations can be a useful predictor for protein-protein interactions and that prediction performance can be improved by combining results from both decision trees and Bayesian networks. When combining results from decision trees and Bayesian networks, a fixed cutoff value of 0.8 was chosen. We could select an optimal probability cutoff point by adjusting the cutoff value and iterating the process. Our approach does have its own limitations. We are using approximately equal number of positives and negatives in our training and testing sets. In reality, the ratio of positives to negatives is very low, and how
Proceedings of the 21st International Conference on Data Engineering (ICDE ’05) 1084-4627/05 $20.00 © 2005
IEEE
our system would work with low ratios remains to be seen. Because many proteins do not have GO annotations, other approaches will be needed in combination with the one we used here. Moreover, to get a higher quality and more complete interaction map, more types of data have to be combined, including gene expression, phenotype, and protein domains. These will be integrated into our system in the future. The ultimate goal is to predict new interactions; we’ll scale our system to the proteomic scale and predict a set of new interactions for Drosophila melanogaster. These interactions should in turn be useful for predicting an interaction map for human proteins.
References [1] A. J. M. Walhout and M. Vidal, "Protein Interactions Maps for Model Organisms," Nature Reviews Molecular Cell Biology, 2 (55-62), 2001. [2] J. I. Semple, C. M. Sanderson, and R. D. Campbell, "The Jury Is out on "Guilt by Association" Trials," Briefings in Functional Genomics and Proteomics, 1 (1): 40-52, 2002. [3] C. von Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Fields, and P. Bork, "Comparative Assessment of Large-Scale Data Sets of Protein-Protein Interactions," Nature, 417 (6887): 399-403, 2002. [4] C. M. Deane, L. Salwinski, I. Xenarios, and D. Eisenberg, "Protein Interactions: Two Methods for Assessment of the Reliability of High Throughput Observations," Molecular & Cellular Proteomics, 1: 349-356, 2002. [5] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki, "A Comprehensive Two-Hybrid Analysis to Explore the Yeast Protein Interactome," Proceedings of the National Academy of Sciences of the United States of America, 98 (8): 4569-4574, 2001. [6] R. Mrowka, A. Patzak, and H. Herzel, "Is There a Bias in Proteome Research?" Genome Research, 11 (12): 19711973, 2001. [7] A. M. Edwards, B. Kus, R. Jansen, D. Greenbaum, J. Greenblatt, and M. Gerstein, "Bridging Structrual Biology and Genomics: Assessing Protein Interaction Data with Known Complexes," TRENDS in Genetics, 18 (10): 529536, 2002. [8] S. Fields and O.-k. Song, "A Novel Genetic System to Detect Protein-Protein Interactions," Nature, 340 (6230): 245-6, 1989. [9] P. Uetz, L. Giot, G. Cagney, T. A. Mansfield, R. S. Judson, J. R. Knight, D. Lockshon, V. Narayan, M. Srinivasan, P. Pochart, et al., "A Comprehensive Analysis of Protein-Protein Interactions in Saccharomyces Cerevisiae," Nature, 403 (6770): 623-627, 2000. [10] S. Li, C. M. Armstrong, N. Bertin, H. Ge, S. Milstein, M. Boxem, P. O. Vidalain, J. D. Han, A. Chesneau, T. Hao, et al., "A Map of the Interactome Network of the Metazoan C. elegans," Science, 303 (5657): 540-3, 2004. [11] L. Giot and et al., "A Protein Interaction Map of Drosophila melanogaster," Science, 302: 1727-1736, 2003.
[12] Y. Ho, A. Gruhler, A. Heilbut, G. D. Bader, L. Moore, S.-L. Adams, A. Millar, P. Taylor, K. Bennett, K. Boutilier, et al., "Systematic Identification of Protein Complexes in Saccharomyces cerevisiae by Mass Spectrometry," Nature, 415 (6868): 180-183, 2002. [13] A.-C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, J. Schultz, J. M. Rick, A.-M. Michon, C.-M. Cruciat, et al., "Functional Organization of the Yeast Proteome by Systematic Analysis of Protein Complexes," Nature, 415 (6868): 141-147, 2002. [14] A. Valencia and F. Pazos, "Computational Methods for the Prediction of Protein Interactions," Current Opinion in Structural Biology, 12 (368-373), 2002. [15] T. Dandekar, B. Snel, M. Huynen, and P. Bork, "Conservation of Gene Order: a Fingerprint of Proteins that Physically Interact," Science, 23: 324-328, 1998. [16] E. M. Marcotte, M. Pellegrini, H.-L. Ng, D. W. Rice, T. O. Yeates, and D. Eisenberg, "Detecting Protein Function and Protein-Protein Interactions from Genome Sequences," Science, 285: 751-753, 1999. [17] A. J. Enright, I. Iliopoulos, N. C. Kyrpides, and C. A. Ouzounis, "Protein Interactions Maps for Complete Genomes Based on Gene Fusion Events," Nature, 402 (6747): 86-90, 1999. [18] M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O. Yeates, "Assigning Protein Functions by Comparative Genome Analysis: Protein Phylogenetic Profiles," Proceedings of the National Academy of Sciences of the United States of America, 96 (8): 4285-4288, 1999. [19] D. Eisenberg, E. M. Marcotte, I. Xenarios, and T. O. Yeates, "Protein Function in the Post-Genomic Era," Nature, 405 (6788): 823-826, 2000. [20] V. Schachter, "Bioinformatics of Large-Scale Protein Interaction Networks," BioTechniques Computational Proteomics Supplement, 32: S16-S27, 2002. [21] C. von Mering, M. Huynen, D. Jaeggi, S. Schmidt, P. Bork, and B. Snel, "STRING: a Database of Predicted Functional Associations between Proteins," Nucleic Acids Research, 13 (1): 258-261, 2003. [22] P. M. Bowers, M. Pellegrini, M. J. Thompson, J. Fierro, T. O. Yeates, and D. Eisenberg, "Prolinks: a database of protein functional linkages derived from coevolution," Genome Biology, 5 (5): R35, 2004. [23] J. C. Mellor, I. Yanai, K. H. Clodfelter, J. Mintseris, and C. DeLisi, "Predictome: a database of putative functional links between proteins," Nucleic Acids Research, 30 (1): 306-9, 2002. [24] L. R. Matthews, P. Vaglio, J. Reboul, H. Ge, B. P. Davis, J. Garrels, S. Vincent, and M. Vidal, "Identification of Potential Interaction Networks Using Sequence-Based Searches for Conserved Protein-Protein Interactions or "Interologs"," Genome Research, 11 (12): 2120-2126, 2001. [25] J. Wojcik and V. Schachter, "Protein-Protein Interaction Map Inference Using Interaction Domain Profile Pairs," Bioinformatics, 17 (Suppl. 1): S296-S305, 2001. [26] J. R. Bock and D. A. Gough, "Whole-Proteome Interaction Mining," Bioinformatics, 19 (1): 125-34, 2003.
Proceedings of the 21st International Conference on Data Engineering (ICDE ’05) 1084-4627/05 $20.00 © 2005
IEEE
[27] S. M. Gomez and A. Rzhetsky, "Towards the Prediction of Complete Protein-Protein Interaction Networks," Pacific Symposium on Biocomputing: 413-424, 2002. [28] M. Deng, S. Metha, F. Sun, and T. Chen, "Inferring Domain-Domain Interactions from Protein-Protein Interactions," in Proceedings of the 6th ACM International Conference on Research in Computational Molecular Biology (RECOMB). Wasthing, D.C., USA, 2002. [29] E. Sprinzak and H. Margalit, "Correlated SequenceSignatures as Markers of Protein-Protein Interaction," Journal of Molecular Biology, 311 (4): 681-692, 2001. [30] S. P. Kannan, C. Huang, S. Wuchty, D. Chen, and J. A. Izaguirre, "Inferring Protein-Protein Interactions from Protein Domain Combinations," Proceedings of the Ninth Annual International Conference on Research in Computational Molecular Biology, 2005. [31] J. R. Bock and D. A. Gough, "Predicting Protein-Protein Interactions from Primary Structure," Bioinformatics, 17 (5): 455-60, 2001. [32] The Gene Ontology Consortium, "Gene Ontology: Tool for the Unification of Biology," Nature Genetics, 25 (2529), 2000. [33] The Flybase Consortium, "The FlyBase database of the Drosophila genome projects and community literature," Nucleic Acids Research, 31 (1): 172-175, 2003. [34] C. A. Stanyon, G. Liu, B. A. Mangiola, N. Patel, L. Giot, B. Kuang, H. Zhang, J. Zhong, and R. L. Finley, Jr., "A Drosophila Protein-Interaction Map Centered on CellCycle Regulators," Genome Biology, 5 (12): R96, 2004. [35] G. D. Bader and C. W. V. Hogue, "Analyzing Yeast Protein-Protein Interaction Data Obtained from Different Sources," Nature Biotechnology, 20 (10): 991-997, 2002. [36] R. Jansen, H. Yu, D. Greenbaum, Y. Kluger, N. J. Krogan, S. Chung, A. Emili, M. Snyder, J. F. Greenblatt, and M. Gerstein, "A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data," Science, 302: 449-453, 2003. [37] R. Jansen and M. Gerstein, "Analyzing Protein Function on a Genomic Scale: the Importance of Gold-Standard Positives and Negatives for Network Prediction," Current Opinions in Microbiology, 7 (5): 535-45, 2004. [38] E. Gasteiger, A. Gattiker, C. Hoogland, I. Ivanyi, R. D. Appel, and A. Bairoch, "ExPASy: the Proteomics Server for In-Depth Protein Knowledge and Analysis," Nucleic Acids Research, 31: 3784-3788, 2003. [39] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, California: Morgan Kauffman Publishers, 1993. [40] D. Heckerman, "A Tutorial on Learning with Bayesian Networks," Microsoft, Technical Report MSR-TR-95-06, 1995. [41] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations: Morgan Kaufmann, 1999.