Computational Identification of Potential Molecular ... - Plant Physiology

Bioinformatics

Computational Identification of Potential Molecular Interactions in Arabidopsis1[C][W] Mingzhi Lin2, Bin Hu2, Lijuan Chen, Peng Sun, Yi Fan, Ping Wu, and Xin Chen* Department of Bioinformatics, Zhejiang University, Hangzhou, People’s Republic of China, 310058 (M.L., B.H., L.C., Y.F., X.C.); and State Key Laboratory of Plant Physiology and Biochemistry, Zhejiang University, Hangzhou, People’s Republic of China, 310058 (P.S., P.W.)

Knowledge of the protein interaction network is useful to assist molecular mechanism studies. Several major repositories have been established to collect and organize reported protein interactions. Many interactions have been reported in several model organisms, yet a very limited number of plant interactions can thus far be found in these major databases. Computational identification of potential plant interactions, therefore, is desired to facilitate relevant research. In this work, we constructed a support vector machine model to predict potential Arabidopsis (Arabidopsis thaliana) protein interactions based on a variety of indirect evidence. In a 100-iteration bootstrap evaluation, the confidence of our predicted interactions was estimated to be 48.67%, and these interactions were expected to cover 29.02% of the entire interactome. The sensitivity of our model was validated with an independent evaluation data set consisting of newly reported interactions that did not overlap with the examples used in model training and testing. Results showed that our model successfully recognized 28.91% of the new interactions, similar to its expected sensitivity (29.02%). Applying this model to all possible Arabidopsis protein pairs resulted in 224,206 potential interactions, which is the largest and most accurate set of predicted Arabidopsis interactions at present. In order to facilitate the use of our results, we present the Predicted Arabidopsis Interactome Resource, with detailed annotations and more specific per interaction confidence measurements. This database and related documents are freely accessible at http://www.cls.zju.edu.cn/pair/.

The complex cellular functions of an organism rely on physical interactions between proteins. Deciphering the protein-protein interaction network to understand higher level phenotypes and their regulations is always a major focus of both experimental biologists and computational biologists. A number of highthroughput (HTP) assays have been developed to identify in vitro protein interactions from several model organisms (Uetz et al., 2000; Giot et al., 2003; Li et al., 2004). A number of initiatives, such as IntAct (Kerrien et al., 2006), Molecular INTeraction database (Chatr-aryamontri et al., 2007), the Database of Interacting Proteins (Salwinski et al., 2004), Biomolecular Interaction Network Database (BIND; Alfarano et al., 1 This work was supported by the National Natural Science Foundation of China (grant no. 30600039), by the Chinese Ministry of Science and Technology (maintenance grant to the State Key Laboratory of Plant Physiology and Biochemistry, Zhejiang University, for long-term running cost of the PAIR database), and by the National Basic Research Program of China (grant no. 2005CB20901 to P.W.). 2 These authors contributed equally to the article. * Corresponding author; e-mail [email protected]. The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Xin Chen ([email protected]). [C] Some figures in this article are displayed in color online but in black and white in the print edition. [W] The online version of this article contains Web-only data. www.plantphysiol.org/cgi/doi/10.1104/pp.109.141317

34

2005), and BioGRID (Stark et al., 2006), have been established to systematically collect and organize the interaction data reported by both proteome-scale HTP experiments and traditional low-throughput studies focusing on individual proteins or pathways. Arabidopsis (Arabidopsis thaliana) has long been studied as a model organism to investigate the physiology, biochemistry, growth, development, and metabolism of a flowering plant at the molecular level. The molecular mechanism studies of various phenotypes and their regulations in Arabidopsis may be facilitated by a comprehensive reference protein interaction network, based on which working hypotheses could be invented with more guidance and confidence. However, due to technological limitations, most experimentally reported protein interactions in available databases were from other organisms. A very limited number of plant interactions could be found in these databases. Therefore, an accurate prediction of the Arabidopsis interactome would be valuable to assist relevant research. Studies on the computational identification of potential interactions started along with the advent of HTP interaction-detection technologies, which often produced a large number of false positives (Deane et al., 2002). Indirect evidence of protein interaction (e.g. protein colocalization and relevance in function) were hence introduced to boost the confidence of HTP results (Jansen et al., 2003). Further investigations demonstrated that direct inference of protein interactions from such indirect evidence alone was possible

Plant PhysiologyÒ, September 2009, Vol. 151, pp. 34–46, www.plantphysiol.org Ó 2009 American Society of Plant Biologists

Predicted Arabidopsis Interactome

(Scott and Barton, 2007). The accuracy and effectiveness of using indirect evidence to predict interactions have also been thoroughly assessed (Qi et al., 2006; Suthram et al., 2006). These works offered precious insights into how protein interactions may be predicted accurately on a proteomic scale. In other organisms such as Homo sapiens, the prediction of an entire interactome has already been proven applicable and useful (Rhodes et al., 2005). On the other side, several efforts have been made to collect and organize a comprehensive map of Arabidopsis molecular interactions. For instances, around 20,000 interactions were inferred by homology to known interactions in other organisms (Geisler-Lee et al., 2007). Another work predicted 23,396 interactions based on multiple indirect data and curated 4,666 interactions from the literature and enzyme complexes (Cui et al., 2008). The Arabidopsis reactome database was established describing the functions of 2,195 proteins with 8,269 reactions in 318 superpathways (Tsesmetzis et al., 2008). And a general interaction database, IntAct (Kerrien et al., 2006), had allocated a special unit actively curating all plant protein interactions from literature and submitted data sets, which now contains 2,649 Arabidopsis interactions. However, in yeast, approximately 18,000 protein-protein interactions had been estimated for approximately 6,000 genes (Yu et al., 2008). Assuming the same rate of interaction, approximately 200,000 protein interactions would be expected for approximately 20,000 Arabidopsis genes. Therefore, the current collection of Arabidopsis interactions is still significantly limited. Moreover, most previous prediction works did not provide rigorous confidence measurements for their predicted interactions, which further limited their scope of applications. Recent advances in statistical learning presented a powerful algorithm, support vector machine (SVM), which may be used to predict interactions based on multiple indirect data. Although the basis of SVM had been laid in the 1960s, the idea of SVM was only officially proposed in the 1990s by Vapnik (1998, 2000). Then, research on its theoretical and application aspects thrived. It has been applied in a wide range of problems, including text categorization (de Vel et al., 2001; Kim et al., 2001), image classification and object detection (Ben-Yacoub et al., 1999; Karlsen et al., 2000), flood stage forecasting (Liong and Sivapragasam, 2002), microarray gene expression data analysis (Brown et al., 2000), drug design (Zhao et al., 2006a, 2006b), protein solvent accessibility prediction (Yuan et al., 2002), and protein fold prediction (Ding and Dubchak, 2001; Hua and Sun, 2001). Many studies have demonstrated that SVM was consistently superior to other supervised learning methods (Brown et al., 2000; Burbidge et al., 2001; Cai et al., 2003). In this work, with careful preparation of example data and selection of indirect evidence, we constructed an SVM model to predict potential Arabidopsis interactions. False positives were tightly controlled. With the high-confidence model, we identified altogether Plant Physiol. Vol. 151, 2009

224,206 potential interactions, which were expected to be 48.67% accurate and to cover 29.02% of the entire Arabidopsis interactome. More specific confidence measurements were also assigned on a per interaction basis. To facilitate the use of our results, we present the Predicted Arabidopsis Interactome Resource (PAIR; http://www.cls.zju.edu.cn/pair/), featuring detailed annotations and a friendly user interface.

RESULTS AND DISCUSSION Example Data Set

Known pairs of interacting proteins (positive examples) and noninteracting proteins (negative examples) were required to train an SVM prediction model. As detailed in “Materials and Methods,” we assembled a gold-standard example set consisting of 4,139 positive and 4,139 negative examples (Table I). In order to verify whether the proteins in our example interactions were a good sample of the entire Arabidopsis proteome, we used four categorizing systems to compute and compare the protein distributions in our example interactions and those in the entire Arabidopsis proteome. The Arabidopsis Information Resource (TAIR) database (Rhee et al., 2003) provided three classification systems based on Gene Ontology (GO) annotation (i.e. the molecular function-based system, the cellular component-based system, and the biological process-based system, which are collectively known as the GO Slim classification systems). In most cases, a GO Slim category in each classification system corresponds to an existing term in GO. Gene products that are annotated to the term itself, or to any of the children of this term, are included in the corresponding GO Slim category. In some cases, a GO Slim category may also encompass more than one GO term. In the GO Slim classification systems, categories were carefully chosen and were intended to be nonoverlapping. In our analysis, proteins classified into the “unknown categories” (e.g. “molecular function unknown,” “biological process unknown,” and “cellular component unknown”) were ignored, while the proteins classified into the “other categories” (e.g. “other molecular functions,” “other biological processes,” and “other cellular components”) were counted. As shown in Figure 1A, proteins in our interaction examples and proteins in the entire Arabidopsis proteome showed similar histograms according to the GO Slim Biological Process categories. The Pearson’s corTable I. Composition of gold-standard interaction examples Data Source

IntAct BIND BioGRID TAIR Total

No. of Interactions

No. of Proteins

2,649 808 736 1,177 4,139

1,138 416 317 770 1,679 35

Lin et al.

Figure 1. Protein distribution in our interaction examples and in the entire Arabidopsis proteome. Blue bars show the fraction of interacting proteins in each category, and red bars show the fraction of all Arabidopsis proteins in each category. A, Protein distribution based on the GO Slim Biological Process categories. The Pearson’s correlation coefficient between two distributions is 0.95 (P = 3.17 3 1027). B, Protein distribution based on the GO Slim Molecular Function categories. The correlation coefficient is 0.52 (P = 2.8 3 1022). C, Protein distribution based on the GO Slim Cellular Component categories. The correlation coefficient is 0.68 (P = 2.9 3 1023). ER, Endoplasmic reticulum. D, Protein distribution based on the Pfam protein family classifications. The correlation coefficient is 0.75 (P , 1 3 10 210). In this diagram, only the largest 150 families are shown for clarity. [See online article for color version of this figure.]

relation coefficient (r) between these two distributions was 0.95. Using Fisher’s z transformation, this coefficient corresponds to a statistical significance of P = 3.17 3 1027. Similarly, according to the GO Slim Molecular Function categories, we observed a correlation coefficient of 0.52 (P = 2.8 3 1022; Fig. 1B), and according to the GO Slim Cellular Component categories, we observed a correlation coefficient of 0.68 (P = 2.9 3 1023; Fig. 1C). These significant correlations indicated that proteins in our example data set were indeed a good sample of the entire proteome. Besides the annotationbased categories, sequence-based protein families also served as a good classification system to compare protein distributions. According to the Pfam database (Coggill et al., 2008), all Arabidopsis proteins were classified into 2,854 families. As shown in Figure 1D, the frequency distributions of example proteins and of all Arabidopsis proteins were roughly consistent, showing a correlation coefficient of 0.75 (P , 1 3 10210). All of these data suggested that the proteins in our interaction examples represented a good sample of the Arabidopsis proteome. 36

Strength of the Indirect Evidence

As detailed in “Materials and Methods,” potential interactions were predicted based on 14 features derived from four types of indirect evidence (Table II). Therefore, it is of interest to evaluate the extent to which these features carried information about protein interaction. One way to assess the discrimination power of a feature is to compare its value distribution in interaction examples against its value distribution in noninteraction examples. Informative features are expected to show apparent differences.

Table II. Composition of features Evidence Type

No. of Features

Coexpression Domain interaction Colocalization Shared annotation Total

1 9 1 3 14 Plant Physiol. Vol. 151, 2009


In this way, we divided the value range of each feature into 20 equal bins. The fraction of positive examples and the fraction of negative examples that fell into each bin were calculated to make the corresponding distribution histograms. We plotted these distribution histograms and the associated likelihood ratio (LR) curve in Figure 2. The LR for each bin is defined as the fraction of interaction examples above this bin divided by the fraction of noninteraction examples above this bin. Above a bin means that the feature values are expected to better indicate an interaction than all values in this bin. For features describing coexpression, colocalization, and domain interaction, this means that the feature values are greater than the biggest number in this bin. But for features describing shared annotation, above a bin means the contrary. For convenience in comparison, we wanted all LR curves to increase with the horizontal axes. Therefore, for features describing shared annotation, the horizontal axes were plotted as 1 – the feature value. Figure 2 shows several typical feature strength diagrams. A complete set of these diagrams is provided in Supplemental Text S1. It was clear that in these diagrams, no sharp differences existed between the value distributions in interaction examples and those in noninteraction examples. Pearson’s correlation between the interaction and noninteraction distributions ranged from 0.324 to 1. The Kolmogorov-Smirnov test also indicated that these interaction and noninteraction distributions were not significantly different, dis-

playing distances from 0.1 to 0.4, corresponding to a P value range of 1.0 to 0.313 (no significance to reject the hypothesis that they were the same; Supplemental Text S1). However, it was also obvious that there were always noticeable deviations between the interaction and noninteraction distributions for each feature. In the best scenarios (at the right-most point), these features always led to significant LRs (i.e. .10-fold). Hence, they shall be considered weak but relevant. A delicate scheme to exploit their combined discriminative power was necessary for accurate prediction. The above set of feature LRs also offered an interesting possibility to predict potential interactions based on the total LR of a protein pair being an interaction rather than a noninteraction, which was computed as the product of the 14 single feature LRs (Rhodes et al., 2005; Cui et al., 2008). While we regarded this approach as a useful supplement to assign per interaction prediction confidence, using it as an interactome prediction method may not be optimal in our case, for two reasons. First, the features we used were not independent. Strong correlations existed among features describing the same type of evidence (e.g. the Pearson’s correlation between the nine domain interaction features ranged from 0.0049 to 0.918). And features describing different indirect evidence may also correlate, such as the colocalization feature and the shared GO cellular component annotation feature (r = 0.0638, P , 1 3 10210). Therefore, the

Figure 2. Comparison of the feature value distribution in interaction examples and the feature value distribution in noninteraction examples. For each feature, its value range is divided into 20 equal bins. Blue curves and red curves show the fraction of interaction examples and the fraction of noninteraction examples that fall into each bin. Green curves represent the LR. A complete set of these analysis diagrams can be found in Supplemental Text S1. The horizontal axis represents the feature value (in A–C) or the 1 – feature value (in D). The left vertical axis represents the fraction of protein pairs. The right vertical axis represents the LR. A, Protein colocalization feature. B, Domain interaction feature 9 (Random Decision Forest Framework). C, Gene coexpression feature. D, Shared annotation feature 3 (cellular component). [See online article for color version of this figure.] Plant Physiol. Vol. 151, 2009

37

Lin et al.

product of LRs would be dominated by the information coded in multiple correlated features. Useful information carried by other independent features may be obscured, which may lead to significant bias in prediction. Second, as shown in Figure 2, the LR curves were not monotonically increasing, which meant that using a single cutoff value to compute LR was too simple. Finer structures of the feature value distributions could be exploited for better prediction accuracy. Consequently, prediction systems that are more complicated and tolerate correlated features were desired. Many previous machine learning studies had suggested that the SVM algorithm excelled in this scenario. Accuracy of Our Prediction Models

We constructed an SVM model to predict Arabidopsis interactions. Its parameters, C and g, were optimized by a grid search. Model performances during parameter optimization were estimated by 10-fold cross-validation with the F1 measure. Accuracy of the resulting model was evaluated by a more precise 100-iteration bootstrap experiment. In each iteration, one-tenth of the example data were randomly left out from training, and these data were used to test the model and compute the confusion matrix. After 100 iterations, the mean and SD of the confusion matrix (Table III), together with other derivative accuracy measurements, were computed. The optimal model produced 84.52% 6 0.20% sensitivity, 90.58% 6 0.25% specificity, and 87.44% 6 0.14% F1 measure. Its overall accuracy was estimated to be 87.55% 6 0.14%. In addition, a receiver operating characteristic (ROC) curve analysis was performed to verify whether the SVM algorithm was effective in interaction prediction. In the ROC curve, sensitivity was plotted against 1 2 specificity for SVM models trained with all range of parameters. Therefore, the ROC curve analysis is a global and unbiased evaluation of the effectiveness of an algorithm, since it does not depend on any specific parameter used to build a particular model (Hanley and McNeil, 1982). The bigger the area under the curve, the more effective a prediction algorithm would be. As a result, we had an area under the curve of 91.86% for the SVM approach, which was slightly better than those for several other widely used prediction methods (Supplemental Fig. S1). These

results demonstrated that the SVM approach was suitable to exploit the information carried in the weak evidence and to make accurate predictions. Although the above prediction model worked well on our example set consisting of equal numbers of positive and negative examples, its application in interactome prediction was still limited. Assuming that the ratio of interacting protein pairs out of random pairs in Arabidopsis was similar to that in yeast (i.e. approximately 1:775; Yu et al., 2008), its 90.58% specificity would translate to a large number of false positives. And the precision of its predicted interactions would drop to approximately 1.15%, worthless for expert inspection. Based on our past experiences (Cai et al., 2003; Xue et al., 2004; Xu et al., 2007), the SVM algorithm usually displayed an increased accuracy for the type of examples with either larger number or greater diversity. Thus, one way to increase specificity without significantly compromising sensitivity was to expand the negative examples in the training data. With this strategy, we added 409,761 randomly generated negative examples to the training set, raising the positive-to-negative ratio to 1:100. With the same protocol, this enlarged example set produced an optimal SVM model showing 29.02% 6 0.13% sensitivity, 99.95% 6 0.0013% specificity, 85.15% 6 0.33% precision, and 43.29% 6 0.17% F1 measure. Applying this model to all possible Arabidopsis protein pairs, 224,206 potential interactions were identified. These 224,206 predicted interactions also implied an estimation of the Arabidopsis interactome size. The equation “true positives + false positives = all predicted positives” can be rewritten as: N int 3 Sen þ ðN all 2 N int Þ 3 ð1 2 SpeÞ ¼ N pred where Nint represents the interactome size implied by our model, Nall is the number of all possible Arabidopsis protein pairs (2.29 3 108), Npred is the number of predicted interactions (224,206), and Sen and Spe are the sensitivity and specificity of our model. Solving this equation gave an estimated Arabidopsis interactome size of 3.78 3 105, corresponding to an interaction rate of 1:606, similar to that estimated in yeast (approximately 1:775; Yu et al., 2008). With this estimation, this model was expected to have a precision of 48.67% in interactome prediction. And the first model trained with an equal number of positive and negative

Table III. The confusion matrix of our prediction models The 1:1 model (high-coverage model) refers to the optimal model trained with an equal number of positive and negative examples. The 1:100 model (high-confidence model) refers to the optimal model trained with the expanded example data set in which the positive-to-negative ratio is 1:100. All measurements are provided in the format of mean 6 SD computed by 100-iteration bootstrap evaluation. Predicted

Positive Negative

38

Actual

Positive TP: 3,498.32 6 8.28 (1:1 model), 1,201.28 6 5.57 (1:100 model) FN: 640.68 6 8.28 (1:1 model), 2,937.72 6 5.57 (1:100 model)

Negative FP: 389.88 6 10.44 (1:1 model), 209.44 6 5.43 (1:100 model) TN: 3,749.12 6 10.44 (1:1 model), 413,690 6 5.43 (1:100 model) Plant Physiol. Vol. 151, 2009


examples was expected to display a precision of 1.46% in interactome prediction. Comparing the above two models, the one trained with an equal number of positive and negative examples had a high sensitivity (84.52%). Or, it would be less likely to miss a real interaction. For this reason, we call it the high-coverage model, or the 1:1 model. However, this good coverage came at a cost. The confidence of its predicted interactions (precision) was estimated to be as low as 1.46%. In contrast, the model trained with 100 times more negative examples showed a high precision (48.67%). Or, it would be less likely to wrongly predict an interaction. Thus, we call this model the high-confidence model, or the 1:100 model. Nonetheless, its predicted interactions were expected to cover only 29.02% of the entire interactome. While the interactions predicted by the highcoverage model could be more useful to reconfirm interactions identified by other low-confidence approaches (e.g. some HTP experiments that tend to report many false positives), the interactions predicted by the high-confidence model would be more interesting for direct expert inspection. In addition, it was noted that our high-coverage predictions included the entire set of high-confidence predictions, suggesting the self-consistency of our methods. Rediscovery of Newly Reported Interactions

The sensitivity of our prediction models in interactome prediction was further validated with one independent evaluation data set containing newly reported interactions that were not included in the major interaction databases at the time when our positive examples were assembled. This data set was extracted from the latest update of BioGRID (Stark et al., 2006), containing 166 novel interactions. As shown in Supplemental Table S1, 122 (73.49%) novel interactions could be successfully recognized by the high-coverage model. This rate of sensitivity was slightly lower than the expected one (84.52%). Therefore, we considered this rate (73.49%) as the likely sensitivity of our high-coverage model when general-

ized to interactome prediction. On the other hand, the high-confidence model was able to identify 48 (28.91%) novel interactions. This sensitivity was comparable to its expected one (29.02%). In contrast, by chance alone, our high-coverage model and highconfidence model would be expected to have approximately 70.27 6 8.37 and approximately 4.96 6 1.94 overlaps, respectively, with this independent evaluation data set. Consequently, the significance of recognizing 122 interactions by our high-coverage model was 3.17 3 10210. And the significance of recognizing 48 interactions by our high-confidence model was less than 1 3 10210. Comparison with Other Predicted Interactomes

Before this work, there had been several efforts to predict Arabidopsis interactions. As mentioned in the introduction, without using complex statistical models, Geisler-Lee et al. (2007) predicted interactions by homology to reported interactions in other organisms (herein referred to as the Interologs data set). Also, Cui et al. (2008) identified 23,396 potential interactions based on multiple indirect evidence with a total likelihood ratio cutoff and published as the AtPID database (the AtPID data set). In addition to these predicted interactomes, Obayashi et al. (2009) computed a coexpression network from 58 microarray experiments, 1,388 slides, which was available as the ATTED-II database (the ATTED-II data set). These data sets offered an interesting opportunity to compare with our results. Table IV shows the overlaps between our predictions and the above data sets, together with the significance values. It was shown that all data sets exhibited significant overlaps that were not likely produced by chance, indicating their consistency in general. The AtPID set showed the largest proportion of overlap with our predictions, most likely as a result of using a similar set of indirect evidence (e.g. gene coexpression and GO biological process similarity). In contrast, the ATTED-II set had relatively lower overlaps with all data sets, including the gold-standard interaction examples. This was not surprising, as

Table IV. Comparison with other predicted interactomes The data sets being compared were our high-confidence predictions (SVM), the homology-based predictions described by Geisler-Lee et al. (2007; Interologs), the interactions predicted by multiple indirect evidence as in the AtPID database (Cui et al., 2008), the coexpression network as in the ATTED-II database (Obayashi et al., 2009), and the gold-standard interaction examples compiled in this work (GS). Note that the ATTED-II set is not exactly a predicted interactome. It was included for the purpose of comparing predicted interactomes with a coexpression network. The Interologs set and the AtPID set contained all of their predicted interactions. Their gold-standard interactions not recognized by their prediction methods were not counted. The size of each data set is given after the data set name in the left column. For each overlap, its significance (probability of existence by chance alone) is provided. SVM

SVM (224,206) Interologs (69,225) AtPID (24,418) ATTED-II (48,276) GS (4,139)

– 1,266 (P , 1 3 10–10) 5,789 (P , 1 3 10–10) 847 (P , 1 3 10–10) 917 (P , 1 3 10–10)

Plant Physiol. Vol. 151, 2009

Interologs

AtPID –10

1,266 (P , 1 3 10 ) – 3,132 (P , 1 3 10–10) 253 (P , 1 3 10–10) 260 (P , 1 3 10–10)

ATTED-II –10

5,789 (P , 1 3 10 ) 3,132 (P , 1 3 10–10) – 434 (P , 1 3 10–10) 132 (P , 1 3 10–10)

GS –10

847 (P , 1 3 10 ) 253 (P , 1 3 10–10) 434 (P , 1 3 10–10) – 52 (P , 1 3 10–10)

917 (P , 1 3 10–10) 260 (P , 1 3 10–10) 132 (P , 1 3 10–10) 52 (P , 1 3 10–10) – 39

Lin et al.

coexpression does not equal interaction. Although gene coexpression and protein interaction are well correlated in prokaryotes, the correlation is not as strong in eukaryotes (Bhardwaj and Lu, 2005), probably because the layer of regulatory mechanisms is different between coexpression and protein interaction (Obayashi et al., 2009). This was the reason why, in this work, coexpression was employed as one feature. Besides it, several other indirect data were also considered to make more reliable predictions. However, although coexpression is weakly correlated with interaction and not enough to indicate an interaction by itself, we indeed observed an increased level of expression correlation in our predicted interactions. The ATTED-II database used two measurements to gauge the correlations between expression profiles: the Pearson’s correlation (PC) and mutual rank (MR). A gene pair with higher PC and lower MR was more likely to interact. Results showed that our predicted interactions had an average PC of 0.1341, higher than that of all protein pairs (0.0075; P , 1 3 10210), and the mean MR of our predicted interactions was 7,454.6, lower than that of all protein pairs (11,286.6; P , 1 3 10210). This increase of expression correlation was similarly observed in our goldstandard interactions, which displayed an average PC of 0.1224 (P , 1 3 10210) and a mean MR of 7,994.5 (P , 1 3 10210). Interestingly, the Wilcoxon rank sum test indicated that the average PC of our predicted interactions (0.1341) was slightly higher than that of the gold-standard interactions (0.1224; P = 0.002), while the mean MR of our predicted interactions (7,454.6) was largely comparable to that of the gold-standard interactions (7,994.5; P = 0.077, no significance). It was also noted that our predicted interactions had a significant overlap with the potential interactions identified by homology to known interactions in other species. Existence of homologous interactions was one of piece of indirect evidence that we did not use in our prediction scheme. Therefore, if predictions by the homology-based approach were independent of ours, the interactions predicted by both approaches should be more reliable. Indeed, for this set of interactions, we had an average SVM confidence score of 1.586 (detailed in “Data Access” below), which was higher than the average of all our predicted interactions (0.761; P , 1 3 10210 by Wilcoxon rank sum test). And on the other side, Geisler-Lee et al. (2007) computed a confidence value to indicate the confidence of each homologypredicted interaction. For the interactions predicted by both approaches, we observed an average confidence value of 18.507, higher than that of all the homologypredicted interactions (4.208; P , 1 3 10210). These results suggested the opportunity to derive a set of predicted interactions of extra high confidence. For this purpose, we assembled an updated set of known interactions consisting of 198,537 low-throughput interactions from Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, and H. sapiens. These interactions were then mapped onto 275,403 40

potential Arabidopsis interactions based on the latest homologous gene group assignments as in the Inparanoid database (O’Brien et al., 2005). Among them, 7,927 interactions were also predicted by our highconfidence model. In this set, the average number of homologous interactions for one predicted interaction was approximately 1.79, larger than that in all the homology-based predictions (1.39; P , 2.2e-16). In other words, our SVM model was able to pick out those homology-based predictions that had more support from reported interactions in other species. Analysis of the Meiotic Recombination-Related Interaction Network

Usage of our predicted interactome was illustrated by an analysis of the meiotic recombination-related interactions. Meiosis is essential for sexual reproduction, in which the step of recombination is critical for normal meiosis. Understanding its underlying molecular mechanisms and regulation has attracted constant attention. The current model of meiotic recombination, known as the double-strand break repair mechanism, was mostly established through analysis of the budding yeast, S. cerevisiae. According to this model (Supplemental Fig. S2), recombination is initiated by the introduction of a double-strand break into the recipient chromatid by a double-strand endonuclease (step 1). This cut is enlarged to a gap, and 3# single-stranded termini are produced by the action of exonucleases (step 2). One of the free 3# ends then invades a homologous region of the donor duplex, displacing a small D loop (step 3). The D loop is then enlarged by repair synthesis until the other 3# end can anneal to complementary single-stranded sequences. Repair synthesis from the second 3# end completes the process of gap repair, and branch migration results in the formation of two Holliday junctions (step 4). Resolution of two junctions by cutting either inner or outer strands leads to two possible noncrossover and two possible crossover configurations (step 5; Szostak et al., 1983). The Arabidopsis proteins known to function in recombination was nicely reviewed by Wijeratne and Ma (2007). In our predicted high-confidence interaction network, 11 proteins mentioned in this review were identified as seed proteins (Table V) and extracted with their first neighbors to create a subnetwork for indepth analysis. This subnetwork contained 122 proteins and 884 interactions (Fig. 3). Among them, eight interactions had been experimentally confirmed and 40 interactions were supported by homologous interactions in other species (Fig. 3; Supplemental Tables S2 and S3). For example, BRCA2A and BRCA2B, orthologs of the human breast cancer susceptibility protein 2, had been reported to physically interact with DMC1 and RAD51 in Arabidopsis (Siaud et al., 2004; Dray et al., 2006), which were also predicted by our SVM model. Another example of predicted interactions supported by known homologue interactions could Plant Physiol. Vol. 151, 2009


Table V. Arabidopsis proteins that function in meiotic recombination These proteins, as reviewed by Wijeratne and Ma (2007), were used as seeds to create a subnetwork of meiotic recombination-related interactions in our predicted interactome. Gene Symbol

Arabidopsis Genome Initiative No.

Function in Recombination

Cluster No.

AtSPO11-1 AtSPO11-2 MRE11 RAD50 DMC1 RAD51 RAD51C XRCC3 RCK MSH4 MLH1

AT3G13170 AT1G63990 AT5G54260 AT2G31970 AT3G22880 AT5G20850 AT2G45280 AT5G57450 AT3G27730 AT4G17380 AT4G09140

Double-strand break induction (step 1) Double-strand break induction (step 1) Processing of broken ends (step 2) Processing of broken ends (step 2) Strand invasion (step 3) Strand invasion (step 3) Strand invasion (step 3) Strand invasion (step 3) Repair synthesis and branch migration (step 4) Repair synthesis and branch migration (step 4) Holliday junction resolution (step 5)

1 1 2 3 4 4 4 4 5 6 7

be found among the Structural Maintenance of Chromosomes family proteins (i.e. SMC1, SMC2, and SMC3) and the Sister Chromatid Cohesion family proteins (i.e. SYN2, SYN3, and SYN4). These proteins are part of the evolutionarily conserved cohesin complex that mediates chromosome cohesion and enables accurate chromosome segregation in both mitosis and meiosis (Revenkova and Jessberger, 2005). A clustering analysis was also performed with the Markov Cluster algorithm (Enright et al., 2002) to reveal the internal structure of our predicted subnetwork. Seven clusters were identified, with six of them connected (Fig. 3). Interestingly, the seed proteins that function in different recombination steps were distributed in nonoverlapping clusters (Fig. 3; Supplemental Table S4), suggesting that our subnetwork structure was functionally relevant. A closer look at the GO biological process annotations of these proteins indicated that proteins in the same cluster were often involved in similar or related biological processes. For example, for the first step of recombination that creates a DNA double-strand break, there were two seed proteins, meiotic recombination proteins SPO11-1 (AtSPO11-1) and SPO11-2 (AtSPO11-2; Grelon et al., 2001; Stacey et al., 2006). Both of them were partitioned into cluster 1, together with three other proteins. All of them were involved in the process of “DNA topological change.” Similarly, the four seed proteins for the third step of recombination (i.e. DNA repair protein RAD51 homolog 1 [RAD51] and homolog 3 [RAD51C], DNA repair protein XRCC3 homolog, and meiotic recombination protein DMC1 homolog) were all partitioned into cluster 4, together with seven other proteins. These proteins were all annotated with similar or related biological processes, such as “base-excision repair,” “DNA repair,” and “meiosis.” The seed proteins for step 4 made two clusters, cluster 5 and cluster 6. Both of them were seemingly homogenous. Cluster 5 had the seed protein Rock-N-Rollers (RCK), a DNA helicase required for interference-sensitive crossover events (Chen et al., 2005). It also contained 21 other helicases, including the RecQ helicase family (i.e. Plant Physiol. Vol. 151, 2009

RecQL2, RecQL3, RecQL4A, RecQL1, and RecQL4B), required for genome stability maintenance (Hartung et al., 2000). RecQL2 and RecQL4A are able to suppress homologous recombination (Hartung et al., 2007; Kobbe et al., 2008), while RecQL4B is specifically required to promote crossovers (Hartung et al., 2007). These opposing effects suggested that a delicate regulation of homologous crossover might be achieved by this cluster of proteins. The other cluster, cluster 6, was around MSH4, a meiosis-specific member of the MutS homolog family (Higgins et al., 2004). This cluster consisted of proteins involved in “mismatch repair” (e.g. MutS homolog proteins MSH2, MSH3, MSH4, and MSH7) and “regulation of transcription” (e.g. Arabidopsis trithorax-like proteins ATX1, ATX2, ATX3, ATX4, and ATX5). It was known that MSH2, a DNA mismatch repair gene MutS homolog, was able to repress recombination of mismatched heteroduplexes (Emmanuel et al., 2006). And one of the trithorax-like proteins, ATX1, was reported to be an epigenetic regulator with histone H3 Lys-4 methyltransferase activity (Alvarez-Venegas et al., 2003), which is critical in yeast for the formation of programmed DNA double-strand breaks that initiate homologous recombination during meiosis (Borde et al., 2009). These reports suggested the functional relevance of cluster 6 proteins. There were also cases of proteins in one cluster that displayed more heterogeneous annotations. For example, although one of the seed proteins for the second step of recombination, DNA repair protein RAD50, formed a relatively homogenous cluster (cluster 3; with proteins from the SMC family and the Sister Chromatid Cohesion family), the other seed protein, double-strand break repair protein MRE11, was part of cluster 2, with 32 other proteins annotated with a diversity of biological processes ranging from development to stress responses. However, it was noticed that, despite the fact that they belonged to a number of biological processes, many of the cluster 2 proteins had the capability to bind small molecules, such as calcium, FK506, and cyclosporine A. It was reported that 41

Lin et al.

Figure 3. The meiotic recombination-related proteins in our predicted interactome. Eleven seed proteins reviewed by Wijeratne and Ma (2007) and their first neighbors were extracted from our predicted interactions. Interactions between them can be grouped into seven clusters, with seed proteins involved in different recombination steps distributed in nonoverlapping clusters. Triangles represent the seed proteins and circles represent their neighbors. Proteins (nodes) are colored according to their GO molecular function annotations. Edges representing experimentally confirmed interactions are colored red. Those representing interactions homologous to known interactions in other organisms are colored blue. Other predicted interactions are colored gray. [See online article for color version of this figure.]

calmodulin antagonists and cAMP inhibited the enhancement of double-strand break repair by ionizing radiation in human cells (Wang et al., 2000). Our data implied the possibility that MRE11 might mediate this suppression. It would be interesting to experimentally demonstrate whether MRE11 actually participates in the regulation of double-strand break repair by calmodulin antagonists and cAMP as well as in the regulation by other small molecules suggested by cluster 2 proteins. Data Access

In order to facilitate the use of our predicted interactions, we present PAIR, which is freely acces42

sible at http://www.cls.zju.edu.cn/pair/. The current version of PAIR supports searches by protein, by interaction, and by homologous interaction in S. cerevisiae, D. melanogaster, C. elegans, and H. sapiens. To search a protein, users can specify its name (or alias), locus, or National Center for Biotechnology Information accession number. Interactions can be searched by specifying their component proteins or by specifying the component proteins in their homologues interactions. All queries support the use of wild characters such as “*” (representing a string of any length) and “?” (representing a single character). If multiple hits are found, a list is presented allowing the choice of a specific protein or interaction to show in detail. Plant Physiol. Vol. 151, 2009


The protein detail page provides annotations of a protein taken from other databases, including its accession numbers, description, aliases, domain composition, GO annotations, and cellular localizations. Predicted interactions involving this protein are also listed at the bottom. The interaction detail page shows concise descriptions of the two component proteins, the evidence supporting this interaction, along with its homologous interactions in other species. Two types of per interaction confidence measurements are presented for each interaction. One is the LRs for features and the total LR for each interaction (see “Strength of the Indirect Evidence” above). These LRs are provided along with their percentile rankings in interaction examples to indicate their significance. The other type of per interaction confidence measurement is a SVM-based confidence score. It had been proposed that the absolute value of decision function output was correlated with the prediction accuracy (Hua and Sun, 2001). Based on this idea, we computed this value for each predicted interaction as its confidence score. Similarly, these confidence scores are provided along with their percentile rankings in interaction examples to indicate their significance. Whenever a list of interactions is presented, filters based on the total LR and the SVM-based confidence score are available for selection of more reliable predictions. However, because these per interaction confidence measurements were not rigorously assessed and were not able to directly translate into a probability of true prediction, they shall be used with caution. Users interested in pathway analysis are able to register with PAIR. After login, the database allows users to add specific interactions into a personalized storage place called “My Collection.” Once pathway mining is done, the interactions stored in My Collection can be exported in the HUPO-PSI format (Kerrien et al., 2007), which is readily usable by a variety of mainstream pathway analysis softwares, such as Cytoscape (Shannon et al., 2003). Future Prospects

As discussed above, the accuracy and coverage of our predicted interactions may seem to reach a meaningful level to assist plant biologists. However, there are still margins for improvement. For example, our training examples were far from adequate. Our interaction examples covered approximately 1% of the total interactome. Rapid advances in plant molecular biology will certainly provide more interaction examples that are able to better characterize their shared characteristics and produce more accurate prediction models. Also, when the feature strength diagrams were compared among features describing the same type of indirect evidence, those based on experimental data or annotation tended to be more discriminative than those based on predicted data (e.g. the domain interaction features derived from Protein Data Bank complex entries versus the domain interaction features Plant Physiol. Vol. 151, 2009

derived from predicted data). Yet, unfortunately, for a significant fraction of protein pairs, many experimental data-based features were missing. In other words, these features saw no differences in a great number of protein pairs, which considerably limited their strength and hence required prediction-based features to complement. Therefore, more comprehensive indirect data may also help to increase prediction accuracy. In addition, novel approaches for feature extraction (Ramani and Marcotte, 2003), kernel computation (Xu et al., 2008), and SVM optimization (Shen et al., 2007) may offer new opportunities to build a better prediction model. For these reasons, regular updates of PAIR are necessary to take advantage of the increasing amount of interaction information and indirect evidence as well as the quickly advancing computational techniques in feature extraction and model construction. The PAIR database was established as part of the informatics infrastructure supporting the State Key Laboratory of Plant Physiology and Biochemistry, China. Two updates per year are scheduled. Its longterm running cost is covered by a maintenance grant to the State Key Laboratory of Plant Physiology and Biochemistry from the Chinese Ministry of Science and Technology.

CONCLUSION

The PAIR database was constructed with careful consideration of the lessons learned in previous interaction prediction studies. It offers a high-confidence interaction set with an expected accuracy of approximately 48.7% and a high-coverage interaction set that was estimated to include approximately 73.5% of the entire interactome. These data currently represent the most accurate and comprehensive set of predicted Arabidopsis interactions. Detailed confidence measurements for each predicted interaction are also available. PAIR can be accessed freely online. As part of the infrastructure supporting the State Key Laboratory of Plant Physiology and Biochemistry, China, PAIR will be updated and upgraded twice a year. It is our hope that this resource is able to complement the existing informatics framework for Arabidopsis research and offer further help in the study of important molecular mechanisms in this model plant.

MATERIALS AND METHODS SVM SVM is a statistical learning algorithm that learns from a set of training examples and builds a mathematical model to classify new examples. In typical settings, the training set contains two types (classes) of examples, positive examples (e.g. interactions) and negative examples (e.g. noninteractions). Each example is represented by a vector of real numbers (also known as features) describing its characteristics (e.g. indirect evidence of interaction). The SVM algorithm learns the relationship between the class labels and feature values in the training examples and encodes this relationship into a

43

Lin et al.

mathematical model. This model can then be used to predict the class label of any given example based on its feature values. A brief account of the SVM algorithm is provided in Supplemental Text S1.

Preparation of the Example Data Set Known pairs of interacting proteins (positive examples) and noninteracting proteins (negative examples) are required to train an SVM prediction model. To assemble an accurate positive example set, we retrieved and integrated experimentally determined Arabidopsis (Arabidopsis thaliana) physical interactions from IntAct (Kerrien et al., 2006), BIND (Alfarano et al., 2005), BioGRID (Stark et al., 2006), and TAIR (Rhee et al., 2003). Because HTP experiments tend to suffer from high false-positive rates (Deane et al., 2002; von Mering et al., 2002), we excluded interactions only supported by HTP experiments. As shown in Table I, this resulted in 4,139 interactions involving 1,679 proteins. Unlike interacting proteins, it is rare to find reported noninteracting proteins. Several studies have attempted to choose negative examples as pairs of proteins with different cellular localizations. However, this approach did not seem to improve model accuracy. Furthermore, it was argued that this would lead to biased estimations of accuracy, because the constraints placed on the cellular distribution of the negative examples could make the prediction task easier (Ben-Hur and Noble, 2006). Therefore, we followed the approach described by Zhang et al. (2004) to select negative examples as random protein pairs that did not overlap with the positive examples. The resulting negative examples may contain several potential interactions. Yet, given the ratio of interacting protein pairs versus noninteracting protein pairs in yeast (approximately 1:775; Yu et al., 2008), this level of contamination was likely acceptable. In this way, we generated the same number of negative examples. The gold-standard data set consisted of a total of 8,278 protein pairs, half positive, half negative.

Features Four types of informative indirect evidence were chosen for interaction prediction based on previous studies on the strength of various indirect evidence (Qi et al., 2006; Suthram et al., 2006). Fourteen features were computed by different mathematical characterizations of this evidence. Therefore, a 14-dimension feature vector was constructed to represent each protein pair. The first type of indirect evidence was gene coexpression. Interacting proteins are often coexpressed. A large set of Arabidopsis gene expression profiles were available at the TAIR database (ftp://ftp.arabidopsis.org/ home/tair/Microarrays/), which contained 1,436 experiments normalized by a robust multiarray average method (Rhee et al., 2003). Using this data set, we calculated the Pearson’s correlation coefficient for the expression profiles of each protein pair as one feature. The second type of indirect evidence was domain interaction. Protein interactions involve physical interactions between their domains. It has been proposed that novel protein interactions can be inferred by known domain interactions. Here, we annotated the domain composition of each Arabidopsis protein according to the Pfam database (Coggill et al., 2008), which assigned one or more of the 2,854 distinctive domains to one or more of the 20,183 Arabidopsis proteins. Known interacting domains were retrieved from the DOMINE database (Raghavachari et al., 2008), which contained two data sets derived from the Protein Data Bank molecular complex entries and seven data sets predicted by computational approaches. According to each data set, we counted the number of interacting domains in a pair of proteins as one feature. This resulted in nine different features. The third type of indirect evidence was shared annotation. Interacting proteins were expected to share similar molecular functions, to be involved in the same biological processes, and to localize in the same cellular components. GO (Raghavachari et al., 2008) provided a standardized way to annotate proteins from these aspects and therefore offered an excellent chance to compare their annotations. In the GO term tree, strongly related terms were expected to have a shared parent term close to the leaf terms and narrow in concept. The fraction of proteins annotated to this shared parent term and all of its child terms reflects the unexpectedness that two proteins share this particular parent annotation term. Consequently, this fraction could be a suitable score to measure the similarity between two annotation terms. The smaller this score is, the more similar the two terms are. Extending the concept of term similarity, we defined the protein annotation similarity score as the

44

smallest term similarity score between their annotation terms. In TAIR, there were 112,347 annotations assigned to Arabidopsis gene products. By focusing specifically on the terms in one of the three categories (i.e. molecular function, biological process, and cellular component), we computed three protein annotation similarity scores as three features. The last type of indirect evidence was colocalization. Interacting proteins are usually colocalized. We obtained protein subcellular localization information from the Arabidopsis Subcellular Database (Heazlewood et al., 2007). The colocalization feature was then calculated as described by Qiu and Noble (2008). Let f(l) be the fraction of all proteins presented in location l, and let I(i,l) be an indicator function indicating that protein i is observed to localize to l. The colocalization feature value is then calculated as follows: F(A,B) = maxl {2 log(f(l))I(A,l)I(B,l)}. In other words, for protein pair (A,B), this feature was computed as the negative logarithm of the fraction of all proteins presented in the most specific location where both proteins A and B were observed to localize. As shown in Table II, 14 features were computed for each pair of proteins: one feature based on coexpression, one feature based on colocalization, nine features based on domain interaction, and three features based on shared annotation.

Construction of a Prediction Model An SVM model was constructed to predict Arabidopsis protein interactions. The SVM algorithm is well documented (Winters-Hilt et al., 2006). A brief introduction is also provided in Supplemental Text S1. For implementation, we used the public software package LIBSVM version 2.88 (http://www. csie.ntu.edu.tw/~cjlin/libsvm/). After comprehensive evaluation, we chose to use the radial basis function kernel for its best performance (data not shown). In this setup, two parameters, the regularization parameter C and the kernel width parameter g, were optimized by a grid search in an empirical range (0.001 to 1,000 for C and 3 3 1025 to 4 3 104 for g), maximizing the F1 measure as defined below. Model performance with each parameter set was estimated by 10-fold cross-validation. Based on whether a real positive or negative example in the testing set was predicted as positive or negative, the prediction result for each protein pair can be divided into four categories: true positive (TP), when a real positive example was predicted correctly; true negative (TN), when a real negative example was predicted correctly; false positive (FP), when a real negative example was predicted as positive; and false negative (FN), when a real positive example was predicted as negative. These four indicators are collectively called the confusion matrix (Table III). Many accuracy measurements are derived from the confusion matrix, such as the sensitivity [or recall; TP/ (TP + FN)], specificity [TN/(TN + FP)], and precision [TP/(TP + FP)]. In the application of predicting interactions, the ability to accurately and comprehensively recognize potential interactions is critical. Therefore, the measurements of precision and recall are most relevant. Precision conveys the idea of how confident we are when the model predicts an interaction. Recall measures how much proportion of the real interactions can be recognized by the model. Giving equal weights to precision (p) and recall (r), Rijsbergen (1979) introduced the “F1 measure” as the geometric mean [2rp/(r + p)]. This measurement favors a balanced accuracy on both interacting and noninteracting protein pairs. For example, if a model predicted all examples as interactions, the overall accuracy would be the fraction of interactions in the test data set. However, the F1 measure would be only zero, reflecting the worst accuracy on all types of examples.

Comparison of Our Predictions with Other Predicted Interactomes Our predicted interactions were compared with three published data sets. The Interologs data set was obtained as the supplemental data of Geisler-Lee et al. (2007). The AtPID data set was retrieved from the AtPID database (version 3; Cui et al., 2008). And the ATTED-II data set was downloaded from the ATTED-II database (version 4.1; Obayashi et al., 2009). The ATTED-II database did not explicitly suggest a criterion by which potential interactions shall be identified based on the expression correlations they provided. However, this database did provide a set of 48,276 highly coexpressed gene pairs, consisting of each gene paired with three other genes that displayed the top three expression similarities. It is clear that coexpression does not equal to interaction. Yet for the purpose of comparing predicted interactomes with a coexpression network, we took these highly coexpressed gene pairs as the



potential interactions identified by the coexpression method. It shall also be noted that in our comparison, the Interologs set and the AtPID set included all of their predicted interactions. Their gold-standard examples not recognized by their respective prediction methods were not counted. The overlaps between data sets were counted, and the probabilities of these overlaps appearing by chance were calculated as follows. For any two data sets, with 10,000 time iterations, we first estimated the distribution of overlaps between two random set of protein pairs having the same sizes and protein distributions as the two data sets being compared. Assuming a normal distribution, the probability of the observed overlap between these two data sets was then computed against the distribution of random overlaps. This probability was taken as the significance value for these two data sets being correlated.

Supplemental Data The following materials are available in the online version of this article. Supplemental Figure S1. ROC curve analysis of four prediction methods. Supplemental Figure S2. Double-strand break repair model for meiotic recombination. Supplemental Table S1. Rediscovery of newly reported interactions by our prediction models. Supplemental Table S2. Eight experimentally confirmed interactions in our predicted meiotic recombination subnetwork. Supplemental Table S3. Forty predicted interactions with known homologous interactions in our predicted meiotic recombination subnetwork. Supplemental Table S4. The predicted meiotic recombination subnetwork. Supplemental Text S1. A brief introduction to the SVM algorithm and the complete set of feature analysis diagrams.

ACKNOWLEDGMENTS We thank Lu Zhang for helpful discussions on feature extraction and Li Chen for help in Web site construction. Received May 11, 2009; accepted July 6, 2009; published July 10, 2009.

LITERATURE CITED Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, Bantoft K, Betel D, Bobechko B, Boutilier K, Burgess E, et al (2005) The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res 33: D418–D424 Alvarez-Venegas R, Pien S, Sadder M, Witmer X, Grossniklaus U, Avramova Z (2003) ATX-1, an Arabidopsis homolog of trithorax, activates flower homeotic genes. Curr Biol 13: 627–637 Ben-Hur A, Noble WS (2006) Choosing negative examples for the prediction of protein-protein interactions. BMC Bioinformatics (Suppl 1) 7: S2 Ben-Yacoub S, Abdeljaoued Y, Mayoraz E (1999) Fusion of face and speech data for person identity verification. IEEE Trans Neural Netw 10: 1065–1074 Bhardwaj N, Lu H (2005) Correlation between gene expression profiles and protein-protein interactions within and across genomes. Bioinformatics 21: 2730–2738 Borde V, Robine N, Lin W, Bonfils S, Geli V, Nicolas A (2009) Histone H3 lysine 4 trimethylation marks meiotic recombination initiation sites. EMBO J 28: 99–111 Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Haussler D (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA 97: 262–267 Burbidge R, Trotter M, Buxton B, Holden S (2001) Drug design by machine learning: support vector machines for pharmaceutical data analysis. Comput Chem 26: 5–14 Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ (2003) SVM-Prot: Web-based


support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 31: 3692–3697 Chatr-aryamontri ACA, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, Cesareni G (2007) MINT: the Molecular INTeraction database. Nucleic Acids Res 35: D572–D574 Chen C, Zhang W, Timofejeva L, Gerardin Y, Ma H (2005) The Arabidopsis ROCK-N-ROLLERS gene encodes a homolog of the yeast ATPdependent DNA helicase MER3 and is required for normal meiotic crossover formation. Plant J 43: 321–334 Coggill P, Finn RD, Bateman A (2008) Identifying protein domains with the Pfam database. Curr Protoc Bioinformatics 23: 2.5.1–2.5.17 Cui J, Li P, Li G, Xu F, Zhao C, Li Y, Yang Z, Wang G, Yu Q, Shi T (2008) AtPID: Arabidopsis thaliana Protein Interactome Database. An integrative platform for plant systems biology. Nucleic Acids Res 36: D999– D1008 Deane CM, Salwinski L, Xenarios I, Eisenberg D (2002) Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol Cell Proteomics 1: 349–356 de Vel O, Anderson A, Corney M, Mohay G (2001) Mining e-mail content for author identification forensics. Sigmod Record 30: 55–64 Ding CH, Dubchak I (2001) Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17: 349–358 Dray E, Siaud N, Dubois E, Doutriaux MP (2006) Interaction between Arabidopsis Brca2 and its partners Rad51, Dmc1, and Dss1. Plant Physiol 140: 1059–1069 Emmanuel E, Yehuda E, Melamed-Bessudo C, Avivi-Ragolsky N, Levy AA (2006) The role of AtMSH2 in homologous recombination in Arabidopsis thaliana. EMBO Rep 7: 100–105 Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30: 1575–1584 Geisler-Lee J, O’Toole N, Ammar R, Provart NJ, Millar AH, Geisler M (2007) A predicted interactome for Arabidopsis. Plant Physiol 145: 317–329 Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, et al (2003) A protein interaction map of Drosophila melanogaster. Science 302: 1727–1736 Grelon M, Vezon D, Gendrot G, Pelletier G (2001) AtSPO11-1 is necessary for efficient meiotic recombination in plants. EMBO J 20: 589–600 Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143: 29–36 Hartung F, Plchova H, Puchta H (2000) Molecular characterisation of RecQ homologues in Arabidopsis thaliana. Nucleic Acids Res 28: 4275–4282 Hartung F, Suer S, Puchta H (2007) Two closely related RecQ helicases have antagonistic roles in homologous recombination and DNA repair in Arabidopsis thaliana. Proc Natl Acad Sci USA 104: 18836–18841 Heazlewood JL, Verboom RE, Tonti-Filippini J, Small I, Millar AH (2007) SUBA: the Arabidopsis Subcellular Database. Nucleic Acids Res 35: D213–D218 Higgins JD, Armstrong SJ, Franklin FC, Jones GH (2004) The Arabidopsis MutS homolog AtMSH4 functions at an early step in recombination: evidence for two classes of recombination in Arabidopsis. Genes Dev 18: 2557–2570 Hua S, Sun Z (2001) A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. J Mol Biol 308: 397–407 Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF, Gerstein M (2003) A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 302: 449–453 Karlsen RE, Gorsich DJ, Gerhart GR (2000) Target classification via support vector machines. Opt Eng 39: 704–711 Kerrien S, Alam-Faruque Y, Aranda B, Bancarz I, Bridge A, Derow C, Dimmer E, Feuermann M, Friedrichsen A, Huntley R, et al (2006) IntAct: open source resource for molecular interaction data. Nucleic Acids Res 35: D561–D565 Kerrien S, Orchard S, Montecchi-Palazzi L, Aranda B, Quinn AF, Vinod N, Bader GD, Xenarios I, Wojcik J, Sherman D, et al (2007) Broadening the horizon: level 2.5 of the HUPO-PSI format for molecular interactions. BMC Biol 5: 44 Kim KI, Jung K, Park SH, Kim HJ (2001) Support vector machine-based text detection in digital video. Pattern Recognit 34: 527–529

45

Lin et al.

Kobbe D, Blanck S, Demand K, Focke M, Puchta H (2008) AtRECQ2, a RecQ helicase homologue from Arabidopsis thaliana, is able to disrupt various recombinogenic DNA structures in vitro. Plant J 55: 397–405 Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, et al (2004) A map of the interactome network of the metazoan C. elegans. Science 303: 540–543 Liong SY, Sivapragasam C (2002) Flood stage forecasting with support vector machines. J Am Water Resour Assoc 38: 173–186 Obayashi T, Hayashi S, Saeki M, Ohta H, Kinoshita K (2009) ATTED-II provides coexpressed gene networks for Arabidopsis. Nucleic Acids Res 37: D987–D991 O’Brien KP, Remm M, Sonnhammer EL (2005) Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res 33: D476–D480 Qi Y, Bar-Joseph Z, Klein-Seetharaman J (2006) Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins 63: 490–500 Qiu J, Noble WS (2008) Predicting co-complexed protein pairs from heterogeneous data. PLOS Comput Biol 4: e1000054 Raghavachari B, Tasneem A, Przytycka TM, Jothi R (2008) DOMINE: a database of protein domain interactions. Nucleic Acids Res 36: D656–D661 Ramani AK, Marcotte EM (2003) Exploiting the co-evolution of interacting proteins to discover interaction specificity. J Mol Biol 327: 273–284 Revenkova E, Jessberger R (2005) Keeping sister chromatids together: cohesins in meiosis. Reproduction 130: 783–790 Rhee SYBW, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, Miller N, et al (2003) The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res 31: 224–228 Rhodes DRTS, Varambally S, Mahavisno V, Barrette T, Kalyana-Sundaram S, Ghosh D, Pandey A, Chinnaiyan AM (2005) Probabilistic model of the human protein-protein interaction network. Nat Biotechnol 23: 951–959 Rijsbergen CJ (1979) Information Retrieval. Butterworths, London Salwinski LMC, Smith AJ, Pettit FK, Bowie JU, Eisenberg D (2004) The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 32: D449–D451 Scott MS, Barton GJ (2007) Probabilistic prediction and ranking of human protein-protein interactions. BMC Bioinformatics 8: 239 Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13: 2498–2504 Shen Q, Shi WM, Kong W, Ye BX (2007) A combination of modified particle swarm optimization algorithm and support vector machine for gene selection and tumor classification. Talanta 71: 1679–1683 Siaud N, Dray E, Gy I, Gerard E, Takvorian N, Doutriaux MP (2004) Brca2 is involved in meiosis in Arabidopsis thaliana as suggested by its interaction with Dmc1. EMBO J 23: 1392–1401 Stacey NJ, Kuromori T, Azumi Y, Roberts G, Breuer C, Wada T, Maxwell A, Roberts K, Sugimoto-Shirasu K (2006) Arabidopsis SPO11-2 functions with SPO11-1 in meiotic recombination. Plant J 48: 206–216 Stark CBB, Requly T, Boucher L, Breitkreutz A, Tyers M (2006) BioGRID: a

46

general repository for interaction datasets. Nucleic Acids Res 34: D535–D539 Suthram SST, Ruppin E, Sharan R, Ideker T (2006) A direct comparison of protein interaction confidence assignment schemes. BMC Bioinformatics 7: 360 Szostak JW, Orr-Weaver TL, Rothstein RJ, Stahl FW (1983) The doublestrand-break repair model for recombination. Cell 33: 25–35 Tsesmetzis N, Couchman M, Higgins J, Smith A, Doonan JH, Seifert GJ, Schmidt EE, Vastrik I, Birney E, Wu G, et al (2008) Arabidopsis reactome: a foundation knowledgebase for plant systems biology. Plant Cell 20: 1426–1436 Uetz PGL, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, et al (2000) A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 503: 623–627 Vapnik VN (1998) Statistical Learning Theory. Wiley, New York Vapnik VN (2000) The Nature of Statistical Learning Theory, Ed 2. Springer, New York von Mering CKR, Snel B, Cornell M, Oliver SG, Fields S, Bork P (2002) Comparative assessment of large-scale data sets of protein-protein interactions. Nature 417: 399–403 Wang Y, Mallya SM, Sikpi MO (2000) Calmodulin antagonists and cAMP inhibit ionizing-radiation-enhancement of double-strand-break repair in human cells. Mutat Res 460: 29–39 Wijeratne, Asela J, Ma, H (2007) Genetic analyses of meiotic recombination in Arabidopsis. Journal of Integrative Plant Biology 49: 1199–1207 Winters-Hilt S, Yelundur A, McChesney C, Landry M (2006) Support vector machine implementations for classification and clustering. BMC Bioinformatics (Suppl 2) 7: S4 Xu H, Lin M, Wang W, Li Z, Huang J, Chen Y, Chen X (2007) Learning the drug target-likeness of a protein. Proteomics 7: 4255–4263 Xu JH, Li F, Sun QF (2008) Identification of microRNA precursors with support vector machine and string kernel. Genomics Proteomics Bioinformatics 6: 121–128 Xue Y, Li ZR, Yap CW, Sun LZ, Chen X, Chen YZ (2004) Effect of molecular descriptor feature selection in support vector machine classification of pharmacokinetic and toxicological properties of chemical agents. J Chem Inf Comput Sci 44: 1630–1638 Yu H, Braun P, Yildirim MA, Lemmens I, Venkatesan K, Sahalie J, Hirozane-Kishikawa T, Gebreab F, Li N, Simonis N, et al (2008) Highquality binary protein interaction map of the yeast interactome network. Science 322: 104–110 Yuan Z, Burrage K, Mattick JS (2002) Prediction of protein solvent accessibility using support vector machines. Proteins 48: 566–570 Zhang LV, Wong SL, King OD, Roth FP (2004) Predicting co-complexed protein pairs using genomic and proteomic data integration. BMC Bioinformatics 5: 38 Zhao C, Zhang H, Zhang X, Zhang R, Luan F, Liu M, Hu Z, Fan B (2006a) Prediction of milk/plasma drug concentration (M/P) ratio using support vector machine (SVM) method. Pharm Res 23: 41–48 Zhao CY, Zhang HX, Zhang XY, Liu MC, Hu ZD, Fan BT (2006b) Application of support vector machine (SVM) for prediction toxic activity of different data sets. Toxicology 217: 105–119