The Plant Journal (2012) 71, 1038–1050
doi: 10.1111/j.1365-313X.2012.05055.x
TECHNICAL ADVANCE
BAR expressolog identification: expression profile similarity ranking of homologous genes in plant species Rohan V. Patel1,2, Hardeep K. Nahal1,2, Robert Breit1,2 and Nicholas J. Provart1,2,* 1 Department of Cell & Systems Biology, University of Toronto, 25 Willcocks Street, Toronto, ON M5S 3B2, Canada, and 2 Centre for the Analysis of Genome Evolution and Function, University of Toronto, 25 Willcocks Street, Toronto, ON M5S 3B2, Canada Received 7 June 2011; revised 27 April 2012; accepted 15 May 2012; published online 12 July 2012. *For correspondence (e-mail
[email protected]).
SUMMARY Large numbers of sequences are now readily available for many plant species, allowing easy identification of homologous genes. However, orthologous gene identification across multiple species is made difficult by evolutionary events such as whole-genome or segmental duplications. Several developmental atlases of gene expression have been produced in the past couple of years, and it may be possible to use these transcript abundance data to refine ortholog predictions. In this study, clusters of homologous genes between seven plant species – Arabidopsis, soybean, Medicago truncatula, poplar, barley, maize and rice – were identified. Following this, a pipeline to rank homologs within gene clusters by both sequence and expression profile similarity was devised by determining equivalent tissues between species, with the best expression profile match being termed the ‘expressolog’. Five electronic fluorescent pictograph (eFP) browsers were produced as part of this effort, to aid in visualization of gene expression data and to complement existing eFP browsers at the Bio-Array Resource (BAR). Within the eFP browser framework, these expression profile similarity rankings were incorporated into an Expressolog Tree Viewer to allow cross-species homolog browsing by both sequence and expression pattern similarity. Global analyses showed that orthologs with the highest sequence similarity do not necessarily exhibit the highest expression pattern similarity. Other orthologs may show different expression patterns, indicating that such genes may require re-annotation or more specific annotation. Ultimately, it is envisaged that this pipeline will aid in improvement of the functional annotation of genes and translational plant research. Keywords: orthologs, expression pattern similarity, poplar, Arabidopsis, rice, maize, Medicago truncatula, technical advance.
INTRODUCTION The increasing availability of sequence data for genomes of multiple species allows us to compute evolutionary trajectories on a genome-wide basis. Such data may also be of value when attempting to trace the ancestry of genes across multiple species in order to assign orthology. The concept of orthology is one that is central to comparative genomics. The original phylogenetic definition of orthology stated that orthologous genes arise due to speciation events (Fitch, 1970). A widely recognized assumption related to this definition is that orthologous genes exhibit conserved functionality as dictated by the final polypeptides encoded by the corresponding genes, because the sequences are similar. 1038
Considering the tissue-specific expression patterns of related genes could contribute to a better understanding of gene function, especially given the high number of incidences of whole genome and segmental duplications in plants (Arabidopsis Genome Initiative, 2000; Jiao et al., 2011). Orthologous genes across multiple species can be identified through their level of sequence similarity. A variety of tools exist for computational detection of orthologs, and these can be split into two major categories: graph-based methods and tree-based methods. Tree-based methods (e.g. Orthostrapper and RIO) use explicit evolutionary models to ª 2012 The Authors The Plant Journal ª 2012 Blackwell Publishing Ltd
Expression profile similarity ranking of homologous genes 1039 infer orthology between genes of multiple species, while graph-based methods (e.g. OrthoMCL and InPARANOID) use sequence similarity alone to infer orthology (Kuzniar et al., 2008). Databases of orthologous genes across a variety of species computed using various methods already exist. These include the National Center for Biotechnology Information Clusters of Orthologous Groups (COG) database (Tatusov et al., 1997), as well as the National Center for Biotechnology Information euKaryotic Orthologous Group (KOG) database (Tatusov et al., 2003). Additionally, online databases of orthologs between a variety of species are provided by OrthoMCL (Li et al., 2003), as well as InPARANOID (Remm et al.,2001), which contain orthologs that were determined using the respective algorithms. Ortholog databases that include information more specific to plants include Phytozome (http:// www.phytozome.net) and Gramene (Liang et al., 2008). The former database uses sequence similarity to identify orthologs, while the latter makes use of phylogenetic trees to infer orthology. Another platform for viewing information concerning orthologous clusters of genes is PLAZA (Proos et al., 2009), an online resource for plant comparative genomics. Pre-computed orthologous groups in this resource were identified through sequence similarity, and were then fed into a phylogenetic pipeline to improve confidence in assignments of homology. CoGe, a platform for visualization of homologous relationships between genes, differs from others by using synteny in order to identify putative orthologous relationships (Lyons and Freeling, 2008). Tools have been developed that permit the exploration of both homologous sequences and co-expression neighbors. GeneCAT (Mutwil et al., 2008) enables users to combine BLAST and condition-independent (Usadel et al., 2009) co-expression analyses to infer functional equivalency between genes. In this way, clusters of orthologs can be identified that share functional equivalency. PlaNet (Mutwil et al., 2011) is a more recent implementation of this idea, extended to cover seven plant species. Ortholog identification may be made difficult by various events in a species’ evolutionary history. Plant species such as Arabidopsis thaliana, as well as the six other plant species that are the focus of this project, are known to have undergone multiple whole-genome duplication and segmental duplication events (Arabidopsis Genome Initiative, 2000; Jiao et al., 2011). These types of events can create oneto-many or many-to-many orthologous relationships, with duplicated genes within a species becoming in-paralogs of one another. This further complicates the process of identifying orthologs, which is also made difficult by the different evolutionary trajectories that duplicated genes may follow, namely neo-functionalization, sub-functionalization, non-functionalization or retention (Ohno, 1970).
Although it is possible, through a number of tools such as those described above, to detect orthologs through their level of sequence similarity, it may also be possible to use gene expression data to assist in ortholog identification. It is widely assumed that orthologs in different species perform similar functions, and for this assumption to hold true they should have similar gene expression profiles, i.e. they should be expressed in similar tissues. If this is not the case, such data may also be used to assess the degree of sub-functionalization of duplicated genes. The species considered in this study were Arabidopsis thaliana (Schmid et al., 2005), Medicago truncatula (Benedito et al., 2008), Populus trichocarpa (Wilkins et al., 2009a), Glycine max (Libault et al., 2010), Oryza sativa (Jain et al., 2007; Li et al.,2007) Hordeum vulgare (Druka et al., 2006) and Zea mays (Sekhon et al., 2011), as developmental atlases of gene expression are available for these species, generated using microarrays or RNA-seq. Given the unknown nature of the exact evolutionary relationships between the genes in each gene cluster, we use the term ‘homolog’ rather than the strict definition of ‘ortholog’ to describe related genes from different species in gene clusters. Here we created a pipeline to identify homologous gene clusters in these seven species by sequence, and to rank expression profile similarity based on expression patterns in equivalent tissues between species. The topranked homolog by expression profile similarity is termed the ‘expressolog’, but the expression pattern similarity scores for all homologs can also be used to help identify potential redundancies in gene function due to duplication events. The results of this pipeline are visualized using a web-based tool, the Expressolog Tree Viewer, which displays the relationships between sequence and expression pattern similarity for a given gene from any of the seven species, highlighting the expressologs for each species. RESULTS Identification of homologous clusters and creation of homologous pair sets In order to compute gene expression divergence within gene families, it was first necessary to determine clusters of homologs. We used the tool OrthoMCL (Li et al., 2003; Chen et al., 2007) to compute these homologous clusters of genes, by using protein sequence files in FASTA format for all seven species of interest. We identified 49 495 clusters of homologous genes across the seven species studied. Of these clusters, 26 016 contained genes from just one species, i.e. contained only paralogs. Subsequently, 9907 clusters were found to contain genes from two species, 5130 clusters contained genes from three species, 2267 clusters contained genes from four species, 992 clusters contained genes from five species, 2340 clusters contained genes from six species, and 2843
ª 2012 The Authors The Plant Journal ª 2012 Blackwell Publishing Ltd, The Plant Journal, (2012), 71, 1038–1050
1040 Rohan V. Patel et al. clusters contained genes from all seven species. Of the 26 016 clusters containing genes from one taxon, 8155 clusters contained Z. mays paralogs, 2817 clusters contained O. sativa paralogs, 1177 clusters contained H. vulgare paralogs, 3771 clusters contained M. truncatula paralogs, 3594 clusters contained P. trichocarpa paralogs, 3291 clusters contained G. max paralogs, and 3211 clusters contained A. thaliana paralogs. The top right half of Table 1 provides a summary of the number of one-to-one homologous sequence pairs between each species pair. Duplication events such as whole-genome duplications, segmental duplications and tandem duplications can result in difficulties when inferring orthology. OrthoMCL identifies probable orthologs as well as in-paralogs (genes that arose from duplication events following a speciation event, i.e. recent paralogs) across the species of interest. These in-paralogs are found to be more similar to each other than to any sequence from any other genome. For tests involving computation of tissue equivalencies prior to ranking expression profile similarity, we also generated two other sets of data for each species pair to address the issue of uneven vector lengths caused by such duplication events. The first set consisted of a collection of all one-to-one homologs plus homologous pairs generated by randomly picking one
sequence from the ‘many’ side of one-to-many homologous clusters and pairing it with its ‘one’ partner. The bottom left half of Table 1 shows the number of sequences in each set for each species pair. The second set consisted of the first set plus homologous pairs generated by randomly picking one sequence from each side of the many-to-many homologous clusters (Table S1). Identification of tissue equivalencies: effect of homolog pair set size Prior to ranking the homologs from multiple species by expression profile similarity, equivalent tissues between each pair of species being compared were deduced. Comparisons between expression profiles of genes in multiple species are made more difficult by the differences in physiology and anatomy of the species of interest. One way to overcome this problem is to use the available gene expression data to compute correlations between plant tissues (Cho et al., 2002). This can provide a means to quantify the relationship between tissues in multiple species, in the absence of accurate sample annotation by Plant Ontology terms (Avraham et al., 2008) for tissue expression data. An example of how tissue equivalencies may be calculated between species is shown in Figure 1. For correlations
Table 1 Summary of results and datasets
a
Number of isoforms Number of genesb Expression platformc Number of probe sets Number of genes mapping to probe setsd A. thaliana P. trichocarpa M. truncatula G. max H. vulgare O. sativa Z. mays
Arabidopsis thaliana
Populus trichocarpa
Medicago truncatula
Glycine max
Hordeum vulgare
Oryza sativa
Zea mays
35 386 27 415 GPL198 22 810 23 861
45 033 40 668 GPL4359 61 413 32 533
53 423 50 962 GPL4652 50 902e 21 654
55 787 46 367 RNA-seq 66 210f 66 210
22 634 22 634 GPL1340 22 840 21 071
51 258 40 655 GPL2025 38 548g 33 982
106 046 77 355 GPL12620 70 062h 45 826
19 485 7094 (5610) 7842 (4653) 6585 (6446) 3569 (3480) 5262 (4899) 4392 (4159)
3380 (2491) 29 914 6898 (5378) 5971 (5971) 3991 (3270) 5145 (4407) 3605 (3605)
3821 (2986) 2935 (2690) 26 127 8751 (8333) 2734 (2692) 4187 (3608) 3942 (3728)
1606 (1606) 1604 (1604) 1898 (1898) 38 784 3317 (3293) 5013 (4974) 4098 (3731)
2292 (2144) 2269 (1550) 1798 (1733) 706 (687) 13 699 6888 (6636) 5696 (5688)
3669 (3366) 2195 (1871) 3023 (2498) 1237 (1198) 5020 (4768) 23 367 12 428 (11 878)
2998 (2998) 2209 (2209) 2314 (2118) 1155 (1049) 2290 (2287) 8524 (8524) 38 221
Top right of diagonal: total number of one-to-one homolog pairs. Bottom left of diagonal: sum of the number of one-to-one and randomly selected one-to-many homolog pairs, used for tissue equivalency calculations. Rows 1–5 show the number of protein isoform sequences covering the denoted number of genes in each BLAST dataset, the expression platform, number of probe sets on it, and the number of genes mapping to the probe sets. Diagonal: number of sequences not in singleton groups within a given species. The numbers in parentheses denote the number of homolog pairs with gene expression information for both members of the pair. Only primary transcript sequence and expression data were used for analyses. a Number of protein sequences in the BLAST dataset, including isoforms. b Number of genes represented in the BLAST dataset. c GEO microarray platform identifier for expression analyses, unless RNA-seq was used. d Number of different genes/primary transcripts present in mapping files (see Experimental procedures). e 61 278 probe sets in total for M. truncatula, M. sativa and S. meliloti genes. f Implied number of genes detected by RNA-seq from http://soybase.org/soyseq/; 69 145 is the number from http://digbio.missouri.edu/ soybean_atlas/. Note that 69 145 genes were predicted by Schmutz et al. (2010), 46 403 with high confidence. g 57 381 probe sets in total for rice indica and japonica sub-strains, the indica variety was used to generate the expression atlas. h Number for expression measurements. ª 2012 The Authors The Plant Journal ª 2012 Blackwell Publishing Ltd, The Plant Journal, (2012), 71, 1038–1050
Expression profile similarity ranking of homologous genes 1041
Figure 1. Method for computing tissue equivalencies, using a hypothetical dataset. Expression profiles are found for tissue 1A from species A, and tissues 1B–4B from species B using a set of homolog pairs. The similarity between these expression profiles is correlated using a correlation metric such as Spearman’s correlation coefficient (SCC). The tissue from species B most correlated with that from species A is considered the most equivalent tissue.
to be calculated, it is necessary to have the same number of gene pairs (i.e. rows in Figure 1) to generate the corresponding tissue expression vectors. We tested four sets of gene pairs: the most sequence-similar MYB homologs as an example of a small dataset, and then the three datasets described above: all one-to-one homologs, all one-to-one homologs plus random pairs picked from one-to-many homologous clusters, and finally all one-to-one homologs plus random pairs picked from both the one-to-many and many-to-many homologous clusters. In order to see which set performed the best for the purposes of tissue equivalency calculations, Spearman’s rank correlations (SCC scores) were first calculated between 28 tissues of G. max (Libault et al., 2010) and all tissues in the A. thaliana AtGenExpress Development Atlas (Schmid et al., 2005). We used a semi-manual scoring system to assess whether the best tissue as assessed by the SCC score matched that in Arabidopsis, by assigning a value of 1 if the Plant Ontology term (Avraham et al., 2008) matched, a value of 0.5 if we were unsure whether the tissues looked similar based on their description, and zero if they were clearly not similar. The results are shown in Table 2 and Table S2. The set of Table 2 Scoring tissue equivalencies between Arabidopsis thaliana and Glycine max using different homolog pair set sizes
Number of tissues examined Best sequence matches, one gene family (MYBs) One-to-one homolog pairs One-to-one homolog pairs plus random one-to-many pairs One-to-one homolog pairs plus random one-to-many pairs plus random many-to-many pairs
Number of pairs
Score
55
28 15.5
1606 6446
17.5 22.5
11 413
21.5
homolog pairs comprising all one-to-one homologs plus those generated by randomly picking one of the ‘many’ in one-to-many homolog clusters gave the best tissue equivalency identification performance, with 22.5 of 28 tissues positively identified. In order to assess the effect of random sampling for generating the one-to-many homolog pairs, we generated 100 sets of randomly sampled one-to-many pairs and manyto-many pairs and performed the tissue equivalency calculation for each set. The results of this experiment are shown in Table 3 for the poplar–Arabidopsis tissue equivalency analysis. Not a lot of variation is introduced by the procedure used to randomly create homolog pairs from the one-tomany homolog cluster space, with the standard deviation for the SCC values being approximately two orders of magnitude less than the SCC values themselves. Thus we used just a single set of randomly created homolog pairs from the one-to-many homolog cluster space for further analyses. Identification of tissue equivalencies: effect of correlation metrics Three correlation metrics were examined, the Pearson correlation coefficient (PCC), the uncentred Pearson correlation coefficient (UPCC) and Spearman’s correlation coefficient (SCC), using two datasets, in order to decide which metric would be best suited to the analysis (see Usadel et al., 2009 for a discussion of the advantages and disadvantages of each). A larger dataset incorporating all 7094 pairs of one-toone homologs plus randomly selected one-to-many pairs from homologous clusters between A. thaliana and P. trichocarpa was used for this experiment. All three correlation metrics mentioned above were used to compute tissue equivalencies. A summary of the results from these analyses is given in Figure 2 (for use of the SCC metric) and Data S1, Figures S1 and S2 (for use of the PCC and UPCC metrics, respectively). Based on these analyses, it was decided that the output from the SCC analysis gave the most coherent results, using a semi-qualitative scoring method similar to that described to generate Table 2 (data not shown). Similar results were seen for other test species pairs (see Table S2; UPCC and PCC data not shown). Therefore, tissue equivalencies were computed between all species using the dataset of all one-to-one homologs plus the randomly generated homolog pairs from one-to-many homologous clusters and the SCC metric. The results of these analyses can be found in Data S1, Figures S3–S22. Between six and 47 tissue samples were available for use for these analyses. Examples of these analyses show that rice anthers are equivalent to maize anthers (Data S1, Figure S21), 16-day-old M. truncatula seeds are equivalent to 28-day-old soybean seeds (Data S1, Figure S14), and carpels from stage 12 flowers of Arabidopsis are equivalent to rice ovaries (Data S1, Figure S4). The complete results of these analyses are also available at http://bar.
ª 2012 The Authors The Plant Journal ª 2012 Blackwell Publishing Ltd, The Plant Journal, (2012), 71, 1038–1050
1042 Rohan V. Patel et al. Table 3 Summary of tissue equivalency calculations Tissue equivalency calculated with
Poplar tissue Male catkin Female catkin Young leaf Mature leaf Xylem Root
9153 one-to-one homolog pairs plus random one-to-many pairs plus random many-to-many pairs (x 100)
2491 one-to-one homolog pairs
5610 one-to-one homolog pairs plus random one-to-many pairs (x 100)
Arabidopsis tissue
SCC value
Arabidopsis tissue
SCC value SD
Arabidopsis tissue
SCC value SD
Flowers stage 12, stamens Flowers stage 12, stamens Rosette leaf 12 Cotyledon Root 1.04 Root 1.09
0.49
Flower stage 12, stamens Flower stage 12, stamens Vegetative rosette Rosette leaf 8 Root 1.04 Root 1.04
0.46 0.0057
Flower stage 12, stamens Flower stage 12, stamens Vegetative rosette Rosette leaf 8 Root 1.04 Root 1.04
0.45 0.0068
0.52 0.56 0.62 0.47 0.55
0.48 0.0051 0.52 0.54 0.44 0.50
0.0058 0.0052 0.0057 0.0053
0.47 0.0059 0.51 0.52 0.42 0.48
0.0068 0.0054 0.0051 0.0066
Table shows equivalent tissues between poplar and Arabidopsis thaliana computed using all one-to-one homolog pairs, homolog pair sets comprising all one-to-one homolog pairs plus pairs picked randomly from one-to-many homolog groups, and homolog pair sets comprising all one-to-one homolog pairs plus pairs picked randomly from one-to-many and many-to-many homology groups. In the latter two instances, 100 homolog pair sets were generated, the SCC values for all tissues were computed 100 times, and the mean and standard deviation are shown for the best ranked tissue considered equivalent. A complete list of data for computing equivalencies is presented in Data S1, Figures S3–S22, and the equivalencies may be viewed at http://bar.utoronto.ca/expressolog_treeviewer/cgi-bin/tissueequivalency.cgi. Root 1.04 and 1.09 refer to growth stages, see Schmid et al. (2005).
utoronto.ca/expressolog_treeviewer/cgi-bin/tissueequivalency.cgi. Expression profile similarity ranking of homologs Having computed tissue equivalency analyses for all species being studied, it was possible to rank homologs within each cluster of genes by expression profile similarity, as per the schematic shown in Figure 3. The tissue equivalencies were used as comparable data points for expression profiles between species. The SCC metric was used to correlate expression profiles of homologs across all equivalent tissues for sequences in each of the 49 495 clusters if expression information was available for them. We used the SCC metric to minimize the effect of outliers (Usadel et al., 2009). Expressolog tree viewer In order to be able to visualize the combined sequence similarity and expression similarity data, we implemented an Expressolog Tree Viewer to display a phylogenetic tree of sequence relationships and corresponding expression pattern similarities. Circles of differing sizes and shades of grey beside the names of the sequences denote the degree of sequence similarity, while circles of differing sizes in shades of red, yellow and blue denote expression pattern similarity or dissimilarity. Hovering over these circles with the mouse pointer provides additional information, such as the numerical value for the degree of similarity. For cases where no expression information is available, a question mark is displayed instead of the expression similarity score. An example Expressolog Tree Viewer output is shown for P. trichocarpa homologs of the A. thaliana ATELF5A-1 gene
(At1g13950) in Figure 4, together with eFP browser outputs of these genes (Winter et al., 2007; Wilkins et al., 2009a). ATELF5A-1 is expressed most highly in the stamens and pollen, and to a lesser extent in the leaves of A. thaliana. There are four homologs of this gene in poplar, and our pipeline was able to detect a homolog with strong expression in catkins, which was flagged as the expressolog of ATELF5A-1, highlighted with a yellow background on the Expressolog Tree Viewer. An analysis of the other P. trichocarpa genes in this cluster showed varying expression patterns. Additionally, it was found that variation in sequence similarity does not necessarily correlate with variation in expression pattern similarity between P. trichocarpa homologs when compared to the query gene. The levels of sequence similarity and expression profile similarity of poplar homologs to ATELF5A-1 in Figure 4 and Table 4 provide an example of this. In Table 4, the poplar homologs of ATELF5A-1 are ranked by the SCC value of expression profile similarity, and one can clearly see that the level of divergence of expression profiles does not correlate with the level of divergence of sequence similarity for each of the ATELF5A-1 homologs. Global analyses of sequence and expression pattern similarity across species For all our data from homologous clusters from two or more species, we determined how often the expressolog, i.e. the homolog with the most similar pattern of expression to the query gene, is also the most similar at the level of sequence. The results are summarized in Table 5; further details are given in Table S3. The number of times for which this is not
ª 2012 The Authors The Plant Journal ª 2012 Blackwell Publishing Ltd, The Plant Journal, (2012), 71, 1038–1050
Expression profile similarity ranking of homologous genes 1043
Figure 3. Method for ranking of homologs by expression profile similarity using a hypothetical dataset. Expression profiles for all genes within a given homologous cluster are retrieved across all equivalent tissues. The expression profile of gene A from species A is compared to the profiles of its homologous genes A’, A’’ and A’’’ from species B in equivalent tissues using a correlation metric such as Spearman’s correlation coefficient (SCC). The top-most ranked homolog in terms of its correlation coefficient score to gene A is termed the ‘expressolog’.
Figure 2. Poplar–Arabidopsis tissue equivalencies. Heatmap showing values of correlations between tissues of Arabidopsis thaliana (rows) and P. trichocarpa (columns), performed using Spearman’s correlation coefficient and 7094 homolog pairs from one-to-one and one-tomany homologous clusters. Tissue equivalencies (best SCC scores) are highlighted by boxes outlined in black.
the case is often surprisingly high. For instance, between poplar and Arabidopsis, there are 4231 cases (39.1%) in which the homolog with the best sequence match is not that with the best expression profile match, i.e. not the expressolog.
The number of cases where the homolog with the best sequence match is the expressolog is 6589. The number of cases in which the expressologs are not the best sequence matches ranges from a low of 15.4% between poplar and M. truncatula, to a high of 50.7% between soybean and barley, as shown in Table 5. Further information on these comparisons is given in Table S3. To view these data in a slightly different way, we created two sets of pairs of sequences for three species combinations. The first set of sequences consisted of those with the best sequence similarity scores from homologous clusters. The second set consisted of those with the best expression similarity scores, i.e. the expressologs. For each set, we extracted the corresponding expression similarity score for the first set, and the corresponding sequence similarity score for the second set. The results for comparison of the corresponding Arabidopsis versus poplar sets are shown in Figure 5. The most similar sequences (mean sequence similarity of 70.7%) have just over half the mean expression similarity score compared with those with the best expression similarity score (0.29 versus 0.49), and the mean sequence similarity score for the latter set is just a few per cent lower (68.3%). The same is true for comparison of Arabidopsis with maize and maize with rice (see Data S1, Figure S23), and for other species comparisons (see Table S3). Figure 6 shows an outline of the pipeline used to rank the homologs by expression profile similarity, summarizing the major steps involved in ranking homologs using this method. eFP browsers In addition to the existing eFP browsers already available on the Bio-Array Resource (BAR) site (Toufighi et al., 2005) for Arabidopsis (Winter et al., 2007) and poplar (Wilkins et al.,
ª 2012 The Authors The Plant Journal ª 2012 Blackwell Publishing Ltd, The Plant Journal, (2012), 71, 1038–1050
1044 Rohan V. Patel et al.
Figure 4. Expressolog Tree Viewer output. The Arabidopsis ATELF5A-1 gene (At1g13950) shows highest expression in the stamens and pollen, and to a lesser extent in leaves of A. thaliana. Red denotes higher expression. Additionally, a phylogenetic analysis from the Expressolog Tree Viewer for the Arabidopsis ATELF5A-1 gene (At1g13950) and its poplar homologs is shown. Some eFP browser (Wilkins et al., 2009a) outputs of expression levels from poplar homologs of the Arabidopsis ATELF5A-1 gene are shown, together with their SCC values and sequence similarity scores. Rankings are based on similarity of expression profile. The catkin-specific expressolog (POPTR_0018s11660.1, PtpAffx.2128.1.A1_x_at) was identified by this pipeline. Remaining homologs were more highly expressed in other tissues, and at different expression levels.
Table 4 Poplar homologs of ATELF5A-1 with SCC expression similarity and sequence similarity values
Expression similarity rank
Homolog
Probe set
SCC value
Sequence similarity (%)
1 2 3 4
POPTR_0018s11660.1 POPTR_0008s09150.1 POPTR_0010s17020.1 POPTR_0006s19870.1
PtpAffx.2128.1.A1_x_at Ptp.2026.1.S1_s_at PtpAffx.249.101.A1_s_at Ptp.1583.2.S1_x_at
0.765 0.677 0.441 )0.088
89 89 87 89
Table 5 Global summary of expressologs and best sequence similarity matches
A. thaliana Poplar Medicago truncatula Soybean Rice Barley Maize
Arabidopsis thaliana
Poplar
Medicago truncatula
Soybean
Rice
Barley
Maize
– 4231 (39.1%) 1457 (18.5%) 5696 (49.5%) 1482 (18.4%) 1122 (19.1%) 2964 (38.8%)
6589 – 1810 (15.4%) 8614 (49.0%) 1913 (18.0%) 1324 (17.1%) 3666 (35.9%)
6423 9976 – 5327 (42.8%) 1053 (18.7%) 866 (19.2%) 2328 (36.4%)
5717 8954 7110 – 3980 (49.4%) 2972 (50.7%) 5999 (41.6%)
6590 8705 4579 4069 – 1450 (15.7%) 4977 (32.1%)
4758 6427 3648 2891 7792 – 2918 (33.0%)
4673 6532 4063 8433 10 541 5924 –
Data to the top right of the diagonal indicate the number of times the top sequence homolog is the expressolog; data to the bottom left of the diagonal number of times the top sequence homolog is not the expressolog.
2009a), five new eFP browsers were created to enable crossspecies visualization of expression patterns within homologous clusters of genes. Although the Expressolog Tree
Viewer provides an indication of the degree of expression similarity of homologous genes, it is also useful to be able to view actual expression values for the species in question, in
ª 2012 The Authors The Plant Journal ª 2012 Blackwell Publishing Ltd, The Plant Journal, (2012), 71, 1038–1050
Expression profile similarity ranking of homologous genes 1045
Figure 5. Analysis of sequence similarity versus expression pattern similarity for two sets of gene pairs from A. thaliana and P. trichocarpa The left panel shows the results of an analysis performed using the most sequence similar homologs. The corresponding expression pattern similarity scores, as measured by Spearman’s correlation coefficient (SCC), were retrieved from our database. Each pair was plotted according to its sequence and expression pattern similarity score. In order to simplify this scatter plot, a hexagonal binning function was used. Each hexagonal bin contains a certain number of points, denoted by the grey shading. If a bin is black, then there are 425 points in it (each point is one pair). The right panel row shows the sequence similarity against expression similarity for all homologs with top-ranked SCC values, termed the ‘expressologs’. The same shading scale and binning function was used. The mean sequence similarity and expression pattern similarity scores across all pairs in each graph are shown by dotted blue lines. The red lines are the lines of best fit through all points in the two graphs, the R-squared values of which are shown.
either a pictographic, tabular or graphical format, as is possible within the eFP browser framework. eFP browsers have been created for the developmental atlases of gene expression for M. truncatula (Benedito et al., 2008), G. max (Libault et al., 2010), O. sativa (Jain et al., 2007; Li et al.,2007), H. vulgare (Druka et al., 2006) and Z. mays (Sekhon et al., 2011). Within the Expressolog Tree Viewer output, the circles denoting sequence similarity link to a multiple sequence alignment, while the circles denoting expression similarity link to the corresponding eFP browser outputs for the relevant homolog, to enable more detailed examination of expression patterns of a given homolog. Example outputs for the five new eFP browsers are given in Figure 7 for the expressologs of the A. thaliana IRREGULAR XYLEM 3 (IRX3) gene. IRX3 is expressed most highly in the stem, specifically the xylem, of A. thaliana (Taylor et al., 1999). DISCUSSION In this study, we have devised a computational pipeline for ranking of genes within homologous clusters based on expression profile similarity. The similarity of spatiotemporal expression patterns may be thought of as an additional piece of information regarding functional equivalency between homologous genes. Thus, using this pipeline, we are able to identify, within each cluster, which homolog exhibits both the highest sequence similarity and expression pattern similarity. This is a complementary approach to the co-expression neighborhood approach implemented by PlaNet (Mutwil et al., 2011) or by Chikina and Troyanskaya (2011); with both of these methods, it is first necessary to manually identify genes in both species
showing similar patterns of expression before exploring the corresponding co-expression neighborhoods: our expressolog method automates this process. Currently, it is possible to view this information for the seven species in this study. It is also possible to view the expression patterns of the ranked homologs using a suite of seven eFP browsers, five of which are new for this study. Despite the fact that the monocots and the dicots diverged roughly 200 million years ago (Wolfe et al., 1989), our pipeline was still able to broadly identify equivalent tissues, as described in the Results. There are some oddities in terms of our equivalency analyses, especially between the monocots and dicots. For instance, our pipeline identified 1 cm pods from soybean as being equivalent to the base of stage 2 leaves from V5 maize plants. Perhaps this is unsurprising, as the divergence between monocots and dicots has resulted in several differences in morphology between the two groups of flowering plants, and as a result there may not be clear equivalent tissues between these groups. We have added small ‘alert’ icons for comparisons between these groups, but the Expressolog Tree Viewer still provides an easy access point for viewing the actual expression profiles of homologs across the monocot–dicot divide, for manual investigation. Movahedi et al. (2011) point out that rice and Arabidopsis genes may exhibit similar co-expression networks despite having differing tissue-specific patterns of expression, highlighting a more general phenomenon of concerted network evolution. Statistical analyses of the expression profile similarity versus sequence similarity correlation outputs between
ª 2012 The Authors The Plant Journal ª 2012 Blackwell Publishing Ltd, The Plant Journal, (2012), 71, 1038–1050
1046 Rohan V. Patel et al. Figure 6. Schematic of the pipeline used to identify expressologs between species.
A. thaliana and P. trichocarpa showed interesting results. Two different sets of homologs were investigated: those exhibiting maximum sequence similarities, and those showing maximum expression profile similarities (expressologs) within each homologous cluster. Both sets of homologs exhibited a similar mean sequence similarity value. However, major differences could be seen in the expression similarity values between different sets of homologs. For instance, there is a large difference in mean expression profile similarity values for the most sequence similar homologs and expressologs. The homolog pairs with the highest sequence similarity show almost a 50% lower expression profile similarity value than the expressologs (Figure 5). Expression profile similarity can be seen as a piece of information additional to sequence homology for in planta functional equivalence. An extension of this idea would be to use expression similarity and sequence similarity to more accurately annotate homologs. Although we have used only developmental gene expression atlases for the majority of analyses described here, we also examined use of a ‘response atlas’ to compute expressologs. In this case, we used experiments from Arabidopsis and poplar that were designed to subject
the plants to drought stress in similar ways, and that sampled similar tissues at similar time points (Wilkins et al., 2009b, 2010). Data S1, Figure S24 shows an analysis of the drought response in A. thaliana compared with P. trichocarpa. Here expressologs of differentially expressed genes from Wilkins et al. (2010) were determined using both developmental atlas data and the abiotic stress expression data. A more positive correlation for expression responses was found for the analysis performed in the latter case. This may suggest the need to use a specific expression compendium when comparing expression data under certain conditions (see Usadel et al., 2009, for a discussion of the difference between using conditionindependent and condition-dependent expression compendia for co-expression analyses). For the example of ATELF5A-1 in A. thaliana, this gene shows highest expression in the stamens, pollen and leaves. The computational pipeline we devised was able to identify an expressolog with strong expression in catkins in P. trichocarpa. However, the other ranked P. trichocarpa homologs of ATELF5A-1 showed expression patterns that differed from both ATELF5A-1 and its P. trichocarpa expressolog. This may suggest that these other homologs
ª 2012 The Authors The Plant Journal ª 2012 Blackwell Publishing Ltd, The Plant Journal, (2012), 71, 1038–1050
Figure 7. Example views of five new eFP browsers that have been created to enable cross-species expression browsing within the framework described in this paper. Views are for expressologs of the Arabidopsis thaliana IRX3 gene in the respective species. The eFP browser views are for Medicago truncatula (top left), Glycine max (top right), Oryza sativa (bottom left), Hordeum vulgare (bottom center) and Zea mays (bottom right). Red indicates higher expression in the depicted tissue.
Expression profile similarity ranking of homologous genes 1047
ª 2012 The Authors The Plant Journal ª 2012 Blackwell Publishing Ltd, The Plant Journal, (2012), 71, 1038–1050
1048 Rohan V. Patel et al. require re-annotation or a more specific annotation. Additional examples, which can be found in Data S2, include the AtG8F gene, which shows highest expression in the internode, stamens, seeds and leaves of A. thaliana. In P. trichocarpa, the expressolog shows highest expression in the xylem and catkins. However, the other homolog in the same gene cluster shows a different pattern of expression. Other sample results are given in Data S2. Currently, many functional annotations are derived through sequence homology. It is envisaged that the results from this pipeline will ultimately aid in the improvement of functional annotations of genes, and we are planning to provide our services to a new International Arabidopsis Informatics Consortium information portal and other bioinformatics resources (International Arabidopsis Informatics Consortium, 2010). Further, this pipeline may be useful in automatically deriving Plant Ontology terms for datasets submitted to gene expression databases. RNA-seq and whole-genome tiling arrays are powerful methods for transcript-specific gene expression profiling. Our method will continue to prove useful as these data become more common. The gene expression profiling datasets we employed for this study in most cases did not contain splice variant information: only 268 genes from poplar, 623 genes from rice, 550 genes from M. truncatula and 13 683 genes from maize had probe sets that mapped to alternately spliced transcripts. Further, the one RNA-seq dataset that we used (from soybean) did not report expression level differences for different transcripts. Thus, for the tissue equivalency analyses we present here, we only used the expression information for the ‘.1’ or ‘T01’ primary transcript, as in most cases these were the only expression data available. However, when we repeated our tissue equivalency analysis for poplar–Arabidopsis using the expression information for the 268 genes with probe sets for alternately spliced transcripts, we found no difference using the set of one-to-one homologs plus randomly selected pairs of one-to-many homologs. The results were similar for maize–rice when we included the alternate transcripts for the 13 683 genes (see Data S3). As more than 50% of expressed genes with introns in maize exhibit alternate splicing along a leaf developmental gradient (Li et al., 2010), including expression information for large numbers of alternate transcripts would shift the distributions of the one-to-one, one-to-many and many-to-many homolog clusters away from membership in the one-to-one homolog clusters and toward membership in the many-tomany homolog clusters. Thus, with ‘complete’ RNA-seq datasets (i.e. those providing expression information for all transcript variants across many tissues), it may be necessary to use the set of ‘one-to-one homolog pairs plus random one-to-many pairs plus random many-to-many pairs’ for tissue equivalency calculations. We do not anticipate this being an issue, as we show in Table 2 that this set performs
almost as well as the set we chose for use in this paper for tissue equivalency calculations. In summary, using the eFP browser framework and Expressolog Tree Viewer, it is now possible to readily view expression patterns of homologs across different species. The ranking of homologs by both sequence similarity and expression profile similarity allows the user to assess the relationship between a given gene and its homologs in terms of expression profile similarities, providing further information regarding functional equivalency and improving the functional annotation of homologous genes. In addition, our pipeline will become more useful as more plant gene expression atlases are generated, perhaps by such efforts as the 1000 Plant Transcriptomes Project (Stewart et al., 2010). We will re-run the pipeline as more gene expression atlases are generated. EXPERIMENTAL PROCEDURES Sequence files The protein sequence files used were as follows: A. thaliana, TAIR10_pep_20101214.fa, downloaded from ftp://ftp.arabidopsis. org/home/tair/Genes/TAIR10_genome_release/TAIR10_blastsets/; M. truncatula, Mt3.0_proteins_20090702_NAMED.fa, downloaded from ftp://ftpmips.gsf.de/plants/medicago/MT_3_0/; P. trichocarpa, Ptrichocarpa_156_peptide.fa.gz, downloaded from ftp://ftp.jgipsf.org/pub/JGI_data/phytozome/v7.0/Ptrichocarpa/annotation/; G. max, Glyma1.pep.fa.gz, downloaded from ftp://ftp.jgi-psf.org/ pub/JGI_data/phytozome/v5.0/Gmax/annotation/initialRelease/; O. sativa, TIGR 6.1 all.pep, downloaded from ftp://ftp.plantbiology.msu. edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudo molecules/version_6.1/all.dir/; H. vulgare, translated using the OrfPredictor (Min et al., 2005) tool using a sequence file provided by Federico Giorgio (Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany); Z. mays, Zmays_166_peptide.fa, downloaded from ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v7.0/ Zmays/annotation/. Orthologs were identified using OrthoMCL version 1.4 (Li et al., 2003), which was run using the following parameters set in the scripts: Mode = 1; P-Value Cutoff = 1e-10; Percent Identity Cutoff = 60; Percent Match Cutoff = 60; MCL Algorithm Inflation value = 2.2, based on manual investigation and comparison with other online databases.
Sequence similarity calculations For the analyses performed in this paper and the values presented in our Expressolog Tree Viewer, we computed the sequence similarity scores using the command line version of CLUSTAL W (Thompson et al., 1994) with the following command: ‘clustalw -infile=[INPUT FILE] -outfile=[OUTPUT FILE] -pwmatrix=gonnet -pwgapopen=10 pwgapext=0.1 > [STDOUT FILE]’, where the [INPUT FILE], [OUTPUT FILE] and [STDOUT FILE] filenames were specified by the scripts that we designed and wrote for our pipeline. The settings are default settings. Alignments presented in the output of the Expressolog Tree Viewer are generated ‘on the fly’ using MAFFT (Katoh et al., 2002) from sequences stored in our Expressolog database. We use the command ‘mafft –globalpair –maxiterate 500 –op 1.53 –ep 0.123 –quiet [INPUT FILE] > [OUTPUT FILE]’ to generate our alignments ‘on the fly’, where ‘op’ is the gap opening penalty, ‘ep’ is the gap extension penalty, ‘globalpair’ denotes an accurate global align-
ª 2012 The Authors The Plant Journal ª 2012 Blackwell Publishing Ltd, The Plant Journal, (2012), 71, 1038–1050
Expression profile similarity ranking of homologous genes 1049 ment, and the ‘maxiterate’ option tells MAFFT to iterate over the alignment 500 times to improve it. The ‘op’ and ‘ep’ values are default values, and the [INPUT FILE] and [OUTPUT FILE] are file names specified by the Expressolog Tree Viewer script.
would give a value of 1 but the uncentred Pearson correlation coefficient would not. Spearman’s correlation coefficient rs ¼ 1
Expression datasets, mapping and normalization Expression datasets, platforms with GEO platform identifiers and mapping files used for each species for use in ranking of orthologs and subsequent production of eFP browsers were as follows: A. thaliana, AtGenExpress data series of Schmid et al. (2005), Affymetrix ATH1 platform GPL198, mapping to gene models performed using ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ affy_ATH1_array_elements-2010-12-20.txt; P. trichocarpa, GEO accession number GSE13990, Affymetrix poplar genome array GPL4359, mapping to gene models performed using http://www. affymetrix.com/Auth/analysis/downloads/na32/ivt/Poplar.na32.annot. csv.zip; M. truncatula, ArrayExpress experiment name E-MEXP1097, Affymetrix Medicago genome array GPL4652, mapping file IMGAGv3MAPPINGS.txt for mapping to IMGAG version 3 gene models provided by Jeremy Murray (Samuel Roberts Noble Foundation, http://www.noble.org); G. max, http://digbio.missouri.edu/ soybean_atlas/Soybean_DB/stacey.atlas+roothair.unique_per_million.gz, data are RNA-seq data so no mapping file was necessary; O. sativa, GEO accession numbers GSE7951 and GSE6893, Affymetrix rice array GPL2025, mapping to gene models performed using http://www.affymetrix.com/Auth/analysis/downloads/na30/ivt/ Rice.na30.annot.csv.zip; H. vulgare, GEO accession number GSE16754, ArrayExpress experiment name E-AFMX-3, Affymetrix barley genome array GPL1340, mapping to gene models performed using http://www.affymetrix.com/Auth/analysis/downloads/na30/ ivt/Barley1.na30.annot.csv.zip; Z. mays, PlexDB experiment number ZM37, Nimblegen maize whole-genome microarray 385K (version V1_4a.53), mapping of Sekhon et al. (2011) expression data based on probes to maize gene models performed by Ethalinda Cannon (Iowa State University, Ames) for PlexDB. The number of genes mapping to each platform is shown in Table 1. For all Affymetrix platforms,.CEL files were normalized using R/Bioconductor (Gentleman et al., 2004) using the MAS5 algorithm, with a TGT value of 100. For soybean RNA-seq expression data, FPKM-normalized data were obtained from the URL indicated above. For maize gene expression data obtained from PlexDB as described above, RMA-normalized expression values were linearized for use in this work using the equation Y=2x, where x represents the RMA-normalized expression value, and y is the value used for this work. Only primary transcripts and protein isoforms were considered for tissue equivalency calculations. However, for expressolog computation, alternative protein isoforms and their corresponding transcripts were used where available.
Correlation metrics Three correlation metrics were compared in the process of this study. The Pearson correlation coefficient yÞ Rðx xÞðy r ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 Rðy yÞ 2 ½Rðx xÞ measures correlation between two vectors X and Y standardized by the standard deviation of the vectors. The formula for the uncentred Pearson correlation coefficient is similar to the one above for the Pearson correlation coefficient, but assumes that the mean is always equal to zero. The difference is seen if there are two vectors with identical shapes but a standard offset to each other. In this case, the Pearson correlation coefficient
6Rd 2 nðn 2 1Þ
takes the ranks of the expression values into account, rather than the absolute values themselves.
Software and webservices The Expressolog Tree Viewer may be accessed on the BAR webserver at http://bar.utoronto.ca/expressolog_treeviewer/cgi-bin/expressolog_treeviewer.cgi or by clicking on the ‘Expressolog’ icon at the top of eFP browser output pages, which appears if a given query gene has homologs in other species. Our eFP browser code has been adapted to permit easy implementation for any species of interest. It is available on http://sourceforge.net/projects/efpbrowser/ (version 1.5). Additionally, information can be retrieved through the use of JSON-based web services at the following URL: http://bar.utoronto. ca/webservices/get_expressologs.php?request=[{‘‘gene’’:’’GENE OF INTEREST’’},{‘‘gene’’:’’GENE OF INTEREST’’},…], where the ‘GENE OF INTEREST’ may be any gene from the seven species of interest to this study. Any number of genes may be inputted in the format shown above, and a JSON data structure is retrieved giving information in the following order: {‘probeset_A’: ‘Probeset of orginal gene’,‘gene_B’:‘Homologous gene’,‘probeset_B’:‘Probeset of homologous gene’,‘correlation_coefficient’:‘SCC Value of expression profile correlation’,‘efp_link’:‘Link to eFP output for species specific browser of the homologous gene’}. A separate data structure such as the one above is given for each homolog, separated by a comma. Such a structure allows the user to parse the relevant information. In this way, bulk data retrieval of important information is facilitated.
ACKNOWLEDGMENTS We would like to acknowledge funding provided by the Natural Sciences and Engineering Research Council of Canada to N.J.P for this study, and from the Agricultural BioProducts Innovation Program of Agriculture and Agri-Food Canada to R.B. We are also grateful to Ethalinda Cannon, Wimalanathan Kokulapalan,and Carolyn Lawrence (Department of Genetics, Development and Cell Biology) of MaizeGDB at Iowa State University, Ames, IA, U.S.A., and Shawn Kaeppler and Rajandeep Sekhon (Department of Agronomy) of the University of Wisconsin, Madison, WI, U.S.A., for discussions and images helpful in creating the Maize eFP browser. We are also grateful to Darshan Brar from the International Rice Research Institute for providing us with images for the rice eFP browser. Federico Giorgi at the Max Planck Institute for Molecular Plant Physiology kindly provided barley sequences. Jeremy Murray from the Samuel Roberts Noble Foundation kindly provided M. truncatula mapping files.
SUPPORTING INFORMATION Additional Supporting Information may be found in the online version of this article: Data S1. Figures S1–S24. Data S2. Other expressolog results for ATG8F, DiT1 and AtBAG3. Data S3. Tissue equivalency results between Arabidopsis and poplar for one-to-one homologs plus randomly picked pairs from one-to-many homolog clusters using probe sets for just the first splice variant versus probe sets for all splice variants where available. Table S1. Numbers of one-to-one, random one-to-many, and random many-to-many homolog pairs on a species–species basis.
ª 2012 The Authors The Plant Journal ª 2012 Blackwell Publishing Ltd, The Plant Journal, (2012), 71, 1038–1050
1050 Rohan V. Patel et al. Table S2. Scoring tissue equivalencies between Arabidopsis thaliana and Glycine max using different homolog pair set sizes. Table S3. Additional information about the sequence similarity and expression pattern similarity scores (SCC values) for cases where the most sequence-similar homolog is or is not the expressolog. Please note: As a service to our authors and readers, this journal provides supporting information supplied by the authors. Such materials are peer-reviewed and may be re-organized for online delivery, but are not copy-edited or typeset. Technical support issues arising from supporting information (other than missing files) should be addressed to the authors.
REFERENCES Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature, 408, 796–815. Avraham, S., Tung, C.W., Ilic, K. et al. (2008) The Plant Ontology Database: a community resource for plant structure and developmental stages controlled vocabulary and annotations. Nucleic Acids Res. 36, D449– D454. Benedito, V., Torez-Jerez, I., Murray, J. et al. (2008) A gene expression atlas of the model legume Medicago truncatula. Plant Journal, 55, 504–513. Chen, F., Mackey, A.J., Vermunt, J.K. and Roos, D.S. (2007) Assessing performance of orthology detection strategies applied to eukaryotic genomes. PLoS One, 2, e383. Chikina, M.D. and Troyanskaya, O.G. (2011) Accurate quantification of functional analogy among close homologs. PLoS Comput. Biol. 7, e1001074. Cho, Y., Fernandes, J., Kim, S.-H. and Walbot, V. (2002) Gene-expression profile comparisons distinguish seven organs of maize. Genome Biol. 3, research0045. Druka, A., Muehlbauer, G., Druka, I. et al. (2006) An atlas of gene expression from seed to seed through barley development. Funct. Integr. Genomics, 6, 202–211. Fitch, W. (1970) Distinguishing homologous from analogous proteins. Syst. Zool. 19, 99–113. Gentleman, R.C., Carey, V.J., Bates, D.M. et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80.1–R80.16. International Arabidopsis Informatics Consortium (2010) An international bioinformatics infrastructure to underpin the Arabidopsis community. Plant Cell, 22, 2530–2536. Jain, M., Nijhawan, A., Arora, R., Agarwal, P., Ray, S., Sharma, P., Kapoor, S., Tyagi, A.K. and Khurana, J.P. (2007) F-box proteins in rice. Genome-wide analysis, classification, temporal and spatial gene expression during panicle and seed development, and regulation by light and abiotic stress. Plant Physiol. 143, 1467–1483. Jiao, Y., Wickett, N.J., Ayyampalayam, S. et al. (2011) Ancestral polyploidy in seed plants and angiosperms. Nature, 473, 97–100. Katoh, K., Misawa, K., Kuma, K. and Miyata, T. (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066. Kuzniar, A., van Ham, R., Pongor, S. and Leunissen, J. (2008) The quest for orthologs: finding the gene across genomes. Trends Genet. 24, 539–551. Li, L., Stoeckert, C.R. Jr and Roos, D.S. (2003) OrthoMCL: identification of ortholog groups for eukaryotic groups. Genome Res. 13, 2178–2189. Li, M., Xu, W., Yang, W., Kong, Z. and Xue, Y. (2007) Genome-wide gene expression profiling reveals conserved and novel molecular functions of the stigma in rice (Oryza sativa L.). Plant Physiol. 144, 1797–1812. Li, P., Ponnala, L., Gandotra, N. et al. (2010) The developmental dynamics of the maize leaf transcriptome. Nat. Genet. 42, 1060–1067. Liang, C., Jaiswal, P., Hebbard, C. et al. (2008) Gramene: a growing plant comparative genomics resource. Nucleic Acids Res. 36, 947–953. Libault, M., Farmer, A., Brechenmacher, L. et al. (2010) Complete transcriptome of the soybean root hair cell, a single-cell model, and its alteration in response to Bradyrhizobium japonicum infection. Plant Physiol. 152, 541– 552.
Lyons, E. and Freeling, M. (2008) How to usefully compare homologous plant genes and chromosomes as DNA sequences. Plant Journal, 53, 661–673. Min, X.J., Butler, G., Storms, R. and Tsang, A. (2005) OrfPredictor: predicting protein-coding regions in EST-derived sequences. Nucleic Acids Res. 33, W677–W680. Movahedi, S., Van de Peer, Y. and Vandepoele, K. (2011) Comparative network analysis reveals that tissue specificity and gene function are important factors influencing the mode of expression evolution in Arabidopsis and rice. Plant Physiol. 156, 1316–1330. Mutwil, M., Obro, J., Willats, W.G. and Persson, S. (2008) GeneCAT-novel webtools that combine BLAST and co-expression analyses. Nucleic Acids Res. 36, W320–W326. Mutwil, M., Klie, S., Tohge, T., Giorgi, F.M., Wilkins, O., Campbell, M.M., Fernie, A.R., Usadel, B., Nikoloski, Z. and Persson, S. (2011) PlaNet: combined sequence and expression comparisons across plant networks derived from seven species. Plant Cell, 23, 895–910. Ohno, S. (1970) Evolution by Gene Duplication. New York: Springer. Proos, S., Van Bel, M., Sterck, L., Billau, K., Van Parys, T., Van de Peer, Y. and Vandepoele, K. (2009) PLAZA: a comparative genomics resource to study gene and genome evolution. Plant Cell, 21, 3718–3731. Remm, M., Storm, C.E. and Sonnhammer, E.L. (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314, 1041–1052. Schmid, M., Davison, T.S., Henz, S.R., Pape, U.J., Demar, M., Vingron, M., Scho¨lkopf, B., Weigel, D. and Lohmann, J. (2005) A gene expression map of Arabidopsis development. Nat. Genet. 37, 501–506. Schmutz, J., Cannon, S.B., Schlueter, J. et al. (2010) Genome sequence of the palaeopolyploid soybean. Nature, 463, 178–183. Sekhon, R.S., Lin, H., Childs, K.L., Hansey, C.N., Buell, C.R., de Leon, N. and Kaeppler, S.M. (2011) Genome-wide atlas of transcription during maize development. Plant J. 66, 553–563. Stewart, C.N. Jr, Burris, J.N., Peng, Y., Wong, G.K.S. and the 1KP Consortium (2010) One Thousand Plant Transcriptome (1KP) Project: a first look at extremophyte and weedy plant transcriptomes. Presented at the Plant & Animal Genomes XVIII Conference, 9-13 January 2010, San Diego, CA. Tatusov, R.L., Koonin, E.V. and Lipman, D.J. (1997) A genomic perspective on protein families. Science, 278, 631–637. Tatusov, R.L., Fedorova, N.D., Jackson, J.D. et al. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4, 41. Taylor, N.G., Scheible, W.R., Cutler, S., Somerville, C.R. and Turner, S.R. (1999) The irregular xylem3 locus of Arabidopsis encodes a cellulose synthase required for secondary cell wall synthesis. Plant Cell, 11, 769–780. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680. Toufighi, K., Brady, S.M., Austin, R., Ly, E. and Provart, N.J. (2005) The Botany Array Resource: e-Northerns, Expression Angling, and promoter analyses. Plant Journal, 43, 153–163. Usadel, B., Obayashi, T., Mutwil, M., Giorgi, F.M., Bassel, G.W., Tanimoto, M., Chow, A., Steinhauser, D., Persson, S. and Provart, N.J. (2009) Coexpression tools for plant biology: opportunities for hypothesis generation and caveats. Plant, Cell Environ. 32, 1633–1651. Wilkins, O., Nahal, H., Foong, J., Provart, N.J. and Campbell, M.M. (2009a) Expansion and diversification of the Populus R2R3-MYB family of transcription factors. Plant Physiol. 149, 981–993. Wilkins, O., Waldron, L., Nahal, H., Provart, N.J. and Campbell, M.M. (2009b) Genotype and time of day shape the Populus drought response. Plant Journal, 60, 703–715. Wilkins, O., Bra¨utigam, K. and Campbell, M.M. (2010) Time of day shapes Arabidopsis drought transcriptomes. Plant Journal, 63, 715–727. Winter, D., Vinegar, B., Nahal, H., Ammar, R., Wilson, G.V. and Provart, N.J. (2007) An ‘electronic fluorescent pictograph’ browser for exploring and analyzing large-scale biological data sets. PLoS One, 2, e718. Wolfe, K.H., Gouy, M., Yang, Y.W., Sharp, P.M. and Li, W.H. (1989) Date of the monocot–dicot divergence estimated from chloroplast DNA sequence data. Proc. Natl Acad. Sci. USA, 86, 6201–6205.
ª 2012 The Authors The Plant Journal ª 2012 Blackwell Publishing Ltd, The Plant Journal, (2012), 71, 1038–1050