The Plant Journal (2006) 46, 336–348
doi: 10.1111/j.1365-313X.2006.02681.x
TECHNICAL ADVANCE
The Arabidopsis co-expression tool (ACT): a WWW-based tool and database for microarray-based gene expression analysis Chih-Hung Jen1,†, Iain W. Manfield2,†, Ioannis Michalopoulos1, John W. Pinney1, William G.T. Willats3, Philip M. Gilmartin2 and David R. Westhead1,* 1 School of Biochemistry and Microbiology, University of Leeds, Leeds, West Yorkshire, LS2 9JT, UK, 2 Centre for Plant Sciences, Faculty of Biological Sciences, University of Leeds, West Yorkshire, LS2 9JT, UK, and 3 Institute for Molecular Biology and Physiology, University of Copenhagen, Copenhagen, DK-1353, Denmark Received 11 September 2005; revised 12 December 2005; accepted 21 December 2005. * For correspondence (fax 44 0 113 343 3167; e-mail
[email protected]). † These authors contributed equally to this article.
Summary We present a new WWW-based tool for plant gene analysis, the Arabidopsis Co-Expression Tool (ACT), based on a large Arabidopsis thaliana microarray data set obtained from the Nottingham Arabidopsis Stock Centre. The co-expression analysis tool allows users to identify genes whose expression patterns are correlated across selected experiments or the complete data set. Results are accompanied by estimates of the statistical significance of the correlation relationships, expressed as probability (P) and expectation (E) values. Additionally, highly ranked genes on a correlation list can be examined using the novel CLIQUE FINDER tool to determine the sets of genes most likely to be regulated in a similar manner. In combination, these tools offer three levels of analysis: creation of correlation lists of co-expressed genes, refinement of these lists using twodimensional scatter plots, and dissection into cliques of co-regulated genes. We illustrate the applications of the software by analysing genes encoding functionally related proteins, as well as pathways involved in plant responses to environmental stimuli. These analyses demonstrate novel biological relationships underlying the observed gene co-expression patterns. To demonstrate the ability of the software to develop testable hypotheses on gene function within a defined biological process we have used the example of cell wall biosynthesis genes. The resource is freely available at http://www.arabidopsis.leeds.ac.uk/ACT/ Keywords: bioinformatics, co-expression, plants, gene networks, regulation.
Introduction Recent years have seen a dramatic increase in the use of DNA microarray experiments, which simultaneously measure the relative mRNA expression of thousands of genes. Such investigations have many possible purposes; for instance, in plants these have been used to identify genes associated with particular environmental stimuli, including biotic and abiotic stresses (Chen et al., 2002); to find genes that are differentially expressed in disease states (Schenk et al., 2000); to map the changes in expression associated with biological processes such as the cell cycle (Breyne et al., 2002) and circadian clock (Harmer et al., 2000); or to identify downstream targets of particular regulatory factors (Hoth et al., 2002). An enormous volume of microarray data has been produced 336
world-wide, and much of it is now being submitted to the fast-growing public databases (Brazma et al., 2003; Craigon et al., 2004; Gollub et al., 2003; Wheeler et al., 2003). For example, at the time of writing, Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) contains data from more than 30 000 arrays and both The Arabidopsis Information Resource (Rhee et al., 2003; http://www. arabidopsis.org/) and Nottingham Arabidopsis Stock Centre (NASC) (Craigon et al., 2004; http://arabidopsis.info/) websites represent valuable repositories of Arabidopsis microarray data sets. This considerable body of data contains information about the coordinated steady-state transcript levels of all genes at the time of tissue harvesting, but the integration of regulatory signals controlling the ª 2006 The Authors Journal compilation ª 2006 Blackwell Publishing Ltd
A WWW-based plant gene co-expression tool and database 337 transcriptome remains poorly characterized and a key challenge is the elucidation of the transcriptional control networks underlying the observed patterns of gene expression. Most microarray experiments are conceived for a particular investigative purpose, but there is a growing realization that the data gathered have use and significance beyond the confines of the original investigation. A study might identify a number of genes important in some particular circumstances, but each array hybridization performed contains potentially useful data on the relative expression of many more genes. It was realized very early that clusters of genes showing similar expression patterns (co-expression) across several experiments are often functionally related (Eisen et al., 1998; Hughes et al., 2000). Such functional relationships tend to exist, using gene ontology terminology (Ashburner et al., 2000; Berardini et al., 2004), at the level of biological processes rather than molecular functions. Coexpression data can be complementary to the information on possible gene function provided by sequence analysis. For example, sequence analysis might suggest the molecular function of a protein to be kinase activity, but co-expression data might be able to associate the gene with other genes, perhaps of different molecular function, but involved in the same pathway or biological process. The co-expression of genes revealed in large databases of related and unrelated microarray experiments can contain information far beyond the original purposes for which the constituent experiments were performed, and can be a valuable predictive tool for gene function and pathway membership. Tools are available for identifying tissues or experiments where genes of interest show similar expression levels, for example the NASC two-gene scatter plot (Craigon et al., 2004; http://affymetrix.arabidopsis.info/ narrays/experimentbrowse.pl) and GENEVESTIGATOR (Zimmermann et al., 2004). Complementing these tools are others reporting correlation of expression of genes across many experiments, with the implication that these genes are most likely to be regulated in a similar manner and may encode proteins that are involved in the same biological process; specific examples include EPCLUST (http://ep.ebi.ac. uk/EP/EPCLUST/), the GENE RECOMMENDER algorithm (Owen et al., 2003), ATTED (Obayashi et al., 2004),
[email protected] (Steinhauser et al., 2004), GENEVESTIGATOR (Zimmermann et al., 2004) and EXPRESSION ANGLER (Toufighi et al., 2005). However, tools are also required that will be able to discriminate between genes showing similar expression patterns but that may be involved in different, but perhaps related, processes; such analyses require methods allowing determination of cut-offs to define sets of genes with discrete functions. Multigene families provide an additional level of complexity. Some family members may have similar, possibly redundant, activities with similar regulation while others may have different functions and be regulated
differently. Discriminating between these possibilities is important to enable predictions of different biological function for genes with similar sequences. Furthermore, with a large proportion of genes identified within genome sequences having no similarity to genes of known function, there is also a need for bioinformatic tools that will help to uncover their roles. To facilitate the interpretation of the extensive and publicly available array data, we have developed two new bioinformatic tools, the Arabidopsis Co-Expression Tool (ACT), which reports gene co-expression patterns across user-selected single or multiple arrays, and CLIQUE FINDER, which provides a quantitative method for determination of correlation cut-offs to generate exclusive groups of genes which may share a common purpose. We demonstrate the different features of our software by initially analysing genes encoding functionally related and well-characterized ribosomal proteins, followed by an analysis of heat-shock and cold-responsive genes to predict the involvement of uncharacterized genes in well-defined responses. We also demonstrate the integration of these analyses with promoter element detection software to identify conserved DNA upstream elements potentially involved in the co-regulated expression of clustered genes. Finally, we demonstrate the ability of the software to predict aspects of regulation and differential function within a group of genes involved in the less well-defined process of cell wall biosynthesis. ACT and CLIQUE FINDER are freely available on the WWW (http://www.arabidopsis.leeds.ac.uk/ACT/) and should be useful to anyone wishing to evaluate regulatory gene expression networks for any aspect of Arabidopsis biology. These new tools could also be used for similarly controlled data sets from other model organisms. Results Description of software capabilities The Arabidopsis Co-Expression Tool (ACT) is based on the Affymetrix arrays within the NASC dataset (Craigon et al., 2004; Affymetrix Inc. Santa Clara, CA, USA). This large dataset, comprising 322 ATH1 array hybridizations, is unique in that each hybridization has been performed according to the same experimental protocol and the final data have been produced according to the same normalization scheme. The array data are divided into experiments (or series) comprising one or more hybridizations, and covering a wide range of biological processes and conditions. In combination, ACT and CLIQUE FINDER permit three levels of analysis. As a first step, ACT generates rank order lists of positively and negatively co-expressed genes. These lists are then used to generate co-correlation scatter plots for any two genes of interest to provide a visual and userinteractive representation of co-correlation data for all genes
ª 2006 The Authors Journal compilation ª 2006 Blackwell Publishing Ltd, The Plant Journal, (2006), 46, 336–348
338 Chih-Hung Jen et al. represented on the array. The third functionality provided by CLIQUE FINDER enables the identification of groups of genes, or cliques, within clusters that share statistically significant co-expression patterns. Importantly, this tool also enables the user to identify genes that are co-expressed but statistically excluded from a clique. As such, ACT and CLIQUE FINDER offer a range of analysis features that distinguish them from other currently available microarray analysis tools and these are demonstrated here using a range of examples. The Arabidopsis co-expression tool (ACT) For a given gene of interest, which we term the driver, the ACT software calculates the similarity in its expression profile to those of all other genes, using all array experiments, or a user-selected subset. Absolute values based on hybridization signal intensities from microarray experiments are used and the output is the Pearson correlation coefficient, or rvalue, which is a scale-invariant measure of expression similarity, and this is accompanied by probability (P) and expectation (E) values reflecting statistical significance against a background of random chance correlations. The E-value is calculated as a product of the number of genes on the array and the P-value. The correlation coefficient (r) is used to rank the genes in descending order of correlation with the driver. In addition to r-, P- and E-values, the output includes Affymetrix probe ID, Arabidopsis Genome Initiative (AGI) code and current annotation for each gene. We use the Affymetrix annotation as supplied by NASC but the AGI code from the correlation output list provides a direct link to TAIR for additional information. (Further details can be found in Experimental procedures below.). Genes with strongly correlated expression patterns are likely to be under similar transcription regulatory mechanisms, or involved in related biological processes. However, with a list of many genes, we require an assessment of which genes are likely to be involved in a regulon and which genes are not. This is a difficult issue, because the operation of different biological processes can be correlated for reasons other than directly shared regulation, and these correlations can be statistically significant. Microarray data are not always sufficient to separate these correlations of different origin. However, the signal associated with common regulation can be strengthened by considering expression correlation with more than one driver gene, selected from the putative regulon. A significant feature of this correlation tool is the ability to integrate patterns of gene expression over the whole transcriptome. Comparing the best correlated genes at the top of each correlation list allows visualization of only a fraction of the data. Where the same genes are seen at the top of two correlation lists, it would be useful to know how many other genes within the two lists also show similar correlation patterns. One way to observe such patterns is to prepare a
scatter plot of the r-values of gene A with the transcriptome versus the r-values of gene B with the transcriptome. A scatter plot represents a graphical representation of the correlation of individual genes with both drivers. Any gene that co-correlates with both drivers will show positive rvalues on both the x and y axes. Anti-correlated expression will be represented by a negative r-value with one or both of the drivers, and a zero correlation would place the data point close to the x or y axis. The pre-calculated correlation data sets for all 21 891 sequences showing evidence of expression are available at http://www.arabidopsis/.leeds.ac.uk/ ACT.
CLIQUE FINDER
The ACT database of pre-calculated correlation coefficients for all genes on the array versus all other genes provides the means to rapidly determine sets of genes showing consistent co-expression with each other The CLIQUE FINDER algorithm enables the quantitative determination of sets of genes that are consistently co-expressed with each other The algorithm takes the list of genes most strongly correlated with a driver of interest, for example the top 100 genes, and then retrieves from the database the correlation coefficients between all possible pairs of these genes. In mathematical graph representation, these correlation coefficients represent ‘edges’ connecting each pair of genes. The analysis retains only the strongest of these edges according to a cut-off value set by the user; this percentage cut-off can be increased or lowered by the user to explore how this affects the size and composition of the clusters produced by the algorithm. The algorithm then identifies the subsets of genes, termed ‘cliques’, where all members are connected to each other by r-values above the threshold value. The members of each clique will often share some biological theme. However, there is often overlap between cliques, and this information is used to combine them into clusters. Any cliques sharing at least 50% of their genes are combined and this procedure is repeated until no cluster overlaps another by 50% or more. The clusters of genes and the unclustered singletons are reported, defining the sets of genes most likely to be regulated or acting together versus the genes likely to be excluded from the cliques. This tool enables the user to identify groups of genes within common regulatory networks, and also to exclude genes from correlation lists that do not show statistically significant membership of the clique. Ribosomal proteins The eukaryotic ribosome consists of approximately 80 ribosomal proteins (Barakat et al., 2001) required in stoichiometric ratios, suggesting a coordinated mechanism of regulation. The genes encoding these proteins are well
ª 2006 The Authors Journal compilation ª 2006 Blackwell Publishing Ltd, The Plant Journal, (2006), 46, 336–348
A WWW-based plant gene co-expression tool and database 339
Table 1 Co-expression analysis result for a ribosomal protein gene r-value
Gene ID
Annotation
1 0.90 0.88 0.87 0.87 0.87 0.87 0.86 0.86 0.85 0.85 0.85 0.84 0.84 0.84 0.84
At4g12600 At1g26880 At3g13230 At1g04480 At2g32060 At2g19730 At1g27400 At4g10480 At4g29410 At4g17390 At2g36160 At3g24830 At3g25520 At3g07110 At1g08360 At4g25740
Ribosomal protein L7Ae family 60S ribosomal protein L34 (RPL34A) Expressed protein 60S ribosomal protein L23 (RPL23A) 40S ribosomal protein S12 (RPS12C) 60S ribosomal protein L28 (RPL28A) 60S ribosomal protein L17 (RPL17A) Alpha NAC–related 60S ribosomal protein L28 (RPL28C) 60S ribosomal protein L15 (RPL15B) 40S ribosomal protein S14 (RPS14A) 60S ribosomal protein L13A (RPL13aB) 60S ribosomal protein L5 (RPL5A) 60S ribosomal protein L13A (RPL13aA) 60S ribosomal protein L10A (RPL10aA) 40S ribosomal protein S10 (RPS10A)
Only the top 15 genes showing correlation of expression with the target gene (in bold type) are shown. Genes with ‘unrelated’ annotations are indicated in grey. NAC, Nascent polypeptide associated complex.
From the annotations presented, it is clear that expression of many ribosomal protein genes is correlated with expression of the selected ribosomal protein gene, as expected if stoichiometric ratios of subunits are required. The selection presented in Table 1 contains a mixture of 60S and 40S subunits, suggesting that there is not apparently separate, coordinate regulation of genes comprising large and small subunits. The over-representation of ribosomal protein genes continues beyond the presented list, with 77 ribosomal protein genes in the 100 best-correlated genes and further representations at lower r-values. It has been determined that there are 249 cytoplasmic ribosomal protein genes in Arabidopsis (Barakat et al., 2001), with probes for 200 of those genes present on the Affymetrix ATH1 microarray. This representation indicates that the enrichment for ribosomal protein genes in the presented correlation list is highly statistically significant, with a P-value of 10)142. In order to demonstrate the features of the ACT cocorrelation scatter plot, we chose a representative large ribosomal subunit protein gene (At3g55280) and a representative small ribosomal subunit protein gene (At5g20290) that, in correlation analyses similar to that shown in Table 1, had revealed co-expression with many other ribosomal protein genes (data not shown). These two genes were used as drivers in a co-correlation analysis to generate a scatter plot. The results of this analysis are presented in Figure 1. Each gene is represented by a grey dot with those annotated as ribosomal protein genes highlighted as open symbols. 60S and 40S ribosomal protein genes are represented by triangles and squares, respectively. The Web-based tool contains a mouse-over facility that identifies each symbol with a direct link to the corresponding TAIR database entry. This figure shows that there are many genes whose expression is positively correlated with the two drivers, although most genes are poorly correlated or uncorrelated; 78% and Ribosomal L23a - like protein (At3g55280)
characterized and annotated, and well represented on the Arabidopsis Affymetrix microarray chips; we have therefore used them as an example data set with which to observe correlation in expression patterns and their relationship to shared biological function. Although the ribosomal protein genes are anticipated to show constitutive expression, an emerging body of evidence suggests that many such genes show controlled transcriptional regulation (e.g. Bae et al., 2003). In a microarray analysis of Arabidopsis cell culture responses to a herbicide treatment, we observed increased expression of a large number of ribosomal protein genes (Manfield et al., 2004). A selection of these ribosomal protein genes were used here as candidates for correlation analysis. None of the experiments from the NASC database used in this analysis specifically addressed the issue of ribosomal protein gene expression, although our cell culture data were included in the database. In the example shown in Table 1, we ran the algorithm using ribosomal protein L7Ae (At4g12600) as the driver; the top 15 best-correlated genes are shown. The entire correlation data set for all 21 891 genes is available at our website (http://www.arabidopsis.leeds.ac.uk/ACT). The gene selected as the driver for this analysis is shown at the top of the list with an r-value of 1.0 as its expression is perfectly correlated with itself. Correlated genes are ranked with descending r-values; we have presented the top 15 genes which show highly correlated expression with the drivers. The correlation list includes and ranks all 21 891 sequences present on the microarray from the most highly correlated sequences shown in Table 1, to anti-correlated genes, the most extreme of which in this case shows an r-value of )0.54 (http://www.arabidopsis.leeds.ac.uk/ACT).
1 0.8 0.6 0.4 0.2 0 –0.2 –0.4 –0.6 –0.6
–0.4
–0.2 0 0.2 0.4 0.6 Ribosomal protein S8 (At5g20290)
0.8
1
Figure 1. Analysis of ribosomal protein gene expression. A co-correlation plot for 60S (At3g55280) and 40S (At5g20290) ribosomal protein genes is shown. The correlation r-values of each of 21 891 genes with each of the two driver genes are plotted as grey diamonds. The correlation r-values of 60S and 40S ribosomal protein genes are highlighted by open triangles and squares, respectively.
ª 2006 The Authors Journal compilation ª 2006 Blackwell Publishing Ltd, The Plant Journal, (2006), 46, 336–348
340 Chih-Hung Jen et al.
Statistical validation of the Arabidopsis co-expression tool The example analysis of ribosomal genes shows that ACT is an effective way to discover functional and regulatory links
between plant genes. With our currently highly incomplete knowledge of such links, we lack the benchmarks necessary to provide a systematic analysis of the accuracy of the procedure; it is difficult to determine what the ‘correct answer’ is or which analysis is better than another. However, it is possible to study the robustness of the method to changes in size of the array set. Figure 2 shows a set of receiver–operator characteristics (ROC curves) plotting [sensitivity] (true positives as a fraction of total positives) against [1 ) specificity] (false positives as a fraction of total negatives); this provides a measure of error rate. These data were created by randomizing the order of the data set and then dividing it into two disjoint parts, comprising 40 arrays and 254 arrays, respectively. Curves are shown for calculations over 10 (chosen randomly) to 40 arrays from the 40-array set, and true and false positives were defined by reference to the same calculations over the 254-array set as a standard of truth. These curves show that performance improves as the number of arrays increases (a better performing method has a larger area under the ROC curve). Fixing a false positive rate (x axis) at 5% would yield very poor performance if only 10 arrays were used (sensitivity