TICL a web tool for networkbased interpretation ... - Wiley Online Library

7 downloads 720 Views 1MB Size Report
TICL – a web tool for network-based interpretation of ... tools available for the interpretation of gene lists. A ... E-mail: [email protected].
TICL – a web tool for network-based interpretation of compound lists inferred by high-throughput metabolomics Alexey V. Antonov1, Sabine Dietmann1, Philip Wong1 and Hans W. Mewes1,2 1 Helmholtz Zentrum Mu¨nchen, Institute for Bioinformatics and Systems Biology, Neuherberg, Germany 2 Department of Genome-Oriented Bioinformatics, Technische Universita¨t Mu¨nchen, Freising, Germany

Keywords bioinformatics tools for high-throughput metabolomics; metabolomics; statistical analysis and data mining; statistical and bioinformatics tools; web tools for metabolomics Correspondence A. V. Antonov, Helmholtz Zentrum Mu¨nchen – German Research Center for Environmental Health (GmbH), Institute for Bioinformatics and Systems Biology, Ingolsta¨dter Landstraße 1, D-85764 Neuherberg, Germany Fax: +49 89 3187 3585 Tel: +49 89 3187 2788 E-mail: [email protected]

High-throughput metabolomics is a dynamically developing technology that enables the mass separation of complex mixtures at very high resolution. Metabolic profiling has begun to be widely used in clinical research to study the molecular mechanisms of complex cell disorders. Similar to transcriptomics, which is capable of detecting genes at differential states, metabolomics is able to deliver a list of compounds differentially present between explored cell physiological conditions. The bioinformatics challenge lies in a statistically valid interpretation of the functional context for identified sets of metabolites. Here, we present TICL, a web tool for the automatic interpretation of lists of compounds. The major advance of TICL is that it not only provides a model of possible compound transformations related to the input list, but also implements a robust statistical framework to estimate the significance of the inferred model. The TICL web tool is freely accessible at http://mips.helmholtz-muenchen.de/proj/ cmp.

(Received 12 November 2008, revised 28 January 2009, accepted 2 February 2009) doi:10.1111/j.1742-4658.2009.06943.x

Knowledge of the molecular basis of metabolism is crucial for our understanding of most cellular processes [1–3]. In recent years, technologies have been developed that allow the systematic investigation of large numbers of different metabolites [1,4–6]. This has led to metabolomics becoming an attractive technology for exploring the molecular basis of complex cell disorders [7–10]. In most genomics and proteomics studies aimed at deciphering the molecular mechanisms of complex biological phenomena, the output is usually a list of genes ⁄ proteins [11–13]. The next common step is the application of bioinformatics and statistical methods to obtain a statistically valid interpretation of the derived gene list. There are dozens of bioinformatics

tools available for the interpretation of gene lists. A standard solution is the inference of over- ⁄ under-represented gene ontology terms [14–22]. The significance of the produced results is usually supplied in the form of a P-value. The P-value represents a probability of inferring a similar or greater enrichment (for any gene ontology term) for a randomly sampled gene list [19]. More complex methods have been proposed to exploit the database information currently available for metabolic and signaling pathways, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) [23] or BioCarta (http://www.biocarta.com). In this case, pathway topology was taken into account by developing specialized scoring functions. The method developed by Rahnenfuhrer et al. [24] includes, in

Abbreviations KEGG, Kyoto Encyclopedia of Genes and Genomes; SHR, spontaneously hypertensive rat; WKY, Wistar Kyoto rat.

2084

FEBS Journal 276 (2009) 2084–2094 ª 2009 Helmholtz Zentrum Mu¨nchen – German Research Center for Environmental Health (GmbH). Journal compilation ª 2009 FEBS

A. V. Antonov et al.

addition, the distance between genes within the metabolic pathway. The impact of a pair of genes is weighted with respect to the distance between genes within the metabolic pathway. Another procedure (impact analysis) proposed recently by Draghici et al. [25,26] goes beyond gene pairs and fully captures the topology of signaling pathways by propagating the perturbations measured at gene levels through the entire pathway. This technique can capture information about the position of the genes on the pathway, because perturbation of the genes at the top of the signaling cascade will propagate through the entire pathway, unlike perturbation of the downstream genes. Metabolomics is a relatively new ‘omics’ technology. Experimental studies of complex cell disorders, which employ high-throughput metabolomics as a basic instrument, have just started to appear. Several studies of different diseases have demonstrated the successful application of metabolomics in clinical research [7–9]. There is no doubt that the number of such clinical studies will grow exponentially in the near future. Similar to transcriptomics and proteomics, metabolomics allows for the detection of a list of markers, present at different concentrations under various explored cell physiological conditions. In the case of metabolomics, the markers are compounds (not genes or gene products). There is a great demand for bioinformatics to provide a statistically valid interpretation of compound lists produced experimentally. Currently, several bioinformatics approaches are available for metabolomics. Each approach was developed to solve different practical problems related to the analysis of metabolomics data [5,27–30]. Most of the proposed tools for metabolomics deal with the mass peak annotation problem [31]. The MassTrix web server has recently been presented [30] and provides the possibility of uploading a high-precision mass spectrum, automatically annotating mass peaks and mapping identified compounds onto KEGG metabolic pathways. Most of the available tools aim to interpret the whole mass spectra rather than a sparse list of compounds differentially present between samples. Other tools are available that provide visualizations of a compound list in the context of metabolic networks [32,33]. The KEGG atlas accepts a list of compounds as an input. The output of the KEGG atlas is a graphical visualization of compounds in the context of the global metabolic reaction network. The KEGG atlas, however, does not provide quantitative and statistical analyses. It is important to know whether experimentally selected compounds are related, for example, whether they belong to a chain or network of metabolic reac-

TICL – a tool for interpretation of compound lists

tions. A partial answer to this question can be obtained from the KEGG atlas. However, without quantitative analysis, there are no clues about the quality of these relations. To fill the gap, we propose an analytical framework for the interpretation of molecular mechanisms that unite a list of compounds. This analytical framework is implemented as the freely accessible web tool TICL. As we demonstrate using data from recently published metabolomics studies, TICL translates compounds into a set of linked metabolic reactions and provides quantitative estimates of the significance of the inferred models.

Results We consider several recently published experimental studies that report lists of compounds found to be differentially present under diverse physiological conditions. We demonstrate that the proposed statistical framework can be helpful in understanding the biological context of the reported compound lists. We start with the study by Lu et al. [9], which reports metabolic variation related to hypertension and age-related conditions. To characterize the development of hypertension, the spontaneously hypertensive rat (SHR), and its normotensive control, the Wistar Kyoto (WKY) rat, were investigated, and their blood plasma was analyzed using GC ⁄ time-of-flight MS. In total, 187 peaks were quantitatively determined after deconvolution, and 78 of them were identified. Plasma compositional differences for many identified compounds showed significant age-related variations for both SHR and WKY. Also, many identified compounds showed significant variations between hypertension-related SHR and control WKY rats. Table 2 in Lu et al. [9] reports  20 compounds that show significantly increased or decreased levels from 10 to 18 weeks of age in both SHR and WKY rats. In total, 16 compounds can be mapped to the global compound network inferred from the KEGG. Submission of this list to the KEGG atlas gives the graphical visualization presented in Fig. 1. At first glance, these compounds have nothing in common; they do not represent any specific canonical metabolic pathway. In this case, visual analyses of Fig. 1 cannot give a clear answer as to whether and how the compounds are related. By contrast, submission of this list to the TICL gives quantitative values that describe the quality of the relations between the input compounds and provides a confidence score for such relations in the form of a P-value (the probability that randomly generated compound lists are involved in relations of similar quality). The report for the analyzed list is given in Table 1.

FEBS Journal 276 (2009) 2084–2094 ª 2009 Helmholtz Zentrum Mu¨nchen – German Research Center for Environmental Health (GmbH). Journal compilation ª 2009 FEBS

2085

TICL – a tool for interpretation of compound lists

A. V. Antonov et al.

Fig. 1. Output returned by the KEGG atlas after submission of 20 compounds that have significantly increased or decreased levels from 10 to 18 weeks of age in both SHR and WKY rats. Red points correspond to submitted compounds.

2086

FEBS Journal 276 (2009) 2084–2094 ª 2009 Helmholtz Zentrum Mu¨nchen – German Research Center for Environmental Health (GmbH). Journal compilation ª 2009 FEBS

A. V. Antonov et al.

TICL – a tool for interpretation of compound lists

Table 1. The quantitative report ‘Enriched subnetworks’ returned by TICL after the submission of 20 compounds with significantly increased or decreased levels from 10 to 18 weeks of age in both SHR and WKY rats.

Model

Maximum distance between compounds

No. input compounds in the subnetwork

P-value

1 2 3 4 5 6

1 2 3 4 5 6

2 4 5 7 11 12

< < < < <