Current Experimental, Bioinformatic and Statistical

0 downloads 0 Views 2MB Size Report
sis, as drugs and/or their metabolites are commonly found in physiological fluids or .... alterations in the 1H NMR spectra including line-broadening, variations in T1 and T2 ... However, as long as the TMS signal is only used for chemi- cal shift ..... regards the fact that two signals of equal intensity in a given spectrum do not.
Send Orders for Reprints to [email protected] Current Metabolomics, 2013, 1, 000-000

1

Current Experimental, Bioinformatic and Statistical Methods used in NMR Based Metabolomics Helena U. Zacharias*, Jochen Hochrein*, Matthias S. Klein, Claudia Samol, Peter J. Oefner and Wolfram Gronwald1# Institute of Functional Genomics, University of Regensburg, Josef-Engert-Str. 9, 93053 Regensburg, Germany Abstract: The aim of this contribution is to familiarize the reader with experimental, bioinformatic and statistical strategies currently used in the field of solution NMR based metabolomics. Special emphasis is given to methods that have worked well in our hands. Methods covered include sample preparation, acquisition and processing of NMR spectra, and identification and quantification of metabolites. Further consideration is given to data normalization and scaling, unsupervised and supervised statistical data analysis, the biomedical interpretation of results, and the centralized community-wide storage and retrieval of NMR data.

Keywords: Fingerprinting, metabolomics, NMR, profiling, quantification, statistical data analysis. INTRODUCTION Metabolomics aims primarily at the comprehensive study of the flow of small organic compounds through bioenergetic and biosynthetic pathways by means of their quantitative analysis in cells, tissues, organs, organisms and body fluids. Typical intermediates and products include amino acids, sugars, organic acids, bases, lipids, vitamins, and various conjugates of absorbed substances of exogenous origin. Metabolomics finds widespread application including such diverse topics as the screening of milk of dairy cows [1] or clinical organ transplant monitoring [2]. Metabolomic investigations are mainly conducted by employing hyphenated mass spectrometry or nuclear magnetic resonance (NMR) spectroscopy. Here, we will focus on the application of solution NMR spectroscopy to biological fluids and tissue extracts in an academic setting. NMR is a powerful tool for metabolite identification and quantification, as it allows the simultaneous detection of all proton-containing metabolites present at sufficient concentrations in a given biological specimen. NMR signal volumes scale linearly with concentration. NMR requires very little sample pretreatment and, typically, no prior chemical derivatization of analytes. One disadvantage of NMR spectroscopy is its comparatively poor sensitivity in the lower micromolar range. Novel techniques such as dynamic nuclear polarization [3, 4] allow a substantial increase in sensitivity of NMR measurements of selected compounds. Coverage of the metabolome may be increased by combining NMR with liquid-chromatography or gas-chromatography coupled mass-spectrometry (LC-MS or GC-MS), which yields lower limits of quantification in the low nanomolar to picomolar range depending on the type of mass analyzer used. #Address correspondence to this author at the Institute of Functional Genomics, University of Regensburg, Josef-Engert-Str. 9, 93053 Regensburg, Germany; Tel: ++49-941-943-5015; Fax: ++49-941-943-5020; E-mail: [email protected] *These authors contributed equally 2213-235X/13 $58.00+.00

In the following paragraphs we will discuss a wide range of issues, from study design to the biomedical interpretation of data, some of which also hold true for metabolomics studies involving analytical methods other than solution NMR spectroscopy. Due to space limitations we chose to describe some of the available techniques in more detail than others. This choice was driven in part by our own experience and does not imply superiority of the methods covered. STUDY DESIGN First, one has to distinguish between pilot experiments that aim at the identification of trends between groups and studies that aim at validating discriminatory metabolic features. For pilot studies often a relatively small number of specimens is sufficient, while for validation studies usually larger sample numbers are required to obtain statistically significant results. If a pilot study allows estimation of effect size, one can calculate the minimum required group sizes of controls and affected groups for a given significance level and desired minimal statistical power. One freely available tool for this purpose is the program GPower [5]. For biofluids generally a high reproducibility of NMR measurements is observed. Therefore, in most cases, technical replicates are not required [6]. In cases where, for instance, groups of healthy and diseased patients are compared, it is important to ensure that the distribution of clinical characteristics such as age, gender, and other known confounders are similar for all groups. A typical confounding factor is medication. Medication may vary considerably and drive the discrimination of groups in multivariate data analysis, as drugs and/or their metabolites are commonly found in physiological fluids or tissue extracts. Therefore, it is important that all drugs and their doses administered are recorded diligently including trade names as galenic formulations for an active ingredient may vary considerably and give rise to additional signals in spectra. When performing the actual study, the subsequent experimental steps such as sample preparation, NMR meas© 2013 Bentham Science Publishers

2 Current Metabolomics, 2013, Vol. 1, No. 00

urements, etc should be performed in a randomized fashion to ensure that all groups are treated similarly and batch effects are minimized. For additional information on the design of NMR-based metabolomic studies readers are referred to Beckonert et al. [7]. SAMPLE PREPARATION AND ORGANIZATION Sample Organization and Storage Assuming that samples are obtained from an external cooperation partner, as it is often the case, the first steps usually involve checking the completeness of the sample cohort and that all required additional information describing the whole study and each sample in detail is provided. This information includes for example age and gender of patients, confounders such as additional diseases and given medication, and most importantly the group membership of each sample. Ideally, all relevant information should be stored in appropriate databases and encrypted. Stored information includes a uniquely chosen project number, project name, date of sample receipt, name and contact information of the cooperation partner, number of samples, sample type, storage place, and additional study and sample information. After measurements, date of measurement, obtained results such as quantitative values and others are stored under the same project number as well. As the repeated thawing and freezing of samples may affect their composition, preparation of aliquots in tubes specified for storage at -80 °C or even in liquid nitrogen is advised, in particular if multiple measurements by different methods are planned. Sample Preparation In most academic laboratories, NMR sample preparation is still performed manually and, thus, prone to pipetting errors, sample contamination and degradation of metabolites that are not stable at ambient temperature. This may severely affect the outcome of otherwise well performed studies and, therefore, the implementation of standardized and detailed protocols as well as the use of standards added at various steps during sample pretreatment is highly recommended to ensure data quality [7]. The preparation of biological samples for NMR measurements involves various pretreatment steps depending on sample composition. Specimens such as plasma, milk and cerebral spinal fluid (CSF) usually contain great amounts of proteins and other macromolecules, producing broad, unspecific signals in the acquired NMR spectra. Moreover, these macromolecules may form complexes with the employed NMR reference compound, thus resulting in a considerably diminished and shifted reference signal, which in turn impacts data analysis. Hence, it is recommended to deplete unwanted macromolecules from plasma, serum, CSF and milk specimens by ultrafiltration using, for example, the Millipore Amicon Ultra-4 (Millipore, Billerica, MA, USA) cellulose filter device with a molecular weight cutoff of 10 kDa [1, 8]. Filters are prewashed with 3 mL of distilled water and centrifuged at 4000 xg in a swing-bucket rotor at 22 °C for 30 min in order to remove filter-preserving substances like glycerol and triethylene glycol. Spectra of blank samples of rinsing water should always be acquired and compared to spectra of filtered biofluid to detect and

Zacharias et al.

exclude signals from filter residues in subsequent data analysis. Typically, 1000 µL of biofluid is placed into the filter device and centrifuged at 4000 xg in a preferentially cooled centrifuge for 60 min. An alternative to the removal of macromolecules by filtering is the suppression of broad lines in the NMR spectra by employing a T2-relaxation filter such as the Carr-Purcell-Meiboom-Gill (CPMG) pulse sequence described later. To analyze the metabolite composition of tissues by NMR, tissues have to be homogenized prior to extraction of hydrophilic and lipophilic compounds by methanol and chloroform extraction, respectively. A comparison of 80% (v/v) aqueous methanol extracts of murine tumor tissues obtained by automated homogenization by Precellys® lysing kit tubes (Bertin Technologies, Montigny le Bretonneux, France) to extracts obtained by manual homogenization by pestling has yielded consistently higher metabolite yields by the first approach (unpublished data). For normalization purposes, it is important to determine and record the exact weight of each tissue specimen before homogenization in 1 mL of an appropriate extraction solvent. The settings of the Precellys®24 homogenizer and the choice of lysing kit tube depend on the nature of the tissue. Normal brain and glioma tissues, for instance, are homogenized twice for 20 s by multidimensional motions at a rate of 6,500 rpm using soft tissue homogenizing beads 91-PCS-CK14 before the homogenates are centrifuged at 9,000 x g for 6 min at 4 °C for separation into supernatant and pellet [9]. To yield a maximum amount of metabolites, the pellet is washed twice with 1 mL of 80% aqueous methanol, respectively. The supernatants are combined and dried and the residue is dissolved in 400 µL of pure deionized water. As an alternative to metabolite extraction, high-resolution 1H magic angle spinning (MAS) NMR spectroscopy enables the acquisition of high-resolution NMR spectra of small intact tissue samples without any pretreatment [10, 7]. However, the method requires complex additional equipment, ideally, a spectrometer solely dedicated to this type of NMR experiments. Pretreatment of urine specimens is typically limited to a short centrifugation step (10,000 – 12,000 x g for 5 min) for spinning down solid residuals. The following steps of NMR sample preparation are valid for standard 5-mm NMR tubes and should be adjusted for other tubes, accordingly. For aqueous samples, we mix 400 µL of biofluid with 200 µL of 0.1 mol/L phosphate buffer, pH 7.4, to adjust pH to the same value for every sample in order to minimize chemical shifts variations, and 50 µL of 29.02 mmol/L 3-trimethylsilyl-2,2,3,3-tetradeuteropropionate (TSP) dissolved in deuterium oxide, which is required for the internal lock signal of the spectrometer. Another commonly used reference substance for aqueous biofluids is 4,4-dimethyl-4-silapentane-1-sulfonate (DSS). Reference compounds such as TSP are often prone to pHdependent peak shifting. As the pH of metabolomic samples is generally adjusted by adding appropriate amounts of buffer (to avoid peak shifting of metabolite signals), this does normally not pose a problem. A large variety of NMR buffers may be used as long as they are low in salt concentration so to avoid problems in the tuning and matching procedure at the spectrometer and a decrease in the signal-tonoise-ratio. Furthermore, buffers should be proton-free, as

Current Experimental, Bioinformatic and Statistical Methods

protons would give rise to additional NMR signals in the spectra. In addition, buffers should contain small amounts (~2 mmol/L) of sodium azide or boric acid to prevent bacterial growth. Though boric acid is commonly preferred, as being less toxic and reactive than sodium azide, it introduces alterations in the 1H NMR spectra including line-broadening, variations in T1 and T2 relaxation and chemical shift changes due to complex formation and chemical exchange processes. Nonetheless, these effects of boric acid on 1H NMR data showed to be negligible in comparison to biological variations in molecular fingerprinting approaches [11]. Besides, the consistent use of one particular preservative should be guaranteed throughout a study. An improved sample preparation method for eliminating intersample chemical-shift variation is offered by Jiang et al. [12]. They propose to remove differences in NMR chemical shifts due to pH variations and differences in dication content by employing potassium fluoride (KF) precipitation of Ca2+ and Mg2+ ions in combination with the addition of K3EDTA as chelating agent prior to phosphate buffering, thereby reducing the standard deviation of chemical shifts to less than 1.5 Hz. Lipophilic compounds are generally dissolved in 650 µL of deuterated chloroform with tetramethylsilane (TMS) as internal standard. One potential disadvantage of TMS is its high vapor pressure, which hampers adding exact quantities. However, as long as the TMS signal is only used for chemical shift referencing and not for quantification purposes these effects are negligible. In deuterated chloroform extracts, the signal of residual non-deuterated chloroform may be used as reference compound [13]. However, this carries the disadvantage that the concentration of non-deuterated chloroform is batch-dependent. An alternative reference compound that worked well in our hands for the quantification of lipophilic compounds is octamethylcyclotetrasiloxane (OMS, SigmaAldrich, Taufkirchen, Germany) [14, 15]. A different approach is provided by the use of a synthetic electronic reference signal circumventing the need for internal reference compounds [16]. NMR MEASUREMENTS The use of automated data acquisition schemes is recommended to ensure high reproducibility of measurements by avoiding biases from human interference. Prior to measurement, the temperature unit of the spectrometer should be carefully calibrated employing, for example, a deuterated methanol sample. Further, each sample should be allowed to equilibrate for at least 300 s in the magnet. Typically, a sample temperature between 298 and 300 K is used. When possible, samples should be automatically locked, tuned, matched and shimmed. Automatic shimming procedures should start from a standard shim file optimized for the current sample matrix. Also calibration of pulse lengths should be done in an automated fashion followed by automated data acquisition. The most widely used experiments in NMR based metabolomics are 1D 1H measurements [17]. They are easy to set up and offer high sensitivity and good spectral resolution. Mostly, 1D NOESY pulse sequences with presaturation during relaxation and mixing time together with additional spoil gradients for water suppression are used.

Current Metabolomics, 2013, Vol. 1, No. 00 3

Ideally, acquisition of the free induction decay (FID) should start during the last pulse of the pulse sequence to achieve a flat baseline. This obviates in theory the need for first order phase correction. In practice, however, such early data acquisition is not feasible. A practical solution is to optimize the delay between pulse and start of the FID and to correct the first points of the FID. This is done for example by the socalled “baseopt” option on Bruker spectrometers. Typically, we perform four dummy scans for equilibration followed by collection of 128 scans into 64k data points over a 20-ppm spectral width using a relaxation delay of 4 s, an acquisition time of 2.66 s, and a mixing time of 0.01 s, respectively. Generally, presaturation for water suppression is used in metabolomic applications, since it shows only minimal disturbances of signals neighboring the water peak. Pulse sequences such as the Carr-Purcell-Meiboom-Gill (CPMG) spin-echo sequence [18], which exploit the different relaxation properties of macromolecules and low-molecular weight metabolites, may be used to attenuate macromolecular signals in NMR spectra of specimens containing large amounts of macromolecules such as blood plasma or serum. Further, the presence of J-couplings often gives rise to complex multiplet structures in NMR spectra of biological specimens. To delineate these coupling patterns, J-resolved 2D spectra may be used that disperse the J-couplings in a second dimension [19]. Spreading of resonances across two dimensions as done in 2D heteronuclear single quantum coherence (HSQC) spectra will also reduce the common occurrence of signal overlap [20]. HSQC spectra are especially well suited, because they show in case of 1H-13C HSQC spectra a signal dispersion of around 140 ppm in the indirect carbon dimension, while the number of signals is not increased in comparison to the corresponding 1D spectra. However, the low natural abundance of about 1.1% of 13C affects detection sensitivity and requires longer acquisition times. These drawbacks may be compensated in part by the use of a cryogenically cooled probe. In our lab, we routinely collect for each spectrum 2048 x 128 data points using 8 scans per increment and a relaxation delay of 3 s. The total acquisition time per spectrum consequently amounts to slightly less than 1 hour. 2D HSQC spectra are also well suited for assignment purposes. For each investigated group of samples usually one highresolution HSQC spectrum is acquired. (Fig. 1) shows a typical 2D HSQC spectrum of a hydrophilic extract prepared from a murine breast tumor. Acquisition time for highresolution 2D spectra may be considerably shortened by use of non-linear sampling schemes [21] in conjunction with multi-dimensional decomposition techniques [22]. 2D HSQC spectra provide information about protons and their directly bound carbons. To facilitate assignment of signals, it helps to identify connectivities between individual signals by performing a heteronuclear multiple bond correlation (HMBC) experiment, which connects protons with heteronuclei that are separated by no more than three bonds. 2D HSQC spectra are also well suited for the analysis of metabolic flux, i.e., the rate of turnover of molecules through a metabolic pathway, making use of 13C-labeled substrates. However, in order to achieve the resolution required for resolving 13C-coupling patterns, long measurement times on the order of several hours are needed to generate a suffi-

4 Current Metabolomics, 2013, Vol. 1, No. 00

Zacharias et al.

Fig. (1). 2D 1H-13C HSQC spectrum acquired from a hydrophilic tissue extract of a murine breast cancer model.

Fig. (2). Example for signal assignment in a low abundant compound based on 1D (left) and 2D TOCSY (right) spectra. Human urine spectra are displayed in green, respectively white and overlaid with reference spectra of 1-methylnicotinamide shown in red, respectively grey. Assignment of the H2 proton is indicated by white lines.

ciently high resolution of 4-6 Hz in the indirect carbon dimension to reliably resolve 1JC-C couplings. Alternatively, a non-linear sampling scheme may be used, which reduces the number of required data points in the indirect dimensions approximately 4-fold [23]. The missing data points may then be reconstructed during data processing by means of multidimensional decomposition [22]. If sensitivity is an issue, homonuclear 2D spectra such as 2D 1H-1H correlation spectroscopy (COSY) [24] and 2D 1H1 H total correlation spectroscopy (TOCSY) [25] spectra may be acquired instead. In principle, they are also ideally suited for signal assignment, as complete (TOCSY) or partial (COSY) patterns of protons connected through J-couplings may be mapped. (Fig. 2) shows a typical example for the

assignment of metabolite signals based on TOCSY spectra. However, for complex biofluids such as human urine, a large amount of signal overlap will be seen especially in TOCSY spectra. Two-dimensional 1H-13C HSQC-TOCSY spectra offer higher resolution for signal assignment, albeit at the expense of reduced sensitivity as compared to a standard TOCSY experiment [26]. In metabolic flux analysis, TOCSY spectra facilitate the investigation of the incorporation of 13C labels into specific metabolites. A good description of this approach has been given recently by Moseley et al [26]. In addition, flux analyses are described in more detail in a separate chapter of this review. A general shortcoming of NMR spectroscopy is its limited sensitivity due to the relatively low degree of nuclear

Current Experimental, Bioinformatic and Statistical Methods

Current Metabolomics, 2013, Vol. 1, No. 00 5

polarization achievable with commercially available magnets. A means to increase nuclear polarization is dynamic nuclear polarization. It has been used, for example, for metabolomic pathway mapping in living yeast [27].

other macromolecules may affect peak discrimination. Therefore, removal of proteins prior to data acquisition is recommended for blood and milk samples to obtain wellresolved 1D and 2D spectra.

Thus far the use of nuclei such as 1H and 13C was described. A considerable number of metabolites such as phosphocreatinine, phosphoethanolamine, adenosine-phosphates (AMP, ADP, ATP) and many others naturally contain 31P and, therefore, are amenable to 31P NMR spectroscopy. It offers the advantage of 100% natural abundance and large chemical shift range. In addition overlap is much less severe than in 1D 1H spectra.

In our laboratory, initial assignment of distinct NMR peaks to metabolites is performed on representative 1D 1H and corresponding high-resolution 2D 1H-13C HSQC and 1H13 C HMBC spectra. Signals are manually identified by comparison with reference spectra of pure compounds measured ideally under the same experimental conditions. A reliable repository of reference spectra is provided by the commercially available Bruker Biofluid Reference Compound Database BBIOREFCODE that includes a large amount of reference spectra of currently almost 600 mostly naturally occurring metabolites acquired under various experimental conditions (e.g. different pH-values, solvents, etc.). For most metabolites, 1D 1H, 2D 1H-13C HSQC, 1H-13C HMBC, 1H-1H TOCSY, COSY and JRES NMR spectra are available. The NMR analysis software suite AMIX-Viewer (BrukerBioSpin GmbH, Rheinstetten, Germany) provides an interface for directly comparing acquired spectra with reference spectra from the BBIOREFCODE database. An example for the assignment process is given in (Figs. 2 and 3). By manually overlaying reference spectra from this database with actual NMR spectra, a considerable number of resonances can be assigned. Nevertheless, slight chemical shift differences due to small variations in salt concentration, pH value, temperature, etc. may complicate assignment, especially in crowded spectral regions. In addition, no database is complete. The BBIOREFCODE database, for instance, lacks a comprehensive collection of reference spectra for pharmaceuticals. Hence, additional NMR databases may have to be used for metabolite identification. Commercially available software modules partially suitable for metabolite identification in-

SPECTRA PROCESSING The use of automated routines for Fourier transform and phase correction in 1D spectra is again highly recommended. In case that the first points of the FID were appropriately corrected only a zero order phase correction, which can be performed automatically, and no baseline correction is necessary. Prior to Fourier transform usually an exponential filter function with a line broadening of 0.3 Hz is applied. METABOLITE IDENTIFICATION Assignment of features in NMR spectra of biofluids to specific metabolites is a laborious task that is often complicated by massive signal overlap present in 1D 1H spectra. This is especially true for the typically crowded region between 4.0 ppm – 3.0 ppm. As described above, overlapping 1D signals may be resolved in a second dimension. Consequently, NMR signal assignment in 1D spectra should be verified by corresponding 2D spectra. In case of blood or milk samples, broad NMR peaks arising from proteins or

Fig. (3). Exemplary NMR peak assignment to lactic acid for a human plasma sample. A) An exemplary 1D 1H NMR spectrum of a human plasma sample (plotted in black), and the reference spectrum of lactic acid obtained from BBIOREFCODE (plotted in green, respectively grey) are shown. B) The corresponding 2D 1H-13C HMBC (plotted in black) and the 2D 1H-13C HSQC (plotted in blue, respectively grey) spectra of the human plasma sample as well as the reference 2D 1H-13C HSQC spectrum of lactic acid obtained from BBIOREFCODE (plotted in green, respectively grey) are shown. The NMR peaks assigned to lactic acid are highlighted in red, respectively grey and marked by boxes. .

6 Current Metabolomics, 2013, Vol. 1, No. 00

clude the Chenomx NMR Suite (http://www.chenomx.com), the Aldrich FT-NMR Library (http://www.sigmaaldrich. com/labware/learning-center/spectral-viewer/ft-nmr-library. html), and the Bio-Rad Laboratories KnowItAll NMR database HaveItAll NMR (http://www.knowitall.com). Chenomx NMR Suite provides 1D 1H reference spectra of about 500 different compounds, whereas HaveItAll and Aldrich FTNMR Library support 1D 1H and 13C spectra. HaveItAll comprises around 573,000 spectra from compounds of both synthetic and natural origin, while the Aldrich library contains spectra of about 15,000 different compounds. However, they do not provide spectra acquired under different measurement conditions typical for metabolic investigations and contain mostly compounds of synthetic origin. The Biological Magnetic Resonance Data Bank (BMRB, http://www.bmrb.wisc.edu/metabolomics) and the Human Metabolome Database (HMDB, http://www.hmdb.ca) are two freely available databases that provide 1D 1H, 2D 1H-13C HSQC, 1H-13C HMBC and 1H-1H TOCSY spectra of pure compounds. Comparing the reference spectra stored in the BMRB and HMDB with the investigated spectra can facilitate the identification of uncommon metabolites not being included in the commercially available NMR databases. In our opinion one key advantage of the BBIOREFCODE database that distinguishes it from the other aforementioned databases is that it provides for each compound reference spectra obtained under numerous measurement conditions which greatly simplifies compound identification especially in crowded spectral regions. A semi-automated approach for metabolite identification is offered by the freely available metabolomics tool MetaboMiner [28]. MetaboMiner includes distinct libraries for various aqueous mixtures (e.g. plasma, urine, CSF) providing reference spectra of about 500 different pure compounds. Included libraries are based on the HMDB. Both 1H-13C HSQC and 1H-1H TOCSY spectra can be used for this metabolite identification procedure. Additional free access metabolite identification software tools are NMRShiftDB (http://nmrshiftdb.nmr.uni-koeln.de/) and SDBS (http://riodb 01.ibase.aist.go.jp/sdbs/cgi-bin/cre_index.cgi?lang=eng). The common NMR databases listed here enable either the search for specific compounds by name, registration number, chemical formula and structure, or the performance of an NMR peak query in a distinct spectral region for 1D or 2D spectra. If no matching reference NMR spectrum can be found in the various databases, one may compute a predicted NMR spectrum from the chemical structure or formula of a compound likely to be present in the sample of interest using, for example, NMRShiftDB. Despite the huge amount of reference spectra of metabolites and pharmaceuticals stored in the NMR databases mentioned above, coverage under different measurement conditions is far from complete. For instance, in one project we had to identify the antifibrinolytical drug tranexamic acid, which had been administered to a group of patients undergoing cardiac surgery [29], by acquiring a reference spectrum of the pure compound and manually overlaying it with the NMR spectra of the investigated urine samples as seen in (Fig. 4). Spike-in experiments with pure substances can further support the identification of metabolites. It is advantageous to further validate metabolite assignments based on

Zacharias et al.

long-range 1H-13C couplings obtained from high-resolution HMBC spectra (Fig. 3). The concomitant use of other analytical methods such as solid-phase extraction, high-performing liquid chromatography (HPLC) or mass spectrometry (MS) can support NMR metabolite assignments as well as facilitate identification of unknown compounds [7]. QUANTIFICATION OF METABOLITES One comprehensible way of interpreting and analyzing metabolic NMR data is to derive metabolite concentrations from the spectra and continue analyses solely on these concentration values. This approach is often referred to as targeted profiling [30]. This method might be extended by inclusion of unknown signals [31]. Statistical analysis of such concentration data is the same as for any quantitative measurement technique, with the exception that the linear concentration range of over five orders of magnitude provided by NMR is broader than that of many other methods [1]. Two issues often hamper quantitative analysis of NMR data. The first one concerns the fact that biological specimens such as human urine are very complex in composition containing hundreds to thousands of different compounds. This results in an even larger number of signals in the corresponding NMR spectra causing severe signal overlap especially in 1D 1H spectra. If not taken into account, signal overlap might result in over-quantification of the compound in question. A possible way to circumvent this issue is to manually inspect all spectra and use only signals for quantification that show no signs of overlap. In addition, routines that automatically check for outlier signals may be used [32]. Another possibility is to fit the lineshape of the target compound to the observed spectrum [33]. Instead of resolving overlapped peaks by such mathematical methods, one might also reduce the amount of overlap by using multidimensional NMR experiments. In such spectra, the signals are spread over more than one dimension, thus making it more unlikely to observe signals at the same spectral position by chance. 2D HSQC spectra have proven useful for this purpose [34, 35]. Another concern in NMR quantification regards the fact that two signals of equal intensity in a given spectrum do not always correspond to equal concentration values. The reasons are numerous, including buildup of NOE’s, uneven frequency profiles of the excitation pulses and different efficacies of the INEPT steps in HSQC experiments. One way of dealing with this issue is to adapt the used pulse sequences. In 1D 13C experiments, inverse-gated proton decoupling prevents the buildup of NOE’s. For HSQC spectra, the extrapolation of so-called time-zero HSQC spectra has been proposed to circumvent such effects [36]. Another approach is to multiply peak integrals by compound-specific calibration factors. These factors may be experimentally determined [34,35] or derived from theoretical calculations [37]. In our hands, quantification from NMR spectra using individual calibration factors has shown high reproducibility and good concordance with orthogonal quantification techniques such as GC-MS and liquid chromatography-tandem mass spectrometry (LC-MS/MS) [35]. Finally, optimal control theory may be employed for the design of pulses that render the

Current Experimental, Bioinformatic and Statistical Methods

Current Metabolomics, 2013, Vol. 1, No. 00 7

Fig. (4). Identification of NMR peaks corresponding to the antifibrinolytic drug tranexamic acid. An exemplary 1D 1H NMR spectrum of urine from a human having undergone cardiac surgery (drawn in black), is manually overlaid with the measured 1D 1H NMR reference spectrum of tranexamic acid (plotted in blue, respectively grey). The molecular structure of tranexamic acid, according to [http://www.drugbank.ca/drugs/DB00302], is plotted in the upper right corner.

efficiency of the INEPT transfer steps independent of the molecule in question [38]. Based on our own still limited experience with this approach in cooperation with the group of Burkhard Luy from the Karlsruhe Institute of Technology, this method appears to hold quite some promise. Generally, compounds are quantified relative to the known amount of an internal reference substance (RS). As described above often correction of these relative values by compound and signal specific calibration factors is necessary. RS’s are used to compensate for time-dependent changes in the NMR spectral intensities that might be caused by differences in pulse calibration and receiver gain adjustments, aging effects of the used equipment and others. A detailed description of commonly used RSs is given in the section on sample preparation. METABOLIC FLUX ANALYSIS Healthy and diseased organisms usually show metabolic differences, e.g. in the activity of disease-connected metabolic pathways, that result in different concentrations of the involved metabolites. These differential metabolites may be discovered by NMR measurements and serve as diagnostic or prognostic biomarkers. However, it is virtually impossible to reveal the underlying differential metabolic pathways

solely on the basis of measurements of mere abundance of metabolites, whose concentrations may be often altered by a plethora of metabolic pathways. On the other hand, metabolic differences may not always lead to altered metabolite concentrations, as counteracting effects may disguise the changed metabolite concentrations. A way to determine the currently active metabolic pathways and their actual turnover rates is the use of metabolic flux analysis. In this approach, a stable-isotope-labeled substrate such as e.g. 13C-labeled glucose is administered [39]. Molecules derived from this substrate will also carry the 13C label and can thus be distinguished from the “natural” molecule due to the NMR activity of 13C. A certain molecule can often be synthesized in an organism using different metabolic pathways. The synthesized molecules will differ, though, in the distribution of labeled atoms, depending on which pathway was used. The pathways active in an organism can thus be exactly worked out, given that the substrate’s labeling is chosen in an appropriate way to distinguish the different possible pathways. A collection of different pulse programs is used for flux analysis, of which some prominent examples will be discussed briefly. 1D 13C spectra directly show the signals of 13Clabeled metabolites. To achieve signal intensities that are directly proportional to metabolite concentrations, inversegated 1H-decoupling should be used [40]. 13C spectra have

8 Current Metabolomics, 2013, Vol. 1, No. 00

the disadvantage of reduced signal-to-noise ratios due to the fact that the gyromagnetic ratio of 13C is roughly only a quarter of that of 1H. Using spectra that involve measurements of 1 H will deliver higher signal intensities even when using shorter measurement times. When comparing 1D 1H-spectra of natural and 13C-labeled substances, the observed signals will not only differ in chemical shift (due to the isotope shift effect), but also in multiplicity, as the 13C-spins cause numerous signal splitting when no carbon decoupling is used. This will make such spectra more difficult to interpret and require the complementary acquisition of multidimensional spectra such as 2D 1H-TOCSY-spectra [26, 41]. Filtering and editing has been proposed to enhance the quality of TOCSY spectra for flux analysis [42]. Another pulse sequence commonly used in NMR flux analysis is the 1H-13CHSQC experiment and the J-resolved HSQC [43]. To achieve a sufficient resolution in the indirect 13C dimension, a high number of increments is required. As described above, non-linear sampling schemes might be used to limit the required measurement time. NMR flux analyses are used widely, one example being the proof of reverse tricarboxylic acid (TCA) cycle flux in cancer cells [44]. STATISTICAL DATA ANALYSIS Preprocessing Statistical data analysis is often performed on NMRderived fingerprints, which prior to subsequent analysis have to be corrected for variation in signal position due to differences in pH, salt concentration and/or temperature. A widely used and robust method to compensate for these effects is called binning or bucketing, whereby a spectrum is split into a number of segments called bins, buckets, or features. Equal-sized buckets are mostly used, albeit other schemes such as adaptive binning have been proposed. Data points inside every bucket are summed up or integrated. The whole spectrum is then represented as a vector of bucket integrals. Alternative approaches are peak alignment [45, 46], methods working at full resolution employing statistical total correlation spectroscopy [47], and orthogonal projection to latent structures [48]. Data are usually analyzed employing multivariate statistics [49] that exploit the joint distribution of the data including the variance of individual features and their joint covariance structure. However, metabolomic datasets are prone to unwanted experimental and/or biological variances and biases. To minimize these contributions, data scaling and data normalization approaches may be used. Simple scaling approaches involve, for instance, scaling relative to the signal of creatinine for urinary data. For other sample matrices, every spectrum may be scaled to a total sum of one. We systematically compared a variety of different, sophisticated scaling and normalization methods on metabolomic NMR datasets [50]. The different strategies can be grouped into methods that either adjust the variance of metabolites by variance stabilization and variable scaling strategies, or remove unwanted variation from sample to sample. The influence of data normalization on sample classification was evaluated by using a recently published NMR metabolite fingerprinting dataset of human urine specimens [51]. Further, the impact of the tested methods on the identification of

Zacharias et al.

differential metabolites was assessed on a Latin-square design dataset. After normalization, the structure of the datasets was additionally examined in detail. In conclusion, Quantile [52], Variance Stabilization [53] and Cubic-Spline [54] normalization showed the overall best results in classifying samples, reducing biases and correctly detecting fold changes. In our lab we mostly apply Quantile Normalization, which aims at achieving an identical distribution of feature intensities across all spectra. Following normalization, similarity of distributions may be assessed by a quantile-quantile plot [52]. In case that two spectra share the same distribution of feature intensities, quantiles will be similar and align along the diagonal in the plot. A convenient way of performing the above data normalization and scaling approaches as well as the following subsequent multivariate data analysis is provided by the use of the statistical programming environment R [55]. Other common tools to perform these tasks or parts thereof include, for example, the numerical programming environment MATLAB (The MathWorks Inc., Natick, USA), the online server MetaboAnalyst [56], and the data analysis software SIMCA (Umetrics, Umeå, Sweden). Unsupervised Data Analysis Typically, several hundred features are extracted from a single NMR experiment. In contrast, often only a relatively small number of different experiments are available to analyze the high-dimensional data space, rendering proper statistical data analysis and visualization difficult. Bioinformatic techniques for the clustering and classification of data can be generally divided into unsupervised and supervised methods, respectively. In unsupervised algorithms, no information about underlying groups is used. Therefore, group separations observed are purely data-driven. This renders these approaches, in contrast to supervised classification algorithms, insensitive to overfitting in case of small sample numbers. Unsupervised algorithms are often employed initially to check for group separation prior to classification of data or in cases where too few samples are available for classification with rigid crossvalidation. In the following, we will discuss some of the most commonly used approaches. Clustering The general goal of clustering algorithms is to combine observations into groups or clusters based on a distance measure, by minimizing the distances within a cluster as compared to the distances between clusters. For this, both hierarchical and non-hierarchical algorithms are used. Hierarchical Clustering Hierarchical clustering is an intriguingly simple method for finding similarities between spectra. All spectra are arranged in groups called clusters. At the beginning, each cluster contains exactly one spectrum. Using a distance matrix of pairwise distances, such as Euclidean distance, Manhattan distance, Pearson’s correlation coefficient, or Spearman’s correlation coefficient, between clusters, with bucket values serving as coordinates in a multidimensional space, similar clusters are merged to form a new, larger cluster. This pro-

Current Experimental, Bioinformatic and Statistical Methods

Current Metabolomics, 2013, Vol. 1, No. 00 9

Fig. (5). Hierarchical clustering tree of 1D 1H spectra of murine lipophilic liver extracts. Distinct clustering into controls and extracts of fatty liver without (steatosis) and with inflammation (Non-Alcoholic SteatoHepatitis, NASH), respectively, is observed. [Figure based on data published in Ref. [13]].

cedure is repeated iteratively, i.e. a new distance matrix is calculated, the closest clusters are joined, and so on. In case a cluster contains more than one sample, an overall coordinate for the cluster has to be defined. In average linkage, the average of all spectra of a cluster is used. In single linkage, the spectra yielding the minimum cluster distance are chosen. In complete linkage, the spectra yielding the largest distance are chosen. The choice of distance measure and linkage type exerts a decisive effect on the final clustering result. All of these distance measures and linkage types are commonly used in metabolomics analyses [57-61]. In the end, all spectra are contained in one cluster. Taking all intermediate steps into account, a hierarchy of clusters has been created that can be visualized as a cluster dendrogram. This tree will reveal groups of similar spectra. Ideally, spectra from predefined groups (e.g. healthy and diseased groups) should end up in different clusters. This will only work if the inter-spectra differences are dominated by group differences rather than noise or other disturbing factors. An example of a well-defined group separation is shown in (Fig. 5). Non-Hierarchical Clustering A typical example of a non-hierarchical clustering algorithm is k-means clustering [62]. In contrast to hierarchical clustering the number of clusters is predefined by the user based on previous knowledge about the data, for example the known number of analyzed groups of patients. Therefore, it cannot be considered a fully unsupervised method. As in hierarchical clustering, samples are grouped in a way that for a given distance function distances within a cluster are minimal when compared to distances between clusters. In the context of NMR-based metabolomics, it has been used for example in the analysis of Huntington’s disease in a mouse model [63]. Affinity Propagation [64] is a promising novel approach for clustering various types of data. Clusters and their respective members are calculated by passing real-valued messages between the points of a dataset. The messages describe the affinity that one data point has for selecting another as its cluster center. In contrast to other clustering algorithms, affinity propagation does not require a random initialization of cluster centers at the start of the algorithm. Therefore, reliable results can be obtained in only one run of the clustering proce-

dure. Frey and Dueck successfully applied affinity propagation to very different kinds of datasets, which also makes the algorithm interesting for application to NMR datasets. Recently, affinity propagation was implemented in the Rpackage APCluster by Bodenhofer and coworkers [65]. Principal Component Analysis Principal component analysis (PCA) is a widely used nonsupervised approach for easy visualization of experimental data. In case of binned spectra, the data of each spectrum can be considered as one point in a multidimensional space, with each bucket representing one dimension and the bucket intensity representing the value in that dimension. PCA performs a data transformation by defining a new coordinate system within this space. The newly defined dimensions are referred to as Principal Components (PC’s). The first PC is aligned along the direction of maximum variance in the data. The second PC is chosen to be orthogonal to PC1 and to have maximum variance, again. According to this scheme, PC’s are defined either until a fixed number of PC’s is reached or until the PC’s variance exceeds a certain amount of the total variance of all original dimensions, e.g. 95%. Mathematically speaking, the whole procedure is based on matrix diagonalization. When plotting the spectra in the reduced PC space, e.g. showing PC1 versus PC2, a considerable amount of the variance present in the dataset is visualized allowing an easy inspection of the data such as the identification of distinct groups of samples or the detection of batch effects. Independent Component Analysis Independent component analysis (ICA) is a method closely related to PCA. In contrast to PCA, which considers only the variance in the data, ICA uses an independence criterion to maximize the statistical independence of the estimated components remaining after whitening by singular value decomposition (comparable to PCA). It has been previously shown that often ICA results are superior to PCA results and that it is well suited for metabolomic analysis [66, 13]. (Fig. 6) shows an application of ICA (6A) and PCA (6B) to murine urinary NMR fingerprints. Urine had been collected from mice that had been subjected to either a control diet or chows causing a fatty liver without (steatosis) or with inflammation of the liver (NASH) [13]. Data clearly show the improved group separation obtained by ICA.

10 Current Metabolomics, 2013, Vol. 1, No. 00

Zacharias et al.

Fig. (6). Comparison of independent component analysis (A) and principal component analysis (B) of urinary murine NMR spectra. Urine was obtained from BALB/c mice fed either a standard chow (yellow, respectively light grey circles, n=9), or diets inducing simple steatosis (blue, respectively black circles, n=4) and non-alcoholic steatohepatitis (red, respectively dark grey circles, n=5), respectively. [Corresponding data were previously published by Klein et al., 2011].

Self Organizing Maps A widely used method for two-dimensional data visualization are self organizing maps (SOMs) [67]. SOMs are neural network based methods that were inspired by the topological structure of the brain where, for example, inputs from adjacent fingers are mapped to adjacent areas in the brain. In SOMs, artificial neurons are arranged in a two-dimensional grid, where each neuron competes with all others to best represent the data. The winning neuron is then optimized to represent the input pattern. Employing a neighborhood function, neighboring neurons are optimized to a lesser degree so that nearby neurons have similar profiles. In NMR-based metabolomics samples that show similar NMR pattern would be arranged in neighboring areas of the map. SOMs were for example used for the visualization of metabolic changes in breast cancer tissue samples [68]. Supervised Data Analysis In supervised data analysis information about the class labels of the individual samples is included. The general goal is in many cases to test whether a given hypothesis is true or not. For example, one starts the analysis by testing for each feature of the dataset whether the null-hypothesis that the samples are members of the same basic population is true or not. In case of two classes and under the assumption that the data are normally distributed a Student’s t-test can be conducted. Normal distribution of the data can be confirmed by means of the Kolmogorov-Smirnov test [69]. The distribution-free counterpart to the parametrical Student’s t-test is the Wilcoxon, Mann and Whitney U-test [69]. For comparison of two or more groups, a one-way analysis of variance (ANOVA) can be conducted for normally distributed data [70]. Note that for two groups Student’s t-test and ANOVA yield the same results. A distribution-free alternative to

ANOVA for non-normally distributed data is the KruskalWallis analysis of variance [71]. Neither the Wilcoxon, Mann and Whitney U-test nor the Kruskal-Wallis test require normally distributed data; however, they assume a similarly shaped and scaled distribution for each group. When testing metabolomic data for significant differences with any of the above-mentioned tests, one usually performs a large number of separate tests for each feature of a dataset. As a consequence, one might falsely reject the null hypothesis (type one error) for one or more of these tests just by chance. There are different approaches to tackle this multiple testing problem and in the following we shortly discuss two of them. The method of Bonferroni [70] controls the family wise error rate which is the probability of making even one type I error among the features selected as significant. For example, for a significance level α = 0.05 and a total of 2000 features only features with a p-value ≤ 0.05/2000 would be selected as significantly different. As a consequence, the Bonferroni correction is very conservative, leading often to a very small number of selected features. An alternative is to control for the false discovery rate (FDR) at a given significance level according to the method of Benjamini and Hochberg [72]. For example, when one controls the FDR at the 10% level one can expect 10% false discoveries among the set of selected significant features. The control of the FDR is a less conservative method than the Bonferroni correction and is recommended in cases where a certain proportion of false positives is tolerable. A straightforward way of analyzing complex experiments with different groups such as ones subjected to treatments with different drugs or combinations thereof is the use of the R-package Limma [73], which fits a linear model to the data to delineate the influence of each given drug on the data. Following the identification of discriminating metabolites a heatmap representation gives a good overview of the up-

Current Experimental, Bioinformatic and Statistical Methods

respectively down-regulation of individual features. An example for a heatmap representation that was generated from urinary NMR fingerprints of patients suffering from autosomal dominant polycystic kidney disease (ADPKD) is given in (Fig. 7) [51]. Analysis of Metabolomic Networks An important aspect of metabolomic data analysis is to infer associations between individual metabolites. For inferring the correct underlying network topology it is of prime importance to discriminate between direct and indirect interactions. By employing partial correlation analysis in an iterative fashion the construction of approximate undirected dependency graphs becomes feasible [74]. Obtained results should be carefully compared to existing knowledge from the literature including metabolic pathway databases as described below. Sample Classification Classification of an unknown sample to known classes of disease (e.g. healthy and diseased) is a typical application of

Current Metabolomics, 2013, Vol. 1, No. 00 11

metabolomics and may be performed using supervised techniques from machine learning. Generally, classifiers or classification algorithms are trained on a training dataset where a class label for each sample is known, followed by an application of the trained algorithm on new test data. For performance evaluation, class labels of the test data also have to be known. When additional independent test data are difficult to obtain, performance evaluation is often performed within a cross-validation setting, where the complete data set is iteratively split into training and test data. It is strongly recommended to use nested cross-validation schemes where parameters relevant for feature selection and the classification algorithm are optimized within the inner loops. By this, it is ensured that validation is not biased by training or parameter optimization. A schematic representation of a 3-fold nested cross-validation is given in (Fig. 8). Here, the upper bar represents the complete dataset; it is split iteratively in training and test data (indicated by the dark and light green bars, respectively). The training data of this loop is passed on to the middle loop in which it is again split in training and test data. The training data of the middle loop is transferred to the in-

Fig. (7). Example for a heatmap representation of discriminative features. The up and down regulation of individual features with respect to the mean over all specimens for a given feature is color coded in yellow and blue, or light and dark grey respectively. The data was generated in context of an analysis of urinary samples of patients suffering from autosomal dominant polycystic kidney disease (ADPKD)[51]. In total, data for five groups of patients are displayed, where the group of patients suffering from ADPKD is further subdivided into groups 1A and 1B based on medication. Group 2 represents apparently healthy volunteers, group 3 renal transplant recipients without acute organ rejection, group 4 diabetes mellitus type 2 patients with reduced eGFR and microalbuminuria and group 5 diabetes mellitus type 2 patients with severely reduced eGFR but no microalbuminuria, respectively. Ala, alanine; Carb, carbohydrates; D-Sac, D-saccharic acid; MeOh, methanol; Suc, sucrose; Tar, tartaric acid; Thr, threonine; Tyr, tyrosine; 3-OH-IVA, 3-hydroxyisovaleric acid; 6-OH-NA, 6-hydroxynicotinic acid.

Fig. (8). Schematic representation of a three-fold nested cross-validation. Each loop is represented by a bar of different length. The uppermost bar corresponds to the outer loop that represents the complete data set.

12 Current Metabolomics, 2013, Vol. 1, No. 00

nermost loop of the nested cross-validation scheme where it is again split as described above. Here, parameters inherent to the classifier are tuned. In the middle loop, the sparsity of the classifier, e. g. the number of used variables is optimized, while in the outmost loop validation is performed. It should be ensured that all data of a specific loop are used once for testing. It has been previously shown that with a nested cross-validation approach an almost unbiased assessment of the true classification error is obtained [75]. Without a rigorous validation step, further interpretation of results should not be carried out. Thus far, approaches based on Partial Least SquaresDiscriminant Analysis (PLS-DA) [76], often in the combination with orthogonal projection to latent structures (OPLSDA) [77], have dominated the classification of NMR metabolomic data. Recently, we compared the performance of PLS-DA to other classification approaches commonly used in genomics, on various metabolite fingerprinting datasets [78]. For some classifiers consistently good performance was obtained independent of the dataset in question. These classifiers included Random Forests [78, 79] and Support Vector Machines [78, 80, 81]. The former are particularly suited for the analysis of high-dimensional NMR data. A Random Forest classifier consists of a set of tree predictors, where each tree is constructed from a different bootstrap sample of the training data. At each node of the tree the splitting in branches is based on a random selection of the input features. The final class label given to a new sample is the result of a majority vote over all trees. Another advantage of Random Forests is the provision of different measures of variable importance, which can be used for the identification of predictive subsets of spectral features [82, 83]. A schematic representation for a Random Forest classifier is given in (Fig. 9). Support vector machines showed good performance on high- as well as low-dimensional datasets. Support vector machine classifiers are so-called large margin classifiers, in which a separating hyperplane is determined in a way to maximize the distance between the individual classes of the training data. This hyperplane is constructed in a highdimensional vector space defined by the individual feature levels. Both classifiers are also included in the MetaboAnalyst software package [56]. Machine-learning algorithms have also been compared for an NMR dataset of 63 preselected metabolites [84]. The considerably lower dimen-

Zacharias et al.

sionality of the dataset as well as differences in the joint distribution of features in each dataset make a direct comparison difficult. Nevertheless, Support Vector Machines also worked well in this setting. It is not to be expected that one particular algorithm will work best for the classification of all kinds of NMR datasets, but based on our experience Random Forests and Support Vector Machines will perform generally reasonably well. In the field of other –omics sciences, the performance of algorithms could often be improved by combining them with data-driven variable selection approaches. This strategy is different to pre-selecting metabolite measurements a priori as conducted by Eisner et al. 2011. Feature selection approaches are normally grouped into three classes: wrapper, filter, and embedded methods, respectively [85]. The feature selection approach of the Elastic Net [86] classification algorithm is an example for the class of embedded feature selection methods, whereas the feature selection used by the Nearest Shrunken Centroids classification approach [87] is representative for the wrapper methods. Among the large variety of available algorithms a t-score-based feature filtering performed best for gene-expression data [85] as well as for high-dimensional metabolic NMR data [78]. In another study, filtering by Gini importance in combination with PLSDA classification also worked well for NMR-based chemotaxonomic datasets [82]. Classification Probabilities The outcome of the sample classification process mainly depends on two factors: (i) the choice of the classification algorithm, and (ii) the classification problem itself. After finding an appropriate classifier, the question of classification reliability also needs to be addressed. When analyzing metabolomic datasets, classification algorithms are evaluated by the prediction accuracy calculated in a cross-validation approach. Even if the averaged prediction accuracy is high, only a few algorithms provide a measure of reliability for individual samples. In this case, the output of the classifier is a set of class probability values 0 ≤ pj ≤ 1. However, in clinical practice one is not only interested in averaged classification accuracies over many cases. In this setting the safeness of individual classifications is crucial. In this context, Appel and coworkers compared six class probability estimators

Fig. (9). Schematic representation of a Random Forest classifier. Each tree is constructed from a different bootstrap sample of the training data as indicated by the different colors of the trees.

Current Experimental, Bioinformatic and Statistical Methods

including Naïve Bayes estimators or binary regression employing logistic models and adapted them for metabolomics analyses [88]. They showed that an approach based on local error rates works well for the estimation of individual classification probabilities. BIOLOGICAL INTERPRETATION OF RESULTS Following statistical data analysis and metabolite identification, the list of differential metabolites is analyzed to identify biologically meaningful patterns. This generally involves careful study of the relevant literature. Due to the enormous amount of information gathered over decades on the biochemistry of enzymes and related metabolic pathways as well as natural abundances of metabolites in cells, tissues and physiological fluids under both physiological and pathophysiological conditions, the use of databases that summarize and organize such information is highly recommended. One such database is the human metabolome database HMDB, which contains extensive information on individual metabolites, including among others corresponding MS and NMR spectra and commonly observed concentration ranges [89]. Valuable databases for placing individual metabolites into the context of metabolic pathways are the Kyoto Encyclopedia of Genes and Genomes (http://www.genome.jp/kegg/), the highly interactive Reactome database (http://www.reactome.org) [90], and the HumanCyc database (http://humancyc.org) [91]. The latter is based on the prediction of metabolic pathways from human genome data and contains detailed biochemical information on corresponding enzymes. However, mapping of regulated metabolites to biological pathways often fails to give a full and accurate picture of the mechanisms underlying changes in metabolite abundance. Actually, incorrect interpretation of the biological or medical significance of metabolite data obtained often rests with the inappropriate choice of thresholds (significance levels) for metabolite selection. Further, statistical methods commonly used for metabolite selection, such as t-tests, assume an independent sampling of metabolites, which is often not true. An avenue to at least partially overcome these problems is Metabolite set enrichment analysis (MESA), which searches for enrichment of pre-defined groups rather than single metabolites [92]. These metabolite groups have been built from previous pathway knowledge. Finally, for an understanding of the dynamics of metabolic networks and to predict, for example, the effects of gene knockouts, metabolic modeling based on a system of time-dependent ordinary differential equations may be used. DATA STORAGE Providing public access to research data is not only recommended for the growing field of metabolomics, but should, as far as possible, be enhanced for every area of research. It enables not only the validation of results and conclusions in publications by external investigators, but also supplies additional data for comparison or further investigations by other research groups. Since 2012, the European Bioinformatics Institute (EMBL-EBI) has established an open-source database for metabolomics experiments and the corresponding metadata, called MetaboLights [http://www. ebi.ac.uk/metabolights] [93]. Researchers are able to store

Current Metabolomics, 2013, Vol. 1, No. 00 13

their raw NMR spectra as well as complementary data in that database and describe their experimental procedures in order to enhance reconstruction of their data acquisition by other scientists. SUMMARY The last few years have seen rapid progress in the field of NMR-based metabolomics. Successful analysis of metabolite abundance and flux requires optimization of every step involved, starting from appropriate study design, over sample preparation, NMR measurements, spectral processing, compound identification and quantification, and both unsupervised and supervised classification of spectra for the identification of spectral features or patterns that may serve as diagnostic, therapeutic or prognostic biomarkers. Progress will be further facilitated by the submission of well documented raw and processed spectra to databases such as MetaboLights, which will allow the direct comparative assessment of old and novel data processing and classification routines. Despite the use of higher field magnets and cryogenically cooled probes, NMR spectroscopy is still limited for the most part to the low micromolar range. A possible avenue to significantly increase the number of detected low abundant metabolites is the combination of NMR with other analytical methods such as GC-MS and LC-MS [8]. In this context, the detailed investigation of specific metabolic pathways containing numerous low abundant metabolites not amenable to NMR is one key advantage of mass spectrometry based approaches. Both NMR and mass spectrometry allow metabolic fingerprinting where a broad screening is performed. One advantage of NMR spectroscopy in that regard is its high reproducibility over extended periods of time rendering it a very robust tool for subsequent multivariate data analysis. In addition, NMR sensitivity does not depend on chemical properties of the metabolites in question such as hydrophobicity or pKa [94]. Therefore, ideally a combination of NMR and mass-spectrometry techniques should be used. This is an area expected to see considerable progress in coming years, in particular with regard to the consolidation of data from different analytical platforms in a single data matrix for subsequent joint statistical data analysis. CONFLICT OF INTEREST The authors confirm that this article content has no conflict of interest. ACKNOWLEDGEMENTS This report was supported in part by the Bavarian genomic Network (BayGene), the German Federal Ministry of Education and Research (BMBF Grant no. 01 ER 0821), and the German Research Foundation (KFO 262). REFERENCES [1]

Klein, M. S.; Buttchereit, N.; Miemczyk, S. P.; Immervoll, A. K.; Louis, C.; Wiedemann, S.; Junge, W.; Thaller, G.; Oefner, P. J.; Gronwald, W. NMR metabolomic analysis of dairy cows reveals milk glycerophosphocholine to phosphocholine ratio as prognostic biomarker for risk of ketosis. J. Proteome Res., 3-2-2012, 11, 13731381.

14 Current Metabolomics, 2013, Vol. 1, No. 00 [2] [3]

[4] [5] [6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16] [17] [18] [19] [20] [21] [22]

Wishart, D. S. Metabolomics: The Principles and Potential Applications to Transplantation. Am. J. Transplant., 2005, 5, 2814-2820. Lerche, M. H.; Meier, S.; Jensen, P. R.; Hustvedt, S.-O.; Karlsson, M.; Duus, J. O.; Ardenkjaer-Larsen, J. H. Quantitative Dynamic Nuclear Polarization-NMR on Blood Plasma for Assays of Drug Metabolism. NMR Biomed., 2011, 24, 96-103. Günther, U. L. Dynamic Nuclear Hyperpolarization in Liquids. Top. Curr. Chem., 2011, 335, 23-69. Erdfelder, E.; Faul, F.; Buchner, A. GPOWER: A general power analysis program. Behav. Res. Meth. Ins. C., 1996, 28, 1-11. Keun, H. C; Ebbels, T. M. D.; Antti, H.; Bollard, M. E.; Beckonert, O.; Schlotterbeck, G.; Senn, H.; Niederhauser, U.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. Analytical Reproducibility in 1H NMRBased Metabonomic Urinalysis. Chem. Res. Toxicol., 2002, 15, 1380-1386. Beckonert, O.; Keun, H. C; Ebbels, T. M. D.; Bundy, J.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. Metabolic Profiling, Metabolomic and Metabonomic Procedures for NMR Spectroscopy of Urine, Plasma, Serum and Tissue Extracts. Nat.Protocols, 2007, 2, 2692-2702. Klein, M. S.; Almstetter, M.; Schlamberger, G.; Nürnberger, N.; Dettmer, K.; Oefner, P. J.; Meyer, H. H. D.; Wiedemann, S.; Gronwald, W. Nuclear Magnetic and Mass Spectrometry-based Milk Metabolomics in Dairy Cows During Early and Late Lactation. J. Dairy Sci., 2010, 93, 1539-1550. Chirasani, S. R.; Leukel, P.; Gottfried, E.; Hochrein, J.; Stadler, K.; Neumann, B.; Oefner, P. J.; Gronwald, W.; Bogdahn, U.; Hau, P.; Kreutz, M.; Grauer, O. M. Diclofenac Inhibits Lactate Formation and Efficently Counteracts Local Immuns Suppression in a Murine Glioma Model. Int. J. Cancer, 2013, 132, 843-853. Waters, N. J.; Garrod, S.; Farrant, R. D.; Haselden, J. N.; Connor, S. C.; Connelly, J.; Lindon, J. C.; Holmes, E.; Nicholson, J. K. High-resolution magic angle spinning 1H NMR spectroscopy of intact liver and kidney: optimization of sample preparation procedures and biochemical stability of tissue during spectral acquisition. Anal. Biochem., 15-6-2000, 282, 16-23. Smith, L. M.; Maher, A. D.; Want, E. J.; Elliott, P.; Stamler, J.; Hawkes, G. E.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. Largescale human metabolic phenotyping and molecular epidemiological studies via 1H NMR spectroscopy of urine: investigation of borate preservation. Anal. Chem, 15-6-2009, 81, 4847-4856. Jiang, L.; Huang, J.; Wang, Y.; Tang, H. Eliminating the dicationinduced intersample chemical-shift variations for NMR-based biofluid metabonomic analysis. Analyst, 21-9-2012, 137, 42094219. Klein, M. S.; Dorn, C.; Saugspier, M.; Hellerbrand, C.; Oefner, P. J.; Gronwald, W. Discrimination of Steatosis and NASH in Mice Using Nuclear Magnetic Resonance Spectroscopy. Metabolomics, 2011, 7, 237-246. Oostendorp, M.; Engelke, U. F.; Willemsen, M. A.; Wevers, R. A. Diagnosing Inborn Errors of Lipid Metabolism with Proton Nuclear Magnetic Resonance Spectroscopy. Clin. Chem., 2006, 52, 13951405. Thomas, A.; Stevens, A. P.; Klein, M. S.; Hellerbrand, C.; Dettmer, K.; Gronwald, W.; Oefner, P. J.; Reinders, J. Early Changes in the Liver-Soluble Proteome from Mice Fed a Non-Alcoholic Steatohepatitis Inducing Diet. Proteomics, 2012, 12, 1437-1451. Akoka, S.; Barantin, L.; Trierweiler, M. Concentration Measurement by Proton NMR Using the ERETIC Method. Anal. Chem., 1999, 71, 2554-2557. Friebolin, H. One- and Two-Dimensional NMR Spectroscopy 3rd ed; Wiley-VCH: Weinheim, 1998. Meiboom, S.; Gill, D. Modified Spin Echo Method for Measuring Nuclear Relaxation Times. Rev. Sci. Instr., 1958, 29, 688-691. Aue, W. P.; Karhan, J.; Ernst, R. R. Homonuclear Broad Band Decoupling and Two-dimensional J-Resolved NMR Spectroscopy. J. Chem. Phys., 1976, 64, 4226-4227. Bodenhausen, G.; Ruben, D. J. Natural Abundance Nitrogen-15 NMR by Enhanced Heteronuclear Spectroscopy. Chem.Phys.Lett., 1980, 69, 185-189. Maciejewski, M. W.; Qui, H. Z.; Rujan, I.; Mobli, M.; Hoch, J. C. Nonuniform Sampling and Spectral Aliasing. J. Magn. Reson., 2009, 199, 88-93. Orekhov, V. Y.; Jaravine, V. A. Analysis of Non-Uniformly Sampled Spectra with Multidimensional-Decomposition. Prog. NMR Spectrosc., 2011, 59, 271-292.

Zacharias et al. [23]

[24] [25] [26]

[27] [28] [29]

[30] [31] [32]

[33] [34]

[35]

[36]

[37]

[38]

[39] [40]

[41] [42]

[43]

Barna, J. C. J.; Laue, E. D.; Mayger, M. R.; Skilling, J.; Worrall, S. J. P. Exponential Sampling, an Alternative Method for Sampling in Two-Dimensional NMR Experiments. J. Magn. Reson., 1987, 73, 69-77. Aue, W. P.; Bartholdi, E.; Ernst, R. R. Two-dimensional Spectroscopy. Application to Nuclear Magnetic Resonance. J. Chem. Phys., 1976, 64, 2229-2246. Braunschweiler, L.; Ernst, R. R. Coherence Transfer by Isotropic Mixing: Application to Proton Correlation Spectroscopy. J.Magn. Reson., 1983, 53, 521-528. Moseley, H. N.; Lane, A. N.; Belshoff, A. C.; Higashi, R. M.; Fan, T. W. A novel deconvolution method for modeling UDP-N-acetylD-glucosamine biosynthetic pathways based on 13C mass isotopologue profiles under non-steady-state conditions. BMC Biol., 2011, 9, 37. Meier, S.; Karlsson, M.; Jensen, P. R.; Lerche, M. H.; Duus, J. O. Metabolic Pathway Visualization in Living Yeast by DNP-NMR. Mol.BioSyst., 2011, 7, 2834-2836. Xia, J.; Bjorndahl, T. C.; Tang, P.; Wishart, D. S. MetaboMiner-semi-automated identification of metabolites from 2D NMR spectra of complex biofluids. BMC Bioinformatics, 2008, 9, 507. Zacharias, H. U.; Schley, G.; Hochrein, J.; Klein, M. S.; Köberle, C.; Eckardt, K.-U.; Willam, C.; Oefner, P. J.; Gronwald, W. Analysis of Human Urine Reveals Metabolic Changes Related to the Development of Acute Kidney Injury Following Cardiac Surgery. Metabolomics, 2013, 9, 697-707. Weljie, A. M.; Newton, J; Mercier, P.; Carlson, E.; Slupsky, C. M. Targeted Profiling: Quantitative Analysis of 1H NMR Metabolomics Data. Anal.Chem., 2006, 78, 4430-4442. Weljie, A. M.; Newton, J; Jirik, F. R.; Vogel, H. J. Evaluating Low-Intensity unknown Signals in Quantitative Proton NMR Mixture Analysis. Anal.Chem., 2008, 80, 8956-8965. Klein, M. S.; Oefner, P. J.; Gronwald, W. MetaboQuant: A Tool Combining Individual Peak Calibration and Outlier Detection for Accurate Metabolite Quantification in 1D 1H and 1H-13C HSQC NMR Spectra. Biotechniques, 2013, 54, 251-256. Mierisova, S.; la-Korpela, M. MR spectroscopy quantitation: a review of frequency domain methods. NMR Biomed., 2001, 14, 247-259. Lewis, I. A.; Schommer, S. C.; Hodis, B.; Robb, K. A.; Tonelli, M.; Westler, W. M.; Sussman, M. R.; Markley, J. L. Method for Determining Molar Concentrations of Metabolites in Complex Solutions from Two-Dimensional 1H-13C NMR Spectra. Anal.Chem., 2007, 79, 9385-9390. Gronwald, W.; Klein, M. S.; Kaspar, H.; Fagerer, S.; Nürnberger, N.; Dettmer, K.; Bertsch, T.; Oefner, P. J. Urinary Metabolite Quantification Employing 2D NMR Spectroscopy. Anal.Chem., 2008, 80, 9288-9297. Hu, K.; Westler, W. M.; Markley, J. L. Simultaneous quantification and identification of individual chemicals in metabolite mixtures by two-dimensional extrapolated time-zero 1H-13C HSQC (HSQC0). J. Am. Chem. Soc., 16-2-2011, 133, 1662-1665. Rai, R. K.; Tripathi, P.; Sinha, N. Quantification of metabolites from two-dimensional nuclear magnetic resonance spectroscopy: application to human urine samples. Anal.Chem., 15-12-2009, 81, 10232-10238. Skinner, T. E.; Gershenzon, N. I.; Nimbalkar, M.; Bermel, W.; Luy, B.; Glaser, S. J. New strategies for designing robust universal rotation pulses: application to broadband refocusing at low power. J. Magn. Reson., 2012, 216, 78-87. Lane, A. N.; Fan, T. W.; Higashi, R. M. Stable isotope-assisted metabolomics in cancer research. IUBMB Life, 2008, 60, 124-129. Miccheli, A.; Tomassini, A.; Puccetti, C.; Valerio, M.; Peluso, G.; Tuccillo, F.; Calvani, M.; Manetti, C.; Conti, F. Metabolic profiling by 13C-NMR spectroscopy: [1,2-13C2]glucose reveals a heterogeneous metabolism in human leukemia T cells. Biochimie, 2006, 88, 437-448. Lane, A. N.; Fan, T. W. Quantification and Identification of Isotopomer Distributions of Metabolite in Crude Cell Extract Using 1H TOCSY. Metabolomics, 2007, 3, 79-86. Howe, P. W.; Ament, Z.; Knowles, K.; Griffin, J. L.; Wright, J. Combined use of filtered and edited 1 H NMR spectroscopy to detect 13 C-enriched compounds in complex mixtures. NMR Biomed., 2012, 25, 1217-1223. Merritt, M. E.; Burgess, S. C.; Spitzer, T. D. Adiabatic JHSQC for 13 C isotopomer analysis. Magn. Reson.Chem., 2006, 44, 463-466.

Current Experimental, Bioinformatic and Statistical Methods [44]

[45] [46]

[47]

[48]

[49] [50]

[51]

[52]

[53]

[54]

[55] [56] [57]

[58]

[59]

[60]

[61]

[62] [63]

Filipp, F. V.; Scott, D. A.; Ronai, Z. A.; Osterman, A. L.; Smith, J. W. Reverse TCA cycle flux through isocitrate dehydrogenases 1 and 2 is required for lipogenesis in hypoxic melanoma cells. Pigment Cell Melanoma R., 2012, 25, 375-383. Forshed, J.; Schuppe-Koistinen, I.; Jacobsson, S. P. Peak Alignment of NMR Signals by Means of a Genetic Algorithm. Anal. Chim. Acta, 2003, 487, 189-199. Stoyanova, R.; Nicholls, A. W.; Nicholson, J. K.; Lindon, J. C.; Brown, T. R. Automatic alignment of individual peaks in large high-resolution spectral data sets. J. Magn Reson., 2004, 170, 329335. Cloarec, O.; Dumas, M.-E.; Craig, A.; Barton, R.; Trygg, J.; Hudson, J.; Blancher, C.; Gauguier, D.; Lindon, J. C.; Holmes, E.; Nicholson, J. K. Statistical Total Correlation Spectroscopy: An Exploratory Approach for Latent Biomarker Identification from Metabolomic 1H NMR Data Sets. Anal. Chem., 2005, 77, 1282-1289. Cloarec, O.; Dumas, M. E.; Trygg, J.; Craig, A.; Barton, R. H.; Lindon, J. C.; Nicholson, J. K.; Holmes, E. Evaluation of the orthogonal projection on latent structure model limitations caused by chemical shift variability and improved visualization of biomarker changes in 1H NMR spectroscopic metabonomic studies. Anal.Chem., 15-1-2005, 77, 517-526. Wishart, D. S. Computational Approaches to Metabolomics. Methods Mol. Biol., 2010, 593, 283-313. Kohl, S. M.; Klein, M. S.; Hochrein, J.; Oefner, P. J.; Spang, R.; Gronwald, W. State-of-the Art Data Normalization Methods Improve NMR-Based Metabolomic Analysis. Metabolomics, 2012, 8, 146-160. Gronwald, W.; Klein, M. S.; Zeltner, R.; Schulze, B.-D.; Reinhold, S. W.; Deutschmann, M.; Immervoll, A.-K.; Böger, C. A.; Banas, B.; Eckardt, K.-U.; Oefner, P. J. Detection of Autosomal Polycystic Kidney Disease Using NMR Spectroscopic Fingerprints of Urine. Kidney Int., 2011, 79, 1244-1253. Bolstad, B. M.; Irizarry, R. A.; Astrand, M.; Speed, T. P. A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias. Bioinformatics, 2003, 19, 185-193. Huber, W.; Heydebreck, A. V.; Sültmann, H.; Poustka, A.; Vingron, M. Variance Stabilisation Applied to Microarray Data Calibration and to the Quantification of Differential Expression. Bioinformatics, 2002, 18, S96-S104. Workman, C.; Jensen, L. J.; Jarmer, H.; Berka, R.; Gautier, L.; Nielser, H. B.; Saxild, H. H.; Nielsen, C.; Brunak, S.; Knudsen, S. A New Non-Linear Normalization Method for Reducing Variability in DNA Microarray Experiments. Genome Biol., 2002, 3, research0048. Development Core Team, R. R: A Language and Environment for Statistical Computing. 2009, Xia, J.; Psychogios, N.; Young, N.; Wishart, D. S. MetaboAnalyst: a Web Server for Metabolomic Data Analysis and Interpretation. Nucl. Acids Res., 2009, 37, W652-W660. Houtkooper, R. H.; Argmann, C.; Houten, S. M.; Canto, C.; Jeninga, E. H.; Andreux, P. A.; Thomas, C.; Doenlen, R.; Schoonjans, K.; Auwerx, J. The metabolic footprint of aging in mice. Sci. Rep., 2011, 1, 134. Borgan, E.; Sitter, B.; Lingjaerde, O. C.; Johnsen, H.; Lundgren, S.; Bathen, T. F.; Sorlie, T.; Borresen-Dale, A. L.; Gribbestad, I. S. Merging transcriptomics and metabolomics--advances in breast cancer profiling. BMC Cancer, 2010, 10, 628. Roessner, U.; Luedemann, A.; Brust, D.; Fiehn, O.; Linke, T.; Willmitzer, L.; Fernie, A. Metabolic profiling allows comprehensive phenotyping of genetically or environmentally modified plant systems. Plant Cell, 2001, 13, 11-29. Draisma, H. H.; Reijmers, T. H.; Meulman, J. J.; van der, Greef J.; Hankemeier, T.; Boomsma, D. I. Hierarchical clustering analysis of blood plasma lipidomics profiles from mono- and dizygotic twin families. Eur. J. Hum. Genet., 2013, 21, 95-101. Chinnaiyan, P.; Kensicki, E.; Bloom, G.; Prabhu, A.; Sarcar, B.; Kahali, S.; Eschrich, S.; Qu, X.; Forsyth, P.; Gillies, R. The metabolomic signature of malignant glioma reflects accelerated anabolic metabolism. Cancer Res, 15-11-2012, 72, 5878-5888. Hartigan, J. Clustering Algorithms John Wiley: New York, 1975. Nikas, J. B.; Low, W. C. Application of clustering analyses to the diagnosis of Huntington disease in mice and other diseases with well-defined group boundaries. Comput Methods Programs Biomed., 2011, 104, e133-e147.

Current Metabolomics, 2013, Vol. 1, No. 00 15 [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78]

[79] [80] [81] [82]

[83] [84]

[85] [86] [87] [88] [89] [90]

[91]

Frey, B. J.; Dueck, D. Clustering by passing messages between data points. Science, 2007, 315, 972-976. Bodenhofer, U.; Kothmeier, A.; Hochreiter, S. APCluster: an R package for affinity propagation clustering. Bioinformatics, 2011, 27, 2463-2464. Scholz, M.; Gatzek, S.; Sterling, A.; Fiehn, O.; Selbig, J. Metabolite Fingerprinting: Detecting Biological Features by Independent Component Analysis. Bioinformatics, 2004, 20, 2447-2454. Dow, L. K.; Sandeep, K.; Dow, E. R. Self-organizing Maps for the Analysis of NMR Spectra. Biosilico, 2004, 2, 157-163. Beckonert, O.; Monnerjahn, J.; Bonk, U.; Leibfritz, D. Visualizing metabolic changes in breast-cancer tissue using 1H-NMR spectroscopy and self-organizing maps. NMR Biomed., 2003, 16, 1-11. Sachs, L. Angewandte Statistik Springer Verlag: Berlin, 1997. Rudolf, M. and Kuhlisch, W. Biostatistik, Eine Einführung für Biowissenschaftler Pearson Education: München, 2008. Kruskal, W. H.; Wallis, W. A. Use of Ranks in One-Criterion Variance Analysis. J.Am.Stat.Assoc., 1952, 47, 583-621. Benjamini, Y.; Hochberg, Y. Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. J. Roy. Stat. Soc.B, 1995, 57, 289-300. Smyth, G. C. In: Bioinformatics and Computational Biology Solutions using R and Bioconductor, Gentleman, R., Carey, V., Irizarry, R. A., and Huber, W. Ed.; Springer: New York, 2005; pp. 397-420. de la Fuente, A.; Bing, N.; Hoeschle, I.; Mendes, P. Discovery of Meaningful Associations in Genomic Data Using Partial Correlation Coefficients. Bioinformatics, 2004, 20, 3565-3574. Varma, S.; Simon, R. Bias in Error Estimation when Using CrossValidation for Model Selection. BMC-Bioinformatics, 2006, 7, 91. Barker, M.; Rayens, W. Partial Least Squares for Discrimination. J.Chemometrics, 2003, 17, 166-173. Trygg, J.; Wold, S. Orthogonal Projections to Latent Structures. J. Chemometrics, 2002, 16, 119-128. Hochrein, J.; Klein, M. S.; Zacharias, H. U.; Li, J.; Wijffels, G.; Schirra, H. J.; Spang, R.; Oefner, P. J.; Gronwald, W. Performance Evaluation of Algorithms for the Classification of Metabolic 1HNMR Fingerprints. J.Proteome Res., 2012, 11, 6242-6251. Breiman, L. Random Forests. Mach. Learn., 2001, 45, 5-32. Dudoit, S.; Fridlyand, J.; Speed, T. P. Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. J. Am. Stat. Assoc., 2002, 97, 77-87. Burges, C. J. C. A Tutorial on Support Vector Machines for Pattern Recognition. Data.Min.Knowl.Disc., 1998, 2, 121-167. Menze, B. H.; Kelm, B. M.; Masuch, R.; Himmelreich, U.; Bachert, P.; Petrich, W.; Hamprecht, F. A. A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics, 2009, 10, 213. Bryan, K.; Brennan, L.; Cunningham, P. MetaFIND: a feature analysis tool for metabolomics data. BMC Bioinformatics, 2008, 9, 470. Eisner, R.; Stretch, C.; Eastman, T.; Xia, J.; Hau, D.; Damaraju, S.; Greiner, R.; Wishart, D. S.; Baracos, V. E. Learning to Predict Cancer-Associated Skeletal Muscle Wasting from 1H-NMR Profiles of Urinary Metabolites. Metabolomics, 2011, 7, 25-34. Haury, A.-C.; Gestraud, P.; Vert, J.-P. The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures. PLoS ONE, 2011, 6, e28210. Zou, H.; Hastie, T. Regularization and Variable Selection via the Elastic Net. J. R. Stat. Soc.B, 2005, 67, 301-320. Tibshirani, R.; Hastie, T.; Narasimhan, B.; Chu, G. Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression. Proc. Natl. Acad. Sci.U.S.A., 2002, 99, 6567-6572. Appel, I. J.; Gronwald, W.; Spang, R. Estimation of Classification Probabilities in High-Dimensional Diagnostic Studies. Bioinformatics, 2011, 27, 2563-2570. Wishart, D. S.; Tzur, D.; Knox, C.; Querengesser, L. HMDB: The Human Metabolome Database. Nucl. Acids Res., 2007, 35, D521D526. Vastrik, I.; D'Eustachio, P.; Schmidt, E.; Gopinath, G.; Croft, D.; de, Bono B.; Gillespie, M.; Jassal, B.; Lewis, S.; Matthews, L.; Wu, G.; Birney, E.; Stein, L. Reactome: a knowledge base of biologic pathways and processes. Genome Biol., 2007, 8, R39. Romero, P.; Wagg, J.; Green, M. L.; Kaiser, D.; Krummenacker, M.; Karp, P. D. Computational prediction of human metabolic

16 Current Metabolomics, 2013, Vol. 1, No. 00

[92] [93]

Zacharias et al.

pathways from the complete human genome. Genome Biol., 2005, 6, R2. Xia, J.; Wishart, D. S. MSEA: A Web-based Tool to Identify Biologically Meaningful Patterns in Quantitative Metabolomic Data. Nucl. Acids Res., 2010, 38, W71-W77. Steinbeck, C.; Conesa, P.; Haug, K.; Mahendraker, T.; Williams, M.; Maguire, E.; Rocca-Serra, P.; Sansone, S. A.; Salek, R. M.;

Received: May 29, 2013

[94]

Revised: June 25, 2013

Griffin, J. L. MetaboLights: towards a new COSMOS of metabolomics data management. Metabolomics, 2012, 8, 757-760. Almstetter, M.; Appel, I. J.; Gruber, M. A.; Lottaz, C.; Timischl, B.; Spang, R.; Dettmer, K.; Oefner, P. J. Integrative Normalization and Comparative Analysis for Metabolic Fingerprinting by Comprehensive Two-Dimensional Gas Chromatography-Time-ofFlight Mass Spectrometry. Anal.Chem., 2009, 81, 5731-5739. Accepted: June 26, 2013