Precursor mass prediction by clustering ionization ... - Springer Link

Metabolomics (2013) 9:1301–1310 DOI 10.1007/s11306-013-0539-4

ORIGINAL ARTICLE

Precursor mass prediction by clustering ionization products in LC-MS-based metabolomics Terk Shuen Lee • Ying Swan Ho • Hock Chuan Yeo Joyce Pei Yu Lin • Dong-Yup Lee

•

Received: 28 December 2012 / Accepted: 12 April 2013 / Published online: 21 April 2013 Ó Springer Science+Business Media New York 2013

Abstract Liquid chromatography-mass spectrometry (LC-MS) is becoming the dominant technology in metabolomics, involving the comprehensive analysis of small molecules in biological systems. However, its use is still limited mainly by challenges in global high-throughput identification of metabolites: LC-MS data is highly complex, particularly due to the formation of multiple ionization products from individual metabolites. To address the limitation in metabolite identification, we developed a principled approach, designed to exploit the multi-dimensional information hidden in the data. The workflow first clusters candidate ionization products of the same metabolite together which typically have similar retention time, then searches for mass relationships among them in order to determine their ion types and metabolite identity. The robustness of our approach was demonstrated by its application to the LC-MS profiles of cell culture supernatant, which accurately predicted most of the known media components in the samples. Compared to conventional methods, our approach was able to generate significantly fewer candidate metabolites without missing out valid ones, thus reducing false-positive matches. Additionally,

Electronic supplementary material The online version of this article (doi:10.1007/s11306-013-0539-4) contains supplementary material, which is available to authorized users. T. S. Lee Y. S. Ho H. C. Yeo J. P. Y. Lin D.-Y. Lee Bioprocessing Technology Institute, A*STAR (Agency for Science, Technology and Research), 20 Biopolis Way, #06-01, Singapore 138668, Singapore D.-Y. Lee (&) Department of Chemical and Biomolecular Engineering, National University of Singapore, 4 Engineering Drive 4, Singapore 117576, Singapore e-mail: [email protected]

improved confidence in identification is achieved since each prediction comes with a probable combination of known ion types. Hence, our integrative workflow provides precursor mass predictions with high confidence by identifying various ionization products which account for a large proportion of detected peaks, thus minimizing false positives. Keywords LC-MS Precursor mass prediction Ionization products Untargeted metabolomics

1 Introduction Metabolomics is a rapidly emerging field involving the measurement and study of small molecules in biological systems. These small molecules, known as metabolites, are the end products of cellular processes, and thus their levels can reflect the phenotypic state of a biological system (Fiehn 2002). This makes metabolomics a valuable tool within the systems biology framework for investigating cellular responses to perturbations, with the ultimate aim of understanding complex systems which include microbes (Reaves and Rabinowitz 2011), plant (Saito and Matsuda 2010), human tissues and cell (Yin et al. 2009; Jansson et al. 2009; Roux et al. 2011), as well as other mammalian systems (Selvarasu et al. 2012). Metabolomic approaches can either be targeted or untargeted. The former specializes on identifying and quantifying metabolites from a certain metabolic pathway or class of compounds (Lu et al. 2008), while untargeted metabolic profiling involves the global analysis of metabolite signals measured by one or more analytical platforms (Dettmer et al. 2007). Liquid chromatography-mass spectrometry (LC-MS) is one of the most commonly used analytical platforms for

123

1302

untargeted metabolomics studies (Roux et al. 2011). Metabolites in a complex sample are first separated via chromatographic-based methods, most often as a function of their polarities and interactions. This allows their elution at different retention times (RT). The eluting analytes are ionized—typically by electrospray ionization (ESI)—and further separated in the mass spectrometer according to their mass-to-charge ratio (m/z), generating a mass spectrum which depicts the m/z of each eluting compound and its corresponding intensity. Hence, the resulting data for the entire chromatographic run can be visualized as a threedimensional plot, where each peak represents a detected ion and is characterized by its m/z, RT and intensity (supplementary Fig. S1). Advances in technology, such as ultra-performance liquid chromatography (UPLC) (Want et al. 2010; Wang et al. 2011), the Orbitrap mass analyzer (Hu et al. 2005) and the Fourier transform ion cyclotron resonance (FT-ICR) mass analyzer (Junot et al. 2010), have increased throughput, sensitivity and mass resolution, thus making LC-MS a powerful tool for metabolic profiling. Importantly, meaningful biological insights can only be gained when the metabolite identities of interesting features are correctly determined globally. Metabolite identification broadly falls under two categories: definitive and putative (Sumner et al. 2007). Definitive identification, which has a higher level of confidence, requires at least two orthogonal properties to be matched to those of a spectral standard. These are typically m/z coupled with either RT or tandem mass spectrometry (MS/MS) fragmentation pattern (Chen et al. 2008; Pluskal et al. 2010). A number of tools (Benton et al. 2008; Bonn et al. 2010; Creek et al. 2011) and spectral libraries (Smith et al. 2005; Horai et al. 2010) are available to aid definitive identification. However, this approach requires availability of metabolite standards under the same experimental conditions. Thus definitive identification requires prior laborious experiments and is not always achievable. Instead, putative metabolite identification is often used, especially in the early stages of analysis (Werner et al. 2008). This approach employs one or more properties to determine metabolite identity without comparison to rigorous metabolite standards. Typically, m/z is the main property used, but orthogonal information such as RT can be employed (Creek et al. 2011) to differentiate isomers. Candidate molecular formulae (elemental compositions) are first computed for each peak based on m/z (Kind and Fiehn 2006, 2007; Stoll et al. 2006; Rogers et al. 2009), followed by formulae matching to databases to determine putative identity (Iijima et al. 2008; Brown et al. 2011). Freely available databases include HMDB (Wishart et al. 2009), MMMDB (Sugimoto et al. 2012), MMCD (Cui et al. 2008), KEGG (Kanehisa et al. 2012), MMD (Brown et al. 2009), MZedDB (Draper et al. 2009) and PubChem

123

T. S. Lee et al.

(Sayers et al. 2012). Putative identification can also be obtained by directly matching m/z values to records in these resources without generating molecular formulae. Despite the rich information generated by LC-MS instruments, there is currently a bottleneck for highthroughput, automatic and untargeted identification of metabolites (Scalbert et al. 2009; Roux et al. 2011). Such limitation is mainly attributable to two factors. Firstly, current knowledge and database resources of naturally occurring metabolites are still limited. Even the human metabolome has not yet been fully characterized (Dunn et al. 2011). Furthermore, it is difficult to standardize experimental data for sharing because measurable properties such as fragmentation patterns and RT are highly dependent on analytical conditions (Scalbert et al. 2009; Creek et al. 2011). Secondly, LC-MS data is highly complex and multidimensional. While the number of detected features in a LC-MS run ranges typically in the thousands, they actually correspond to a much smaller set of metabolites. This is because, during ionization, each metabolite can form multiple ionization products (IPs) such as naturally-charged species, adducts, fragments, complexes and isotopic variants which are detected as distinct peaks (Thurman et al. 2001; Brown et al. 2009; Draper et al. 2009). All these molecular possibilities largely depend on the experimental setup as well as the metabolite itself. Additionally, some peaks may also be the result of noise and instrument artifacts. Thus, if not accounted for appropriately, the large number of peaks will dramatically increase the likelihood of false identifications (Werner et al. 2008). This clearly over-burdens researchers in subsequent manual examination and experimental verification. A few recent studies have suggested computational solutions to address false-positive issues (Tautenhahn et al. 2007; Sana et al. 2008; Brown et al. 2009, 2011; Draper et al. 2009; Scheltema et al. 2009; Pluskal et al. 2010) but there was no prescription on how to exploit the rich information on IPs so as to infer metabolite identities more confidently instead. To address this limitation, we developed a fully integrated workflow for analyzing LC-MS data. Indeed, our approach improves prediction confidence, saving time, resources and manual work.

2 Methods 2.1 Overview Our overall workflow for prediction of metabolite identity is illustrated in Fig. 1. Initially, it requires a list of detected mass peaks as input, which can be generated by any peak detection (deconvolution) program. The input peak list contains information on the m/z, retention time (RT),

Precursor mass prediction by clustering ionization products

1303

be aligned after m/z matching. By an iterative process of m/z matching and subsequent RT correction, alignment can be incrementally improved. In the second step, putative IPs from the same metabolites are clustered. If the IPs of a particular metabolite are correctly grouped, the confidence of its identity is enhanced. In the final step of mass and metabolite identity prediction, the mass relationships among IPs of the same cluster are evaluated after the computation of the monoisotopic masses (from the most abundant isotopes). The resulting mass predictions are searched against a molecular formulae database to find matches within a mass error tolerance. The final output consists of a list of mass predictions, their constituent IPs, and their putative metabolite identities based on database matches. 2.2 Step 1: Robust m/z matching and RT alignment across multiple runs

Fig. 1 Overall workflow for high-confidence precursor mass prediction. Step 1: raw data is processed to provide the input list of detected peaks m/z matching is done to group together peaks representing the same IP across different runs. Step 2: putative IPs from the same metabolites are clustered together using retention time and peak intensity information. Step 3: by examining mass relationships between IPs in a cluster, underlying metabolite mass is determined, and then matched to databases in predicting its identity

integrated intensity (area under the peak), signal-to-noise ratio (s/n), and run number. In the first step of m/z matching and RT alignment, those detected peaks representing the same IP among all the runs are grouped together and uniquely identified by their highly similar m/z values together with RT. Again, IPs can be attributed to various naturally-charged species, adducts, fragments, complexes, and even isotopic variants originating from the same detected metabolites. However, the ions [M?H]? and [M-H]- (where M represents the metabolite) are assumed to be the most common IPs detected. As RTs of the same IP can drift among runs due to changes in chromatographic performance, they need to

Instead of peak-matching by RT values (Smith et al. 2006), the first step involves stringent m/z matching of peaks across multiple chromatographic runs originating from the same IP. The highly accurate and precise m/z values that is commonly obtainable in LC-MS measurements (\3 ppm in our data), allows for very robust m/z matching across runs, being less sensitive to analytical conditions and batch drift. In contrast, significant RT variation among runs (RT drift) can affect grouping quality from matching RT values even with intermittent corrections. As such, instead of creating slices in the m/z dimension and matching peaks with similar RTs together (Smith et al. 2006), we run a sliding window in the RT dimension first and matches the m/z of peaks among runs (Fig. 2a). The window moves such that it starts at the first unmatched peak in the RT dimension, which is the target peak (TP) to be matched and grouped with others within the window. Such a grouping containing the TP forms a new peak-group. An effective m/z matching method is to look for a significant ‘‘jump’’ in the m/z value between adjacent peaks well above the m/z error (ppm) calibrated during generation of LC-MS data. (Fig. 2b). Peaks before such m/z ‘‘jumps’’ are grouped together with the TP to form a peak-group. Although the ‘‘jump’’ method is a highly effective way of correctly determining the peak-group for the TP, its peakgroup may contain more than one peak from the same run. This can arise from peaks having very similar m/z values representing different isomers. In such cases, an additional K-means clustering step in both m/z and RT dimensions is further performed to separate the run with the most peaks (Fig. 2c); the maximum number of clusters to partition such a run into (i.e. K), is its number of peaks. After clustering, runs with single peaks in the original peakgroup are individually matched to the nearest of the K

123

1304

T. S. Lee et al.

Fig. 2 Example of m/z matching for three hypothetical features with very similar m/z and RT. a Graph (m/z vs. RT) showing the locations of neighboring peaks from four runs. Ungrouped peaks are partitioned according to a fixed slice width in the RT dimension. Moving across the RT axis, each slice starts from the first peak (known as the target peak) that is not yet in a peak-group. In the first iteration, starting from p1, eight peaks are incorporated into slice1 (including p2, p3 and p4). b Graph showing the peaks of slice1 along the m/z axis. The algorithm detects a large enough m/z jump (*0.04) as it scans down the m/z axis, thus it ignores those peaks beyond the jump and groups the target peak p1, along with three others, into peak-group1. d Graph

showing the peaks of slice2 along the m/z axis. The target peak is p3, the next ungrouped peak with the smallest RT. Along the m/z axis, the peaks are not separated by an m/z jump, thus they are initially grouped together. However, because there are extra peaks from the same sample (e.g. p3 and p5 both from Run1), the algorithm proceeds to separate them by k-means clustering in two dimensions (m/z and RT), shown in (d). The value of k is two since there are up to two peaks of the same run. After clustering, the one containing the target peak (p3) forms peak-group2, while the other cluster is ignored. Returning back to the full dataset, the process is repeated again, this time with slice3 starting from p5

clusters. Peaks unmatched to the newly-formed peak-group (with TP) are left for subsequent sliding-window m/ z matching. After m/z matching, runs are corrected for their RT deviation by curve-fitting over the full dataset using similar

LOESS technique employed by XCMS. Representative peak-groups for curve-fitting are selected based on sufficiently low variability of their m/z values. A few rounds of m/z matching and RT correction can be repeated with smaller RT windows to ensure good RT alignment.

123


The final set of aligned peak-groups can be additionally filtered using a number of criteria, including the average signal–noise ratio within a peak-group. 2.3 Step 2: Determining ionization product clusters The second step aims to accurately cluster a metabolite’s IPs together from the list of global peak-groups. Two key assumptions were made during the clustering process as outlined by the example in supplementary Fig. S2. Firstly, IPs from the same metabolites go through similar chromatographic elution, thus their peaks should have the same shapes and location along the RT dimension (Vaclavik et al. 2012). In evaluating the similarity of the RT profiles of candidate IPs (peak-groups) for a metabolite, the average Pearson’s correlation coefficients between all run pairs of any two peak-groups with similar RT are computed. Given the similarity matrix between such peak-groups, a variant of quality threshold (QT) clustering (Heyer et al. 1999) is done to produce clusters of similar peak-groups which may be overlapping. To this end, the minimum correlation coefficient between member peak-groups is used to score candidate IP-cluster. Based on the preliminary findings of IP-clusters, the next task is to refine the results. To do this, we impose a second requirement that the intensity ratio between two IPs of the same metabolite should remain stable across all runs if machine analytic conditions, other than metabolite concentration, are unchanged (Brown et al. 2009). Starting with peak-groups with the highest signal–noise ratio, those whose intensities significantly increase the mean coefficient of variation (CV) of a nascent IP-cluster in its intensity ratio are removed. Otherwise, it is added to the newly formed IP-cluster. The whole process is repeated on peak-groups that were removed to generate other new IP-clusters. This essentially splits the original IP-cluster into smaller clusters with relatively stable intensity ratios across runs. 2.4 Step 3: Precursor mass prediction and database matching As a filtering step before metabolite mass prediction for each IP-cluster, isotopic member peak-groups with m/z differences close to ?1 for singly-charged ions (or ?0.5 for doubly-charged ions) compared to monoisotopic peak-groups are removed from the cluster. This charge size also provides information on plausible IP types subsequently. Next, we generate a list of candidate metabolite masses for an IP cluster, based on a standard set of known IP types, and subsequently report the common candidate mass(es). To illustrate, given the m/z value of a peak-group (P), the adduct mass (A), the charge (C), and

1305

the number of metabolite molecules in the IP type (N), the metabolite mass is given by M = ((C 9 P)–A)/N. For example, given the IP formula [2M?H]?, the mass of the adduct is that of a proton (A = 1.007), the charge C is 1, and N is 2. If a peak-group has an m/z of P = 883.287, then the candidate mass will be M = ((1 9 883.287)-1.007)/ 2 = 441.14. Valid metabolite mass predictions are given by two or more highly similar mass values, while the rest of the candidates are removed from consideration. For those peak-groups in the IP-cluster that are not associated with any valid prediction, the default IP type ([M?H]? or [M-H]-) is used to derive their metabolite mass. Each valid IP type for an IP-cluster can be scored according to their confidence level, depending on analytical conditions. By default, equal score of one is given to each IP type, and is added up to give the total number of detected IPs as the prediction score for the metabolite mass. Fig. 3 illustrates a simplified example of our IP-clustering step.

3 Results and discussion The performance of our workflow was evaluated on a UPLC-MS dataset generated from the supernatant samples of a Chinese hamster ovary cell culture (see supplementary material for details). For both positive and negative modes, a total of 119 runs from a single batch were included for our analysis. For control comparison, 18 replicates runs of the protein-free chemically-defined medium used for our cell culture experiments were obtained. Another 12 runs from pooling all cell culture samples, as well as one blank run of pure water were also included for quality control. 3.1 Physically-meaningful peak-group identification Two note-worthy approaches to our peak-grouping strategy are to (1) anchor the grouping process using the more reliable m/z values of detected peaks, and (2) K-means clustering of nascent groups with more than one peak in a single run. Matching by m/z avoids the problem of peakgrouping by raw RT values before its correction, which is likely to affect grouping quality. Subsequent RT-correction requires initial grouping information and hence can only propagate any grouping errors. Our application of K-means clustering is in recognition of the possibility that the K number of peaks in a single run may result from K originating features. (Any lesser K value requires further clustering). After our routine procedure to minimize such cases by trial-and-error optimization of grouping parameters, we find it useful to refine their grouping via K-means clustering (albeit sub-optimally at times), as all resulting peak-groups will be checked for quality prior to subsequent IP-clustering step. This includes tighter peak clusters in

123

1306

Fig. 3 Example of predicting metabolite mass from the m/z list of a cluster of IPs. Monoisotopic peak-groups are first derived putatively from their isotopic counterparts. Their resulting m/z are used to generate mass candidates of the underlying metabolite based on a list of possible IPs (not all shown in figure). Finally, its metabolite identity is searched against databases within a mass error tolerance. In this case, three candidates match, resulting in a prediction with mass *181.07 (grey boxes). The prediction score is calculated by summing the scores associated with the IP types of matching candidates. For the peak-group at m/z 147.04, since its candidates do not match any other, the one derived from the default [M?H]? IP type is used to give a predicted metabolite mass of 146.03

m/z-RT space (smaller m/z error, etc.) as well as missing or extra peaks and/or good signal–noise ratio in replicates for some number of samples. Typically, this will still remove peak-groups with extra peaks in one run if clustering is not done. To evaluate the overall appropriateness of our grouping strategy, we grouped a positive mode dataset of 574,816 peaks (4,830 peaks per run on average) that was picked up by the matched filter algorithm (Smith et al. 2006) (see supplementary material for peak detection parameters). We conducted one round of RT alignment in between two rounds of respective peak-grouping, before a final filtering of peak-groups (which do not have 100 % presence in replicates of at least one sample, and an average signal– noise ratio [3 for replicates, also in at least one sample). With these criteria, we generated a total of 5,895 peakgroups with our approach while RT-matching produced a total of 5,195 peak-groups (see supplementary data for our

123

T. S. Lee et al.

workflow parameters, and supplementary materials for XCMS grouping parameters). Peak-groups between the two approaches were then compared with an m/z tolerance of ±10 parts-per-million (ppm) and RT tolerance of ±5 s (s). The comparison revealed that 91 % (4,745) of the RTmatched peak-groups are similar to ours. For the 9 % (450) of RT-matched peak-groups that did not match ours, most of them (71 %) have weak signals with signal–noise ratio \5 and a peak detection frequency of only 37 %, compared to 73 % of global peak-groups, suggesting critical differences are with regard to contentious peak-groups. Remarkably, we were still able to infer robustly the identities of known metabolites in a cell culture during subsequent evaluation. As expected, RT-matched peak-groups without K-means clustering contain more peaks than those generated from our approach (86 vs. 76 on average), due to the presence of extra peaks in the same runs. For instance, there are 8 peak-groups matched only by RT that each contains 236 peaks, whereas we split them into tight groups of 118 peaks with notable RT difference of *4 s using our inbuilt K-means clustering step. Such RT difference may be significant enough to indicate differences in structure (see example in supplementary Fig. S3a). In other cases, the presence of extra peaks appeared to have skewed mass predictions well above the reported m/z error of typical analytical platforms [157 ppm (supplementary Fig. S3b) vs. 10–30 ppm]. This large m/z error of peak-groups can result in unreliable identification of putative metabolites in contrast to our approach with subsequent K-means clustering. 3.2 Robust precursor mass prediction of media metabolites We assessed the effective ability of our approach in predicting the 32 known components of a cell culture media (supplementary data and supplementary Table S1) (Chong et al. 2009). To minimize instrument artifacts and other noises, peak-groups were further removed if, (1) they did not contain any peak with intensity[1.5 times that of blank run peaks; (2) they did not have consistently reproducible intensities among pooled sample replicates and a CV \ 0.15 for these replicate intensities; and (3) RT were [9 min (so as to omit un-separated metabolites during chromatographic elution) (see supplementary material). These quality enhancing steps significantly reduced the number of peak-groups by 70 % (5,895–1,748) for the positive ion mode and 53 % (2,752–1,291) for the negative mode. The resulting peak-groups were then processed by our IP clustering and metabolite mass prediction steps. Notably, we correctly identified 28/32 (87.5 %) media metabolites (supplementary Table S1) by matching


1307

Table 1 Overall media metabolites identified from multiple IPs without [M?H]1? Metabolite

Mass

# IPsa

IP type

IP m/z

Intensity (9106)

L-Isoleucine

131.0946

4

[M?H-FA]1?

86.0958

3.20

[2M?Fe–H]1?

317.1153

0.23

[M?H-FA]1?

86.0959

2.61

1?

132.1014

0.30

[M?H-NH3]1?

133.0315

1.27

[M?H-FA]1?

104.0525

0.40

[M?H-FA]1?

120.0801

17.83

[M?H]1?

166.0858

0.57

[M?H-FA]1?

136.0751

5.87

1?

L-Leucine

131.0946

4

[M?H] L-Methionine

L-Phenylalanine

L-Tyrosine

149.0510 165.0790 181.0739

5 10 8

165.0541

1.40

L-Tryptophan

204.0899

12

[M?H-NH3]1? [M?H-FA]1?

188.0702 159.0912

33.75 5.16

Pantothenate

219.1107

11

[M?a]1?

242.0998

0.52

[M?H-H2O]1?

202.1073

0.36

1?

883.2865

0.37

[M?Na]1?

464.1289

0.27

[M?2H]2?

678.2903

1.27

[M?H?Na]2?

689.2811

0.07

[M?H-NH3]

Folic acid Vitamin B12

441.1397 1354.5674

13 8

[2M?H]

For each metabolite, the top two IPs with the highest intensities are shown. The intensity of each IP shown here is the maximum peak intensity within its peak-group a # IPs denotes the total number of IPs (including isotopic peaks) from the same metabolites

predicted masses (to ±10 ppm mass error) to a combined database of KEGG and HMDB entries. Interestingly, the remaining four unidentified metabolites were also not found by searching for [M?H]? and [M-H]- species conventionally. Instead, we manually found only one monoisotopic IPs for both glucose ([M?Na]?) and choline ([M]?) among the global peak-groups. As such, affirmative mass predictions were not generated for them without a second detected IP using our approach. For thymidine and thiamine which also went undetected, their respective IPs ([M?Na]? and [M]?) were even more weakly detected. To understand the reason behind their low detection, we further compared with an in-house mass spectral library which revealed the presence of two thiamine fragments with m/z values of 122.0708 and 144.0478 at the same RT as the [M]? ion. (Similarly, we also found a potential glucose fragment with m/z = 85.0315 from its IP-cluster.) This suggested fragmentation occurs commonly, which is beyond the scope of our approach. A significant proportion of metabolites form dominant and informative non-[M?H]? features, as 9 out of 19 media metabolites identified in the positive mode have ions other than [M?H]? as their most abundant IPs (Table 1). In particular, three metabolites, L-Isoleucine, L-Methionine, and Vitamin B12, did not form detectable [M?H]? ions, and would have been unidentified without similar approach

as ours. Importantly, the most abundant IP would account for the vast proportion of detected intensities which is required for accurate quantification in differential profiling studies (Table 1). Hence, our robust metabolite identification approach has profound consequences for subsequent analysis of biological systems.

3.3 Effective reduction in false predictions From our study, we found that metabolites with multiple IPs are commonly detected, as [35 % of mass predictions in the positive mode and 20 % in the negative mode contain more than one peak-group (cluster size [1) (Fig. 4). The presence of multiple IPs for many detected metabolites and their inference has the effect of affirming higher confidence in their predicted identities. Specifically, by allowing the possibility of forming various IPs in step 2 of IP-clustering, we effectively minimize false positive predictions of the underlying metabolite mass by collapsing its multiple IPs into a single prediction. To be sure, we found a 48 % (1,748–904) and 29 % (1,291–914) reduction in mass predictions for the positive and negative mode respectively. We further illustrate the phenomenon of multiple IPs causing an over-estimation of global metabolite number in the context of known metabolites. By inspecting only the

123

1308

T. S. Lee et al.

Fig. 4 Size distributions of IPclusters (blue columns) and metabolite mass predictions, in both the positive (a) and negative (b) ion modes. Predictions are further broken into: all predictions (red columns), those that have a database match (green columns), and those that correctly match to media metabolites (purple). Each column represents the proportion of IP-clusters or predictions containing the particular number of peakgroups. Total numbers of clusters or predictions are shown in parentheses in the legend (Color figure online)

positive mode, there were 60 peak-groups that allowed for correct identification for 19 out of 38 media components with our approach. Of the 19 correctly detected components, 16 can also be detected by simply assuming the formation of [M?H]? ions. However, with the latter approach, an additional six metabolites were unnecessarily proposed to account for the 60 peak-groups. Consistent with the principle of Occam’s razor, we believe these miscellaneous six predictions from using conventional approaches are false positives. Hence, our approach prevented these false-positive matches from the original 60 peak-grounds, which significantly account for 27 % (6/[16 ? 6]) of all [M?H]?-based database matches.

123

4 Concluding remarks In this study, we have demonstrated a principled and systematic approach for interpreting complex, multidimensional LC-MS data. We benchmarked our approach by the robust identification of known media components in a mammalian cell culture, despite generating appropriately, a significantly smaller number of candidates. Thus, our workflow has demonstrated its potential in accurately providing untargeted leads from high-throughput metabolomics works for subsequent studies. Future works include its validation on different sample types and better methods in identifying optimal workflow parameters.

Precursor mass prediction by clustering ionization products Acknowledgments This study was supported by Biomedical Research Council, A*STAR (Agency for Science, Technology and Research) Singapore, National University of Singapore, Next-Generation BioGreen 21 Program (SSAC, No. PJ009520), Rural Development Administration, Republic of Korea.

References Benton, H. P., Wong, D. M., Trauger, S. A., & Siuzdak, G. (2008). XCMS2: Processing tandem mass spectrometry data for metabolite identification and structural characterization. Analytical Chemistry, 80, 6382–6389. Bonn, B., Leandersson, C., Fontaine, F., & Zamora, I. (2010). Enhanced metabolite identification with MS(E) and a semiautomated software for structural elucidation. Rapid Communications in Mass Spectrometry: RCM, 24, 3127–3138. Brown, M., Dunn, W. B., Dobson, P., et al. (2009). Mass spectrometry tools and metabolite-specific databases for molecular identification in metabolomics. The Analyst, 134, 1322–1332. Brown, M., Wedge, D. C., Goodacre, R., et al. (2011). Automated workflows for accurate mass-based putative metabolite identification in LC/MS-derived metabolomic datasets. Bioinformatics, 27, 1108–1112. Chen, J., Zhao, X., Fritsche, J., et al. (2008). Practical approach for the identification and isomer elucidation of biomarkers detected in a metabonomic study for the discovery of individuals at risk for diabetes by integrating the chromatographic and mass spectrometric information. Analytical Chemistry, 80, 1280–1289. Chong, W. P., Goh, L. T., Reddy, S. G., et al. (2009). Metabolomics profiling of extracellular metabolites in recombinant Chinese Hamster Ovary fed-batch culture. Rapid Communications in Mass Spectrometry: Rcm, 23, 3763–3771. Creek, D. J., Jankevics, A., Breitling, R., Watson, D. G., Barrett, M. P., & Burgess, K. E. (2011). Toward global metabolomics analysis with hydrophilic interaction liquid chromatographymass spectrometry: Improved metabolite identification by retention time prediction. Analytical Chemistry, 83, 8703–8710. Cui, Q., Lewis, I. A., Hegeman, A. D., et al. (2008). Metabolite identification via the Madison metabolomics consortium database. Nature Biotechnology, 26, 162–164. Dettmer, K., Aronov, P. A., & Hammock, B. D. (2007). Mass spectrometry-based metabolomics. Mass Spectrometry Reviews, 26, 51–78. Draper, J., Enot, D. P., Parker, D., et al. (2009). Metabolite signal identification in accurate mass metabolomics data with MZedDB, an interactive m/z annotation tool utilising predicted ionisation behaviour ‘rules’. BMC Bioinformatics, 10, 227. Dunn, W. B., Broadhurst, D., Begley, P., et al. (2011). Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry. Nature Protocols, 6, 1060–1083. Fiehn, O. (2002). Metabolomics–the link between genotypes and phenotypes. Plant Molecular Biology, 48, 155–171. Heyer, L. J., Kruglyak, S., & Yooseph, S. (1999). Exploring expression data: Identification and analysis of coexpressed genes. Genome Research, 9, 1106–1115. Horai, H., Arita, M., Kanaya, S., et al. (2010). MassBank: A public repository for sharing mass spectral data for life sciences. Journal of Mass Spectrometry: JMS, 45, 703–714. Hu, Q., Noll, R. J., Li, H., Makarov, A., Hardman, M., & Graham Cooks, R. (2005). The Orbitrap: A new mass spectrometer. Journal of Mass Spectrometry: JMS, 40, 430–443. Iijima, Y., Nakamura, Y., Ogata, Y., et al. (2008). Metabolite annotations based on the integration of mass spectral

1309 information. The Plant journal: for Cell and Molecular Biology, 54, 949–962. Jansson, J., Willing, B., Lucio, M., et al. (2009). Metabolomics reveals metabolic biomarkers of Crohn’s disease. PLoS ONE, 4, e6386. Junot, C., Madalinski, G., Tabet, J. C., & Ezan, E. (2010). Fourier transform mass spectrometry for metabolome analysis. The Analyst, 135, 2203–2219. Kanehisa, M., Goto, S., Sato, Y., Furumichi, M., & Tanabe, M. (2012). KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Research, 40, D109–D114. Kind, T., & Fiehn, O. (2006). Metabolomic database annotations via query of elemental compositions: Mass accuracy is insufficient even at less than 1 ppm. BMC Bioinformatics, 7, 234. Kind, T., & Fiehn, O. (2007). Seven golden rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinformatics, 8, 105. Lu, W., Bennett, B. D., & Rabinowitz, J. D. (2008). Analytical strategies for LC-MS-based targeted metabolomics. Journal of Chromatography B, Analytical Technologies in the Biomedical and Life Sciences, 871, 236–242. Pluskal, T., Nakamura, T., Villar-Briones, A., & Yanagida, M. (2010). Metabolic profiling of the fission yeast S. pombe: Quantification of compounds under different temperatures and genetic perturbation. Molecular BioSystems, 6, 182–198. Reaves, M. L., & Rabinowitz, J. D. (2011). Metabolomics in systems microbiology. Current Opinion in Biotechnology, 22, 17–25. Rogers, S., Scheltema, R. A., Girolami, M., & Breitling, R. (2009). Probabilistic assignment of formulas to mass peaks in metabolomics experiments. Bioinformatics, 25, 512–518. Roux, A., Lison, D., Junot, C., & Heilier, J. F. (2011). Applications of liquid chromatography coupled to mass spectrometry-based metabolomics in clinical chemistry and toxicology: A review. Clinical Biochemistry, 44, 119–135. Saito, K., & Matsuda, F. (2010). Metabolomics for functional genomics, systems biology, and biotechnology. Annual Review of Plant Biology, 61, 463–489. Sana, T. R., Roark, J. C., Li, X., Waddell, K., & Fischer, S. M. (2008). Molecular formula and METLIN personal metabolite database matching applied to the identification of compounds generated by LC/TOF-MS. Journal of Biomolecular Techniques: JBT, 19, 258–266. Sayers, E. W., Barrett, T., Benson, D. A., et al. (2012). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 40, D13–D25. Scalbert, A., Brennan, L., Fiehn, O., et al. (2009). Mass-spectrometrybased metabolomics: Limitations and recommendations for future progress with particular focus on nutrition research. Metabolomics: Official Journal of the Metabolomic Society, 5, 435–458. Scheltema, R., Decuypere, S., Dujardin, J., Watson, D., Jansen, R., & Breitling, R. (2009). Simple data-reduction method for highresolution LC-MS data in metabolomics. Bioanalysis, 1, 1551–1557. Selvarasu, S., Ho, Y. S., Chong, W. P., et al. (2012). Combined in silico modeling and metabolomics analysis to characterize fedbatch CHO cell culture. Biotechnology and Bioengineering, 109, 1415–1429. Smith, C. A., O’Maille, G., Want, E. J., et al. (2005). METLIN: A metabolite mass spectral database. Therapeutic Drug Monitoring, 27, 747–751. Smith, C. A., Want, E. J., O’Maille, G., Abagyan, R., & Siuzdak, G. (2006). XCMS: Processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Analytical Chemistry, 78, 779–787. Stoll, N., Schmidt, E., & Thurow, K. (2006). Isotope pattern evaluation for the reduction of elemental compositions assigned

123

1310 to high-resolution mass spectral data from electrospray ionization Fourier transform ion cyclotron resonance mass spectrometry. Journal of the American Society for Mass Spectrometry, 17, 1692–1699. Sugimoto, M., Ikeda, S., Niigata, K., Tomita, M., Sato, H., & Soga, T. (2012). MMMDB: Mouse multiple tissue metabolome database. Nucleic Acids Research, 40, D809–D814. Sumner, L. W., Urbanczyk-Wochniak, E., & Broeckling, C. D. (2007). Metabolomics data analysis, visualization, and integration. Methods in Molecular Biology, 406, 409–436. Tautenhahn, R., Bo¨ttcher, C., & Neumann, S. (2007). Annotation of LC/ESI-MS Mass Signals. In S. Hochreiter & R. Wagner (Eds.), Bioinformatics research and development. Lecture notes in computer science vol. 4414 (pp. 371–380). Berlin, Heidelberg: Springer. Thurman, E. M., Ferrer, I., & Barcelo, D. (2001). Choosing between atmospheric pressure chemical ionization and electrospray ionization interfaces for the HPLC/MS analysis of pesticides. Analytical Chemistry, 73, 5441–5449. Vaclavik, L., Schreiber, A., Lacina, O., Cajka, T., & Hajslova, J. (2012). Liquid chromatography-mass spectrometry-based metabolomics

123

T. S. Lee et al. for authenticity assessment of fruit juices. Metabolomics: Official Journal of the Metabolomic Society, 8, 793–803. Wang, X., Sun, H., Zhang, A., Wang, P., & Han, Y. (2011). Ultraperformance liquid chromatography coupled to mass spectrometry as a sensitive and powerful technology for metabolomic studies. Journal of Separation Science, 34, 3451–3459. Want, E. J., Wilson, I. D., Gika, H., et al. (2010). Global metabolic profiling procedures for urine using UPLC-MS. Nature Protocols, 5, 1005–1018. Werner, E., Croixmarie, V., Umbdenstock, T., et al. (2008). Mass spectrometry-based metabolomics: Accelerating the characterization of discriminating signals by combining statistical correlations and ultrahigh resolution. Analytical Chemistry, 80, 4918–4932. Wishart, D. S., Knox, C., Guo, A. C., et al. (2009). HMDB: A knowledgebase for the human metabolome. Nucleic Acids Research, 37, D603–D610. Yin, P., Wan, D., Zhao, C., et al. (2009). A metabonomic study of hepatitis B-induced liver cirrhosis and hepatocellular carcinoma by using RP-LC and HILIC coupled with mass spectrometry. Molecular BioSystems, 5, 868–876.