a liquid chromatography-mass spectrometry

0 downloads 0 Views 9MB Size Report
Sofia Moco, Raoul J. Bino, Ric C.H. De Vos, Jacques Vervoort Trends in Analytical ...... PEEK in-line filter holder with PEEK frit 0.5 µm pore size (UpChurch.
Metabolomics Technologies applied to the Identification of Compounds in Plants

Sofia Moço

Promotoren Prof. Dr. R. J. Bino Hoogleraar in de Metabolomica van Planten Wageningen Universiteit Prof. Dr. S. C. de Vries Hoogleraar in de Biochemie Wageningen Universiteit Co-promotoren Dr.ir. J.J.M. Vervoort Universitair hoofddocent, Laboratorium voor Biochemie Wageningen Universiteit Dr. R. C. H. de Vos Senior onderzoeker Plant Research International Wageningen Promotiecomissie Prof. Dr. Ton Bisseling, Wageningen Universiteit, Nederland Dr. Robert D. Hall, Plant Research International, Wageningen, Nederland Dr. Joachim Kopka, Max Planck Institute for Molecular Plant Physiology, Potsdam, Germany Dr. John P. M. van Duynhoven, Unilever, Vlaardingen, Nederland Dit onderzoek is uitgevoerd binnen de onderzoeksschool Voeding, Levensmiddelentechnologie, Agrobiotechnologie en Gezondheid.

Metabolomics Technologies applied to the Identification of Compounds in Plants A liquid chromatography-mass spectrometry / nuclear magnetic resonance perspective over the tomato fruit Sofia Isabel Abraúl Viana Moço

Proefschrift Ter verkrijging van de graad van doctor op gezag van de rector magnificus van Wageningen Universiteit Prof. Dr. M. J. Kropff in het openbaar te verdedigen op maandag 15 oktober 2007 des namiddags te vier uur in de Aula

Metabolomics Technologies applied to the Identification of Compounds in Plants Sofia Isabel Abraúl Viana Moço 2007 PhD thesis, Wageningen University, The Netherlands with references - with summaries in English and Dutch ISBN 978-90-8504-742-1

Contents Page

Preface

7

CHAPTER 1: Metabolomics technologies and metabolomics identification

11

CHAPTER 2: Untargeted large scale plant metabolomics using liquid chromatography coupled to mass spectrometry

35

CHAPTER 3: A liquid chromatography mass spectrometry based metabolome database for tomato

65

CHAPTER 4: Tissue specialization at the metabolite level is perceived during the development of tomato fruit

91

CHAPTER 5: Building up a comprehensive database of flavonoids based on nuclear magnetic resonance data

123

CHAPTER 6: Push-button flavonoid identification: a NMR database integrated with a 1H NMR predictive model

135

CHAPTER 7: Metabolite correlations in tomato obtained by fusion of liquid chromatography-mass spectrometry and nuclear magnetic resonance data

175

Summarizing discussion and conclusions

197

Samenvatting References Curriculum vitae List of publications Acknowledgements / Agradecimentos Training and supervision plan

203 207 219 220 221 222

Preface

A new era of plant biochemistry at the systems level is emerging in which the detailed description of biochemical phenomena, at the cellular level, is important for a better understanding of physiological, developmental, and biomolecular processes in plants. This emerging field is oriented towards the characterisation of small molecules (metabolites) that act as substrates, products, ligands or signalling entities in cells. This thesis concerns the development and establishment of such metabolomics strategies for screening and identifying metabolites in biological systems. Most technological strategies were applied to the assignment of metabolites from tomato (Solanum lycopersicum) fruit. Tomato was chosen for being a widely consumed crop with nutritional attributes, representing a model for the Solanaceae family. In order to achieve both high coverage of detected metabolites and valuable information for identification purposes, liquid chromatography coupled to mass spectrometry (LC-MS) and nuclear magnetic resonances (NMR) technologies were used. In addition, metabolite databases, based on experimental data (mass-based, in the case of LC-MS and chemical shift-based, in the case of NMR) were initiated, in order to systemize the extensive metabolite information. The chapters in this thesis describe method developments and their applications in plant metabolomics that are also feasible to be implemented on other biological systems. A review on the technologies used for metabolomics with a perspective on compound identification is presented in Chapter 1. In Chapter 2, a robust large scale LC-MS method for the analysis of metabolites in plants is described in detail. It presents a step-by-step protocol with thorough information about the reagents used, sample preparation, instrument setup, methods of analysis and data processing strategies. The described analytical method combines LC with photo diode array (PDA) and MS detection, and allows 7

PREFACE

the analysis of mostly semi-polar secondary metabolites present in plants, such as phenolic acids, flavonoids, glucosinolates, saponins, alkaloids and derivatives thereof. Chapter 3 presents an application of the LC-PDA-MS method for the profiling of metabolites present in tomato fruit. The metabolites putatively identified in this fruit were included in a tomato dedicated-database (the MoTo DB) that is available for public search on the web (see: http://appliedbioinformatics.wur.nl). A comparison between two tomato fruit tissues, peel and flesh, for their metabolite content was made using this MoTo DB. Using the same LC-PDA-MS setup, several different tomato fruit tissues were compared in more detail, along the fruit ripening timeline, in Chapter 4. The presence of tissue-specific metabolites, at determined ripening stages, suggests developmental control of metabolite biosynthesis. Such tissue-specific metabolomics approach may give rise to a biological view over metabolite compartmentalisation. Chapters 5 and 6 describe the implementation of a NMR database for secondary metabolites, mostly including flavonoids, the Flavonoid Database (see: Flavonoid Database under http://www.wnmrc.nl). The acquisition of a large data set of related standard compounds allowed the analysis of shifts in NMR characteristics by the presence of certain functional groups or substituents in the flavonoid backbone. In addition, a 1H NMR-based prediction model was iteratively trained from the acquired experimental data and can be used for the prediction of unknown related molecules. This approach greatly increases the efficiency in the identification of (flavonoid) metabolites. Chapter 7 describes correlations of metabolomics data derived from LC-MS and NMR analyses of a large number of different tomato cultivars. The identification of metabolites is obtained among other available sources, the MoTo DB and the Flavonoid Database. This approach illustrates the complementariness and coincidence of NMR and MS as analytical techniques, applied to the detection of metabolites in tomato fruit. The summarizing discussion and conclusions, sets the work presented in this thesis into a biochemical perspective, and prospects suggestions for the future.



Chapter 1

Metabolomics Technologies and Metabolite Identification Sofia Moco, Raoul J. Bino, Ric C.H. De Vos, Jacques Vervoort Trends in Analytical Chemistry (2007) in press

Metabolomics studies rely on the analysis of the multitude of small molecules (metabolites) present in a biological system. Most commonly, metabolomics is heavily supported by mass spectrometry (MS) and nuclear magnetic resonance (NMR), as parallel techniques. These two technologies provide an overview of the metabolome and detain a high compound-elucidation power. Beyond the capacity of large scale analysis, a main effort should be pursued for the unequivocal identification of metabolites. The combination of liquid chromatography (LC)-MS and NMR is a powerful methodology to achieve metabolite identification. A better chemical characterization of the metabolome will undoubtedly enlarge the knowledge of any biological system.

11

CHAPTER 1

INTRODUCTION Biological systems are under constant challenges from the environment. Adaptation to environmental stimuli is reflected in alterations in the genome, the transcription of genes, the expression level and post-translational modifications of proteins, and in the primary and secondary metabolism. The phenotype of the organism is the product of its genotype within its environment. The metabolic composition is reflected onto the phenotype and hence a detailed analysis of the metabolome is a representation of the phenotype under study. Phenotypic changes are therefore most adequately monitored by reliable metabolomics studies. Metabolomics stands out from any other organic compound analysis in scale and in chemical diversity, i.e., all metabolites are aimed to be described, both secondary and primary metabolites, present in an organism or biological system. Perhaps the most striking feature of metabolomics lays in its integrative capacity, as part of the “omics” disciplines, which has resulted in a shift from mainly pure (organic) chemistry-based characterization (as in phytochemistry) into a biochemical context. In plants, the characterization of endogenous primary and secondary metabolites is of interest for the quality and improvement of crops, as well in the study of e.g. physiology, ecology and development phenomena in plant biochemistry. Metabolomics can thus provide valuable tools, relevant in a wide range of applications (Table 1.1) including the perception of cellular phenomena through systems biology approaches (Bino et al., 2004; Hall, 2006). Table 1.1. Fields of metabolomics applications. Area of application Plant breeding and crop quality assessment Food assessment and safety Toxicity assessment Nutrition assessment Medical diagnosis and assessment of disease status Pharmaceutical/drug developments Yield improvement in crops and fermentations Biomarker discovery Technological advances in analytical chemistry Genotyping Environmental adaptations Gene function elucidation Integrated in systems biology

Systems biology is considered to be the latest strategy to describe cellular mechanisms at a global scope, making use of transcriptomic, proteomic and metabolomic information. This discipline is expected to provide a better 12

Metabolomics and identification

understanding of cell biology by enabling the study of the function and behaviour of molecular interactions in complex networks (Galperin and Ellison, 2006). The ability to address biological questions using a systems biology approach depends on the information that can be obtained from the system under study. However, the quality of the conclusions extracted from such study also depends on the information that is fed into the system. Hence, the better understanding of a certain biological system by a metabolomics approach relies on the amount of participating metabolites with known identity.

LC-MS

PDA

NMR

CE-MS

GC-MS

metabolome

Figure 1.1. Heuristic representation of the metabolome indicating that only a small fraction of metabolites have been identified up to now, being the majority of naturally occurring metabolites still unknown. The most commonly used analytical techniques that have been used for metabolite identification are LC-MS, NMR, GC-MS, CE-MS and PDA.

In metabolomics, the necessity of enlarging the list of identified metabolites becomes more and more a main constraint (Fig. 1.1). The extensive data sets nowadays obtained from analytical platforms such as the most commonly used MS- and NMRbased systems, create a gap between “signal x (or at most, detected metabolite X)” and “metabolite with IUPAC name 2-(3,4-dihydroxyphenyl)-5,7-dihydroxy-3[(2S,3R,4S,5R,6R)-3,4,5-trihydroxy-6-[[(2R,3R,4R,5S,6S)-3,4,5-trihydroxy-6-methyloxan-2-yl]oxymethyl]oxan-2-yl]oxy-chromen-4-one, also commonly known as rutin with CAS registry number 153-18-4, described using a unique InChi identifier” (Fig. 1.2). The construction of (experimental) spectrometric and spectroscopic-based metabolite databases and the accessibility to searchable chemical databases are some of the initiatives that can aid narrowing this gap. This is a challenge that not only resides in obtaining high quality data suitable for identification from the available analytical technologies but also resides in the integration and development 13

CHAPTER 1

of bio-computational tools for automation of the data analysis. The identification of metabolites is a necessity for understanding the molecular nature of the biochemical processes in which they participate, as substrates or products, in reactions at the (sub)cellular level. The density of the characterization of compounds can be described up to the stereochemical conformation accounting for their three-dimensional (3D) structure. In general, chiral structures profoundly influence chemical and biological mechanisms, being essential for structure-activity relationships in catalysis, drug development applications and medicinal chemistry. LC

MS C ompos ition of extra ct H ydrophobicity R etention time

609 32 6 7 7

A ccura te ma s s / s ig na l E lementa l compos ition: C aH bN cO dP eS f Is otopic pa ttern %

610 8303

1219 4449 1 2 2 0 ;2 36 6

6 11 2 00 1

m/z

M S /M S UV

4 65

354

%

303

F rag mentation pattern S tructural information

2 55

S pectra 2 0 0 -6 0 0 nm /s ig na l C hromophores - λmax

0

0

S tandard compounds L iterature

NM R C hemica l s hifts C oupling cons ta nts

(B io)C hemica l databas es

Figure 1.2. Pieces of information given by analytical technologies and knowledge resources that lead to the identification of a metabolite, here exemplified for the metabolite rutin (IUPAC name: 2-(3,4-dihydroxyphenyl)-5,7-dihydroxy-3-[(2S,3R,4S,5R,6R)-3,4,5-trihydroxy-6-[[(2R,3R,4R,5S,6S)-3,4,5trihydroxy-6-methyl-oxan-2-yl]oxymethyl]oxan-2-yl]oxy-chromen-4-one; CAS registry number 153-18-4; InChi identifier: InChI=1/C27H30O16/c1-8-17(32)20(35)22(37)26(40-8)39-7-15-18(33)21(36)23(38)27(4215)43-25-19(34)16-13(31)5-10(28)6-14(16)41-24(25)9-2-3-11(29)12(30)4-9/h2-6,8,15,17-18,20-23,2633,35-38H,7H2,1H3/t8-,15+,17-,18+,20+,21-,22+,23+,26+,27-/m0/s1): liquid chromatography (LC), mass spectrometry (MS), fragmentation pattern analysis (MS/MS), ultraviolet/visible range spectroscopy (UV/Vis) obtained by the photo diode array (PDA), nuclear magnetic resonance (NMR), experimental validation by standard compounds and information present in literature and databases.

The technologies used in metabolomics allow the characterization of molecules by providing pieces of information that can lead to their annotation and ultimately to their identification. In this study, we pinpoint several major considerations to be taken into account in any metabolomics approach: sample preparation, analytical technique, data analyses, identification tools and databases, and finally hypothesis testing and conclusions. Special attention is given to the identification of metabolites in plants by means of LC-MS and NMR strategies. 14

Metabolomics and identification

SAMPLE PREPARATION Sample preparation is perhaps the most underestimated part of metabolomics analyses. In any biological system, metabolites of a wide chemical diversity are present in a dynamic range of concentrations that can exceed 106 (e.g. ratio in concentrations between sucrose and brassinolide in Arabidopsis). In plants, a major part of the large diversity in the metabolome is due to the presence of a wide range of secondary metabolites, which generally strongly exceeds the number of primary metabolites. The composition and quantity of detected metabolites depends to a large extent on the sample preparation chosen. From the estimated hundreds of thousand metabolites that can exist in the plant kingdom, there is an impressive chemical variation. This large chemical variation not only exists between different plant species but also between different tissues of a single plant. According to Krishnan et al. (2005), a typical cell may contain 5,000 metabolites (expectedly in diverse concentrations and diverse chemical properties), which challenges the ability of a sample preparation method to capture as many of these metabolites as possible. The extent of the detected metabolome is therefore dependent on the contents of the (prepared) biological sample. The more steps introduced in the sample preparation, such as sequential extractions and concentrations (for example to favour a particular class of compounds), the narrower will be the chemical diversity of compounds finally present in the extract. One should be aware that the further into the analysis pipeline, the slenderer will be the overall knowledge of the metabolites present in the sample. On the other hand, the knowledge of the (narrow) set of metabolites that endures the complete analytical pipeline, i.e. from sample preparation to identified metabolite, is progressively richer (Fig. 1.3). In order to have reproducible measurements, the conditions of the biological material should be as homogeneous as possible, in terms of environmental conditions (light, temperature, humidity, nutrients, time of sampling, etc.), leaving ideally the biological variation as the only inherited variation. For metabolomics applications, a fast, reproducible and unselective extraction method is preferred for detection of a wide range of metabolites that occur in the plant, avoiding unforeseen chemical modifications. There are various methodologies for extracting compounds from biological materials: liquid extraction (temperature or pressure-assisted), solid phase extraction (SPE), solid phase microextraction, and microwave assisted extraction. In general, metabolites of interest are extracted by liquid extraction with one solvent, aqueous or organic, or with a combination of solvents (liquidliquid extraction), implying that the type of metabolites extracted is dependent 15

CHAPTER 1

on the chemical properties of the solvent used. For a certain class of metabolites, a particular solvent can be more adequate, yet not unique for its extraction. For plants, semi-polar compounds such as phenolic acids, flavonoids, alkaloids and glycosylated sterols are successfully extracted in solutions of methanol/water while the apolar carotenoids are better extracted in chloroform. The choice of solvent should also be compatible with the analytical instruments used. For reversed phase LC-MS analyses, solvents such as ethyl acetate or chloroform are not advisable as these do not dissolve in the mobile phase used for the chromatography nor they produce an efficient spray in the case of direct flow injection analysis. On the other hand, in NMR analyses any solvent can be used, preferentially deuterated in case of 1H NMR measurements. More important than the choice of sample preparation protocol is the reproducibility of the extraction and the ability to distinguish naturally occurring compounds. Spectral DBs AMDIS/NIST ACD Labs PERCH AMIX

CE LC GC ...

biological system biological biological tissue extract (thousands of metabolites)

Genomics Transcriptomics Proteomics Interactomics

Chemical DBs PubChem SciFinder Beilstein Merk Index

PDA MS NMR ...

2'

xx x x x x

chromatogram + spectrum

list of peaks

6

3' 4'

O

2

7

6'

5'

3 5 OH

O

Identified visualization; list of metabolite differential putative metabolites metabolites

∫dt Systems Biology

8

HO

Identified Metabolites

Metabolome

publications Species DBs DNP HMDB KNApSAcK

Fluxomics Metabolic Pathways Metabolic Networks

Figure 1.3. Metabolomics pipeline towards a systems biology approach: from the whole metabolome to identified metabolites. The large amount of metabolites present in a biological system (e.g. plant) undergoes a strategy of combined experimental design and data interpretation to achieve the identification of only few metabolites naturally occurring in the system. This procedure includes: sampling, sample preparation and extraction, analysis by typically CE-, LC-, GC-MS or NMR, interpretation of chromatograms and spectra obtained, list of statistical relevant candidate peaks, visualisation of the multivariate data, extraction of differential peaks, construction of a list of putative metabolites, identification of a metabolite. Along this procedure, resources such as species databases, literature, spectral databases and chemical databases can be fed into the analytical pipeline, narrowing the number of ambiguities for candidate metabolites. Metabolite information can aid the interpretation of metabolic pathways and networks in combination with dynamic and transient measurements made by flux analyses. The integration of genomics, transcriptomics, proteomics and interactomics with metabolomics contributes to a systems biology overview of the system (for abbreviations see Table 1.2).

16

Metabolomics and identification

ANALYTICAL TECHNOLOGIES LC-MS MS is a spectrometric method that allows the detection of mass-to-charge species pointing to the molecular mass (MM) of the detected metabolite. As a developing technology in metabolomics applications, there are various configurations of mass spectrometers, in terms of ion acceleration and mass detection, ion production interfaces and ion fragmentation capabilities (Fig. 1.4). Moreover, there have been constant adjustments over the years in the hardware and software of mass spectrometers to meet robustness, practicality, applicability and efficiency of the analyses.

analyte A

EI ESI APCI MALDI DESI APPI

ion production

ion A

CID SID IRMPD ECD ion fragmentation

Q-MS TripleQ-MS Q-Ion trap-MS TOF-MS Q-TOF-MS FT-ICR-MS FT-Orbitrap-MS

ion acceleration and detection

Figure 1.4. Configuration possibilities of mass spectrometers. There are different configurations of mass spectrometers, according to the ion acceleration and detection: quadrupole-MS (Q-MS); triple quadrupole-MS (tripleQ-MS), quadrupole-ion trap-MS (Q-ion trap-MS), time-of-flight-MS (TOF-MS), Fourier transform-ion cyclotron resonance-MS (FT-ICR-MS) and FT-Orbitrap-MS. There are different interfaces for the production of ions: electron impact (EI), electrospray (ESI), atmospheric pressure chemical ionisation (APCI), matrix assisted laser desorption ionisation (MALDI), desorption electrospray ionisation (DESI) and atmospheric pressure photoionisation (APPI). In terms of ion fragmentation techniques, collision-induced dissociation (CID) is the most conventional method. Other fragmentation techniques include surfaceinduced dissociation (SID), infrared multiphoton dissociation (IRMPD) and electron-capture dissociation (ECD), especially for the fragmentation of multiple-charged polypeptides.

The performance of soft-ionisation mass spectrometers, as used in LCMS applications, can be described (and compared) by means of several intrinsic parameters (Fig. 1.5): the mass resolving power (or resolution), the mass accuracy, the linear dynamic range and the sensitivity (McLuckey and Wells, 2001). The improvement of these parameters enables a more effective identification of the MM of the analyte injected into the MS instrument. In general, the highly used quadrupole (Q)-MS instruments have a mass resolving power that is 4 times less than that of a time-of-flight (TOF)-MS, while an Fourier transform (FT)-ion cyclotron (ICR)-MS can reach a resolving power of higher than 1,000,000 (or 400 times higher than a Q-MS) (Balogh, 2004). A higher mass accuracy facilitates a finer distinction between closely related mass-to-charge signals. Consequently, the quality and the 17

CHAPTER 1

quantity of the assignments of mass signals into metabolites can be much improved by using high and ultra-high resolution accurate mass spectrometers. Hybrid TOF-MS instruments such as QTOF-MS instruments are widely used in metabolomics due to their high sensitivity, mass resolving power (about 10,000) and mass accuracy, having a semi-automated instrument control. However, in terms of linear dynamic range, (Q)TOF-MS instruments are limited by the properties of the time-to-digital converter detector that is only able to record one ion per dead time. Intense mass signals become saturated, masking their real intensity and leading to distortions on the mass peak shape producing deviations on the mass accuracy, typically to lower mass-to-charge values (Chernushevich et al., 2001; Verhoeven et al., 2006). Recently some improvements have been implemented into (Q)TOF-MS instruments, extending their dynamic range. The use of an online lock mass spray, acting as an internal standard, can help to correct for deviations in the mass-tocharge axis, and can dictate the ion intensity interval for which the mass accuracy is highest and adequate for elemental composition calculation (Moco et al., 2006a). $m

100

Mass resolving power R = m/$m Mass accuracy measured - m real Dm = mm measured

% relative abundance

80 max

60 FWHM

$m

Sensitivity SNR = intensity of m noise intensity Linear dynamic range range over which the ion signal is linear with the analyte concentration

40 50% of max

20

0

999.94

999.96 m 999.98

1000

1000.02

m/z

1000.04

Figure 1.5. Parameters used to describe the performance of mass spectrometers. The mass resolving power (or resolution), m/Δmx, can be described by two ways: i. for m being the averaged mass-to-charge ratio associated with two adjacent mass signals of equal size and shape that overlap by x% (50% is commonly used nowadays) and Δmx being the difference in mass-to-charge between the two adjacent mass signals or ii. m being the mass at the apex of the mass signal and Δmx being the width at x% height (typically 50%) of this mass signal, designated by FWHM (full width at half height of maximum). The mass accuracy is described by the ratio between the mass error (difference between measured and real mass) and the theoretical mass, often represented as parts per million (ppm). The sensitivity is described by the ratio between the intensity level of the mass signal and the intensity level of the noise. The linear dynamic range is described as the range of linearity of the ion signal measured in function of the analyte concentration.

18

Metabolomics and identification

FT-MS instruments, both the cyclotron (FT-ICR-MS) and the Orbitrap type (FT-Orbitrap-MS), enable measurements at a higher mass accuracy in a wider dynamic range. The FT-ICR-MS has the highest mass resolving power so far reported for any mass spectrometer (>1,000,000) and a mass accuracy generally within 1 ppm (Brown et al., 2005). The recently developed FT-Orbitrap-MS has a more modest performance compared to the FT-ICR-MS (maximum resolving power > 100,000 and 2 ppm of mass accuracy with internal standard), but is a high speed and high ion transmission instrument due to shorter accumulation times. This is a very advantageous characteristic especially when hyphenated to a separation technique, such as LC, and also when carrying out MS/MS experiments (Makarov et al., 2006). The appearance of high mass accuracy instruments in a wide dynamic range can improve immensely the identification capabilities in the online methods applied to complex mixtures. The mass detection of a molecule in soft-ionisation MS is conditioned by the capacity of the analyte to ionise while being part of a complex mixture. Because only ions, either anions or cations, can be measured by MS, metabolites unable to ionise can not be detected. Apart from the chemical properties of the molecule itself, the eluent flow and composition, sample matrix and ionisation source all influence the ionisation. Ion suppression and matrix effects can become a main issue, in particular in semi-quantitative measurements. The use of ionisation enhancers, sample cleanup methods and different ionisation source are some of the possibilities that can improve the ionisation of the analytes under study (Mallet et al., 2004). Most MS applications in metabolomics make use of a separation method before mass detection, typically LC, gas chromatography (GC) or capillary electrophoresis (CE). Such separation step introduces an extra dimension for identification (retention time) to the data, and reduces the complexity of the data analysis by avoiding ion suppression at the source (especially relevant when soft ionisation is used). A separation method, however, diminishes the throughput of analyses compared to a direct flow injection method. Different classes of compounds can be measured according to the separation technique used. LC is probably the most versatile separation method, as it allows the separation of compounds of a wide range of polarity. Using reverse-phase columns, semi-polar compounds (phenolic acids, flavonoids, glycosylated steroids, alkaloids and other glycosylated species) can be separated and by using hydrophilic columns, polar compounds can also be measured by LC-MS (sugars, amino sugars, amino acids, vitamins, carboxylic acids and nucleotides) (Tolstikov and Fiehn, 2002). The appearance of ultra performance liquid chromatography (UPLC) can improve the speed of analysis but 19

CHAPTER 1

more importantly, provide a better chromatographic resolution. The hyphenation of UPLC to MS can be advantageous for a better assignment of metabolites from chromatographic mass signals. Regardless of the configuration of the (UP)LC-MS system, the robustness and reproducibility (in retention time and mass accuracy) as well as efficient ionisation of the analyses are essential for obtaining consistent data (De Vos et al., 2007). The chromatographic parameters (temperature, pH, column, flow rate, eluents, gradient), injection parameters, sample properties, MS and MS/MS parameters (calibration, instrumental parameters: capillary voltage, orientation of lens, etc.) and all other parameters related to the configuration of the LC-MS system (presence of other detectors such as photo diode array (PDA), tube widths, etc.) may all influence the performance of the metabolomics analyses. An adequate configuration should be adopted fitting the aim of the analyses and the limitations of the instruments.

Using LC-MS in the identification of metabolites Metabolite assignments using LC-MS as a tool for compound identification are usually obtained by combining accurate mass, isotopic distribution, fragmentation patterns and any other mass spectrometric information available. The calculation of the chemical combinations that fit a certain accurate mass is generally one of the first steps to obtain a set of alternatives that can lead to the identity of the metabolite detected. This set of alternatives becomes less extensive if the mass spectrometer can provide a more accurate MM value (Kind and Fiehn, 2006). Using an instrument that can provide very high mass accuracies, the range of possibilities of molecular formulae (MF’s) is limited and can, especially for lower m/z values, lead to the correct MF. The number of possible MF’s increases with increasing MM values. Furthermore, in most cases a pre-selection of chemical elements can be made, avoiding the generation of excessive false alternatives upon inclusion of all elements of the periodic table. For general applications in plant or animal metabolomics, most metals can be excluded (except perhaps for Na or K that are common adducts in mass spectra), being the core elements C, H, O, N, P and S. Logically, any other element for which there is the slightest evidence of being present in the analysed sample should be included for elemental composition calculation. Another point to take into account when MF’s are calculated from MM’s is the algorithm used for the calculation. There are more possible mathematical combinations of elements that fit certain MM’s than the amount of chemically 20

Metabolomics and identification

existing MF’s. This is related to chemical rules such as the octet rule that dictate certain limitations on chemical bonding derived from the electronic distribution of the participating atoms present in molecules. The widely applied nitrogen rule is used for the assessment of the presence or absence of N-atoms in a molecule or ion. Another useful item is the presence of rings and double bonds. As described by Bristow (2006) the number of rings and double bonds can be calculated from the number of C, H and N atoms that a molecule contains (assuming a C, H, N and O containing molecule). One of the most powerful methods for narrowing the number of MF’s is to make use of the isotopic pattern of a mass signal. For most small organic molecules M, the intensity of the second isotopic signal, corresponding to the 13C signal, can unravel the number of carbons that the molecular ion contains knowing that the natural abundance of 13C is 1.11%. This is therefore of major assistance in the assignment of MF’s from MM’s. According to Kind and Fiehn (2006), this strategy can remove more than 95% of the false positives and can even outperform an analysis of solely accurate mass using a (yet non-existing) mass spectrometer capable of 0.1 ppm mass accuracy. With the appearance of large dynamic range MS instruments (with good isotopic intensity measurements), this is certainly an efficient strategy when combined with the MS spectra analyses tools described below. The fragmentation pattern of a mass signal can provide structural information about the fragmented ion. From the fragments obtained the structure of the molecule can be deduced, knowing that the breakages will occur at the weakest points of the ion. For example, an O-glycosylated flavonoid will firstly fragment on the glycosidic linkage and only afterwards in the aglycone backbone, if sufficient energy is provided. The possibility of isolating one ion and performing tandem MS to the successively obtained fragments can be highly informative for tracking functional groups and connectivity of fragments for structure elucidation of metabolites. In addition, the possibility of obtaining accurate mass fragments is also another advantage when there is little knowledge about the possible atomic arrangements of the molecular ion. Moreover there is a series of possible MS experimental procedures that can enhance our knowledge about the metabolites of interest. These experiments include comparisons of analysis obtained by positive and negative ion modes (either by online switching or offline), neutral mass loss experiments that can aid on the identification of certain functional groups or substituents, such as hydroxyls, carbonyls or glycosides (Fabre et al., 2001). The usage of 13C material as internal standard is also an elegant method of obtaining metabolite information (Mashego et al., 2004). 21

CHAPTER 1

Additionally, on the case of a separation method being coupled to the mass spectrometer, the retention time is a parameter that can give information about the polarity of the metabolite. Nowadays, in stabilized (LC or GC)-MS setups, the retention time variation can be relatively low allowing direct comparisons of chromatograms and the construction of metabolite databases (Lisec et al., 2006; Moco et al., 2006a; De Vos et al., 2007). Data obtained from additional detectors can also be a complementary source of structural information of a metabolite. Typically, for a well-separated chromatographic signal with sufficient intensity, a full absorbance spectrum can be obtained in the ultraviolet/visible (UV/Vis) range using a PDA detector. For many secondary metabolites, their light-absorbance spectra can indicate at least the classes of compounds that these belong to, as the type of chromophores can be inferred from the absorbance maxima and the shape of the spectrum. Absorbance maxima can undergo slight shifts with the introduction of conjugations in the polyaromatic system. Possibly the most straight-forward approach for obtaining confirmation of the identity of metabolites in a biological sample is to test commercially available standard compounds on the same analytical system. However, this approach implies the (commercial) availability of such standard compounds which, especially in for secondary metabolitrs is scarce. When standard compounds are available, these are useful not only for confirmation of the identity of compounds but also for undergoing (semi-) quantitative analyses and most importantly for the construction of metabolite databases containing experimental data of tested compounds on a fully characterized system. In summary, the ability to assign metabolites using MS resides in the possibility of combining different features of the MS analysis: accurate mass, fragmentation pattern, isotopic pattern with additional experimental parameters such as retention time, UV/Vis spectra and confirmation with standard compounds. Also biochemical, literature and species information, as well as other related relevant information is appreciated for the assignment of the metabolite in study (Fig. 1.2).

NMR NMR is a spectroscopic technique that takes advantage of the spin properties of the nucleus of atoms. The nuclear spin is the total angular momentum present in the nucleus of atoms. Only the nuclei that have a non-zero nuclear spin exhibit nuclear magnetic resonance. Among this group of atoms, are 1H, 13C, 15N and 31P, 22

Metabolomics and identification

which are elements that are present in bio-organic molecules. Depending on the nucleus, the nuclear spin can assume at least 2 different spin states. When exposed to an external magnetic field, the spins of the nuclei (re)orient along the magnetic field axis making it possible to change from one spin state to the other by absorbing energy. This is the basis of nuclear magnetic resonance. The energetic difference between the nuclear spin states can be explained by the Boltzmann distribution of the spin populations and is dependent on the (external) magnetic field strength, the temperature of the sample and the gyromagnetic ratio of the nuclear spin. Because the nuclear transition energy is much lower (typically in the order of 104) than an electronic transition, NMR is not as sensitive as other techniques such as Infrared (IR) or UV/Vis spectroscopy (Claridge, 1999; Nave, 2005). Nevertheless, NMR is perhaps the most selective analytical technique available, being able to provide unambiguous information from the magnetic signatures of the atoms that take part in a molecule. One of the many applications of NMR is the ability to elucidate chemical structures, as it can provide highly specific evidence for the identification of a molecule. Furthermore, NMR is a quantitative technique, because the number of nuclear spins is directly related to the intensity of the signal (Pauli, 2001). Different metabolomics approaches can be applied when using NMR (Ratcliffe and Shachar-Hill, 2005). The first is related to the capacity of molecule identification. Because 1H is part of almost all bio-organic molecules and has a very high natural abundance (99.9816-99.9974% (de Laeter, 2003)) and good NMR properties, it is the most used nucleus for NMR measurements. In general, the compounds of interest are isolated from their tissues, often through laborious analytical procedures, and solubilised in (when possible deuterated) solvent for the acquisition of a 1H NMR and when adequate two-dimensional (2D)-NMR spectra. For most bio-organic compounds the acquisition of a 1D 1H NMR spectrum is not sufficient for a full structural elucidation. Homonuclear 1H -2D spectra such as COSY (correlated spectroscopy), TOCSY (total correlation spectroscopy) and NOESY (nuclear Overhauser enhancement spectroscopy) are very informative about the 3D position of the protons in a molecule. To capture connectivities between different nuclei, such as between 1H and 13C, heteronuclear 2D spectra can be acquired for detecting direct 1H-13C bonds by a HMQC (heteronuclear multiple bond coherence) or over a longer range by a HMBC (heteronuclear single quantum coherence). There is a wide diversity of different types of NMR measurements, according to the interest of the user in particular chemical features. Another metabolomics application is in vivo NMR. Because NMR is a non23

CHAPTER 1

destructive technique, performing measurements without sample loss is feasible, which allows monitoring time series of changing biological materials or the performance of analyses without sample extraction, i.e. on solid tissues. In vivo NMR can be advantageous when measuring specific cellular compartments that with an extraction step could never be attributed to specific organelles or tissues (Aubert et al., 1999). Furthermore, the usage of HR-MAS (high resolution magic angle spinning) is of high importance in the medical field in the analysis of biopsies for clinical judgements (Sitter et al., 2006), in the food processing industry (Shintu et al., 2004) and in any other cases where the sampling proves to be difficult. A fast-growing approach, in particular in the animal/human research area, is NMR fingerprinting. This approach involves the acquisition of NMR spectra of complex mixtures, as biofluids or plant extracts for pinpointing differences between the samples, with the intention of biomarker discovery (Ratcliffe and Shachar-Hill, 2005; Kochhar et al., 2006). This strategy pairs with MS fingerprinting for obtaining a global overview over the metabolome. Tomato fruits and Arabidopsis leaves have been profiled by NMR (Le Gall et al., 2003a; Ward et al., 2003). Most studies so far use 1H NMR as being the least selective for the type of molecules and that can provide the highest sensitivity. However, 13C NMR (Vlahov, 2006) and 2D measurements such as JResolved (JRes) (Viant, 2003), COSY (Xi et al., 2006) and HMBC (Masoum et al., 2006) have also been used. In NMR profiling, the necessity of spectral comparisons demands the spectrum acquisition and the control of conditions to be extremely rigorous. Small changes in temperature, pH, and presence of impurities or degradation of the sample material can lead to the detection of false metabolic alterations and therefore the indication of incorrect differential metabolites. Nowadays, in a 14.1 Tesla (600 MHz for 1H NMR) instrument, the limit of detection is in the microgram (1H and 1H-13C NMR) or even sub-microgram region (1H NMR). The sensitivity of NMR has been improving over the years, increasing the suitability of this technique for analytical applications. The detection of less sensitive nuclei such as 13C or 15N through magnetization of 1H, using probe heads with pulsed gradients for acquisition of 2D-NMR spectra increased the spectral resolution and sensitivity compared to 1D NMR of 13C or 15N. The nuclear properties as well as the natural abundance of the nuclei chosen for NMR acquisition also condition the NMR signal: 1H is naturally more abundant than 13C, therefore the amount of 13 C in a sample is dependent on this fact. The resolution and the signal-to-noise of the measurement can be improved by using instruments with higher magnetic field strengths. The number of nuclei or the number of moles of the analyte in the detection volume used for the measurement also influences the sensitivity of the 24

Metabolomics and identification

NMR measurement. Thus, for high MM compounds, larger amounts (in mass) are needed to achieve sufficient sensitivities (Fig. 1.6). The labelling of low abundant metabolites with stable isotopes can also be applied and can be a strategy for performing 2D-NMR analysis on low amounts of material. In flux analysis, the labelling of compounds for analysis of the propagation of the isotope label in pathway analysis and kinetics measurements is a known application (Ratcliffe and Shachar-Hill, 2006). In NMR spectroscopy, the signal to noise ratio (Equation 1.1) is dependent on the T2* of the signals measured (Claridge, 1999). The T2* is inversely related to the line width of the signals obtained (πΔν½ = 1/T2*) and is influenced by magnetic field inhomogeneities. These magnetic field inhomogeneities can be caused by magnetic field susceptibility fluctuations in the sample (for instance large particles present, paramagnetic ions or inferior NMR tubes) or by poor shimming. Automated shimming procedures available for the most recent type of NMR instruments largely alleviate the latter, leaving sample preparation as the major cause of inferior NMR spectra. S

2 α NAT −1B02γ excγ obs T2* (NS) 2 3

N

3

1

S/N = signal-to-noise ratio N = number of molecules in the observed sample volume A = abundance of the NMR active spins involved in the experiment T = temperature B 0 = static magnetic field γexc = magnetogyric ratio of the initially excited spins γobs = magnetogyric ratio of the observed spins T 2* = effective transverse relaxation time NS = total number of accumulated scans

(1.1)

The appearance of cryogenic probeheads brought important improvements in the NMR sensitivity (Kovacs et al., 2005). Being able to take advantage of the reduction of thermal noise by using low temperature detection coils, a signal-tonoise can be obtained up to 5 fold higher than with conventional probes. In addition, the possibility to miniaturize the active volume of the detection cell enabled the appearance of microprobes. Moreover, the signal to noise of the detection coil is inversely related to its diameter. These minituarized NMR probes are available with active volumes as low as 1.5 µL, providing new possibilities for analysing molecules in the lower detection volumes, increasing the concentration of the analyte at no expense on the signal-to-noise. This low active volume is compatible with chromatographic elution volumes in capillary chromatography, making the usage of capillary microcoil NMR (CapNMR) feasible (Schroeder and Gronquist, 2006).

25

CHAPTER 1

LC-(SPE)-NMR The coupling of LC with NMR is becoming increasingly useful as the NMR sensitivity improves, avoiding excessive analytical demands on obtaining enough material to perform NMR measurements. In practical terms, the hyphenation of LC with NMR is still not as clear-cut as LC-MS but it is establishing itself as a powerful system for identifying related metabolites from complex mixtures such as natural extracts from plants. There are different configurations that can be used when coupling LC to NMR (Exarchou et al., 2005). More recently the online coupling of LC to SPE and subsequent NMR became available and improved some of the existing analytical barriers of the previous modes. In this configuration, the chromatographic peaks are trapped in SPE cartridges and can be concentrated up to several times by multi-trapping into the same cartridge. The chromatography itself can be done with (less expensive) protonated solvents because the analytes within the cartridges are dried and then eluted with fully deuterated solvents. The separation of flavonoids and phenolic acids present in Greek oregano extract was accomplished by this method (Exarchou et al., 2003). This method is suitable for the analysis of less abundant compounds in complex mixtures, since it allows the separation, concentration and NMR acquisition of metabolites within a single system, avoiding the often tedious analytical preparations before NMR analysis.

Using NMR in the identification of metabolites The magnetic resonance of nuclei present in a molecule is displayed as signals with a determined frequency, represented by chemical shift values, δ, in the NMR spectrum. The analysis of a NMR spectrum can be extremely puzzling due to overlapping signals and multiplicities within the signals. The NMR spectrum of a particular molecule is unique, and for this reason NMR is considered one of (and perhaps even) the most selective techniques for compound elucidation. For the analysis of NMR spectra, the number, position and area of the signals in the spectrum as well as the multiplicity of these are some of the aspects that are used in order to attempt the assignment of a molecule. An aspect that can be both highly informative and difficult to interpret is the multiplicity of signals. The signal splitting or multiplicity of the signals is caused by the spin-spin coupling between the proton and the nearby atoms. The coupling constants, J, transmit structural information, necessary for the elucidation of most molecules. The interpretation of NMR spectra can be quite demanding, especially for highly related structures or higher MM molecules. There are several software tools 26

Metabolomics and identification

(ACD/Labs, ChemOffice, and PERCH Solutions) that can help in 1H NMR spectral analysis by providing NMR spectral predictions. The aim of these prediction tools is to aid analysts to assign spectral δ’s and J‘s to the analysed molecule. Strictly theoretical calculations of NMR spectra from molecular properties are an option, yet unaccounted effects often appear on experimental spectra being difficult to incorporate in the theoretical prediction routines. In particular the prediction of 1H NMR spectra proves to be more difficult to implement due to the effect of 3D conformational structures on the 1H NMR chemical shifts of the protons. The construction of prediction models based on experimental data can be a successful alternative in order to describe chemical phenomena at a detailed molecular level (Moco et al., 2006b).

LC-MS-NMR The identification of metabolites can be aided by metabolite profiling methods such as MS or NMR, but often the full chemical description of a molecule is only achieved by integrating metabolite information taken from different sources. The combination of MS with NMR for unravelling the identity of a molecule is one of the most powerful strategies (Fig. 1.2). On the one hand, MS can indicate not only the MM of a compound and therefore the possible MF’s, but also the presence of certain functional groups or substitution patterns. However, ambiguity still remains in the absence of standard compounds from which mass values, fragment ions and fragmentation energies can be compared to the unknown molecule. On the other hand, NMR allows the structural elucidation of molecules up to the isomer level. The most efficient way to seize the advantages of both technologies is to use them in parallel or if possible online. The coupling of LC with both MS and NMR has been described and it is an elegant and efficient way of obtaining useful data for the identification of compounds (Exarchou et al., 2003). The advantage of performing the same separation for both MS and NMR makes the correspondence of the chromatographic signals between these two instruments clear. However, due to the complex analytical setup, the analyses done by LC-MS and LC-(SPE)-NMR separately are still the most common. Developments in chemometric methods can assist in the rapid identification of molecules present in complex mixtures. The method depends on data obtained from a large number of samples which are both measured by LC-MS and NMR. The different data matrices obtained from these fingerprints can be fused using concatenation or other data fusion methods. In theory, fluctuations in the LC-MS 27

CHAPTER 1

matrix should reflected similar changes in the NMR matrix data set. When the sample preparation and analysis are done in a coherent manner, this method might enable high throughput identification of molecules. This approach has been tested for biofluid analysis, by coupling LC-MS and NMR data of urine samples (Crockford et al., 2006; Forshed et al., 2007) and can be a promising strategy in biomarker discovery.

DATA ANALYSES The extraction of valuable conclusions from the analysis of metabolomics data is as important as performing the analytical measurements itself. There are a variety of methods that allow the transformation of raw data, directly taken from the instrument, passing through different treatments and ultimately leading to a list of metabolites. Prior to any data analysis, it is important to be aware of the possible sources of variation present in the samples that can influence the final conclusions if these are not overseen. Parameters such as biological variation present among individuals, sampling, sample preparation and the analytical measurement influence the reproducibility of the results and these should be monitored as much as possible by the measurement of replicates, both analytical and biological. In principle, the biological variance should surpass all analytical variance. Signal irreproducibility is an obstacle for reliable comparison of chromatograms and spectra. Retention time shifts in GC and more severely LC are common, as are occasional shifts in NMR spectra. In the latter, non-reproducibilities seem to be strictly related to sample preparation and hardly due to instrumental incoherence. Nevertheless, even in strictly controlled conditions signal shifts may persist. For this reason, the use of signal alignment software has become a routine procedure for comparison of chromatograms or spectra. MetAlign (De Vos et al., 2007), XCMS (Smith et al., 2006) and MZmine (Katajamaa et al., 2006) are some of the available alignment toolboxes for MS applications and HiRes for NMR applications (Zhao et al., 2006). These are relevant items for the reduction of raw data into a still informative but workable sized data set. For masking or emphasising variable and sample deviations, scaling and standardisation tools can be applied, as long as these do not lead to artificial distortions of the original data. As for all the “omics” technologies, the multidimensionality is one of the characteristics of metabolomics data which ensures an inherent 28

Metabolomics and identification

complexity of the data set. The application of supervised and unsupervised tests such as principal component analysis (PCA), hierarchical cluster analysis (HCA), partial least squares (PLS) and discriminant analysis (DA), among others, are widely applied in metabolomics (Scholz et al., 2005; Masoum et al., 2006). These methods not only simplify the data by reduction of dimensionality but can also provide a visual representation of the data. More sophisticated methods of emphasising relationships between metabolites such as correlation matrixes and metabolic correlation networks can help to establish relationships between different metabolites and even between metabolites and transcripts, genes or proteins. In this way, a systems level overview is envisioned (Joyce and Palsson, 2006). There are different tools either for visualisation purposes or databases that can be used to display the coupling of different “omics” data: KEGG (www.genome.jp/kegg), MetaCyc (http://metacyc. org), MAPMAN (gabi.rzpd.de/projects/MapMan) and KappaView (kpv.kazusa.or.jp/ kappa-view), for example.

IDENTIFICATION TOOLS AND DATABASES There are still only few tools that can produce automatically a list of possible metabolites from the mass signals at a particular retention time (MS) or δ’s (NMR). The analysis of spectrometric or spectroscopic data imply an intensive manual effort, hindering the throughput of the analysis setup. In fact, the bridge between experimental data (MS and NMR spectra, retention time, fragmentation pattern, chemical shift, coupling constant) and the available chemical databases (Table 1.2) is still weak, let alone automatic. Some identification tools such as elemental composition calculation or MM calculation exist among the different instrumental software’s, but these seldom allow a spectral matching tool linked to a public database, like in proteomics applications. Some of the few examples of spectral databases are AMDIS (Automated mass spectral deconvolution and identification system) (www.amdis.net) which can be used mostly for identification of GC-MS signals. Advanced Chemistry Development Labs (ACDLabs) also provides commercially spectral matching with databases for MS and NMR, as well as predictor tools (Advanced Chemistry Development, Inc.). Nevertheless many plant metabolites such as secondary metabolites are not present in these databases.

29

CHAPTER 1

Table 1.2. Number of metabolite records present in MS and NMR, pathway and chemical databases. DB

Source

No. Records (ca.)

MS-based DBs NIST/EPA/NIH Mass Spectral Library (NIST 0.5)

National Institute of Standards and Technology (NIST)

163,000

SpecInfo

Daresbury Laboratory

139,000

Spectral Database for Organic Compounds, SDBS

National Institute of Advanced Industrial Science and Technology (AIST)

23,500

KNApSAcK (Comprehensive Species-Metabolite Relationship Database)

Nara Institute of Science and Technology (NAIST)

15,500

Metlin

The Scripps Research Institute

15,000

Human Metabolome Database (HMDB)

Genome Alberta and Genome Canada

Golm Metabolome Database ([email protected])

Max Planck Institute of Molecular Plant Physiology

Metabolome of Tomato Database(MoTo DB)

Plant Research International

2,300

100

NMR-based DBs Flavonoid Database

Wageningen University

Human Metabolome Database (HMDB)

Genome Alberta and Genome Canada

ACD Databases

Advanced Chemistry Development, Inc.

Spectral Database for Organic Compounds, SDBS

National Institute of Advanced Industrial Science and Technology (AIST)

SpecInfo

Daresbury Laboratory

Standard Compounds on Biological Magnetic Resonance Bank (BMRB)

University of Wisconsin

NMRShiftDB

University of Koeln

250 (13C and 1H) 400 (13C) 350 (1H) 15,000 (13C and 1H) 8,800 (15N) 26,100 (31P) 12,500 (13C) 14,300 (1H) 102,000 (13C) 117,000 (1H) 1,000 (15N) 1,000 (17O) 17,000 (31P) 25,000 (19F) 275 (13C and 1H) 19,500 (13C) 3,000 (1H)

Pathways DBs Kyoto Encyclopedia of Genes and Genomes (KEGG)

Kyoto University / Tokyo University

14,000

Chemical DBs SciFinder

Chemical Abstracts Service (CAS)

30,500,000

PubChem

National Institutes of Health (NIH)

10,100,000

Beilstein Database

MDL

eMolecules

eMolecules

Available Chemicals Directory

Elsevier MDL

Combined Chemical Dictionary (CCD)

Chapman & Hall/CRC Press

30

9,400,000 > 5,600,000 >> 200,000

Dictionary of Organic Compounds

265,000

Dictionary of Natural Products

170,000

Metabolomics and identification Dictionary of Inorganic and Organometallic Compounds

103,000

Dictionary of Drugs

44,000

Dictionary of Analytical Reagents

14,000

ChemIDplus

National Institute of Health (NIH)

380,000

Substance Registry System (SRS)

Environmental Protection Agency (EPA)

98,000

ChemFinder

CambridgeSoft Corporation

72,000

Merk Index

John Wiley & Sons, Inc.

10,200

Chemical Entities of Biological Interest (ChEBI)

European Molecular Biology LaboratoryEuropean Bioinformatics Institute (EMBL-EBI)

10,000

Building up public metabolite databases is starting to be done by the laboratories within the community (Table 1.2). One of the largest initiatives for the identification of metabolites is the Human Metabolome Project where MS and NMR data are combined with molecule information (Wishart et al., 2007). The detailed description of the methods of sample preparation and analysis, conditions of the analytical experiment, chemical information about the metabolites (name, IUPAC name, chemical descriptors such as CAS registry numbers and InChi and/ or structural information, links to chemical databases), experimental spectra and biological source are some of the features to include in the metabolite databases. A troublesome issue resides already in the nomenclature of molecules, as the list of common names for a given molecule can be quite extensive, as well as the same common name can be attributed to distinct molecules. This is a real impediment for a reliable and unambiguous classification and creates false interpretations, in particular in the organisation of databases and searching tools. Only with an acute description of the experimental conditions and chemical identity of the metabolites is the comparison and exchange of data relevant. Perhaps at this stage the priority into a rigorous identification of metabolites will have to arise, as dealing with unknowns and not fully identified metabolites creates a lot of incongruent hits in the databases. Ideally the separate metabolite databases will be accessible through a common search engine as an open source web service, as in BioMOBY (Wilkinson et al., 2005).

CONCLUSIONS The description of the metabolome can be achieved by different methods, either in parallel or in combination. Especially MS and NMR profiling techniques are 31

CHAPTER 1

powerful methods to detect the metabolome as a whole. Comparison of metabolic profiles can elucidate differences between organisms and pinpoint the responsible metabolites. However, if we can identify differences but not chemically describe these, very little is left to say about the underlying nature of the metabolic phenomena. There is still a long way to go to completely describe the metabolome of an organism, pointing to the elucidation of unknowns as a priority. As yet, no single analytical method can capture the whole metabolome and the analytical method chosen confines the amount of metabolites left to identify along its process. Currently, the integration of high resolution LC-MS and NMR approaches provides necessary information for the elucidation of compounds. The development of bioinformatic tools will facilitate the management of large amounts of data and help in the integration of different data sets by sieving the metabolite information from the instrumental chromatographs and spectra. The expansion of our view over the metabolome of organisms will improve the description of metabolic networks and cellular phenomena in general.

32

Chapter 2

Untargeted Large Scale Plant Metabolomics using Liquid Chromatography coupled to Mass Spectrometry Ric C.H. De Vos*, Sofia Moco*, Arjen Lommen*, Joost J.B. Keurentjes, Raoul J. Bino and Robert D. Hall Nature Protocols 2: 778-791 (2007) *equally contributing authors

Untargeted metabolomics aims to gather information on as many metabolites as possible in biological systems by taking into account all information present in the data sets. Here we describe a detailed protocol for large scale untargeted metabolomics of plant tissues, based on reversed phase liquid chromatography coupled to high resolution mass spectrometry (LC-QTOF-MS) of aqueous-methanol extracts. Dedicated software, metAlign, is used for automated baseline correction and alignment of all extracted mass peaks across all samples, producing detailed information on the relative abundance of thousands of mass signals representing hundreds of metabolites. Subsequent statistics and bioinformatics tools can be used to provide a detailed view on the differences and similarities between (groups of) samples or to link metabolomics data to other systems biology information, genetic markers and/or specific quality parameters. The complete procedure from metabolite extraction towards a data matrix with aligned mass signal intensities takes about 6 days for 50 samples.

35

CHAPTER 2

INTRODUCTION Metabolomics has emerged as a valuable technology for the comprehensive profiling and comparison of metabolites in biological systems and a multitude of applications for human, microbial and plant systems have already been reported or predicted (Sumner et al., 2003; Bino et al., 2004; Jenkins et al., 2004; Trethewey, 2004; van der Greef et al., 2004; Vaidyanathan et al., 2005; Dixon et al., 2006; Hall, 2006; Saito et al., 2006). Plants are especially rich in chemically diverse metabolites, which are usually present in a large range of concentrations, and no single analytical method is currently capable of extracting and detecting all metabolites. Over the past decade, several methods suitable for large scale analysis and comparison of metabolites in plant extracts have been established (Dixon et al., 2006; Hall, 2006), including gas chromatography coupled to mass spectrometry (GC-MS) (Fiehn et al., 2000; Roessner et al., 2001; Roessner et al., 2002; Fernie, 2003; Schauer et al., 2005a; Lisec et al., 2006), direct flow injection-mass spectrometry (DFI-MS) (Aharoni et al., 2002; Goodacre et al., 2003; Hirai et al., 2005; Overy et al., 2005), liquid chromatography-mass spectrometry (LC-MS) (Tolstikov et al., 2003; Jander et al., 2004; von Roepenack-Lahaye et al., 2004; Vorst et al., 2005; Moco et al., 2006a; Rischer et al., 2006), capillary electrophoresis-mass spectrometry (CE-MS) (Sato et al., 2004), and nuclear magnetic resonance (NMR) technologies (Le Gall et al., 2003a; Ward et al., 2003). LC-MS based approaches are expected to be of particular importance in plants, due to the highly rich biochemistry of plants which covers many semi-polar compounds, including key secondary metabolite groups, which can best be separated and detected by LC-MS approaches (Huhman and Sumner, 2002; Tolstikov et al., 2003; von Roepenack-Lahaye et al., 2004; Breitling et al., 2006; Dixon et al., 2006; Hall, 2006; Moco et al., 2006a; Saito et al., 2006). Of the many semi-polar compounds not involved in primary metabolism, quite a number have already been shown to have phenotypic / physiological importance. It is also mainly secondary metabolites that are attracting much attention from health, food and nutrition groups (Beekwilder et al., 2005; Vaidyanathan et al., 2005; Dixon et al., 2006; Rischer et al., 2006) owing to, for example, their resistance effects, antioxidant properties, and colour and flavour characteristics. These and other so-called quality aspects of plant materials are generally not centred on individual metabolites but rather are related to a particular mixture of compounds from diverse, biochemically-related and unrelated groups. As such, a metabolomics approach to help understand better how complex these mixtures are, which components play the most important role, and how their biosynthesis is controlled, is likely to be of great future value and importance. 36

Large Scale Metabolomics using LC-MS

Commonly used plant metabolomics approaches and their advantages and limitations Although NMR is in principle the most uniform detection technique and is essential for the unequivocal identification of unknown compounds, NMR-based metabolomics approaches still suffer from a relatively low sensitivity compared to MS. As yet, MS-based platforms are most widely used in plant metabolomics (Hall, 2006). GC coupled to electron-impact time-of-flight (TOF)-MS was the first approach used in large scale plant metabolomics (Fiehn et al., 2000), and a detailed protocol for sample extraction, derivatization and subsequent data analyses has recently been described (Lisec et al., 2006). This approach covers a large variety of nonvolatile metabolites, mainly those involved in primary metabolism, including organic and amino acids, sugars, sugar alcohols, phosphorylated intermediates (in the polar fraction of extracts), as well as lipophilic compounds such as fatty acids and sterols (in the apolar fraction). GC-(TOF)MS produces highly reproducible separation and fragmentation patterns of metabolites, which enables the development of common GC-TOF-MS based metabolite libraries (Kopka et al., 2005; Schauer et al., 2005a). Although CE-MS also enables good separation and detection of many polar primary metabolites (Sato et al., 2004), it is seldom used compared to GC-TOF-MS. As most primary metabolites have commercially available standard compounds, both GC-TOF MS and CE-MS can produce quantitative data for hundreds of compounds involved in central metabolism. The preferred method for analysing semi-polar metabolites is LC-MS with a soft-ionisation technique, such as electrospray ionisation (ESI) or atmospheric pressure chemical ionisation (APCI), resulting in protonated (in positive mode) or deprotonated (in negative mode) molecular masses. Compounds detectable by LCMS include the large and often economically important group of plant secondary metabolites such as alkaloids, saponins, phenolic acids, phenylpropanoids, flavonoids, glucosinolates, polyamines, and all kinds of derivatives thereof (Huhman and Sumner, 2002; Tolstikov et al., 2003; Moco et al., 2006a; Rischer et al., 2006). These compounds can be effectively extracted with aqueous alcohol solutions and directly analysed without derivatization. Depending on the type of column used, various primary metabolites including several polar organic acids and amino acids can be reliably analysed using LC-MS (Tolstikov and Fiehn, 2002). Based on the high mass resolution of time-of-flight (TOF)-MS and Fourier Transform-Ion Cyclotron Resonance-MS (FT-ICR-MS) instruments, enabling elemental formulae calculations of detected ions, rapid DFI-MS approaches without any prior compound separation have been developed to compare metabolite fingerprints of crude plant extracts (Aharoni 37

CHAPTER 2

et al., 2002; Goodacre et al., 2003; Hirai et al., 2005; Overy et al., 2005). However, such direct injection approaches, irrespective of the resolution and accuracy of the mass spectrometer, may suffer from significant adduct formation and ion suppression phenomena upon ionisation of complete crude extracts. Moreover, by definition, direct injection methods cannot discriminate between the many molecular isomers. Therefore, most MS-based platforms in plant metabolomics perform at least some separation. LC preceding MS not only results in the detection of isomeric compounds, which are often abundantly present in plants, but also enables valuable structural information to be collected online, for example, MS/MS fragmentation patterns and UV/Vis absorbance spectra using photodiode array (PDA) detection (Huhman and Sumner, 2002; Tolstikov and Fiehn, 2002; Tolstikov et al., 2003; von RoepenackLahaye et al., 2004; Moco et al., 2006a; Rischer et al., 2006; Saito et al., 2006). It has been estimated that extensive LC in combination with high resolution MS (e.g. TOF-MS) enables the detection of several hundreds of compounds in a single crude plant extract (von Roepenack-Lahaye et al., 2004; Vorst et al., 2005; Moco et al., 2006a). With continually improving tools for data acquisition, processing and mining, LC-MS will certainly grow in value for biochemical profiling and metabolite identification. Combining LC with ultra-high resolution mass spectrometry such as FTMS (Breitling et al., 2006; Peterman et al., 2006) and other identification tools like LC-NMR-MS (Exarchou et al., 2003; Wilson and Brinkman, 2003; Wolfender et al., 2003), as well as making use of improved separation technologies such as ultraperformance LC (UPLC) coupled to MS (Laaksonen et al., 2006; Nordström et al., 2006), will further improve our potential to identify metabolites and to provide an even more detailed metabolite profile of plant extracts.

Untargeted LC-MS for plant metabolomics Compared with primary metabolites, the number of commercially available standards for secondary metabolites per plant species or tissue is still very limited. Consequently, metabolomics approaches based on analyses of compounds for which standards are available, which is common practice in GC-(TOF)MS based metabolomics studying primary metabolism, would very much limit the great potential of LC-MS in plant research. Recent developments in processing software for unbiased mass peak extraction and alignment of LC-MS data, such as metAlign (Bino et al., 2005; Vorst et al., 2005; Keurentjes et al., 2006; Moco et al., 2006a), XCMS (Nordström et al., 2006; Smith et al., 2006), MZmine (Katajamaa et al., 2006) and Markerlynx (Idborg et al., 2005) now offer possibilities for more holistic untargeted 38

Large Scale Metabolomics using LC-MS

metabolomics approaches aimed to gather information on as many metabolites as possible present in extracts analysed. In such untargeted approaches, mass peak identification using standards is not the primary step in data processing. In contrast, all analytical information present in the profiles is first transformed into coordinates on the basis of mass, retention time and signal amplitude. These coordinates are then aligned across all samples. By applying appropriate statistical and multivariate analyses tools, differential mass peaks or mass peaks correlating with a specific trait can be filtered out and identified to some degree by using accurate mass, MS/MS fragmentation and then confirmed with standards when available. Examples of such untargeted approaches in plant research are the comparison of secondary metabolites in roots and leaves of wild-type and mutant Arabidopsis (Arabidopsis thaliana) plants (von Roepenack-Lahaye et al., 2004), studying metabolic alterations in fruits of a light-hypersensitive mutant of tomato (Solanum lycopersicum) (Bino et al., 2005), comparing tubers of potato (Solanum tuberosum) of different genetic origin and developmental stages (Vorst et al., 2005), determining tissue-specificity of metabolic pathways in tomato fruit (Moco et al., 2006a), establishing gene-tometabolite networks in Catharanthus roseus (Rischer et al., 2006), and identifying quantitative trait loci (QTL’s) controlling metabolite composition in Arabidopsis (Keurentjes et al., 2006; Fu et al., 2007). For our metabolomics approaches we prefer to use the freeware metAlign (www.metalign.nl and www.rikilt.wur.nl/UK/services/MetAlign+download) to process large LC-MS (Bino et al., 2005; Vorst et al., 2005; Keurentjes et al., 2006; Moco et al., 2006a) as well as GC-MS (Tikunov et al., 2005) data sets, based on a number of features: • compatibility with most mass spectrometry software such as MassLynx, Xcalibur, ChemStation, Agilent, Bruker and ANDI/netCDF formats and output in any of these formats as well as in Excel; • compatibility with both LC and GC, and independent of mass spectrometer type (e.g. quadrupole-MS, TOF-MS, FTMS) or instrument maker; • an easy interface for user-defined parameter settings; • automated local noise calculation and mass-specific baseline corrections; • capability to align up to hundreds of data sets. Examples of using metAlign for the comparison of ten to hundreds of LC-MS data files are available (Bino et al., 2005; Vorst et al., 2005; Keurentjes et al., 2006; Moco et al., 2006a). Though metAlign converts accurate mass data into nominal masses, mainly for reasons of faster data processing, the masses of aligned signals 39

CHAPTER 2

can automatically be recovered using a script called MetAccure (Vorst et al., 2005; Moco et al., 2006a).

Considerations for tissue sampling and handling Although no limitations regarding sample type are foreseen, except from a technical point of view, care must be taken in acquiring reproducible data. Sources of variation contributing to the total ‘noise’ in subsequent statistical analyses are biological variation (e.g. variation in plant growth conditions, development, etc.), perturbations during and after tissue collection, and variation in tissue sampling for metabolite extraction including weighing errors. Metabolic conversions in tissues can be abolished by flash-freezing samples in liquid nitrogen immediately after harvest. Frozen samples should be fully homogenized into a fine powder in order to facilitate and standardise metabolite extraction. Nevertheless, each analysis provides only a single snapshot of the metabolic state of that sample without further information on biological variation or measurement errors. To estimate these variations, sufficient biological replicates and sufficient technical replicates from the same batch of tissue powder, respectively, need to be prepared and analysed.

Considerations for metabolite extraction and LC-PDA-MS analyses The extraction procedure is crucial for the detection of metabolites naturally occurring in the extracted tissues. Therefore, the extraction protocol should be reproducible and with high recovery and stability of most compounds, at least those of prime interest. We have tested a number of different solvents, such as methanol, ethanol and acetone, at different ratios of water versus organic solvent, for extraction efficiency, chromatographic behaviour and extract stability. Acidified aqueous-methanol at a final concentration of 75% methanol (v/v) and 0.1% formic acid (v/v) was the most suitable solvent for efficient extraction of a wide range of compounds of our prime interest, mostly secondary metabolites, from different plant species and tissues (Bino et al., 2005; Vorst et al., 2005; Keurentjes et al., 2006; Moco et al., 2006a). Enzymes present in the sample should be inactivated by directly adding the solvent to frozen plant powder and mixing immediately. Extraction efficiency was tested using several (poly)phenolic compounds added to the frozen powder before extraction. At a solvent/sample ratio of 3 and a sonication time of 15 min, the recovery of all standards tested was higher than 90%. Sonication for up to 2 h did not significantly change the metabolite profile as compared to 40

Large Scale Metabolomics using LC-MS

15-min sonication. However, it is advised to check the extraction efficiency upon analysis of a completely different plant matrix or in case of main interest in specific key compounds. 2.94 191.0080

TOF MS ESBPI 3.25x10 4

37.78 1081.4750

26.55 1314.5332

A

100

33.19 1078.5151

24.42 609.1366

%

42.71 271.0497

14.25 12.67 353.0836 17.41 387.1647 443.1941 0

5.00

10.00

15.00

20.00

B

15.00

20.00

% %

0

12.28 577.1318

5.00

10.00

15.00

40.00

45.00

O

O O

H O

O

O

H

100

E HO

50.00

HO

30.00

35.00

40.00 100

C 11 H 20 NO 10 S 3 +0.2 ppm

N OH H

F

OH

O

OH

O

O HO

S S

O

610.1428

HO

HO

1080.5498

OH

HO

O

O

424.0250

OH

HO

OH

1074

1075

1076

1077

1078

1079

1080

1081

1082

1083

m/z

1032.53 1078.53

100

0 100

418

419

420

421

422

423

424

425

m/z

96.96

0

611.1508

OH

OH

1073

Time

O

N

OH

50.00

C 27 H 29 O 16 -1.7 ppm

OH

O

O

O

OH

45.00 609.1445

HO

O

H

H

OH

1072

25.00

47.73 385.1505

S

OH

O

40.26 36.89 385.1494 711.3969

422.0250

%

%

H OH OH

OH

1071

35.00

23.92 934.0601

20.00

C 51 H 84 NO 23 -1.8 ppm

1070

30.00

27.99 31.28 447.0908 371.1339

1078.5415

D

1069

40.41 478.0796

25.92 477.0626

19.10 449.1069

1079.5465

0

25.00

13.27 431.0981

3.34 191.0119

5.19 391.0305

OH

TOF MS ESBPI 3.41x10 4

TOF MS ESBPI 3.02x10 4

14.09 325.0881

HO

50.00

2.45 341.0953

C

O

45.00

48.71 476.0715

35.09 591.1628

29.15 447.0891

10.00

40.00

27.13 577.1447

9.07 481.0574

5.00

HO

35.00

31.53 480.0210

21.97 755.1909

4.61 565.0416

100

30.00

26.88 339.0537

14.50 478.0744

3.39 376.0196

0

25.00

43.62 1455.2723

42.18 271.0591

37.23 1079.5116

28.90 1152.5239

19.88 492.0550

17.44 385.0844

2.95 292.9145

100

22.49 741.1760

%

100

27.05 962.4711

OH

600

601

602

603

604

605

606

607

608

609

610

611

612

613

m/z

609.14

300.03

100

358.03 576.39

271.03

550

600

650

700

750

800

850

900

950

1000

1050

1100

1150

m/z

0

100

120

140

160

180

610.15

255.03

422.02 195.03

259.01 302.04

151.01 0

301.04

%

870.48

738.43

577.39

195.98

%

%

1079.55

200

220

240

260

280

300

320

340

360

380

400

420

440

m/z

0

100

150

200

250

300

350

611.15 400

450

500

550

600

650

m/z

Figure 2.1. LC-QTOF MS profiling of crude extracts from three different plant species. Upper panel shows typical ion chromatograms, obtained in ESI negative mode, of (A) tomato fruit, (B) Arabidopsis leaf, and (C) strawberry fruit. Lower panels show detected accurate masses of [M-H]- ions and LC-MS/MS spectra of three compounds from different classes of secondary metabolites: (D) α-tomatine, an alkaloid, detected as formic acid adduct; (E) glucoiberin, a glucosinolate; (F) rutin, a flavonoid.

The chromatographic conditions applied are always a compromise between metabolite resolution, retention time stability and sample throughput. In the standard protocol we use a C18-reversed phase microbore column with a relatively small particle size. This column was selected after testing different type of columns 41

CHAPTER 2

for their ability to retain and separate semi-polar compounds of our prime interest, including flavonoids and phenolic acids (Bino et al., 2005; Keurentjes et al., 2006; Moco et al., 2006a), alkaloids (Bino et al., 2005; Vorst et al., 2005; Moco et al., 2006a) and glucosinolates (Keurentjes et al., 2006). A gentle and continuous acetonitrile gradient of 45 min, followed by 15 min column washing and stabilization, resulted in adequate separation of many semi-polar compounds including isomeric forms (Fig. 2.1). We tend to use the same chromatographic conditions in our untargeted metabolomics work, in order to compare mass signals from different samples and to enable compound identification using LC-MS databases (Moco et al., 2006a). In most of our experiments, the LC-MS run itself is not the limiting factor in sample throughput. Instead, sample harvest, grinding, weighing and extraction, and finally data analyses usually take much more time. For large series of samples, e.g. more than 300 extracts, steeper gradients with shorter run times may be useful in order to decrease total run time and therefore the chance of possible perturbations upon increasing analysis times. This might occur due to (pre-)column deterioration or disturbances in the MS-electronics or LC pump, thus introducing extra variation in the final data set. Thus, during analyses of an Arabidopsis recombinant inbred line (RIL) population consisting of 409 extracts including controls, we doubled the sample throughput by using a total run time of 30 min per extract (Keurentjes et al., 2006). However, speeding up the LC-run time, with the same type of column, unavoidably results in an increased amount of co-eluting compounds and thus may lead to a loss of resolution of isomers and an increased ion suppression and adduct formation at the ionisation source. We advise to start with the standard 60 min protocol as outlined below and, if needed, to modify the chromatographic conditions (gradient, column type, etc.) in such a way that at least the compounds of key interest are adequately separated and detected. Upon starting up a new series of analyses, the chromatography is relatively unstable due to (pre-)column conditioning by the crude extracts themselves. To avoid suboptimal alignment resulting from this early-stage system instability, several “dummy” runs of extracts should be performed, before collecting the actual data. We routinely program the LC-MS software to inject and analyse repeatedly the first sample extract at least 4 times. Standard solutions should not be injected between crude extracts, as during analysis of these relatively clean samples the column can partly be re-conditioned resulting in small retention shifts. To ensure constant and reproducible ionisation, regularly check the actual pressure and supply of the nitrogen and argon gasses. In our system we can check this pressure by comparing the intensity of the reference mass (lock mass; see below) over the samples. If the 42

Large Scale Metabolomics using LC-MS

intensity of this mass signal is markedly changed in one or more samples, these samples should be reanalysed within the same series. Analyse extracts in a randomized order to avoid possible variation from timedependent changes, e.g. due to slow deterioration of (pre-) column or ionisation source. Due to the high variability of metabolites present in crude extracts with respect to their chemical characteristics and intrinsic behaviours upon sample preparation, the use of a single internal standard to correct for variation in extraction and detection of all mass signals over the samples is of dubious value. Adding a series of internal standards, e.g. each representing a different class of plant metabolites, may be a better option but may introduce ion suppression effects in the case of co-eluting compounds. Consequently, we recommend preparing a statistically relevant number of replicates from a homogenous (pooled) batch of material and analysing these throughout the entire sample series, in order to estimate technical reproducibility and, if needed, to correct for this type of variation.

Freezingg and Freezin

Transfer vials Transf

grindingg

to auto autosampler ampler

STEPS 1-2

Growth and harvest of

Extraction, centrifugation

LC-PDA-QTOF MS

plant material

and filtration STEPS 3-7

STEPS 8-9

Data analyses Data a

Output:

processing cessing

file CSV fil

STEP 15

• t-tests • multivariate analyses tools • correlation analyses • ………

LC-MS profiles

MetAlign for mass peak extraction and alignment over samples

STEPS 10-14

Identification of relevant mass peaks

STEPS 16-19 Figure 2.2. Schematic overview of experimental set-up and data flow for untargeted LCQTOF MS based metabolomics of plant materials. A detailed description of each step is given in the PROCEDURE.

With our LC-QTOF MS system we normally acquire data in centroid mode. In contrast to the continuum mode, in which the mass signal is represented by a Gaussian curve, the centroid mode projects each mass signal as accurate m/z value by on-the-fly mathematical transformation. Although relevant information on mass 43

CHAPTER 2

peak shape and purity may be lost upon centroiding, the raw data files are markedly reduced from about 500 Mb to a more useful size of about 10 Mb per sample (at a run time of 1 h and sampling rate of 1 scan per second). Especially upon analysing and processing large series of extracts, and for storing and databasing thousands of raw data files gathered over years of analyses, acquiring data in centroid mode is the most practical option. In addition, by using a separate lock mass spray as reference and by continuously switching between sample and reference, the MassLynx software can automatically correct the centroid mass values in the sample for small deviations from the exact mass measurement (Wolff et al., 2001), resulting in a mass accuracy of better than 5 ppm generally. This paper describes a detailed protocol for untargeted LC-MS based metabolomics of large numbers of extracts. The standard procedure is schematized in Fig. 2.2 and consists of: tissue sampling and extract preparation; LC-QTOF MS analysis using an ESI source, metAlign-assisted mass peak extraction and alignment across samples; and the identification of mass peaks selected by means of appropriate statistical filtering. In principal, the methodology described below is applicable to a wide range of plants species, tissues or products derived thereof.

MATERIALS Reagents • Acetonitrile, HPLC supra-gradient grade (Biosolve, cat. no. 01203502, CAS [75-05-8]). CAUTION: Acetonitrile is harmful and highly flammable and should be handled in a fume hood. • Methanol absolute, HPLC supra-gradient grade (Biosolve, cat. no. 13683502, CAS [67-56-1]). CAUTION: Methanol is toxic and highly flammable and should be handled in a fume hood. • Formic acid for analysis 98-100% (Merck-KGaA, cat. no 1.00264.1000, CAS [64-18-6]). CAUTION: Formic acid is corrosive and volatile, and should be handled in a fume hood. • Leucine enkaphaline, ≥ 95% pure, isolated by HPLC (Sigma, cat. no. L9133, CAS [81678-16-2]). • Phosphoric acid p.a. 85% in water solution (m/v) (Acros, cat. no. 20114-0010, CAS [7664-38-2]). CAUTION: Phosphoric acid is corrosive and should be handled in a fume hood. • Ultrapure water (Elga Maxima, Bucks) 44

Large Scale Metabolomics using LC-MS

• Liquid nitrogen for freezing samples. CAUTION: Liquid nitrogen is a low temperature refrigerant and should be handled with protective glasses and protective gloves. • Liquid nitrogen for applying gas to mass spectrometer ionisation source. • Argon 5.0, at least 99.999% pure, for applying gas to mass spectrometer collision cell. • Sample extraction solution (see REAGENT SETUP) • HPLC mobile phase (see REAGENT SETUP) • MS calibration solution (see REAGENT SETUP) • Lock mass solution (see REAGENT SETUP)

Equipment • Storage tubes or plastic bags resistant to liquid nitrogen, e.g. polypropylene 50-mL tubes with screw cap (Greiner, cat. no. 210261), Eppendorf micro-test tubes, 12 mL glass tubes with screw caps (Omnilabo) • IKA A11 basic grinder • Pipettes and tips suitable for handling organic solvents (Microman, Gilson) • Ultrasonic bath (Branson 3510) • Single-use sterile and non-pyrogenic latex-free syringes, 0.01-1 mL Tuberkulin Omnifix-F (B.Braun Melsungen AG, cat. no 9161406V) • Single-use syringe filters free of polymers, such as Anotop 10 (diameter 10 mm, pore size 0.2 µm; Whatman, cat. no 6809-1022) or Minisart RC4 (diameter 4 mm, pore size 0.2 µm; Sartorius, cat. no 17821). CRITICAL: Filters for MS analyses should be resistant to extraction solution (75% methanol + 0.1% FA) and free of polyethylene glycol or any other soluble polymer • Crimp cap autosampler vials of 1-2 mL with aluminum crimp caps containing natural rubber/ polytetrafluoroethylene septum • Tecan Genesis Workstation with TeVacs vacuum filtration unit • Protein filtration plates in 96 wells format (Captiva 0.45 µm, Ansys Technologies) • Ninety-six-well plates with 700 µL glass inserts (Waters) and 96square well polytetrafluoroethylene-coated seal (Waters) • Analytical column Luna C18(2), 2.0 mm diameter, 150 mm length, 100 Å pore size, spherical particles of 3 µm (Phenomenex) 45

CHAPTER 2

• Pre-columns Luna C18(2), 2.0 mm diameter, 4 mm length (Security Guard, Phenomenex) • PEEK in-line filter holder with PEEK frit 0.5 µm pore size (UpChurch Scientific) • Alliance 2795 HT liquid chromatography system equipped with an internal degasser, sample cooler and column heater (Waters) • Photodiode array detector 2996 (Waters) • Quadrupole-time-of-flight Ultima V4.00.00 mass spectrometer equipped with an electrospray ionisation (ESI) source (Waters) and separate lock mass spray inlet • Separate HPLC pump (e.g. Bromma 2150; LKB) for continuously pumping the lock mass solution at 10 µL min-1 • PEEK tubings (Upchurch Scientific) for connecting the LC-PDA (125 µm inner diameter) and the lock mass pump (250 µm inner diameter) to the mass spectrometer • PHD 4400 syringe pump (Harvard) • Gastight glass syringe 0.1-1.0 mL (Hamilton-Bonaduz Schweiz, cat. no. 1001) • Software: MassLynx data management software 4.0 (Waters), metAlign (www.metalign.nl or www.rikilt.wur.nl/UK/services/MetAlign+download), Microsoft Office Excel 2003. Optional: multivariate analyses software such as GeneMaths 2.01 (Applied Maths, Belgium).

Reagents Setup Plant growth and sampling conditions: Samples to be prepared for metabolomics studies should be as representative as possible for the genotype or tissues to be analysed. For small plants like Arabidopsis seedlings, a combinatorial approach of controlled plant growth, pooling and replicate analyses can be used to minimize biological and experimental variation. For instance, in the large scale metabolomics study in Arabidopsis RILs (Keurentjes et al., 2006), seeds were sown on 10 mL ½ MS (Murashige and Skoog) agar (2%) in Ø 6 cm Petri-dishes with a density of a few hundred seeds per dish. Dishes were placed in a cold room at 4°C for 7 days in the dark to promote uniform germination and were then randomly placed in five blocks in a climate chamber where each block contained one replicate dish of each line. Growth conditions were 16 h light (30 W m-2) at 20°C and 8 h dark at 15°C, at 75% relative humidity. After 6 days the lids of the Petri-dishes were removed to 46

Large Scale Metabolomics using LC-MS

ensure that seedlings were free of condensed water on the day of harvest. On day 7, at 7 h into light period, all seedlings were harvested within 2 h by submerging the complete Petri-dish briefly in liquid nitrogen and scraping off the aerial parts with a razor blade. Finally, per line material from 2 dishes was pooled to make one of the replicate samples and from the other 3 dishes to make the second. To obtain representative material from large plant tissues, such as fruits of tomato, apple, or tubers of potatoes, a representative “pie”-segments is taken from at least 5 fruits or tubers per plant using a sharp knife. Segments are snap-frozen in liquid nitrogen and pooled per plant. Once harvested, plant material can be stored at -80°C until further processing. Sample extraction solution: Prepare 99.875% methanol solution acidified with 0.125% (v/v) formic acid (FA). CAUTION: methanol is toxic and highly flammable, while formic acid is corrosive. Both solvent should be handled in a fume hood. HPLC mobile phase: Two eluents are used as mobile phase; eluent A is 0.1% FA (v/v) in ultrapure water, and eluent B is 0.1% FA (v/v) in acetonitrile. CAUTION: Both methanol and acetonitrile are toxic and highly flammable, while FA is corrosive; all solutions should be handled in a fume hood. CRITICAL: As the retention of some metabolites, especially alkaloids, is very sensitive to slight variations in the acidity of the mobile phase, always precisely add 0.1% (v/v) FA to both eluents and prepare sufficient eluents to analyse the entire sample series. MS calibration solution: To calibrate the mass spectrometer, freshly prepare about a 1 mL solution of phosphoric acid at a concentration of 0.05% (v/v) in 50% acetonitrile / ultra pure water and load into the gastight glass syringe. CAUTION: Handle solvents in fume hood. Lock mass solution: Prepare a solution of leucine enkaphaline in 50% (v/ v) acetonitrile / ultra pure water to obtain a final concentration of 0.1 µg mL-1. Prepare sufficient solution for analysis of the complete series of samples. CAUTION: Handle solvent in fume hood. Equipment Setup LC-PDA-QTOF MS setup: see Boxes 2.1 and 2.2. CRITICAL: The LC-PDA system needs to be conditioned for a minimum of 1 h before; the QTOF MS should be conditioned for a minimum of 2 h. Data pre-processing and alignment: We routinely program the metAlign software to extract and align all mass signals having a signal to noise ratio of at least 3 (normally used as a threshold in analytical chemistry). The software performs the following processing steps: (i) mass data smoothing using a digital filter related 47

CHAPTER 2

to average peak width; (ii) local noise calculation as a function of retention time and ion trace; (iii) baseline correction of all ion traces and introduction of a threshold to obtain noise reduction; (iv) scaling and calculation and storage of peak maximum amplitudes; (v) between-chromatogram alignment using high signal/noise peaks common to all chromatograms; (vi) iterative fine alignment by including an increasing number of low signal peaks; (vii) output of aligned data into a csv-file compatible with Microsoft Excel and most multivariate programs, and, finally and optional, (viii) significant difference filtering at user-defined thresholds and output of selected data back to the MS-software platforms for visualisation of differential chromatographic mass peaks. A picture of the metAlign interface is given in Fig. 2.3. The parameters used for processing the 30-min LC-MS runs are shown in the figure itself; for the 60-min runs the differing parameters are given in the legend. The software, examples, manual, etc., can be downloaded for free from www.metalign. nl or www.rikilt.wur.nl/UK/services/MetAlign+download/. It is recommended to carefully read the manual to become acquainted with the effect of the different parameters and how to optimize the settings. Box 2.3 gives a summarized account of this information. Default parameters for some other MS systems can be found in the metAlign manual.

Figure 2.3. Interface of metAlign software used for untargeted processing of LC-QTOF MS data files. The program is divided into three parts: part A deals with program configuration, data selection, peak extraction and baseline correction; part B covers the actual alignment of extracted mass peaks and output of [mass peak intensity x samples]-data matrix; part C is used to identify and visualise chromatographic peaks that are statistically different between two groups of samples (optional). Parameter settings given in this figure correspond to the default values for processing of 30 min LC-MS runs. For 60 min LC-MS runs the following default parameter settings are recommended: 4=70; 5=2450; 8=3; 9=25; 13=69, 35 and 2450, 35; 16=10, 5. A short description of buttons and parameters is given in Box 2.3.

48

Large Scale Metabolomics using LC-MS

LC-PDA-MS/MS set up: If needed, mass signals can be further identified using LC-MS/MS. For this purpose, masses of interest are incorporated into a mass inclusion list (data-directed MS/MS). We perform LC-MS/MS on the QTOF Ultima with a scan time of 0.4 s and an interscan delay of 0.1 s. The collision energy profile is programmed to increase sequentially from 5, 10, 20 to 30 eV (ESI positive mode) or 10, 15, 30 to 50 eV (ESI negative mode). If these settings are insufficient to obtain informative MS/MS information for the masses of interest, the collision energy profile can be adjusted. CRITICAL: In case of random LC-MS/MS experiments, in which up to the eight highest intensity ions per survey scan can be automatically selected for MS/MS, use a mass exclusion list containing abundant eluent mass signals in order to prevent switching to MS/MS mode for these impurities.

PROCEDURE Tissue sampling and extraction

1.

Harvest a reproducible amount of tissue (leaf, roots, fruit, etc.) by rapid freezing in liquid nitrogen. Large plant parts such as tomato fruits or potato tubers should first be cut rapidly into representative smaller parts with a sharp knife before freezing. In the case of seeds or small seedlings (e.g. Arabidopsis) use 1.5- or 2.2-mL Eppendorf tubes; in case of larger tissues use 50-mL Greiner tubes or plastic bags that are resistant to liquid nitrogen. CAUTION: To prevent storage tubes or bags from exploding, remove all liquid nitrogen by gently pouring off before closing and do not screw tube lids firmly! PAUSE POINT: Frozen tissue can be stored at -80°C for at least 1 year.

2.

Homogenize the frozen tissue in liquid nitrogen into a fine powder using a pestle and mortar, but preferably use a ball mill (Retsch Mixer Mill MM 301 for Arabidopsis) or analytical mill (IKA A11 for larger tissues) which have been thoroughly pre-cooled with liquid nitrogen. Transfer homogenized powder into precooled storage containers resistant to liquid nitrogen. CRITICAL STEP: Take care that tissues stay fully frozen during homogenization; discard any samples that start to thaw. If needed carefully pour a small volume of liquid nitrogen onto the sample, let the nitrogen evaporate and continue homogenization. PAUSE POINT: Frozen powder can be stored at -80°C for at least 1 year.

3.

Weigh 100 mg frozen powder of Arabidopsis with an accuracy of 49

CHAPTER 2

more than 5% in a pre-cooled Eppendorf tube, or 500 mg in the case of larger amounts of tissue (e.g. tomato fruit or potato tuber) in a 10-mL glass tube with screw cap. Lower amounts can be used as well, but this is not advisable in view of the inherent relative higher weighing error using frozen material. CRITICAL STEP: Take care that tissues stay fully frozen; discard any samples that start to thaw. Lyophilization of tissue is not recommended, unless for specific practical reasons, without knowing the effect of the lyophilization procedure on the metabolite profile. PAUSE POINT: Frozen powder in tubes can be stored at -80°C for at least 1 month.

4.

Prepare extracts freshly at the beginning of a series of analyses. Add ice-cold sample extraction solution (99.875% methanol acidified with 0.125% FA) in a volume/fresh weight ratio of 3 to the tube containing the weighed frozen powder, close lid and immediately vortex for 10 s. Assuming a tissue-water content of about 95%, this will result in a final concentration of 75% methanol and 0.1% FA. In the case of samples with highly variable water contents or lyophilized material, pure water can be added to adjust each sample to a final solvent concentration to 75% methanol and 0.1% FA. Store extracts on ice until all samples are ready.

5.

Sonicate 15 min at maximum frequency (40 kHz) in a water bath at room temperature (20 ºC).

6.

Centrifuge 10 min at maximum speed (20,000 x g for Eppendorf tubes; 3,000 x g for glass tubes) at room temperature.

7.

Filter the supernatant over a 0.2 µm PTFE filter using a disposable syringe into a 1.8-mL glass vial and close vial with cap. In the case of large amounts of samples, use suitable filtration plates in 96-wells format and a vacuum filtration unit. We use a TECAN Genesis Workstation 150 equipped with a 4-channel pipetting robot and a TeVacS 96-wells filtration unit. Pre-wash filtration plates (Captiva 0.45 µm, Ansys Technologies) with at least three times with 700 µL of 75% methanol containing 0.1% FA. Dry bottom tips of the filters by blotting on filter paper. Place a 96-well plate with 700 µL glass inserts in the filtration unit under the pre-washed filtration plate. Load each well with 700 µL of extract and vacuum-filtrate 2 times 20 s until dry. Carefully remove air-bubbles trapped at the bottom of the inserts. Cover the plate with a 96-square well PTFE-coated seal. CRITICAL STEP: All filters used should be free of aqueous-methanol soluble polymers, such as polyethylene glycol. 50

Large Scale Metabolomics using LC-MS

LC-PDA-QTOF MS analysis

8.

Place vials or 96-well plates in the autosampler, conditioned at

20°C.

9.

Check for the presence of sufficient eluents, lock mass solution and nitrogen gas, and start sample series with at least 4 “dummy” injections, to stabilize the LC-PDA-MS system, using the setup detailed in Boxes 2.1 and 2.2. Check system performance and mass accuracy during these first runs. Deviations of observed known parent masses from their calculated masses should be less than 5 ppm (at signal intensities similar to that of the local lock mass), otherwise recalibrate system. ? TROUBLESHOOTING PAUSE POINT: Raw data can be stored on hard disks, tapes, DVD’s or other digital storage devices until further processing.

Pre-processing and alignment of LC-MS data

10.

Configure metAlign (see Equipment Setup) and select the data to be processed (buttons 1-3, see Box 2.3 for more details). The first sample selected in button 2B is used as the reference file in the actual alignment (part B, see Fig. 2.3). We recommend selecting the sample that has been analysed just in the middle of the entire LC-MS series as this reference file, to minimize the extent of retention profile correction between first and last samples analysed.

11.

Perform a test baseline correction (part A, Fig. 2.3) and alignment (part B, Fig. 2.3) on only a few variable samples to check whether the default settings are at least correct to extract and align mass peaks that are of specific interest (if any). Define parameters for peak extraction and noise (buttons 4-9, see Box 2.3 for more details) and run baseline correction (button 11, see Box 2.3 for more details). Manually inspect corresponding mass peaks in the beginning, middle and at the end of the baseline-corrected chromatograms and compare with the original raw data. If it is obvious that some mass signals from relatively broad chromatographic peaks are missing in the baseline corrected data, set parameter 9 (see Box 2.3 for more details) at a slightly higher value and re-run baseline correction. On the other hand, if closely eluting peaks of compounds with similar (nominal) mass have been extracted as single peaks, lower the value at button 9.

12.

Once peak extraction and baseline correction settings are satisfactory, run baseline correction for all samples. Note that baseline correction is 51

CHAPTER 2

the most time-consuming part of metAlign and can take a few hours for 100 samples (depending on configuration of the computer).

13.

After baseline correction of the entire series, inspect retention shifts in the baseline-corrected data files of the reference sample and of the first and last sample of the entire data set. Set maximum shift at initial peak searching criteria (parameter 13, see Box 2.3 for more details) according to default settings, or to a value at least a factor of 2 higher than visually observed retention shifts and higher than that set in parameter 9. In most experiments on related samples we use the iterative alignment with parameters indicated in Fig. 2.3 and its legend (see also examples in the metAlign manual). ? TROUBLESHOOTING.

14.

To prevent metAlign outputting mass peaks that are detected in only one or a few samples, e.g. due to impurities present in one extract, it is recommended to increase parameter 18 (see Box 2.3 for more details) to a value corresponding to the number of replicates or to relevant statistical units.

15.

After running the alignment (button 20), create the data output file (button 21, see Box 2.3 for more details). Identification of relevant metabolites

16.

Retrieve accurate masses of filtered mass peaks in the raw data file manually. Inspect absorbance spectra, recorded by the PDA detector, of compounds of interest. ? TROUBLESHOOTING

17. Perform additional LC-QTOF MS/MS fragmentation experiments for further identification. Enter selected masses into a mass inclusion list to ensure isolation in the quadrupole (data-directed MS/MS). 18.

Predict the elemental composition of the mass peaks of interest from the accurate mass calculation, together with MS/MS fragmentation, isotopic patterns and, if possible, specific absorbance spectra.

19.

Use the elemental formulae obtained to search the internet or commercially available compound databases (e.g. Database of Natural Products on CD-ROM) for possible candidates. As a first step to facilitate the query of LCMS based plant metabolomics data, an open access database for identified semipolar metabolites, currently mainly (poly)phenolic compounds, detected in tomato fruit has recently been developed (Moco et al., 2006a) and can be searched at http://appliedbioinformatics.wur.nl/moto. This database is derived using the exact 52

Large Scale Metabolomics using LC-MS

protocol described in the present paper. However, in untargeted LC-MS most of the elemental compositions detected in plant extracts are yet still unknowns or reference compounds are not commercially available (von Roepenack-Lahaye et al., 2004; Vorst et al., 2005; Moco et al., 2006a). Therefore, many of the putatively annotated structures cannot yet be unambiguously identified without using NMR or other tools.

TIMING day 1

harvest

day 2

sampling

day 3

extraction LC-MS analyses

day 4 day 5 day 6

data pre-processing data output

Figure 2.4. Timeline of standard procedure of untargeted LC-MS analyses, based on 50 Arabidopsis seedling samples and LC-MS analysis time of 1 h. For large plant tissues such as tomato fruits, the sampling step (including grinding and weighing) can take 4 days, resulting in a total time of 8 days for 50 samples.

The timeline of the procedure from tissue handling up to the final output for subsequent statistical analyses (matrix of intensity of aligned mass peaks versus samples) is schematized in Fig. 2.4. For about 50 Arabidopsis samples, the sampling step, which includes grinding in liquid nitrogen using a ball mill and weighing of frozen tissues, can be done in 2 days. However, for the same amount of samples from larger plant tissues such as tomato fruit and potato tubers, these activities usually take more time: about 4 days. Subsequent sample extraction, conditioning the LC-MS, extract analysis and mass peak alignment by metAlign will take about 4 days for 50 samples, irrespective of the type and origin of tissue. Depending upon the research question, much more time may be needed for further interpretation of the comprehensive metabolomics data set including statistical filtering and identification of relevant mass peaks.

53

CHAPTER 2

BOX 2.1: LC-PDA-QTOF MS setup; Conditioning the HPLC-PDA system 1. Prepare the mobile phase solvents, prime HPLC pump and tubing, and degas both solvents for at least 10 min using the in-line degasser of the Alliance 2795 HT. 2. Install one PEEK in-line solvent filter between injection system and pre-column cartridge. Place two pre-columns in tandem in the cartridge, fix in front of the analytical column and place both columns in the column oven conditioned at 40°C. 3. Precondition column system by increasing the percentage of eluent A step-wise (starting at 100% eluent B) until the initial gradient conditions are reached. 4. Program the inlet file according to the gradient settings given below (Tables 2.1 and 2.2). In the standard set up we use relatively long chromatographic runs of 1 h, including column washing and re-conditioning, with a mobile phase flow of 0.19 mL min-1 into the analytical column (diameter of 2.0 mm). This flow rate corresponds to 1 mL min-1 on a 4.6 mm column, which is standard in most HPLC-UV/ Vis applications. In the case of a large sample series, e.g. more than 300 extracts, we consider the use of a 30-min run at a slightly higher flow rate, to lower the chance of possible perturbations. Table 2.1. Gradient settings for a 60 min run; Flow rate 0.19 mL min-1. Time (min) 0 45 47 52 54 60

%A 95 65 25 25 95 95

%B 5 35 75 75 5 5

Table 2.2. Gradient settings for a 30 min run; Flow rate 0.20 mL min-1. Time (min) 0 20 25 26 30

%A 95 25 25 95 95

%B 5 75 75 5 5

5. The PDA detector is placed between analytical column and the QTOF-MS. Connect column outlet to flow cell of the PDA detector and switch on the detector. Program PDA to acquire data every second from 210 nm to 600 nm with a resolution of 4.8 nm. Wavelength range, scan rate and resolution can be adjusted according to LC runs times and research aims. 54

Large Scale Metabolomics using LC-MS

CRITICAL: Check HPLC pump for air bubbles and connections for leakage by verifying pressure stability. CRITICAL: Precondition PDA-lamp, column oven temperature and analytical column for at least 1 hour before starting sample analyses. Meanwhile, the mass spectrometer can be calibrated and checked for performance as described in Box 2.2. 6. Place the aqueous-methanol extracts in trays inside the autosampler (20 ºC) during the analysis series. Program the injection system to operate in sequential mode and to load the syringe with 5 µL of sample with 5 µL of air both before and after the sample. The injection needle is washed with 50% (v/v) methanol/water between injections.

BOX 2.2: LC-PDA-QTOF MS setup; Conditioning the MS system Before each series of sample analyses, the mass spectrometer should be conditioned and calibrated to obtain good performance in terms of mass accuracy and resolution. In contrast to electron impact ionisation, as used in most GC-(TOF)MS applications, detection sensitivity and mass spectra obtained by soft-ionisation LCMS are completely dependent on the type of mass spectrometer, ionisation source and chromatographic system used. The procedure and settings described here are for a QTOF Ultima with ESI source and the TOF-tube in V-mode, in combination with the HPLC conditions described above. 1. Connect the outlet of the PDA, with eluent flow of 0.19 mL min-1, to the inlet of the mass spectrometer and set the capillary voltage at 2.75 kV, cone voltage at 35 V, source temperature at 120 ºC and desolvation temperature at 250 ºC. Use a cone gas flow of 50 L h-1 and desolvation gas flow of 600 L h-1. CRITICAL: pre-condition MS for at least 2 h at these standard settings. 2. Disconnect the eluent tubing from the MS and use the syringe pump to inject the phosphoric acid calibration solution directly into the ESI source, at an initial flow of 5 µL min-1. 3. Acquire data from m/z 80-1,500 at a scan rate of 0.9 s and an interscan delay of 0.1 s. A series of phosphoric acid cluster peaks should appear throughout the entire range of the mass spectrum. CRITICAL: to obtain proper calibration and accurate mass calculations, none of the mass calibration peaks should exceed an intensity of 250 counts s-1 (in continuum mode) and the intensity of the clusters over the mass range should be as 55

CHAPTER 2

uniform as possible. Adjust pump flow, capillary voltage, cone voltage, desolvation gas flow and/or collision energy until criteria are fulfilled. 4. Combine spectra of about 50 scans during acquisition mode at optimal settings in continuum mode, centre the mass signals and check mass resolution of the machine for m/z 488.8772 (negative ionisation mode) or 490.8918 (positive ionisation mode). Mass resolution is calculated by dividing the m/z value of the centred mass signal by the mass difference at half height of the Gaussian-shaped mass peak in continuum mode, and should be better than 8,500 (with QTOF Ultima in V-mode); otherwise re-tune instrument and repeat procedure. 5. Use the centred mass data for calibration of the instrument using a polynomial-5 fit. CRITICAL: mean residual mass deviation should be less than 1.5 ppm, otherwise adjust calibration settings. 6. Check calibration using leucine enkephalin as a standard. Inject the leucine enkephalin solution through the separate lock mass inlet into the ESI source and acquire data under MS conditions as used during sample analyses, but in continuum mode. Adjust flow to obtain a specific mass intensity of 250 counts s-1. Collect and combine about 50 spectra and centre the mass peak. CRITICAL: the observed mass should be within 20 ppm deviation of m/z 556.2767 in positive mode and 554.2619 in negative mode, otherwise recalibrate instrument. 7. Reconnect the outlet of the PDA to the inlet of the mass spectrometer. Check the effluent from the LC system, including mobile phase, tubings, columns and PDA flow cell, by acquiring centroid data from m/z 80 - 1,500 under the exact conditions of sample analysis. Individual mass signals at initial gradient conditions should preferably be less than 200 counts per scan in negative mode or less than 500 counts per scan in positive mode, to prevent excessive ion suppression of sample compounds. 8. Prepare MS method file to acquire mass data from m/z 80-1,500, at a scan rate of 0.9 s and an interscan delay of 0.1 s and in centroid mode. CRITICAL: the range of masses to be detected in sample extracts should fall within the range of calibration masses. During sample analyses, the standard setting of collision energy is 10 eV in negative ion mode and 5 eV in positive ion mode. If needed for optimal ionisation of key compounds, the collision energy may be adapted. The MS is programmed to switch from sample to lock spray every 10 s and to average two scans for lock mass correction (m/z 556.2767 in positive mode and 554.2619 in negative mode). The lock mass solution is used for online calibration of 56

Large Scale Metabolomics using LC-MS

the mass accuracy during sample analysis (Wolff et al. al.,, 2001; Moco et al., 2006a). CRITICAL: adjust flow rate or concentration of the lock mass solution to obtain an intensity of about 500 counts per scan (in centroid mode) during LC-MS runs, to enable accurate mass calculation of as many compounds in the extracts as possible.

BOX 2.3: Description of metAlign buttons and parameters A more detailed description can be found in the manual, which can be downloaded from www.metalign.nl or www.rikilt.wur.nl/UK/services/ MetAlign+download/. PART A: Program configuration, data set selection and baseline correction • Buttons 1-3 are used to define the data sets as well as define folders and formats for input and output. • Parameters 4 and 5 (value in scans) refer to the region in the chromatogram, which should be processed. In particular parameter 5 should be taken in an empty region of the chromatogram at the highest concentration of organic modifier in the gradient or at an earlier time point. This enables metAlign to calculate a matrix of noise vs. retention time vs. mass. This noise matrix together with parameter 7 and 8 is then used as a basis to find real mass peaks. • Parameter 6 (value in ion counts of a single mass) is machine dependent and should be set at about 70% of the maximum value a detector can record, to be able to deal with artefacts due to detector saturation. MetAlign creates artificial maxima at this value for all peaks above this value. • Parameter 7 and 8 (factor times local noise) are peak slope and threshold factors used to filter out peaks from noise. • Parameter 9 (value in scans) should be the average mass peak width at half height of non-saturated compounds. This parameter is used in determining the data smoothing (digital filter) as well as for a window in the alignment (see “14. Tuning Alignment Options and Criteria”). • Parameter 10 is “de-clicked” to indicate that the peak shapes should not be saved, which only in this mode is compatible with alignment; “clicked” keeps peak shapes and renders the output incompatible with alignment, but on the other hand is compatible with deconvolution algorithms from third party software. • Button 11 consecutively processes all data sets defined by buttons 57

CHAPTER 2

1 to 3. It starts the noise estimation as a function of time and mass, the smoothing, maximum amplitude correction (if needed), baseline correction, noise elimination, peak picking and exporting of baseline corrected peaks. PART B: Scaling and aligning data sets • Button 12 provides different modes of scaling data sets. Options are: (a) no scaling, (b) scaling on the basis of sum of all the amplitudes of the peaks picked, (c) scaling using a specific mass. • The parameters in “13. Initial Peak Search Criteria” provide the window (in +- the indicated scans) at a position (in scans) in the chromatogram in which a search for identical masses is done over all chromatograms. This window may vary with retention time; the parameters in 13 provide coordinates used for linear interpolation of the window size for the whole chromatogram. • The options in “14. Tuning Alignment Options and Criteria” determine if the rough or iterative alignment should be performed. In brief the alignment is described as follows: In both modes of alignment the window determined by “13. Initial Peak Search Criteria” is used to restrict searches for identical masses in different data sets. For the rough mode the alignment finishes here. For the iterative alignment this is the starting point for the first estimation of a retention shift profile for all data sets with regard to the first data set. For each time point in a retention shift profile criteria (parameter 16 and 17) to calculate differences in retention times between files are on the basis of a minimum number of aligned masses present in all data sets, which are above a minimum amplitude (factor times noise) and occur in a chromatogram sub-window (of two times parameter 9). The next iteration will start from here. Using this first retention shift profile the alignment is refined by doing book keeping on the differences in retention and automatically decreasing the parameters in “13. Initial Peak Search Criteria” to obtain a smaller search window throughout the chromatogram. The second alignment is then done as described for the smaller retention corrected search window (13). Parameters 16 (number of masses) and 17 (factor times noise) are also automatically reduced and a new and better retention shift profile is calculated analogous to the first iteration. Iterations continue until the final values in parameters 16 and 17 are reached and the search window is 2 times the value of parameter 9 (average peak width). After finalizing the last iteration incomplete mass peak sets spread over neighbouring scans are combined in a fine-alignment process. • Parameter 15 restricts changes in retention time shifts between calculated points in a retention shift profile to a maximum value (in scans per 100 58

Large Scale Metabolomics using LC-MS

scans). This restriction is used after calculation of a retention shift profile and serves to filter out possible anomalies. • Parameters 18 and 19 are filters for aligned mass peaks, which indicate minimum completeness of aligned mass peak sets. • Button 20 starts the scaling and alignment of data obtained in PART A. • Button 21 is used to obtain information on the alignment of masses. There are 3 options: (i) a normal ASCII output, (ii) a Excel-compatible CSV-file output, (iii) a graphical display of the retention shift profiles of individual data sets with regard to the first reference file. • Button 29 executes the calculations under Button 11, Button 20 and Button 28. • Button 30 exits the program saving the parameters set. PART C: Peak selection and export to MS software format for visualisation (only applicable when comparing 2 groups of data) • Parameter 22 is the significance percentage restriction when selecting differences between data in group 1 vs. group 2. • Parameter 23 restricts selection of differences between groups on the basis of the ratio in the means of individual aligned masses. • Parameter 24 restricts selection of differences between groups on the basis of the minimum amplitudes defined as a factor times noise, i.e. it determines what is defined as present. • Parameter 25 is used to filter on peaks which are only present in one group. The extra edit box is a filter for this option. It determines the minimum number of masses which should be present for a “compound”, which is only present in one group. • Parameter 26 is a condition. With this condition you select if peaks present in group 2 are larger than in group 1 or vice versa. • Button 27 executes PART C and creates a selection of peaks on the basis of the parameters set (22-26). • Button 28 gives similar output as described at Button 21.

59

CHAPTER 2

TROUBLESHOOTING Major problems are not expected when applying this protocol and keeping in mind the indicated critical steps. If by accident the LC flow stops, or for some reason has to be stopped even for only a short time, or upon running out of nitrogen gas, at least 4 samples should be analysed as “dummy” injections to re-stabilize the system. Upon malfunction of the MS system, e.g. sudden decrease in detector sensitivity, the extracts can be stored at 4-10°C for at least 1 week. After storage, always sonicate vials or inserts to re-dissolve possible precipitates in the extracts or filter extracts once again. If, upon metAlign processing, there seems to be insufficient land-mark peaks (i.e. mass signals common in all samples) for proper iterative alignment, a message will automatically be displayed. This can be the case in comparing highly unrelated samples (“apple and pears”). If such comparison is still essential for the research question, we recommend to lower parameters 16 and/or 17 or, alternatively, use the rough alignment tool at button 14 (see also Box 2.2). With regard to accurate mass calculation, the mass accuracy of an ion detected by the QTOF-Ultima MS is in principle highest at signal intensities that are comparable to that of the local lock mass (Moco et al., 2006a). Thus, if in all samples the mass signal of interest is lower than about half the intensity of the lock mass, it is impossible to calculate its exact mass using this type of mass spectrometer. Lowering the lock mass intensity during analysis is not recommended, as this will prevent an accurate estimation of the lock mass itself. At low mass signals, it is difficult to obtain informative MS/MS fragmentation as well. Strategies to increase the mass signal, such as injecting higher sample volumes, analyzing in the opposite ionisation mode, using a different ionisation source (e.g. APCI) or post-column addition of ionisation promoter (e.g. ammonium acetate), may be tested. Alternatively, the compound of interest can be concentrated or the sample can be re-analysed by other instruments with higher mass accuracy and/or MS/MS capabilities at a low mass intensity range. ANTICIPATED RESULTS As this untargeted metabolomics protocol makes use of crude 75% aqueousmethanol extracts of plants coupled to C18-reversed phase LC and ESI-MS, the technique described is slightly biased towards semi-polar secondary metabolites. Nevertheless, within the same extracts a number of primary metabolites, e.g. 60

Large Scale Metabolomics using LC-MS

several organic acids, nucleotides, amino acids, sugars and their phosphorylated forms, can be detected by this technique as well. However, as most of these primary metabolites are highly polar and usually co-elute with other compounds in the injection peak when using this type of columns, one should be aware that differences detected in the intensity of polar mass signals may result from differential degrees of ion suppression. Results on polar compounds obtained with this protocol should be checked with alternative LC systems (Tolstikov and Fiehn, 2002; Jander et al., 2004) or other metabolomics techniques (e.g. GC-TOF MS, CE-MS). 100

A

6 4 2 0 -2 -4 -6

10.0

B

75 50 25 0 -25 -50

5.0 2.5 0.0 -2.5 -5.0

-8

-75

-7.5

-10

-100

-10.0

0

48

96

144

192

time of LC-MS analyses (h)

240

0

C

7.5

mass accuracy (ppm)

8

intensity variation (%)

retention time variation (s)

10

48

96

144

192

time of LC-MS analyses (h)

240

0

48

96

144

192

240

time of LC-MS analyses (h)

Figure 2.5. Stability of the LC-QTOF MS system during 240 h continuous analyses of crude plant extracts (ESI negative mode). From a homogenous batch of Brassica nigra leaf tissue, 16 replicate extracts were prepared and analysed throughout a series of 240 samples, using a run time of 1 h per sample. Variation between replicates in the detection of rutin (for identification see Fig. 2.2F) is indicated. (A) Retention drift during analyses, expressed in seconds deviation from the mean retention time (23.195 min ± 1.3 sec; n=16). (B) Variation in mass signal intensity (peak height calculated by metAlign), expressed as percentage deviation from the mean intensity (1721 ± 355 counts scan-1, coefficient of variation=21%; n=16) versus time of analysis. Variation is sum of all technical variation including weighing, extraction, LC-MS analysis and data-processing. (C) Variation in accurate mass measurement, in ppm deviation from the mean of accurate masses calculated on the top of chromatographic peaks. Scale of y-axis: -10.0 to +10.0 ppm.

As shown in Fig. 2.5, the protocol described here enables highly stable chromatography and mass signal detection throughout analysis of large sample series. As the quality of metAlign-assisted data alignment and untargeted sample comparison is higher with increasing reproducibility of chromatography, the maximum drift in retention time of (known) compounds over the sample series analysed should be as small as possible and preferably less than 10 s (Fig. 2.5A). Larger retention shifts usually indicate column deterioration, trapped air bubbles or changes in eluent pH. Technical variation in relative quantification of mass signals between samples, which can be introduced at each step from 1 to 16 of the PROCEDURE, can be calculated from the intensities of (known) mass peaks (Fig. 2.5B). The coefficient of variation in intensities between replicate samples should be less than 25% overall, and is usually less than 10% for the higher abundant signals (Moco et al., 2006a). 61

CHAPTER 2

In addition, technical reproducibility can be estimated by creating scatter plots of all mass peaks from replicate samples (Vorst et al., 2005). Upon adequate mass calibration and by using lock mass correction on-line, the accurate masses of ions detected are usually stable throughout large sample series (Fig. 2.5C). With the TOF resolution used and at a signal intensity that is comparable to that of the lock mass, the observed accurate mass of a compound of interest should be within 5 ppm deviation from the calculated mass. In our laboratory, we use a script called MetAccure (Vorst et al., 2005; Moco et al., 2006a) to select scans within a userdefined intensity ratio of sample versus lock mass, to enable automated and correct accurate mass calculations. By calculating the mean values of observed accurate masses of compounds across all samples analysed, mass accuracies of 2 ppm or better can be obtained (Moco et al., 2006a). 10

MS signal [Ln(Mass peak intensity)]

9.5 9 8.5 8 7.5 7 6.5 6 5.5 5.5

6

6.5

7

7.5

8

8.5

9

9.5

PDA signal [Ln(Area 360 nm)]

Figure 2.6. Correlation between conventional LC-PDA analysis and untargeted LC-MS based metabolomics with regard to detection of the flavonoid rutin (for identification see Fig. 2.2F). Ripe fruits of 114 different tomato cultivars were analysed by LC-PDA-QTOF MS in ESI negative mode, as described in this protocol. LC-PDA signals (peak areas at 360 nm) were subsequently extracted in a targeted manner using the QuanLynx tool of MassLynx, while LC-MS parent ion signals were retrieved in an untargeted manner using metAlign. Ln-transformed data show high linear correlation (y = 1.0937 x with r2 = 0.972; p < 2.5x10-7), indicating that the untargeted approach is equivalent to the targeted (conventional) LC-PDA approach.

Reversed phase LC with PDA detection is used since decades for quantitative analysis of many secondary metabolites in plants. As the analytical system described in this protocol consists of reversed phase LC coupled to both PDA and MS, the quality of the untargeted LC-MS data can be checked by comparing with LC-PDA data of the same samples (Fig. 2.6). After log-transformation of both data, a significant and linear correlation should be achieved between a mass peak signal obtained by 62

Large Scale Metabolomics using LC-MS

untargeted metabolomics and peak area obtained by conventional LC-PDA analysis. A low correlation may indicate significant ion suppression, MS detector saturation or marked misalignments. However, correlations can only be established for compounds that show clearly separated PDA peaks in the chromatograms.

Figure 2.7. Hierarchical clustering (Pearson correlation) of 180 A. thaliana genotypes consisting of a recombinant inbred line (RIL) population and their parents, based on untargeted metabolomics data. Samples were analysed by LC-QTOF MS (30 min run) and 5783 mass peaks, extracted and aligned by metAlign, were loaded into GeneMaths software for multivariate analyses. Mass signal intensities (y-axis) were ln-transformed and standardised per raw average (each raw representing single mass peak), with colour scale given in the lower panel (green means relatively low, red means relatively high intensity). Replicate samples are indicated with the same colour on the sample key (x-axis): yellow- and bluecoloured samples are replicate analyses of two different samples each composed of a mixture of RILs, to check for LCMS-reproducibility and alignment; green- and red-coloured samples represent 5 biological replicates of the Ler and Cvi parents, respectively.

The aligned data sets can also be imported into software packages for large scale multivariate or statistical analyses, such as GeneMaths (Vorst et al., 2005) and MetaNetwork (Fu et al., 2007). We recommend loading mass peak data as nlogtransformed values. We routinely use GeneMaths software to check the quality of the mass signal output from large scale experiments, by applying principle component analysis and hierarchical clustering. In these multivariate approaches, replicate samples should cluster relatively close, as compared to e.g. different genotypes (Fig. 2.7), plant treatments or tissues, and the segregation of the scores should be according to the expected data structure (Vorst et al., 2005) (if applicable).

63

Chapter 3

A Liquid Chromatography Mass Spectrometry based Metabolome Database for Tomato Sofia Moco, Raoul J. Bino, Oscar Vorst, Harrie A. Verhoeven, Joost de Groot, Teris A. van Beek, Jacques Vervoort and Ric C.H. De Vos Plant Physiology 141: 1205-1218 (2006)

For the description of the metabolome of an organism, the development of common metabolite databases is of utmost importance. Here we present the MoTo DB (Metabolome Tomato Database), a metabolite database dedicated to liquid chromatography-mass spectrometry (LC-MS)-based metabolomics of tomato fruit (Solanum lycopersicum). A reproducible analytical approach consisting of reversed phase LC coupled to quadrupole time-of-flight MS and photodiode array detection (PDA) was developed for large scale detection and identification of mainly semipolar metabolites in plants and for the incorporation of the tomato fruit metabolite data into the MoTo DB. Chromatograms were processed using software tools for mass signal extraction and alignment, and intensity dependent accurate mass calculation. The detected masses were assigned by matching their accurate mass signals with tomato compounds reported in literature and complemented, as much as possible, by PDA and MS/MS information, as well as by using reference compounds. Several novel compounds not previously reported for tomato fruit were identified in this manner and added to the database. The MoTo DB is available at http:// appliedbioinformatics.wur.nl and contains all information so far assembled using this LC-PDA-QTOF MS platform, including retention times, calculated accurate masses, PDA spectra, MS/MS fragments and literature references. Unbiased metabolic profiling and comparison of peel and flesh tissues from tomato fruits validated the applicability of the MoTo DB revealing that all flavonoids and α-tomatine were specifically present in the peel, while several other alkaloids and some particular phenylpropanoids were mainly present in the flesh tissue.

65

CHAPTER 3

INTRODUCTION For understanding the dynamic behaviour of a complex biological system, it is essential to follow, as unbiased as possible, its response to a conditional perturbation at the transcriptome, proteome and metabolome levels. To study the dynamics of the metabolome, to analyse fluxes in metabolic pathways and to decipher the biological roles of metabolites, the identification of the participating metabolites should be as unambiguous as possible. Metabolomics is defined as the analysis of all metabolites in an organism and concerns the simultaneous (‘multiparallel’) measurement of all metabolites in a given biological system (Dixon and Strack, 2003). However, this is a technically challenging task, as no single analytical method is capable of extracting and detecting all metabolites at once due to the enormous chemical variety of metabolites and the large range of concentrations at which metabolites can be present. Therefore the characterization of a complete metabolome requires different complementary analytical technologies. Currently, mass spectrometry (MS) is the most sensitive method enabling the detection of hundreds of compounds within single extracts. Ideally, metabolome data should be incorporated into open access databases where information can be viewed, sorted and matched. Different pathway resources are available that combine information from the “omics” technologies such as Kyoto Encyclopedia of Genes and Genomes (http://www. genome.jp/kegg), MetaCyc (http://metacyc.org) or The Arabidopsis Information Resource (http://www.arabidopsis.org). Hitherto, research on plant metabolic profiling using chromatographic techniques coupled to MS technologies for database purposes has been accomplished by gas chromatography mass spectroscopy (GC-MS) analysis of extracts (Schauer et al., 2005a; Tikunov et al., 2005). GC-MS entails high reproducibility in both chromatography and mass fragmentation patterns. This reproducibility enabled the development of common metabolite databases, e.g. [email protected] (http://csbdb.mpimp-golm.mpg.de/csbdb/gmd/gmd.html) and the Fiehn-Library (http://fiehnlab.ucdavis.edu/compounds), that gather information mainly on primary metabolites. Liquid chromatography (LC)-MS is the preferred technique for the separation and detection of the large and often unique group of semipolar secondary metabolites in plants. Specifically, high resolution accurate mass MS enables the detection of large numbers of parent ions present in a single extract and can provide valuable information on the chemical composition and thus the putative identity of large numbers of metabolites. Recently, accurate mass LC-MS was performed to detect 66

MoTo DB

secondary metabolites present in roots and leaves of Arabidopsis (Arabidopsis thaliana) (von Roepenack-Lahaye et al., 2004), to study metabolic alterations in a light hypersensitive mutant of tomato (Solanum lycopersicum) (Bino et al., 2005) and to compare tubers of potato (Solanum tuberosum) of different genetic origin and developmental stages (Vorst et al., 2005). The variety of LC-MS systems, and the generally poorer retention time reproducibility of LC compared to GC, limits the establishment of a single optimised analytical procedure and hampers the comparison of LC-MS chromatograms between laboratories. Moreover, software tools able to transform automatically mass spectrometry data into a list of (putative) plant metabolites, in particular for LC-MS, are not yet available. This implies that analyses of mass signal data sets are left to manual searches in the available chemical databases such as SciFinder, PubChem or Dictionary of Natural Products (DNP). To extend the applicability of LC-MS in plant metabolomics, efforts should be made in (i) the establishment of a routine and reproducible LC-MS method, (ii) the annotation of the large numbers of mass signals detected, (iii) the unambiguous identification of compounds, and (iv) the development of a common reference database and searching tools for secondary metabolites in plants. In this article we present an open access metabolite database for LC-MS, called MoTo DB, dedicated to tomato fruit. This database is based on literature information combined with experimental data derived from LC-MS based metabolomics experiments. A reproducible and robust C18-based reversed phase LCPDA-electrospray ionisation (ESI)-QTOF-MS method was developed for the detection and putative identification of predominantly, secondary metabolites of semi-polar nature. The assignment of mass signals detected relies on the combination of the parameters: i) accurate mass, ii) retention time, iii) UV/Vis spectral information, and iv) MS/MS fragmentation data. To demonstrate the applicability of the established LC-MS metabolomics platform including database searching, peel and flesh tissues from ripe tomato fruit were compared for differences in metabolic composition. Statistically significant differences in LC-QTOF MS profiles between the tissues were identified in an unbiased manner, and differential mass peaks were annotated by searching in the MoTo DB. Several compounds not previously reported in tomato were also identified and have been incorporated into the database. All available information in the MoTo DB can be searched at http://appliedbioinformatics.wur. nl.

67

CHAPTER 3

MATERIALS AND METHODS Plant material: A large pool of tomato (Lycopersicum esculentum, now Solanum lycopersicum) fruit material was prepared by combining fruits from turning, pink and red ripe stages of development of 96 different tomato cultivars representing the three major types of tomato fruits (i.e. cherry, Dutch beef and normal round tomatoes). These plants were grown in an environmentally controlled greenhouse located in Wageningen, The Netherlands, during the summer and autumn of 2003. Plants were grown in rock wool plugs connected to an automatic irrigation system comparable to standard commercial cultivation conditions. For analysis of anthocyanins, purple-coloured fruits from offspring of a crossing of two natural mutants, Af x hp-2 j (van Tuinen et al., 2005), were harvested at the ripe stage of development. Peel (about 2 mm thickness) was removed from fruits, ground into a fine powder in liquid nitrogen and stored at –80 °C until further analysis. For metabolite profile comparison of peel and flesh, red ripe fruits of cultivar Money Maker were used of which peel (2 mm thickness) and flesh (rest of fruit) were separated and used as described. Extraction: Of the frozen tomato powder, 0.5 g FW was weighed and extracted with 1.5 mL pure methanol (final methanol concentration in the extract ~ 75%). Hydrolysed extracts were prepared by sequentially adding 1 mL of 0.1% TBHQ in methanol solution and 0.4 mL of HCl 6 M to 0.6 g FW tomato material, shaking in a water bath at 90-95°C for 1 h, and adding 2 mL of methanol (Bovy et al., 2002). All samples were sonicated for 15 min, filtered through a 0.2 μm inorganic membrane filter (Anotop 10 Whatman, Maidstone, England) and analysed. Chemicals: Standard compounds p-coumaric acid, protocatechuic acid, salicylic acid, caffeic acid, ferulic acid, cinnamic acid, myricetin and naringenin were purchased from ICN (Ohio, USA); p-hydroxybenzoic acid, chlorogenic acid quercetin, phenylalanine, sinapic acid and α-tomatine from Sigma (St. Louis, USA); vanillic acid and rutin (quercetin-3-O-rutinoside) from Acros (New Jersey, USA); naringenin chalcone from Apin Chemicals (Abingdon, UK), kaempferol and kaempferol-3-Orutinoside from Extrasynthese (Genay, France) and tert-butylhydroquinone (TBHQ) from Aldrich (Steinheim, Germany). Acetonitrile HPLC supra gradient and methanol absolute HPLC supra gradient were obtained from Biosolve (Valkenswaard, The Netherlands). Formic acid for synthesis 98-100% was from Merck-Schuchardt (Hohenbrunn, Germany), hydrochloric acid (HCl) 37% p.a. from Acros (New Jersey, USA) and ultra pure water was obtained from an Elga Maxima purification unit (Bucks, UK). Leucine enkaphaline was purchased from Sigma (St. Louis, USA). 68

MoTo DB

Chromatographic conditions: HPLC was carried out using a Waters Alliance 2795 HT system with a column oven. For chromatographic separation, a Luna C18(2) pre-column (2.0 x 4 mm) and analytical column (2.0 × 150 mm, 100 Å, particle size 3 μm) from Phenomenex (Torrance, CA, USA) were used. Five μL of sample was injected into the system for LC-PDA-MS analysis. Degassed solutions of formic acid:ultra pure water (1:103, v/v) (eluent A) and formic acid:acetonitrile (1:103, v/v) (eluent B) were pumped at 0.19 mL min-1 into the HPLC system. The gradient applied started at 5% B and increased linearly to 35% B in 45 min. Then, for 15 minutes the column was washed and equilibrated before the next injection. The column temperature was kept at 40 °C and the samples at 20 °C. The room temperature was maintained at 20 °C. Detection of metabolites by PDA and MS: The HPLC system was connected online to a Waters 2996 PDA detector, set to acquire data every second from 240 to 600 nm with a resolution of 4.8 nm, and subsequently to a QTOF Ultima V4.00.00 mass spectrometer (Waters-Corporation, MS technologies, Manchester, UK). An ESI source working either in positive or negative ion mode was used for all MS analyses. Before each series of analyses, the mass spectrometer was calibrated using phosphoric acid:acetonitrile:water (1:103:103, v/v) solution. Capillary voltage, collision energy and desolvation temperature were optimised to obtain a series of phosphoric acid clusters suitable for calibration between m/z 80 and 1500. During sample analyses, the capillary voltage was set to 2.75 kV and the cone at 35 V. Source and desolvation temperatures were set to 120 °C and 250 °C, respectively. Cone gas and desolvation gas flows were 50 and 500 Lh-1, respectively. In the positive ion mode, the collision energy was 5 eV while in the negative ion mode it was 10 eV. Resolution was set at 10,000 and during calibration the MS parameters were adjusted to achieve such a resolution. TOF MS data were acquired in centroid mode. During LC-MS analyses scan durations of 0.9 s and an interscan time of 0.1 s were used. For LC-MS/MS measurements 10 μL of sample was injected into the system and MS/MS measurements were made with 0.40 s of scan duration and 0.10 s of interscan delay with increasing collision energies according to the following program: 5 (ESI-positive) or 10 (ESI-negative), 15, 30 and 50 eV. A lock spray source was equipped with the mass spectrometer allowing on line mass correction to obtain high mass accuracy of analytes. Leucine enkephalin, [M+H]+ = 556.2766 and [M-H]- = 554.2620, was used as a lock mass, being continuously sprayed into a second ESI source using an LKB Bromma 2150 HPLC pump, and sampled every 10 s, producing an average intensity of 500 counts scan-1 in centroid mode 69

CHAPTER 3

(~100 count scan-1 in continuum mode). Data analysis and alignment: Acquisition of LC-PDA-MS data was performed under MassLynx 4.0 (Waters). MassLynx was used for visualisation and manual processing of LC-PDA-MS/MS data. Mass data were automatically processed by metAlign version 1.0 (www.metalign.nl). MetAlign transforms accurate masses into nominal masses to shorten the calculation time and minimize the number of mass bins. Baseline and noise calculations were performed from scan number 225 to 2,475, corresponding to retention times 4.0 min to 49.3 min. The maximum amplitude was set to 15,000 and peaks below three times the local noise were discarded. The .csv file output containing nominal mass peak intensity data (peak heights, i.e. ion counts scan-1 at the centre of the peak) at aligned retention times (scans) over all samples processed was used for further data processing. A script called metAccure was used for the calculation of accurate masses from the metAlign-extracted peaks. MetAccure calculates the accurate mass, using only those scans in which signal intensities are within a user-defined window relative to the lock mass intensity of each mass signal using the .csv files containing retention time alignments, originating from metAlign analysis, in combination with the original data in NetCDF format, created from MassLynx.raw files by Dbridge (Vorst et al., unpublished data) Comparison of extracts from peel and flesh tissues for significant differences in intensity of each aligned mass signal was made using the t-student statistical tool within metAlign (level of significance set at 0.05). The settings for baseline corrections and signal alignment were analogous to those described above. Annotation of metabolites: Data sets obtained after metAlign and metAccure treatment were analysed as [retention time x accurate mass x peak intensity] matrixes for metabolite identification. [M+H]+ and [M-H]- values were calculated for metabolites present in Table 3.1 and used for sorting with the matrixes. Data collected during the first 4.0 min of chromatography were discarded. Novel metabolites were identified by calculating the elemental composition from accurate mass measurements using the MassLynx software. The tolerance was set at 5 ppm, taking into account the correct analyte-lock mass signal ratio. For an observed accurate mass, a list of possible molecular formulae was obtained, selected for the presence of C, H, O and N. In addition, raw data sets were checked manually in MassLynx for retention time, UV/Vis spectra and QTOF-MS/MSfragmentation patterns for chromatographically separated peaks, complementing the accurate mass-based elemental formulas. The combination of accurate mass data, retention time (as an indication of polarity), UV/Vis spectra and MS/MS data allowed a putative identification of metabolites. Best matches were searched in the 70

MoTo DB

Dictionary of Natural Products (DNP) and SciFinder databases for possible structures. The putative identifications were confirmed by published data and with standard compounds, if commercially available. MoTo DB build-up: Based on available literature information about compounds identified in tomato, information acquired from LC-PDA-MS analysis of tomato fruit was used to validate each metabolite: a retention time; ii] accurate mass in the form of monoisotopic mass (neutral) and in the ion forms [M+H]+ and [M-H]-; iii] Elemental compositions; iv] MS/MS fragments, v] maximum absorbance peaks in UV/Vis. Given a found mass and a ∆ppm (or ∆mDa) that is set by the user, the database can find possible matches. Formic acid, if detected, was also included in the database. The database is implemented in MySQL and running on a Linux cluster.

RESULTS

Metabolites present in tomato fruit according to literature First, a database was constructed based on literature research to include metabolites reported to be present in tomato fruit from both wild and cultivated varieties as well as transgenic tomato plants. Though some tomato varieties are known to contain anthocyanins in their fruit (Jones et al., 2003), so far, to our knowledge, there are no reports on the identification of this class of compounds in fruit tissue. Therefore, in our literature search we included reports on anthocyanin identification in seedlings of tomato. Names (common and International Union of Pure and Applied Chemistry (IUPAC)), Chemical Abstracts Service (CAS) registry number, molecular formula, monoisotopic accurate mass, published references and other properties of each metabolite are systematized in this database. The database includes both polar, semi-polar and apolar compounds. Because the procedure used by us for extraction, separation and detection (see below) is biased towards compounds of semi-polar nature, we expected mostly secondary metabolites like (poly)phenols, alkaloids and derivatives thereof to be detected. Table 3.1 summarizes all (poly)phenolic compounds (48) and alkaloids (15) so far reported to be present in tomato fruit extracts, including compounds that have been identified only in fruits of transgenic tomato plants. Many compounds were assigned before mass spectrometry technologies became available. The number of compounds identified by NMR is very limited. 71

CHAPTER 3

Table 3.1. List of secondary metabolites identified in tomato fruit extracts according to literature. a – identified after hydrolysis, b – identified in transgenic tomato plants, c – identified using NMR data, d – identified in seedlings, Mol Form – molecular formula, MM – monoisotopic molecular mass. Compound

Mol Form

MM

Reference

p-Hydroxybenzoic acid

C7H6O3

138.0317

(Mattila and Kumpulainen, 2002)

Salicylic acid

C7H6O3

138.0317

(Schmidtlein and Herrmann, 1975), (PetróTurza, 1987)

Cinnamic acid

C9H8O2

148.0524

(Petró-Turza, 1987)

Protocatechuic acid

C7H6O4

154.0266

(Mattila and Kumpulainen, 2002)a

m-Coumaric acid

C9H8O3

164.0474

(Hunt and Baker, 1980)a

p-Coumaric acid

C9H8O3

164.0473

(Schmidtlein and Herrmann, 1975)a, (Hunt and Baker, 1980)a, (Petró-Turza, 1987), (Martinez-Valverde et al., 2002), (Mattila and Kumpulainen, 2002), (Raffo et al., 2002), (Le Gall et al., 2003a)bc

Vanillic acid

C8H8O4

168.0423

(Schmidtlein and Herrmann, 1975), (Mattila and Kumpulainen, 2002)

180.0423

(Schmidtlein and Herrmann, 1975)a, (Hunt and Baker, 1980)a, (Martinez-Valverde et al., 2002), (Mattila and Kumpulainen, 2002), (Raffo et al., 2002), (Sakakibara et al., 2003), (Minoggio et al., 2003), (Le Gall et al., 2003a)bc

Caffeic acid

C9H8O4

Ferulic acid

C10H10O4

194.0579

(Schmidtlein and Herrmann, 1975)a, (Hunt and Baker, 1980)a, (Martinez-Valverde et al., 2002), (Mattila and Kumpulainen, 2002), (Raffo et al., 2002), (Minoggio et al., 2003)

Sinapic acid

C11H12O5

224.0685

(Schmidtlein and Herrmann, 1975)a

Naringenin

C15H12O5

272.0685

(Hunt and Baker, 1980)a; (Justesen et al., 1998)a, (Martinez-Valverde et al., 2002)a, (Raffo et al., 2002), (Minoggio et al., 2003)

Naringenin chalcone

C15H12O5

272.0685

(Hunt and Baker, 1980)a, (Krause and Galensa, 1992), (Muir et al., 2001), (Le Gall et al., 2003b)b, (Minoggio et al., 2003)

Kaempferol

C15H10O6

286.0477

(Stewart et al., 2000), (Martinez-Valverde et al., 2002)a, (Tokusoglu et al., 2003)a

Quercetin

C15H10O7

302.0427

(Hertog et al., 1992), (Crozier et al., 1997)a, (Justesen et al., 1998)a, (Stewart et al., 2000), (Martinez-Valverde et al., 2002)a, (Raffo et al., 2002), (Sakakibara et al., 2003), (Tokusoglu et al., 2003)a

Myricetin

C15H10O8

318.0376

(Raffo et al., 2002), (Sakakibara et al., 2003), (Tokusoglu et al., 2003)a

p-Coumaric acid-O-β-D-glucoside

C15H18O8

326.1002

(Fleuriet and Macheix, 1977), (Reschke and Herrmann, 1982)a, (Winter and Herrmann, 1986)c, (Buta and Spaulding, 1997)

p-Coumaroylquinic acid

C16H18O8

338.1002

(Fleuriet and Macheix, 1977)

Caffeic acid-4-O-β-D-glucoside

C15H18O9

342.0951

(Fleuriet and Macheix, 1977), (Winter and Herrmann, 1986)

72

MoTo DB (Fleuriet and Macheix, 1977), (Fleuriet and Macheix, 1981), (Winter and Herrmann, 1986), (Buta and Spaulding, 1997), (MartinezValverde et al., 2002), (Mattila and Kumpulainen, 2002), (Raffo et al., 2002), (Sakakibara et al., 2003), (Minoggio et al., 2003), (Le Gall et al., 2003a; Le Gall et al., 2003b)bc (Winter and Herrmann, 1986), (Mattila and Kumpulainen, 2002)

Chlorogenic acid (5-O-caffeoylquinic acid)

C16H18O9

354.0951

4-O-Caffeoylquinic acid

C16H18O9

354.0951

5-O-Caffeoylquinic acid

C16H18O9

354.0951

(Winter and Herrmann, 1986)

Ferulic acid-O-β-D-glucoside

C16H20O9

356.1107

(Fleuriet and Macheix, 1977),(Reschke and Herrmann, 1982), (Winter and Herrmann, 1986)

Feruloylquinic acid

C17H20O9

368.1107

(Fleuriet and Macheix, 1977)

Tomatidine

C27H45NO2

415.3450

(Juvik et al., 1982)a, (Friedman et al., 1998)a

Tomatidenol

C27H43NO2

413.3294

(Juvik et al., 1982)a, (Friedman et al., 1994)a, (Friedman et al., 1997)a, (Friedman, 2002)a

Naringenin-7-O-glucoside

C21H22O10

434.1213

(Hunt and Baker, 1980), (Le Gall et al., 2003a)bc, (Le Gall et al., 2003b)bc

Naringenin chalcone-glucoside

C21H22O10

434.1213

(Bino et al., 2005)

Astragalin

C21H20O11

448.1006

(Le Gall et al., 2003a)bc, (Le Gall et al., 2003b)bc

Dihydrokaempferol-7-O-hexoside and Dihydrokaempferol-?-O-hexoside

C21H22O11

450.1162

(Le Gall et al., 2003a)bc, (Le Gall et al., 2003b)bc

Isoquercitrin

C21H20O12

464.0955

(Muir et al., 2001)b, (Le Gall et al., 2003a)b, (Le Gall et al., 2003b)b

Myricitrin

C21H20O12

464.0955

(Sakakibara et al., 2003)

Naringin

C27H32O14

580.1792

(Bovy et al., 2002)abd

Kaempferol-3-O-rutinoside

C27H30O15

594.1585

(Bovy et al., 2002)bd, (Le Gall et al., 2003b)bc

Kaempferol-3-7-di-O-glucoside

C27H30O16

610.1534

(Le Gall et al., 2003a)bc, (Le Gall et al., 2003b)bc

Rutin

C27H30O16

610.1534

(Fleuriet and Macheix, 1977), (Buta and Spaulding, 1997), (Stewart et al., 2000), (Muir et al., 2001), (Raffo et al., 2002); (Le Gall et al., 2003a)bc, (Le Gall et al., 2003b)bc, (Minoggio et al., 2003)

Quercetin-3-O-trisaccharide

C32H38O20

742.1956

(Muir et al., 2001), (Minoggio et al., 2003)

p-Coumaric acid-rutin conjugate

C36H36O18

756.1902

(Buta and Spaulding, 1997)

Kaempferol-3-O-rutinoside-7-Oglucoside

C33H40O20

756.2113

(Le Gall et al., 2003a)bc, (Le Gall et al., 2003b)bc

Delphinidin-3-O-rutinoside-5-Oglucoside

C33H41O21+

773.2135

(Mathews et al., 2003)bd

Petunidin-3-O-rutinoside-5-Oglucoside

C34H43O21+

787.2291

(Mathews et al., 2003)bd

Malvidin-3-O-rutinoside-5-O-glucoside

C35H45O21+

801.2448

(Mathews et al., 2003)bd

Delphinidin-3-O-(pcoumaroyl)rutinoside-5-O-glucoside

C42H47O23+

919.2503

(Mathews et al., 2003)bd

73

CHAPTER 3 Petunidin-3-O-(pcoumaroyl)rutinoside-5-O-glucoside

C43H49O23+

933.2659

(Bovy et al., 2002)bd, (Mathews et al., 2003)bd

Delphinidin-3-O-(caffeoyl)rutinoside5-O-glucoside

C42H47O24+

935.2452

(Mathews et al., 2003)bd

Malvidin-3-O-(p-coumaroyl)rutinoside5-O-glucoside

C44H51O23+

947.2816

(Bovy et al., 2002)bd, (Mathews et al., 2003)bd

Petunidin-3-(caffeoyl)rutinoside-5-Oglucoside

C43H49O24+

949.2608

(Bovy et al., 2002)bd, (Mathews et al., 2003)bd

Malvidin-3-(caffeoyl)rutinoside-5-Oglucoside

C44H51O24+

963.2765

(Mathews et al., 2003)bd

δ-Tomatine

C33H55NO7

577.3979

(Friedman et al., 1998)a

γ-Tomatine

C39H65NO12

739.4507

(Friedman et al., 1998)a

β-Tomatine

C45H75NO17

901.5035

(Friedman et al., 1998)a

Dehydrotomatine

C50H81NO21

1031.5301

(Friedman et al., 1994), (Kozukue and Friedman, 2003)

α-Tomatine

C50H83NO21

1033.5458

(Juvik et al., 1982), (Willker and Leibfritz, 1992)c, (Friedman et al., 1994), (Yahara et al., 1996), (Friedman et al., 1997), (Friedman et al., 1998), (Friedman, 2002), (Bianco et al., 2002), (Kozukue and Friedman, 2003)

Lycoperoside H

C50H83NO22

1049.5407

(Yahara et al., 1996)c, (Yahara et al., 2004)c

Lycoperoside A

C52H85NO23

1091.5512

(Yahara et al., 1996)c, (Yahara et al., 2004)c

Lycoperoside B

C52H85NO23

1091.5512

(Yahara et al., 1996)c, (Yahara et al., 2004)c

Lycoperoside C

C52H85NO23

1091.5512

(Yahara et al., 1996)c, (Yahara et al., 2004)c

Esculeoside B

C56H93NO28

1227.5884

(Fujiwara et al., 2004)c, (Yahara et al., 2004)c

Esculeoside A

C58H95NO29

1269.5990

(Fujiwara et al., 2003)c, (Fujiwara et al., 2004)c, (Yahara et al., 2004)c, (Yoshizaki et al., 2005)c

Lycoperoside F

C58H95NO29

1269.5990

(Yahara et al., 2004)c

Lycoperoside G

C58H95NO29

1269.5990

(Yahara et al., 2004)c

Metabolite Extraction and LC-PDA-MS analysis A representative tomato fruit sample was obtained by combining fruits of 96 different tomato cultivars producing ripe red, orange coloured beef, round or cherry-type of fruits at different stages of ripening (Tikunov et al., 2005). In addition, some purple-skinned fruits were selected for analyses of anthocyanins, which is a class of tomato fruit compounds only occurring in specific varieties (Jones et al., 2003) or in transgenic plants (Mathews et al., 2003). Peel material was chosen as the starting material, as this tissue contains the highest levels of flavonoids (Muir et al., 2001), which represent an important class of secondary metabolites. The 75% methanol/water extract enabled separation by C18-reversed phase LC and detection by both PDA and MS of semi-polar metabolites. Fig. 3.1 shows an example 74

MoTo DB

of a chromatogram obtained upon LC-PDA-QTOF-MS analysis of 75% methanol/water extracts from tomato peel. These extracts were stable for several months at –20 ºC, as determined by comparing LC-PDA chromatograms. Only naringenin chalcone was observed to decay slowly into naringenin while standing in the autosampler (20 ºC) during a series of analyses (about 1.4 μg g-1 FW h-1). In order to test the reproducibility of the LC system, chromatograms of the tomato fruit material, that have been analysed over a period of 2 years (>100 samples), were manually compared for retention time shifts using some typical tomato compounds (Table 3.2). Within a single series of analyses, the standard variation was very small (about 2 s) for all compounds tested. Between series of analyses over this time period, the maximum variation was 30 sec, with a maximum retention time window of 1.1 min for naringenin chalcone. During this prolonged period, LC columns of different batches were used.

I A II

B

III

Figure 3.1. Typical chromatograms obtained from reversed phase LC-PDA-ESI+-QTOF-MS analysis of tomato peel extract: A, total ion signal (QTOF MS); B, absorbance signal (PDA). Retention times (in min) are indicated for the most intense peaks (difference between the two detectors is 0.15 min). Inserts in A show accurate mass (I) and MS/MS spectrum (II) and in B absorbance spectrum (III) obtained for the compound rutin eluting at 23.3 min.

75

CHAPTER 3

Comparison of ionisation modes Since compounds may preferentially ionize in either positive or negative mode in our LC system, which is based on a gradient of acetonitrile acidified with formic acid, we analysed tomato extracts sequentially in both modes and compared the absolute mass signal intensities, expressed in peak heights, of the monoisotopic parent ions of some identified compounds. Phenolic acids and their carboxylic acid derivatives ionised better in negative ionisation mode, while flavonoids generated higher signal intensities in positive ionisation mode (Fig. 3.2). Nitrogen-containing compounds such as phenylalanine and some alkaloids ionised better in positive mode, and were mainly detected as formic acid-adducts in negative mode. These adducts were formed in the ionisation source and were readily recognized in MS/MS mode from the loss of 46 Da (formic acid). A loss of 18 Da corresponding to a loss of H2O was also regularly observed in negative ionisation mode. Table 3.2. Retention time shifts observed during LC-QTOF-MS analysis of tomato fruit. Ret (min) = retention time, in minutes; Av = average; StDv = standard deviation; Wd = retention time window Ret metabolite (min) Within series (n=13) In-between series (n=6)

Av

Chlorogenic acid StDev Wd

Av

Rutin StDev Wd

Naringenin chalcone Av StDev Wd

14.42

0.03

0.09

23.40

0.04

0.13

41.81

0.03

0.11

14.92

0.33

0.79

23.85

0.50

0.99

42.26

0.50

1.12

Automatic mass alignment and exact mass calculation Firstly, reproducibility of sample preparation and subsequent automated extraction and comparison of mass signal intensities, expressed as peak height, using metAlign software (Bino et al., 2005; Vorst et al., 2005), was performed on a data set obtained from LC-MS analysis of 8 replicate extractions of tomato peel. The retention time correction used by the software to align all mass signals was, on average, 2.5 s, which is in accordance to the retention shift observed on manual inspection of the chromatograms (Table 3.2). The overall variation in mass signal intensities between these replicate samples was < 15%. Automation of the calculation of the accurate mass of detected LC-MS signals was tested using a data set of 44 tomato extracts obtained from both peel and flesh tissues analysed in negative ionisation mode. Upon metAlign-assisted data processing, 4,958 mass signals with signal-to-noise ratios > 3 were extracted. It is known that exact mass measurements on QTOF instruments using lock mass correction provide highest accuracy at analyte signal intensities that are similar 76

MoTo DB

3.0

Naringenin

A-Tomatine

Naringenin chalcone

0.0

Rutin

4-Caffeoylquinic acid

0.5

5-Caffeoylquinic acid

1.0

3-Caffeoylquinic acid

1.5

Quercetin-trisaccharide

2.0 / mass signal intensity (neg)]

Log [mass signal intensity (pos)

2.5

Phenylalanine

to the lock mass signal (Colombo et al., 2004). To establish the dynamic range in signal intensity for producing high mass accuracy in our TOF MS, the deviation of manually measured mass (i.e. the mean of the 3 top scans of the extracted mass peak) from the theoretical mass was plotted against the parent mass signal intensity (ion counts at top scan) for some known tomato metabolites (Fig. 3.3). Typically, accurate mass measurements derived from peak intensities lower than the lock mass intensity resulted in a positive deviation from the real mass, while mass measurements from peak intensities higher than lock mass intensity resulted in a negative deviation. High mass accuracies (i.e. mass deviation less than 5 ppm) were observed within an analyte signal intensity window of 0.25-2.0 times the lock mass. Thus, to automatically calculate correct accurate masses for signals extracted and aligned by metAlign, a script called metAccure (Vorst et al., unpublished data) was programmed to use only those scans with mass signal intensities within this intensity window. In this way, appropriate accurate masses were automatically obtained for 479 (about 10%) of the total mass signals detected in ESI-negative mode, in which isotopes, adducts and fragments are included. This number indicates that for the majority of extracted mass signals, though having a chromatographically relevant signal-to-noise ratio of at least 3, the intensities in the samples analysed were too low to estimate properly their accurate mass, either by automated calculation through metAccure or by manual calculation.

-0.5

-1.0 metabolite

Figure 3.2. Peak intensity ratios, in logarithmic scale, of mass signals (peak height) obtained in positive and negative ionisation modes for some metabolites found in tomato peel extracts.

77

CHAPTER 3 Table 3.3. Metabolites that have previously been reported in literature, identified by LC-PDA-ESI-QTOFMS/MS (negative ionisation mode) in tomato peel extracts. Ret (min) = retention time, in minutes; Av = average; StDv = standard deviation; Av m/z = average found mass signal; UV/Vis = absorbance maximums in the UV/Vis range; Mol Form = molecular formula of the metabolite; Theo. Mass = theoretical monoisotopic mass calculated for the ion [M-H]-; Mean ∆ (ppm) = deviation between the averages of found accurate mass and real accurate mass, in ppm; Putative ID = putative identification of metabolite; ()FA = formic acid adduct; - = data not found; (S) = identification confirmed by the standard compound; I, II, III, IV, V, VI= different isomers (only one reported in literature). Ret (min)

Theo. Mass

Mean ∆ Putative ID (ppm)

Av m/z

UV/Vis

MS/MS fragments

Mol Form

0.09

341.0883

-

179, 135

C15H18O9

341.0878 1.52

Caffeic acid-hexose I

0.08

325.0930

294sh, 313

163

C15H18O8

325.0929 0.25

Coumaric acid-hexose I

10.32 0.08

341.0883

310

179, 161, 135

C15H18O9

341.0878 1.58

Caffeic acid-hexose II

11.35 0.08

341.0883

302sh, 318

281, 251, 233, 221, 179, 161, 135

C15H18O9

341.0878 1.53

Caffeic acid-hexose III

12.08 0.06

355.1036

290sh, 313

193, 177, 145

C16H20O9

355.1035 0.31

Ferulic acid-hexose I

12.58 0.07

341.0883

-

181, 179, 137, 135

C15H18O9

341.0878 1.49

Caffeic acid-hexose IV

C15H18O9

341.0878 1.39

Caffeic acid-hexose V

Av

StDev

9.45 9.75

13.32 0.05

341.0883

-

281, 221, 181, 179, 161, 137, 135

13.43 0.07

353.0878

300sh, 327

191, 173, 127

C16H18O9

353.0878 0.01

3-Caffeoylquinic acid

13.71 0.07

325.0929

285

163, 119

C15H18O8

325.0929 0.05

Coumaric acid-hexose II

14.41 0.10

353.0878

295sh, 327

179, 173

C16H18O9

353.0878 -0.08

5-Caffeoylquinic acid (S)

15.90 0.05

355.1036

-

193, 175, 160

C16H20O9

355.1035 0.42

Ferulic acid-hexose II

15.98 0.06

341.0886

-

179

C15H18O9

341.0878 2.26

Caffeic acid-hexose VI

16.76 0.07

353.0880

323

191, 173, 161, 127

C16H18O9

353.0878 0.49

4-Caffeoylquinic acid

1227, 1095, 1065, 933, 866, 770

C57H95NO30 1272.5866 2.75

(Esculeoside B)FA

301, 271, 255

C32H38O20

Quercetin-hexosedeoxyhexose-pentose

1269, 1137, 1107, 974, 770, 752

C59H97NO31 1314.5972 2.21

(Lycoperoside G)FA or (Lycoperoside F)FA or (Esculeoside A)FA I

301, 271, 255

C27H30O16

Quercetin-glucoserhamnose (S)

C59H97NO31 1314.5972 2.54

(Lycoperoside G)FA or (Lycoperoside F)FA or (Esculeoside A)FA II

C59H97NO31 1314.5972 3.74

(Lycoperoside G)FA or (Lycoperoside F)FA or (Esculeoside A)FA III

19.53 0.25

1272.5901 -

21.42 0.04

741.1870

22.83 0.06

1314.6001 -

23.43 0.04

609.1451

256, 299sh, 351

256, 299sh, 355

25.48 0.16

1314.6005 -

1269, 1137, 1107, 975, 908, 866, 812, 770, 752, 275, 179, 161, 149, 143, 125, 113

26.37 0.21

1314.6021 -

1270, 1138, 1108, 976, 909, 813, 753, 179, 161, 143, 125, 113

78

741.1884 -1.82

609.1461 -1.59

MoTo DB

26.41 0.03

593.1505

26.44 0.39

368

593.1512 -1.09

Kaempferol-glucoserhamnose (S)

285

C27H30O15

1094.5382 -

1049

C51H85NO24 1094.5389 -0.59

(Lycoperoside H)FA

32.46 0.37

1078.5463 -

1033, 871, 738, 576, 161, 143

C51H85NO23 1078.5440 2.14

(α-Tomatine)FA (S)

32.59 0.22

1136.5539 -

1091, 958, 928, 796, 635, 149, 143, 113

C53H87NO25 1136.5494 3.91

(Lycoperoside C)FA or (Lycoperoside B)FA or (Lycoperoside A)FA

32.65 0.02

433.1135

315sh, 368

271, 151

C21H22O10

433.1140 -1.21

Naringenin chalconehexose I

41.43 0.05

271.0617

288, 303sh

151,119,107

C15H12O5

271.0612 1.84

Naringenin (S)

41.86 0.05

271.0615

365

151, 119, 107

C15H12O5

271.0612 1.15

Naringenin chalcone (S)

Identification of tomato metabolites The identification of compounds reported to be present in tomato fruit was done using two approaches. Firstly, 19 available standard compounds (see Materials and Methods section) were injected and compared for retention time, accurate mass and UV/Vis spectra with LC-peaks detected in the extracts from the pooled peel material of the 96 tomato cultivars. In this way, chlorogenic acid (i.e. 3-caffeoylquinic acid), rutin, kaempferol-rutinoside, naringenin, naringenin chalcone and α-tomatine were identified. Secondly, the chromatograms from the 44 LC-MS data sets were checked for the presence of accurate masses, as calculated by metAccure, corresponding to metabolites that were expected to be detected with our system (Table 3.1). The accurate mass hits were subsequently combined with PDA and MS/MS-fragmentation data for further identification and confirmation of metabolites. As an example, data of known tomato metabolites observed in extracts of the pooled peel material of the 96 tomato cultivars, derived by LC-PDA-MS and MS/MS analyses in negative mode, are listed in Table 3.3. In an analogous way, the presence of anthocyanins was confirmed by LC-PDA-QTOF-MS/MS analysis (positive mode) in peel extracts from purple-skin tomato fruits (data not shown). Using this primarily accurate mass-directed targeted approach, about 41% (25 compounds) of the metabolites cited in Table 3.1 were identified in both tomato peel samples. In addition, caffeic acid, ferulic acid, p-coumaric acid, quercetin and kaempferol aglycones could be detected but only after acid hydrolysis of the extract. All experimental LC-MS information gathered for these metabolites, including retention time window, accurate mass, PDA spectral information and MS/MS data generated at different collision energies were added to the MoTo DB.

79

CHAPTER 3

60 0.25

40

ANALYTE LOCK MASS

2.0

$ ppm

20 5.0

0 -5.0

-20 -40 -60 0

10 100 1000 10000 log (mass signal intensity)

Figure 3.3. Difference between observed and theoretical monoisotopic masses, calculated as ∆ppm (y-axis), as a function of the parent ion signal intensity, expressed as ion counts scan-1 at centre of peak (x-axis, log10-transformed data) for some identified compounds in tomato peel extracts. Threshold levels for mass accuracies between +5 and – 5 ppm, and for analyte mass signal intensities between 0.25 and 2.0 times the lock mass signal intensity are indicated with dotted lines.

Database building The data from Table 3.1 were used as a foundation upon which to initiate the tomato fruit LC-MS database. From the molecular formula, the accurate mass of each component was calculated using the “Isotopic compositions of the elements 1997” list (Rosman and Taylor, 1998) for accurate mass assignments. The observed mass, together with a mass accuracy setting, is the main search entry for this database (Fig 3.4). A choice on the entry form is provided to enable ionisation specific correction of mass spectrometer data, in order to submit the proper mass value of the uncharged molecule to the database. Mass accuracy can be set from 1 to 1,000 ppm, thus enabling the matching of data from detectors generating masses with either low or high accuracy. All other properties of the compounds are stored in a table, which can be accessed from the hit list after mass searching. Each hit suggests either a metabolite previously found in literature and validated by experimental data (Table 3.3) or a novel compound (Table 3.4). Links with the PubChem and MedLine database are available for extended, external searches on particular or related components. The information for each compound includes molecular formula, molecular mass, 80

MoTo DB

CAS number, IUPAC name and analytical properties such as retention time, MS/MS fragments and UV/Vis absorbance maxima, when available. Literature references related to the occurrence in tomato fruit are also listed. Since our aim is to provide a compound database with data from literature and/or experimental MS/MS data, we did not include unknown or novel compounds that have not been validated. A

MS/MS PDA Accurate mass

Ret

Standards NMR

MoTo Putative Identification

Identification

B

Figure 3.4. A. Strategy applied for data analysis and identification of metabolites in tomato fruit, using LC-PDA-QTOF MS. Key entry into the database is the (intensity-corrected) accurate mass. B. Screenshot from the MoTo database query frame. Detected masses can be filled in (in this example m/z 609 in negative ionisation mode) and searched against the database at user-defined mass accuracy (first frame). If at least one mass hit is found in the database, the elemental compositions, deviations from accurate masses and IUPAC names of the corresponding metabolites are indicated, as well as links to PubChem, if applicable, and our own experimental data (second frame). The last frame shows the experimental and literature information available for the selected compound.

Comparison of metabolic profiles of peel and flesh tissues The applicability of the LC-MS platform and metabolite database to automatically extract and annotate (differentially accumulating) mass signals was tested with red ripe fruits of tomato cultivar Money Maker. Since we are interested in the differential distribution of metabolites and their biochemical pathways 81

CHAPTER 3

between tomato fruit tissues, peel and flesh material was separated from whole ripe fruits and analysed by LC-PDA-ESI-QTOF-MS in both positive and negative ion modes. After automatic peak extraction and alignment of samples per ionisation mode using metAlign, 2,944 mass signals (signal-to-noise ratio > 3) were obtained in negative mode and 4,059 in positive mode. Since both tissues had similar water content (i.e. flesh: 94%, peel: 93%; n=8; determined by freeze-drying), the intensities of their mass signals were directly comparable. For each aligned mass peak, the extracts from both tissues were compared for significant differences in signal intensity (based on 8 extraction repetitions) using the student t-test tool within metAlign. As expected, the mass profiles of these fruit tissues were markedly different. About 38 % of the total of mass signals detected were significantly ≥ 1.5fold higher in the peel extracts than in the flesh extracts (1,095 signals for negative mode and 1,566 for positive mode), and about 25 % were higher in flesh than in peel (794 for negative mode and 880 for positive mode). Chromatographic mass peaks detected in negative ionisation mode that were significantly different between the extracts from both tissues are visualised in Fig. 3.5. Subsequent metAccureassisted accurate mass calculation of the differential mass peaks and searching for analogous masses in the MoTo database indicated that flavonoids and derivatives thereof and α-tomatine were mainly occurring in the peel extracts. On the other hand, some phenylpropanoids (h, 52 fold; i, 2 fold) as well as glycosylated steroids such as glycosylated spirosolanols (j, 130 fold) were significantly higher in the flesh extracts. An intense mass signal, k, was solely detected in the extracts from flesh tissue and could be identified as the parent ion of a hydroxyfurostanol-tetrahexose (e.g. tomatoside A) from the accurate mass observed ([M-H]- = 1081.5442, C51H85O24-, 1.0 ppm difference from theoretical mass) and its MS/MS fragmentation pattern. Table 3.4. Novel metabolites identified or putatively assigned by LC-PDA-ESI--QTOF-MS/MS in tomato fruit extracts (abbreviations as in Table 3.3). Ret (min) Av

StDev

Av m/z

UV/Vis

MS/MS fragments

Mol Form

Theo. Mass

Mean ∆ (ppm)

Putative ID Hydroxybenzoic acidhexose

4.74

0.05

299.0771 251

137

C13H16O8

299.0772

-0.48

7.42

0.07

380.1558 -

146

C15H27NO10

380.1562

-1.11 Pantothenic acid-hexose

12.99

0.05

431.1557 -

269, 161, 143, 125, C19H28O11 119, 113, 101

431.1559

-0.43 Benzyl alcohol-dihexose

14.76

0.05

771.1989

609, 463, 301

C33H40O21

771.1989

-0.01

15.47

0.06

595.1665 -

475, 385, 355

C27H32O15

595.1668

Naringenin chalcone-0.51 dihexose or Naringenindihexose

82

263sh, 351

Quercetin-dihexosedeoxyhexose

MoTo DB

15.82

0.04

401.1452 -

293, 269, 233, 191, 161, 149, 131, 125, C18H26O10 101

1266, 1135, 1105

401.1453

C59H95NO31 1312.5815

-0.37

Benzyl alcohol-hexosepentose

(Dehydrolycoperoside G)FA or 4.33 (Dehydrolycoperoside F)FA or (Dehydroesculeoside A)FA

24.77

0.15 1312.5872 -

27.05

0.12

515.1193

301sh, 323

353, 335, 191, 179, C25H24O12 173

515.1195

-0.45 Dicaffeoylquinic acid I

27.60

0.07

515.1191

301sh, 323

353, 191, 179

C25H24O12

515.1195

-0.72 Dicaffeoylquinic acid II

29.71

0.07

515.1188

301sh, 327

353, 299, 203, 191, C25H24O12 179, 173, 135

515.1195

-1.40 Dicaffeoylquinic acid III

30.11

0.04

256, 887.2246 301sh, 323

741, 723, 301, 271, C41H44O22 255, 179

887.2251

Quercetin-hexose-0.57 deoxyhexose-pentose-pcoumaric acid

32.16

0.03

433.1137

307sh, 360

271, 151

C21H22O10

433.1140

-0.84

38.40

0.08

677.1503

301sh, 327

515

C34H30O15

677.1512

-1.29 Tricaffeoylquinic acid I

39.78

0.11

677.1493

292sh, 325

515, 353, 335, 179, C34H30O15 173

677.1512

-2.82 Tricaffeoylquinic acid II

Naringenin chalconehexose II

DISCUSSION Metabolomics is developing as an important functional genomics tool. Technical improvements in the large scale determination of metabolites in complex plant tissues and dissemination of metabolomics research data are essential (Sumner et al., 2003; Bino et al., 2004). A major challenge is to construct consolidated metabolite libraries, and to develop metabolite-specific data management systems. Here we set out to establish a reproducible LC-PDA-MS based metabolomics platform including a LC-MS metabolite database and mass-directed searching tools for a commonly used plant material, i.e. tomato fruit. An in-depth literature study was performed to obtain as much information as possible on metabolites previously detected in tomato fruits. Because tomato is an important crop, numerous analytical studies aimed at identifying its constituents have been performed. However, a number of problems arise when building such a database from the literature. Firstly, finding the exact identity of a specific natural compound can be troublesome since common names or non-IUPAC nomenclatures are often used. Secondly, studies performed without MS or NMR technologies might lead to questioning the validity of at least some of the assigned compounds. Thirdly, it is known that using harsh conditions during sample preparation may produce artefacts, which can result in the correct identification but of a compound not occurring in the original biological sample. For instance, it has long been thought that the flavanone 83

CHAPTER 3

A

2.54x104

B

2.54x104

C

g b

a

h

4

f

c d

i

2.54x10

j

e

k

Figure 3.5. Unbiased LC-QTOF MS based comparative profiling of aqueous-methanol extracts from peel and flesh tissues from ripe tomato fruit (var. Money Maker). Mass chromatograms (m/z 1001,500) were acquired in ESI negative mode. Retention times (in min) and nominal masses of the most intense signals are indicated in the chromatograms (plotted as BPI, base peak intensities, from 4 to 50 min). A, representative original chromatogram of peel tissue; B, representative original chromatogram of flesh tissue; C, differential chromatogram for metabolites that are significantly (p