Trends in Analytical Chemistry 60 (2014) 71–79
Contents lists available at ScienceDirect
Trends in Analytical Chemistry j o u r n a l h o m e p a g e : w w w. e l s e v i e r. c o m / l o c a t e / t r a c
Chemometrics in foodomics: Handling data structures from multiple analytical platforms Thomas Skov a,*, Anders H. Honoré a,b, Henrik Max Jensen b, Tormod Næs a,c, Søren B. Engelsen a a
Spectroscopy and Chemometrics, Department of Food Science, Faculty of Science, University of Copenhagen, Denmark DuPont Nutrition Biosciences ApS, Advanced Analysis, Edwin Rahrs vej 38, 8220 Brabrand, Denmark c Nofima, Oslovegen 1, 1430 Ås, Norway b
A R T I C L E
I N F O
Keywords: Correlation studies XC-MS NMR Foodomics Data processing Data validity Metabolomics Multi-block chemometrics Multivariate data analysis Pearson correlation
A B S T R A C T
Foodomics studies are normally concerned with multifactorial problems and it makes good sense to explore and to measure the same samples on complementary, synergistic analytical platforms that comprise multifactorial sensors and separation methods. However, the challenge of exploring, extracting and describing the data increases exponentially. Moreover, the risk of becoming flooded with non-informative data increases concomitantly. Acquisition of data from different analytical platforms provides opportunities for checking the validity of the data, comparing analytical platforms and ensuring proper data (pre)processing – all in the context of correlation studies. We provide practical and pragmatic tools to validate and to deal advantageously with data from more than one analytical platform. We emphasize the need for complementary correlation studies within and between blocks of data to ensure proper data handling, interpretation and dissemination. Correlation studies are a preliminary step prior to multivariate data analysis or as an introduction to more advanced multi-block methods. © 2014 Elsevier B.V. All rights reserved.
Contents 1. 2. 3. 4. 5. 6. 7.
Introduction ........................................................................................................................................................................................................................................................... Correlation studies – possibilities and pitfalls .......................................................................................................................................................................................... Multivariate challenges and opportunities ................................................................................................................................................................................................. Improved pre-processing through intra-block and inter-block correlation studies ..................................................................................................................... Simple methods – powerful correlations! .................................................................................................................................................................................................. Multi-block methods and their potential in foodomics ......................................................................................................................................................................... Outreach ................................................................................................................................................................................................................................................................. References ..............................................................................................................................................................................................................................................................
1. Introduction Foodomics, as defined here, is more than classical quality control (QC) and bulk chemical analysis of food and/or body fluids in relation to food control and nutritional research, and foodomics is more than just the use of advanced analytical platforms for analyzing sample composition. In the following, we use the concept foodomics to be the simultaneous analysis of many samples using a high-throughput analytical platform, such as gas/liquid
* Corresponding author. Tel.: +45 2539 4543. E-mail address:
[email protected] (T. Skov). http://dx.doi.org/10.1016/j.trac.2014.05.004 0165-9936/© 2014 Elsevier B.V. All rights reserved.
71 72 73 76 77 78 78 78
chromatography (GC/LC = XC) followed by detection by mass spectrometry (MS) and/or nuclear magnetic resonance (NMR), which can simultaneously detect hundreds or thousands metabolites in the biological samples [1]. Seen from a physical, chemical and biological perspective, food systems (and biofluids) are complex multifactorial systems containing mixtures of heterogeneous classes of molecules, and complex physical structures, such as amorphous solids, aqueous solutions, gels, macromolecules, macroorganelles, cells, crystals, pores and cavities. The complex nature of food samples thus makes extraction and separation on a specific column (GC or LC) combined with a mass separator the preferred choice of analytical method. Other less destructive techniques such as NMR and vibrational spectroscopy
72
T. Skov et al./Trends in Analytical Chemistry 60 (2014) 71–79
(near-infrared, infrared and Raman) can also be applied, but they cannot match the sensitivity of coupled separation and MS systems. With XC-MS, the data output becomes two-dimensional and very large, and it is more challenging and time consuming to extract proper information directly from the raw data. The data exploration and processing of XC-MS data is further challenged by the presence of artifacts, such as irrelevant signals from column components, skewness in data due to mass uncertainties and ion suppression and shifted peaks across samples, column deterioration over time, drift in baseline, and injection variations across samples. NMR data contain fewer spectral artifacts, but are challenged by pH and ionic effects and water-signal contributions. Preprocessing data is therefore a significant part of the whole foodomics data-analysis chain. Applications of foodomics include studies of foods for control, safety, functionality, authenticity, traceability, freshness, contaminants, toxicity, bioactivity and effects on human health. Foodomics studies are thus often designed to investigate well-defined scientific questions related to food and its consumption. For example, the problem under investigation can be: “What causes a diet to be healthier than the control standardized diet?” The beneficial effect to health may be known, but the mechanism and the metabolites responsible remain elusive. When analyzing data from such studies, the target is clear: find the differences in the metabolite profiles that can be related to the experimental design (i.e., the two diets). Even though the question is simple, the complexity and the size of foodomics data with several analytical platforms, the information from hundreds or even thousands of metabolites and the natural biological variation of individuals make exploration of the data nontrivial. Foodomics thus cannot be defined without taking into account the current methodologies for exploring and exploiting patterns and relations in the mega/multivariate data structures that are generated. Classical chemometric methods and methods originating from different or related research fields are available and can be transferred directly to foodomics. Chemometrics or multivariate data analysis is a powerful method to explore correlations or co-variations in such datasets using only minimal a priori assumptions. This can be done with (supervised) or without (unsupervised) a priori knowledge about the experimental design. The nature of many foodomics studies makes the investigation of correlations between the data blocks an important aspect in not only the exploration of the data but also the validation of the data and the necessary preprocessing of the data. This review aims to provide a toolbox of simple methods and a guideline for data exploration in foodomics studies that include data from several analytical platforms. We advocate for a systematic combination and fusion of data blocks to assess the analytical performance, to improve the interpretability and to prevent reporting bias and unreliable results. We include “eye-opener” examples of how to use well-established chemometric techniques in novel ways in order to augment the information content and to enhance the understanding of the foodomics data.
Fig. 1. Venn diagram. Metabolite information found in data obtained from analytical platforms is at best both qualitative and quantitative. The identification of metabolites obtained from two platforms might show that some metabolites are shared (qualitatively similar), but, often, a Venn diagram does not state whether the metabolites are quantitatively comparable.
differences in selectivity”, whereas, for other metabolites, the quantitative nature should be comparable. In-depth knowledge about the analytical behavior and the capabilities of the instrumental techniques greatly help in avoiding incorrect interpretation of shared metabolites, but, without correlation studies, the risk of examining non-validated (quantitatively) metabolites is large and can hamper subsequent chemometric models and thence interpretation of foodomics studies. In the simplest case, a correlation (or covariation) spectrum to a given reference is a very powerful correlation study {e.g., this could be the correlation between visible (VIS) and near-infrared (NIR) spectra of pectin and the degree of esterification (%DE) (see Fig. 2) [4]}. Fig. 2 highlights the impact of correlation studies for finding important regions in complex spectra without a model or if simple models are introduced. This simple case illustrates the usefulness of the approach and the danger of misinterpretations. It is well known that the methyl ester stretch is found at 2244 nm in the NIR spectrum (circle) and indeed a highly positive correlation is found
2. Correlation studies – possibilities and pitfalls Correlation studies range from simple univariate comparisons [2,3] to high-dimensional patterns linking many metabolites together through various chemometric methods. Linking two analytical platforms is often done through a Venn diagram, as shown in Fig. 1. The identification of metabolites obtained from two platforms might show that some metabolites are shared (qualitatively similar), but, often, a Venn diagram does not state whether the metabolites are quantitatively similar. For some analytical platforms, this is not expected, due to, e.g., “being out of linear range” or “major
Fig. 2. A typical covarygram showing the correlation (and covariation) between Standard Normal Variate (SNV) corrected blocks of data (visual and near infrared spectroscopy) and a univariate reference value: the degree of esterification (%DE) [4]. The black line is the average spectrum, The blue line is the covariation spectrum (scaled to an absolute maximum intensity of 1) and the red line is the Pearson correlation spectrum. {Modified from [4]}.
T. Skov et al./Trends in Analytical Chemistry 60 (2014) 71–79
at this wavelength. However, the negative correlations found at 1990 nm and 2202 nm are stronger, and is basically a simple case of the biological cage of covariance: when one correlation goes up, others go down (i.e., when pectins are esterified the amount of galactoronic acid goes down). This approach can be extrapolated directly to correlation analysis between two blocks of data. Direct correlation analysis between blocks of data was first explored by Noda in the two-dimensional (2-D) correlation method between data from different vibrational spectroscopies [5]. The results of this approach comprise a number of features that are highly correlated in the two blocks and it has particularly been exploited in NMR metabolomics in order to correlate different NMR methods [6], to correlate NMR data to disease status to obtain latent biomarker identification [7] and to correlate NMR data to UPLC-MS data in toxicology studies [8]. Such simple correlation analyses even before introducing any chemometric model are extremely powerful. Correlation analyses prior to subsequent data modeling can enhance data understanding, awareness of potential problems in data models, guide the choice of preprocessing method and emphasize incoherence in data obtained from different analytical platforms. Correlation studies can be seen as a simple first step of the chemometric multi-block approaches mentioned at the end of this review.
3. Multivariate challenges and opportunities In classical empirical research, a model requires that the number of variables must be fewer than the number of observations, but developments in modern analytical platforms have pushed scientists far beyond the classical model. In typical foodomics studies, it is normal to include 100–500 samples. If these samples were to be measured by NIR spectroscopy, as in a classical QC analysis setup, typically 1000 spectral variables would be recorded. Such data sets of the order of 100 × 1000 can with advantage be dealt with using chemometrics, which can model efficiently collinear data sets with many more variables than samples. However, nowadays, it is common for the analytical platform to record more than 100,000 variables for each sample and that pushes the chemometric tools to the limit, as this increase will also increase the chances of spurious correlations and include more interferences. Several techniques and strategies are available to reduce not only the number of variables before the actual model, but also the number while modeling. When measuring many variables, a large proportion of them will be irrelevant, and it is advantageous to get rid of such irrelevant variables before data analysis. In foodomics studies, variables being irrelevant often means unsystematic variables, noisy variables or variables containing information different from that being explored (e.g., the difference between two diets). As variables in foodomics studies are often directly linked to metabolites, the term metabolites will also be used in the following. The study of correlations between information found in a single metabolite or multiple metabolites from multiple analytical platforms provides a novel way of validating the extracted information. A metabolic fingerprint would ideally represent the detection and the quantification of all small molecules present in a biological system at a given time. However, the analytical platforms used for separation and detection of metabolites do not enable the identification of all in-vivo metabolites, mainly due to inadequate compound extraction and the dynamic range of the metabolome. Another issue is that different analytical platforms might not respond identically to the same metabolite [9]. However, Venn diagrams, which show shared and unique metabolites from different analytical platforms, do not take into consideration the quantitative nature of the shared metabolites. Quantitative discrepancy might lead to too many reported metabolites being important. A study of correlations
73
between analytical platforms can therefore provide valuable information with regard to removal of irrelevant metabolites. Access to several analytical platforms in many foodomics laboratories makes sub-optimal the classical approach of using only one analytical platform {e.g., NMR or LC-MS [10–12]}. The need for comprehensive coverage of the foodome in untargeted studies calls for the use of complementary analytical platforms [13–18]. The use of more than one analytical platform will provide several sources of information, as indicated in Table 1. Very often, the structure(s) of the data used for chemometric models depends on the routines applied in the foodomics laboratory. The data structures presented in Table 1 can also be analyzed with more complex correlation studies (see references included in the different sections), which might be too involved to use in a standard foodomics set-up. However, in an advanced set-up, it is sometimes rewarding to investigate correlations not only within blocks of data but also between blocks of data. How to approach multiway structure of, e.g., LC-MS data in foodomics studies is not dealt with here but has been discussed [21]. In order to highlight the advantage of multivariate correlation studies, one can compare simple univariate and multivariate relationships on simple data, as visualized in Fig. 3. Univariate correlation studies are often too simple, and often rather simple multivariate approaches of well-selected variables can reveal much more of the underlying information in the data. Increasing the number of variables also increases the chance of finding even more (co)relations describing a similar difference between samples and, as such, confirms the findings presented in Fig. 3. One area within foodomics where such observations can be particularly useful is in the validation of extracted metabolites from CE-MS, GC-MS or LC-MS studies. This is normally performed using dedicated software packages that use different principles for extracting relative concentrations of the metabolites [22,23]. Examples of software packages for high-resolution MS data are freeware, such as MZMine [24,25], XC-MS [26,27] and MetAlign [28,29], and vendor software, such as Markerlynx (Waters), Sieve (Thermo) or ProfileAnalysis (Bruker), but advanced peakdeconvolution methods can also be applied [21]. Which software to use greatly depends on the research environment and the chosen analytical platform. The main issue is that these software packages do not provide the same results – neither qualitatively (the metabolites found identified with a retention time m/z pair) – nor quantitatively (the relative concentration found in the different samples) [30]. In a recent metabolomics study, several software packages were compared and the shared and unique metabolites were reported [31]. Adding correlation studies on top of such data would further validate the shared metabolites and how the different software packages can extract the same information, as visualized in Fig. 4. In the study by Gürdeniz et al., the metabolites were not described in the same way by all software packages, and could not be seen in the reported Venn diagrams. This omission was only observed through subsequent (quantitative) correlation studies. Selecting just one software package will thus bias the results for some metabolites (blue boxes) whereas, for other metabolites, the outcome will be the same (red boxes). This simple comparison demonstrates the need for correlation studies, even within the same analytical platform; quantitative consensus between software packages will most likely lead to even better foodomics studies. In a similar manner, correlation studies will provide additional validation in foodomics studies where the extraction of metabolites can be done in different ways. In general, the inspection of loading plots (as done in Fig. 4) is essential in correlation studies, as loadings present consistently all metabolite correlations found in data. Depending on the location of the metabolites in the loading plots, it is possible to extract
74
T. Skov et al./Trends in Analytical Chemistry 60 (2014) 71–79
Table 1 Possible correlation studies with data from more than one analytical platform. A) Principal Component Analysis (PCA) on augmented data blocks B) Univariate regression, C) Multivariate regression (e.g., Partial Least Squares regression – PLS), D) Multiway regression (e.g., N-PLS) [19], E) Anova PLS [20], F) PLS2 regression, G) Multiway data blocks are normally transformed to be able to use cases A–F. Not all cases are illustrated in this review and the reader is referred to the references given for more elaborated examples Case
Data structures
Correlation method
Foodomics approach (suggestions for new ways of using the chemometric (tools)
A
PCA
Augment data blocks and look for correlations between similar and/or different features found in the different blocks Use similar features to assess block consistencies
B
Univariate regression
How related are two selected features?
C
Multivariate regression (e.g. PLS)
Optimize the pro-processing of foodomics data
D
Multiway regression (N-PLS)
Exploit the multiway nature of data when predicting
E
Anova PLS
Evaluate how the experimental design is affecting the multivariate block information
F
PLS2 regression
Assess the quantitative similarities between two analytical platforms (e.g. NMR/LC-MS or hydrophilic interaction/reversed phase LC)
G
Multiway methods (PARAFAC, PARAFAC2 etc.)
Transfer multiway data into discrete features (e.g. metabolite intensities) that can be used in case A–F
valuable information pertaining to metabolite correlations and any blocks that can be combined can be evaluated. However, one of the difficulties with loading plots is the pre-hand scaling of the data. Very often the visual appearance of the loadings plot will reflect the chosen scaling technique (e.g., no scaling, unit variance scaling,
or Pareto scaling), which has the potential to degrade the real knowledge gain from this plot. As an example, unit variance scaling on data with thousands of metabolites will provide a loading plot with a crowded centered circle with way too many metabolites being regarded as important (Fig. 5). How to evaluate such a loading plot
Fig. 3. Example of the inadequate use of single metabolites to differentiate between two classes in the experimental design [e.g., a bioactive food product (blue bars and dots) and a reference standard diet (yellow bars and dots)].
T. Skov et al./Trends in Analytical Chemistry 60 (2014) 71–79
75
Fig. 4. Correlation studies by PCA in an LC-MS study. The loading plot is a simulated plot of shared and unique metabolites extracted from XC-MS, MarkerLynx and MZMine, respectively (metabolite names are not included). The highlighted metabolites are shared but they show different correlation patterns which is not represented in, e.g., a Venn diagram. In the univariate correlation plots, the real data for two selected metabolites are shown. This highlights the importance of combining Venn diagrams with multivariate and univariate correlation studies to discover, e.g., an problem often seen for MarkerLynx with too many zeros in the data. More information about the study design can be found in [31].
is a tricky exercise. One option is to reduce the importance of lowintensity metabolites, e.g., by Pareto scaling, which has the advantage that fewer metabolites need to be examined, but also the risk of overlooking important low-intensity metabolites.
One approach to circumvent running into too few or too many metabolites is to use the loadings plots in a combined manner. For example, we recommend using an approach where all three (or others or more) scaling techniques are evaluated simultaneously or
Fig. 5. The use of different scaling techniques. Left: No scaling, Middle: Pareto scaling, and Right: Unit variance scaling.
T. Skov et al./Trends in Analytical Chemistry 60 (2014) 71–79
2-norm
30
10
30
0
B 10 20 Predicted #PC5
10 C 30
10 20 Predicted #PC9
Reference #PC3
E 10 20 Predicted #PC5
10 F 20 10 Predicted #PC4
30
30
r: 0.82 RMSECV: 1.76
10
30
r: 0.82 RMSECV: 1.89
G 10 20 Predicted #PC3
20
0 0
30
20
0 0
10
30
r: 0.84 RMSECV: 1.78
r: 0.93 RMSECV: 1.08
20
0 0
30
10
30
r: 0.97 RMSECV: 0.73
10 20 Predicted #PC5
D
20
0 0
30
20
0 0
10
30
r: 0.92 RMSECV: 1.28
20
0 0
20
Reference #PC8
10 20 Predicted #PC4
VOL norm 30
r: 0.77 RMSECV: 2.14
0
A
Reference #PC5
UV scaled
Reference #PC5
Reference #PC9
10
30
Reference #PC5
r: 0.98 RMSECV: 0.63
20
0 0
Pareto scaled
30
Reference #PC4
Reference #PC4
Mean centering
TSP 30
Reference #PC4
76
H 10 20 Predicted #PC8
30
r: 0.92 RMSECV: 1.16
20 10 0 0
I 10 20 Predicted #PC4
30
Fig. 6. Predicted versus measured plot from a PLS model using a number of principal components (PCs) (e.g., PC#4) of creatinine levels using different normalization and scaling techniques. (A–C) Perdeuterated 3-(trimethylsilyl) propionate sodium salt (TSP) normalized spectra predicting creatinine concentration using three different scaling techniques; (D–F) 2-norm normalized spectra predicting creatinine concentration using three different scaling techniques; and, (G–I) Volume normalized spectra predicting total creatinine excretion; The points have been colored according to increasing creatinine concentration/excretion; cyan corresponds to low concentration and magenta corresponds to high creatinine concentration/excretion.
using so-called correlation loadings [32]. In the latter approach, a certain distribution of samples along the different principal components (e.g., in a PCA score plot) can be correlated to all metabolites (scores are correlated to metabolite intensities across samples). In this manner, a distribution of samples found in an intervention study (as given in the score plot) can be investigated in all metabolite profiles, regardless of intensity level in the raw data. This also ensures that low-intensity metabolites are evaluated in a given model space. 4. Improved pre-processing through intra-block and inter-block correlation studies Data pretreatment, scaling and normalization are often essential for a successful foodomics study. Recorded metabolite profiles, measured from food or biofluid samples, are often subject to systematic variations in intensity across all measured variables, which can mask the biologically relevant differences between profiles. The sources of obscuring variation may arise from inhomogeneity of samples, minor differences in sample preparation, instrumental perturbation and also data extraction steps may introduce an additional error [33]. Data normalization (vertical scaling between samples) and data scaling (horizontal scaling between variables) are typically applied to remove unwanted systematic bias in measurements of ion or signal intensity while retaining the interesting biological information. While the normalization of metabolite profiles is intended to remove this spurious between-sample variation from the data, paradoxically, the application of any normalization
procedure can also lead to degradation of the biological information and introduction of spurious correlations. The chosen normalization method can thus have a decisive effect on the outcome of subsequent application of pattern-recognition methods. There exist several established methods for normalizing metabolite profiles and each method is based on certain assumptions about the nature of the data. Some of the most common normalizations methods are: (1) (2) (3) (4) (5)
unit sample intensity sum; unit sample vector length (Euclidian norm); probabilistic quotient/median fold-change [34]; a reference feature present in all samples at a constant level; histogram matching [35], which seek to match the histogram of intensities in a spectrum as closely as possible to that of a reference spectrum; and, (6) entropy-related methods [21].
Since any normalization method may be destructive to the quantitative information buried in the metabolomics data, it is always recommended to use an external validation of the normalization method used. This can be optimally done if, for example, external quantitative knowledge from chemical reference parameters is available. Then, a simple multivariate regression to a given metabolite can evaluate if the normalization degrades or improves the correlation. The use of external validation is demonstrated in a simple case study described below and shown in Fig. 6, which concerns an NMR study of human urine as a part of an intervention study
T. Skov et al./Trends in Analytical Chemistry 60 (2014) 71–79
concerning protein intake [36]. Urine samples have an inherent problem with normalization, as the urine may contain different amounts of dry matter related to the hydration status of the study subject and the time since the last urination. Any measurement of urine thus requires normalization. In this case study, both NMR spectra and chemical reference measurement of creatinine were measured for all urine samples. It was assumed that the best normalization method can then be selected as the one that best conserved the creatinine information [37] and this was studied through correlation analysis in form of PLS models. Fig. 6 shows how the regression performance between the NMR spectra and the externally measured creatinine concentration/ excretion can be influenced by different normalization methods. The normalization providing the best model (here mean centering and TSP normalization) will then serve as a good choice for how to preprocess the NMR data. If QC samples have been available, the best pre-processing method could also have been studied in simple PCA models using proximity of the QC samples in the score plot as an optimization criterion. Other scaling and normalization methods have been suggested for foodomics data and, in combination, they serve as an important pre-step [38]. A limitation of this “supervised guidance” of preprocessing methods can also be applied for other foodomics data blocks where external knowledge is available. Together with the single block scaling and normalization issues mentioned above, additional considerations are required when combining several blocks of data. When combining different blocks of data, scaling of the individual blocks becomes important. Often, there are orders of magnitude differences between the variations represented in the different blocks, which will, in turn, bias the unscaled modeling. For example, one chemical compound may be represented in an NMR or LC-MS peak with 100 data points while another metabolite may be represented in one variable or a few variables. Such mismatch in magnitude of variation will lead to biased models, especially as most multivariate models favor high variation. Combining blocks of different nature (e.g., continuous, discrete, or univariate) is a challenging task and several methods have been proposed for achieving this [39–42]. Data blocks that originate from multiple analytical platforms will have very different ranges and noise characteristics. Scaling must be done carefully and, if possible, separately for each data block. However, not only is scaling within each block important, but also scaling between blocks is important because larger blocks become more dominant [43,44]. The simplest solution to the multiblock scaling problem is to convert all blocks of data into discrete variables, which can then be concatenated and scaled to unit variance (UV scaling). This ensures that all variables have an equal chance of affecting the model and that local and minor types of information/variation are more easily found due to the noise reduction when turning continuous data into discrete variables. However, it is important to remember that one block of data with many correlated discrete variables could still dominate the variation explained by the model and, as such, UV scaling does not ensure each block is equally emphasized by the model. Besides UV scaling, other scaling techniques have been applied, depending on the aim of the study, noise characteristics and block structure, such as scaling to unit block variance [41] or other variance measures [16,38,45,46]. When data blocks are augmented, we recommend using metabolic profiling (i.e., to turn all data blocks into discrete data). For NMR data, this implies integrating and extracting peak areas, and, for LC-MS data, this will normally imply using software that generates peak tables consisting of metabolites with a unique retention time and mass pair [22,23]. A PCA model with UV scaling will often be the first method of choice indicating similarities and differences between variables within and between blocks. If the blocks remain too different in complexity and co-variation, we recommended investigating different block-scaling methods, as
77
mentioned above. However, we should emphasize that the final foodomics result will be biased, depending on the technique chosen, and, as such, this approach might not be trivial. This is why metadata from the multivariate data modeling must always be reported in scientific literature. 5. Simple methods – powerful correlations! For correlation analyses, there are basically two ways to proceed. The simplest is to examine univariate correlations between design parameters of different metabolites, but this is rarely appropriate due to the multicomponent nature of most foodomics studies. The other approach is to use multivariate methods to correlate many features at the same time, thereby allowing interactions to take part in the correlations found. For the latter part, several correlation methods have been proposed in the literature, such as matrix correlations, procrustes analysis [46], PLS2 and multiblock methods (see section 4). 5.1. How to check complimentary information in different analytical platforms? To be able to validate information from one analytical platform with information from another platform is an efficient way of proving the validity of an experimental set-up and the results [8]. With multiple analytical platforms, such as NMR and LC-MS, all combinations shown in Table 1 can be investigated, depending on how the raw data is treated (not taking into consideration cleaning, initial variable selection and preprocessing). The simplest two-block case involves scatter plots where (quantitative) correlations between metabolites can be visualized. The two-vector scatter plot is the backbone of science and can be extremely powerful when linking metabolites from different data blocks (e.g., glucose found from NMR plotted against glucose found from LC-MS). However, in metabolomics studies, the correlations are often not univariate but rather a combination of several metabolites acting together (a pattern) in response to a certain perturbation (e.g., different diets). These multivariate patterns or correlations can be visualized through latent variable methods, such as PCA or PLS, where the dimensionality of the block(s) can change from multivariate in NMR studies to multiway in LC-MS or GC-MS studies. Studying such correlations can be used to highlight and to validate if the performances of the two analytical platforms are similar and if they measure the shared metabolites in the same way. One reason for this not to happen could be that the analytical platforms are not expected to extract a shared metabolite in the same way. For example, if one method is less sensitive, this platform (e.g., NMR) might not capture the same trend as observed in a high resolution, sensitive platform, such as LC-MS. By contrast, if the concentration is too high, it might affect the correlation due to detector issues, such as ion suppression and saturation in MS. Running the same samples on the same analytical platform (LCMS), but in two different set-ups (e.g., polarity, and column type) can also enhance the understanding and the validity of the acquired data. An example could be hydrophilic interaction chromatography (HILIC) and reversed-phase liquid chromatography (RPLC) data measured on the same samples [47]. For such data, a preliminary PCA model on augmented data, a PLS2 model linking the two blocks or correlation-based methods will indicate if the known similar information (both qualitatively and quantitatively) can be found in both analytical platforms. If this can be validated, complementary information (not expected to arise from both methods) will most probably also be valid and can then be further processed.
78
T. Skov et al./Trends in Analytical Chemistry 60 (2014) 71–79
6. Multi-block methods and their potential in foodomics Above, we discussed a number of simple, rather well-established methods for relating data to each other. Focus was on univariate methods for relating two variables to each other and on multivariate interpretation of one data block/set and the relation between two data blocks. Below, we give a brief overview of some natural, quite recently developed extensions of these methods and also indicate how they can be potentially relevant in the present foodomics context {see [46,48]}. These methods are known under different names, the most used being multi-block methods, multi-set methods and data fusion methods. The area of multi-block methodology [49,50] is a very natural extension of methodologies mentioned above, such as PLS regression, but, instead of looking at the relation between two blocks of variables, multi-block methods can be used for relating several data blocks to each for the purposes of prediction or interpretation. The development has been fueled by an increasingly strong need in several modern scientific disciplines, with the omics area as one of the most important. Some of the methodologies developed fit perfectly into the area of foodomics because of not only the data structures available but also the needs to interpret complex relations between several large data sets (e.g., NMR, XC-MS, and GC/LC). The concepts involved are strongly related to standard multivariate analysis {see, e.g. [51]}, which implies that very similar ways for visualizing and interpreting the results can be used. The area of multi-block modelling can broadly be split in two closely related disciplines based on the purpose of the analysis, as below. (1) if predicting and understanding how a number of different blocks can be combined for predicting one or several other blocks, the area is called multi-blocks regression (and is a direct extension of PLS2 regression as discussed above). A typical example of this is when several blocks are used for gaining information (e.g., about a number of chemical constituents or about which groups of samples a specific sample belongs to) [42]. (2) In other cases, no specific order of the blocks is of interest, only how the blocks are related [46]. The most natural way of approaching the two situations is by the use of so-called data-compression methods inspired by the original PCA. These are methods that are based on stable projections of the data and visual inspection and empirical validation of the extracted components. These are methods that are developed for solving the collinearity problem (i.e., many more highly correlated variables than the number of samples) and a number of efficient calculation and estimation methods already exist [52]. In multi-block modelling, there are a number of concepts that have appeared to be very fruitful and that are also relevant in foodomics studies. Among these concepts are common variability, unique variability and additional variability of the blocks (see, e.g. [42]), the last mainly relevant for regression situations. The first of these has to do with the dimensions defining the joint variability that is common in the blocks [53] (i.e., the linear combinations in one of the blocks that are highly correlated with linear combinations in the other). One can also think of this as the intersection between the spaces defined by the by the blocks. The unique dimensions of the blocks are those directions in the block spaces that are uncorrelated with the common variability, in this sense representing information that is not shared by the other blocks. Additional information is related to prediction and defines how and how much extra information each of the blocks contributes to the prediction of output variables. All these concepts are quite general and can be implemented in various ways. Their future within foodomics
is quite obvious. For example, one is often interested in what is common and what is unique in different information sources (as was also touched upon in the correlation studies above) and sometimes the methodologies are used for classification that clearly falls under the prediction concept above. Some of the operationalizations of the concepts are also invariant (see, e.g. [42]) to the relative scaling of the blocks that can make them particularly important when data blocks of different type are involved in one single analysis.
7. Outreach Foodomics studies are complex and studies dealing with foodomics data are concerned with multifactorial problems, and, as these are analyzed with multifactorial sensors and separation methods, multivariate data-handling methods are required to extract and to describe the data. The modern analytical platforms generate vast amount of data in a very short time and the analyst risks the challenge of being flooded with non-informative data. There are multiple ways to analyze foodomics data and many choices have to be made. Data recording, cleaning, preprocessing, modeling and validation are all essential parts of the informatics flow in any foodomics study. However, multivariate data analysis typically remains the bottleneck for most systems biology studies, including foodomics, and new developments in chemometrics are required for the challenging data fusion or meta-analysis often demanded in systems biology studies. Obligatory deposit of large foodomics data sets published in the literature should serve as an inspiration for innovative chemometricians. However, simple correlation studies (with well-known chemometric principles from the foodomics world) have proved to be very powerful methods for a first glance at the relationship between multiple blocks of data. We advocate that such studies be part of the generic data-handling chain in all future foodomics studies. Finally, we cannot state too strongly that any result from a properly designed foodomics study will rely on the quality of the internal and external validation, and, last but not least, that it will ultimately be necessary to validate the correlations found with evidence from other sources because statistical methods cannot discriminate between causal effects and indirect correlations.
References [1] A. Cifuentes, Food analysis and foodomics foreword, J. Chromatogr. A 1216 (2009) 7109. [2] D. Camacho, A. de la Fuente, P. Mendes, The origin of correlations in metabolomics data, Metabolomics 1 (2005) 53–63. [3] R. Steuer, On the analysis and interpretation of correlations in metabolomic data, Brief. Bioinform. 7 (2006) 151–158. [4] S.B. Engelsen, L. Norgaard, Comparative vibrational spectroscopy for determination of quality parameters in amidated pectins as evaluated by chemometrics, Carbohyd. Polym. 30 (1996) 9–24. [5] I. Noda, Generalized 2-dimensional correlation method applicable to infrared, raman, and other types of spectroscopy, Appl. Spectrosc. 47 (1993) 1329–1336. [6] C.D. Eads, I. Noda, Generalized correlation NMR spectroscopy, J. Am. Chem. Soc. 124 (2002) 1111–1118. [7] O. Cloarec, M.E. Dumas, A. Craig, R.H. Barton, J. Trygg, J. Hudson, et al., Statistical total correlation spectroscopy: an exploratory approach for latent biomarker identification from metabolic H-1 NMR data sets, Anal. Chem. 77 (2005) 1282–1289. [8] D.J. Crockford, E. Holmes, J.C. Lindon, R.S. Plumb, S. Zirah, S.J. Bruce, et al., Statistical Heterospectroscopy, an approach to the integrated analysis of NMR and UPLC-MS data sets: application in metabonomic toxicology studies, Anal. Chem. 78 (2005) 363–371. [9] F.J. Blanco, C. Ruiz-Romero, OSTEOARTHRITIS Metabolomic characterization of metabolic phenotypes in OA, Nat. Rev. Rheumatol. 8 (2012) 130–132. [10] E. Want, Challenges in applying chemometrics to LC-MS-based global metabolite profile data, Bioanalysis 1 (2009) 805–819. [11] G.A. Theodoridis, H.G. Gika, E.J. Want, I.D. Wilson, Liquid chromatography-mass spectrometry based global metabolite profiling: a review, Anal. Chim. Acta 711 (2012) 7–16.
T. Skov et al./Trends in Analytical Chemistry 60 (2014) 71–79
[12] W.B. Dunn, D.I. Ellis, Metabolomics: current analytical platforms and methodologies TRAC-trends in analytical, Chemistry (Easton) 24 (2005) 285–294. [13] S.E. Richards, M.E. Dumas, J.M. Fonville, T.M.D. Ebbels, E. Holmes, J.K. Nicholson, Intra- and inter-omic fusion of metabolic profiling data in a systems biology framework, Chemometr. Intell. Lab. 104 (2010) 121–131. [14] C. Socaciu, F. Ranga, F. Fetea, L. Leopold, F. Dulf, R. Parlog, Complementary advanced techniques applied for plant and food authentication, Czech. J. Food Sci. 27 (2009) S70–S75. [15] B. Biais, J.W. Allwood, C. Deborde, Y. Xu, M. Maucourt, B. Beauvoit, et al., H-1 NMR, GC-EI-TOFMS, and data set correlation for fruit metabolomics: application to spatial metabolite analysis in melon, Anal. Chem. 81 (2009) 2884–2894. [16] M.E. Dumas, C. Canlet, L. Debrauwer, P. Martin, A. Paris, Selection of biomarkers by a multivariate statistical processing of composite metabonomic data sets using multiple factor analysis, J. Proteome Res. 4 (2005) 1485–1492. [17] A.K. Smilde, M.J. van der Werf, S. Bijlsma, B.J.C. van der Werff-van-der Vat, R.H. Jellema, Fusion of mass spectrometry-based metabolomics data, Anal. Chem. 77 (2005) 6729–6736. [18] V. Garcia-Canas, C. Simo, M. Herrero, E. Ibanez, A. Cifuentes, Present and future challenges in food analysis: foodomics, Anal. Chem. 84 (2012) 10150–10159. [19] R. Bro, Multiway calibration, Multilinear PLS J. Chemometr. 10 (1996) 47–61. [20] H. Martens, M. Hoy, F. Westad, D. Folkenberg, M. Martens, Analysis of designed experiments by stabilised PLS Regression and jack-knifing, Chemometr. Intell. Lab. 58 (2001) 151–170. [21] T. Skov, S.B. Engelsen, Chemometrics, Mass Spectrometry, and Foodomics, John Wiley & Sons, Inc., 2013, pp. 507–538. [22] M. Katajamaa, M. Orešic, Data processing for mass spectrometry-based metabolomics, J. Chromatogr. A 1158 (2007) 318–328. [23] S. Castillo, P. Gopalacharyulu, L. Yetukuri, M. Orešic, Algorithms and tools for the preprocessing of LC-MS metabolomics data, Chemometr. Intell. Lab. 108 (2011) 23–32. [24] M. Katajamaa, J. Miettinen, M. Oresic, MZmine: toolbox for processing and visualization of mass spectrometry based molecular profile data, Bioinformatics 22 (2006) 634–636. [25] T. Pluskal, S. Castillo, A. Villar-Briones, M. Oresic, MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data, BMC Bioinformatics 11 (2010). [26] C.A. Smith, E.J. Want, G.O. Maille, R. Abagyan, G. Siuzdak, XCMS: processing mass spectrometry data for metabolite profiling using Nonlinear peak alignment, matching, and identification, Anal. Chem. 78 (2006) 779–787. [27] R. Tautenhahn, G.J. Patti, D. Rinehart, G. Siuzdak, XCMS online: a web-based platform to process untargeted metabolomic data, Anal. Chem. 84 (2012) 5035–5039. [28] A. Lommen, MetAlign: interface-driven, versatile metabolomics tool for hyphenated full-scan mass spectrometry data preprocessing, Anal. Chem. 81 (2009) 3079–3086. [29] A. Lommen, H.J. Kools, MetAlign 3.0: performance enhancement by efficient use of advances in computer hardware, Metabolomics 8 (2012) 719–726. [30] E. Lange, R. Tautenhahn, S. Neumann, C. Gröpl, Critical assessment of alignment procedures for LC-MS proteomics and metabolomics measurements, BMC Bioinformatics 9 (2008) 1–19. [31] G. Gürdeniz, M. Kristensen, T. Skov, L.O. Dragsted, The effect of LC-MS data preprocessing methods on the selection of plasma biomarkers in fed vs. fasted rats, Metabolites 2 (2012) 77–99. [32] G. Lorho, F. Westad, R. Bro, Generalized correlation loadings: extending correlation loadings to congruence and to multi-way models, Chemometr. Intell. Lab. 84 (2006) 119–125. [33] M. Sysi-Aho, M. Katajamaa, L. Yetukuri, M. Oresic, Normalization method for metabolomics data using optimal selection of multiple internal standards, BMC Bioinformatics 8 (2007).
79
[34] F. Dieterle, A. Ross, G. Schlotterbeck, H. Senn, Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in H-1 NMR metabonomics, Anal. Chem. 78 (2006) 4281–4290. [35] R.J.O. Torgrip, K.M. Aberg, E. Alm, I. Schuppe-Koistinen, J. Lindberg, A note on normalization of biofluid 1 D H-1-NMR data, Metabolomics 4 (2008) 114–121. [36] L.G. Rasmussen, H. Winning, F. Savorani, H. Toft, T.M. Larsen, L.O. Dragsted, et al., Assessment of the effect of high or low protein diet on the human urine metabolome as measured by NMR, Nutrients 4 (2012) 112–131. [37] F. Savorani, M.A. Rasmussen, M.S. Mikkelsen, S.B. Engelsen, A primer to nutritional metabolomics by NMR spectroscopy and chemometrics, Food Res. Int. (2013). [38] R. Berg, H. Hoefsloot, J. Westerhuis, A. Smilde, M. Werf, Centering, scaling, and transformations: improving the biological information content of metabolomics data, BMC Genomics 7 (2006) 1–15. [39] T.G. Doeswijk, J.A. Hageman, J.A. Westerhuis, Y. Tikunov, A. Bovy, F.A. van Eeuwijk, Canonical correlation analysis of multiple sensory directed metabolomics data blocks reveals corresponding parts between data blocks, Chemometr. Intell. Lab. 107 (2011) 371–376. [40] T. Skov, D. Ballabio, R. Bro, Multiblock variance partitioning: a new approach for comparing variation in multiple data blocks, Anal. Chim. Acta 615 (2008) 18–29. [41] J. Forshed, H. Idborg, S.P. Jacobsson, Evaluation of different techniques for data fusion of LC/MS and H-1-NMR, Chemometr. Intell. Lab. 85 (2007) 102–109. [42] T. Næs, O. Tomic, N.K. Afseth, V. Segtnan, I. Måge, Multi-block regression based on combinations of orthogonalisation, PLS-regression and canonical correlation analysis, Chemometr. Intell. Lab. 124 (2013) 32–42. [43] J.A. Westerhuis, A.K. Smilde, Deflation in multiblock PLS, J. Chemometr. 15 (2001) 485–493. [44] J. Saurina, Characterization of wines using compositional profiles and chemometrics Trac-Trends in Analytical, Chemistry (Easton) 29 (2010) 234–245. [45] M. de Tayrac, S. Le, M. Aubry, J. Mosser, F. Husson, Simultaneous analysis of distinct Omics data sets with integration of biological knowledge: multiple factor analysis approach, BMC Genomics 10 (2009). [46] S. Hassani, H. Martens, E.M. Qannari, M. Hanafi, G.I. Borge, A. Kohler, Analysis of -omics data: graphical interpretation- and validation tools in multi-block methods, Chemometr. Intell. Lab. 104 (2010) 140–153. [47] G. Paglia, M. Magnusdottir, S. Thorlacius, O.E. Sigurjonsson, S. Gudmundsson, B.O. Palsson, et al., Intracellular metabolite profiling of platelets: evaluation of extraction processes and chromatographic strategies, J. Chromatogr. B 898 (2012) 111–120. [48] K. Van Deun, I. Van Mechelen, L. Thorrez, M. Schouteden, B. De Moor, M.J. van der Werf, et al., DISCO-SCA and properly applied GSVD as swinging methods to find common and distinctive processes, PLoS ONE 7 (2012) e37840. [49] J.A. Westerhuis, T. Kourti, J.F. MacGregor, Analysis of multiblock and hierarchical PCA and PLS models, J. Chemometr. 12 (1998) 301–321. [50] S. Bougeard, E.M. Qannari, N. Rose, Multiblock redundancy analysis: interpretation tools and application in epidemiology, J. Chemometr. 25 (2011) 467–475. [51] J. Sanchez, Martens, Harald; Naes, Tormod: Multivariate Calibration, John Wiley & Sons, Chichester, 1989. 419 + xvii pp. ISBN 0471 90979 3. [52] S. Hassani, H. Martens, E.M. Qannari, M. Hanafi, A. Kohler, Model validation and error estimation in multi-block partial least squares regression, Chemometr. Intell. Lab. 117 (2012) 42–53. [53] I. Måge, B.-H. Mevik, T. Næs, Regression models with process variables and parallel blocks of raw material measurements, J. Chemometr. 22 (2008) 443–456.