JOURNAL OF CHEMOMETRICS J. Chemometrics 2006; 20: 325–340 Published online 18 January 2007 in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/cem.1005
An automated method for peak detection and matching in large gas chromatography-mass spectrometry data sets Sarah J. Dixon1, Richard G. Brereton1*, Helena A. Soini2, Milos V. Novotny2 and Dustin J. Penn3 1
Centre for Chemometrics, School of Chemistry, University of Bristol, Cantocks Close, Bristol BS8 1TS, UK Department of Chemistry and Institute for Pheromone Research, Indiana University, 800 E Kirkwood Ave, Bloomington, IN 47405, USA 3 Konrad Lorenz Institute for Ethology, Austrian Academy of Sciences, Savoyenstr. 1a, A-1160 Vienna, Austria 2
Received 12 March 2006; Revised 29 June 2006; Accepted 28 July 2006
A new approach for peak detection and matching has been developed and applied to two data sets. The first consisted of the Gas Chromatography-Mass Spectrometry (GC-MS) samples of 965 human sweat samples obtained from a population of 197 individuals. The second data set contained 500 synthetic chromatograms, and was generated to validate the peak detection and matching methods. The size of both of the data sets (around 500 000 detectable peaks over all chromatograms in data set 1, and around 100 000 in data set 2) would make it unfeasible to check manually whether peaks are matched. In the method described, the first procedure involves pre-processing the data before carrying out the second procedure of peak detection. The final procedure of peak matching consists of three stages: (a) finding potential target peaks in the full data set over all chromatograms; (b) matching peaks in the chromatograms to these targets to form clusters of spectra associated with each target; (c) merging targets where appropriate. Peak detection and matching were applied to both data sets, and the importance of stage (c) of peak matching described. In addition to the analysis of the synthetic chromatograms, the method was also validated by shuffling the original order of the sweat chromatograms and performing the methods independently on the newly shuffled data. Copyright # 2007 John Wiley & Sons, Ltd. KEYWORDS: GC-MS; peak detection; peak matching; metabolomics
1. INTRODUCTION Gas Chromatography - Mass Spectrometry (GC-MS) is widely used for analysing volatile and semi-volatile compounds in complex mixtures [1–5]. This technique is being increasingly employed in the area of metabolomics. However, most existing studies are performed on fairly limited sample sizes, for example 20–100 samples, and often contain a substantial portion of compounds that are common to most of these samples. Furthermore several internal standards can be added to allow easy alignment. There are very few published animal metabolic studies on large data sets consisting of several hundred samples. These pose particular challenges, especially when studying emanations such as sweat where it is difficult to add internal standards because of the nature of the extraction procedure.
*Correspondence to: R. G. Brereton, Centre for Chemometrics, School of Chemistry, University of Bristol, Cantocks Close, Bristol BS8 1TS, UK. E-mail:
[email protected]
In most areas of metabolomic data analysis, data preparation is the critical process, while later stages such as principal components analysis and partial least squares depend critically on how the data has been processed prior to pattern recognition. In the study reported here, two data sets are analysed. The first comes from human sweat samples, acquired by a novel rolling stir bar extraction method [6] (for reviewing background for stir bar methodology, see a reference of Baltussen et al. [7]). After sample collections at a remote site, the preserved stir bars were shipped for analysis and subsequently subjected to a thermal desorption/GC-MS analysis. The second is a synthetic data set, used to validate the method. This was created using a library of 1000 mass spectra extracted from the sweat chromatograms, and characteristics such as number of peaks per chromatogram, typical peak intensity and degrees of overlap were modelled on those of the sweat data set. Prior to pattern recognition, there are two principal groups of methods for pre-processing GC-MS data. The first technique is to compare the Total Ion Chromatograms (TICs) and observe how the intensity varies at given elution Copyright # 2007 John Wiley & Sons, Ltd.
326 S. J. Dixon et al.
times, or more commonly, elution time windows [8]. To enable comparison between a series of chromatograms, they need to be aligned, so that a given compound has the same retention time in all chromatograms. There are many examples in the literature of aligning data sets, including Correlation Optimised Warping (COW) [9], Dynamic Time Warping (DTW) [10], reduced set mapping [11], minimising the data set’s pseudo rank [12] or latent variable projections [13]. These normally take one chromatogram as a target, and warp subsequent chromatograms to match this target, either by using the shape of the TIC or by using the full spectral information. However, if the sample matrix is complex, and there are many compounds whose chromatographic signals overlap, information about the position of individual compounds may be lost in the TIC. Most applications described in the literature originate from a series of samples often consisting of similar peaks where it is expected that there are several known compounds in common among all chromatograms [14]. These methods are very valuable, for example, when comparing HPLCs taken over a period of time, with slightly different chromatographic conditions. In metabolomics, samples often differ substantially in nature, and it is harder to define universally present landmark compounds or to obtain target chromatograms that contain a sufficiently representative and well-resolved set of peaks. More information can be extracted by only looking at Single Ion Chromatograms (SICs), or a reduced TIC built from selected mass channels, but this requires prior knowledge of which mass channels should be studied. Alternative methods depend on developing a peak table consisting of a matrix of the intensity of peaks above the detection limit across all chromatograms. The first stage is to determine how many peaks are present in each sample and where they elute: there are numerous reports in the literature about peak picking algorithms [15–19]. Then, by extracting a mass spectrum and retention time for each peak, the relative area of each compound can be compared from sample to sample. A major problem involves matching peaks from different chromatograms using spectral similarity measures [20–22]. Spectra may be distorted if the signal to noise level is low or if there is overlap between neighbouring peaks, and errors in matching (due to poor spectral similarity) can have serious consequences in pattern recognition. Therefore, a relatively robust but automated approach to peak recognition and matching is necessary. An advantage of peak table methods in contrast to TIC based methods is that there is more diagnostic information in the data and that it is not necessary, for example, to warp or baseline correct entire chromatograms. However, these approaches are critically dependent on determining which peaks in different chromatograms correspond to the same compound. Human sweat is a very complex matrix, with several hundred compounds detected in each chromatogram (which in this study was recorded over 52.33 min). The peaks are often of low intensity, and are difficult to discriminate using only the TIC. A large number of samples (965) were obtained, and due to the large number of compounds in each sample, it is impractical to detect and integrate each peak manually: with up to 500 identifiable peaks per chromatogram, this results in half a million possible chromatographic peaks. Copyright # 2007 John Wiley & Sons, Ltd.
Manual alignment and integration of each peak at 5 min per peak (checking mass spectra, elution times, baseline, etc.) would require 41 666 h work, or 18.5 years working at the rate of 45 weeks a year and 50 h per week. A contrasting data set of 50 chromatograms consisting of 30 peaks would take only 125 h or around two and a half weeks’ work to check manually, and is typical of the size of many published chemometrics studies. It was therefore necessary to devise an automated method for detecting and integrating the compound peaks in each sample, and then comparing the relative amount of each compound between samples. This method has to be computationally efficient, and minimise the problem of erroneous results. No attempt is made as yet to elucidate the structures of the compounds in this study. Several main procedures are necessary. The first requires improving the signal to noise ratio, and to enable more effective peak detection. Methods can be applied to either reduce noise in the entire chromatogram by identifying and removing mass channels with limited/no information [15], or by smoothing each SIC [16]. The second procedure involves determining how many peaks are in the chromatogram, where they elute, where they begin and end and their intensity. Usually closely overlapping peaks correspond to compounds with different mass spectra, and so multivariate methods for resolution are not necessary in contrast to DAD-HPLC where they are often mandatory. The most common (and simplest) method for detection of peaks searches for changes in the gradient of either the TIC or individual SICs to indicate the position of compounds. The first (and sometimes higher) derivatives are calculated, and a change in sign suggests the start of a peak [17]. An alternate method (Intensity Weighted Variance [18]) splits the SICs into small windows. If there is no peak in the window, then the intensity profile will be relatively constant, and resemble a discrete uniform distribution. If a peak is present in the window the intensity profile will be concentrated in the centre of the window. This method is quite robust to noise, however it is quite computationally intense. The freely available software tool AMDIS can also be used for peak detection [19]. This works along each mass channel sequentially. First each mass channel is subjected to pre-processing, and then all the possible maxima are highlighted. Each maximum point is then subjected to a series of peak validation tests to ensure that it is not from noise. Mass channels which exhibit a maximum within a pre-defined time window are assigned as being from the same compound. The AMDIS software itself is only used for peak detection and compound identification via library searching, and does not attempt to integrate each peak, nor does it analyse multiple samples simultaneously. There are also methods which attempt to deconvolute overlapping peaks [23–26] to determine the peak area of all underlying compounds; however, these often require a prior estimate of the number of compounds present, or an example of the raw spectra of each compound. An extension to this has been to obtain a typical peak shape from the chromatogram, and use this model shape for more accurate deconvolution [27]. If the overlapping compounds appear in several samples, it is also possible to use the difference in variations between mass channel intensities between J. Chemometrics 2006; 20: 325–340 DOI: 10.1002/cem
Automated peak detection and matching 327
samples to extract the mass spectra of each compound, and therefore the intensity in each sample [28]. In this paper, a method was developed which first identified and smoothed each informative mass channel, before using the first derivative to locate possible peaks. The method is completely automated. It is based on an existing approach developed for handling mass spectral data from contaminated banknotes [29]. Tuneable parameters are fixed at the start of analysis, and are determined by the chromatographic characteristics such as scan rate and typical peak width. The final and hardest procedure involves determining which peaks in a series of chromatograms correspond to the same compound, by matching peaks between the samples. For samples with high signal to noise ratios, and few overlapping compounds, peak matching can usually be performed on TICs using methods such as COW as described above. However, as sweat is so complex, and as the sampling and analytical procedure is optimised to detect volatile and semi-volatile compounds, which will be strongly related to effects such as dietary habits, personal behaviour, and even genetic influences, there are very few compounds common to all samples, and it is difficult to choose a single target chromatogram out of the possible chromatograms. Many peak matching methods assume prior knowledge of key compounds present in all samples [30,31]. For our purposes, it was not known how many compounds would be present, and in which samples. It was also not possible to add multiple marker compounds, as the samples were very complex, so any added internal standards would overlap with analytes from the samples. Also, the standards would have had to be adsorbed onto a stir bar which is a complex process in contrast to urine, for example, which is an easy matrix for the addition of standards. Two standards were added in this study [6]. In contrast, if the matrix is simple, and there is expected to be no overlap between compounds, it is possible to align components simply by their retention time, with no use of extracted mass spectra [32]. In the method described in this paper, all mass channels are examined individually for peaks. Where several different
mass channels contain maxima within a given retention time window, these are grouped as originating from the same compound, and are assumed to arise from the same origin. For matching, a method was developed which first searched through the detected peaks from all samples looking for all unique peaks over the entire chromatographic data set. These peaks were retained as target peaks. In the second stage each sample was then analysed again and peaks matched to the targets. Finally a third stage views all peaks assigned to a target as a cluster, and tries to determine whether peaks in different clusters should be grouped together. This third stage is an important one. For example, consider finding a peak in chromatogram 1 that has sufficiently different characteristics to a peak in chromatogram 2 that these are considered different (this may be due to poor signal to noise ratio in one chromatogram). Both peaks are seeded as targets, and peaks arising from identical compounds may cluster to either of these targets, the targets could be regarded as extreme measurements from a single group. After all chromatograms have been analysed, it is necessary to then determine whether these clusters should be merged. In practice we find that for relatively small data sets (100–200 chromatograms) there are few problems with using a simpler approach, but for large data sets (500–1000 chromatograms), where there are few common peaks, there can be significant difficulties choosing target peaks and chromatograms, and the methods reported in this paper are robust to this problem. With the recent potential growth of large metabolomics and proteomics data sets, it is necessary to develop new, automated, approaches for peak matching, which we describe in this paper. A table of notation including these variables, and others from the text, is given in Table 1.
2. DATA SETS AND DEFINITIONS 2.1. Definitions Figure 1 provides an overview of key definitions used in this paper. A table of notation including these variables, and
Figure 1. Graphical example of some of the definitions used. Copyright # 2007 John Wiley & Sons, Ltd.
J. Chemometrics 2006; 20: 325–340 DOI: 10.1002/cem
328 S. J. Dixon et al.
others from the text, is given in Table 1. For the experimental data, sweat samples are collected from P subjects. In total B sweat samples are obtained, involving Yp repeats for P individual p, where B ¼ Pp¼1 Yp . For every sample in both datasets, there is a corresponding GC-MS, denoted by b. Each chromatogram is characterised by a matrix (X) of dimensions N M, with N scans recorded over M mass channels. A single element of this chromatogram (xnm) represents the intensity of mass channel m, in scan number n. A single row of a chromatogram (xn) is a mass spectrum. A single column (xm) shows the intensity of a single mass channel over time as a SIC, with the sum of all SICs, being the TIC. Intensity, although often presented in units ion counts per scan, will be presented in this paper as a unitless number whose value is on the same scale in all chromatograms. Chromatogram b consists of Kb detectable compounds, each having a retention time of rk and a mass spectrum bk.
Table I. List of notations
2.2. Data set 1—human sweat Data set 1 consisted of human sweat samples primarily of five repeats from individuals, sampled once a fortnight for five fortnights between June and August 2005 obtained from an isolated population in Carinthia, Southern Austria. One hundred ninety-seven individuals took part in the study, whose family histories were all known. However, a few individuals were not able to participate each fortnight and so this reduced the data set size from a possible 985 samples down to B ¼ 965.
2.2.1. Reagents and materials Standard compounds were purchased from Aldrich (Milwaukee, WI). Stir bars (TwisterTM, 10 mm, 0.5 mm film thickness, 24-mL polydimethylsiloxane (PDMS) volume) used for sample collection were purchased from Gerstel GmbH (Mu¨lheim an der Ruhr, Germany). Stir bars were conditioned prior and between each used in the TC 2 tube conditioner (Gerstel GmbH) at 3008C under helium flow. Volatile and semi-volatile compounds were collected from skin using the high-reproducibility rolling stir bar methodology described elsewhere [6].
Symbol
Description
2.2.2. Instrumentation
a B, b C d E F G
Peak area Number of chromatograms Scan number of peak maximum First derivative of u Scan number of peak end Noise factor Sparse N W matrix containing peak area information for a chromatogram Raw height of peak Peaks detected in a mass channel Peaks detected in a chromatogram True number of peaks in a chromatogram Peak clusters over all chromatograms in a data set Number of mass channels in a chromatogram Number of scans in a chromatogram Number of people sampled Frequency of occurrence of peaks in data set 2 Peak retention time Scan number of peak start Number of target peaks in a given time window Windows of a chromatogram for determining the noise factor Tuneable parameters, as detailed in Table II Number of informative mass channels in a chromatogram Matrix representation of a chromatogram Number of repeat samples from subject p Mean area of peaks in data set 2 Synthetic peak shape Mass spectrum used to generate synthetic data set Extracted mass spectrum of a detected chromatographic peak Time shift added to a synthetic chromatogram Cosine matching function Number of mass channels with peak maxima at the same value of n Threshold which the baseline corrected peak height must be greater than Wavelet de-noised SIC Random number from standard normal distribution Standard deviation of gaussian used for synthetic peak shape
GC equipment for quantitative analysis consisted of an Agilent 6890N Gas Chromatograph connected to a 5973i MSD mass spectrometer (Agilent Technologies) with a Thermal Desorption Autosampler (TDSA, Gerstel). A positive electron ionisation (EI) mode at 70 eV was used with a scanning rate of 4.51 scans/s over the mass range of m/z 35–350. The ion source and quadrupole temperatures were set at 230 and 1508C, respectively. The separation capillary was DB-5MS (20 m 0.18 mm, i.d., 0.18 mm film thickness) from Agilent. Samples were thermally desorbed in a TDSA automated system, followed by injection into the column with a Cooled Injection System, CIS-4. The TDSA operated in a splitless mode. The temperature programme for desorption was 208C (0.5 min), then 608C/min to 2508C (3 min). The temperature of the transfer line was set at 2808C. The CIS was cooled with liquid nitrogen to 808C. After desorption and cryotrapping, the CIS was heated at 128C/s to 2808C with the hold time of 10 min. The CIS inlet was operated in the solvent vent mode, a vent pressure of 14 psi, a vent flow of 50 ml/min, and a purge flow of 50 ml/min. The temperature programme in the GC operation was 508C for 1 min, then increasing to 1608C at the rate of 58C/min, followed by the second ramp at the rate of 38C/min to 2008C (hold time 16 min). The carrier gas head pressure was 14 psi (flow rate, 0.7 ml/min at constant flow mode). The GC temperature programme lasted for 52.33 min, with mass spectrometric detection commencing after a deadtime of 1.88 min. To increase throughput, two instruments of identical specifications were used to analyse the samples. The configuration of both instruments was the same, and tests had been done to ensure reference samples analysed on each instrument were of acceptable similarity. There was a slight difference in scan rates between the two instruments, with instrument 1 sampling 13 481 scans over the analysis period, and instrument 2 sampling 13 460. In the Mass Spectrometric software (ChemStation) a parameter is set,
h I, I J, j K, k L, l M, m N, n P Q, q r S T, t U, u V1–7 W, w X Yp Z, z a b g d e w k u r s
Copyright # 2007 John Wiley & Sons, Ltd.
J. Chemometrics 2006; 20: 325–340 DOI: 10.1002/cem
Automated peak detection and matching 329
below which a scan will be recorded as having zero intensity. This is the detection limit of the instrument, and can be set by the instrument operator. The mass resolution was reduced to unity before analysis. The repeats from one individual were distributed as evenly as possible between two analytical instruments, in most cases three samples were analysed on one instrument, and the remaining two on the other instrument. For the purpose of this paper the split of samples between instruments is not of importance. Chromatograms from instrument 1 were of the dimensions 13 481 316, whereas those from instrument 2 were of the dimensions 13 460 316. In both cases M ¼ 316, analysed over the range m/z ¼ 35–350.
2.3. Data set 2—synthetic data A synthetic data set was created to test the peak detection and matching methods described. This consisted of B ¼ 500 synthetic GC-MS chromatograms, of dimensions 14 000 316, containing between Kb ¼ 196–253 detectable peaks per chromatogram (mean ¼ 221.23, SD ¼ 10.03). To create the data set, several aspects must be modelled. These are the retention time and mass spectrum of each peak, the average intensity of each peak and the frequency of occurrence of each peak over all chromatograms.
2.3.1. Mass spectra and retention times K ¼ 1000 peaks, with their corresponding mass spectra (bk), were selected from the results of peak detection on data set 1. The mass spectra had been recorded in the m/z range 35–350, with m taking values of 1–316. All mass spectra were normalised as follows: norm
bkm
b ¼ 316 km P bkm m¼1
The retention times (r) of the peaks were randomly distributed over the range 100 to 13 900, with the criterion that the minimum gap between successive peaks was 5 scans. This was done by randomly choosing 1000 values from the numbers 100–13 900, without replacement. If, when placed in order, there are two values which are less than 5 scans apart, the highest value is discarded, and another random number chosen. This is repeated until 1000 values have been drawn. These scan numbers are the initial synthesised retention times for the peaks, and all chromatograms are then subjected to the addition of a time shift, as described in Section 2.3.6.
Figure 2. Vector q , used to model the frequency of occurrence of K ¼ 1000 peaks in data set 2.
distribution is shown in Figure 2. The 1000 elements of q were randomly reordered to distribute the frequencies of occurrence over all 1000 peaks. Matrix Q was created, of dimensions 500 1000, with all elements initialised with the value zero. For each peak k, qk elements of the kth column of Q were randomly assigned a value of 1. This gives the presence/absence distribution of all peaks over all 500 samples.
2.3.3. Peak intensity The results from data set 1 show that the total area intensity of all peaks (as found by taking the sum of all mass channels contributing to the peak, over the range of n in which the peak elutes, minus a simple baseline) approximately follow a log normal distribution, so a vector (z) was created of length 1000, with the elements taken as the exponentials of a normal distribution of mean 14, and SD 0.5, as shown in Figure 3. Vector z was randomly reordered to distribute the areas over all K peaks. These are the average areas of the 1000 synthetic peaks. The distribution of areas of peak k across all qk
2.3.2. Frequency of occurrence of each peak In data set 1, there were few peaks which appeared in a large number of chromatograms, and a few peaks which only appeared in a few chromatograms, with the others distributed evenly in between. A vector q, of length 1000, represents the number of times a peak is present in the samples, created as follows qk ¼ ðððððk 1Þ 2=999Þ þ 1Þ6 Þ 1Þ=ð728=495Þ þ 5 scaled so that the minimum was equal to 5, and the maximum 500, where qk is the number of chromatograms (size of peak cluster) in which peak k is detected. The Copyright # 2007 John Wiley & Sons, Ltd.
Figure 3. Vector z , the K ¼ 1000 generated peak areas in data set 2. J. Chemometrics 2006; 20: 325–340 DOI: 10.1002/cem
330 S. J. Dixon et al.
chromatograms which that peak appears in was modelled by a normal distribution, of mean zk and SD 0.1 zk. A matrix Z was created, of dimensions 500 1000, with the zth column of Z containing the qk areas calculated for peak k, positioned in the elements indicated by column qk.
2.3.4. Peak shape For each peak k, the shape was modelled by a gaussian distribution and then normalised, as shown below: !, rX k þ14 ðn rk Þ2 norm akn ¼ exp akn 2s 2 n¼r 14 k
where s is the standard deviation of the peak, which for this application always takes the value of 5 scans. Values are calculated for n in the range (rk14) up to (rk þ 14), with points outside this range taking the value of zero.
2.3.5. Synthesising the chromatogram For chromatogram b, Xb is initialised as a matrix of zeros, of dimensions 14000 316. For all peaks Kb, the contribution towards the synthetic chromatogram was calculated as: xbnm ¼ xbnm þ
Kb X
ðnorm akn zbk norm bkm Þ þ ebnm
k¼1
where e represents random noise, taken from a normal distribution with mean and standard deviation of P14000 n¼1 xmn =14 000, for mass channel m; a different noise distribution is obtained for each mass as it is observed experimentally.
2.3.6. Time shift As an approximation of the retention time shift which will occur due to slight changes in analytical conditions over 3 months, as well as slight variations in retention time on a day to day basis, a time shift was modelled as follows: db ¼ r þ 0:1 b where r is a random number from the standard normal distribution (mean ¼ 0, SD ¼ 1), and db indicates the time shift, in scan numbers. The value of db is added to the retention time of all peaks in chromatogram b.
2.4. Software The GC-MS instrument was controlled using Agilent ChemStation software. Instrument 1 was running version D.01.02, and system 2 was running D.01.02.16. GC-MS data were exported in .cdf format. The files were imported into MATLAB (The Mathworks, Inc., Natick, MA) using a freely available conversion tool [34]. All data processing was performed using MATLAB version 7.0.4.365, Release 14, Service Pack 2.
3. DATA ANALYSIS Each data set is subjected to three procedures—data pre-processing, peak detection, and peak matching. Each of these procedures can be divided into smaller stages. The first procedure is data pre-processing, which consists of three stages, namely P1 (selection of mass channels), P2 (calculation of noise factor) and P3 (smoothing).This is followed by Copyright # 2007 John Wiley & Sons, Ltd.
three peak detection stages, namely D1 (detection of I peaks in each mass channel), D2 (validation of each of the I peaks) and D3 (grouping peaks from different mass channels to find J peaks in the whole chromatogram). Finally three peak matching stages are performed, namely A1 (searching for unique peaks over all samples), A2 (assigning peaks from each chromatogram to a target peak cluster) and A3 (merging clusters which contain peaks which truly originate from the same compound).
3.1. Data preparation 3.1.1. Stage P1—selection of mass channels The first stage is to determine which mass channels contain potential information: all mass channels are examined. Any mass channel which has no more than 10 consecutive non-zero scans over the entire chromatogram is considered as uninformative and removed. This step can save considerable computation time as usually only 50–60% of mass channels pass this test, and the number of mass channels retained is reduced from M to Wb (which varies from sample to sample). This method primarily aims to remove mass channel with no diagnostic signal. Other mass channel removal methods such as CODA [35,36] remove mass channels which have high noise levels or high background, with the aim to produce a cleaner TIC. This is not appropriate in the current study as there is the risk of removing some mass channels which are noisy in one region, but contain useful information in another. The main objective of this first step, however, is to reduce the size of the original chromatogram, by removing information that is clearly redundant, and is primarily for the purpose of improving computational speed.
3.1.2. Stage P2—noise factor The second stage is to compute the noise factor of each of the Wb remaining mass channels. This method is based on that originally proposed by Stein [19], and is summarised below: 1. Each SIC, w, was split into windows of 13 sequential scans. 2. U windows were retained where the number of times the signal crosses the mean of that window is more than 6. These windows therefore had a relatively constant signal intensity (i.e. they did not contain a peak), and so could be taken as regions of baseline. This stage involves rejecting regions of the chromatogram where there are likely to be peaks, and retained only areas of baseline and noise. 3. In each of these U windows, the Median Absolute Deviation (MAD) from the mean is calculated, and then divided by the mean intensity of the window to give the Noise Factor (Fwu for window u and mass channel w). 4. The noise factor for the entire mass channel, Fw, is taken as the median value of Fwu over all U windows. The noise factor value is used at a later stage to validate whether a peak is true, or simply an artefact.
3.1.3. Stage P3—smoothing Each of the Wb mass chromatograms is then de-noised using a wavelet filter. In this study the soft Daubechies 5 filter, at J. Chemometrics 2006; 20: 325–340 DOI: 10.1002/cem
Automated peak detection and matching 331
level 3 [37,38] was used, to give a vector uw of the smoothed SIC.
3.2. Peak detection 3.2.1. Stage D1—detection of peaks in each informative mass channel The next stage is to determine the number and position of all Iw peaks in each of the Wb mass channels for each chromatogram. Each of the Wb mass channels (w) are processed in turn, using the following steps: 1. The first derivative (d) of the de-noised SIC uw, was computed using a 5-point quadratic Savitzky–Golay filter [39,40]. 2. Starting at n ¼ 3, n is increased until the value of dn is positive. This is taken as the start of the first peak (S1). 3. The value of n is further increased until both dn and dnþ1 are negative. This is taken as an estimate of the peak maximum. 4. In order to determine whether a true peak has been detected in the SIC, two criteria are assessed, to determine either if (i) nS1 > V1 (in this application, V1 is set to 7, so the peak maximum must be at least 8 scans after S1) or (ii) both nS1 > V2 (in this application, this is set to 2) and un2 is at least four times the detection limit of the instrument (which would happen if the peak is on the shoulder of a previous one, and the peak maximum lies close after S1). If neither of the two criteria above is met, then it is taken that n is not the true maximum of the peak, and the algorithm returns to step 3 to look for a new peak maximum. If at least one of the criteria is met, this point is taken as the peak maximum (C1). These distances (V1 and V2) can be set according to the nature of the chromatographic data and are tuneable parameters. 5. The value of n is further increased until both dn and dnþ1 are positive, which is taken as the end point of the peak (E1). Steps 2–5 are repeated until the end of the SIC is reached to find all I potential peaks (denoted by i) in the mass channel. In many cases, n will define both the end of one peak, and the start of the following peak, with Ei ¼ Siþ1. Figure 4 illustrates steps 2–5 over a section of SIC where two peaks are present.
3.2.2. Stage D2—peak validation To protect against false peaks obtained using the criteria above, a series of peak validation steps occur as described in reference [19]. To be considered a peak, all three criteria must be obeyed: 1. The number of times the signal between the start and end crosses the mean intensity is counted. If this number is more than 30% of the peak width (EiSi), the peak is rejected. 2. If the peak is too narrow (for this application, Ei Si < 10 scans), it is rejected. pffiffiffiffi 3. A threshold ki is calculated as ki ¼ 2 hi Fw , where hi is the height at the centre Ci. The following method was employed to remove small peaks: a line was fitted between Si and Ei,; the intensity of the proposed peak above this line was computed at all points in time and the Copyright # 2007 John Wiley & Sons, Ltd.
Figure 4. Graphical illustration of the peak detection stage D1. Five key positions are marked. smallest half were retained; a baseline is fitted to these values. If the baseline corrected peak maximum is less than ki, the peak is rejected. Further details are described in reference [19]. If a peak is retained, then a linear baseline is computed between Ei and Si. The area, ai, is calculated as the sum of all points from start to finish above this baseline. The height of the peak at Ci (after baseline removal) is also computed. Once all peaks have been validated, a sparse matrix G of dimensions N Wb is obtained where for mass channel w the values are zero except for the detected peak maxima which are given values of the corresponding peak areas.
3.2.3. Stage D3—grouping mass channel peaks from the same compound The final stage of peak detection is to group peaks detected in different mass channels that have similar retention times. Each mass channel will have Im peaks, and there will be peaks in different mass channels which can be assumed to have come from the same origin: chromatographic peak j. These mass channel peaks are grouped as follows: 1. w is a vector of length N, where element wn is the number of non-zero elements in row n, of matrix G. For most scans this will be zero. 2. Starting at j ¼ 1, rj is taken where w is a maximum. 3. A window of width 2 V3 þ 1 scans is set around rj (in this application, V3 ¼ 3). This accounts for the fact that due to noise, peaks from different mass channels can have slightly different shaped peaks, with small differences in peak maxima. 4. All peaks (over all w mass channels) with peak maxima within this window are regarded as part of the same compound peak. In each row of matrix G, any values which are present in this window (n ¼ rj V3 to rj þ V3), are shifted to time point n ¼ rj. The area of peak j is then P determined by aj ¼ W w¼1 gnw , where n ¼ rj. The mass spectrum for peak j, (g j) is created as a vector of length 316. The height of all mass channels which show a nonJ. Chemometrics 2006; 20: 325–340 DOI: 10.1002/cem
332 S. J. Dixon et al.
zero value in matrix G in column n ¼ rj, are added to the mass spectrum (after the removal of the baseline), while all remaining mass channels take a value of zero. 5. Column n of matrix G is set to zero and the next largest value of w is searched for, to determine peak j þ 1. 6. Steps 3–6 are repeated until all values of w are zero, and Jb chromatogram peaks have been found. 7. Peaks with aj below a given threshold (V4) are removed.
3.3. Peak matching The next procedure is to determine which chromatographic peaks, found in different samples, have a common origin. In this application, as in many applications in metabolomics, it is not possible to have prior knowledge about the number of compounds in the sample, or their retention times, and it is not possible to generate, in advance, a list of marker peaks that will be present in every sample. As it is probable that there will be a large difference between all samples, it is not possible to set one sample as a target chromatogram and align further samples to the target, so a method was devised that overcomes these limitations. In the first stage, all chromatograms are analysed, and all target peaks extracted. All target peaks will have retention times and mass spectra sufficiently different from one another that they are deemed to have originated from different compounds. The second stage is to determine which samples contain which target peaks. It may be the case that two target peaks actually originate from the same compound, but due to noisy mass spectra, they were identified as arising from two different compounds. Therefore, the third stage is to check if any target peaks should be merged together, leaving a list of L unique compounds, over the entire data set. The following three stages are carried out to identify all L unique compounds in the data set.
3.3.1. Stage A1—defining candidate target peaks 1. The chromatograms are ordered in a sequence, numbered 1 to B, and processed in this sequence. 2. All peaks detected in chromatogram 1 are set as target peaks. 3. The first peak j detected in chromatogram 2 is selected. It is necessary to determine whether it has a common origin to a target peak in chromatogram 1 or whether it is a unique new compound in its own right and should be added to the list of target peaks. 4. The mass spectrum of peak j (g j) is normalised as per Section 2.5.2 step 1, and a user defined time window of size 2 V5 þ 1 (in this application V5 ¼ 100 scans) is set from rj V5 to rj þ V5. 5. The mass spectra from each of the T target peaks whose retention time is within the window (in the first instance these will simply be all peaks that elute within this window in chromatogram 1) g t are also normalised. 6. The number of masses common between g j and each of the T target spectra that have non-zero intensity are computed. If a target spectrum has less than 2 masses in common with the sample spectrum, then it is rejected as a possible match. No intensity information is employed in this step. Copyright # 2007 John Wiley & Sons, Ltd.
Figure 5. Flow diagram showing the first stage of peak matching (A1). 7. The cosine matching function between g j and each remaining target spectrum g t is calculated as g j g 0t "jt ¼ kg j kkg t k with ejt ¼ 1 being a perfect match. 8. If there is a value of ejt greater than V6 (in this application, a value of 0.9 was set), then peak j is matched to the target peak t (in the first case this will be in sample 1) with the highest value. If no target peak is found (within the window specified in step 4), with a cosine matching J. Chemometrics 2006; 20: 325–340 DOI: 10.1002/cem
Automated peak detection and matching 333
factor of at least 0.9, the peak in sample 2 is in itself regarded as a new target, with its associated retention time and mass spectrum. 9. This procedure is applied to all peaks detected in sample 2, and all peaks are either matched to existing target peaks, or defined as targets in their own rights. In certain situations two or more peaks from sample 2 may be matched to the same target peak (from sample 1). The values of e are compared, and the highest score is taken as the true match. The rejected sample peak is then re-checked for matches against other target peaks, to see if it is a target in its own right. 10. These steps are then applied to subsequent chromatograms, and once all B chromatograms have been analysed, a list of all possible target peaks has been formed. The aim of this preliminary phase is to obtain a list of target peaks. A flow chart summarising the steps is presented in Figure 5.
3.3.2. Stage A2—obtaining groups of peaks associated with a target The next stage involves returning to all the chromatograms and assigning all non-target peaks to the best target using the matching factor and window as described above (with a restriction put in place to ensure that two peaks from the same chromatogram cannot be matched together). The same thresholds and window sizes are used as in stage A1 (Section 3.3.1). This stage is distinct from the first stage which is used only to determine targets, even though there is some peak matching performed. This stage is a key one especially when data sets are very large. A typical problem is that a target peak may be determined, for example in chromatogram 1, and a peak in chromatogram 2 matched to this target. However, a better match may be found from a target in chromatogram 400, for example, which was not available when the second chromatogram was analysed. It is important to recognise that there may be a range of mass spectra and elution times associated with a specific compound and target peaks could arise from the extremes of this distribution so that peaks with intermediate characteristics could be assigned to either of these targets. The first stage simply identifies proposed targets without assigning any peaks to these, whereas the second stage tries to form a group of peaks associated with each target. A flow chart showing these steps is shown in Figure 6.
3.3.3. Stage A3—merging In some cases two targets actually correspond to the same compound, but due to noisy mass spectra, co-elution or retention time shifts they may not be matched. Characteristic peaks arising from one specific compound can be regarded as forming a population over all the chromatograms with some on the extremes. Whereas the spectral characteristics of the extremes may not match well, peaks with intermediate characteristics could be associated with either group. Hence, there may be several groups of peaks assigned to different targets that are in reality part of the same population. An analogy is of a normal distribution where values at the tails Copyright # 2007 John Wiley & Sons, Ltd.
Figure 6. Flow chart showing the second stage of peak matching (A2). lie so far distant from each other, that they may not be recognised as part of the same distribution. If these extreme values are used to seed new populations, intermediate values will be assigned to one or other of these distributions, and it is necessary to determine whether these distributions should be merged. When analysing smaller data sets, commonly this problem is overcome by manual inspection of the mass spectra which can then be employed to modify the peak table if necessary, and so is not so serious. With very large data sets, it is not possible to manually check every cluster of peaks associated with a target due to the significant size of the data set. However, for subsequent pattern recognition it is crucial that peaks are correctly assigned in a peak table. Therefore, the third and final peak matching stage is to determine whether two clusters of peaks (identified over all the chromatograms), associated with two targets, should be merged. Criteria such as mutual exclusivity could be employed but only in situations where most compounds are expected to be present in nearly all samples. If, for example a peak is present in only a few per cent of samples, by looking just at mutual exclusivity, neighbouring groups could be erroneously merged. J. Chemometrics 2006; 20: 325–340 DOI: 10.1002/cem
334 S. J. Dixon et al.
The steps employed to determine which targets should be merged are as follows: 1. Starting from the target peak with the smallest retention time, all clusters of peaks associated with targets found within the time window rt to rt þ V5 (100 scans in this application) are assessed to determine whether the groups should be merged. It is important to note that the retention time associated with a target is the time in the first chromatogram in which the target is detected. As the step starts by taking the smallest value of rt, the window need only be in one direction, looking for peaks with a greater retention time. 2. The average normalised spectrum from all chromatographic peaks over the entire data set associated with each target is computed, and e is calculated between all
pairs of average spectra. By using the average spectra rather than the spectra in the first chromatogram in which the target peak is detected, we obtain an average for the group of peaks associated with that target. If there are no values of e greater than a pre-defined value V7 within this window, it assumed all the target clusters come from independent origins. If the number of peaks associated with a target is greater than 1, then the value of V7 is set at 0.9, otherwise it is set at 0.95. 3. If there are values of e greater than V7, the two targets associated with the highest value of e are further assessed to determine whether they should be merged. Providing there is no chromatogram which contains both peaks, these targets are merged. 4. If there is a merger between two targets, the match factor e between the merged cluster and all other target clusters
Figure 7. Flow chart showing third stage of peak matching (A3). Copyright # 2007 John Wiley & Sons, Ltd.
J. Chemometrics 2006; 20: 325–340 DOI: 10.1002/cem
Automated peak detection and matching 335
3.5.2. Data set 2—comparison of true and estimated peaks
within the window are recalculated, and the procedure is repeated until there are no further possible mergers within the window. 5. The window is then moved by half the value of V5, and the procedure repeated, until the entire chromatographic elution range is scanned, and L clusters remain, with each cluster containing a peak from a unique origin, a retention time rl and an average mass spectrum of bl.
For data set 2, as the areas and mass spectra of each of the Kb peaks in chromatogram B are known, as well as the number of peaks in each cluster over all chromatograms, so the estimates of these parameters can be compared to their true values. To validate the peak detection, each chromatogram is taken in turn, and the true number of compounds Kb is compared to the number of detected compounds Jb ¼ K^b . For true compound k, to check if that compound has been detected correctly, the cosine match factor ekj is calculated between the normalised mass spectrum bk and all normalised extracted mass spectra where rk 5 þ db < rj < rk þ 5 þ db. If a peak is found where ekj > 0.9, then detected peak j is ^ k is taken as g j, and a^k is assigned as originating from peak k, b taken as aj. It is then possible to calculate the difference between a^k and ak to study the similarity between true and estimated values of peak area. If more than one peak is found where ekj > 0.9 then it would not be possible to confidently determine which peak j is a match to peak k, and no match would be assigned. To assess the peak matching, each of the K ¼ 1000 peaks is compared to the L peak clusters from peak matching. To check which cluster l corresponds to peak k, the cosine match factor ekl is calculated between the normalised mass spectrum bk and the normalised average mass spectra of each cluster where rk V5 < rl < rk þ V5. If a cluster is found where ekl > 0.9, then cluster l is assigned as originating from peak k. The number of chromatograms included in cluster l is taken as the estimate of the cluster size, q^k , and this is compared with the true size of the cluster qk.
These stages are shown as a flow chart in Figure 7.
3.4. Tuneable parameters The tuneable parameters used in this application are listed in Table II. It is not the purpose of this paper to investigate these tuneable parameters in depth, however our experience of the two data sets suggests that these parameters were fairly robust. It is not necessary to change all of the parameters for each data set. For example, the wavelet filter described in Section 3.1.3 was found to give good results when used on a data set sampled on a different instrument (not discussed in this paper), and can be used as a good first choice of parameter. Tuneable parameters that do need adjusting for specific data sets, and cannot easily take default values, are those which take into consideration physical sampling rates, signal intensity and average retention time shifts. These are marked with an asterisk () in the table.
3.5. Validation 3.5.1. Data set 1—shuffle test Whereas there is no obvious method for validating the peak matching process on data set 1, apart from manual checking of the peaks, which is impracticable if there are half a million possible peaks in the data set, one approach to show the repeatability of the method involves shuffling the order of the data set, performing the peak matching on the shuffled data and seeing how close the results correspond to the unshuffled data. The method described in this paper does not require target chromatograms, but does identify successive potential target peaks by working successfully through the chromatograms in an ordered list. The shuffle test can be used to demonstrate that target peaks are extracted irrespective of the order of the chromatograms. Table II. Tuneable parameters, parameters with
4. RESULTS AND DISCUSSION 4.1. Data set 1—human sweat 4.1.1. Influence of stage A3 Stage A3 has an important influence on the number of unique peaks found in the data set. The aim is to merge clusters of peaks that are associated with targets that effectively originate from the same origin. The number of unique peaks reduces from 71 724 to 61 384, meaning that over 10 000 of the original target peaks (14.4%) were merged into joint clusters.
are those that are recommended as adjustable
Section
Parameter
Value in this paper
3.1.1 3.1.3 3.2
Minimum width of non-zero points Type of wavelet filter Calculation of 1st derivative Minimum distance in scans between n and Si Minimum distance in scans between n and Si for a peak on the shoulder of the previous peak Size of window within which peaks from different mass channels are accepted as arising from the same peak Total area of peak Parameter relating to window size for matching Matching threshold for spectral similarity in peak aligning Matching threshold for spectral similarity in peak cluster merging
10 Soft Daubechies 5 filter, level 3 5 point quadratic Savitzky–Golay filter V1 ¼ 7 V2 ¼ 2
3.3.1, 3.3.2 and 3.3.3
Copyright # 2007 John Wiley & Sons, Ltd.
V3 ¼ 3 V4 ¼ 5000 V5 ¼ 100 V6 ¼ 0.9 V7 ¼ 0.9 or 0.95
J. Chemometrics 2006; 20: 325–340 DOI: 10.1002/cem
336 S. J. Dixon et al.
Table III. Number of clusters of peaks of given size or larger, before and after stage A3 Before Cluster size 100 200 300 400 500 600 700 800 900
After
Number
%
Number
%
226 92 60 41 22 14 7 5 4
3.381 1.376 0.898 0.613 0.329 0.209 0.105 0.075 0.060
297 157 94 72 48 35 17 9 5
5.913 3.126 1.871 1.433 0.956 0.697 0.338 0.179 0.100
Table III presents the number of clusters of peaks of a given size before and after stage A3. As an example, before stage A3, there are 226 peaks clusters which have a cluster size of at least 100, (the peak being found in at least 100 chromatograms): after stage A3, there are 297 peak clusters which have a cluster size of at least 100. It can be seen that the number of occurrences of all cluster sizes increases after stage A3, showing that this stage has merged together several peak clusters which were originally assumed as originating from different compounds. The information is illustrated graphically in Figure 8. Note that only two internal standards were added to the mixture. There were very few compounds that
were present in all chromatograms. The most useful potential marker compounds are likely to be present in between 25% and 75% of samples as these are indicative of substantial groups. Compounds present in just a small number of samples or individuals are probably due to the environment. From Table III and Figure 8 it can be seen that the number of large clusters has increased substantially, almost doubling in number for sizes of 300 or more. This suggests that stage A3 has had an important effect on the merging of clusters. For relatively small data sets this is not such a serious problem. The difficulty is that if there are 1000 chromatograms, there is a much higher chance of a mismatch and of a new target being found part way through the data set, which grows a new cluster. In order to study this, 100 chromatograms were selected randomly from the data set. These were subjected to the peak matching process and 11 680 targets were found prior to stage A3, reducing to 11 020 after this stage, or a reduction of 5.65% compared to 14.4% on the overall matrix. Obviously manual inspection of the data could protect against this problem, but once the data set becomes large, this is not practicable.
4.1.2. Validation by shuffling As discussed in Section 3.5, an important way of validating the peak matching procedure is by shuffling the data. Table IV presents the number of peaks found in data set 1 as analysed both in its original order and in the shuffled order. The right hand column represents the number of peaks found in at least one subject four out of five times (or three out of four times if they are sampled four times). It can be seen that there is good agreement, although a few differences remain. Closer inspection of the data finds that of the right-hand column, 20 unique targets are present in only one of the two data sets (original order and shuffled). However, the average occurrence of these 20 peaks is 60 samples and the maximum 187 samples (out of 965), so it is only the less abundant peaks where this problem occurs. This suggests that there may be some quite small clusters which could represent overlapping or noisy peaks. There will, of course, always be a few errors in peak matching, however the difference in numbers of peaks at each stage is small, giving good confidence in the analysis, and that the shuffle test is a useful one for determining the robustness of the method.
4.1.3. Peak distributions Figure 8. Number of peak clusters of varying size (qk) in data set 1 before and after stage A3.
The peak distributions are illustrated by reference to the unshuffled data. Very similar conclusions can be obtained from analysis of the shuffled chromatograms. Figure 9 shows the distribution of peak cluster sizes on the entire data set when the minimum cluster size is set to 5 chromatograms.
Table IV. Number of peaks found in data set 1, in the original order and in a shuffled order
Original Shuffled Difference % difference
Unique peaks (stage A2)
After clustering (stage A3)
Peaks in at least five samples
Peaks retained in Yp1 subjects (who had at least four repeats)
71724 71493 231 0.32
61384 61257 127 0.21
5023 4971 52 1.03
470 463 7 1.5
The numbers are very similar in both cases showing that the order of analysis of the chromatograms is not important. Copyright # 2007 John Wiley & Sons, Ltd.
J. Chemometrics 2006; 20: 325–340 DOI: 10.1002/cem
Automated peak detection and matching 337
Figure 9. Cumulative distribution of size of peak cluster sizes in data set 1 after peak matching.
It can be seen in Table III that there are only 5 peaks found in more than 900 samples. The mass spectra of these peaks are shown in Figure 10. This small number is not unexpected. Whereas sweat itself will contain plasma proteins and lipids, the sweat sampling technique, and the GC-MS analytical conditions in this study are optimised towards volatile and semi-volatile compounds. Of interest is the number of peaks that are found in several repeats of an individual, as these are likely to be diagnostic markers. It is important to recognise that there can be environmental and analytical factors that influence repeatability. Even the weather can affect a sample, as if a person sweats more this may mean higher concentrations in samples and so more peaks above detection limits. However, a good marker should nevertheless be detected in the majority of repeats of at least one individual. Table V summarises this information. Note that 182 individuals were sampled five times, 13 four times, 1 twice and 1 once. There is a very steep increase when all individuals were considered, because a few individuals sampled once or twice introduce a large number of extra compounds that probably would be removed if five repeats were available. We recommend using a Yp1 criterion for individuals with Yp > 4, which results in 470 peaks. Note that nearly 100 of these peaks are due to siloxanes from the analytical process, which should be removed prior to metabolomic analysis. However, the purpose of this paper is to present the peak picking and matching algorithm and not to make further analyses.
4.2. Data set 2—synthetic data Data set 2 had similar characteristics to data set 1, and so the same tuneable parameters were used.
4.2.1. Peak detection Figure 11 compares the estimated number of peaks (Jb ¼ K^b ) versus the true number of peaks (Kb) for all chromatograms. It can be seen there is a good correlation between the two values (correlation coefficient ¼ 0.95). For all of the B chromatograms, it can be seen that K^b > Kb , with the maximum difference being 29 peaks (12.39% of the mean Copyright # 2007 John Wiley & Sons, Ltd.
Figure 10. Mass spectra of the 5 peaks found in more than 900 samples.
Table V. Number of peak clusters associated with a target in a minimum number of repeat samples from at least one individual, taking into account all individuals (including those sampled once or more), individuals sampled at least four times and individuals sampled five times
No. of repeat samples Yp Yp-1 Yp-2 Yp-3 Yp-4
All subjects
Only subjects with four or five repeats
Only subjects with five repeats
387 685 999 3316 5023
239 470 999 3316 5023
220 414 787 2104 5023
J. Chemometrics 2006; 20: 325–340 DOI: 10.1002/cem
338 S. J. Dixon et al.
Figure 11. Number of detected peaks (Jb ¼ K^b ) in each data set 2 chromatogram versus true number (Kb).
Figure 13. Estimated cluster size q^k versus true cluster size qk, for 932 of the 1000 peaks in data set 2.
number of peaks), and the mean difference being 13.5 peaks (6.11%). This shows that some false peaks were being detected. These are likely to be in regions of noise, or could be due to a single peak being misinterpreted as two peaks. Figure 12 shows the estimated area of each peak (^ ak ) versus the true area ak for all Kb peaks in all B chromatograms. The correlation coefficient between all estimated and true values is 0.98. The mean of all true peak areas is PKb 6 k¼1 ak =Kb ¼ 1:31 10 and the root mean square peak area 5 is 1.28 10 , corresponding to 9.77% of the overall mean.
be distorted, leading to the peak not being accurately matched to its correct cluster. Selecting only clusters where qk 100, there are only 23 clusters out of 327 where q^k is less than 50% of qk, and 107 clusters where q^k is less than 90% of qk. There are 20 instances where q^k > q. This would occur if two clusters from two separate compounds had been merged together, and the averaged mass spectra had a high match factor with only one of the k true mass spectra. There were 68 peaks from the 1000 studied which were not included in the graph. Eighteen of these were peaks for which no cluster had a value of ekl > 0.9, probably due to the mass spectra being distorted during detection, or the peaks having a low signal to noise ratio. The remaining 50 peaks were not plotted as in each case there was a given cluster l which had a value of ekl > 0.9 for 2 different values of k. This would occur if two (or more) peaks were merged together into one cluster and this could happen either in the peak detection procedure, (if two peaks overlapped, the average mass spectrum of cluster l would contain information from both the synthetic mass spectra), or in the peak matching procedure, (if both peaks had similar mass spectra and the retention times were within a given window, they would be combined in the same cluster). The average mass spectrum of cluster l would have a high similarity to both of the original spectra.
4.2.2. Peak matching A plot of estimated cluster size (^ qk ) versus true cluster size (qk) for each of the 1000 peaks is given in Figure 13. There are several examples where q^k < qk . This difference is due to overlapping peaks in the simulated data set. If two closely overlapping peaks have similar mass spectra, then the individual peak shapes in the associated SICs cannot be separated, and will be treated as one peak. The extracted mass spectrum of one (or both) of the overlapping peaks will
5. CONCLUSIONS
Figure 12. Estimated peak area a^ versus true peak area ak in data set 2. Copyright # 2007 John Wiley & Sons, Ltd.
Increasingly chemometricians are faced with large data sets especially in metabolomics and proteomics. Many such data sets arise from coupled chromatography. With the capacity of modern instruments some data sets can be very large in size, and pose specific problems. Over the past decade the chemometric analysis of coupled chromatographic data has developed from quite modest applications (for example, deconvoluting a two or three component peak cluster), to medium sized applications as happens quite commonly in pharmaceutical chemistry (10–100 chromatograms containing 10–50 interesting peaks each) to data sets of the size analysed in this paper. J. Chemometrics 2006; 20: 325–340 DOI: 10.1002/cem
Automated peak detection and matching 339
Data pre-processing, including peak picking and matching poses special challenges when faced with such large data sets. There is almost no alternative to a semi-automated method given the size of the problem, and little chance to check all the peak tables manually. In this paper, we present a method that has been very effective on a large real world data set and using a synthetic data set. Human sweat is especially rich in compounds and can provide information on, for example, genetics, personal habits, environment and so on, but is hard to sample and contains many more compounds than for example urine. There is, also, an increasing interest in the use of odor to detect disease as it is well know that many mammals can detect disease through sniffing. Pattern recognition [33] can be performed on the data to try to unlock these signals, but a major challenge is to prepare the data for the pattern recognition algorithms. There are alternatives to the approaches outlined in this paper, and in some areas such as plant metabolomics, where it is easier to control the extraction procedure and there are many more common compounds that are diagnostic of specific groups these alternatives may be more appropriate. However, the algorithms discussed in this paper are very powerful and suggest a potential strategy for peak detection and matching. In addition, we present some new approaches for validating the peak matching algorithms that can be applied more generally when data sets are large.
Acknowledgements We thank Yun Xu, Fan Gong and Hejun Duan of the Centre for Chemometrics for helpful discussions. Lisa Oberzaucher and Karl Grammer of the University of Vienna are thanked for sampling. Alexandra Katzer is thanked for her superb organisational skills. This work was sponsored by ARO Contract DAAD19-03-1-0215. Opinions, interpretations, conclusions and recommendations are those of the authors and are not necessarily endorsed by the United States Government.
REFERENCES 1. Kuhara T. Gas chromatographic-mass spectrometric urinary metabolome analysis to study mutations of inborn errors of metabolism. Mass. Spec. Rev. 2005; 24: 814–827. 2. Carrillo JD, Lo´pez AG, Tena MT. Determination of volatile oak compounds in wine by headspace solidphase microextraction and gas chromatography-mass spectrometry. J. Chromatogr. A 2006; 1102: 25–36. 3. Shen H, Carter JF, Brereton RG, Eckers C. Discrimination between tablet production methods using pyrolysis-gas chromatography-mass spectrometry and pattern recognition. Analyst 2003; 128: 287–292. 4. Service K, Brereton RG, Harris S. Analysis of badger urine volatiles using gas chromatography-mass spectrometry and pattern recognition techniques. Analyst 2001; 126: 615–623. 5. Mishra S, Tripathi RM, Bhalke S, Shukla VK, Puranik VD. Determination of methylmercury and mercury (II) in a marine ecosystem using solid-phase microextraction gas chromatography-mass spectrometry. Anal. Chim. Acta 2005; 551: 192–198. Copyright # 2007 John Wiley & Sons, Ltd.
6. Soini HA, Bruce KE, Klouckova I, Brereton RG, Penn DJ, Novotny MV. In-situ surface sampling of biological objects and preconcentration of their volatiles for chromatographic analysis. Anal. Chem. 2006; 78: 7161–7168. 7. Baltussen E, Cramers CA, Sandra PJ. Sorptive sample preparation—a review. Anal. Bioanal. Chem. 2002; 373: 3–22. 8. Gong F, Wang BT, Chau FT, Liang YZ. Data preprocessing for chromatographic fingerprint of herbal medicine with chemometric approaches. Anal. Letts. 2005; 38: 2475–2492. 9. Nielsen NPV, Carstensen JM, Smedsgaard J. Aligning of single and multiple wavelength chromatographic profiles for chemometric data analysis using correlation optimised warping. J. Chromatogr. A 1998; 805: 17–35. 10. Kassidas A, MacGregor JF, Taylor PA. Synchronization of batch trajectories using dynamic time warping. AIChE J. 1998; 44: 864–875. ˚ berg M, Karlberg B, Jacobsson SP. Peak 11. Torgrip RJO, A alignment using reduced set mapping. J. Chemom. 2003; 17: 573–582. 12. Fraga CG, Prazen BJ, Synovec RE. Objective data alignment and chemometric analysis of comprehensive two-dimensional separations with run-to-run peak shifting on both dimensions. Anal. Chem. 2001; 73: 5833–5840. 13. Andersson R, Ha¨ma¨la¨inen MD. Simplex focusing of retention times and latent variable projections of chromatographic profiles. Chemom. Intell. Lab. Syst. 1994; 22: 49–61. 14. Jonsson P, Bruce SJ, Moritz T, Trygg J, Sjo¨stro¨m M, Plumb R, Granger J, Maibaum E, Nicholson JK, Holmes E, Antti H. Extraction, interpretation and validation of information for comparing samples in metabolic LC/MS data sets. Analyst 2005; 130: 701–707. 15. Windig W, Phalp JM, Payne AW. A noise and background reduction method for component detection in liquid chromatography/mass spectrometry. Anal. Chem. 1996; 68: 3602–3606. 16. Hastings CA, Norton AM, Roy S. New algorithms for processing and peak detection in liquid chromatography/mass spectrometry data. Rapid Commun. Mass Spectrom. 2002; 16: 462–467. 17. Vivo´-Truyols G, Torres-Lapasio´ JR, van Nederkassel AM, van der Heyden Y, Massart DL. Automatic program for peak detection and deconvolution of multioverlapped chromatographic signals, Part I, Peak detection. J. Chromatogr. A 2005; 1096: 133–145. 18. Jarman KH, Daly DS, Anderson KK, Wahl KL. A new approach to automated peak detection. Chemom. Intell. Lab. Sys. 2003; 69: 61–76. 19. Stein SE. An integrated method for spectrum extraction and compound identification from gas chromatography/mass spectrometry data. J. Am. Soc. Mass Spectrom. 1999; 10: 770–781. 20. Silva-Wilkinson RA, Burkhard LP, Sheedy BR, DeGraeve GM, Lordo RA. A simple comparison of mass spectral search results and implications for environmental screening analyses. Arch. Environ. Cont. Toxicol. 1999; 36: 109–114. 21. Pesyna GM, Venkataraghavan R, Dayringer HE, McLafferty FW. Probability based matching system using a large collection of reference mass spectra. Anal. Chem. 1976; 48: 1362–1368. 22. Stein SE, Scott DR. Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass. Spectrom. 1994; 5: 859–866. 23. Shen H, Grung B, Kvalheim OM, Eide I. Automated curve resolution applied to data from multi-detection instruments. Anal. Chim. Acta 2001; 446: 313–328. 24. Eide I, Neverdal G, Thorvaldsen B, Arneberg R, Grung B, Kvalheim OM. Toxicological evaluation of complex J. Chemometrics 2006; 20: 325–340 DOI: 10.1002/cem
340 S. J. Dixon et al.
25. 26.
27.
28.
29.
30.
31.
mixtures: fingerprinting and multivariate analysis. Environ. Toxicol. Pharmacol. 2004; 18: 127–133. Jiang JH, Ozaki Y. Self-modeling curve resolution (SMCR), Principles, techniques, and applications. Appl. Spectrosc. Rev. 2002; 37: 321–345. Idborg H, Edlund PO, Jacobsson SP. Multivariate approaches for efficient detection of potential metabolites from liquid chromatography/mass spectrometry data. Rapid Commun. Mass Spectrom. 2004; 18: 944–954. Steffen B, Mu¨ller KP, Komenda M, Koppmann R, Schaub A. A new mathematical procedure to evaluate peaks in complex chromatograms. J. Chromatogr. A 2005; 1071: 239–246. Jonsson P, Johansson AI, Gullberg J, Trygg JAJ, Grung B, Marklund S, Sjo¨stro¨m M, Antti H, Moritz T. Highthroughput data analysis for detecting and identifying differences between samples in GC/MS-based metabolomic analyses. Anal. Chem. 2005; 77: 5635–5642. Dixon SJ, Brereton RG, Carter JF, Sleeman R. Determination of cocaine contamination on banknotes using tandem mass spectrometry and pattern recognition. Anal. Chim. Acta 2006; 559: 54–63. Jonsson P, Gulberg J, Nordstro¨m A, Kusano M, Kowalczyk M, Sjo¨stro¨m M, Moritz T. A strategy for identifying differences in large series of metabolomic samples analyzed by GC/MS. Anal. Chem. 2004; 76: 1738–1745. Christensen JH, Mortensen J, Hansen AB, Andersen O. Chromatographic preprocessing of GC-MS data for
Copyright # 2007 John Wiley & Sons, Ltd.
32. 33. 34. 35.
36.
37. 38. 39. 40.
analysis of complex chemical mixtures. J. Chromatogr. A 2005; 1062: 113–123. Duran AL, Yang J, Wang LJ, Sumner LW. Metabolomics spectral formatting, alignment and conversion tools (MSFACTs). Bioinformatics 2003; 19: 2283–2293. Dixon SJ, Xu Y, Brereton RG, Soini HA, Novotny MV, Oberzaucher E, Grammer K, Penn DJ. Chemom. Intell. lab. Sys. In press. http://mexcdf.sourceforge.net/. Windig W, Phalp JM, Payne AW. A noise and background reduction method for component detection in liquid chromatography/mass spectrometry. Anal. Chem. 1996; 68: 3602–3606. Zomer S, Brereton RG, Wolff JC, Airiau CY, Smallwood C. Component detection weighted index of analogy: similarity recognition on liquid chromatographic mass spectral data for the characterization of route/process specific impurities in pharmaceutical tables. Anal. Chem. 2005; 77: 1607–1621. Donoho DL. De-noising by soft thresholding. IEEE Trans. Inf. Theory 1995; 41: 613–627. Special issue on wavelets. Proc. IEEE. 1996; 84: 507–685. Savitzky A, Golay MJE. Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem. 1964; 36: 1627–1639. Brereton RG. Chemometrics: Data Analysis for the Laboratory and Chemical Plant. Wiley: Chichester, UK, 2003; p 132.
J. Chemometrics 2006; 20: 325–340 DOI: 10.1002/cem