Assessing Bias in Experiment Design for Large Scale ... - CiteSeerX

1 downloads 0 Views 341KB Size Report
Amol Prakash‡孜 , Brian Piening§**, Jeff Whiteaker§, Heidi Zhang§, Scott A. Shaffer‡‡,. Daniel Martin§ 壯, Laura Hohmann壯, Kelly Cooke壯, James M. Olson§,.
Research

Assessing Bias in Experiment Design for Large Scale Mass Spectrometry-based Quantitative Proteomics*□ S

Amol Prakash‡§¶储, Brian Piening§**, Jeff Whiteaker§, Heidi Zhang§, Scott A. Shaffer‡‡, Daniel Martin§ §§, Laura Hohmann§§, Kelly Cooke§§, James M. Olson§, Stacey Hansen§, Mark R. Flory¶¶, Hookeun Lee储储, Julian Watts§§, David R. Goodlett‡‡, Ruedi Aebersold§§储储, Amanda Paulovich§, and Benno Schwikowski‡¶ Mass spectrometry-based proteomics holds great promise as a discovery tool for biomarker candidates in the early detection of diseases. Recently much emphasis has been placed upon producing highly reliable data for quantitative profiling for which highly reproducible methodologies are indispensable. The main problems that affect experimental reproducibility stem from variations introduced by sample collection, preparation, and storage protocols and LC-MS settings and conditions. On the basis of a formally precise and quantitative definition of similarity between LC-MS experiments, we have developed Chaorder, a fully automatic software tool that can assess experimental reproducibility of sets of large scale LC-MS experiments. By visualizing the similarity relationships within a set of experiments, this tool can form the basis of systematic quality control and thus help assess the comparability of mass spectrometry data over time, across different laboratories, and between instruments. Applying Chaorder to data from multiple laboratories and a range of instruments, experimental protocols, and sample complexities revealed biases introduced by the sample processing steps, experimental protocols, and instrument choices. Moreover we show that reducing bias by correcting for just a few steps, for example randomizing the run order, does not provide much gain in statistical power for biomarker discovery. Molecular & Cellular Proteomics 6:1741–1748, 2007.

Mass spectrometry has exhibited tremendous promise in probing complex biological samples globally at the protein From the Departments of ‡Computer Science and Engineering and ‡‡Medicinal Chemistry and **Molecular and Cellular Biology Program, University of Washington, Seattle, Washington 98195, §Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, §§Institute for Systems Biology, Seattle, Washington 98103, ¶Systems Biology Group, Institut Pasteur, 75015 Paris Cedex 15, France, 储储Institute of Molecular Systems Biology, 8093 Zurich, Switzerland, and ¶¶Department of Molecular Biology and Biochemistry, Wesleyan University, Middletown, Connecticut 06459 Received, December 14, 2006, and in revised form, June 6, 2007 Published, MCP Papers in Press, July 24, 2007, DOI 10.1074/ mcp.M600470-MCP200

© 2007 by The American Society for Biochemistry and Molecular Biology, Inc. This paper is available on line at http://www.mcponline.org

level (1– 4). This capability is of key importance for the identification of diagnostic biomarkers for developing early detection methodologies for many human diseases, including cancers, and for unbiased, global measurements of cellular processes that are a key component of systems biology approaches (5). Although mass spectrometers are capable of both selective and sensitive measurements, mass analyzers are limited in their dynamic range. The consequence of this is a limited capability to detect very low abundance analytes in biological samples with large dynamic range. Concomitantly mass spectrometer duty cycles limit the number of CID events per unit of time and often lead to a significant undersampling of more complex proteomes (6). Furthermore the subset of peptides being sampled for CID can vary from one experiment to the next, hindering both interpretation and confidence in quantification. Many approaches therefore go beyond the straightforward use of CID for large scale protein identification. These range from the accurate mass and time tag approach (7), clustering (8), complete workflow solutions for LC-MS data sets (9 –12), alignment algorithms for LC-MS (13, 14), and feature detection approaches for SELDI platforms (15–17). These approaches rely heavily on high quality LC-MS profiles for peak alignment, peptide identification, and quantitation and thus require a high degree of reproducibility in the sample collection, processing, and analytical run conditions (12). The need for reproducibility in LC-MS experiments has become increasingly essential with the popularization of large scale quantitative proteomics (19). Sources of variation that greatly affect this reproducibility can vary depending on the experimental platform and, in particular, the choice of instrument. Critical sources of variation common to all LC-MS experiments include variation in signal intensity, mass accuracy (20 –24), and elution profile. For the latter, this can be further described by variations in peptide elution time, elution order, and peak width (25). Sample collection (e.g. harvesting conditions), preparation protocols (e.g. freeze/thaw cycles), experimental design (e.g. run order), platform stability (e.g. column and spray), and sample stability can affect results,

Molecular & Cellular Proteomics 6.10

1741

Chaorder, a Tool to Assess Experimental Reproducibility

thereby leading to biomarkers that have no biological basis or discovery of biomarkers with high false positive rates. Further experimental variation is often observed when comparing results across laboratories (26, 27) and instruments. Previous work highlighting these problems (19 –24) has suggested that careful experimental design can minimize variation. Recently focused studies have assessed reproducibility of MS/MS acquisition (28) and ICAT labeling (29). Thus, to improve reproducibility of various protocol steps, biologists and chemists are exploring various techniques and ideas, for example different numbers of washes, different columns, different sample processing strategies, randomizing run order, etc. However, the lack of proper measures of experimental reproducibility prevents investigators from having a complete understanding of the impacts of their protocol modifications, thus impeding progress. A common strategy to study reproducibility in sample and platform quality control is to spike a set of standard peptides/ proteins into a sample and then describe variance in the measurement through coefficient of variance scores (30, 31). This strategy is problematic in two respects. First, using 5–10 peptides to represent the complexity of the entire sample is a classic case of statistical under-representation, especially for complex samples like human serum, which potentially may contain a few hundred thousand peptides (32). Second, each coefficient of variance score captures only one particular reproducibility factor, e.g. mass accuracy, intensity, etc., ignoring all other factors. In addition, there are no well characterized coefficient of variance scores for factors such as signal-tonoise ratio, elution profile, elution time, etc. Comparative measures for experimental data have proven to be key enablers of progress in other scientific domains. Examples include the BLAST (Basic Local Alignment Search Tool) (33) E-value score for the degree of homology between proteins, and the Phred (34) score, a key ingredient to enabling qualityassessed, large scale, automated DNA sequencing. An ideal measure of reproducibility in LC-MS-based proteomics would build on a qualitative and quantitative measurement of every peptide or protein present in a sample and compare experiments based on this knowledge. However, our inability to attain identification for the majority of peaks in an experiment makes a corresponding definition of reproducibility impractical. RESULTS

An Approach to Measure Similarity in LC-MS Experiments—Here we present a practical, quantitative measure of experimental reproducibility and, more generally, similarity in LC-MS-based proteomics. Our measure quantifies how similar all raw MS1 peaks from two LC-MS experiments are relative to each other after proper alignment in the retention time dimension. As we illustrate below, this definition captures changes in mass resolution, signal intensity, signal elution profile, and noise levels.

1742

Molecular & Cellular Proteomics 6.10

Building upon an existing alignment algorithm to compute this measure of similarity between two LC-MS experiments (9), we present a tool Chaorder1 that produces a visual representation of the similarity relationships in a set of experiments. Chaorder is highly efficient, scalable, and parallelizable and thus can handle the gigabytes of data that current state-of-the-art instruments generate. It can handle data generated from a variety of instruments and of varying sample complexities (shown under “Results”). It takes into account all features (mass accuracy, elution profile, intensity, elution times, signal-to-noise, etc.) of all signals present in the data. Building on this tool, we propose a methodology for quality control of LC-MS data that is flexible enough to be applied to any LC-MS proteomics platform. The application of our methodology to data from different laboratories revealed significant and consistent biases in large scale LC-MS experiments despite careful experimental design. Biases detected in this study are caused by experimental protocols, like HPLC washing, sample freeze/thaw cycles, run order, and date. The fact that these are standard elements of any sample protocol suggests that these biases may be occurring in many proteomics experiments today. The systematic exploration of these biases may thus be an important first step on the way to the design of unbiased LC-MS proteomics experiments. Chaorder takes as input a list of experiments and computes the bounded alignment score between each pair of experiments (9). It then represents each experiment as a point in two dimensions such that the Euclidean distance between a pair of points approximates the inverse of the alignment score between the corresponding two experiments (details are described under “Workflow”). A set of experiments that are expected to be similar (e.g. almost perfect technical repeat experiments) correspond to high pairwise alignment scores, and the corresponding points tend to appear in close proximity in the two-dimensional image. Conversely distant points correspond to dissimilar experiments. In our experience, approximate ranges of pairwise alignment scores (Euclidean distances) are empirically correlated with different qualitative classes of similarity. A distance between two points close to 0 represents ideal levels of similarity; a distance between 0 and 0.2 represents achievable (and tolerable, depending on the experimental setup) levels of reproducibility usually seen in repeat experiments with the exact same experimental setup. A distance between 0.2 and 0.5 represents biases in the experimental setup if experiments were expected to be similar (e.g. technical replicates). The experimental setup could then be studied further for more in-depth analysis. Distances greater than 0.5 usually represent similarities caused by the same sample being analyzed under very different perturbations, different platforms, etc. 1

Taken from the book Birth of the Chaordic Age by D. W. Hock and Visa International (18).

Chaorder, a Tool to Assess Experimental Reproducibility

FIG. 1. Chaorder plot for instrument reproducibility analysis. Each square represents an LC-MS experiment generated from the same sample. Blue squares represent experiments run on a QSTAR instrument, and red squares represent experiments run on a Q-TOF instrument. Data labels represent the run order.

FIG. 2. Chaorder plot for repeats of human serum on time-offlight instrument. Each square represents an LC-MS experiment. The key for each experiment is written in the square. Data labels represent the run order.

Distances greater than 0.7– 0.8 represent unrelated experiments, e.g. comparisons between yeast cell lysate with human serum. Beyond interpretations of individual pairwise distances, the two-dimensional image can be used to understand systematic effects that may occur over time due to possible sample degradation or changing experimental conditions (presented in detail below). We applied Chaorder to a variety of data sets, ranging from simple quality control studies over simulated biomarker experiments to real biomarker studies. Most of these experiments were LC-MS runs with no MS/MS acquisition. Detailed experimental protocols of these data sets are provided in the Supplemental Appendix. Instrument Comparison for Reproducibility: Instrument Variability and Run Order Effects—10 repeat LC-MS experiments using human serum were performed on two instruments, a Q-TOF and QSTAR with an electrospray source. Both instruments are based on the time-of-flight principle but come from different manufacturers. Fig. 1 shows the Chaorder image of this data set. Each point represents a complete LC-MS run of the unfractionated human serum. The colors correspond to the different instruments (QSTAR ⫽ blue, Q-TOF ⫽ red), and the data labels represent the order in which the experiments were run. A number of observations can be made. First, the data from the two instruments form two distinct clusters. Second, the QSTAR instrument shows much more variation than the QTOF instrument. Third, run order effects are revealed: successive experiments tend to move in one direction. Sample car-

ryover is one possible explanation for this effect. These data also show the problems faced by the community with different choices of instruments available. There is little hope that results generated on the QSTAR instrument can be reproduced on a Q-TOF instrument, and comparing data between the two platforms would present a significant challenge. Human Serum Repeats: Sample Preparation Effects—Periodic experiments were performed using human serum on a time-of-flight (LCT Premier) instrument. Between experiments, the sample went through freeze/thaw cycles, and immediately before each LC-MS experiment, an aliquot was digested using trypsin. Fig. 2 shows the Chaorder image for this data set. Again the image reveals strong run order effects as the run order is inferred without any prior information. Our own manual follow-up analysis revealed that, globally, the total ion intensity was reduced over this series of experiments. Many peptides maintain their intensity levels, but many other intensity levels are significantly reduced. This suggests that the sample is degrading over time. Another possibility is a systematic change of digestion efficiency. Overall Fig. 2 represents an example where sample preparation steps appear to introduce biases that are not well characterized. Repeated Injections of Angiotensin II: Column Variability and Run Order Effects—Experiments with Angiotensin II injections were performed periodically to study the integrity, reproducibility, and sensitivity of the LC column used for a yeast genetics study. These injections were performed between the yeast experiments on an FT-ICR instrument. Be-

Molecular & Cellular Proteomics 6.10

1743

Chaorder, a Tool to Assess Experimental Reproducibility

FIG. 3. Chaorder plot for Angiotensin II analyzed by LC-MS. Each data point represents an LC-MS experiment. The two geometric shapes, squares and triangles, represent two different C18 HPLC columns used. Colors represent differences in data acquisition date: light blue, April 22, 2005; dark blue, April 25, 2005; pink, May 6, 2005; red, May 9, 2005; yellow, June 13, 2005; green, June 23, 2005. Each data point is numbered in sequence from the first (1) to last (36) date of acquisition.

cause the LC column degrades over time, the column was changed for a new one after 212 injections. For each experiment, the raw LC-MS data produced were recorded. Fig. 3 shows the Chaorder image for this data set. Each data point represents one Angiotensin II LC-MS experiment. Squares and triangles correspond to different LC columns, colors represent the different dates on which experiments were run, and the data label corresponds to the run order. Without having been informed by any prior knowledge but the raw data, Fig. 3 reveals a clear split between the two columns. Surprisingly the variation between colors (dates) appears to be larger. Beyond this, the image reveals a strong association by date, and within each date, strong run order effects are revealed. Upon further examination, we found that although the sample contained only Angiotensin there was significant carryover, and many non-Angiotensin features were observed in the resulting output that resemble peptide peaks. Feature identification using msInspect (35) resulted in the detection of more than 100 features. Because these experiments were performed between the yeast experiments, the carryover is expected, but the strong association of these experiments by run order and dates reveals the potential problems with the variation of the analytical platform. Simulated Biomarker Hand-mixed Data: Day-to-day Variations—In experiments designed to emulate a test case for biomarker discovery, two protein samples were prepared, one

1744

Molecular & Cellular Proteomics 6.10

FIG. 4. Chaorder plot for four- to five-protein hand-mixed data on TOF. Each square represents an LC-MS experiment. Blue squares represent four-protein samples, and red squares represent five-protein samples. Data labels represent the dates on which experiments were run.

“control sample” consisting of a four-protein digest and a second “disease” sample in which, in addition to the four digested proteins, ␤-lactoglobulin was spiked in as a simulated “biomarker.” Multiple LC-MS experiments of the two samples were performed on a TOF instrument. Despite the low number of proteins, the resulting LC-MS maps tend to be very complex. Possible reasons are that the peptides exhibit multiple charge and isotopic states, the proteins are not absolutely pure, the tryptic digestion is imperfect leading to missed cleavages and miscleaved peptides, and the sample was run over a short 30-min gradient. All these issues arise in any proteomics setup and significantly increase the complexity of the output. This observation was made for this mixture as well with thousands of peptide-like features being observed in the experiments (without deconvoluting isotopic patterns), whereas the tryptic digestion theoretically yields only on the order of a hundred peptides. Fig. 4 shows the Chaorder image for this data set. The four-protein experiments are shown in blue, the five-protein experiments are shown in red, and the data label corresponds to the date on which experiments were run. A first observation is that, even after randomizing the run order, the experiments cluster by the dates they were run on, especially on the 19th and 21st. Second the four- and five-protein data seem to be distinguishable despite the somewhat weak clustering of fourprotein data sets and five-protein data sets; this could make it difficult to differentially identify the spiked-in fifth protein. This test experiment reveals a fundamental problem for biomarker discovery caused by low reproducibility: any actual

Chaorder, a Tool to Assess Experimental Reproducibility

FIG. 5. Chaorder plot for LC/LC-MS yeast cell lysate cell cycle time series study on the LCQ Deca XP. Each square represents an LC-MS experiment on an SCX fraction. Blue squares represent time point 0, and red squares represent time point 30 minutes. Data labels represent the number of the SCX fraction. Multiple squares with the same label represent repeat experiments.

biomarkers may remain hidden behind all the less interesting experimental variations. Time Course Study for Yeast Cell Cycle: Freeze/Thaw Cycle Effects—Our next analysis concerns two LC/LC-MS/MS measurements of tryptic digests of a whole-cell lysate of the yeast Saccharomyces cerevisiae (from Ref. 36). The two samples were collected from cells synchronized in the G1 phase of the cell cycle and at 30 min following release. Both protein samples were digested into peptides using trypsin and then separated by strong cation exchange (SCX) chromatography into a number of fractions. For each time point, each fraction was split in half (or three parts): one part was analyzed immediately, and the second and third parts were analyzed after multiple freeze/thaw cycles using RPLC-ESI-MS (ThermoFinnigan LCQ Deca XP). Fig. 5 shows the Chaorder plot for time points 0 min (blue) and 30 min (red) of this data set. The data labels correspond to the SCX fraction, and multiple data points with the same label are the repeat experiments of that particular SCX fraction after additional freeze/thaw cycles. A first observation is that the horizontal axis of the image approximately reflects the different time points, and the vertical axis reflects the SCX chromatography. This is remarkable because Chaorder generated the plot without any additional prior knowledge. As one would expect, successive SCX fractions differ significantly but also share a set of common proteins, which then occur in related RPLC fractions. Fig. 5 captures this similarity of certain RPLC fractions as well. Another interesting point to note

FIG. 6. Chaorder plot for Huntington disease study in mouse performed on a Q-TOF instrument for 3-month-old mice. Each square represents an LC/MS experiment. Different colors represent different mice. Shape represents the biological condition: squares are for wild type, and triangles are for heterozygous/homozygous. Data labels represent the run order. Multiple data points with the same color and shape represent repeat experiments.

is that the repeat experiments do not cluster as tightly as one might expect. Our detailed analysis revealed many unexpected artifacts presumably generated by the freeze/thaw cycle, e.g. change in noise levels, changes in the intensity levels of many peptides, MS/MS undersampling, etc. Mouse Models for Huntington Disease: Run Order Effects— Serum from a mouse model of Huntington disease (homozygous and heterozygous) and from normal mice was analyzed on a Q-TOF mass spectrometer in LC-MS mode. Experiments were performed for mice at the ages of 3, 6, 9, and 12 months. In the mouse model, Huntington disease is known to result in symptoms at 12 months, and the aim of this study was the identification of biomarkers in younger mice. Many of the mice were littermates. Each experiment was performed in triplicate, i.e. the serum collected from a mouse was divided into three aliquots, and each was digested and analyzed separately. Fig. 6 shows the Chaorder plot for the 3-month data. This data set has three wild type mice (squares) and five in the homozygous/heterozygous class (triangles). Triplicate experiments for a single mouse are shown in a single color. The data label represents the run order. A first observation to be made from the plot is that the triplicate experiments are not as tightly clustered as one could have expected. Because there are multiple sources of variation (mice-to-mice, homozygous/ heterozygous/wild type, replicates, etc.), one does not expect to see a very simple cluster structure, but one might hope to see at least that mice in one disease state would cluster

Molecular & Cellular Proteomics 6.10

1745

Chaorder, a Tool to Assess Experimental Reproducibility

together. Looking at Fig. 6, one can see that this is not the case. Instead an added effect of run creates a more complex pattern of associations. For example, some experiments close in run order (0-1-2-3, 5-6-7, 14-15, etc.) are close by. Note that Chaorder identified these clusters without any added prior knowledge. The clustering suggests that run order is creating artifacts in the data that reduce the statistical power of 1) clustering replicates and 2) distinguishing wild type from disease type. In particular, the 3-month data have to be treated with caution. Similar effects are suggested in the 6and 9-month data, although the effects of run order are much weaker. WORKFLOW

LC-MS experiments and their peptide signals can vary in characteristics, such as signal intensity, elution profile, elution time, mass resolution, signal-to-noise level, etc. A global measure of similarity between experiments has to take all of these variations into account. Certain global variations, such as different amounts of sample loaded between experiments, are expected and can be compensated for computationally by normalization of signal intensities. Other examples are different gradients that lead to varying retention times (37) and the different resolutions of different mass analyzers. The similarity measure presented here aims to ignore these unproblematic technical variations but to capture all remaining technical and biological variation. Prakash et al. (9) presented the ChAMS2 method to align LC-MS experiments based on raw MS1 signals following the principle described above. The alignment algorithm is based on a score that measures the similarity between pairs of mass spectra. On this basis, ChAMS produces a mapping between related spectra, i.e. the spectra that contain peaks generated by the same peptide. The alignment score is the average spectra similarity score between all spectra paired by the alignment, and the alignment is entirely based on the MS1 spectra. The algorithm is capable of handling data from different mass analyzers (e.g. FT, TOF, LTQ, etc.) by tuning the mass resolution parameter (called ␧ in Ref. 9). The Supplemental Appendix describes the alignment algorithm in more detail. Given a list containing N LC-MS experiments (possibly from different instruments), Chaorder computes the alignment score between every pair of experiments. If A and B are two LC-MS experiments, their distance d(A,B) (9) is as follows. d共 A,B兲 ⫽ 1 ⫺

sc共 A,B兲

冑sc共 A,A兲 ⫻ sc共B,B兲

(Eq. 1)

This results in an N ⫻ N matrix of distances. Multidimensional scaling (38) then identifies each experiment with a point in two 2 The abbreviations used are: ChAMS, chromatography aligner using mass spectra; SCX, strong cation exchange; RPLC, reversed phase LC.

1746

Molecular & Cellular Proteomics 6.10

dimensions such that the distances between any two experiments A and B approximately represent the pairwise distances d(A,B). Such an embedding in two dimensions necessitates a certain distortion of some of the distances d(A,B), but the present study suggests that the global embedding in two dimensions still reveals major global effects. Furthermore as the embedding can be rotated without changing any of the embedding distances, the axes do not have a particular significance; the embedding just illustrates the relative distance relations between all experiments. The Supplemental Appendix describes multidimensional scaling in more detail. Chaorder provides other views of the data as well, e.g. a pair of experiments can be analyzed for their differences and similarities using ChAMS (9). Most of the analysis under “Results” could also have been obtained by manual analysis of the data. In fact, many of our conclusions about run order effects, etc. were validated by manual analysis. However, just as with the manual analysis of tandem mass spectra, manual analysis alone is not feasible for the current, let alone expected future, large scale experimentation. Chaorder performed all of the above analyses in a matter of few minutes or hours, depending on the size of the data, on a single Linux desktop computer. We are not aware of any other method that can perform the above similarity analysis with such a high efficiency and quality. The software Chaorder is available on request. DISCUSSION

The assessment of reproducibility and similarity between LC-MS experiments is the first step toward bias-free and statistically powerful experimentation that can form the basis of biomarker discovery. High reproducibility is an essential prerequisite for any analytical approach that aims to address this problem. Careful design features, such as randomized run order, may help minimize artifacts but do not eliminate them. This is evident in the simulated biomarker experiment where the relatively strong variation introduced by the emulated biomarker was offset by other types of variations. The fact that such strong effects compound data analysis of a controlled and relatively simple analytical scenario suggests that variation presents even greater challenges in many real world applications. The other case studies illustrate the need to systematically address the biases of each step of an experimental setup, for example freeze/thaw cycles, number of washes, chemicals used, trypsin digestion efficiency, instrument choices, column choices, the day the experiments were run, the scientist performing each step, etc. Judicious decisions are required about each of these to achieve the highest levels of reproducibility at each step, thus yielding an experiment design that allows statistically valid conclusions about the underlying biological phenomena. To this end, the software Chaorder is a robust, efficient, and, to our knowledge, the first tool to assess global LC-MS

Chaorder, a Tool to Assess Experimental Reproducibility

experimental reproducibility and similarity. We present results from various studies completed in a number of different laboratories that show experimental reproducibility being significantly affected by sample processing steps, experimental protocol, and instrument choices. The low degree of reproducibility indicated by Chaorder in all case studies suggests a widespread need for quality control experiments and the use of quality assessment tools before any kind of comparative data analysis (e.g. by MS or MS2 with Sequest (39)). Chaorder can also be used to identify outlier experiments, which can either be analyzed manually or be removed from downstream analysis if feasible. When deciding between different types of columns, Chaorder allows an assessment of their reproducibility. Measuring the experimental quality of the wash runs, Chaorder can suggest how many wash runs lead to a clean column. Chaorder can help tune the experimental setup, e.g. indicate whether a longer column is required for higher reproducibility, whether the column is degraded beyond tolerable limits, etc. It can also help tune the sample processing steps, e.g. the choice of parameters for freeze/ thaw cycles, choice of chemicals, etc. The problems uncovered here may be tolerable for studies in which qualitative aspects matter most but appear critical to render mass spectrometry useful as a quantitative survey and discovery tool. Because we used data from multiple mass spectrometry laboratories, these issues do not appear to be unique to any one of them; instead they seem to represent general challenges facing the mass spectrometry community. We found that one of the strongest parameters to affect the experimental reproducibility is the order in which experiments are run. There are multiple possible causes, e.g. change in instrument calibration over time, sample degradation, LC column degradation, etc. Because results from different laboratories have similar characteristics, it is likely that a combination of several causes is the source of the observed biases. Significant efforts will need to be focused in this direction to understand and eliminate these biases to strengthen the downstream analysis. Beyond the potential to improve reproducibility across different instruments and sample processing protocols within a single laboratory, Chaorder is also a first step toward the comparison and standardization of experimental platforms and conditions across laboratories, another important step to make mass spectrometry more reliable, trustworthy, and relevant for biomedical research. Acknowledgments—We thank Leo Bonilla, Jimmy Eng, Jennifer Sutton, and Se´bastien-Li-Thiao-Te´ for helpful discussions. * This work was supported in part by the National Institutes of Health through the University of Washington NIEHS-sponsored Center for Ecogenetics and Environmental Health (NIEHS, National Institutes of Health Grant P30ES07033) and the Northwest Regional Center of Excellence for Biodefense and Emerging Infectious Diseases (Grant 1U54 AI57141-01). The costs of publication of this article were defrayed in part by the payment of page charges. This article must

therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. □ S The on-line version of this article (available at http://www. mcponline.org) contains supplemental material. 储 To whom correspondence and requests for the software Chaorder should be addressed. Fax: 206-543-2969; E-mail: [email protected]. REFERENCES 1. Aebersold, R., and Goodlett, D. R. (2001) Mass spectrometry in proteomics. Chem. Rev. 101, 269 –295 2. Aebersold, R., and Mann, M. (2003) Mass spectrometry-based proteomics. Nature 422, 198 –207 3. Peng, J., Elias, J. E., Thoreen, C. C., Licklider, L. J., and Gygi, S. P. (2003) Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J. Proteome Res. 2, 43–50 4. Tyers, M., and Mann, M. (2003) From genomics to proteomics. Nature 422, 193–197 5. Ideker, T., Galitski, T., and Hood, L. (2001) A new approach to decoding life: systems biology. Annu. Rev. Genomics Hum. Genet. 2, 343–372 6. Desiere, F., Deutsch, E. W., Nesvizhskii, A. I., Mallick, P., King, N. I., Eng, J. K., Aderem, A., Boyle, R., Brunner, E., Donohoe, S., Fausto, N., Hafen, E., Hood, L., Katze, M. G., Kennedy, K. A., Kregenow, F., Lee, H., Lin, B., Martin, D., Ranish, J. A., Rawlings, D. J., Samelson, L. E., Shiio, Y., Watts, J. D., Wollscheid, B., Wright, M. E., Yan, W., Yang, L., Yi, E. C., Zhang, H., and Aebersold, R. (2005) Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry. Genome Biol. 6, R9 7. Smith, R. D., Anderson, G. A., Lipton, M. S., Pasa-Tolic, L., Shen, Y., Conrads, T. P., Veenstra, T. D., and Udseth, H. R. (2002) An accurate mass tag strategy for quantitative and high throughput proteome measurements. Proteomics 2, 513–523 8. Beer, I., Barnea, E., Ziv, T., and Admon, A. (2004) Improving large-scale proteomics by clustering of mass spectrometry data. Proteomics 4, 950 –960 9. Prakash, A., Mallick, P., Whiteaker, J., Zhang, H., Paulovich, A., Flory, M., Lee, H., Aebersold, R., and Schwikowski, B. (2006) Signal maps for mass spectrometry-based comparative proteomics. Mol. Cell. Proteomics 5, 423– 432 10. Listgarten, J., and Emili, A. (2005) Statistical and computational methods for comparative proteome profiling using liquid chromatography-tandem mass spectrometry. Mol. Cell. Proteomics 4, 419 – 434 11. Radulovic, D., Jelveh, S., Ryu, S., Guy Hamilton, T., Foss, E., Mao, Y., and Emili, A. (2004) Informatics platform for global proteomic profiling and biomarker discovery using liquid-chromatography-tandem mass spectrometry. Mol. Cell. Proteomics 10, 984 –997 12. Wang, W., Zhou, H., Lin, H., Roy, S., Shaler, T. A., Hill, L. R., Norton, S., Kumar, P., Anderle, M., and Becker, C. H. (2003) Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards. Anal. Chem. 75, 4818 – 4826 13. Bylund, D., Danielsson, R., Malmquist, G., and Markides, K. E. (2002) Chromatographic alignment by warping and dynamic programming as a pre-processing tool for PARAFAC modelling of liquid chromatography mass spectrometry data. J. Chromatogr. A 961, 237–244 14. Listgarten, J., Neal, R. M., Roweis, S. T., and Emili, A. (2005) Multiple alignment of continuous time series (Saul, L. K., Weiss, Y., Bottou, L., Eds.) in Advances in Neural Information Processing Systems, pp. 817– 824, The MIT Press, Cambridge, MA 15. Yasui, Y., McLerran, D., Adam, B., Winget, M., Thornquist, M., and Feng, Z. (2003) An automated peak identification/calibration procedure for highdimensional protein measures from mass spectrometers. J. Biomed. Biotechnol. 242–248 16. Coombes, K., Fritsche, H. A., Jr., Clarke, C., Chen, J., Baggerly, K. A., Morris, J. S., Xiao, L., Hung, M., and Kuerer, H. M. (2003) Quality control and peak finding for proteomics data collected from nipple aspirate fluid by surface-enhanced laser desorption and ionization. Clin. Chem. 49, 1615–1623 17. Qu, Y., Adam, B., Thornquist, M., Potter, J. D., Thompson, M. L., Yasui, Y., Davis, J., Schellhammer, P. F., Cazares, L., Clements, M., Wright, G. L.,

Molecular & Cellular Proteomics 6.10

1747

Chaorder, a Tool to Assess Experimental Reproducibility

18. 19. 20.

21. 22.

23. 24.

25.

26. 27.

28.

Jr., and Feng, Z. (2003) Data reduction using a discrete wavelet transform in discriminant analysis of very high dimensionality data. Biometrics 59, 143–151 Hock, D. W., and Visa International (2000) Birth of the Chaordic Age, Berrett-Koehler Publishers, San Francisco Ransohoff, D. (2005) Bias as a threat to the validity of cancer molecularmarker research. Nature 5, 142–149 Hu, J., Coombes, K. R., Morris, J. S., and Baggerly, K. A. (2005) The importance of experimental design in proteomic mass spectrometry experiments: Some cautionary tales. Brief. Funct. Genomics Proteomics 3, 322–331 Sorace, J. M., and Zhan, M. (2003) A data review and re-assessment of ovarian cancer serum proteomic profiling. BMC Bioinformatics 4, 24 Baggerly, K. A., Morris, J. S., and Coombes, K. R. (2004) Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments. Bioinformatics 20, 777–785 Ransohoff, D. (2004) Rules of evidence for cancer molecular-marker discovery and validation. Nat. Rev. Cancer 4, 309 –314 Baggerly, K. A., Morris, J. S., Edmonson, S. R., and Coombes, K. R. (2004) Signal in noise: can experimental bias explain some results of serum proteomics tests for ovarian cancer? M. D. Anderson Biostatistics Technical Report UTMDABTR-008-04, The University of Texas M. D. Anderson Cancer Center, Houston, TX Jaffe, J. D., Mani, D. R., Leptos, K. C., Church, G. M., Gillette, M. A., and Carr, S. A. (2006) PEPPeR, a platform for experimental proteomic pattern recognition. Mol. Cell. Proteomics 5, 1927–1941 ABRF 2006: Integrating science, tools and technology with systems biology. (2006) J. Biomol. Tech. 17, 1– 89 Rai, A. J., Gelfand, C. A., Haywood, B. C., Warunek, D. J., Yi, J., Schuchard, M. D., Mehigh, R. J., Cockrill, S. L., Scott, G. B., Tammen, H., SchulzKnappe, P., Speicher, D. W., Vitzthum, F., Haab, B. B., Siest, G., and Chan, D. W. (2005) HUPO Plasma Proteome Project specimen collection and handling: towards the standardization of parameters for plasma proteome samples. Proteomics 5, 3262–3277 Elias, J. E., Haas, W., Faherty, B. K., and Gygi, S. P. (2005) Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations. Nat. Methods 2, 667– 675

1748

Molecular & Cellular Proteomics 6.10

29. Molloy, M. P., Donohoe, S., Brzezinski, E. E., Kilby, G. W., Stevenson, T. I., Baker, J. D., Goodlett, D. R., Gage, D. A. (2005) Large-scale evaluation of quantitative reproducibility and proteome coverage using acid cleavable isotope coded affinity tag mass spectrometry for proteomic profiling. Proteomics 5, 1204 –1208 30. Silva, J., Gorenstein, M., Li, G., Vissers, J. P. C., and Geromanos, S. (2006) Absolute quantification of proteins by LCMS: a value of parallel MS acquisition. Mol. Cell. Proteomics 5, 144 –156 31. Ishihama, Y., Sato, T., Tabata, T., Miyamoto, N., Sagane, K., Nagasu, T., and Oda, Y. (2005) Quantitative mouse brain proteomics using culturederived isotope tags as internal standards. Nat. Biotechnol. 23, 617– 621 32. Anderson, N. L., and Anderson, N. G. (2002) The human plasma proteome: History, character, and diagnostic prospects. Mol. Cell. Proteomics 1, 845– 867 33. Karlin, S., and Altschul, S. (1993) Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. U. S. A. 90, 5873–5877 34. Ewing, B, and Green, P. (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186 –194 35. Bellew, M., Coram, M., Fitzgibbon, M., Igra, M., Randolph, T., Wang, P., May, D., Eng, J., Fang, R., Lin, C., Chen, J., Goodlett, D. R., Whiteaker, J., Paulovich, A., and McIntosh, M. (2006) A suite of algorithms for the comprehensive analysis of complex protein mixtures using high-resolution LC-MS. Bioinformatics 22, 1902–1909 36. Flory, M. R., Lee, H., Bonneau, R., Mallick, P., Serikawa, K., Morris, D. R., and Aebersold, R. (2006) Quantitative proteomic analysis of the budding yeast cell cycle using acid-cleavable isotope-coded affinity tag reagents. Proteomics 6, 6146 – 6157 37. Snyder, L. R., Kirkland, J. J., and Glajch, J. L. (1997) Practical HPLC Method Development, 2nd Ed., Wiley Interscience, Hoboken, NJ 38. Abdi, H. (2007) Metric multidimensional scaling, in Encyclopedia of Measurement and Statistics (Salkind, N. J., ed) pp. 598 – 605, Sage, Thousand Oaks, CA 39. Eng, J., McCormack, A., and Yates, J. (1994) An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976 –989

Suggest Documents