Angela D. Norbeck, Heather M. Mottaz, Alan R. Dabney, Mary S ... Richard D. Smith ... SpRank d. M. 4.015 0.325. 1. 0.14. ARHGGEDYVFSLLTGYCEPPTGVSLR.
A Statistical Approach to Quantifying Uncertainties in the Accurate Mass and Time (AMT) tag Pipeline Navdeep Jaitly, Joshua Adkins, Matthew E. Monroe, Angela D. Norbeck, Heather M. Mottaz, Alan R. Dabney, Mary S Lipton, Gordon A. Anderson, Richard D. Smith
Outline z
z z z z
Analytical Pipelines using Liquid Chromatography coupled to High Resolution Mass spectrometry Experiments (the Accurate Mass and Time Tag approach) Need for a statistical foundation to characterize identifications obtained under such a paradigm Summarize a statistical method (Peptide Prophet) which tackles this problem for MS/MS based analysis pipelines Our approach to characterizing confidence in identifications from AMT tag pipeline Some results
LC-MS based Proteomics Analysis Pipelines z
Availability of High mass accuraccy instruments has led to novel identification methods using LC-MS experiments z
AMT, PePPer, MSInspect, SuperHirn, CRAWDAD z Liquid Chromatography retention time used in identification z z
z
Generally higher sensitivity than MS/MS analyses; better for broad comparative proteomics studies Lower specificity – matching is performed using mass and LC elution time alone, not multidimensional fragmentation patterns. An approach is required to estimate probability that individual identifications are correct to help separate good and bad results at user-specified false discovery rates
Accurate Mass and Time (AMT) tag pipeline z
Features in an LC-MS dataset are matched to a database of peptides previously identified by LC-MS/MS analyses using specified mass and elution time tolerances AMT tag Database
Alignment & Matching
Monoisotopic mass
Monoisotopic mass
LC-MS dataset
Normalized elution time (NET)
scan # (monoisotopic mass, elution time, abundance)
(Peptide Sequence, monoisotopic mass, elution time)
|Δmass| < mass tolerance |ΔNET| < elution time tolerance
Current Method for Assessing and Controlling Rate of Random Matches z
Decoy database matching used to assess rate of randon matching z
z
z
Shift database (or features) by some mass, and re-match to database – the numbers of matches reflects the random rate of errors
Rate of error controlled by z
Reducing mass and elution time tolerances
z
Building more stringent LC-MS/MS databases (higher XCorr, hyperscore, Mascot score, Peptide Prophet score, etc)
Overall method consists of iteratively controlling and assessing error to choose “optimal” parameters
Challenges with current method z
z
Balancing false positives and false negatives is a tricky game! z
Building more confident LC-MS/MS database decreases background false positives, but, increases false negatives
z
Reducing Mass and Elution time tolerances has a similar effect
z
Manually chosen parameters may look suitable, but in reality be sub-optimal
Each identification is either accepted or rejected z
In reality some identifications are better than others – higher MS/MS confidence, lower mass and elution time differences, etc
Statistical Method for MS/MS Identifications z
Peptide Prophet – A Statistical Model to estimate probability that an MS/MS spectrum is correctly identified z
Uses a Linear function (F-Score) of result metrics from SEQUEST to calculate an overall value representing the confidence of identifications
ARHGGEDYVFSLLTGYCEPPTGVSLR
c1
c2
c3
c4
XCorr
DelCN
SpRank
dM
4.015
0.325
1
0.14
Parameters Used
s
F ( x 1 , x 2 ,.., x s ) = c 0 + ∑ c i x i i =1
Keller, A., Nesvizhskii, A. I., Kolker, E., Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem, 2002, 74, 20, pg. 5383-92
Peptide Prophet F-Score Distributions
All Hits Correct Hits Incorrect Hits
Incorrect Matches
Correct Matches
0.0 0.1 0.2 0.3
z
Overall F-Score distribution is bimodal, with distinct distributions for correct and incorrect matches Probability that an identification is correct can be computed from relative probabilities of coming from correct or incorrect distribution
relative frequency
z
-2
0
2
4 F Score
6
8
Metrics Associated with a Candidate Identification from AMT Tag Pipeline z
Each match between an LC-MS feature and a peptide AMT tag is described by a mass error and an LC NET error, and metrics about the AMT Tag, related to the MS/MS searches from which is was identified
Mass
Scan
Aligned NET
Peptide
NET
Mass
Discriminant Score (Peptide Prophet)
2228.114
1097
0.218
TETQEKNPLPSKETIEQEK
0.220
2228.117
3.1
Δ mass = -1.35 ppm Δ NET = -0.002
Would like to use these metrics to separate results into correct and incorrect matches
Statistical Method for Assignment of Relative Truth (SMART) Score z
A SMART score combines the mass and LC NET error of a peak match with the probability the MS/MS identification* (AMT tag) was correct, to estimate the probability that a feature was correctly assigned p(+match | δm, δnet, Fscore, peptide) = p(δm, δnet | +match) p(Fscore| + peptide) p(+ peptide ) p(δm, δnet | +match) p(Fscore| + peptide) p(+ peptide)+ p(δm, δnet | randommatch) p(Fscore| randommatch) p(randommatch)
Assumes Independence of mass error, net error and Fscore
The SMART score is a q-value for the matching process.
* Probability that a peptide in a database can be computed with existing methods such as Peptide Prophet based on properties of an MS/MS identification
Distribution of Mass and LC Normalized Elution Time (NET) differences True and False matches resulting from peak matching display different Mass and LC NET error distributions • charge 2
Centrally Distributed True Matches
• charge >2
0.08 0.06 0.04
LC NET Error
z
0.02 0 -0.02
Random Matches Distribution
-0.04 -0.06 -0.08 -25
-20
-15
-10
-5
0
5
10
Mass Error (PPM)
15
20
25
Estimating the Probability a Match is Correct z 70
The probability that a peak match is correct depends on where its mass and LC NET error value lies on the two-dimensional distribution
60
NET Error = -0.008 50
Mass Error = 1ppm
probability of correct match= h+
40
30
20
10
0 20
h0 -2 0 - 0 .1
- 0 .0 8
probabilit y of correct match =
-0 .0 6
-0 .0 4
-0 .0 2
0
0 .0 2
0 .0 4
0 .0 6
height of central part h+ = height of central part +height of random part h+ + h−
0 .0 8
52 52 + 2
Estimating the Probability a Match is Correct z 70
The probability that a peak match is correct depends on where its mass and LC NET error value lies on the two-dimensional distribution
60
NET Error = -0.07 50
Mass Error = 2 ppm
probability of correct match= 40
30
20
10
h+ ≈ 0 0 20
h0 -2 0 - 0 .1
- 0 .0 8
probabilit y of correct match =
-0 .0 6
-0 .0 4
-0 .0 2
0
0 .0 2
0 .0 4
0 .0 6
height of central part h+ = height of central part +height of random part h+ + h−
0 .0 8
0 0+2
Data Model and Model Fitting z
What distributions describe the observation vectors appropriately for positive and negative matches and can be used in the Bayes Formula ?
p(+match | δm, δnet, Fscore, peptide) = p(δm, δnet | +match) p(Fscore| + peptide) p(+ peptide) p(δm, δnet | +match) p(Fscore| + peptide) p(+ peptide)+ p(δm, δnet | randommatch) p(Fscore| randommatch) p(randommatch) z
Expectation Maximization Algorithm used to find optimal parameters for the distributions
z
Data to Model: Mass Errors, NET Errors, F-Score distribution Real/Random Type
Distribution
Real
Mass Errors
ppm errors distributed normally (truncated)
Random
Mass Errors
Da Errors distributed normally (truncated)
Real
NET Errors
Distributed normally
Random
NET Errors
Uniform Background
Real
F Scores
FScores Normally distributed as a function of mass
Random
F Scores
Gamma distribution
Data Model - Example (Real and Random Distributions)
(Random Distribution)
Density
Density Mass error (ppm)
Mass error (Da) Quantiles from Data
Decoy (11 Da Shifted) Matches
Mass error (ppm) Quantiles from Data
Regular Matches
Mass error (ppm) Quantiles from model
Mass error (Da)
Mass error (Da) Quantiles from model
Mass Error Distribution for Salmonella Typhimurium protein extract sample
Data Model - Example (Real and Random Distributions)
(Random Distribution)
Density
Density
NET Error
NET Error Quantiles from Data
Decoy (11 Da Shifted) Matches
NET Error Quantiles from Data
Regular Matches
NET Error Quantiles from model
NET Error
NET Error Quantiles from model
NET Error Distribution for Salmonella Typhimurium protein extract sample
Performance Curves
500
Using different thresholds for SMART score, allows us to get results at different levels of error
300
400
Lower probability value cutoff score results in more true positive, but at the cost of false positives
100
200
Higher probability value cutoff score results in fewer false positives (~ 0)
Trade off region where reducing probability threshold score results in accelerating number of false positives
0
# of false (decoy) peptides
z
High number of true positives achieved with very few false positives!
0
100
200
300
# of matched standard peptides
QC_05_3_13Jan06_Andromeda_05-10-03_1_18_2006 Matched to database with 8000 decoy proteins.
1 2 3
SMART Pep Prophet >.99, 4ppm, 0.05 NET Pep Prophet >.99, 2ppm, 0.02 NET Pep Prophet >.999, 1ppm, 0.01 NET
0.20
1
0.15
Estimated FDR
0.25
0.30
0.35
Performance Curves – Example 2
0.10
2
0.00
0.05
3
0
5000
10000
15000
# of Matches
Performance Curve for Salmonella Typhimurium protein extract sample compared to typical criteria
Summary z
Developed a model to estimate confidence of peak matching
z
The Statistical Method for Assignment of Relative Truth (SMART) score provides a measure to prioritize acceptable matches using one number, by defining a probability score combining disparate information
z
Allows calculation of FDR for identifications and estimates the tradeoff between false negatives and false positives
z
Initial evaluation: shows good correlation with observed number of correct answers
z
Making Implementation available as free software downloadable from http://omics.pnl.gov/
Acknowledgements
z
Co-authors Samuel Purvine (for acronym) Dick Smith
z
$upport
z z
z
DOE Office of Biological and Environmental Research z NIH National Center for Research Resources z National Institute of Allergy and Infectious Diseases
Accurate Mass and Time (AMT) Tag Pipeline z
Uses database of LC-MS/MS results to identify features discovered in LC-MS analyses1 LC-MS/MS Analyses
LC-MS/MS Analysis
MS/MS Search Engine
Scan by Scan Deisotoping MS Features
(Mass, Intensity, Scan)
(SEQUEST, Mascot, X!Tandem, etc)
Peptide Identifications
Elution Time Normalization
LC Peak Definition LC-MS Features
(Representative Mass,Intensity, Scan)
Elution Time Alignment, Mass Recalibration, Peak Matching AMT Tag Identifications
(Peptide, elution time)
Peptide Identifications
(Peptide, normalized elution time)
AMT Tag Database (Peptide, Theoretical Mass normalized elution time)
(Peptide, Intensity)
1. http://ncrr.pnl.gov/training/workshops/2007HUPO/LCMSBasedProteomicsDataProcessing.pdf
2500 2000 1500 500
1000
Empirical Decoy Model
0
Estimated # of false Matches
3000
Decoy Vs Model
0
2000 4000 6000 8000 10000 12000
# of Matches