A Statistical Approach to Quantifying Uncertainties in the ... - PNNL

0 downloads 0 Views 774KB Size Report
Angela D. Norbeck, Heather M. Mottaz, Alan R. Dabney, Mary S ... Richard D. Smith ... SpRank d. M. 4.015 0.325. 1. 0.14. ARHGGEDYVFSLLTGYCEPPTGVSLR.
A Statistical Approach to Quantifying Uncertainties in the Accurate Mass and Time (AMT) tag Pipeline Navdeep Jaitly, Joshua Adkins, Matthew E. Monroe, Angela D. Norbeck, Heather M. Mottaz, Alan R. Dabney, Mary S Lipton, Gordon A. Anderson, Richard D. Smith

Outline z

z z z z

Analytical Pipelines using Liquid Chromatography coupled to High Resolution Mass spectrometry Experiments (the Accurate Mass and Time Tag approach) Need for a statistical foundation to characterize identifications obtained under such a paradigm Summarize a statistical method (Peptide Prophet) which tackles this problem for MS/MS based analysis pipelines Our approach to characterizing confidence in identifications from AMT tag pipeline Some results

LC-MS based Proteomics Analysis Pipelines z

Availability of High mass accuraccy instruments has led to novel identification methods using LC-MS experiments z

AMT, PePPer, MSInspect, SuperHirn, CRAWDAD z Liquid Chromatography retention time used in identification z z

z

Generally higher sensitivity than MS/MS analyses; better for broad comparative proteomics studies Lower specificity – matching is performed using mass and LC elution time alone, not multidimensional fragmentation patterns. An approach is required to estimate probability that individual identifications are correct to help separate good and bad results at user-specified false discovery rates

Accurate Mass and Time (AMT) tag pipeline z

Features in an LC-MS dataset are matched to a database of peptides previously identified by LC-MS/MS analyses using specified mass and elution time tolerances AMT tag Database

Alignment & Matching

Monoisotopic mass

Monoisotopic mass

LC-MS dataset

Normalized elution time (NET)

scan # (monoisotopic mass, elution time, abundance)

(Peptide Sequence, monoisotopic mass, elution time)

|Δmass| < mass tolerance |ΔNET| < elution time tolerance

Current Method for Assessing and Controlling Rate of Random Matches z

Decoy database matching used to assess rate of randon matching z

z

z

Shift database (or features) by some mass, and re-match to database – the numbers of matches reflects the random rate of errors

Rate of error controlled by z

Reducing mass and elution time tolerances

z

Building more stringent LC-MS/MS databases (higher XCorr, hyperscore, Mascot score, Peptide Prophet score, etc)

Overall method consists of iteratively controlling and assessing error to choose “optimal” parameters

Challenges with current method z

z

Balancing false positives and false negatives is a tricky game! z

Building more confident LC-MS/MS database decreases background false positives, but, increases false negatives

z

Reducing Mass and Elution time tolerances has a similar effect

z

Manually chosen parameters may look suitable, but in reality be sub-optimal

Each identification is either accepted or rejected z

In reality some identifications are better than others – higher MS/MS confidence, lower mass and elution time differences, etc

Statistical Method for MS/MS Identifications z

Peptide Prophet – A Statistical Model to estimate probability that an MS/MS spectrum is correctly identified z

Uses a Linear function (F-Score) of result metrics from SEQUEST to calculate an overall value representing the confidence of identifications

ARHGGEDYVFSLLTGYCEPPTGVSLR

c1

c2

c3

c4

XCorr

DelCN

SpRank

dM

4.015

0.325

1

0.14

Parameters Used

s

F ( x 1 , x 2 ,.., x s ) = c 0 + ∑ c i x i i =1

Keller, A., Nesvizhskii, A. I., Kolker, E., Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem, 2002, 74, 20, pg. 5383-92

Peptide Prophet F-Score Distributions

All Hits Correct Hits Incorrect Hits

Incorrect Matches

Correct Matches

0.0 0.1 0.2 0.3

z

Overall F-Score distribution is bimodal, with distinct distributions for correct and incorrect matches Probability that an identification is correct can be computed from relative probabilities of coming from correct or incorrect distribution

relative frequency

z

-2

0

2

4 F Score

6

8

Metrics Associated with a Candidate Identification from AMT Tag Pipeline z

Each match between an LC-MS feature and a peptide AMT tag is described by a mass error and an LC NET error, and metrics about the AMT Tag, related to the MS/MS searches from which is was identified

Mass

Scan

Aligned NET

Peptide

NET

Mass

Discriminant Score (Peptide Prophet)

2228.114

1097

0.218

TETQEKNPLPSKETIEQEK

0.220

2228.117

3.1

Δ mass = -1.35 ppm Δ NET = -0.002

Would like to use these metrics to separate results into correct and incorrect matches

Statistical Method for Assignment of Relative Truth (SMART) Score z

A SMART score combines the mass and LC NET error of a peak match with the probability the MS/MS identification* (AMT tag) was correct, to estimate the probability that a feature was correctly assigned p(+match | δm, δnet, Fscore, peptide) = p(δm, δnet | +match) p(Fscore| + peptide) p(+ peptide ) p(δm, δnet | +match) p(Fscore| + peptide) p(+ peptide)+ p(δm, δnet | randommatch) p(Fscore| randommatch) p(randommatch)

Assumes Independence of mass error, net error and Fscore

The SMART score is a q-value for the matching process.

* Probability that a peptide in a database can be computed with existing methods such as Peptide Prophet based on properties of an MS/MS identification

Distribution of Mass and LC Normalized Elution Time (NET) differences True and False matches resulting from peak matching display different Mass and LC NET error distributions • charge 2

Centrally Distributed True Matches

• charge >2

0.08 0.06 0.04

LC NET Error

z

0.02 0 -0.02

Random Matches Distribution

-0.04 -0.06 -0.08 -25

-20

-15

-10

-5

0

5

10

Mass Error (PPM)

15

20

25

Estimating the Probability a Match is Correct z 70

The probability that a peak match is correct depends on where its mass and LC NET error value lies on the two-dimensional distribution

60

NET Error = -0.008 50

Mass Error = 1ppm

probability of correct match= h+

40

30

20

10

0 20

h0 -2 0 - 0 .1

- 0 .0 8

probabilit y of correct match =

-0 .0 6

-0 .0 4

-0 .0 2

0

0 .0 2

0 .0 4

0 .0 6

height of central part h+ = height of central part +height of random part h+ + h−

0 .0 8

52 52 + 2

Estimating the Probability a Match is Correct z 70

The probability that a peak match is correct depends on where its mass and LC NET error value lies on the two-dimensional distribution

60

NET Error = -0.07 50

Mass Error = 2 ppm

probability of correct match= 40

30

20

10

h+ ≈ 0 0 20

h0 -2 0 - 0 .1

- 0 .0 8

probabilit y of correct match =

-0 .0 6

-0 .0 4

-0 .0 2

0

0 .0 2

0 .0 4

0 .0 6

height of central part h+ = height of central part +height of random part h+ + h−

0 .0 8

0 0+2

Data Model and Model Fitting z

What distributions describe the observation vectors appropriately for positive and negative matches and can be used in the Bayes Formula ?

p(+match | δm, δnet, Fscore, peptide) = p(δm, δnet | +match) p(Fscore| + peptide) p(+ peptide) p(δm, δnet | +match) p(Fscore| + peptide) p(+ peptide)+ p(δm, δnet | randommatch) p(Fscore| randommatch) p(randommatch) z

Expectation Maximization Algorithm used to find optimal parameters for the distributions

z

Data to Model: Mass Errors, NET Errors, F-Score distribution Real/Random Type

Distribution

Real

Mass Errors

ppm errors distributed normally (truncated)

Random

Mass Errors

Da Errors distributed normally (truncated)

Real

NET Errors

Distributed normally

Random

NET Errors

Uniform Background

Real

F Scores

FScores Normally distributed as a function of mass

Random

F Scores

Gamma distribution

Data Model - Example (Real and Random Distributions)

(Random Distribution)

Density

Density Mass error (ppm)

Mass error (Da) Quantiles from Data

Decoy (11 Da Shifted) Matches

Mass error (ppm) Quantiles from Data

Regular Matches

Mass error (ppm) Quantiles from model

Mass error (Da)

Mass error (Da) Quantiles from model

Mass Error Distribution for Salmonella Typhimurium protein extract sample

Data Model - Example (Real and Random Distributions)

(Random Distribution)

Density

Density

NET Error

NET Error Quantiles from Data

Decoy (11 Da Shifted) Matches

NET Error Quantiles from Data

Regular Matches

NET Error Quantiles from model

NET Error

NET Error Quantiles from model

NET Error Distribution for Salmonella Typhimurium protein extract sample

Performance Curves

500

Using different thresholds for SMART score, allows us to get results at different levels of error

300

400

Lower probability value cutoff score results in more true positive, but at the cost of false positives

100

200

Higher probability value cutoff score results in fewer false positives (~ 0)

Trade off region where reducing probability threshold score results in accelerating number of false positives

0

# of false (decoy) peptides

z

High number of true positives achieved with very few false positives!

0

100

200

300

# of matched standard peptides

QC_05_3_13Jan06_Andromeda_05-10-03_1_18_2006 Matched to database with 8000 decoy proteins.

1 2 3

SMART Pep Prophet >.99, 4ppm, 0.05 NET Pep Prophet >.99, 2ppm, 0.02 NET Pep Prophet >.999, 1ppm, 0.01 NET

0.20

1

0.15

Estimated FDR

0.25

0.30

0.35

Performance Curves – Example 2

0.10

2

0.00

0.05

3

0

5000

10000

15000

# of Matches

Performance Curve for Salmonella Typhimurium protein extract sample compared to typical criteria

Summary z

Developed a model to estimate confidence of peak matching

z

The Statistical Method for Assignment of Relative Truth (SMART) score provides a measure to prioritize acceptable matches using one number, by defining a probability score combining disparate information

z

Allows calculation of FDR for identifications and estimates the tradeoff between false negatives and false positives

z

Initial evaluation: shows good correlation with observed number of correct answers

z

Making Implementation available as free software downloadable from http://omics.pnl.gov/

Acknowledgements

z

Co-authors Samuel Purvine (for acronym) Dick Smith

z

$upport

z z

z

DOE Office of Biological and Environmental Research z NIH National Center for Research Resources z National Institute of Allergy and Infectious Diseases

Accurate Mass and Time (AMT) Tag Pipeline z

Uses database of LC-MS/MS results to identify features discovered in LC-MS analyses1 LC-MS/MS Analyses

LC-MS/MS Analysis

MS/MS Search Engine

Scan by Scan Deisotoping MS Features

(Mass, Intensity, Scan)

(SEQUEST, Mascot, X!Tandem, etc)

Peptide Identifications

Elution Time Normalization

LC Peak Definition LC-MS Features

(Representative Mass,Intensity, Scan)

Elution Time Alignment, Mass Recalibration, Peak Matching AMT Tag Identifications

(Peptide, elution time)

Peptide Identifications

(Peptide, normalized elution time)

AMT Tag Database (Peptide, Theoretical Mass normalized elution time)

(Peptide, Intensity)

1. http://ncrr.pnl.gov/training/workshops/2007HUPO/LCMSBasedProteomicsDataProcessing.pdf

2500 2000 1500 500

1000

Empirical Decoy Model

0

Estimated # of false Matches

3000

Decoy Vs Model

0

2000 4000 6000 8000 10000 12000

# of Matches

Suggest Documents