ResqMi - a versatile algorithm and software for ... ABACUS: Adaptive Background Genotype Calling Scheme. [Cutler et al. ..... Faster, easier handling in ResqMi.
ResqMi - a versatile algorithm and software for Resequencing Microarrays Stephan Symons, Kirstin Weber, Michael Bonin and Kay Nieselt GCB 2008 Dresden, September 9-12, 2008
September 12, 2008
1 / 28
Stephan Symons
ResqMi
Introduction
Outline
2 / 28
1
Introduction Resequencing Microarrays
2
ResqMi Calling Algorithms Efficient calling algorithm ReAnalyze
3
ResqMi
4
Results Experiments Results
5
Discussion
Stephan Symons
ResqMi
Introduction
Resequencing Microarrays
Resequencing Microarrays Hybridization based sequencing of known nucleotide sequences Applications: Genetic diseases → analyze respective genes Infectious diseases → analyze characteristic sequences
Major producers: Affymetrix: GeneChip Sequence Analysis Platform NimbleGen: CGS Platform
Oligonucleotide arrays, Probe length 25 (Affymetrix) Base at position 13 interrogates current position 4 × 2 probes for each position in target sequence for each possible base sense and antisense probes
3 / 28
Stephan Symons
ResqMi
Introduction
Resequencing Microarrays
Analysis Workflow
Workflow focussing on Affymetrix Arrays: CEL, CHP files
Hybridized Array Scanning Image Analysis
Crucial Step: Base Calling CEL File (Spot Intensities)
Task: Calculate sequence from intensities
Base Calling
Na¨ıve: Call base with highest intensity Na¨ıve algorithm: unreliable calls on poor data.
4 / 28
Stephan Symons
ResqMi
CHP File (Called Sequence)
Introduction
Resequencing Microarrays
Analysis Workflow A C G
Antisense
Sense
Task: Calculate sequence from intensities
Sense
T
Crucial Step: Base Calling
Antisense
Workflow focussing on Affymetrix Arrays: CEL, CHP files
Na¨ıve: Call base with highest intensity Na¨ıve algorithm: unreliable calls on poor data.
...A 4 / 28
Stephan Symons
ResqMi
C...
Introduction
Resequencing Microarrays
Analysis Workflow A C G
Antisense
Sense
Task: Calculate sequence from intensities
Sense
T
Crucial Step: Base Calling
Antisense
Workflow focussing on Affymetrix Arrays: CEL, CHP files
Na¨ıve: Call base with highest intensity Na¨ıve algorithm: unreliable calls on poor data.
...A 4 / 28
Stephan Symons
ResqMi
C...
Introduction
Resequencing Microarrays
Calling Algorithms
ABACUS: Adaptive Background Genotype Calling Scheme [Cutler et al. 2001]
Data integrity checks to filter out sites of poor quality Likelihood-based method
Model-P [Zhan et al. 2005] Employs physical model based on target sequence.
All calling algorithms have quite good performance. But: No-call ratios of ≈ 5% [Affymetrix 2006] →≥ 500 bases per experiment to inspect. Manual inspection and editing: time consuming.
5 / 28
Stephan Symons
ResqMi
Introduction
Resequencing Microarrays
Calling Algorithms
ABACUS: Adaptive Background Genotype Calling Scheme [Cutler et al. 2001]
Data integrity checks to filter out sites of poor quality Likelihood-based method
Model-P [Zhan et al. 2005] Employs physical model based on target sequence.
All calling algorithms have quite good performance. But: No-call ratios of ≈ 5% [Affymetrix 2006] →≥ 500 bases per experiment to inspect. Manual inspection and editing: time consuming.
5 / 28
Stephan Symons
ResqMi
Introduction
Resequencing Microarrays
Problems and opportunities Applications for the analysis of resequencing arrays: Efficiency: fast base calling, least possible number of ambiguous calls. Usability: User-friendly interaction, swift navigation through sequence and intensity, identify the impact of mutations. Visualization: Overview and position specific visualization of intensity and sequence data Here we present: ResqMi - a free and extensible application offering calling algorithms and visualization, easy navigation and editing. An efficient base calling algorithm. ReAnalyze - A scheme to revise no-calls.
6 / 28
Stephan Symons
ResqMi
Introduction
Resequencing Microarrays
Problems and opportunities Applications for the analysis of resequencing arrays: Efficiency: fast base calling, least possible number of ambiguous calls. Usability: User-friendly interaction, swift navigation through sequence and intensity, identify the impact of mutations. Visualization: Overview and position specific visualization of intensity and sequence data Here we present: ResqMi - a free and extensible application offering calling algorithms and visualization, easy navigation and editing. An efficient base calling algorithm. ReAnalyze - A scheme to revise no-calls.
6 / 28
Stephan Symons
ResqMi
ResqMi Calling Algorithms
Outline
7 / 28
1
Introduction Resequencing Microarrays
2
ResqMi Calling Algorithms Efficient calling algorithm ReAnalyze
3
ResqMi
4
Results Experiments Results
5
Discussion
Stephan Symons
ResqMi
ResqMi Calling Algorithms
Efficient calling algorithm
Calling Algorithm
Model based algorithm, based on work of Clark et al. 2007: originally used for whole genome SNP detection. General rationale: Use na¨ıve calls, unless quality is poor. Two quality measures: Conformance Ratio of brightest to second brightest intensity
Natural approach, similar to manual editing. Extensively applied to whole genome SNP detection
8 / 28
Stephan Symons
ResqMi
ResqMi Calling Algorithms
Efficient calling algorithm
Conformance Non-reference calls: lower intensities in adjacent positions Less reliable signals around mismatches [Hacia et al. 1999]. Calls are reliable if neighbors are reference calls.
Conformance Ci = Fraction of base calls equal to reference
Mismatch site
Sliding window around i, depending on raw call. Reference call: i − 10 . . . i + 10 else: i − 20 . . . i − 10 and i + 10 . . . i + 20. Stretch of match sites
Threshold µ for Ci . 9 / 28
Stephan Symons
ResqMi
ResqMi Calling Algorithms
Efficient calling algorithm
Conformance Non-reference calls: lower intensities in adjacent positions Less reliable signals around mismatches [Hacia et al. 1999]. Calls are reliable if neighbors are reference calls.
Conformance Ci = Fraction of base calls equal to reference
Mismatch site
Sliding window around i, depending on raw call. Reference call: i − 10 . . . i + 10 else: i − 20 . . . i − 10 and i + 10 . . . i + 20. Stretch of match sites
Threshold µ for Ci . 9 / 28
Stephan Symons
ResqMi
ResqMi Calling Algorithms
Efficient calling algorithm
Intensities Ratio
Na¨ıve base calls are unreliable if bases have nearly equally bright intensities. Small differences might be due to chance. Positions with higher differences are more reliable.
Ratio of brightest (Pi ) to second-brightest (Qi ) intensity ∆i =
Ambiguous sites
Pi Qi
→ Signal to Noise Ratio. Threshold ν for If ∆i . Clearly distinguishable sites
10 / 28
Stephan Symons
ResqMi
ResqMi Calling Algorithms
Efficient calling algorithm
Intensities Ratio
Na¨ıve base calls are unreliable if bases have nearly equally bright intensities. Small differences might be due to chance. Positions with higher differences are more reliable.
Ratio of brightest (Pi ) to second-brightest (Qi ) intensity ∆i =
Ambiguous sites
Pi Qi
→ Signal to Noise Ratio. Threshold ν for If ∆i . Clearly distinguishable sites
10 / 28
Stephan Symons
ResqMi
ResqMi Calling Algorithms
Efficient calling algorithm
Strict Call
For position i, consider raw base call Bi Using thresholds µ for conformance and ν for the intensities ratio: A strict call Si = Bi is made if Ci > µ and ∆i > ν If Ci or ∆i below threshold: Si = n If Bi is non-reference and within i − 5 . . . i + 5 is a non-reference call with higher intensity, call Si =n
→ Filters calls for reliable calls. → Rejects non-reference calls that are likely due to chance.
11 / 28
Stephan Symons
ResqMi
ResqMi Calling Algorithms
Efficient calling algorithm
Consensus Call
Compare strict calls from sense (Sis ) and antisense (Sia ) strands If Sis is complementary to Sia call respective base. Non-reference call rejected if brighter non-reference call within i − 5 . . . i + 5. If the strict calls in both strands differ: 1 2
Strict consensus call: return n Relaxed consensus call: return resulting base to its IUPAC code. Example: Sis =a, Sia =c, return r
The resulting sequence can be used for further analysis
12 / 28
Stephan Symons
ResqMi
ResqMi Calling Algorithms
Efficient calling algorithm
Base Calling For all positions i in both strands, 1 Calculate raw (na¨ıve) Calls 2 Calculate Conformance using raw calls 3 Calculate Intensities ratios 4 Produce Strict calls Using strict calls from both strands, 5
Build Consensus calls
For each strand
Raw Calls
Conformance
Raw Data
Strict Call
Consensus Call Called Sequence
Ratio
13 / 28
Stephan Symons
ResqMi
ReAnalyze
ResqMi Calling Algorithms
ReAnalyze
ReAnalyze
Any calling algorithm produces n calls. Many ns are easily resolvable: 1
2 3
Homozygous - highest intensity correspond to complement bases. Call would be reference call. High ∆: clear signal in both strands.
ReAnalyze: Call base, if all criteria are satisfied. Relaxed version of Calling algorithm. Useful to revise data called by any algorithm.
14 / 28
Stephan Symons
ResqMi
ResqMi Calling Algorithms
ReAnalyze
ReAnalyze
Any calling algorithm produces n calls. Many ns are easily resolvable: 1
2 3
Homozygous - highest intensity correspond to complement bases. Call would be reference call. High ∆: clear signal in both strands.
ReAnalyze: Call base, if all criteria are satisfied. Relaxed version of Calling algorithm. Useful to revise data called by any algorithm.
14 / 28
Stephan Symons
ResqMi
ResqMi
Outline
15 / 28
1
Introduction Resequencing Microarrays
2
ResqMi Calling Algorithms Efficient calling algorithm ReAnalyze
3
ResqMi
4
Results Experiments Results
5
Discussion
Stephan Symons
ResqMi
ResqMi
Presenting ResqMi “Resequencing using Microarrays” Focuses on Affymetrix GeneChip Sequence Analysis platform. Interface optimized on usability Editing called sequence Fast navigation around bases.
Base calling algorithms: Model based, ReAnalyze. Visualization and quality control Freely available for Windows, MacOS, Linux
16 / 28
Stephan Symons
ResqMi
ResqMi
Sequence Browsing and editing
Sequences organized fragment-wise Overview of fragment Editable sequence table Intensity windows connected to adjust to the currently selected position
17 / 28
Stephan Symons
ResqMi
ResqMi
Quality Control
Plots of CEL files Intensity, Standard Deviation, Coefficient of Variation Highest base per position, Base Color Intensities
Reference R vs Target T M-Values M = log2 ( RT ) A-Values A = (log2 R + log2 T )/2
Enlarge and export images
18 / 28
Stephan Symons
ResqMi
ResqMi
Quality Control
Plots of CEL files Intensity, Standard Deviation, Coefficient of Variation Highest base per position, Base Color Intensities
Reference R vs Target T M-Values M = log2 ( RT ) A-Values A = (log2 R + log2 T )/2
Enlarge and export images
18 / 28
Stephan Symons
ResqMi
ResqMi
Quality Control Plots of CEL files Intensity, Standard Deviation, Coefficient of Variation Highest base per position, Base Color Intensities
Reference R vs Target T M-Values M = log2 ( RT ) A-Values A = (log2 R + log2 T )/2
Enlarge and export images
18 / 28
Stephan Symons
ResqMi
ResqMi
ResqMi in Detail
Visualization Intensities Profile, Spike, Sequence Logo
In detail table of calls Calls, Quality, force, original calls
Annotations for called sequence Overview of all sequences
19 / 28
Stephan Symons
ResqMi
ResqMi
ResqMi in Detail
Visualization Intensities Profile, Spike, Sequence Logo
In detail table of calls Calls, Quality, force, original calls
Annotations for called sequence Overview of all sequences
19 / 28
Stephan Symons
ResqMi
ResqMi
ResqMi in Detail
Visualization Intensities Profile, Spike, Sequence Logo
In detail table of calls Calls, Quality, force, original calls
Annotations for called sequence Overview of all sequences
19 / 28
Stephan Symons
ResqMi
ResqMi
ResqMi in Detail
Visualization Intensities Profile, Spike, Sequence Logo
In detail table of calls Calls, Quality, force, original calls
Annotations for called sequence Overview of all sequences
19 / 28
Stephan Symons
ResqMi
Results
Outline
20 / 28
1
Introduction Resequencing Microarrays
2
ResqMi Calling Algorithms Efficient calling algorithm ReAnalyze
3
ResqMi
4
Results Experiments Results
5
Discussion
Stephan Symons
ResqMi
Results
Experiments
Experiments
Name CFTR Mito SARS
Target CFTR Human Mitochondrium SARS Coronavirus
Bases 9511 37756 30588
Fragm. 84 480 3
Exp. 17 14 44
For CFTR and Mito, calls made using Affymetrix software were available Exploring parameter space: µ ∈ {0.5, 0.6, 0.7, 0.8, 0.9} ν ∈ {1.01, 1.05, 1.1, 1.25, 1.33} in each combination, for strict and relaxed consensus calls.
21 / 28
Stephan Symons
ResqMi
Results
Experiments
Experiments
Name CFTR Mito SARS
Target CFTR Human Mitochondrium SARS Coronavirus
Bases 9511 37756 30588
Fragm. 84 480 3
Exp. 17 14 44
For CFTR and Mito, calls made using Affymetrix software were available Exploring parameter space: µ ∈ {0.5, 0.6, 0.7, 0.8, 0.9} ν ∈ {1.01, 1.05, 1.1, 1.25, 1.33} in each combination, for strict and relaxed consensus calls.
21 / 28
Stephan Symons
ResqMi
Results
Results
No call rate
Mean No−call ratio 0.5
µ=0.5
µ=0.9
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.3
●
●
0.2
● ● ● ● ● ● ● ● ● ● ●
− 0.1
rate of n calls
µ=0.8
●
●
●
●
●
− 0.0
Linear effect of µ and ν on calling rate, µ: smaller effect than ν; µ = 0.9: many ns Strict calls: more n, Relaxed calls: more ambiguous calls
µ=0.7
● ● ● ● ● ● ●
0.4
No call ratio:
µ=0.6
ν=1.01 ν=1.05 ν=1.1 ν=1.25 ν=1.33 ν=1.01 ν=1.05 ν=1.1 ν=1.25 ν=1.33 ν=1.01 ν=1.05 ν=1.1 ν=1.25 ν=1.33 ν=1.01 ν=1.05 ν=1.1 ν=1.25 ν=1.33 ν=1.01 ν=1.05 ν=1.1 ν=1.25 ν=1.33
●
CFTR, relaxed consensus CFTR, strict consensus CFTR, Affymetrix Mito, relaxed consensus Mito, strict consensus Mito, Affymetrix SARS, relaxed consensus SARS, strict consensus
22 / 28
Stephan Symons
ResqMi
Results
Results
Non-reference call rate
0.15
Mean non−reference ratio
●
●
0.10
− ●
µ=0.5
µ=0.6
µ=0.7
CFTR, relaxed consensus CFTR, strict consensus CFTR, Affymetrix Mito, relaxed consensus Mito, strict consensus Mito, Affymetrix SARS, relaxed consensus SARS, strict consensus
µ=0.8
µ=0.9
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
ν=1.01 ν=1.05 ν=1.1 ν=1.25 ν=1.33 ν=1.01 ν=1.05 ν=1.1 ν=1.25 ν=1.33 ν=1.01 ν=1.05 ν=1.1 ν=1.25 ν=1.33 ν=1.01 ν=1.05 ν=1.1 ν=1.25 ν=1.33 ν=1.01 ν=1.05 ν=1.1 ν=1.25 ν=1.33
0.00
Parameters have similar impact on non-reference call rates Conformance cutoff: more effect than ratio cutoff Consensus method: heavy impact
0.05
Non-reference call:
rate of non−reference calls
−
23 / 28
Stephan Symons
ResqMi
Results
Results
Results
Concordance between calls reference: Cohen’s K ≥ 0.5. Results comparable to Affymetrix. Relaxing thresholds: little effect, Ci ≈ 0.25 for some i Called data using parameters µ = 0.8 and ν = 1.25 (K ≥ 0.54). Applying ReAnalyze: Cutoff ν ≥ 1.2 (empirically determined) Reducing the n-rate by up to 9.35%.
24 / 28
Stephan Symons
ResqMi
Discussion
Outline
25 / 28
1
Introduction Resequencing Microarrays
2
ResqMi Calling Algorithms Efficient calling algorithm ReAnalyze
3
ResqMi
4
Results Experiments Results
5
Discussion
Stephan Symons
ResqMi
Discussion
Calling algorithm
Calling algorithm Call ratios better or comparable to Affymetrix Good concordance with reference Faster, easier handling in ResqMi
Impact of parameters is linear and predictable µ = 0.8 and ν = 1.25 give good results. ReAnalyze can further improve calling rate
26 / 28
Stephan Symons
ResqMi
Discussion
Take home message
Download ResqMi! It’s free - Open source software Fast calling (5 min for our datasets) (other software take far longer ) Fast and flexible navigation, editing, data handling Helps identifying impact of mutations ReAnalyze reassesses all called sequences Visualization,Quality Control, Export...
www-ps.informatik.uni-tuebingen.de/resqmi
27 / 28
Stephan Symons
ResqMi
Discussion
Acknowledgements
Kirstin Weber Michael Bonin Kay Nieselt ¨ Gunnar Ratsch
Thank you!
28 / 28
Stephan Symons
ResqMi