SKYLINE NORMALIZATION OF DNA ...

SKYLINE NORMALIZATION OF DNA CHROMATOGRAMS BY REGRESSION Lucio Andrade

Elias S. Manolakos

Communications and Digital Signal Processing (CDSP) Center Electrical and Computer Engineering Department, Northeastern University, Boston MA 02115 landrade elias @ece.neu.edu

ABSTRACT A skyline normalization method for electrophoretic DNA data is described. Contrary to commonly used heuristic windowing methods we propose a channel coupled exponential decay model to describe the imminent peak drop towards the low SNR region of the trace. We provide means for its estimation, regardless of artifacts, compressions, false peaks and applied chemistry. We demonstrate that this technique can improve the interpretation of chromatograms, and therefore base-calling accuracy, when using either a pattern recognition (as in [1]) or a peak shape fitting base-calling approach (as in [2, 3, 4]). 1. INTRODUCTION Automated DNA sequencing methods rely on the capability to separate DNA fragments according to their length. This task is accomplished by using gel (or capillary) electrophoresis, giving rise to the, so called, DNA chromatograms (or electropherograms ). The size of a peak is directly related to the number of DNA fragments, in a subpopulation of fixed-length fragments, ending with the same base. By considering the order of arrival of tagged subpopulations of DNA fragments to a detector, the order of DNA bases ( , , , ) in the sample under analysis can be inferred [5]. The objective of base-calling software is to accurately discriminate between true peaks, generated by a significant population of same-length DNA strands passing the detection point, from insignificant spurious peaks, possibly generated by noise, cross-correlation between the four channels, or several other well known artifacts. Recently, there have been efforts to address base-calling in statistically sound modeling frameworks [1, 3, 2, 4]. It is true that such methods have shown a lot of promise in terms of generality, read-length accuracy and base confidence estimation. However, experience has taught us that the signal pre-processing steps should be carefully re-examined before the full potential of such methods is realized and their performance reaches a level that routinely exceeds that of Phred [6], a popular base-caller which relies on data sensitive heuristics. We are introducing here a new algorithm for accurate signal normalization, used currently in the signal pre-processing stage of the BEM base-caller [1]. We discuss first the limitations of simplistic normalization approaches to motivate the need for more intelligent schemes. Then we formulate the problem of data normalization and solve it using regression methods. We show that the resulting simple algorithm can track the “skyline” and correctly normalize data both in the good and the poor quality parts of a trace and

that it performs well with data generated using different dye chemistries. We demonstrate how it improves unsupervised clustering in the context of the BEM base-caller and discuss why we expect it to also work well with base-calling methods that utilize peak shape fitting, such as those reported n [2, 3, 4]. 2. BACKGROUND AND RELATED WORK The uncertainty on peak height is a known problem when interpreting DNA electrophoretic data. On the average, peak height gets smaller as the length of synthesized strands increases, due to the competition between ddNTPs and dNTPs. As strand synthesis progresses, the size of longer strand sub-populations gets smaller [5]. The dNTP:ddNTP ratio can be carefully manipulated to obtain meaningful size subpopulations of labeled strands beyond 1000bp, however the exponential drop of the resulting signal height is imminent. Since we are expecting this decaying exponential peak height behavior, it is intuitively appealing to try to estimate its trend (the signal “skyline”) and use it to correctly normalize the signal. Among the published works in base-calling this intuitive approach was not used before, perhaps because the variation of peak height is also affected by other factors, making it difficult to correctly estimate the exponential decay. Some of these factors are: 1) The appearance of unexpected large peaks due to unincorporated dyes, 2) the appearance of band compressions resulting from the formation of secondary structures in the DNA fragments, 3) false stops in dye primer chemistry caused by primer-dimmer formations or default fragments, 4) overlapping peaks towards the end of the chromatogram due to loss of resolution, and, 5) even the randomness in the quantity of strands at a certain length, e.g. the increased uncertainty associated with using terminator dye labels. For examples of these types of problems see [7]. In addition to normalizing the dynamic range towards the low SNR region of the data, it is also desirable to normalize the expected peak height across channels to be able to disambiguate among multiple peaks appearing at the same region. At large read-lengths, small peaks underneath bigger ones may represent true base-calls and can be missed if the data is not correctly normalized. Alternative methods for scaling the raw data of DNA chromatograms have not been adequately discussed in the literature. Published base-callers [8, 7] typically use a windowing approach for normalization. The maximum sample in each window is found and linear interpolation is used before applying normalization to avoid sharp discontinuities. The authors of [6] use the average of peak areas at

Channel T, D1090 400

Strength

300 200 100 0 7200

7300

7400

7500

7600 Time

7700

7800

7900

8000

7300

7400

7500

7600 Time

7700

7800

7900

8000

Strength

1.5 1 0.5 0 7200

Strength

1.5 1 0.5

!

EG , JI P %Q $% 4 + 5 6 8 7 : 4 9 ( H F & < > ; @ = B ? D A C D C A ? "# %')(* +-,. * / 0 13222 45+6K784:9 L % 222 LNMKO ( F , RS 2 " (T* 0U,

E V XW:Y F Z

[ \ " L (* +-, " _

where is the set of peaks in channel and denotes set cardinality. The two upper panels of Figure 2 show the linearized regressor for two data sets sequenced using different chemistries; primer dyes (PD) on the left and terminator dyes (TD)

5.8 5.6

Channel G Channel A Channel C Channel T

7 6.5

width

6

5.4

log(dlp)

5.5

p

log(dl )

D1089−T

Skyline regression D:4106

Channel G Channel A Channel C Channel T

5.2 5

40

35

35

30

30

25

25

20 15

15

4.5

10

10

4

5

3.5

0 0

5

0.5

3

0.5

6000 p (location)

8000

10000

4000

6000

8000 10000 p (location)

12000

0.3

0.5

0.2

0.4

log( d l ) − log( dl ) p p

0.1

^

p

0

4.8

5 ^ 5.2 log( dlp )

5.4

5.6

5.8

10

5 0 0

0

6

7

Fig. 2. Skyline regression for two data sets. Primer (terminator) dye labeled data are used in the top left (right) panel respectively. Each channel has a different ( -intercept) but all have the same decay rate (slope). Residual analysis (bottom panels) suggests that in both cases the assumptions are reasonable and the regressor is satisfactory.

Z

on the right. TD chromatograms usually present larger decay rate ( slope) than PD. The bottom scatter plots show the residual analysis for the respective data sets. In both cases the residuals for every channel are confined within a “box”, indicating that the assumptions made are reasonable and the regressor is satisfactory. It can be observed that and may differ from channel to channel. It is well known that the uncertainty on peak height ( ) is larger in TD data, as it is also confirmed by the relatively larger size of the boxes in the right panel compared to those in the left panel. However, does not depend on the type of the chemistry (as and ). It rather depends on the quantities of dyes of different types in the solution and the proportion of different types of bases in the underlying sequence.

l

El \ Z

4. EXPERIMENTAL RESULTS AND DISCUSSION 4.1. Impact of normalization to the BEM base-caller In [1] we used a mixture of Gaussians to model peak events in a DNA chromatogram. Every peak is represented by two features, width (time support) and strength (normalized height). Typically BEM scatter plots (Figure 3) exhibit three clusters of events: , including the spurious peaks due to noise and artifacts located close to the origin; , the single base-call peaks, a large cluster of events with widths typically between 10 and 20 samples located towards the center of the plane; , a cluster usually situated in the area above including events due to overlapped alleles which represent more than one base-calls. Correct learning leads directly to less boundary classification errors and thus to less base-calling errors. The top row panels of Figure 3 show two examples of extracted peak events in two different channels when the windowing normalization approach is used. The corresponding bottom row panels show the events extracted when the regressed model is used to normalize the data instead. The rest of the preprocessing operations applied to the raw data

{

{

m

1

1.5

5 0.5

1

1.5

0 0

0.5

strength

4 ^ 5 log( dlp )

1.5

20

0.1

3

l \

25

20

15

−0.4 4.6

30

25

10

−0.3

4.4

35

30

0.2

−0.2

−0.3

35

15

−0.1

−0.2

40

0.3

^

−0.1

width

Residual analysis D:4106

1 strength

D4107−A

40

width

4000

Residual analysis D:1090

p

0 0

1.5

D1089−T

2.5

2000

log( d l ) − log( dl )

1 strength

4.4

20

5

4.8 4.6

D4107−A

40

width

Skyline regression D:1090 6

strength

Fig. 3. Strength-width scatter plots of peak events extracted by the BEM base-caller. Left (right) column of plots show the clustering results for channel T (A) of a primer (terminator) dye labeled data set respectively. Normalization based on the proposed (windowing) approach is used in bottom (top) row of panels respectively. See text for discussion.

e

were the same in both cases and included de-correlation, baseline correction and mobility shifts correction. The solid the maximum value of lines represent the contours at the class conditional densities after BEM learning. ¿From the scatter plots of Figure 3 we observe that the proposed normalization method improves the shape of the clusters, especially that of and . Furthermore we observe that the width uncertainty is also reduced because correct normalization leads to better estimation of the time support of every peak. The average probability oftoerror of the for parametric model was reduced from to for the TD examthe PD example and from ple. Although not all base-calling errors are due to errors of the mixture model classifier, it is true that less base-calling errors are expected when the model has a larger likelihood. Furthermore regressive normalization seems to have a larger impact on chromatograms with poorly separable peak event clusters.

{

S

m

W IS S

II S

4.2. Impact on peak fitting base-calling approaches Some authors [2, 3, 4] have proposed to model the DNA chromatogram time series data as the sum of kernels, usually of Gaussian shape. Although extensive experimental evidence of the success of such approach has yet to be provided, peak shape fitting is in principle a possible alternative to features extraction. We believe that putting more emphasis on signal pre-processing algorithms (such as correct baseline estimation [9], improved mobility shifts estimation etc) could pay off by leading to better performance when employing a maximum likelihood peak shape fitting basecaller. We will show next, by using an illustrative example, how correct skyline normalization may contribute to correct decision making in a peak shape fitting base-calling context. In Figure 4 we tried to fit the correct (incorrect) number of 12 (11) base symbols to the same region of the T channel of chromatogram D1090 depicted in Figure 1. The first (second) panel corresponds to the outcome using the proposed normalization method under the correct (incorrect) hypothesis. The third (fourth) panel corresponds to the outcome using the more traditional windowing appoach under the correct (incorrect) hypothesis. The correct hypothesis

1.5

1.10 0.90

1

1.03 0.93

0.90

1.41 1.54 0.88 0.95

1.10 0.93 0.90

0.5 0 7200 1.5

7300

7400

7500

7600

7700

7800

7900 1.90 1.19

1.10 0.90

1

1.03 0.93

0.90

8000 1.49 1.17

0.95

0.90

0.5

d II S

5. CONCLUSIONS

0 7200

7300

7400

7500

7600

7700

1.00 0.81

1 0.5 0 7200

7300

7400

7500

7600

7700

1.00 0.81

1

7300

7400

7500

7600

7900

0.80 0.87 0.50 0.50

7800 0.55 0.52

0.71 0.5 0 7200

7800 0.55 0.52

0.71

7700

8000 0.63 0.52 0.47

7900 0.94 0.59

0.50

7800

8000 0.71 0.57 0.47

7900

8000

Fig. 4. Fitting Gaussian shapes, after normalizing the same trace region of Figure 1 using the proposed regressed skyline approach (top two panels) versus using a windowing approach (bottom two panels). A correct and an incorrect hypothesis is considered in each case. The correct (incorrect) hypothesis tries to fit 5 (4) base symbols to the wide bump in panels 1 and 3 (2 and 4) respectively. Estimated amplitude values are shown next to each fitted Gaussian. See text for discussion. tries to fit five base symbols into the broad bump, one more than the incorrect hypothesis. The fitted model (dashed line) is given by

0 ; e = n=@? { V \ \

(5)

is the number of base symbols considered by the where hypothesis, , , and, are the time location, spreadness and amplitude of the peak shape representing base symbol . The term is the noise component, usually modeled as additive and Gaussian. In general, is a slowly varying function of time, i.e. it is expected that neighboring peaks spread similarly. Accurate base-calling depends on the correct estimation of ; its value will help disambiguate against symbols appearing in the same region but in other channels (lanes). After normalization of the complete chro . Any deviation of from increases matogram potentially the risk of base-calling errors, especially when under-calls are generated. For the cases shown in Figure 4 we used exhaustive search to find the optimal estimates for the unknown parameters ( , , ). We show the value of for each candidate base symbol next to the respective peak. When using skyline normalization by regression (top two panels) all single base peaks are within while using the windowing approach (bottom two panels), the peaks sur .1 rounding the broad bump get suppressed to less than Let us now consider only the peaks inside the broad bump. The correct hypothesis (top panel) has mean square error 2 value while the wrong hypothesis (sec . The lower value sugond panel) has

d[ c

V \

d I d I I W S S

( I S ,

d

I

S

example when using skyline normalization by regression when using the windowing approach. 2 For every peak the error is ')(+*,. -/ ( . 1 In the

and

d II S

gests assigning five base symbols to the broad bump, which is correct. However, using a windowing normalization ap proach, the incorrect hypothesis exhibits (bottom panel) which is less than that of the correct hypoth (third panel), leading us to accept the esis wrong hypothesis.

"!$# %&

In this paper the peak height drop expected in DNA chromatograms is modeled as a channel coupled exponential decay curve. We show how the curve (skyline) can be estimated by exploiting the proportionality of the uncertainty to the peak height pers´e. Regression is performed simultaneously on the four channels and for the entire chromatogram. This provides an alternative to windowing approaches that usually fail under unexpected artifacts, false peaks, overlapping peaks, or even absence of peaks. Skyline normalization based on regression is shown to increase the classification performance of the BEM base-caller. We also explain why it is expected to positively impact peak shape fitting basecalling methods by reducing the variance of peak heights. When it comes to analyzing DNA chromatograms, skyline correction is only one of the commonly applied signal pre-processing operations that has not been adequately addressed in the literature. Since we believe that there is a lot of room for improvement, other signal pre-processing problems that we are currently revisiting include: baseline correction [9], mobility shifts correction inter-base distance estimation etc. 6. REFERENCES [1] M. Pereira, L. Andrade, S. El-Difrawy, B. Karger, and E. Manolakos, “Statistical learning formulation of the DNA base-calling problem and its solution using a Bayesian EM framework,” Discrete Applied Mathematics, vol. 104, no. 1– 3, pp. 229–258, 2000. [2] D. Brady, M. Kocic, A. Miller, and B. Karger, “Maximum Likelihood Base-Calling for DNA sequencing,” IEEE Trans. on Biomedical Engineering, vol. 47, no. 9, pp. 1271–1280, 2000. [3] S.W. Davies, M. Eizenman, and S. Pasuphaty, “Optimal structure for automating processing of DNA sequences,” IEEE Trans. on Biomedical Engineering, vol. 46, no. 9, pp. 1044– 1056, 1999. [4] N.M. Haan and S.J. Godsill, “Modeling electropherogram data for DNA sequencing using variable dimension MCMC,” in ICASSP 2000, Istanbul. 2000, pp. 3542–3545, IEEE. [5] L. Alphey, DNA sequencing: From experimental methods to bioinformatics, Springer-Verlag, 1997. [6] B. Ewing, L. Hillier, M. Wendl, and P. Green, “Base-calling of automated sequencer traces using phred. I. Accuracy assessment,” Genome Research, vol. 8, pp. 175–185, 1998. [7] Perkin-Elmer, Applied Biosystems, Foster City, CA, ABI PRISM, DNA sequencing analysis software, user’s manual, 1996. [8] M.C. Giddings, J. Severin, M. Westphall, J. Wu, and L.M. Smith, “A software system for data analysis in automated DNA sequencing,” Genome Research, vol. 8, pp. 644–665, 1998. [9] L. Andrade and E. Manolakos, “Accurate Estimation of the Signal Baseline in DNA Chromatograms,” in IEEE Neural Networks for Signal Processing, Sept. 2002, in press .