A Fast and Scalable Algorithm for Alignment of Optical DNA Mappings

3 downloads 0 Views 1MB Size Report
Nov 25, 2013 - nel confinement allows for integration of the stretching in a lab on a chip context ...... Harris TD, Buzby PR, Babcock H, Beer E, Bowers J, et al.
1

arXiv:1311.6379v1 [q-bio.QM] 25 Nov 2013

A Fast and Scalable Algorithm for Alignment of Optical DNA Mappings Charleston Noble1,∗ , Adam N. Nilsson2 , Camilla Freitag3 , Jason P. Beech4 , Jonas O. Tegenfeldt5 , Tobias Ambj¨ ornsson6 1 Charleston Noble Department of Astronomy and Theoretical Physics, Lund University, Lund, Sweden 2 Adam N. Nilsson Department of Astronomy and Theoretical Physics, Lund University, Lund, Sweden 3 Camilla Freitag Department of Physics, Gothenburg University, Gothenburg, Sweden 4 Jason P. Beech Division of Solid State Physics, Department of Physics, Lund University, Lund, Sweden 5 Jonas O. Tegenfeldt Division of Solid State Physics, Department of Physics, Lund University, Lund, Sweden 6 Tobias Ambj¨ ornsson Department of Astronomy and Theoretical Physics, Lund University, Lund, Sweden ∗ E-mail: [email protected]

Abstract Optical mapping represents a promising complementary approach to DNA sequencing and has the potential for rapid identification of bacterial strains and species and can serve as an important clinical tool in the fight against antibiotic resistance. However, leading approaches face a difficult corrective image analysis problem for which only one method has been developed as a solution, and this method is prohibitively slow for molecules greater than a few hundred kbp in length. In this article, we introduce an algorithm which offers orders-of-magnitude improvement in computational time while simultaneously improving the quality of corrected images compared to the previous approach. We compare the two methods using several objective criteria, including noise reduction and computational time, and we present a new score for information content in optical DNA mappings which should see widespread application among experimentalists, particularly for optimizing mapping conditions. We show that our method outperforms the only current technique by each criterion considered and in every example tested, offering only improvements without requiring any compromise. Furthermore, we demonstrate our algorithm’s ability to process a simulated optical map with length corresponding to an intact human genome quickly on a typical desktop computer.

Author Summary Traditionally, analysis of an organism’s DNA content has relied on determining its DNA sequence at base pair resolution, even if this much information is unnecessary for the desired application. Optical mapping is a complementary approach which instead produces a “high level” view of the DNA sequence and offers a number of distinct advantages, particularly rapid species identification. However, leading optical mapping techniques face a difficult image analysis problem for which only one current technique serves as a solution. And this technique suffers from a number of drawbacks, most importantly its long computational time, which render it infeasible for use in many applications. We present an alternative approach which improves computational time by several orders of magnitude without requiring any tradeoffs. That is, our method improves over the previous approach in each objective criterion considered. Finally, we demonstrate its ability on a simulated optical map corresponding to the length of an intact human genome, showing that our method should be able to easily handle optical mapping data for the foreseeable future without requiring further improvement.

2

Introduction Optical mapping is an emergent complementary approach to DNA sequencing [1, 2]. Though recently “third generation” single-molecule DNA sequencing techniques have brought rapid sequencing of single DNA molecules close to reality [3–8], it is widely believed that the next great challenge in the field of genomics will not be sequencing technology itself but analyzing the incredible amount of data that results from it. In this vein, optical mapping produces a lower resolution (typically kbp) map of a DNA molecule that can serve as a scaffold upon which the sequence can then be assembled at bp resolution. Besides serving as a crucial complement to DNA sequencing, optical mapping offers a variety of important applications which bypass the need for sequencing entirely. It can be used to quickly identify large-scale structural variations, including duplications, deletions, insertions, inversions, and translocations, which are increasingly being linked to heritable traits of phenotypic significance [9, 10]. And it allows for the rapid identification of bacterial species and strains and could represent an important step against the growing problem of antibiotic resistance [11–15]. Many techniques for optical mapping have been developed, and they typically rely on sequencespecific DNA modifications at short target sites, followed by imaging and analysis. These sequencespecific modifications can include staining and denaturation (“melt mapping”) [16], fluorocoding [17], competitive binding of intercalating dyes [18], methylation [19], enzymatic nicking [20,21], and enzymatic restriction [22–24]. Now, these optical mapping techniques can roughly be divided into two groups: one based on stretching over a surface and one based on stretching by confinement in nanochannels. While surface–stretching techniques offer a few advantages, such as allowing for near 100% extension [25], mapping via nanochannel confinement allows for integration of the stretching in a lab on a chip context and can be brought to application much more easily. Furthermore, nanochannel confinement allows for the “folding” of DNA into a microscope’s field of view in order to visualize large amounts of DNA in one image [26]. One particular problem inherent to nanochannel confinement techniques is that DNA tends to undergo random diffusive processes during imaging, including center-of-mass diffusion and local stretching [27]. Thus any single image represents only one realization of these stochastic processes. To correct for these effects, kymograph analysis has emerged as an important tool [16]. In this technique, a video, rather than a picture, is taken of the DNA molecule during imaging. Each frame of the video is condensed to a single row of pixels, and these rows are stacked to produce a single image resembling a barcode (see Fig. 1a). This barcode is then aligned, and the resulting time-averaged image represents the DNA molecule at thermodynamic equilibrium (see Fig. 5). Without this alignment procedure, the barcode is obscured, and a great deal of information is lost due to motion blurring, limiting its usefulness for applications. And due to the physical constraints of the DNA molecule itself, traditional sequence and series alignment techniques cannot be employed directly. To the best of the authors’ knowledge, only one algorithm has thus far been developed to specifically address this constrained barcode alignment problem. Developed by Walter Reisner for his 2010 PNAS paper, Single-molecule denaturation mapping of DNA in nanofluidic channels [16], it involves a computationally intensive global optimization procedure based on local “stretching factors.” While it satisfies the physical constraints imposed by the DNA molecule itself, the computational cost of this method is a serious drawback. Its computational time scales roughly with O(n2 ), where n is the width of the barcode, and can require weeks of computational time for DNA segments of even modest length. So, if this barcode technique is to reach application for human-scale DNA kymographs, a new alignment technique is necessary. In this paper we present WPAlign (Weighted Path Align), a novel kymograph alignment algorithm which offers linear scaling in time and can align DNA barcodes with length corresponding to an entire human genome in less than an hour on a typical desktop computer. And besides this DNA barcode alignment problem, we expect our feature detection method, with suitable modifications, to find application in other areas of biological image analysis, such as automated organism tracking [28, 29], or automated

3 study of cellular transport along axons (`a la [30–33]), defects in which have been implicated in a number of neurodegenerative diseases, including Alzheimer’s Disease, Parkinson’s Disease, and Amytrophic Lateral Sclerosis [34, 35]. Additionally, we present a new information score which quantifies the information content of DNA barcodes and should see widespread use as a barcode quality criterion by which experimentalists can evaluate barcodes and optimize experimental mapping conditions.

Results We consider alignment of a “fuzzy” barcode (see Fig. 1a for an illustrative example). WPAlign works very intuitively by detecting pronounced bright or dark ridges (“features”) in such a fuzzy barcode, then stretching the barcode horizontally to straighten the features, increasing sharpness. The great advantage of this approach is that it reduces the problem to a simpler two-step process of (1) single-feature detection, followed by (2) single feature alignment satisfying thermodynamic constraints. This two-step process can then be applied recursively to align the entire kymograph.

Single Feature Detection Consider an image represented as a 2D gray scale intensity function with columns x and rows y, denoted I(x, y), x ∈ [1, n], y ∈ [1, m]. See Fig. 1a for an illustrative example. Then a feature F (y) is a mapping from rows Y to columns X. Intuitively the feature F we would like to detect is the “most pronounced” vertical ridge or valley in I such that: 1. (Completeness) Our detected feature F traverses each row y exactly once. Actual features might not adhere to this constraint, since anomalous events such as DNA breakage could occur, but these events should be treated separately from the alignment process. 2. (Continuity) F ’s horizontal movement is constrained. I.e., |F (y + 1) − F (y)| ≤ k for some integer k and for all y. This is to fit the physical constraints we inherit from the polymer physics of DNA molecules for the present application. Unfortunately, standard feature detection (e.g., ridge detection) methods cannot be applied directly, as the resulting features will not necessarily satisfy these constraints, so we have developed the following three-step procedure for single feature detection: 1. Image pre-processing First the entire image is smoothed with a 2D Gaussian kernel (here we use σ = 10 pixels) to remove noise from random intensity fluctuations (see Fig. 1b and Fig. 4, line 13). Then we apply a Laplacian of Gaussian filter to obtain ∇2 L (see Fig. 1c, and Fig. 4, line 14), a matrix with the same size as I which has large positive values in dark bands and large negative values in light bands. These positive and negative values are linearly rescaled to range from (0, 1] and [−1, 0), respectively. The resulting image we call the Laplacian response and denote as K. Now our feature F should be represented by a continuous region of high (valley) or low (ridge) values in K. We would like to treat these as two distinct cases to avoid detecting a single feature which is partially composed of a ridge and partially composed of a valley, so we compute two separate images, KB and KD which emphasize bright and dark regions, respectively (see Fig. 2, and Fig. 4 line 15). Both are positive or zero everywhere, with small values representing possible feature locations.

4

Figure 1. Image preprocessing. (A) A raw, unaligned, kymograph, 170 pixels wide and consisting of 200 time frames, (B) The kymograph after Gaussian smoothing, and (C) The Laplacian response image, which we refer to as K, that results from applying the Laplacian of Gaussian filter to our smoothed kymograph. In K, light regions represent positive values and dark regions represent negative values. 2. Network assembly We can think of these images KB and KD as energy landscapes, and finding the best ridge or valley becomes a problem of finding the lowest-energy paths through KB and KD , respectively. To do this, we assemble directed, acyclic networks GB and GD as follows (see Fig. 2, and Fig. 4, line 18). For simplicity, we describe our process only for GB , as the process is identical for GD . 1. First, we create one node for every pixel in KB , plus two “peripheral” nodes. Thus GB consists of (m × n + 2) nodes. 2. Now the first of the peripheral nodes is connected to each of the nodes corresponding to the first row of G, and each of the last-row nodes is connected to the second peripheral node with edge weight 1. 3. The rest of the nodes are connected as follows: the node corresponding to pixel (i, j) in KB is connected to pixels in the next row directly below, to left k columns, and to the right k columns (see below). This k value is used to satisfy the continuity constraint above. We have found k = 2 to be reasonable for our current application. Ki,j

Ki−k,j+1

Ki−k+1,j+1

...

Ki+k−1,j+1

Ki+k,j+1

4. If a pixel is too close to a border on the left or right (i.e., i − k < 1 or i + k > n), connections to non-existing nodes prescribed in step 3 are ignored. 5. Finally, every edge is assigned a weight equal to the intensity of the pixel in KB corresponding to the node it’s directed to.

5

Figure 2. Network assembly. The Laplacian response image, K, (see Fig. 1c) has been rescaled into (A) KD and (B) KB , images that emphasize dark and bright regions in K, respectively. White pixels represent barriers that potential features cannot cross, while continuous dark regions indicate likely features. (C) Here we show one example network, although separate networks are indeed assembled for KB and KD . Each node (square) within the rectangular region represents a pixel in KB or KD . The top and bottom nodes (which we term “peripheral nodes”) are added to provide starting/ending points for the shortest path finding algorithm. The width of the edges corresponds to the inverse of the edge weight, and the darkness of the nodes represents the average weight of incoming edges. The red line illustrates the shortest path through the network. For the sake of visual clarity, this network was created using a small subsection of an actual Laplacian response with k = 1.

6 3. Shortest path finding Now every path between the peripheral nodes in our networks represents a potential feature which satisfies the continuity and completeness constraints above, so our task is to find the “best” such feature. Essentially, we are searching over all paths between the peripheral nodes which do not move more than k pixels horizontally between any adjacent rows. Our cost function is the sum of pixel intensities over a path, and we seek to minimize this over the space of these acceptable paths. Thus, intuitively, the best bright feature corresponds to the shortest path in KB , and the best dark feature corresponds to the shortest path in KD . So, our task becomes a shortest path finding problem. Since our networks are directed and acyclic, the shortest path can be computed by a standard dynamic programming algorithm based on topological sorting which grows linearly with the sum of the number of edges and nodes (Fig. 4, line 19) [36]. Note that, in our graph, this sum grows bilinearly with the number of time frames and the horizontal width of our input kymographs. And since the number of time frames does not change, in practice, the sum of nodes and edges in our graph grows linearly with the kymograph width. Thus the computational time of the shortest path finding algorithm also grows linearly with the kymograph width. By this process we obtain two paths, one in KB corresponding to the best bright feature in our original kymograph, and one in KD corresponding to the best dark feature. The path with the shorter length of these two is chosen as the best overall feature (see fig. 2).

Single Feature Alignment Satisfying Thermodynamic Constraints Once the best feature F (y) has been detected by the three-step process above, it is aligned by setting F (y) = hF (y)i (rounding to the nearest integer) for all rows y, and the pixels to the left and right of F (y) are stretched (or compressed) linearly. To determine intensity values at non-integer positions during this linear stretching, we used cubic spline interpolation (see Fig. 3 and Fig. 4, line 23). Note that by setting F (y) = hF (y)i, we ensure that the feature is aligned over its position at thermodynamic equilibrium. Finally, the best single feature in the input image has been both (1) detected and (2) aligned. But we wish to detect and straighten all of the features. Thus to continue the process, the image is split vertically along the newly aligned feature (see Fig. 3c, d, and e), and w columns (where w is set to half the width of a typical feature) are removed from each on the side adjacent to the split. Then the (1) single feature detection and (2) single feature alignment routines are called again on each of these two smaller images if they are wider than typical a feature (i.e., 2×w). In our current application, features are typically ∼ 10 pixels wide, so we use w = 5 pixels, though this parameter can easily be changed depending on the application. (see Fig. 3, and Fig. 4, lines 24–25). It is worth noting that, since we remove columns before calling the process again, the algorithm is guaranteed to converge. If we did not perform this step, the same feature could conceivably be restraightened perpetually.

Comparison with Existing Methods In practice, WPAlign produces alignments which are visually quite appealing (see Fig. 5). To quantify the quality of these alignments, we propose two criteria: time-trace noise reduction and information content. We compare WPAlign to the only previous technique, the Reisner method, based on these criteria and their empirical computational costs.

7

Figure 3. Feature detection and recursion in WPAlign. (a) A raw, unaligned kymograph is given as input (see Fig. 1). (b) The “best” feature is identified from the input kymograph. (c) The feature identified in (b) is aligned via cubic spline interpolation. (d, e) The feature identification process is called recursively on the regions to the right (d) and to the left (e) of the newly-aligned feature. (f) This process is continued until all features have been aligned. Time Trace Noise Reduction The ultimate goal of producing these kymographs is to produce a 1-dimensional intensity profile that is “typical” for the DNA being analyzed. This intensity profile is then what is actually compared to existing databases for genomic analysis. To produce this intensity profile, hIi, we simply take the time-average of an aligned kymograph. That is, m 1 X hI(x)i = I(x, y), m y=1 where m is the number of rows (time frames) in the kymograph. Then the overarching purpose of alignment is to reduce the noise in this time trace. This noise can be quantified by the column-wise variance, given by m 1 X 2 σ 2 (x) = [I(x, y) − hI(x)i] . m y=1 To compare WPAlign and the Reisner method, these variances were calculated for the 10 representative kymographs in Fig. 5, see Supplementary Methods for experimental details on how these kymographs 2 2 were obtained. For every column in each of the kymographs, we obtained values σW (x) and σR (x), corresponding to the variances resulting from alignment by the respective algorithms. Then a distribution 2 2 of values σW (x)/σR (x) was calculated (see Fig. 6). This distribution is normal with mean µ = 0.75, indicating that, for a typical kymograph column, WPAlign reduced variance by 25% when compared to the Reisner method. Thus WPAlign, besides reducing computational costs, reduces kymograph noise and yields a clear improvement over the Reisner approach. In addition, we examined how aligned kymograph noise is affected by the length of the time axis (see

8

Require: Require: Require: Require: Require: Require:

I w r t k b

. Grayscale image with columns x (1 . . . m) and rows y (1 . . . n) . Width of a feature typical to the application . Gaussian smoothing radius for pre-processing . Sigma value for Laplacian of Gaussian filter . Allowable horizontal path distance between time-frames . Barrier threshold value

1: 2: 3:

// Begin the recursion with leftX:=1, rightX:=m, and depth:=1. I ← AlignSingleFeature(I, 1, m, 1)

4: 5: 6: 7: 8:

function AlignSingleFeature(I, leftX, rightX, depth)

9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30:

// Verify that the region is wide enough. if rightX − leftX > 2w then // Extract the region of the image between leftX and rightX. Region ← I(1, . . ., n; leftX, . . ., rightX) // Preprocess the extracted region. IS ← 2DGaussianConvolution(Region, r) K ← LaplacianOfGaussian(IS , t) KB , KD ← Rescale(K) // Assemble the networks and find the shortest path. GB , GD ← AssembleNetworks(KB , KD , k) shortestPath, meanX, distance ← FindShortestPath(GB , GD ) // Continue if the path is acceptable. if distance < b then I, meanPathX ← StraightenSingleFeature(I, shortestPath) I ← AlignSingleFeature(I, leftX+w, meanX, depth+1) I ← AlignSingleFeature(I, meanX, rightX+w, depth+1) end if end if return I end function Figure 4. WPAlign pseudocode.

9

Figure 5. Typical T4 melting map kymographs, raw and aligned via WPAlign. The numbered kymographs represent the raw data, and the aligned versions are displayed directly below each.

10

250 1,400

Comparisons

hI i

200

1,000 Reisner 600

150

WPAlign 0

50

100

150

Position (pixel)

100

50

0 −0.5

0

0.5

1

1.5 σ2W/σ2R

2

2.5

3

Figure 6. Kymograph noise comparison of WPAlign and the Reisner approach. 2 (x)), and the Reisner Column-wise variances were calculated for aligned barcodes using WPAlign (σW 2 2 2 approach (σR (x)). Here we show the distribution of σW /σR . A Gaussian distribution was fit to this data, resulting in a mean of µ = 0.75. Thus for a typical barcode column, WPAlign reduces variance by 25% compared to the Reisner method. (Inset) An example time trace of barcode 1 (see Fig. 5) aligned via the WPAlign and Reisner methods. Error bars represent one standard deviation (i.e., σW and σR ). Fig. 7). To do this, we calculated hσ 2 (x)i, the mean of the column-wise variances given above, given by # " n m X X 1 1 2 hσ 2 (x)i = [I(x, y) − hI(x)i] , n x=1 m y=1 for kymographs with time axes ranging from 20 to 200 time frames. From the results we can see that the noise from kymographs aligned via WPAlign is independent of the number of time frames chosen during filming. However, kymographs aligned by the Reisner approach seem to undergo an increase in noise as the number of frames increases, until plateauing at roughly 120 frames. We present a possible explanation for this result: as the number of time frames increases, the underlying DNA molecule is allowed more time to undergo conformational changes and random diffusive processes, rendering the first and last frames increasingly dissimilar. Thus any template frame chosen by the Reisner method will be increasingly dissimilar from the frames farthest from it in the kymograph. This renders the local stretching factor optimization more prone to becoming stuck in local minima for these “distantly related” frames, introducing noise in the final alignment. WPAlign, on the other hand, avoids this problem as it does not rely on the choice of a single template frame.

Information Score Before alignment, bright and dark features can stray into adjacent kymograph columns as the DNA molecule undergoes horizontal diffusion. Thus thin features are obscured in the time average, effectively broadening peaks and valleys. As alignments improve, however, features occupy less horizontal space and become more apparent in the final time-trace. Since these features are essential in later performing statistical comparisons with other data, increasing contrast between neighboring features (i.e., increasing feature sharpness) essentially increases the information contained in the time trace.

11

−3

5.5

x 10

Reisner

5

WPAlign

4.5

σ2

4 3.5 3 2.5 2 20

40

60

80

100

120

140

160

180

200

Time Frames

Figure 7. Effect of time axis length on aligned kymograph noise. Kymographs with time axes varying from 20 to 200 time frames in length were aligned by both methods. The resulting column-wise 2 2 (x) were calculated as in Fig. 6. Here we plot, for each kymograph, the mean of (x) and σW variances σR 2 2 (x)i. (x)i and hσW these column variances, i.e., hσR To the best of the authors’ knowledge, no measure has yet been developed to quantify this information content, and so we present a new scheme based on the self-information of a random variable, see Supplementary Methods for a comprehensive description. Here in brief, the information J contained in a kymograph is given by " #  X 1 log(|∆Ik |)2 J= − log p exp − 2 log(σ 2 ) 2π log(σ 2 ) k

where σ 2 represents the noise of the underlying kymograph, and the ∆I’s represent intensity differences between neighboring peaks and valleys. Thus information increases as the difference between neighboring peaks and valleys grows and as the noise of the underlying kymograph decreases. That is, time-traces with sharper, more well-defined local extrema will contain more information than time traces from noisy kymographs with broad extrema. Time traces and their corresponding information content were calculated for the representative kymographs shown in Fig. 5 after undergoing alignment by both WPAlign and the Reisner method (see Fig. 8). In every example, WPAlign produces visually sharper and more well-defined time-traces. Furthermore, the information content is greater for WPAlign in all cases. In fact, the average information gain from using WPAlign over the Reisner method on our dataset is roughly 68% (i.e., h(JW − JR )/JR i = 0.68), and in all cases considered, WPAlign produced a larger information score. Thus WPAlign produces alignments whose final time-traces unambiguously contain more information than those from the Reisner method. Furthermore, independent of our kymograph alignment technique, this information score can serve as an objective and easily interpretable barcode quality measurement by which barcodes can be compared, providing a basis for experimental optimization. For example, expected barcodes can be calculated from theory for a number of experimental conditions, and then experiments can be performed only for barcodes expected to yield the highest information content, saving valuable time and resources for experimentalists. Or if theory is not available for a particular optical mapping application, the information score can serve

12

1,400

1,800

1,400

1,600

hIi

1,000

JR = 746 JW = 1000

400 0

500

150

JR = 552

JR = 767

J

JW = 1279

W

= 785

50

600 150

50

150

R

50

JW = 1196

600 150

50

Reisner WPAlign

600

hIi

150

500

800

1,100

1,300

J = 735

JR = 746 JW = 1043

600

800 1200

JR = 668 600

JW = 986

600 50

150

Position (pixels)

JR = 523

J = 592 R JW = 1054 50

100 150

Position (pixels)

J 50

W

= 981

100 150

Position (pixels)

JR = 449 JW = 859 50

Position (pixels)

JR = 327 JW = 758

100 200

50

200

Position (pixels)

Figure 8. Comparison of average intensity traces from kymographs aligned via WPAlign (red) and the Reisner approach (black). The information score is displayed below each trace for both methods, quantifying the contrast between neighboring features. Notably, h(JW − JR )/JR i = 0.68, indicating that WPAlign on average produced time-traces with more information than the Reisner method over our sample set. as a quality criterion for experimental barcodes by which experimental conditions can be rigorously optimized.

Computational Time Empirically, WPAlign exhibits linear scaling with barcode length, n (see Fig. 9). This is somewhat intuitive, as the bottleneck of the approach is the shortest path finding algorithm, and this runs in linear time due to the directed and acyclic qualities of our networks. The Reisner approach, on the other hand, scales with O(n2 ), rendering it impractical for bacterial barcodes on the order of 1Mbp or larger (see Fig. 9). Perhaps most importantly, WPAlign was able to successfully align a simulated barcode with length on the order of a full human genome in only 40 minutes. The Reisner approach would require over 108 seconds, or roughly 3 years, to perform this task on an identical computer. Thus WPAlign is, to the best of the authors’ knowledge, the only method capable of aligning intact human genome barcodes at current kbp resolutions while satisfying the imposed constraints from polymer physics.

Discussion In this paper we presented a new DNA barcode alignment algorithm which outperforms the only current alternative not only in computational speed, but also in all of the objective alignment quality measures considered. Indeed, the orders-of-magnitude improvement in computational speed could open the door for high throughput DNA barcode alignment at the human-genome scale. And to the best of the authors’ knowledge, this is the only currently existing technique suited to perform this task. Thus it should serve

13

10

10

O(n2.2)

Reisner 8

10

WPAlign

6

Time (s)

10

4

10

2

10

O(n1.1)

0

10

−2

10

2

10

3

10

4

10

5

10

Barcode Width (pixels)

Figure 9. Comparison of time scaling for WPAlign and the Reisner approach. Empirical time scaling results for both techniques on identical kymographs ranging between ∼ 102 and ∼ 105 pixels in width, where ∼ 105 pixels is roughly the length of an intact human genome at current resolutions. Power law curves of the form axb were fit to these data (solid lines), yielding b = 1.1 for WPAlign, and b = 2.2 for the Reisner method. Thus WPAlign exhibits approximately O(n) time scaling with barcode width, while the Reisner method scales approximately with O(n2 ). Scaling data beyond ∼ 103 pixels was projected for the Reisner approach (dashed line), as alignment times became prohibitive. Simulated kymographs were generated by concatenating experimental T4 kymographs from above.

14 as an important step in data analysis for a number of nanofluidic optical mapping techniques, including denaturation mapping, fluorocoding, competitive binding, enzymatic nicking, and restriction mapping. And by providing a rapid framework for this data analysis, WPAlign can help bring the many applications of optical mapping closer to application, including bacterial strain and species identification, detection of large-scale genomic structural variation, and scaffold building for third generation de novo sequencing techniques. In addition, our feature detection method, suitably modified, should find application in other domains of biological image analysis, such as automated organism tracking and automated axonal transport studies, aiding an important branch of research which has promising applications in a number of neurodegenerative diseases, inclusing Alzheimer’s Disease, Parkinson’s Disease, and Amytrophic Lateral Sclerosis. And finally our information score, motivated as a way to compare the quality of alignments produced by different methods, should find widespread application among the optical mapping community as an easily calculable and interpretable barcode quality criterion by which to optimize experimental conditions.

Methods Calculating KD and KB Given a Laplacian response K(x, y), linearly scaled such that each value falls in the range [−1, 1], we calculate KD and KB using ( 1 − K(x, y) for K(x, y) > 0 KD (x, y) = B for K(x, y) ≤ 0 ( 1 + K(x, y) for K(x, y) < 0 KB (x, y) = , B for K(x, y) ≥ 0 where B  1 is an arbitrary large constant which creates “barrier” pixels through which paths will not traverse. In this way, we prevent the feature detection algorithm from “jumping” between adjacent features. Then in KD , small values represent dark regions, and in KB small values represent bright regions.

Laplacian of Gaussian Filter First we convolve I with a Gaussian kernel g(x, y, t) =

 2  x + y2 1 exp − 2πt 2t

to give a scale space representation L(x, y; t) = g(x, y, t) ? f (x, y). Then the Laplacian operator ∇2 L = √ Lxx + Lyy is computed, resulting in strong positive responses for dark regions of extent 2t and strong negative responses for bright regions of similar extent [37]. For our data, we have found t = 10 pixels to be reasonable.

Information Score Here we introduce an information score associated with a DNA barcode, given as a 2D intensity profile, I. This score was chosen to quantify the amount, and sharpness, of “robust” peaks and valleys in the barcode.

15 We define these “robust” extrema roughly as those which differ from neighboring pixels by an amount greater than the background noise. In order to quantify this background noise, we assume that I has already been aligned. By the nature of the alignment, each column in I represents a single intensity value obscured by the addition of Gaussian noise which results from random photo-effects and thermal fluctuations. Thus we create a new image whose pixel in the ith row, jth column, is given by I 0 (i, j) = I(i, j) − hIij , where I(i, j) is simply the intensity of this pixel in the aligned kymograph, and hIij is the mean intensity of the jth column of the kymograph. Then I 0 is an image with intensities attributable only to our kymograph’s background noise, so our background variance σ 2 is given by the intensity variance of I 0 , or σ 2 = h(I 0 − hI 0 i)2 i. Now we calculate the time average of I, denoted hI(x)i, given by m

hI(x)i =

1 X I(x, y) m y=1

and we locate the “robust” peaks and valleys by treating hI(x)i as an energy landscape and employing the method of Azbel, developed for a simplified description of the random melting of a one dimensional two-component Ising model [38]. Thus we obtain only local extrema which are “robust”, i.e., regions which to the left and right are surround by “energy barriers” larger than some threshold typically chosen to be on the order of σ. Finally our information score is that of self information [39], and we define the information content J of a DNA barcode by "  # X log(|∆Ik |)2 1 , (1) J= − log p exp − 2 log(σ 2 ) 2π log(σ 2 ) k

where ∆I(k) is the difference between two neighboring “robust” peaks and valleys in the “free energy landscape” and k denotes the numbering of ∆Is found. To justify the use of the logarithm of ∆I(k) rather than I(k) itself in Eq. (1) we note that in a simplistic approach to DNA melting we have that the probability for a DNA basepair to be open is related to the Boltzmann weight P (i) ∝ exp(−β∆E(i)), where ∆E(i) is the energy difference between a basepair being open and closed. Therefore, by using log[I(k)] our information score utilizes free energy differences.

Experimental Procedure To generate the optical DNA mappings shown above, we employed the following experimental protocol. T4 GT7 DNA (Wako Chemicals GmbH, Neuss, Germany) was mixed at a ratio of 1 dye molecule per 6 base pairs with YOYO1 Iodide (Molecular Probes, Oregon, USA) and kept at 50C for 2 hours to maximize the homogeneity of the staining. Melting experiments were performed in a buffer consisting of 10mM NaCl in 0.05 x TBE (1 x TBE is 89mM Tris, 89mM Boric acid and 2mM EDTA). Beta Mercaptoethanol was added to a final concentration of 2% and Formamide to a final concentration of 50%. Using compressed nitrogen the buffer carrying the DNA molecules was forced into 100 nm by 150 nm channels etched in fused silica. Once in the nanochannels the molecules were imaged in a Nikon TE2000 microscope (Nikon, Tokyo, Japan) using a high-pressure mercury lamp for excitation, a FITC filter cube to pick out the fluorescence from the stained DNA and an Andor Ixon DU 897 EMCCD (Andor Technology, Belfast, Ireland) camera for image acquisition. In order to form the melt maps the nanochannel device was brought into contact with an aluminium block, heated to 31.5C at which temperature the molecules are partially denatured.

16

Acknowledgements We thank Karl ˚ Astr¨ om for helpful discussions on image analysis. This work was supported by the Swedish Research Council (VR) grant nos. 2009-2924, 2007-584 and 2007-4454, the EUs seventh Framework Programme (7RP/2007-2013) under grant agreement no. 201418 (READNA), a project grant from VINNOVA (grant number P35735-1), The Carl Tryggers Foundation grant no. 12:13, the Hasselblad Foundation and the Crafoord Foundation grant nos. 2008-0841 and 2005-1123.

References 1. Aston C, Mishra B, Schwartz DC (1999) Optical mapping and its potential for large-scale sequencing projects. Trends in Biotechnology 17: 297–302. 2. Samad A, Huff E, Cai W, Schwartz DC (1995) Optical mapping: a novel, single-molecule approach to genomic analysis. Genome research 5: 1–4. 3. Eid J, Fehr A, Gray J, Luong K, Lyle J, et al. (2009) Real-time dna sequencing from single polymerase molecules. Science 323: 133–138. 4. Harris TD, Buzby PR, Babcock H, Beer E, Bowers J, et al. (2008) Single-molecule dna sequencing of a viral genome. Science 320: 106–109. 5. Gupta PK (2008) Single-molecule dna sequencing technologies for future genomics research. Trends in biotechnology 26: 602–611. 6. Xu M, Fujita D, Hanagata N (2009) Perspectives and challenges of emerging single-molecule dna sequencing technologies. Small 5: 2638–2649. 7. Pushkarev D, Neff NF, Quake SR (2009) Single-molecule sequencing of an individual human genome. Nature biotechnology 27: 847–850. 8. Metzker ML (2009) Sequencing technologies–the next generation. Nature Reviews Genetics 11: 31–46. 9. Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, et al. (2009) Origins and functional impact of copy number variation in the human genome. Nature 464: 704–712. 10. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, et al. (2006) Global variation in copy number in the human genome. nature 444: 444–454. 11. Shukla SK, Kislow J, Briska A, Henkhaus J, Dykes C (2009) Optical mapping reveals a large genetic inversion between two methicillin-resistant staphylococcus aureus strains. Journal of bacteriology 191: 5717–5723. 12. Kotewicz ML, Jackson SA, LeClerc JE, Cebula TA (2007) Optical maps distinguish individual strains of escherichia coli o157: H7. Microbiology 153: 1720–1733. 13. Zhou S, Kile A, Bechner M, Place M, Kvikstad E, et al. (2004) Single-molecule approach to bacterial genomic comparisons via optical mapping. Journal of bacteriology 186: 7773–7782. 14. Zhou S, Deng W, Anantharaman TS, Lim A, Dimalanta ET, et al. (2002) A whole-genome shotgun optical map of yersinia pestis strain kim. Applied and environmental microbiology 68: 6321–6331.

17 15. Welch R, Burland V, Plunkett G, Redford P, Roesch P, et al. (2002) Extensive mosaic structure revealed by the complete genome sequence of uropathogenic escherichia coli. Proceedings of the National Academy of Sciences 99: 17020–17024. 16. Reisner W, Larsen NB, Silahtaroglu A, Kristensen A, Tommerup N, et al. (2010) Single-molecule denaturation mapping of dna in nanofluidic channels. Proceedings of the National Academy of Sciences 107: 13294–13299. 17. Neely RK, Dedecker P, Hotta Ji, Urbanaviˇci¯ ut˙e G, Klimaˇsauskas S, et al. (2010) Dna fluorocode: a single molecule, optical map of dna with nanometre resolution. Chemical Science 1: 453–460. 18. Nyberg LK, Persson F, Berg J, Bergstr¨om J, Fransson E, et al. (2012) A single-step competitive binding assay for mapping of single dna molecules. Biochemical and biophysical research communications 417: 404–408. 19. Lim SF, Karpusenko A, Sakon JJ, Hook JA, Lamar TA, et al. (2011) Dna methylation profiling in nanochannels. Biomicrofluidics 5: 034106. 20. Baday M, Cravens A, Hastie A, Kim H, Kudeki DE, et al. (2012) Multicolor super-resolution dna imaging for genetic analysis. Nano letters 12: 3861–3866. 21. Das SK, Austin MD, Akana MC, Deshpande P, Cao H, et al. (2010) Single molecule linear analysis of dna in nano-channel labeled with sequence specific fluorescent probes. Nucleic acids research 38: e177–e177. 22. Schwartz DC, Li X, Hernandez LI, Ramnarain SP, Huff EJ, et al. (1993) Ordered restriction maps of saccharomyces cerevisiae chromosomes constructed by optical mapping. Science 262: 110–114. 23. Meng X, Benson K, Chada K, Huff EJ, Schwartz DC (1995) Optical mapping of lambda bacteriophage clones using restriction endonucleases. Nature Genetics 9: 432–438. 24. Cai W, Aburatani H, Stanton VP, Housman DE, Wang YK, et al. (1995) Ordered restriction endonuclease maps of yeast artificial chromosomes created by optical mapping on surfaces. Proceedings of the National Academy of Sciences 92: 5164–5168. 25. Marie R, Pedersen JN, Bauer DL, Rasmussen KH, Yusuf M, et al. (2013) Integrated view of genome structure and sequence of a single dna molecule in a nanofluidic device. Proceedings of the National Academy of Sciences 110: 4893–4898. 26. Freitag C, Fritzsche J, Persson F, Ambj¨ornsson T, Noble C, et al. (2013) Mapping of intact chromosomes in meandering nano-channels. Unpublished Manuscript . 27. Reisner W, Morton KJ, Riehn R, Wang YM, Yu Z, et al. (2005) Statics and dynamics of single dna molecules confined in nanochannels. Phys Rev Lett 94: 196101. 28. Tsibidis GD, Tavernarakis N (2007) Nemo: a computational tool for analyzing nematode locomotion. BMC neuroscience 8: 86. 29. Huang KM, Cosman P, Schafer WR (2008) Automated detection and analysis of foraging behavior in c. elegans. In: Image Analysis and Interpretation, 2008. SSIAI 2008. IEEE Southwest Symposium on. IEEE, pp. 29–32. 30. Chetta J, Shah SB (2011) A novel algorithm to generate kymographs from dynamic axons for the quantitative analysis of axonal transport. Journal of neuroscience methods 199: 230–240.

18 31. Obashi K, Okabe S (2013) Regulation of mitochondrial dynamics and distribution by synapse position and neuronal activity in the axon. European Journal of Neuroscience : n/a–n/a. 32. Zhou HM, Brust-Mascher I, Scholey JM (2001) Direct visualization of the movement of the monomeric axonal transport motor unc-104 along neuronal processes in livingcaenorhabditis elegans. The Journal of Neuroscience 21: 3749–3755. 33. Miller KE, Sheetz MP (2004) Axonal mitochondrial transport and potential are correlated. Journal of Cell Science 117: 2791-2804. 34. Chevalier-Larsen E, Holzbaur EL (2006) Axonal transport and neurodegenerative disease. Biochimica et Biophysica Acta (BBA)-Molecular Basis of Disease 1762: 1094–1108. 35. Haghnia M, Cavalli V, Shah SB, Schimmelpfeng K, Brusch R, et al. (2007) Dynactin is required for coordinated bidirectional motility, but not for dynein membrane attachment. Molecular biology of the cell 18: 2081–2089. 36. Leiserson CE, Rivest RL, Stein C, Cormen TH (2001) Introduction to algorithms. The MIT press. 37. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60: 91–110. 38. Azbel MY (1973) Random two-component one-dimensional ising model for heteropolymer melting. Physical Review Letters 31: 589–592. 39. Cover TM, Thomas JA (1991) Entropy, relative entropy and mutual information. Elements of Information Theory : 12–49.

Suggest Documents