Efficient Algorithms for SNP Haplotype Block ... - Semantic Scholar

1 downloads 0 Views 410KB Size Report
Keywords: algorithm, SNP, haplotype, diversity, haplotype block selection, missing data. ...... using 10 tag SNPs, and there are total of 61 SNPs in these blocks.
Efficient Algorithms for SNP Haplotype Block Selection Problems Yaw-Ling Lin? Dept. Computer Science and Information Engineering Providence University, Taiwan E-mail: [email protected]

Abstract. Global patterns of human DNA sequence variation (haplotypes) defined by common single nucleotide polymorphisms (SNPs) have important implications for identifying disease associations and human traits. Recent genetics research reveals that SNPs within certain haplotype blocks induce only a few distinct common haplotypes in the majority of the population. The existence of haplotype block structure has serious implications for association-based methods for the mapping of disease genes. Our ultimate goal is to select haplotype block designations that best capture the structure within the data. Here in this paper we propose several efficient combinatorial algorithms related to selecting interesting haplotype blocks under different diversity functions that generalizes many previous results in the literatures. In particular, given an m × n haplotype matrix A, we show linear time algorithms for finding all interval diversities, farthest sites, and the longest block within A. For selecting the multiple long blocks with diversity constraint, we show that selecting k blocks with longest total length can be be found in O(nk) time. We also propose linear time algorithms in calculating the all intra-longest-blocks and all intra-klongest-blocks. Keywords: algorithm, SNP, haplotype, diversity, haplotype block selection, missing data.

1

Introduction

Mutation in DNA is the principle factor that is responsible for the phenotypic differences among human beings, and SNPs (single nucleotide polymorphisms) are the most common mutations. An SNP is defined as a position in a chromosome where each one of two (or more) specific nucleotides are observed in at least 10% of the population [15]. The nucleotides involved in a SNP are called alleles. It has been observed that for almost all SNPs only two different alleles are present, in such case the SNP is said biallelic; otherwise the SNP is said multiallelic. In this paper, we will consider exclusively biallelic SNPs. In diploid organisms, such as humans, each chromosome is made of two distinct copies and each copy is called a haplotype. There is currently a great deal of interest in how the recombination rate varies over the human genome. The primary reason for this is the excitement about using linkage disequilibrium (LD) to map and identify human disease genes. The blocks of LD are regions, typically less than 100 kb, in which LD decreases very little with distance between markers. Between these blocks, however, LD is observed to decay rapidly with physical distance. Concomitant with the blocks of LD, these studies also find low haplotype diversity within blocks. Often, more than 90% of the chromosomes in a sample possess one of only two to five haplotypes within a block [4, 16, 15, 6, 5]. These results also show that the SNPs within each block induce only a few distinct common haplotypes in the majority of the population, even though the theoretical number of different haplotypes for a block containing n SNPs is exponential in n. The existence of haplotype block structure has serious implications for association-based methods for the mapping of disease genes. Conducting genome-wide association mapping studies could be much easier than it used to be based on single marker [7]. On the one hand, if a causative allele occurs within a long block of LD, then it may be difficult to localize that causative gene at a fine scale by association mapping. On the other hand, if the diversity of haplotypes within blocks is low, then common disease genes may be mapped, at least to within a haplotype block, by using far fewer markers than previously imagined [7]. ?

This work is supported in part by the National Science Council, Taiwan, R.O.C, grant NSC-96-2221-E-126-002.

Terminology and Problem Definition Abstractly, input to the haplotype blocking problem consists of m haplotype vectors. Each position in a vector is associated with a site of interest on the chromosome. Usually, the position in the haplotype vector has a value of 0 if it is the major allele or 1 if minor allele. Let the haplotype matrix A be an m × n matrix of m observations over n markers (sites). We refer to the jth allele of observation i by Aij . For simplicity, we first assume that Aij ∈ {0, 1}. A block, or marker interval, [j, k] = hj, j + 1, . . . , ki is defined by two marker indices 1 ≤ j ≤ k ≤ n. A segmentation is a set of non-overlapping non-empty marker intervals. A segmentation is full if the union of the intervals is [1, n]. The data matrix limited to interval [j, k] is denoted by A(j, k); the values of the ith observation are denoted by A(i, j, k), a binary string of length k − j + 1. Figure 1 illustrates an example of a 7 × 11 0-1 haplotype matrix.

1

2

3

4

5

6

7

8

9

10 11

1 1

0 0

0 1

0 1

0 0

0 0

1 0

0 1

0 1

0 0

1 0

h4

0 1

0 1

1 1

0 0

0 1

0 0

1 0

1 0

0 1

0 1

1 1

h5 h6

1 0

1 0

0 0

0 0

0 0

1 0

1 1

0 0

1 0

1 0

1 1

H7

0

1

0

0

0

0

1

1

0

0

0

h1 h2 h3

1

2

3

4

5

6

7

8

9

10 11

Fig. 1. A haplotype matrix A and its corresponding submatrix A(8, 11).

Given an interval [j, k], a diversity function, δ : [j, k] → δ(j, k) ∈ R is an evaluation function measuring the diversity of the submatrix A(j, k). We say an interval [j 0 , k 0 ] is a subinterval of [j, k], written [j 0 , k 0 ] ⊂ [j, k], if j ≤ j 0 and k 0 ≤ k. Note that δ-function is a monotonic non-decreasing function from [1..n, 1..n] to the unit real interval [0, 1]; that is, 0 ≤ δ(j 0 , k 0 ) ≤ δ(j, k) ≤ 1 whenever [j 0 , k 0 ] ⊂ [j, k]. Given an input set of n haplotype vectors, a solution to the Haplotype Block Selection (HBS) Problem is a segmentation of marker intervals, revealing these non-overlapped haplotype blocks of interest in the chromosome. Problem 1 (Haplotype Block Selection – HBS (Generic)). Input: A haplotype matrix A. Output: A set S of marker intervals satisfying some given constraints. Here in this paper we propose several efficient algorithms related to selecting interesting haplotype blocks under different evaluation (diversity) functions that generalizes many previous results in the literatures [15, 22, 6, 19, 1]. Diversity functions Several operational definitions have been used to identify haplotype-block structures, including LD-based [6, 19], recombination-based [12, 20], information-complexity-based [1, 13, 8] and diversity-based [2, 15, 21] methods. The result of block partition and the meaning of each haplotype block may be different by using different measuring formula. For simplicity, haplotype samples can be converted into haplotype matrices by assigned major alleles to 0 and minor alleles to 1. Definition 1 (haplotype block diversity). Given an interval [i, j] of a haplotype matrix A, a diversity function, δ : [i, j] → δ(i, j) ∈ R is an evaluation function measuring the diversity of the submatrix A(i, j). Haplotype blocks are the genome regions with high LD, thus it implies that no matter what kinds of haplotype block definition we used, the patterns of haplotype within the block will be small, and the diversity of the block will be low. In terms of diversity functions, the block selection problem can be viewed

as finding a segmentation of given haplotype matrix such that the diversities of chosen blocks satisfy certain value constraint. Following we examine several haplotype block diversity evaluation functions. Given an m × n haplotype matrix A, a block S(i, j) (i, j are the block boundaries) of matrix A is viewed as m haplotype strings; they are partitioned into groups by merging identical haplotype strings P into the same group. The probability pi of each haplotype pattern si , is defined accordingly such that pi = 1. As an example, Li [14] proposes a diversity formula defined by X δD (S) = 1 − p2i . (1) si ∈S

Note that δD (S) is the probability that two haplotype strings chosen at random from S are different from each other. Other measurements of diversity can be obtained by choosing different diversity function; for example, to measure the information-complexity one can choose the information entropy (negative-log) function [1, 8]: X δE (S) = − pi log pi . (2) si ∈S

In the literatures [15, 21], Patil and Zhang et al. define a haplotype block as a region where at least 80% of observed haplotypes within a block must be common haplotype. As the same definition of common haplotype in the literatures, the coverage of common haplotype of the block can be formulated as a form of diversity: P P 1 pi m s ∈C s ∈M δC (S) = 1 − iP = iP . (3) pi pi si ∈U

si ∈U

Here U denotes the unambiguous haplotypes, C denotes the common haplotypes, and M denotes the singleton haplotypes. In other words, Patil et al. require that δC (S) ≤ 20%. Some studies [6, 21] propose the haplotype block definition based on LD measure D0 ; however, there is no consensus definition for it so far. Zhang and Jin [22] define a haplotype block as a region in which all pair-wise |D0 | values are not lower than a threshold α. Let S denote a haplotype interval [i, j]. We define the diversity as the complement of minimal |D0 | of S. By the definition, S is a haplotype block if its diversity is lower than 1 − α. δL1 (S) = 1 − min{(|Di00 j 0 |)|i ≤ i0 < j 0 ≤ j}.

(4)

Zhang et al. [21] also propose the other definition for haplotype block; they require at least α proportion of SNP pairs having strong LD (the pair-wise |D0 | greater than a threshold) in each block. Similarly, we can use the diversity to redefine the function. We define the diversity as the proportion of SNP pairs that do not have strong LD. Therefore, haplotype interval S is a feasible haplotype block if its diversity is smaller than a threshold. We can use the following diversity function to calculate the diversity of S. Here N (i, j) denotes the number of SNP pairs that do not have strong LD in the interval [i, j]. N (i, j) δL2 (S) = ¡(j−i)+1¢ = 2

1 2 [(j

N (i, j) . − i)2 + j − i]

(5)

Diversity measurement usually reflects the activity of recombination events occurred during the evolutionary process. Generally, haplotype blocks with low diversity indicates conserved regions of genome. By using appropriate diversity functions, the block selection problem can be viewed as finding a segmentation of given haplotype matrix such that the diversities of chosen blocks satisfy certain value constraint. In this paper, we consider the following problems: Problem 2 (all-interval-diversities). Given a haplotype matrix A, compute the all pairs block diversity values. That is, output the set {δ(i, j) | 1 ≤ i ≤ j ≤ n}. Using techniques of the suffix tree [9, 17], we are able to propose an O(mn + n2 ) time, linear proportional to the input plus output size, algorithm for computing the all-interval-diversities problem.

Problem 3 (farthest-sites). Given a haplotype matrix A and a given real diversity upper limit D, for each marker site i find its corresponding farthest right marker j = R[i] so that δ(i, j) ≤ D. That is, output the set {R(i) = j | δ(i, j) ≤ D < δ(i, j + 1), 1 ≤ i ≤ j ≤ n}. To simplify the boundary condition, we assume that δ(i, n + 1) = ∞. We show that all farthest-sites can be found totally in O(mn) time, linear proportional to the input size, by the the similar techniques of the suffix tree as in the all-pair-block-diversity problem. Problem 4 (longest-block). Given a haplotype matrix A and a diversity upper limit D, find the longest block [i, j] so that δ(i, j) ≤ D. That is, output the arg maxi,j {|[i, j]| = j − i + 1 | [i, j] ⊂ [1, n] and δ(i, j) ≤ D}. We show that the longest-block can be found in O(n) time by examining the O(n)-sized farthest-sites array; that is, the the longest-block can be found in linear time. Problem 5 (k-longest-blocks). Given a haplotype matrix A and a diversity upper limit D, find a segmentation consists of k feasible blocks such that the total length is maximized. That is, output the set S = {B1 , B2 , . . . , Bk }, with δ(B) ≤ D for each B ∈ S, such that |B1 | + |B2 | + · · · + |Bk | is maximized. We show that the k-longest-blocks can be found in O(nk). These blocks can be identified by examining the farthest-sites array and sing the technique of sparse dynamic programming. Problem 6 (all-intra-longest-block). Given a haplotype matrix A, a diversity upper limit D, for all constrained interval [i, j] ⊂ [1, n], find the corresponding longest block B ⊂ [i, j] with δ(B) ≤ D. That is, output the arg maxB⊂[i,j] {|B| | B ⊂ [i, j] and δ(B) ≤ D}. We show that the all-intra-longest-block can be found totally O(n2 ) time, which is a linear time algorithm proportional to the output size. Problem 7 (all-intra-k-longest-blocks). Given a haplotype matrix A, a diversity upper limit D, for all constrained interval [i, j] ⊂ [1, n], find the longest segmentation with k feasible blocks such that the total length is maximized. That is, output the set S = {B1 , B2 , . . . , Bk }, with δ(B) ≤ D for each B ∈ S, B ⊂ [i, j], such that |B1 | + |B2 | + · · · + |Bk | is maximized. We show that the all-intra-k-longest-blocks can be found totally in O(n2 k) time, which is a linear time algorithm proportional to the output size. Results The rest of the paper is organized as the following. In Section 2, we propose an O(mn + n2 ) time, linear proportional to the input plus output size, algorithm for computing the all-interval-diversities problem. We also show that all farthest-sites can be found totally in O(mn) time, linear proportional to the input size. As a corollary, we show that the longest-block can be found in O(n) time by examining the O(n)-sized farthest-sites array; that is, the the longest-block can be found in linear time. In Section 3, we show that the k-longest-blocks can be found in O(nk). These blocks can be identified by examining the farthest-sites array and sing the technique of sparse dynamic programming. We also show that the all-intra-longest-block can be found totally O(n2 ) time, which is a linear time algorithm proportional to the output size. The problem of all-intra-k-longest-blocks can be found totally in O(n2 k) time, which is a linear time algorithm proportional to the output size.

2

Computing Diversities of all Blocks

Given an m × n haplotype matrix A, the goal here is to compute the all pairs block diversity values. That is, output the set {δ(i, j) | 1 ≤ i ≤ j ≤ n}. Here we show that by using techniques of the suffix tree [9, 17], there exits an O(mn + n2 ) time, linear proportional to the input plus output size, algorithm for computing the all-interval-diversities problem. Consider the 7 × 11 haplotype matrix illustrated at Figure 1, the computation of the submatrix diversity, say A(4, 7), can be illustrates at Figure 2.

δ

= −



=

Fig. 2. Compute the diversity of submatrix A(4, 7).

A naive algorithm would try to enumerate all different O(n2 ) blocks, each with data sized as much as O(mn). While it takes at least linear time in computing the diversity of each submatrix, thus a totally O(mn3 ) operations would be needed. To obtain the desired linear time algorithm for computingP all diversities, we use the suffix tree data ∗ structure [9, 17]. It is well known that, given a string s ∈ of length n, the suffix tree of s, Ts , that consists of all suffixes of s as leaves can be constructed in linear time. Thus, given an m × n haplotype matrix A, our linear time algorithm begins by constructing m suffix trees of m rows of A, T = {TA(1,1,n) , TA(2,1,n) , . . . , TA(m,1,n) }, in totally O(mn) time. Figure 3 illustrates the m suffix trees of the given m × n haplotype matrix, where Si (j) denotes the jth suffix of row i.

Fig. 3. The m suffix trees for m rows of a m × n haplotype matrix.

After obtaining these m suffix trees, we can merge these m trees and obtain the total suffix tree T ∗ with mn leaves. Note that T ∗ can also be obtained by concatenating the original m row-strings of the haplotype matrix into one long string u = A(1, 1, n)]A(2, 1, n)] · · · ]A(m, 1, n) and constructing the corresponding suffix tree Tu , where ‘]’ denotes a symbol that does not appear in alphabets of the matrix. Note that the structure of T ∗ is equivalent to Tu . The next step of the algorithm is to construct n confluent subtrees, which is discussed in the following. Confluent LCA Subtrees The lowest common ancestor (LCA) between two nodes u and v in a tree is the furthest node from the root node that exists on both paths from the root to u and v. Harel and Tarjan [11] have shown that any n-node tree can be preprocessed in O(n) time such that subsequently LCA queries can be answered in constant time. Let T be a tree with leaf nodes L. Given S ⊂ L, let set Λ(S) = {Lca(x, y) | x 6= y ∈ S} denote the collection of all (proper) lowest common ancestors defined over S. Definition 2 (confluent subtree). The confluent subtree of S in T , denoted by T↑S , is a subtree of T with leaves S and internal nodes Λ(S). Further, u ∈ Λ(S) is a parent of v in T↑S if and only if u is the lowest ancestor of v in T comparing to any other node in Λ(S).

Our notation of confluent subtree is called induced subtrees in the literature [3]. Let T be a phylogenetic tree with leaf nodes L. A post-order (pre-order, or in-order) traversal of nodes of L within T defines a tree ordering of nodes on L. The following results is a generalization of the section 8 of [3]. Lemma 1. Let T be an n-node phylogenetic tree with leaf nodes L. The following subsequent operation can be done efficiently after an O(n) time preprocessing. Given a query set A ⊂ L, the confluent subtree T↑A can be constructed in O(|A|) time if A is given in sorted tree ordering; otherwise, T↑A can be constructed in O(|A| log log |A|) time.

Confluent(T, A) Input:A trees T with leaves L = {1, 2, . . . , n}, A ⊂ L. Output: The confluent subtree of A in T , T↑A . Preprocessing: Compute the tree ordering of L on T and the Lca queries preparation [11]. Notations: p[·, T 0 ], lef t[·, T 0 ], right[·, T 0 ]: parent, left, right children links. 1 2 3 4 5 6 7 8

Let A = hv1 , v2 , . . . , vk i be nodes of A in the tree ordering. Create a dummy node λ and let level[T, λ] ← +∞ ; Push(S, λ) for i ← 1 to k − 1 do B visit each vi ’s. x ← Lca(vi , vi+1 ) ; y ← vi while level[x, T ] > level[top[S], T ] do y ← Pop(S) Push(S, x) ; p[vi+1 , T 0 ] ← x ; p[y, T 0 ] ← x ; p[x, T 0 ] ← top[S] lef t[x, T 0 ] ← y ; right[x, T 0 ] ← vi+1 ; right[top[S], T 0 ] ← x root[T 0 ] ← right[λ, T 0 ] ; return T 0 as T↑A ; Fig. 4. Computing the confluent subtree.

Proof. We propose an O(|A|) time algorithm, Confluent(T, A), shown in Figure 4. The algorithm requires O(n) time preprocessing phase for building up the tree ordering of L on T , and perform the Lca constant time queries preprocessing [11]. Further, the input A ⊂ L is assumed to be listed according to the tree ordering of L; otherwise, we can use the data structure of van Emde Boas [18] for sorting these finite ranged integers in O(|A| log log |A|) time. The correctness and the time analysis of the proof is omitted for most of the details are similiar to the arguments presented in [3]. ¤ By Lemma 1, in O(mn) time, we can partition leaves of T ∗ into n sets {A1 , A2 , . . . , An }, where Ai consists of all m suffixes started at column i in given haplotype matrix. These Ai ’s subsets can be used to construct ∗ the corresponding n confluent LCA subtrees, T↑A ’s, as illustrated in Figure 5. i

Fig. 5. The corresponding n confluent LCA trees for n columns in a haplotype matrix. ∗ represent the changing structure of m rows of submatrix A(i, n). Figure 6 shows an Each LCA tree T↑A i ∗ , taken from the haplotype matrix shown in Figure 1. Observe example the confluent LCA subtrees, T↑A 8 how the LCA tree reveals the structure of A(8, 11).

Fig. 6. The 8-LCA tree reveals the topological structure of 7 strings of A started at the 8th column.

∗ Event List The idea is to build up the event-list for each LCA tree T↑A so that all diversity values i started at i, δ(i, ·)’s, can be obtained in O(m + n) time. To do so, we need to perform a BFS-like search on ∗ ∗ each LCA tree T↑A . This can be done by the bucket-sort technique. We first visit each node of all T↑A ’s, i i and append each node into a global depth-list, ranked according to its distance to the root. After that, we pick up nodes in the depth-list by the increasing depth order, and append them into the final event-list, as illustrated in Figure 7.

Fig. 7. The event list of the 8-LCA tree.

Lemma 2. The event-list of a given m × n haplotype matrix can be computed in O(mn) time. By using the event-list, it is easily verified that diversities started at i, δ(i, ·)’s, can be obtained by following the i-th entry of the event-list. Observe that there are at most m events along with each entry, and there are at most n diversities need to be calculated. Thus diversities δ(i, ·)’s can be obtained in O(m + n) time for each column i. It follows that all-interval-diversities can be found in O(mn + n2 ) time, linear proportional to the input plus output size. Theorem 1 (all-interval-diversities). Given an m × n haplotype matrix A, O(mn + n2 ) time suffices to compute all interval diversities {δ(i, j) | 1 ≤ i ≤ j ≤ n}. Recall that, in the farthest-sites problem, with a given real diversity upper limit D, for each marker site i we need to find its corresponding farthest right marker j = R[i] so that δ(i, j) ≤ D. By using the event-list and Lemma 2, it is easily verified that diversities started at i, δ(i, ·)’s, can be examined by following the i-th entry of the event-list. Since there are at most m events along with each entry, the farthest site can thus be identified in O(m) time for each column i. It follows that all farthest sites can be found in O(mn) time, linear proportional to the input size. Theorem 2 (farthest-sites). Given an m × n haplotype matrix and a given real diversity upper limit D, for each marker site i we can find its corresponding farthest right marker j = R[i] so that δ(i, j) ≤ D < δ(i, j + 1), totally in O(mn) time. Corollary 1 (longest-block). Given an m×n haplotype matrix and a diversity upper limit D, the longest block with diversity not exceeding D can be found in O(mn) time.

3

Longest k Blocks with Diversity Constraints

We discuss a series of problem related to finding multiple blocks with diversity constraints in the section. Given a haplotype matrix A and a diversity upper limit D, let S = {B1 , B2 , . . . , Bk } be a segmentation of A with δ(B) ≤ D for each B ∈ S. The length of S is the total length of all block in S; i.e., `(S) = |B1 | + |B2 | + · · · + |Bk |. Our objective is to find a segmentation consists of k feasible blocks such that the total length `(S) is maximized. Given A and D, first we consider the most general form of the problem and define the block length evaluation function f (k, i, j) = max{`(S) | S a feasible segmentation of A(i, j) with k blocks} Note that the k-longest-blocks asks to find the value f (k, 1, n). First of all, we prepare the left farthest sites, L[j]’s, for each site j of A; according to Theorem 2, the array can be identified in O(mn) time. Here we show that the answer can be found in O(nk) time after the preprocessing. The idea behind the dynamic programming formula is illustrated at Figure 8.

Fig. 8. The O(kn) algorithm to compute f (k, 1, n).

It can be verified that f (k, 1, j) = max{f (k, 1, j − 1), f (k − 1, 1, L[j] − 1) + j − L[j] + 1} That is, the k-th block of the maximal segment S in [1, j] either does not include site j; otherwise, the block [L[j], j] must be the last block of S. Note that f (k, 1, j) can be determined in O(1) time suppose f (k − 1, 1, ·)’s and f (k, 1, 1..(j − 1))’s being ready. It follows that f (k, 1, ·)’s can be calculated from f (k − 1, 1, ·)’s, totally in O(n) time. Thus a computation ordering from f (1, 1, ·)’s, f (2, 1, ·)’s, . . . , to f (k, 1, ·)’s leads to the following result. Theorem 3 (k-longest-blocks). Given a haplotype matrix A and a diversity upper limit D, find a segmentation consists of k feasible blocks such that the total length is maximized can be done in O(nk) time after a linear time preprocessing. Given A and D, the all-intra-longest-block asks to find, for all constrained interval [i, j] ⊂ [1, n], the corresponding longest block B ⊂ [i, j] with δ(B) ≤ D. That is, the values f (1, i, j)’s. Similarly, we first prepare the farthest sites of A in O(mn) time. Here we show that the answer can be found in O(n2 ) time after the preprocessing. Using similar idea as illustrated at Figure 8. It can be verified that f (1, i, j) = max{f (1, i, j − 1), (j − L[j] + 1)} That is, the longest block in [i, j] either does not include site j; otherwise, the block [L[j], j] must be the longest (feasible) block within [i, j]. Note that f (1, i, j) can be determined in O(1) time whenever f (1, i, 1..(j − 1))’s being ready. It follows that f (1, i, ·)’s can be calculated totally in O(n) time. Thus a computation ordering from f (1, 1, ·)’s, f (1, 2, ·)’s, . . . , to f (1, n, ·)’s leads to the following result. Theorem 4 (all-intra-longest-block). Given a haplotype matrix A and a diversity upper limit D, find for all constrained interval [i, j] ⊂ [1, n], the corresponding longest block B ⊂ [i, j] with δ(B) ≤ D can be done in O(n2 ) time after a linear time preprocessing.

Note that the computation time is linear proportional to the output plus input size, and thus is an optimal algorithm in regarding of time efficiency. or all constrained interval [i, j] ⊂ [1, n], find the longest segmentation with k feasible blocks such that the total length is maximized. That is, output the set S = {B1 , B2 , . . . , Bk }, with δ(B) ≤ D for each B ∈ S, B ⊂ [i, j], such that |B1 | + |B2 | + · · · + |Bk | is maximized. The all-intra-k-longest-blocks asks to find, for all constrained interval [i, j] ⊂ [1, n], the longest segmentation with k feasible blocks such that the total length is maximized. Again, the problem can be solved by first prepare the farthest sites of A in O(mn) time. Here we show that the answer can be found in O(n2 k) time after the preprocessing. Using similar idea as illustrated at Figure 8. It can be verified that f (k, i, j) = max{f (k, i, j − 1), f (k − 1, i, L[j] − 1) + j − L[j] + 1} That is, the k-th block of the maximal segment S in [i, j] either does not include site j; otherwise, the block [L[j], j] must be the last block of S. Note that f (k, i, j) can be determined in O(1) time suppose f (k−1, i, ·)’s and f (k, i, 1..(j −1))’s being ready. It follows that f (k, i, ·)’s can be calculated from f (k −1, i, ·)’s, totally in O(n) time. Thus f (k, ·, ·)’s can be calculated from f (k − 1, ·, ·)’s totally in O(n2 ) time. Thus a computation ordering from f (1, ·, ·)’s, f (2, ·, ·)’s, . . . , to f (k, ·, ·)’s leads to the following result. Theorem 5 (all-intra-k-longest-blocks). Given A, D, there exits an O(kn2 )-time algorithm to find, for all constrained interval [i, j] ⊂ [1, n], the longest segmentation with k feasible blocks such that the total length is maximized.

4

Experimental Results

In response to needs for analysis and observation of human genome variation, we establish web systems and apply the algorithms in this paper to provide several analysis tools for bioinformatics and genetics researchers. The web consists of a list of PHP and Perl CGI-scripts together with several C programs for selecting haplotype blocks and analyzing the haplotype diversity. We collect the haplotype data of human chromosome 21 from Patil et al. [15] and download haplotypes for all the autosomes from phase II data of HapMap [10] so that the bioinformatics researchers can use the data to evaluate the performance of the tools. Researchers can also input their own haplotype data by on-line inputting the haplotype sample or uploading the file of haplotype sample, or providing a hyperlink of other website’s haplotype data. The website provides tools to examine the diversity of haplotypes and partition haplotypes into blocks by using different diversity functions. Using the diversity visualization tool, researchers can observe the diversity of haplotypes in the form of the diagram of curves; the diversity function (1), δD , is used to calculate the diversities of intervals based on sliding widows manner. Our website also provides tools for researchers to partition haplotypes into blocks with constraints on diversity and tag SNP number; by using the tool, researchers can find the longest segmentation consists of non-overlapping blocks with limited number of tag SNPs. Researchers can upload a haplotype sample or select a sample from our database, and then input a number of tag SNPs to partition the sample. As an example; we pick the shortest contig of Patil’s haplotype sample (contig number: NT 001035, 69 SNPs) and paste to the system; applying the diversity function (3), δC , with the requirement that the diversity is not greater than 0.2 (at least 80% of common haplotype) in each block; as a result, the sample can be partitioned into 19 haplotype blocks by using 10 tag SNPs, and there are total of 61 SNPs in these blocks. Our web system is freely accessible at http://bioinfo.cs.pu.edu.tw/~hap/index.php. Some preliminary results, including the selection of different diversity functions as well as choosing meaningful diversity constraints, in using our tools can also be found in the web system.

5

Concluding Remarks

By using appropriate diversity functions, the block selection problem can be viewed as finding the segmentation of a given haplotype matrix such that the diversities of chosen blocks satisfy certain value constraint.

In this paper, we propose several efficient combinatorial algorithms related to selecting interesting haplotype blocks under different diversity functions. By using the powerful data structures including suffix trees [9, 17] and the confluent LCA subtrees [11, 3], we are able to show that the kernel combinatorial structure, event-list, can be constructed in time linear proportional to the input. We need to point out that these time-efficiency results of our algorithms can be applied in many different definitions of diversity functions; there we assume that the computation of the diversity of submatrix A(j, k 0 ) can be obtained from a shorter prefix of A(j, k), with k < k 0 , in time proportional to the size of changing events . It is not hard to verify that the diversity functions listed in Eq. (1), (2), (3), as well as many other different diversity functions, possess this property.

References 1. Eric C. Anderson and John Novembre. Finding haplotype block boundaries by using the minimum-descriptionlength principle. Am. J. of Human Genetics, 73:336–354, 2003. 2. D. Clayton. Choosing a set of haplotype tagging SNPs from a larger set of diallelic loci. Nature Genetics, 29(2), 2001. 3. R. Cole, M. Farach, R. Hariharan, T. Przytycka, and M. Thorup. An O(n log n) algorithm for the maximum agreement subtree problem for binary trees. SIAM Journal on Computing, 30(5):1385–1404, 2002. 4. M. Daly, J. Rioux, S. Schafiner, T. Hudson, and E. Lander. Highresolution haplotype structure in the human genome. Nature Genetics, 29:229–232, 2001. 5. E. Dawson, G. Abecasis, et al. A first-generation linkage disequilibrium map of human chromosome 22. Nature, 418:544–548, 2002. 6. S. B. Gabriel, S. F. Schaffner, H. Nguyen, et al. The structure of haplotype blocks in the human genome. Science, 296(5576):2225–2229, 2002. 7. D.B. Goldstein. Islands of linkage disequilibrium. Nature Genetics, 29:109–111, 2001. 8. G. Greenspan and D. Geiger. Model-based inference of haplotype block variation. In Seventh Annual International Conference on Computational Molecular Biology (RECOMB), 2003. 9. D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. 10. International HapMap Project. http://www.hapmap.org/index.html.en. 11. D. Harel and R. E. Tarjan. Fast algorithms for finding nearest common ancestors. SIAM Journal on Computing, 13(2):338–355, 1984. 12. R. R. Hudson and N. L. Kaplan. Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics, 111:147–164, 1985. 13. M. Koivisto, M. Perola, R. Varilo, W. Hennah, J. Ekelund, M. Lukk, L. Peltonen, E. Ukkonen, and H. Mannila. An MDL method for finding haplotype blocks and for estimating the strength of haplotype block boundaries. In 8th Pacific Symposium on Biocomputing (PSB), pages 502–513, 2003. 14. W.H. Li and D. Graur. Fundamentals of Molecular Evolution. Sinauer Associates, Inc, 1991. 15. N. Patil, A. J. Berno, D. A. Hinds, et al. Blocks of limited haplotype diversity revealed by high resolution scanning of human chromosome 21. Science, 294:1719–1723, 2001. 16. D. Reich, M. Cargill, E. Lander, et al. Linkage disequilibrium in the human genome. Nature, 411:199–204, 2001. 17. Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995. 18. P. van Emde Boas. Preserving order in a forest in less than logarithmic time and linear space. Information Processing Letters, 6:80–82, 1977. 19. J.D. Wall and J.K Pritchard. Haplotype blocks and linkage disequilibrium in the human genome. Nature Reviews Genetics, 4(8):587–597, 2003. 20. N. Wang, J.M. Akey, K. Zhang, R. Chakraborty, and L. Jin. Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. Am. J. Human Genetics, 71:1227–1234, 2002. 21. K. Zhang, Z.S. Qin, J.S. Liu, T. Chen, M.S. Waterman, and F. Sun. Haplotype block partitioning and tag SNP selection using genotype data and their applications to association studies. Genome Res., 14(5):908–916, 2004. 22. Kui Zhang, Zhaohui Qin, Ting Chen, Jun S. Liu, Michael S. Waterman, and Fengzhu Sun. HapBlock: haplotype block partitioning and tag SNP selection software using a set of dynamic programming algorithms. Bioinformatics, 21(1):131–134, 2005.