LINEAR REDUCTION METHODS FOR TAG SNP ... - GSU C.S.

1 downloads 0 Views 98KB Size Report
LINEAR REDUCTION METHODS FOR TAG SNP SELECTION. Jingwu He. Alex Zelikovsky. Department of Computer Science. Georgia State University, Atlanta, ...
LINEAR REDUCTION METHODS FOR TAG SNP SELECTION Jingwu He

Alex Zelikovsky

Department of Computer Science Georgia State University, Atlanta, GA 30303 E-Mail: [email protected], [email protected] ABSTRACT It is widely hoped that constructing a complete human haplotype map will help to associate complex diseases with certain SNP’s. Unfortunately, the number of SNP’s is huge and it is very costly to sequence many individuals. Therefore, it is desirable to reduce the number of SNP’s that should be sequenced to considerably small number of informative representatives, so called tag SNP’s. In this paper, we propose a new linear algebra based method for selecting and using tag SNP’s. Our method is purely combinatorial and can be combined with linkage disequilibrium (LD) and block based methods. We measure the quality of our tag SNP selection algorithm by comparing actual SNP’s with SNP’s linearly predicted from linearly chosen tag SNP’s. We obtain an extremely good compression and prediction rates. For example, for long haplotypes (> 25000 SNP’s), knowing only 0.4% of all SNP’s we predict the entire unknown haplotype with 2% accuracy while the prediction method is based on a 10% sample of the population. Keywords: Single nucleotide polymorphism, tag SNP, linear independence. 1. INTRODUCTION Genome-wide SNP scans for disease association tests are still infeasible. In order to decrease SNP genotyping cost it is quite attractive to sequence only small amount of SNP, so called tag SNP, and then infer the rest of SNP’s (or certain suspicious SNP’s) based on the sequenced tag SNP’s. An interesting open problem is to find optimal subsets of tag SNP’s. Since the SNP’s responsible for complex diseases are unknown, the tag SNP’s should allow to reconstruct all (or almost all) SNP’s. Note that complete 100% correct reconstruction is impossible just because a single mutation may spoil otherwise reliable reconstruction. The assumption is that the genotyped tag SNPs would carry sufficient statistical power for identifying disease associations. Partially supported by NIH Award 1 P20 GM065762-01A1.

An established way of selecting tag SNP’s [3, 1, 10, 11, 12] is based on linkage disequilibrium (LD) and partition of the entire SNP sequence into blocks, i.e., contiguous SNP segments within which the number of different haplotypes is comparatively small. Due to the low diversity within a block, the SNPs are highly correlated and very small number of SNP’s can predict values of all other SNP’s. In [4], one can find a valuable discussion of tasks and limitations of LD approaches. The paper [2] also relies on LD but applies optimization approach associating data compression problem with tag SNP selection. Following [2], we can informally formulate the problem as Tag SNP Selection Problem. Given the full pattern of all SNP’s for a small sample, select minimum number of tag SNP’s that will allow to reconstruct the full data set, i.e., reconstruct any haplotype from tag SNP’s. In this paper, we suggest a completely different method of the SNP value reconstruction from tag SNP’s which is based on linear algebra. Respectively, the set of tag SNP’s is chosen to be linearly independent. The linear reduction methods have been already successfully applied for the haplotype inference problem, i.e., the problem of inferring haplotypes from observed genotypes [8]. Further we assume that the reconstruction of non-tag SNP’s from tag SNP’s is based on our knowledge of haplotype data (rather than genotypes). In order to measure the quality of our approach, we directly count the number of correct predictions rather than introduce intermediate objectives. Our experimental results show that we need much less tag SNP’s than LDbased methods to fully predict the entire haplotype. For example, we extract 100 tag SNP’s from randomly chosen 100 haplotypes out of generated 1000 haplotypes each with 25000 SNP’s. Given the values of the tag SNP’s of any haplotype h out of the 900 haplotypes which did not participate in tag SNP selection, we can reconstruct all 25000 SNP’s of h with the average error 2%. This is in contrast with LDbased methods which require more tag SNP’s to achieve the same prediction accuracy [2]. The rest of the paper is organized as follows. In the next

section we give a formal description of the tag SNP selection and haplotype prediction problem based on suggested methodology for experimental verification of solution methods. In Section 3 we describe ideas and variants of the suggested linear reduction method. In Section 4 we present our experimental study of suggested methods on simulated and real data. 2. THE TAG SNP SELECTION AND HAPLOTYPE RECONSTRUCTION Assume that there is a population P of haplotype vectors. The input to the tag SNP selection problem is a set of n haplotype vectors H = {hi |i = 1, . . . , n}, each having m coordinates (positions), hi = (hi,1 , . . . , hi,m ). Traditionally, each hij ∈ {0, 1} corresponds to diallelic (taking only two values) SNP’s. Each of n haplotype vectors corresponds to a haplotype drawn from the population P and each of m positions corresponds to a SNP site in a haplotype. The tag SNP’s are k position-sites t1 , t2 , . . . , tk , ti ∈ {1, . . . , m}, which are “characteristic” to all haplotypes, i.e., one can reconstruct an entire (preferably unique) haplotype h ∈ P from its tag SNP vector-value hk = (ht1 , . . . , htk ). In order to formally describe reconstruction, we introduce the notion of reconstruction function, which is a vector-function f = (f1 , . . . , fm ), where fj = fj (x1 , . . . xk ) is a k-variable function equal to j-th site ¯ = f (hk ). Obviously, of the reconstructed haplotype h ftj (hk ) = htj . Now we are ready to formulate the problem as follows Statistical Tag SNP Selection and Haplotype Reconstruction problem (STTS). Given a set of n haplotype vectors H on m sites and k < m, find k tag sites t1 , . . . , tk and a reconstruction function f = (f1 , . . . , fm ), such that for any haplotype h ∈ P , expected Hamming distance between ¯ is minimized. h and its prediction h Note that the STTS problem is a natural generalization of the standard tag SNP selection problem. For example, a block approach is to rely only on certain tag SNP’s (namely those which are in the same block) when reconstructing a particular SNP while the block definition is usually defined based on LD. The novel idea of our formulation is that it does not specify any restrictions on what tag SNP’s can be used when the value of a particular SNP is decided. Indeed, if the values of all tag SNP’s are available, then nothing forbids to use all of them for determining The main drawback of the above formulation of the STTS problem is that it is not specified how the population P is given. But as we would expect P cannot be efficiently represented since, otherwise, the very STTS problem would not make sense. The standard approaches to tackle the STTS are biostatistical. In this paper we suggest

a different approach which is usual for computer science or, in general, engineering. We argue that engineering approach is supposed to be valid since the problem itself can be viewed as one of engineering kind – with least expenses (number of tag SNP’s) identify the entire haplotype (presumably valuable information for, e.g., disease association and drug discovery). From engineering point of view, a deep biostatistical sense of each particular SNP is secondary to the savings in number of tag SNP’s. Anyway, the input format of the population P should be decided, and we assume that the population is given as a set of already sequenced haplotypes since any other knowledge about P is inferred and, therefore, arguable. Then the standard experimental way of checking any solution would be picking a random subset H of a set P , extracting tag SNP sites and finding reconstruction function based on H and, finally, checking average accuracy rate of prediction over all haplotypes in P \ H. In order to get a more trustful results, the reported results should be averaged over multiple random choices of H. Naturally, the larger the set H, the more accuracy can be achieved. Therefore, we give a new optimization problem formulation. Optimum Tag SNP Selection and Haplotype Reconstruction problem (OTTS). Given population as a set P of p haplotypes on m sites, a population sample H ⊂ P of n haplotypes and an integer k < m, find k tag sites t1 , . . . , tk and a reconstruction function f = (f1 , . . . , fm ), such that ¯ between any haplotype average Hamming distance |h, h| ¯ = f (hk ) is minih ∈ P \ H and predicted haplotype h mized. 3. LINEAR REDUCTION OF SITES AND HAPLOTYPES In this section we give motivation, formal description and implementation variations of the suggested linear reduction method for the OTTS problem. Usually, in genetic sequences derived from human haplotypes (see [11]), the number of sites is much larger than the number of individuals. Because of such disproportion many columns corresponding to SNP sites are similar. Indeed, as noted in [11], the number of synonymous sites in real data is considerably large, here two sites are synonymous (or equivalent) if the corresponding 0-1-columns either the same or the complimentary (i.e., the same after each entry x is replaced with 1 − x). It is common to keep only one site out of several synonymous sites since they are assumed not to carry any additional information [11]. In general, if k columns are “dependent” then we can drop the k-th site. In [8] we suggested to rely on the standard linear dependence by replacing 0’s with -1’s which makes two synonymous SNP columns linearly dependent. In (-1,1)-notations, two sites are synonymous if and only if

they are collinear (i.e., linearly dependent). The following theorem gives biological justification of linear dependency.

6

5

LR RLR RLRP

4

total error %

Theorem 1 Let H be a set of haplotypes obtained from two haplotypes by recombination events at g sites. Then the linear rank of H, rank(H) ≤ g + 2.

3RLRP 3

2

Our linear reduction consists of the following steps: 1

From the population sample H extract r = rank(H) of sites T (H) = {t1 , . . . , tr } forming a basis of columns-sites. For each column-site in fj , j = Pr1, . . . , m in H find a unique representation fj = i=1 αi,j hti Output the set of tag SNP’s T (H) and the reconstruction function f = (f1 , . . . , fm ). The suggested linear reduction method can be implemented very efficiently. Indeed, using O(n2 m) Gaussian elimination, we can transform the n × m matrix H into the reduced echelon format H 0 which will have exactly r nonzero rows. The r tag SNP’s formed by linearly independent column-sites corresponding to non-zero rows can be easily found from H 0 . Let F be the the matrix H 0 in which zero rows are dropped, so F is an r × m matrix. Then for any haplotype h with the tag SNP values hr , the predicted re¯ = f (hr ) equals construction h ¯ = hr F h

(1)

The Gaussian elimination is greedily chooses first linearly independent tag SNP’s. Intuitively, the haplotype information is spread all over the haplotype length. Therefore, we compare two linear reduction (LR) implementations – (i) LR, where the SNP’s are taken in the order in which they are given in H and (ii) Randomized LR (RLR), where the SNP’s are randomly permuted. While there is no obvious way to fix falsely predicted SNP’s, the unresolved SNP’s (i.e., SNP values predicted to be neither −1 nor 1) are assigned −1 if the predicted values are negative and 1 otherwise. We report the results for the Randomized LR with Postprocessing (RLRP) which is RLR where unresolved SNP’s are recovered by the method specified above. Finally, we report results of the following 3RLRP method. The RLRP algorithm is run 3 times, three bases are extracted based on the sample and three different predictions of each SNP are obtained. Then out of these three predictions we choose the one in majority. Although 3RLRP uses the same size sample as RLRP, it is based on 3 times more tag SNP’s.

0 50

150

250

350

450

550

650

750

sample population

Fig. 1. Simulated data with 25000 sites and haplotype population 1000. The total number of errors in % to the total number of SNP’s depending on the size of the sample population for the three algorithms LR, RLR, RLRP and 3RLRP.

4. EXPERIMENTAL RESULTS For generating the test data, we have used the haplotype generator ms [9]. This generator is a well-known standard based on the coalescent model of SNP sequence evolution. The ms generator has capability to generate a given number of haplotypes with the prescribed number of sites and recombination rate. In our tests, we have generated different size haplotype population (300, 500, 1000 and 2000) with 25000 sites based on the recombination rate 40. For population size of 1000 (see Figure 1), we report an error depending the size of population sample (50 to 500) while the number of tag SNP’s is always close to the size of the sample. The second plot (see Figure 2) is devoted to experiments with the real data set derived from the 616 kilobase region of human genotypes [5]. The missing data resolving and haplotyping are taken from [6]. We report an error depending on the size of the population sample while the number of tag SNP’s is always less than the size of the sample and comes to 60 for 100 haplotypes. The results are averaged over 10 random draws from the set of all haplotypes. In 3RLRP, the number of tags used is almost three times of that in RLRP. The third plot (see Figure 3) compares errors of two algorithms RLRP and 3RLRP for the same number of tag SNP’s. The last plot (see Figure 4) compares the error rate of RLRP for the same sample size while the population grows from 300 to 2000. As one can see, the error rate does not change with the population size. 5. CONCLUSIONS & FUTURE WORK We have suggested a linear reduction method which can reliably (error rate below 2%) recover all SNP’s based on very

4

20 18

LR

16

RLR

RLRP (p=1000)

3RLRP

12

total error%

total error %

RLRP (p=500) 3

RLRP

14

RLRP (p=300)

3.5

10 8

RLRP (p=2000)

2.5 2 1.5

6 4

1

2

0.5

0 10

20

30

40

50

60

70

80

90

100

0 

sample population









sample of population

Fig. 2. The dataset of 158 haplotypes with 103 SNP’s from [5]. The total number of errors in % to the total number of SNP’s depending on the size of the sample population for the three algorithms LR, RLR, RLRP and 3RLRP. 10 9

Fig. 4. Simulated data with 25000 sites and different sizes of haplotype population. The total number of errors in % to the total number of SNP’s depending on the size of the sample population for the different population sizes (p = 300, 500, 1000, 2000).

RLRP

8

3RLRP

total error %

7

[3]

6 5 4 3 2

[4]

1 0 30

35

40

45

50

55

60

number of tags

Fig. 3. The dataset of 158 haplotypes with 103 SNP’s from [5]. The total number of errors in % to the total number of SNP’s depending on the number of the tags for algorithms RLRP and 3RLRP.

[5]

[6]

[7]

small portion of tag SNP’s (e.g., 100 tags out of total 25K SNP’s) while sampling below 10% of population. In our future work, we will apply tag selection to genotype data. Also, we will explore different possibilities of combining our methods with block methods and will apply for recovering missing SNP data. 6. REFERENCES [1] H. I. Avi-Itzhak, X. Su, F. M. De La Vega. Selection of minimum subsets of single nucleotide polymorphism to capture haplotype block diversity. In Proceedings of Pacific Symposium on Biocomputing, pages 466-477, 2003. [2] V Bafna, B. V. Halldorsson, R. Schwartz, A. G. Clark, S. Istrail. Haplotypes and informative SNP selection algorithms: don’t block out information. Proceedings of the Sev-

[8] [9] [10]

[11]

[12]

enth International Conference on Research in Computational Molecular Biology, 9–18, 2003. C. S. Carlson, M. A. Eberle, M. J. Rieder, Q. Yi, L. Kruglyak, and D. A. Nickerson. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. American Journal of Human Genetics, 74(1):106–120, 2004. A. G. Clark. Finding genes underlying risk of complex disease by linkage disequilibrium mapping. Current Opinion in Genetics & Development, 13(3):296–302, 2003. M. Daly, J. Rioux, S. Schaffner, T. Hudson, and E. Lander. High resolution haplotype structure in the human genome. Nature Genetics, 29:229–232, 2001. E. Eskin, E. Halperin, and R. Karp. Efficient reconstruction of haplotype structure via perfect phylogeny. Technical report, Hebrew University Computer Science, 2003, to appear in Bionformatics, 2004. Gabriel, G. et al (2002). The structure of haplotype blocks in the human genome. Science, 296:2225-2229. J. He, A. Zelikovsky. Linear reduction for haplotype inference. Submitted, 2004. R. Hudson. Gene genealogies and the coalescent process. Oxford Survey of Evolutionary Biology, 7:1–44, 1990. R. Judson, B. Salisbury, J. Schneider, A. Windemuth, and J. C. Stephens. How many SNPs does a genome-wide haplotype map require? Pharmacogenomics, 3:379–391, 2002. Patil, N. et al. (2001). Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science, 294, 1719–23. K. Zhang, Z. S. Qin, J. S. Liu, T. Chen, M. S. Waterman, F. Sun. Haplotype block partitioning and tag SNP selection using genotype data and their applications to association studies. Genome Research, to appear, 2004.

Suggest Documents