Fine-scale mapping of disease susceptibility locus

Genes & Genomics (2012) 34: 401-407 DOI 10.1007/s13258-011-0220-0

RESEARCH ARTICLE

Fine-scale mapping of disease susceptibility locus with Bayesian partition model Sungkyoung Choi ․ Sungho Won 1)

Received: 11 November 2011 / Accepted: 21 February 2012 / Published online: 25 June 2012 © The Genetics Society of Korea and Springer 2012

Abstract The causal relationship between genes and diseases has been investigated with the development of DNA sequence. Polymorphisms incorporated in the HapMap Project have enabled fine mapping with linkage disequilibrium (LD) and prior clustering of the haplotypes on the basis of a similarity measure has often been performed in an attempt to capture coalescent events because they can reduce the amount of computation. However an inappropriate choice of similarity measure can lead to wrong conclusions and we propose a new haplotype-based clustering algorithm for fine-scale mapping by using a Bayesian partition model. To handle phase-unknown genotypes, we propose a new algorithm based on a Metropolized Gibbs sampler and it is implemented in C++. Our simulation studies found that the proposed method improves the accuracy of the estimator for the disease susceptibility locus. We illustrated the practical implication of the new analysis method by an application to fine-scale mapping of CYP2D6 in drug metabolism.

Keywords

Linkage disequilibrium; Metropolized Gibbs Sampling; Similarity Measure

Introduction A fundamental assumption underlying LD-based fine mapping is that the majority of the mutations that gives rise to variants with increased disease risk are relatively young on evolutionary time scale compared to the origin of most SNP. Two consequences follow: (1) at the causal loci, the affected inS. Choi · S. Won ( ) Department of Applied Statistics, Chung-Ang University, Korea e-mail: [email protected] S. Won The Research Center for Data Science, Chung-Ang University, Korea

dividuals are more related evolutionarily than a random pair of individuals from the population; (2) some adjacent loci also exhibit this excess level of evolutionary relatedness in the affected individuals because LD between these loci and the causal loci has not yet been completely deteriorated by historical recombination events. Given the current genome coverage by SNP, we often do not expect the causal SNPs to be genotyped. Instead, we usually have in our fine-mapping data some SNPs that are in LD with the causal SNPs, which collectively provide information on the location of the causal SNPs via their correlated ancestry. Many methods (Durrant and Mott, 2010; Molitor et al., 2003a; Su et al., 2008; Templeton et al., 1987; Templeton et al., 1988; Waldron et al., 2006; Zollner and Pritchard, 2005) have been developed through this principle in the past few years to estimate the location of the causal SNPs. These methods differ in the extent of the ancestry details they attempt to model, ranging from explicitly modeling the ancestry with coalescent events (Templeton et al., 1987; Templeton et al., 1988; Zollner and Pritchard, 2005) to clustering haplotypes with similarity measures that have been designed to reflect evolutionary relatedness of the haplotypes (Li and Jiang, 2005; Molitor et al., 2003a; Waldron et al., 2006). In a previous study, we (Won et al., 2007) found that the haplotype - clustering - based approach (Waldron et al., 2006) is generally as efficient as or slightly more advantageous than the coalescent-based approach, TreeLD (Zollner and Pritchard, 2005). At the same time the former can provide confidence intervals with more stable coverage probabilities. Last but not least, the haplotype - clustering - based approach is computationally much less demanding and hence is amenable to handle data set with much larger numbers of individuals and markers. Conceptually, haplotype - clustering - based approaches (Denison and Holmes, 2001; Knorr-Held and Rasser, 2000; Molitor et al., 2003a, 2003b; Morris, 2005, 2006; Seaman et al., 2002) reduce the number of parameters by clustering similar haplotypes to the same group in order to minimize multiple testing problems (Tzeng et al., 2006). All haplotypes are partitioned into several groups where genotypes in the same group

402

convey the same risk of disease (Seaman et al., 2002). However, these approaches did not carefully consider the evolutionary histories. For instance, the assumption of recent disease mutation may indicate that haplotypes at the causal loci are similar in cases while they are not in controls. Also the similarity measure should provide different weights according to the evolutionary events, mutation/genotyping errors and recombination/gene conversion at the trait locus. The evolutionary events can be handled by using both length measure and match measure. In addition, other factors also have to be considered to provide optimal weights; allele frequencies and genetic/physical distance weights. When alleles from two different haplotypes are same, if alleles are rare or close to the putative causal SNP, it gives stronger evidence for similarity of haplotypes (Durrant et al., 2004; Tzeng et al., 2006; Waldron et al., 2006; Yu et al., 2004). However these factors have not been clearly investigated. Also, the uncertainty of haplotype has to be considered. According to Clark (2004), the units of biological function, that is, the protein-coding gene, produce proteins whose sequences correspond to maternal and paternal haplotypes, so that variation in population is inherently structured into haplotypes and the statistical power of association tests is likely to be improved with phased data because there are fewer haplotypes than the corresponding diplotypes. However, ignoring inherent uncertainty in haplotype reconstruction can lead to substantial inflation in the estimated level of LD across a region (Morris et al., 2003) and hence to over-confidence in a subsequent haplotype-based association analysis (Morris et al., 2004). Here, we compared results according to similarity measures and weights, and suggested a new method that handles phase-unknown genotypes via haplotype reconstruction, as implemented in the software DECIPHER(S.A.G.E., 2007) and PHASE (Stephens and Donnelly, 2003; Stephens et al., 2001). Under Hardy Weinberg equilibrium (HWE), we showed that mean squared error (MSE) becomes improved with both length and match measures, and we found that our estimates are more accurate under the consideration of phase uncertainty.

Genes & Genomics (2012) 34: 401-407

Similarity Measure Li and Jiang (2005) generalized similarity (or dissimilarity) measures and we improve their similarity measure by adding weights, w1, x, a and w2, x, a. Our weights depend on a and x. If matching alleles between h and a are rare, the evidence for similarity between h and a is strong; and if the matching alleles are far from the putative causal base pair, x, the evidence of their similarity between h and a is smaller compared to other matching alleles near x. We let ai (or hi) indicate the allele of haplotype a (or h) at location i, and dx, i be the relative distance between base pairs x and i to the total length of region of interest. Then the similarity of h to a is defined as S x , a ( h) =

r

∑

i =− r ,i ≠ 0

w1, x ,a (a x +i ) I1, x ( h x +i , a x +i ) +

Consider a case-control study with N unrelated individuals with M SNPs in a candidate region. yi means the disease status of individual i, and 0 (1) indicates controls (cases). We denote a genotype for individual i by gi. We assume that there are ni pairs of haplotypes that are compatible with the observed genotypes for individual i, and denote each pair of haplotypes by (hi1, j, hi2, j) where j ∈ {1, 2,..., ni } . We also assume that the hypothetical ancestral haplotype a bears the original disease mutation allele and x is the putative causal locus.

i =− r ,i ≠ 0

w2, x ,a (a x +i ) I 2, x ( h x +i , a x +i )

,

where ⎧ 1, if h x +i = a x +i I1, x ( h x +i , a x +i ) = ⎨ x +i x +i ⎩0, if h ≠ a ,

⎧ ∏ I1, x ( h x +l , a x +l ) if i > 0 ⎪ l∈{1,2,...,i} I 2, x ( h x +i , a x +i ) = ⎨ x +l x +l ⎪ ∏ I1, x ( h , a ) if i < 0, ⎩l∈{−1,−2,...,i} w1, x ,a (a x +i ) = w2, x ,a (a x +i ) = f1 (d x , x +i ) ⋅ f 2 ( pax+i ).

Here Li and Jiang (2005) suggested two choices, 1 and 1 0.1· dx,x+i, for f1(dx,x+i), but we consider 1 - dx,x+i. We introduce f2(p) to allow for the effect of allele frequency and there are 2

several possibilities for f2(p): 1 - p, 1 - p , and

1− p (Durrant p

et al., 2004; Waldron et al., 2006; Yu et al., 2004). Based on these, the similarity of h and a can be defined as follows: if we let M1 ≡ M2 ≡

r

∑

i =− r ,i ≠ 0 r

∑

i =− r ,i ≠ 0

w1, x ,a (a x +i ) I1, x ( h x +i , a x +i ) w2, x ,a (a x +i ) I 2, x ( h x +i , a x +i ),

we let SM1 = M1 +M 2 ,

Methods

r

∑

f 2 ( p) = 1

2

f 2 ( p) = (1 − p) / p

3

f 2 ( p) = 1 − p

4

f 2 ( p) = 1 − p 2 .

SM = M1 +M 2 , SM = M1 +M 2 , SM = M1 +M 2 ,

Here r indicates the size of window that consists of marker SNPs around x. The performance of the proposed method depends on r, and it may be optimized when 2r corresponds to the number of SNPs that are in LD with the causal variant. We rescale the similarity measure between 0 and 1 by dividing Sx,a(a) by the maximum because Sx,a(h) depends on various parameters. Haplotypes are clustered to two groups based on their similarities to a; haplotypes preserving disease allele and

Genes & Genomics (2012) 34: 401-407

403

normal allele. For simplicity we call the former disease haplotypes and the latter normal haplotypes. If the similarity between h and a around x exceeds the threshold δ, h is clustered to a disease haplotype and otherwise to a normal haplotype. For each individual we provide the likelihood for the disease status with parameters for penetrances, θk ( k ∈ {0,1, 2} ), where k indicates the number of disease haplotypes. It should be noted that δ, h, a, x and θk are considered as parameters and in each iteration, they are generated by Gibbs sampling algorithm (see Appendix for detail). Likelihood and Metropolized Gibbs sampling algorithm The likelihood for individual i can be expressed as a summation over all possible haplotype pairs, weighted by their probabilities: ni

f ( yi | gi , θ , T ) ∝ ∑ f ( yi | hi1, j , hi 2, j , θ , T ) f (hi1, j , hi 2, j | gi ) j =1

T = ( a, x ) .

θ = (θ 0 , θ1 , θ 2 ) ,

where

The

probability

of

f (hi1, j , hi 2, j | gi ) under HWE can be calculated as

P (hi1, j , hi 2, j ) ni

∑ P(h j =1

i1, j

, hi 2, j )

.

2 If we let ph the haplotype frequencies, P(hi1, j , hi 2, j ) is ph

i 1, j

when

hi1, j = hi 2, j and otherwise it is 2 ph ph . There are several freei 1, j

i 2, j

ly available software (Stephens and Donnelly, 2003; Stephens et al., 2001) to estimate the haplotypes frequencies and we used the software DECIPHER (S.A.G.E., 2007). The choice of software may not generate a substantial difference because similar likelihoods are maximized to estimate haplotype frequencies even though computational intensity can be different. If we let θ = (θ 0 ,θ1 , θ 2 ) , our posterior distribution becomes f (T , θ | y, g ) ∝ f (T ) f (θ ) f ( y | T , θ , g ) N

ni

∝ f (T ) f (θ )∏ ∑ f ( yi | hi1, j , hi 2, j , θ , T ) f (hi1, j , hi 2, j | gi ), i =1 j =1

and the likelihood for individual i is ⎪ f ( yi | hi1, j , hi 2, j , θ , T ) = ⎨ ⎪

⎧ θ 0yi (1 − θ 0 )1− yi if there is no disease haplotypes in (hi1, j ,hi 2, j ) ⎪ yi 1− y ⎨ θ1 (1 − θ1 ) i if there is one disease haplotypes in (hi1, j ,hi 2, j ) y 1 ⎪θ i (1 − θ ) − yi if there are two disease haplotypes in (h ,h ). 2 i1, j i 2, j ⎩ 2

Therefore if we let the latent variable Hi indicate the unknown haplotype for individual i, our posterior is f (T , θ , H | y, g ) ∝ f (T ) f (θ ) f ( H | g ) f ( y | T , H , θ , g ) N

∝ f (T ) f (θ )∏ [ f ( yi | H i ,θ , T ) f ( H i | gi ) ], i =1

where H = {H1, H2, … , HN} and

H i ∈ {(hi1,1 , hi 2,1 ),

(hi1,2 , hi 2,2 ),K , (hi1,ni , hi 2,ni )} . Based on this posterior, we im-

plemented a Metropolized Gibbs sampling algorithm (see the Appendix).

Results To compare the different similarity measures and methods, we calculated the root mean squared error (RMSE) and mean squared error (MSE) of the estimated disease location. We used the median of the sampled putative disease locations, x, as a point estimate of disease location. Simulation studies under phase certainty To compare the performance of each similarity measure, 4000 haplotypes were simulated with the software, ms (Hudson, 2002). For a 750Kb region, a 10-8 mutation rate, 10-8 recombination rate per base pair and a 104 effective population size were assumed. We used the same disease model as was used in Waldron et al. (2006). A disease SNP was randomly selected. A disease allele frequency was assumed to be between 0.15 and 0.25 for a common disease, and 0.05 and 0.15 for a rare disease. We assumed that the disease prevalence is 0.12, and genotype relative risks (GRR) are 1:2:4 and 1:3:9. 200 cases and 200 controls were sampled from the simulated 2000 individuals without replacement. A causal SNP was not selected as markers. If we let p be the disease allele frequency, the probability of being selected as markers was proportional to 4p(1 - p). We repeat these procedures 100 times to calculate the RMSE of the causal SNP location. The following scenarios were considered: M1,1: 30 SNPs for markers, common disease allele and 1:3:9 GRR M1,2: 30 SNPs for markers, rare disease allele and 1:3:9 GRR M1,3: 30 SNPs for markers, common disease allele and 1:2:4 GRR M1,4: 30 SNPs for markers, rare disease allele and 1:2:4 GRR M1,5: 45 SNPs for markers, rare disease allele and 1:3:9 GRR M1,6: 75 SNPs for markers, common disease allele and 1:2:4 GRR M1,7: 150 SNPs for markers, rare disease allele and 1:2:4 GRR. Table 1 shows RMSE of the point estimate for the disease SNP location. We compared the results from 6 different similarity measures and Waldron’s approach (Waldron et al., 2006). Waldron’s approach is denoted by WAL. We assumed 4 that the genotypes are phased. Our results show that SM per3 forms the best, followed by SM . Even though we cannot find

404

Genes & Genomics (2012) 34: 401-407

Table 1. RMSE for simulated haplotypes. Haplotypes were simulated with ms(Hudson, 2002) and then phenotypes were generated for each disease model. WAL indicates Waldron et al.’s algorithm(Waldron et al., 2006). 1

M1,1 M1,2 M1,3 M1,4 M1,5 M1,6 M1,7 Mean

SM 123612 134425 195729 175271 131415 171289 128880 151517

2

SM 129460 149967 188892 170499 107610 168315 120083 147832

3

SM 118491 135794 183603 175670 121779 168671 121120 146447

4

SM 111086 135720 188122 176239 117090 168375 119833 145209

WAL 129190 152151 185661 183603 115845 177313 119833 151942 4

a model that is always best, we can see that SM is quite good compared to the other models. We also consider the real genotypes from Hapmap data. The cystic fibrosis transmembrane conductance regulator (CFTR) is located across 200 Kb in region q31.2 on the long arm of chromosome 7. From the HapMap database, we downloaded CFTR haplotype of CEU (Utah) and applied some disease models to these haplotype data. SNPs and the phenotypes were generated as was done in simulated data. From 120 haplotypes (60 individuals), haplotypes for 100 cases and 100 controls were sampled with replacement. These simulations were repeated 50 times to get the RMSE. The following disease models were considered: M2,1: 40 SNPs for markers, common disease allele and 1:3:9 GRR M2,2: 40 SNPs for markers, rare disease allele and 1:3:9 GRR M2,3: 40 SNPs for markers, common disease allele and 1:2:4 GRR M2,4: 40 SNPs for markers, rare disease allele and 1:2:4 GRR. In Table 2 we calculated RMSE for the point estimate of the disease SNP location. We assumed that each genotype was phased and it shows that SM1 performs the best, followed by 4 SM . In summary we conclude that SM4 is generally good even though results can be different depending on the simulation Table 2. RMSE for CFTR haplotypes. From the Hapmap, CFTR haplotypes were sampled with replacement as genotypes for each individual and then phenotypes were generated with four disease models. We assumed that genotypes for each individual are phased. WAL indicates Waldron et al.’s (Waldron et al., 2006). 1

M2,1 M2,2 M2,3 M2,4 Mean

2

3

4

SM

SM

SM

SM

WAL

18440 36373 24597 31623 27758

18947 40125 23770 30725 28392

18974 37094 25120 32078 28316

18493 36729 25357 32450 28257

21517 42426 27386 31097 30607

Figure 1. D indicates an unknown disease locus and we consider two haplotypes h1 and h2. 0 and 1 respectively indicate the major and minor alleles for each locus. In this example h1 and h2 have different alleles near D (within the box) but similar alleles for SNPs located further away from D (outside the box). If the similarity measure M1 is utilized, then h1 and h2 is clustered to the same group.

settings. Also, WAL is usually not as good as SM1, SM2, SM3, and SM4. WAL uses the similarity measure similar to M1 and it may be the reason of their lower efficiency. For instance, for the situations in Figure 1, it is obvious that h and a are not similar because markers near x are different. However, the similarities between markers that are located further assign them to the same group when M1 is utilized. Simulation studies under phase uncertainty Hapmap data around CFTR are also utilized for the simulations when genotypes are not phased. To make the procedure for haplotype frequency estimation faster, only 20 markers were selected by the same way in Table 2. 75 cases and 75 controls were generated and we considered the following disease models: M3,1: 20 SNPs for markers, common disease allele and 1:3:9 GRR M3,2: 20 SNPs for markers, rare disease allele and 1:3:9 GRR M3,3: 20 SNPs for markers, common disease allele and 1:2:4 GRR M3,4: 20 SNPs for markers, rare disease allele and 1:2:4 GRR. For each individual, their genotypes were assumed to be unphased and the haplotype frequencies were estimated with DECIPHER(S.A.G.E., 2007). The uncertainty of haplotypes implies that there are several haplotype pairs compatible with the observed genotypes for each individual, and their uncertainty can lead to over-confidence if it is ignored (Morris et al., 2004). Therefore the proposed Metropolized Gibbs sampler was applied for each similarity measure. WAL was not considered because it cannot handle unphased markers, and its performance in Table 1 and Table 2 was not comparable to the others. In Table 3, MSEs for each disease model and MSE were decomposed into the square of bias and variances. Our results show that the SM4 performs the best, followed by SM2. Real data analysis: CYP2D6 CYP2D6 has an important role in drug metabolism and it is

Genes & Genomics (2012) 34: 401-407

405

5

Table 3. MSE (×10 ) for CFTR haplotype when genotypes are not phased. From the Hapmap, CFTR haplotypes were utilized to generate the genotypes for each individual but their genotypes are assumed to be unphased. Phenotypes were generated with four disease models. Haplotype frequencies were estimated by DECIPHER(S.A.G.E., 2007).

SM1

SM2

SM3

SM4

bias2

Var

MSE

bias2

Var

MSE

bias2

var

MSE

bias2

var

MSE

481 0.1 761 636 470

4540 17257 11377 14435 10902

5021 17257 12138 15071 12372

973 47 907 580 627

4953 16790 10573 14386 11676

5926 16837 11479 14966 12302

732 3 750 625 528

4689 17849 10983 14399 11980

5421 17852 11733 15024 12507

623 2 723 622 493

4399 17491 10932 14260 11770

5022 17493 11655 14882 12263

M3,1 M3,2 M3,3 M3,4 Mean

located on human chromosome 22q13. Hosking et al. (Hosking et al., 2002) genotyped 32 SNPs to evaluate whether LD mapping is appropriate and they generated the traits of each individual from their genotypes. We use only individuals who have no missing value in their genotypes. The remaining individuals are 268, and 12 individuals are affected. We consider 32 SNPs flanking CYP2D6 and the proposed method using similarity measure SM4 was applied. It should be noted that the true causal SNP was known for this data. For WAL, we used the most probable haplotype pairs for analysis and Cochran-Armitage test were also conducted. Figure 2 showed that the point estimate and the confidence interval at the 0.05 significance level from the proposed method are the most accurate. The true location of a causal SNP is 52.53Mb on chromosome 22. The posterior mode for the proposed method and the WAL were 50.5Mb and 40.55Mb respectively, and it illustrates the substantial improvement of the proposed method.

Discussion In this manuscript, we investigated two important issues for clustering analysis. First, we consider the various similarity

measures. For the similarity measures, two things should be considered; if matching alleles are rare or located near the causal SNP, they provide more evidence for similarity of haplotypes. From the simulation studies we found that the statistical efficiency can be improved by providing the appropriate weights. Second, we implemented a new algorithm that can handle phase uncertainty. If phase uncertainty is not considered, our estimates for disease location can be biased and the confidence interval may not preserve the significance level in some situations (Morris et al., 2004). We developed the software HCAT (haplotype clustering analysis tool for fine-mapping) in C++, which can be obtained by e-mail from S.W. ([email protected]). HCAT can in2 corporate several different weights (1, 1 - p, 1 - p , and (1 − p ) / p ) for allele frequency weighting, and it can handle phase-unknown genotypes. HCAT uses a non-informative prior, so that the prior does not affect the results and the estimate is approximately equal to the maximum likelihood estimate (MLE). Calculation of the similarity measure is simpler than that in WAL. When genotypes are unphased, it takes 4 minutes on a 1994MHz x86_64 Linux system (cluster) for 100,000 iterations for 20 markers and 150 individuals. However even though the proposed method improves the previous approaches, there are still several limitations for the

Figure 2. Application for CYP2D6 at the 0.05 significance level. “Estd” and “Tloc” respectively indicate the estimated and the 4 true locations of a causal SNP. “New Approach” indicates the results from the proposed method using similarity measure SM .

406

Genes & Genomics (2012) 34: 401-407

proposed method. First, our algorithm cannot handle continuous traits or covariates that affect the disease. Second the common diseases are usually generated from the complicated interplay of multiple variants but the proposed method cannot handle the multiple variants. These will be investigated as a part of our ongoing research. Acknowledgement This research was supported by the Chung-Ang University Research Scholarship Grants in 2009.

References Denison DG, Holmes CC (2001) Bayesian partitioning for estimating disease risk. Biometrics 57: 143-9. Durrant C, Mott R (2010) Bayesian quantitative trait locus mapping using inferred haplotypes. Genetics 184: 839-52. Durrant C, Zondervan KT, Cardon LR, Hunt S, Deloukas P, Morris AP (2004) Linkage disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes. Am. J. Hum. Genet. 75: 35-43. Hosking LK, Boyd PR, Xu CF, Nissum M, Cantone K, Purvis IJ, Khakhar R, Barnes MR, Liberwirth U, Hagen-Mann K, Ehm MG, Riley JH (2002) Linkage disequilibrium mapping identifies a 390 kb region associated with CYP2D6 poor drug metabolising activity. Pharmacogenomics. J. 2: 165-75. Hudson RR (2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18: 337-8. Knorr-Held L, Rasser G (2000) Bayesian detection of clusters and discontinuities in disease maps. Biometrics 56: 13-21. Li J, Jiang T (2005) Haplotype-based linkage disequilibrium mapping via direct data mining. Bioinformatics 21: 4384-93. Molitor J, Marjoram P, Thomas D (2003a) Fine-scale mapping of disease genes with multiple mutations via spatial clustering techniques. Am. J. Hum. Genet. 73: 1368-84. Molitor J, Marjoram P, Thomas D (2003b) Application of Bayesian spatial statistical methods to analysis of haplotypes effects and gene mapping. Genet. Epidemiol. 25: 95-105. Morris A, Pedder A, Ayres K (2003) Linkage disequilibrium assessment via log-linear modeling of SNP haplotype frequencies. Genet. Epidemiol. 25: 106-14. Morris A, Whittaker J, Balding D (2004) Little Loss of Information Due to Unknown Phase for Fine-Scale Linkage-Disequilibrium Mapping with Single-Nucleotide-Polymorphism Genetype Data.

Appendix Notation yi phenotype of individual i = 1, 2, …, N ; 0 for unaffected and 1 for affected t y = (y1, … , yN) t gi genotype of individual i, g = (g1, g2, … , gN) th hi1,j (or hi2,j) j possible maternal (or paternal) haplotype for genotype gi, j ∈ {1, 2,..., ni }

Am. J. Hum. Genet. 74: 945-53. Morris AP (2005) Direct analysis of unphased SNP genotype data in population-based association studies via Bayesian partition modelling of haplotypes. Genet. Epidemiol. 29: 91-107. Morris AP (2006) A flexible Bayesian framework for modeling haplotype association with disease, allowing for dominance effects of the underlying causative variants. Am. J. Hum. Genet. 79: 679-94. S.A.G.E. (2007) Statistical analysis for genetic epidemiology, Release 5.2 : http://darwin.cwru.edu/sage/ Seaman SR, Richardson S, Stucker I, Benhamou S (2002) A Bayesian partition model for case-control studies on highly polymorphic candidate genes. Genet. Epidemiol. 22: 356-68. Stephens M, Donnelly P (2003) A comparison of bayesian methods for haplotype reconstruction from population genotype data. Am. J. Hum. Genet. 73: 1162-9. Stephens M, Smith NJ, Donnelly P (2001) A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68: 978-89. Su SY, Balding DJ, Coin LJ (2008) Disease association tests by inferring ancestral haplotypes using a hidden markov model. Bioinformatics 24: 972-8. Templeton AR, Boerwinkle E, Sing CF (1987) A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping. I. Basic theory and an analysis of alcohol dehydrogenase activity in Drosophila. Genetics 117: 343-51. Templeton AR, Sing CF, Kessling A, Humphries S (1988) A cladistic analysis of phenotype associations with haplotypes inferred from restriction endonuclease mapping. II. The analysis of natural populations. Genetics 120: 1145-54. Tzeng JY, Wang C, Kao J, Hsiao C (2006) Regression-Based Association Analysis with Clustered Haplotypes through Use of Genotypes. Am. J. Hum. Genet. 78: 231-42. Waldron ER, Whittaker JC, Balding DJ (2006) Fine mapping of disease genes via haplotype clustering. Genet. Epidemiol. 30: 170-9. Won S, Sinha R, Luo Y (2007) Fine-scale linkage disequilibrium mapping: a comparison of coalescent-based and haplotype-clustering-based methods. BMC. Proc. 1 Suppl 1: S133. Yu K, Martin RB, Whittemore AS (2004) Classifying disease chromosomes arising from multiple founders, with application to fine-scale haplotype mapping. Genet. Epidemiol. 27: 173-81. Zollner S, Pritchard JK (2005) Coalescent-based association mapping and fine mapping of complex trait loci. Genetics 169: 1071-92.

th

Hi

latent variable for phase of i individual’s genotype, H i ∈ {(hi1,1 , hi 2,1 ), (hi1,2 , hi 2,2 ), ..., (hi1,ni , hi 2,ni )}

x δ

putative disease locus threshold of similarity measure to determine the disease cluster hypothetical ancestral haplotype for disease haplotype tessellation structure all the parameters that affect haplotype tessellation, (a, δ, x)

a T D

Genes & Genomics (2012) 34: 401-407

407

Prior We used the following non-informative priors: δ ~ U (0,1), θ k ~ U (0,1), f (a) ∝ 1, x ~ U ( xmin , xmax ) . Here xmin and xmax receptively indicate the minimum and maximum value of the marker location (base pair), and k can be 0, 1, or 2.

⎛

α = min ⎜ 1, ⎝

f ( y | H,T new ) ⎞ ⎟ f ( y | H,T old ) ⎠

⎛ Bf (sonew +1, no − sonew +1)Bf (s1new +1, n1 − s1new +1)Bf (s2new +1, n2 − s2new +1) ⎞ = min ⎜1, ⎟. ⎜ Bf (soold +1, no − soold +1)Bf (s1old +1, n1 − s1old +1)Bf (s2old +1, n2 − s2old +1) ⎟⎠ ⎝ 1

Here Bf is a beta function, B f ( x, y ) = ∫ t x −1 (1 − t ) y −1 dt . 0

2. Update θ = (θ0, θ1, θ2) Metropolized Gibbs Sampling Algorithm 1. Given (t)-th samples, update T = ( a, δ, x) in the following way to generate (t+1)-th samples: anew = ε1, f (ε1 ) ∝ 1 δnew = δ

(t)

+ ε2, ε2 ~ N(0,

(t)

1 ) 5

2

xnew = x + ε3, ε3 ~ N(0, σ ). In particular, x is updated with reflection at xmin and xmax. new Take D = (anew, δnew, xnew) with probability α. ⎧ D (t ) with probability 1-α D (t +1) = ⎨ new with probability α , ⎩D

f (θ | y, G, H , T ) ∝ f ( y | H ,θ , T ) f (θ ) ∝ θ0s0 (1 −θ0 )n0 −s0 θ1s1 (1 −θ1 )n1 −s1 θ2s1 (1 − θ2 )n2 −s2 ∝ BD (θ0 ; s0 +1, n0 − s0 + 1) BD (θ1; s1 + 1, n1 − s1 + 1) BD (θ2 ; s2 + 1, n2 − s2 + 1),

where BD is a beta distribution. We update each θk with a separate beta distribution because θk are independent. 3. Update Hi (, ∈ {hi ,1 , hi ,2 ,..., hi ,ni } ) f ( H i | yi , gi ,θ , T ) =

f ( yi | H i , θ , T ) f ( H i | gi ) f ( yi | gi , θ , T ) =

f ( yi | H i , θ , T ) f ( H i | gi ) ni

∑ j =1

f ( yi | hi , j , gi , θ , T ) f (hi , j | gi )

.