haplotypes. ⢠Three most popular methods are Statistical ... (haplotypes with more than one mutational connection) than ... Likelihood network software â PAL by.
Population structure
Population structure • What causes population structure? • Population structure can be caused by an adaptative response of organisms to their environment, or structure can be a by-product of an event that affected the population. • Adaptive responses – local adaptation to a particular environments. • By-product response – vicariant event divides a species into two populations, and the two populations start accumulating differences.
Population structure • Population structure can also be divided into ongoing events and historical events. • Ongoing event – restricted gene flow would cause population structuring. • Historical event – past fragmentation would cause population structuring.
Population structure • What is our null hypothesis? • The species of interest is panmictic. • This means that the individuals of a species have an equal probability of being found anywhere throughout its geographic distribution. • Panmixia is also one of the assumptions of a population at Hardy Weinberg equilibrium, therefore if other assumptions are not violated, panmictic populations will be at Hardy Weinberg equilibrium.
Nested Clade Analysis • Methodology proposed by Templeton et al. (1995). • The method test for non-random distribution of haplotypes on a geographic landscape. • The method assumes no a priory structure – i.e. the null hypothesis is panmixia.
Nested Clade Analysis • What must we know to conduct a nested clade analysis? • Genealogical relationships of haplotypes – i.e. a haplotype network. • Geographic distribution of haplotypes – this can either be in the form of geographic coordinates, or in the form of a distance matrix.
Haplotype networks • There are several methods of reconstructing haplotype networks – the relationship among haplotypes. • Three most popular methods are Statistical Parsimony networks, Likelihood networks, and Median-Joining networks. • Each network reconstruction uses a different optimality criterion for resolving relationships among haplotypes.
Statistical parsimony • The statistical parsimony algorithm begins by estimating the maximum number of differences among haplotypes as a result of single substitutions with a 95% statistical confidence. This number is called the parsimony connection limit. • After this, haplotypes differing by one change are connected, then those differing by two, by three and so on, until all the haplotypes are included in a single network or the parsimony connection limit is reached. • The statistical parsimony method emphasizes what is shared among haplotypes that differ minimally rather than the differences among the haplotypes.
Likelihood networks • The likelihood network approach assumes some model of molecular evolution. Networks are then generated according to some stochastic process, and likelihood of the evolution of characters on that network assuming the given model of evolution is evaluated. • Network is rearranged, and likelihood recalculated until a network with the highest likelihood is obtained (essentially same approach as ML tree reconstruction).
Median-joining networks • The median-joining network method begins by combining the Minimum-Spanning Trees (MSTs) within a single network. With a parsimony criterion, median vectors (which represent missing intermediates) are added to the network. • Median-joining networks can handle large data sets and multistate characters. It is an exceptionally fast method that can analyze thousands of haplotypes in a reasonable amount of time and can also be applied to amino acid sequences.
Haplotype networks • Due to homoplasy, Statistical Parsimony and Median-Joining networks result in “loops”. • Loops are instance where alternate, equally good connections exist. • Do loops need to be resolved? • How can loops be resolved?
Loop resolution • The coalescent makes several predictions that aid in loop resolution. – Older alleles (those of higher frequency in the population) have a greater probability of becoming interior haplotypes (haplotypes with more than one mutational connection) than younger haplotypes. – On average, older alleles will be more broadly distributed geographically. – Haplotypes with greater frequency will tend to have more mutational connections. – Singletons are more likely to be connected to nonsingletons than to other singletons. – Singletons are more likely to be connected to haplotypes from the same population than to haplotypes from different populations.
Haplotype networks • Statistical Parsimony software – TCS 1.21 by Clement et al. (2000) • Median-Joining network software – Network by Rohl • Likelihood network software – PAL by Strimmer and Drummond or other likelihood program (Treefinder by Jobb)
Geographic localities • Geographic distribution of haplotypes • Each locality is identified as a set of geographic coordinates, and the program calculates greater circle distances between localities. • Localities are in the form of a distance matrix. • Advantages? Disadvantages?
Nested Clade Analysis • To perform nested clade analysis, we need to nest the haplotypes into higher level clades. • By nesting, we are successively reducing the number of mutations observed in our data, and are approximating the coalescent process – i.e. we are going back in time.
Nesting rules • Reduce all haplotypes connected by one mutational step into one nested haplotype. • Proceed from tip haplotypes to interior haplotypes, nest all haplotypes. • Proceed from exterior to interior haplotypes , nest all haplotypes. • Stranded haplotypes should be added to its sister haplotypes that has the smallest number of individuals. • Once finished, proceed with next nesting level.
Inferring Structure • Once we reconstruct the haplotype network, nest the haplotype network, and generate a file of geographic coordinates or a geographic distance matrix, we construct an input file for the program Geodis 2.1 which will test for association between haplotype relationships and geographic distribution of the haplotypes.
Haplotype geography association • The program Geodis computes two distances. • Dc distance – clade distance, it measures how geographically wide spread are individuals bearing a particular haplotype. • Dn distance – nested clade distance, it measures how far are individuals bearing a particular haplotype for all other individuals within a particular clade
Haplotype geography association
Haplotype geography association • Dc distance – three haplotypes exist, each has a geographic center • Dc (1) = 0 • Dc (2) = 1/3 (2) + 2/3 (1) = 1.33 • Dc (3) = 1/3 (1.9) + 1/3 (1.9) + 1/3 (1.9) = 1.90
Haplotype geography association • Dn distance – three haplotypes exist, all have one geographic center • Dn (1) = 1 (1.6) = 1.6 • Dn (2) = 1/3 (1.6) + 2/3 (1.5) = 1.53 • Dn (3) = 1/3 (1.6) + 1/3 (1.5) + 1/3 (2.3) = 1.80
Haplotype geography association • The primary information comes from the contrast of these Dc and Dn distance values. • The null hypothesis is panmixia – no association of haplotype distribution over the landscape, and therefore the Dc and Dn distances should be equal. • Statistical equality is measured via random permutations of haplotypes among localities, and significance of actual vs. expected Dc and Dn values is evaluated using a χ2 test. • Dc and Dn values can be significantly small or significantly large
Dc – Dn contrast • The primary information comes from the contrast of these Dc and Dn distance values. • E.g. significantly small Dc values would indicate restricted dispersal of this clade in comparison to other clades. • E.g. significantly large Dn values would indicate long distance dispersal. • Very important – all of these inferences are done at a relative scale, not an absolute scale.
NCA • Recently, NCA has been questioned on several fronts. • Inference key is dichotomous, and there is no statistical probability associated with inferences. • Geodis is prone to α error – we are likely to reject the null hypothesis falsely. • How permutations are made – permuting individuals among localities.
SAMOVA • The Spatial Analysis of Molecular Variance (SAMOVA) is an adaptation of AMOVA – the Analysis of Molecular Variance. • The objective of the methodology is to find groups of sampling localities that are maximally differentiated from each other. • Only genetic information is used by this method, no additional a priory information is used about population membership.
SAMOVA • Distribute sampling localities on a geographic landscape. • Assume some number of K clusters. • Partition sampling localities among clusters. • Calculate F for between cluster contrasts. • Transfer a portion of individuals between clusters. • Calculate again F for between clusters. • Accept new cluster configuration if F greater than F from original cluster configuration, otherwise reject new cluster configuration.
SAMOVA
SAMOVA
SAMOVA • The number of populations that maximize between cluster F (Fct), is the number of populations assumed to be correct. • K = 2; Fct = 0.10277 • K = 3; Fct = 0.15486 • K = 4; Fct = 0.21312
Structure • The program Structure (Pritchard et al., 2000), is a Bayesian clustering approach to assign individuals to populations. • Their model assumes Hardy Weinberg and linkage equilibria and attempts to define groups of individuals that minimize departures from these equilibria. • The concept is based on the fact that populations are groups of interbreeding individuals, and therefore these groups should be at Hardy Weinberg equilibrium.
Structure • The program Structure will calculate the probability of a given number of groups under several different models. • Ancestry models – – – –
No admixture Admixture Linkage Prior population information
• Allele frequency models – Independent – Correlated
Structure • Ancestry models – No admixture – each individual comes purely from one population – Admixture – individuals may have mixed ancestry – Linkage – there is linkage disequilibrium among loci – Prior population information – predefined population information is assumed to be correct
Structure • Ancestry models • In general, we want to use the admixture model. It is robust, in that it allows for the possibility of some individuals being of mixed origin. Linked loci model should be used with large number of loci that are at linkage disequilibrium. The prior population model can be used for assigning unknown individuals to know groups, or to test for the a priory existence of groups.
Structure • Allele frequency models • Independent models assume that frequencies of different alleles are independent of each other in each population, i.e. that we expect different populations to have different frequencies of alleles. In correlated model, we expect frequencies of alleles to be correlated between populations. • Using the correlated model we bias ourselves to finding structure.
Structure • Assume some number of K clusters. • Randomly partition sampling localities among clusters. • Calculate a deviation Hardy Weinberg equilibrium within clusters. • Transfer a portion of individuals between clusters. • Calculate again deviation from HWE within clusters. • Accept new cluster configuration if HWE deviation is smaller than in the original cluster configuration, otherwise reject new cluster configuration.
Structure • If there is real structure, this will lead to mild deviations in Hardy Weinberg equilibrium, and to linkage disequilibrium among loci. • However, HWE and LD may also result from inbreeding, and genotyping errors such as occasional, undetected, null alleles. • Some models such as correlated allele frequencies and no population admixture models will bias results toward finding structure. • So be conservative in inferring the existence of structure.
Structure • For each number of populations, a posterior probability value exists. • Question is if the inference of an additional population is justified by the increase in probability. • Heuristically, we want a relatively large increase in probability, however, we are getting absolute changes in probability.
Structure • When is structure real? • The ln Pr(X│K) will plateau. • In admixture analyses α value will have little variance. • Populations will not be of equal size.
Structure • Bayes’ rule for determining number of populations. • P = eln Pr(data│K=X)/Σ(eln Pr(data│K=1) .. eln Pr(data│K=N))
Structure • Bayes’ rule for determining number of populations. • K ln Pr(X│K) P • 1 -4356 2.16-163 • 2 -3983 0.211 • 3 -3982 0.576 • 4 -3983 0.211 • 5 -4006 2.17-11
Structure • Bayes’ rule for determining number of populations. • K ln Pr(X│K) P • 15 -7735.4 2.4-29 • 16 -7698.2 3.4-13 • 17 -7669.4 1.0 • 18 -7699.9 6.1-14 • 19 -7762.0 6.5-41
Structure • Bayes’s rule is very powerful, but often results in a biologically unrealistic result (it tends to overestimate the number of populations). • Therefore a methodology has been developed by Evanno et al. (2005) based on variance in posterior probabilities of estimates in repeated runs.
Structure • Evanno et al. (2005) observed that the distribution of L(K) did not show a clear mode for the true K, but found that the second order rate of change of the likelihood function with respect to K (∆K) did show a clear peak at the true value of K. • This is true because the variance in likelihood estimates between independent runs assuming some true values of K is much smaller than when K is not correct. • ∆K = m(|L(K + 1) − 2 L(K) + L(K − 1)|)/s[L(K)]
Structure
Structure
Structure 1
2
3 4
5 6
7
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
-7000.0 -8000.0 -9000.0 -10000.0 Series1
-11000.0 -12000.0 -13000.0 -14000.0 5000.0 4000.0 3000.0 2000.0 1000.0
Series1
0.0 -1000.0 -2000.0 -3000.0 -4000.0 -5000.0 2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Structure 9000.0 8000.0 7000.0 6000.0 5000.0
Series1
4000.0 3000.0 2000.0 1000.0 0.0 3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
120 100 80 60
Series1
40 20 0 3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Structure • Population structuring may be the result of historical and/or ongoing processes. • Historical processes occurred in the past, and the processes no longer are acting. • Ongoing processes are actively ongoing. • Ongoing processes are usually associated with different levels of gene flow. • One such process may be restricted gene flow resulting in isolation-by-distance.
Structure • To infer if isolation-by-distance, we can run a Mantel test. • Mantel test calculates a coefficient of correlation between two matrixes, in this case a matrix of genetic and geographic distances, and though permutation of one of the matrixes it tests significance of the correlation coefficient. • Mantel test also calculates a regression coefficient of one matrix onto another matrix.
Correlation = Association • Correlation coefficient – measures the strength of association between two variables of the same experimental unit; in other words a measure of the precision of the relation between two variables Size vs weight
Describing a normal distribution 99.00%
• Dispersion:
95.00% 68.27%
– Variance: Avg. sq. distance of the observation from their mean
S2 =
2 ( x − x ) ∑
n −1
– Standard deviation
S=
∑ (x − x) n −1
-1 s
2
-1.96 s -2.58 s
+1 s +1.96 s +2.58 s
Correlation = Association • Covariance – a statistical measure of the tendency of two variables to change in conjunction with each other • Or… a measure of how much two variables vary together. Cov xy
(x ∑ =
i
− x )( y i − y ) n −1 Cov xy
1 ∑ xi y i − n (∑ xi ∑ y i ) = n −1
Correlation coefficient cov xy r= SxSy • • • • •
Does not have units Ranges from -1 to +1 Sign indicates the direction of the correlation It does NOT indicate a cause-effect relation Traits can be correlated but have very different values
How do correlated traits look like in a diagram?
Regression analysis • Provides a precise quantitative relationship between variables • Regression line y=a+bx where b=slope slope = b =
cov xy sx
2
Regression is used to measure the extent to which one variable explains the other variable
Mantel test • Calculate a covariances between variables of the genetic and geographic matrixes. • Calculate a variances of the genetic and geographic matrixes. • Calculate a correlation coefficient. • Permute one of the matrixes, recalculate covariances and variances, and recalcuate correlation coefficients. • Get a distribution of correlation coefficients based on permuted matrixes, and see if actual correlation coefficient is significantly larger than expected by chance.
Mantel test • In a two dimensional habitat, it is advisable to analyze the natural log geographic and genetic distance. • Our expectation is that there will be positive correlation coefficient between geographic distance and genetic differentiation (such as Fst). • If we measure genetic differentiation as migration, then our expectation is there will be a negative correlation between geographic distance and migration rate. • If the Mantel test yields a significant correlation coefficient, we can interpret this as restricted gene flow with isolation-by-distance.
Continuously differentiated pops • In some populations isolation-by-distance is a significant structuring factor, however, at some distance, individuals no longer exchange genes, i.e. isolation-by-distance is a structuring factor only over certain distances, and at larger distance individuals a genetically isolated. • This type of structuring may be analyzed though a spatial autocorrelation analysis.
Spatial autocorrelation • Divide geographic distances into an inclusive and non-inclusive distance, e.g. above and below 500 km. • Run a Mantel test. • Select a new distance class, e.g. above and below 1000 km. • Run a Mantel test. • Proceed until geographic distance range is exhausted. • Distance at which point correlation is no longer significant represents a distance at which there is significant differentiation among localities.
Spatial autocorrelation • Distance at which point correlation is no longer significant represents a distance at which there is significant differentiation among localities. • Below this distance threshold individuals are connected by gene flow, above this threshold they are not. • Below this threshold individuals below to the same population as their geographic neighbors, above this threshold, they are members of different populations. • This distance window is a sliding window, and therefore there are no fixed boundaries among the inferred populations.
Spatial autocorrelation Microsatellite autocorrelogram
Genetic distance (Fst)
0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 0
500
1000
1500
2000
2500
Geographic distance (km)
3000
3500
4000
Spatial autocorrelation
Analysis of Molecular Variance • Analysis of Molecular Variance (AMOVA) Excoffier et al. (1992) is an implementation of Analysis of Variance (ANOVA) for molecular data. • It is based on calculating variance components (as sums of squared deviations) within various levels of hierarchical grouping. • Variance components of the different hierarchical groups are then compared. • Expectation is if there will be no difference if hierarchical groups do not exists.
Analysis of Molecular Variance • AMOVA is useful for hypothesis testing. • We have some a priory hypothesis of population differentiation (in the form of grouping of individuals or groups of individuals), and we want to test if this grouping is a significant better explanation of the data than no grouping. • We either reject this grouping, or we accept this grouping (and then hopefully have some relevant process that would have resulted in this grouping).