STATISTICS BY EXAMPLES (USING R)

STATISTICS BY EXAMPLES (USING R) OCTAVIO MART´INEZ DE LA VEGA LANGEBIO - CINVESTAV - IRAPUATO

Abstract. By years I had been teaching the Biostatistic course in Cinvestav - Irapuato beginning with elementary probability theory and, when that part was understood, I introduced the ‘central ’ (or core) concepts needed for statistical inference using practical examples. I think that is the correct approach; however, time limitations had force me to change the approach. This time I will introduce practical problems with real data and proceed to do the full analysis introducing the concepts as they are needed until the questions generated by the experiment at hand are solved by the use of statistical methods. For this teaching method to work you need to participate asking questions and carefully following the ideas presented. Hopefully you will be able to assimilate the central concepts of Statistics.

In this document, where you see something in bold face it has a web link to a particular resource. We will be using R for demonstrations; however, this is NOT and R course (we do not have the time); you may install R in your computer and follow the examples and analyses performed. You are also welcome to download the notes for previous editions of the course. 1. The analysis of a very small dataset We are going to analyze just two data obtained in the laboratory of Rafael Rivera Bustamante1. The facts and data are very simple, (1) A chilli pepper plant found in a collect resulted resistant to a particular gemini virus. (2) When auto-pollinated, this plant produces both, resistant and susceptible plants (no intermediates were found). (3) In the particular experiment reported, Marco found 156 resistant and 130 susceptible (a total of 286 descendants). This dataset is of the smallest possible size: two data plus some background information. Note that the background information can be used only if the analyst has a good understanding of genetics. Now we can ask ourselves the following questions: Which reasonable hypotheses can be infer from this information? –How “strong” –or otherwise, weak, is the evidence backing up the ‘more likely’ hypothesis? –How close (in ‘likelihood ’) comes the following-likely hypothesis? My idea during the lecture is to induce you to find the responses to these questions. We will then see the answers after the ‘re-discovery’ (actually in this notes) and thus hopefully learn something. 1.1. Which information is given by these data? First, notice that the data make more sense if we interpret those using our knowledge of Genetics. To make sense of the data we must assume that the phenomenon observed, i.e., the resistance of the plant to the virus has a genetical base. Accepting this we reason that the original parental plant cannot be homogenous (homozygous) for all the genes determining the characteristic; this because an homozygous plant, when auto pollinated will give only descendants of the same class than the parent –resistant in this case. The fact that we have two discrete clases in the Date: November 26, 2014. 1See: Garc´ıa-Neria M. A. and R. F. Rivera-Bustamante (2011) Characterization of Geminivirus Resistance in an Accession of Capsicum chinense Jacq. MPMI Vol. 24, No. 2, pp. 172–182. doi:10.1094 / MPMI-06-10-0126. 1

2

Statistics by Examples

despondency (resistant and susceptible) indicates that this plant is heterozygous, at least for one locus –but could be heterozygous for two or more loci. A second fact is that that the data give information in a proportion. The fact that exactly 286 descendants were studied is one of the results, but it was not designed a priory like that; the design was very simple, say: obtain as many descendants as you can and do the bioassay to determine if each one is ‘resistant’ or ‘susceptible’ to the virus. Thus the important bit here is that we have an estimated proportion of resistant pˆ = 156/286 ≈ 0.54545 How can we ‘model’ this experiment? –this means, how can we set a mathematical (statistical) framework to evaluate this evidence? For this we will first derive the Bernoulli distribution and then, as the sum of ‘exits’ the Binomial distribution. 1.2. The Bernoulli distribution. Consider a ‘random experiment’ (i.e., and experiment in which we are note sure of the output; see my notes for a more formal definition) in which only two exclusive results are possible, say ‘success’ (S) or ‘failure’ (F ). This very simple model assume that the probability to obtain a ‘success’ is p; 0 ≤ p ≤ 1 and the probability of obtaining a ‘failure’ is 1 − p. We can write down this as P [S] = p, p[F ] = 1 − P [S] = 1 − p What the word ‘probability’ means? Here we will think of a ‘probability’ as the limit of the relative frequency of occurrence of an event when the number of times that the experiment is performed is ‘very large’. A typical example of the Bernoulli distribution is the toss of a ‘fair’ coin; we reason that there is the same ‘chance’ of the coin The coin come in up heads or tails, thus we can assume that p = 1/2 = 0.5 –the maximum uncertainty. Lets plot in R the experiment of tossing a fair coin 10,000 times... > # Use the ’sample’ function to obtain n=10,000 flips of a ’fair’ coin > results # Now we calculate the ’estimated’ (observed) probabilities as function of n > p.est # Get the estimated probabilities > for(i in 1:10000){p.est[i] # Example of a plot >plot(c(1:10), p.est[1:10], type="l", ylim=c(0,1), xlab="Number of tries (Bernoulli experiments)", ylab="\"Estimated\" p") >abline(h=0.5, col="red", lty=2) From Figure 1 we can see how, as the number of tosses (experiments) increases, the estimated proportion pˆ ‘tends’ to be closer to its true (parametric) value of p = 0.5 Note that this is NOT strictly speaking a ‘mathematical result’, but a fact of nature. The only thing that we MUST assume is that the probability of success does not change, i.e., it stay still as we do all the experiments. Under that condition the relative frequency of occurrence of ‘successes’ tend to the true value, p. Also notice that if we call ‘success’ to the fact of obtaining a ‘resistant’ plant and as ‘failure’ the fact of obtaining a ‘susceptible’ plant, then the observation of each one of the descendants is a Bernoulli trial with (unknown) probability p. In this case it is sensible to assume that the probability of obtaining a resistant plant does not change, because it is determined by the (unknown but fixed) genotype of the parent plant.

0.8 0.6 0.2 0.0 2

4

6

8

10

0

20

40

60

80

100

(a) The first 10 tosses

(b) The first 100 tosses

0.8 0.6 0.4 0.0

0.0

0.2

0.4

0.6

"Estimated" p

0.8

1.0

Number of tries (Bernoulli experiments)

1.0


0.2

"Estimated" p

0.4

"Estimated" p

0.6 0.4 0.0

0.2

"Estimated" p

0.8

1.0


1.0

3

0

200

400

600

800

1000

0

2000

4000

6000

8000

10000



(c) The first 1000 tosses

(d) All tosses

Figure 1. Relative frequency of ‘heads’ when tossing a fair coin In our data we have only the sum of successes (resistant plants) from a total of 286, thus we must obtain the ‘probability function’ for that ‘random variable’2. 1.3. The Binomial distribution. Consider a random variable X defined as the sum of successes in n Bernoulli trials. We want to obtain the ‘probability function’ for X, i.e., an explicit expression for P [X = x]; x = 0, 1, 2, · · · , n, given p and n. The procedure to obtain such expression is simple; just consider ‘all things that can happen’ (all possible events) and obtain the probability P [X = x] as function of x, p, n. 2For definitions of these terms see my notes for previous editions of this course.

4


We can begin with a small number of trials, say n = 3; then we can generalize our finding to any value of n. Table 1 shows all possible cases for n = 3 and their corresponding probabilities. For n = 3 and taking into account that each try can end in only two possible results, we have a total of 2 × 2 × 2 = 8 possible cases. Table 1. Probabilities of individual results of 3 Bernoulli trials Case 1 2 3 4 5 6 7 8

Trials 1st 2nd F F F F F S S F F S S F S S S S

3rd F S F F S S F S

Probabilities 1st 2nd 3rd 1−p 1−p 1−p 1−p 1−p p 1−p p 1−p p 1−p 1−p 1−p p p p 1−p p p p 1−p p p p

Total P

x

(1 − p)3 p(1 − p)2 p(1 − p)2 p(1 − p)2 p2 (1 − p) p2 (1 − p) p2 (1 − p) p3

0 1 1 1 2 2 2 3

In Table 1 we obtained the probabilities of each one of the 8 possible cases by multiplying the individual probabilities of the outcome of each consecutive trial. We can do so because consecutive trials are independent, i.e., the result in a previous trial does not affect the probability of the following one. For n = 3 the random variable X can take four distinct values, say x = 0, 1, 2, 3 and we can find the explicit form of the probability function by adding the probabilities of distinct cases that have the SAME value x. We have that P [X = 0] = (1 − p)3 (because this can happen only in one way; Case 1 in Table 1. In a similar way, P [X = 1] = p(1 − p)2 + p(1 − p)2 + p(1 − p)2 = 3p(1 − p)2 , by adding the probabilities of cases 2 to 4. Thus we find, P [X = 0] = (1 − p)3 , P [X = 1] = 3p(1 − p)2 , P [X = 2] = 3p2 (1 − p), P [X = 3] = p3 or putting the results in a table (Table 2). Table 2. Binomial probabilities for n = 3 x Coefficient Case probability P [X = x] 0 1 (1 − p)3 (1 − p)3 2 1 3 p(1 − p) 3p(1 − p)2 2 3 p2 (1 − p) 3p2 (1 − p) 3 3 1 p p3 Also note that the sum of the probabilities, say x=3 X

P [X = x] = (1 − p)3 + 3p(1 − p)2 + 3p2 (1 − p) + p3 = (p + (1 − p))3 = 1

x=0

The only part that we need to generalize to obtain an explicit formula for any value of n is the ‘Coefficient’ (second column in Table 2). This coefficient is the number of ways to obtain x successes from n assays, and it is called the ‘Binomial coefficient’ by obvious reasons. This coefficient is equal to n! n = x x!(n − x)! where n! = n(n − 1)(n − 2) · · · 1 and by definition 0! = 1. Thus for example we have 3! 3! 3 = = =1 0 0!(3 − 0)! 3!

5


3 1

=

3! 3! = =3 1!(3 − 1)! 2!

Can you see the reasoning to find this coefficient? (I let this as a homework exercise). Now we can write the general expression for the binomial probability n! n P [X = x] = px (1 − p)n−x = px (1 − p)n−x x x!(n − x)! I also let the demonstration of this fact as an exercise; if you are interested, this can be proved by mathematical induction. 1.4. Estimation by maximum likelihood. Going back to our data, we can now assume, with very good reasons, that the number of resistant plants in the segregating population follows a Binomial distribution with n = 286. This is because each one of the plants can be only resistant (success) or susceptible (failure) and the data that we have are the number of successes (156) in n = 286 trials (or Bernoulli experiments). In this case the distribution depend on one unknown parameter, p, the probability of obtaining a resistant plant in the progeny of the parental one. We need to give an estimate of this parameter, to be able to use our Binomial formula to calculate probabilities. Which value do you think is a ‘good’ estimate of the parameter? Very likely your answer to this question will be pˆ = 156/286 ≈ 0.54545 the question is WHY is this your answer! Your almost automatic reasoning was based in the evidence that we have; you thought something like ‘we observed 156/286 ≈ 0.54545 resistant plants in the population, thus it appears likely that the true value of the parameter is ‘close’ to this quantity, i.e., it appears reasonable that p ≈ 0.54545’ And you are right of course. By pure common sense you applied one of the principles of Statistics, the estimation by maximum likelihood. In simple English the principle of estimation by maximum likelihood estimation (MLE) says: Take as estimate of the parameter the value that maximize the likelihood of observing the data that you have And what is the ‘likelihood’ ? Well, the likelihood function is identical to the probability function, except that now we will look at it as a function of the parameter of interest (p in our case) and we will fix the value of the random variable to the one observed. In our case we know that n = 286, x = 156/286 thus we know will see P [X = 156] = P [X = 156|n = 286, p] as a function of p –with x fixed, say the likelihood of our data are 286! p156 (1 − p)286−156 = c p156 (1 − p)130 l(p) = P [X = 156|n = 286, p] = 156!(286 − 156)! Now, all that we need to do is to find the maximum of the function l(p) = c p156 (1 − p)130 for all possible values of p, 0 ≤ p ≤ 1. That value will be the maximum likelihood estimate of p, say pˆ. Figure 2 presents the plot of the likelihood function, l(p) for distinct values of p. >put.p plot(put.p, dbinom(x=156, size=286, prob=put.p), type="l", xlab="p", ylab="l(p)", main="Likelihood function for Binomial data n=286") >abline(v=156/286, col="red", lty=2) >plot(put.p, dbinom(x=156, size=286, prob=put.p), type="l", xlab="p", ylab="l(p)") >abline(v=156/286, col="red", lty=2)

6


●●● ●

●

●

●

0.04

0.04

●

●

●

●

●

●

0.03

0.03

●

●

●

●

l(p)

●

●

0.02

0.02

l(p)

●

●

●

●

●

●

0.01

●

●

0.01

●

●

●

●

●

●

●

●

●

●

0.00

● ●

0.0

0.2

0.4

0.6

0.8

1.0

●

●

0.48

0.50

0.52

0.54

0.56

p

p

(a) Over all interval

(b) Close-up

0.58

0.60

Figure 2. Likelihood function for Binomial data, n = 286 In fact, it is relatively easy to prove analytically that the MLE of p is given by pˆ =

X n

and this is ALWAYS the value that maximize the binomial likelihood3. It also happens that MLE have good properties: They are ‘consistent’, i.e., converge in probability to the true value of the parameter, also they are asymptotically normally distributed and have a relative small variance. We have seen that from the statistical point of view our ‘best guess’ for the value of p is p ≈ 0.54545; however, does it makes sense from the genetical point of view? In science we make ‘hypotheses’ about nature, and we want to test them by questioning nature trough experiments. In this example reasonable hypotheses must be based in (Mendelian) Genetics, because we are observing a phenomenon with genetic bases. Thus we will want to perform a ‘statistical test’ to evaluate the evidence given by the data. 1.5. Genetic hypothesis for the example: First try. In the example we have a plant that is heterozygous for one or more loci affecting the resistance character. The simplest hypothesis is that only one locus is affecting the phenomenon. Under that situation we will have the following Mendelian scheme ⊗ Rr ⇓ 1 RR : 2 Rr : 1 rr Rr

Note that given that the parental plant was of the ‘resistant’ phenotype, we have that the dominant allele ‘R’ must give the resistance; thus if that is true we will observe a phenotypic segregation Rr Resistant 3 Resistant (R )

⊗

Rr Resistant

⇓ : 1 Susceptible (rr)

3How to do that? This is let as a ‘not so easy exercise’; anyone presenting a prove this afternoon will have 10 (only first student that send an e-mail)

7


This genetic hypothesis implies an statistical one; if the segregation is 3 Resistant : 1 Susceptible, then our parameter p in the Binomial distribution of the data must be equal to 3/4 = 0.75 Is that the case? Let’s formally put our statements in statistical terms: H0 : Data follow a Binomial distribution with p = 0.75 Ha : Data follow a Binomial distribution with p 6= 0.75 Now we need an ‘Statistical Test’ to decide if H0 is reasonable, i.e., if the data at hand support H0 –which is called the ‘null hypothesis’ in statistical terms. If we ‘reject’ H0 this implies that we accept Ha that is called the ‘alternative hypothesis’. Note that one of the two hypothesis must be true because their union, H0 ∪ Ha , completely cover all the parameter space, this is, all possible values of p. Let’s give two definitions. Test Statistic - A ‘test statistic’ is a function of the observations which under the null hypothesis, H0 , has a known distribution That is, we need some statistic that when the null hypothesis is true has a completely known distribution. If we can find such statistic, we can evaluate the likelihood of the data assuming this hypothesis; this take us to the second definition. Statistical Test - A ‘statistical test’ is a rule to decide if the null hypothesis, H0 , must be rejected. Before applying a formal statistical test, less assess the probabilities of finding the data that we have under the null, H0 , and alternative, Ha , hypotheses. If we believe the null hypothesis we calculate 156 130 3 1 286! P [X = 156|n = 286, H0 ] = P [X = 156|n = 286, p = 3/4] = 156! 130! 4 4 To evaluate the alternative hypothesis, Ha , we can ‘believe’ the result that we obtained, and use as p the value obtained from the experiment, pˆ = 156/286; this is the maximum probability if Ha is true, say 156 156 130 130 286! P [X = 156|n = 286, Ha ] = P [X = 156|n = 286, p = 156/286] = 156! 130! 286 286 Evaluating in R, > # Probability under the > dbinom(x=156, size=286, [1] 3.144458e-14 > # Probability under the > dbinom(x=156, size=286, [1] 0.04733422

null hypothesis: prob=3/4) alternative hypothesis: prob=156/286)

This evaluation tell us that the probability of obtaining the observed data is much, much smaller than the probability under the (free) alternative hypothesis. In fact, the ratio of these quantities is immensely large, > dbinom(x=156, size=286, prob=156/286) / dbinom(x=156, size=286, prob=3/4) [1] 1.505322e+12 This means that Ha is approximately 1,500,000,000,000 more likely than H0 ! It stands to reason that we must reject H0 in favor of Ha . Figure 3 shows the probabilities for each possible value of x; 0, 1, 2, · · · , 286 under the two competing hypothesis. From Figure 3 we can see that the two distributions are very well separated, i.e., the more likely values (large probabilities) are clearly different in the two cases. The observed data do not support H0 . We can conclude then that p 6= 3/4.

8


0.05

Parameter p

0.03 0.00

0.01

0.02

P[X=x]

0.04

3/4 156/286

0

50

100

150

200

250

x

Figure 3. Binomial Probabilities under the hypothesis H0 (red) and Ha (blue) 1.6. The Maximum Likelihood Ratio Test. There are multiple statistical tests for the same situation, which can be confusing. However, all ‘parametric’ tests adjust to the definitions that we have given above, and the ‘non-parametric’ ones generally rely in approximating the distribution of the statistic by some kind of simulation or resampling procedure. Here we are going to present the Maximum Likelihood Ratio Test (MLRT) which is and approximated but general and powerful test. As its name implies, the MLRT relies in a statistic equal to the ratio of the likelihood functions; the test statistic takes the form l(y|H0 ) D = −2 loge l(y|Ha ) where y are the data and l(y|H0 ), l(y|Ha ) are the likelihood functions evaluated under the null and alternative hypotheses, respectively. D will be 0 only when l(y|H0 ) = l(y|Ha ), and will give positive values when l(y|Ha ) > l(y|H0 ); for example, if Ha is twice as likely than H0 , we will get a value of D = −2 loge (1/2) ≈ 1.386294; if Ha is ten times more likely than H0 , we get a value of D = −2 loge (1/10) ≈ 4.60517, etc. This is a test statistic (see definition above) because under very general conditions the distribution of the observed D when the null hypothesis is true is known and result to be the χ2 (chi-square) distribution with degrees of freedom equal to the number of restrictions that impose the null hypothesis H0 . In our example H0 imposes only one restriction in the distribution, say p = 3/4, thus to test the hypothesis we can compare the value of D with the value of the χ2 distribution with 1 degree of freedom (df). Before doing the test we must fix a probability of error ‘Type I’. This is the probability of erroneously rejecting H0 when this is true. We call this probability α, α = P [rejecting H0 |H0 is true] We will reject H0 when the value of D is ‘large’, i.e., when the probability of obtaining a value of D as large or more than the one observed assuming the null hypothesis. In fact, the formal test can be written as LRT - Reject H0 if D ≥ χ2α,

df .

9


where χ2α, df is the quantile of the χ2 distribution, with the value of α selected by the researcher and the df as mentioned above. The selection of an appropriate value of α is sometimes complicated; the researcher must ask herself which error is ‘small’ enough for her particular case. For our example we can calculate the value of D using R, > -2*log(dbinom(x=156, size=286, prob=3/4)/dbinom(x=156, size=286, prob=156/286)) [1] 56.08006 and we can calculate the critical values of α, for example for α = 0.05, 0.01, 0.001, in R: > # Here the values of p correspond to the alphas > qchisq(p=0.05, df=1, lower.tail=FALSE) [1] 3.841459 > qchisq(p=0.01, df=1, lower.tail=FALSE) [1] 6.634897 > qchisq(p=0.001, df=1, lower.tail=FALSE) [1] 10.82757 From these calculated values we see that we reject H0 at α = 0.05, 0.01 and 0.001; we ‘expected’ this, because we have seen that the data do not agree with the null hypothesis. The previously explained way to judge the likeness of the null hypothesis dates from the time before computers, when one needed to consult a printed table in a statistical book to take the decision. Currently, most of the scientific papers quote the value of ‘P ’ associated with the test. This corresponds to calculate the probability of a value as or more extreme than the one obtained in the test; i.e., to calculate P [D ≤ d|H0 ] where D is a random variable while d is the value obtained by the researcher. We can calculate such probability in R as > pchisq(56.08006, df=1, lower.tail = FALSE) [1] 6.957922e-14 Note that the ‘P ’ value is in this case extremely small, P ≈ 6.957922e − 14 and of course point out to the fact that we must reject H0 . You can interpret the value of P as the probability of obtaining the data observed under the assumption that H0 is true, i.e., assuming that p = 3/4. We can see that this very small value of P is of comprable size to the inverse of the ratio of the probabilities that we obtained before, 1/1.505322e + 12 ≈ 6.643097e − 13. This means that our informal calculation is close to the value obtained with the formal test; in fact, our result (which is a LRT in disguise) is more precise than the approximate LRT performed; our result is ‘exact’ and the difference is due to the approximation to the χ2 distribution. There are other tests for this case, for example the Person’s χ2 test, where the statistic is defined as χ2 =

X (Oi − Ei )2 i

Ei

Where Oi , Ei are the observed and expected values respectively in each one of i classes. The value must be compared with the value of χ2α, df to take the decision about H0 . In fact, this was the test used by Garc´ıa-Neria and Rivera-Bustamante in their paper. However, the LRT is more powerful and there is no excuse to use the Person’s χ2 test when a better alternative exist. To obtain the value of this statistic and compare it with D is let as an exercise.

10


1.7. Statistical Errors and Sample Sizes. As always that a decision is taken, an error can be committed. Table 3 presents the contingency table for the ‘States of Nature’ (the true fact about H0 ) versus the decisions that can be take. The inner cells present the errors (or correctness) in the decisions and its corresponding probabilities. Note that in Table 3 the sums of the probabilities (by column) are 1. (See Type I and II errors if you are curious). Table 3. Results of an Statistical Test

Reject H0 Accept H0

States of Nature H0 is true H0 is false Error Type I, P = α Right, P = 1 − β Right, P = 1 − α Error Type II, P = β

Type I errors are sometimes called ‘false positives’, while Type II errors are ‘false negatives’. Ideally, we will want to have no errors, say, α = 0 ∩ β = 0; however that is not possible (with finite samples!). In fact, in almost all cases it happens that decreasing α will increase β and viceversa. We have seen a way to ‘control’ the error Type I; just fix the value of α that you are willing to accept and this error is controlled. But, what about β, the probability of error Type II (accepting H0 when in fact it is wrong). The problem to fix β is that it depends on ‘how far’ is the true value of the parameter from the one assumed by H0 . Because the true value of the parameter is unknown, β cannot be easily controlled. However, increasing sample size will usually decrement β for a fixed α. To understand why increasing the sample size decrease the probability of Type II error, we need to understand the concept of variance of an estimator. We do not have time to see the details (see my notes of previous course), but in general the variance of an estimator decreases with the sample size. For example, the variance of the mean (an estimator of µ in the normal distribution) is given by 2 ¯ =σ V [X] n ¯ where X is the arithmetic mean based in n observations of a variable with variance σ 2 . The square root ¯ is called the ‘standard error’, of V [X] q ¯ ¯ = √σ se(X) = V [X] n Thus, the variation that we can observe in repeated evaluations of the arithmetic mean decreases as the sample size, n, increases; with larger sample sizes we will have a more precise estimation and less error probability in associated statistical tests.

In the case of binomial variates, the variance of the estimator Pˆ is given by p(1 − p) V [Pˆ ] = V [X/n] = n2 and the corresponding standard error is 1p p(1 − p) se(Pˆ ) = n Figure 3 show the probabilities of the Pˆ estimator for Binomial distributions with p = 0.5 and increasing sample sizes. From Figure 3 we can see that, as the sample size increases, the probability function of the estimator becomes more ‘concentrated’ at the true value of the parameter, p = 0.5 This is a result of the fact that the variance of the estimator decreases with the increase in sample size, and thus a more ‘slim’ density appears. Note that the data in the X-axis are normalized, thus for n = 50 values of x from 0 to 50 are shown, while for n = 100 the x goes from 0 to 50, and for n = 500 x goes from 0 to 500. In general the sample size needed for a given experiment can be calculated using 1) An estimate of the variation that will be present in the experiment, 2) The ‘minimum difference’ that the researcher is

11


50 100 500

0.06 0.00

0.02

0.04

P[X=x]

0.08

0.10

Parameter n

x

Figure 4. Probabilities of the estimator Pˆ for binomial distributions with p = 0.5 and distinct sample sizes (n) interested in detecting as ‘significant’ and 3) A value of error Type I (α) that the researcher is willing to accept. Having these quantities it is possible (for univariate experiments) to calculate the number of ‘experimental units’ or replicates that will be needed in the experiment. This must be done before performing the experiment; otherwise there is a risk of performing a wasteful experiment! 1.8. Genetic hypothesis for the example: Second try. With the rejection of the null hypothesis p = 3/4 we need to look for a genetical alternative that will explain the data. The next level of complexity is to assume that the resistance is the result of two (independent) loci. Under that situation we will have the following Mendelian scheme R1 r1 , R2 r2 ⊗ R1 r1 , R2 r2 ⇓ 1 R1 R1 , R2 R2 : 2 R1 R1 , R2 r2 : 1 R1 R1 , r2 r2 : 2 R1 r1 , R2 R2 : 4 R1 r1 , R2 r2 : 2 R1 r1 , r2 r2 : 1 r1 r1 , R2 R2 : 2 r1 r1 , R2 r2 : 1 r1 r1 , r2 r2 Or, better, in a Punett table From Table 4 we can see that with this putative cross we have 16 combinations Table 4. Punett table of Genotypes 1 R2 R2 2 R2 r2 1 r2 r2

1 R1 R1 2 R1 r1 1 r1 r1 1 R1 R1 , R2 R2 2 R1 r1 , R2 R2 1 r1 r1 , R2 R2 2 R1 R1 , R2 r2 4 R1 r1 , R2 r2 2 r1 r1 , R2 r2 1 R1 R1 , r2 r2 2 R1 r1 , r2 r2 1 r1 r1 , r2 r2

in 9 distinct genotypes. We need to look for a ‘phenotypic’ function that could give a proportion of resistant individuals which will be ‘close’ to the observed proportion pˆ = 156/286 ≈ 0.54545 Note that we are looking for some kind of pleiotropic or epistatic effect, because 9 distinct genotypes must map to two phenotypes: ‘resistant’ (R) or ‘susceptible’ (S). The first thing that we must take into account is that the genotype ‘R1 r1 , R2 r2 ’ present in the parent must be ‘resistant’ (R), thus we obtain the following table

12

Statistics by Examples Table 5. Putative table of Genotypes and Phenotypes (1) 1 R1 R1 2 R1 r1 1 r1 r1 1 R2 R2 1 R1 R1 , R2 R2 2 R1 r1 , R2 R2 1 r1 r1 , R2 R2 ? ? ? 2 R2 r2 2 R1 R1 , R2 r2 4 R1 r1 , R2 r2 2 r1 r1 , R2 r2 ? R ? 1 R1 R1 , r2 r2 2 R1 r1 , r2 r2 1 r1 r1 , r2 r2 1 r2 r2 ? ? ?

From Table 5 we can see that if we assign the ‘susceptible’ (S) genotype to all the genotypes with an interrogation mark (?), we obtain a putative proportion of p = 4/16 = 0.25 resistants; this is far from the observed proportion of pˆ = 156/286 ≈ 0.54545 (we could do the formal test to see if that proportion is also rejected –in fact it is, and you could do the test as an exercise). What happens if we assign the ‘susceptible’ (S) genotype to all individuals with ‘r1 r1 ’ OR ‘r2 r2 ’ and all other are ‘resistant’ (R)? Table 5 presents such results Table 6. Putative table of Genotypes and Phenotypes (2) 1 R1 R1 2 R1 r1 1 r1 r1 1 R2 R2 1 R1 R1 , R2 R2 2 R1 r1 , R2 R2 1 r1 r1 , R2 R2 R R S 2 R2 r2 2 R1 R1 , R2 r2 4 R1 r1 , R2 r2 2 r1 r1 , R2 r2 R R S 1 r2 r2 1 R1 R1 , r2 r2 2 R1 r1 , r2 r2 1 r1 r1 , r2 r2 S S S In this case (Table 5) we find a proportion of resistant equal to p = 9/16 = 0.5625, promisingly ‘close’ to the observed proportion, pˆ = 156/286 ≈ 0.54545. As before we set our (new) null hypothesis, H0 : Data follow a Binomial distribution with p = 9/16 = 0.5625 versus Ha : Data follow a Binomial distribution with p 6= 9/16 Before proceeding to do the formal test, we can compare the distributions of X near the observed value of x = 156. Figure 5 presents the binomial probabilities for 100 ≤ x ≤ 200 under the two hypothesis. Figure 5 we can see that both distributions overlap, signaling that the two hypotheses do not produce very different results (compare Figure 5 with Figure 3 where we present the hypothesis p = 3/4). Now we can compare the probabilities (and its ratio) under the new hypothesis, > # Probability under the (new) null hypothesis: > dbinom(x=156, size=286, prob=9/16) [1] 0.03999901 > # (maximum) Probability under the alternative hypothesis: > dbinom(x=156, size=286, prob=156/286) [1] 0.04733422 > # Ratio of the two probabilities > dbinom(x=156, size=286, prob=156/286) / dbinom(x=156, size=286, prob=9/16) [1] 1.183385 In this case the probabilities to obtain the observed value of x = 156 given the null hypothesis H0 ⇒ p = 9/16 and the maximum probability under Ha ⇒ p = 156/286 are close enough to consider them

13

Statistics by Examples Parameter p

0.02 0.00

0.01

P[X=x]

0.03

0.04

9/16 156/286

100

120

140

160

180

200

x

Figure 5. Binomial Probabilities under the hypothesis H0 (red) and Ha (blue) compatible; their ratio is only 1.183385 and thus we not have a reason to reject H0 ⇒ p = 9/16, but let’s perform the formal LRT in R, > # The test statistic (d) for the new null hypothesis: > -2*log(dbinom(x=156, size=286, prob=9/16)/dbinom(x=156, size=286, prob=156/286)) [1] 0.3367578 > # The probability of such a result > pchisq(0.3367578, df=1, lower.tail = FALSE) [1] 0.5617067 This time we find that the value of the LRT statistic, d = 0.3367578, is NOT significant at all, P = 0.5617067, thus we cannot reject the null hypothesis H0 ⇒ p = 9/16 because a result as the one observed is ‘very likely’ under it. The interpretation will be that the resistance is inherited in this case by a mechanism of ‘double recessive epistasis’, as presented in Table 6. As a homework exercise you can perform the Pearson’s χ2 test and compare the results with the LRT presented here. 1.9. Confidence Intervals. We have seen that with the MLE we obtain a good ‘estimate’ of the true value of the parameter. However, it is a ‘point’ estimate, in the sense that it give you a single value, without any idea of how confident you could be that this is close to the true value. It will be desirable to have an idea of the region where it is likely that the true value of the parameter can be found; this is the idea behind confidence intervals. We can see a definition and then show how to obtain an approximate confidence interval for our example of interest. Confidence Interval - The interval (L(X), U (X)) is said to be a confidence interval (CI) of size 1 − α for the parameter θ if P [L(X) ≤ θ ≤ U (X)] = 1 − α. Note that in this definition the quantities L(X) and U (X) (the Upper and Lower limits) are ‘random variables’, i.e., functions of the data X. This justifies the fact that we can make a probabilistic statement, because on the other hand the true value of the parameter, θ, is unknown but fixed. This is important for the INTERPRETATION of a confidence interval. When we have the realized data, X = x, the quantities L(x) and U (x) are also fixed, thus we cannot make the statement that ‘The probability that the true value of the parameter is in (L(x), U (x)) is 1 − α’. This is incorrect. What can be said is that if we

14


repeat the experiment ‘many’ times, and each time we calculate the CI then in the 1 − α of the cases the true value of the parameter will be in those intervals. Given that, in many cases, estimators have an approximate normal distribution, a very useful approximate confidence interval is given by ˆ U (X) = θˆ + 2se(θ) ˆ L(X) = θˆ − 2se(θ); ˆ is its standard error. Under these conditions we have where θˆ is the estimator of the parameter and se(θ) that P [L(X) ≤ θ ≤ U (X)] = 0.95 i.e., this is for α = 0.05 (the more accurate value in this formula instead of 2 is 1.96, but 2 is easier to remember; if you are going to memorize a formula pick up this one!). Even when there are many useful approximations to obtain CI, in the majority of the cases it is possible to calculate ‘exact’ CI; this is possible when the true distribution is known and thus there is no need to use asymptotic results. We are going to explain the algorithm in the Binomial case, and then apply to it in our example and perform some simulations to obtain a better understanding of the interpretation. In the case of a discrete variable –which can take only values in the natural numbers, as the Binomial, we cannot fix all possible values of α, simply because not all of them will be exactly obtainable. However, is simple to obtain a CI centered at the estimated value of the parameter using the following algorithm (1) Input x, n and α (the number of successes obtained, number of trials and Type I error probability, respectively) (2) calculate pˆ = x/n (the estimated value of the probability of success) (3) set a = P [X = x|n, pˆ] and i = 0 (4) set i = i + 1, L = x − 1, U = x + 1, a = a + P [X = L|n, pˆ] + P [X = U |n, pˆ] (5) Is a ≤ 1 − α? If NOT go back to 4, if YES continue (6) set L = L/n, U = U/n (7) output L, U, pˆ, a This algorithm is programed in R in the function bin.con and is presented in the Appendix. Evaluating this function at values of α = 0.05, 0.01, 0.001 for our example we find > bin.con() # With defaults, x=156, n=286, alpha=0.05 L U p.est Conf 0.4895105 0.6013986 0.5454545 0.9501328 > bin.con(alpha=0.01) # With defaults, x=156, n=286 and alpha=0.01 L U p.est Conf 0.4685315 0.6223776 0.5454545 0.9925665 > bin.con(alpha=0.005) # With defaults, x=156, n=286 and alpha=0.005 L U p.est Conf 0.4615385 0.6293706 0.5454545 0.9964522 > bin.con(alpha=0.001) # With defaults, x=156, n=286 and alpha=0.001 L U p.est Conf 0.4475524 0.6433566 0.5454545 0.9993128 Note that the true value of the parameter (after our Hypothesis Test) is p = 9/16 = 0.5625, and this value is included (between the lower and upper limits) in all the intervals calculated. To calculate a CI is in some sense equivalent to a hypothesis test; if after calculating the interval (at a given α) we see that the value determined by our null hypothesis (in this case H0 ⇒ p = 9/16) is included in the interval, then

15


we cannot reject H0 ; otherwise, if the value is NOT included in the interval we reject the corresponding hypothesis with 1 − α ‘confidence’ (a ‘signficance of at least α). To complete this section I want to perform a simulation of 100 cases of the Binomial distribution with the parameters n = 286, p = 9/16. In each case I will calculate the CI at α = 0.05 and see in how many of the cases we find the true parameter within the intervals. Note that the expected value for X, E[X] = np = 286 × 9/16 = 160.875 ≈ 161 is not a whole number. Here is the procedure for the simulation (will be explained in class) my.res my.res[1:5,] x L U p.est Conf 1 NA NA NA NA NA 2 NA NA NA NA NA 3 NA NA NA NA NA 4 NA NA NA NA NA 5 NA NA NA NA NA my.res[,1] my.res[1:5,] x L U p.est Conf 1 154 NA NA NA NA 2 153 NA NA NA NA 3 155 NA NA NA NA 4 171 NA NA NA NA 5 142 NA NA NA NA > summary(my.res$x) Min. 1st Qu. Median Mean 3rd Qu. Max. 141 155 160 161 167 181 > summary(my.res$x)/286 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.4930070 0.5419580 0.5594406 0.5629371 0.5839161 0.6328671 > var(my.res$x) [1] 71.57414 > var(my.res$x)/286 [1] 0.2502592 for(i in 1:100) my.res[i,2:5] my.res[1:5,] x L U p.est Conf 1 154 0.4790210 0.5979021 0.5384615 0.9622699 2 153 0.4755245 0.5944056 0.5349650 0.9621704 3 155 0.4825175 0.6013986 0.5419580 0.9623788 4 171 0.5419580 0.6538462 0.5979021 0.9535973 5 142 0.4370629 0.5559441 0.4965035 0.9617003 > summary(my.res) x L U Min. :141 Min. :0.4336 Min. :0.5524 1st Qu.:155 1st Qu.:0.4825 1st Qu.:0.6014 Median :160 Median :0.5035 Median :0.6154 Mean :161 Mean :0.5062 Mean :0.6199

p.est Min. :0.4930 1st Qu.:0.5420 Median :0.5594 Mean :0.5631

Conf Min. :0.9501 1st Qu.:0.9508 Median :0.9524 Mean :0.9546

16


3rd Qu.:167 Max. :181

3rd Qu.:0.5280 Max. :0.5769

3rd Qu.:0.6399 Max. :0.6888

3rd Qu.:0.5839 Max. :0.6329

3rd Qu.:0.9617 Max. :0.9624

p.in.CI sum(p.in.CI) [1] 94 > sum(p.in.CI)/100 [1] 0.94 my.col

STATISTICS BY EXAMPLES (USING R)

STATISTICS BY EXAMPLES (USING R)

Suggest Documents

Statistics Using R with Biological Examples

Statistics Using R with Biological Examples

DISCOVERING STATISTICS USING R

Basic statistics using R - CSC

statistics using r - Google Drive

with R examples (Springer Texts in Statistics) - Google Sites

PdF Applied Bayesian Statistics: With R and OpenBUGS Examples ...

Discovering Statistics Using R PDF Best Collection By ... - Google Sites

[PDF] - by Andy Field Discovering Statistics Using R ... - Google Sites

Using R for Introductory Statistics : John Verzani

simpleR – Using R for Introductory Statistics

UPCEO, connecting statistics and people using R

Statistics and Data analysis using R - TUM

Verzani's Simpler?Using R For Introductory Statistics.

[PDF] Discovering Statistics Using R - Google Sites

Basic statistics using R - Source - CSC

Computational Statistics Using R and R Studio An ... - Calvin College

Introduction to Probability and Statistics Using R - IPSUR - R Project

Introduction to Probability and Statistics Using R - IPSUR - R Project

R for Statistics 571

R for Statistics 571

Neural networks Statistics, Optimisation, and Learning Examples ...

Examples and Homework Questions on Engineering Statistics

DISCRETE EXAMPLES: GENETICS AND SPELL ... - Columbia Statistics