Correcting for Classification Errors when Estimating the ... - CiteSeerX

0 downloads 0 Views 169KB Size Report
and standard errors using an appropriately defined LINK and. ILINK statements (see Appendix) (SAS Institute, 1997). In and the nonparental lines is influenced ...
Correcting for Classification Errors when Estimating the Number of Genes Using Recombinant Inbred Chromosome Lines K. M. Eskridge,* M. M. Shah, P. S. Baenziger, and D. A. Travnicek ABSTRACT

controlling economically important quantitative traits (Law, 1966). Recombinant inbred chromosome lines are partial substitution lines that carry one or a few chromosomal segments instead of the whole substitution chromosome in an otherwise common genetic background. Generally, two strategies have been adopted to determine the number and location of genes on a chromosome using RICL populations, namely, (i) interpreting the phenotypic distribution of the trait and (ii) following the mongenic inheritance of marker genes (e.g., disease resistance genes) linked to the quantitative trait of interest (Law, 1966, 1967; Law et al., 1976; Snape et al., 1985). However, these approaches are restrictive for the determination of gene number and the nature of gene action. Since the first approach depends entirely on the recognition of discrete classes, where such discontinuities are often not found it is difficult to estimate the number of genes controlling a quantitative trait. In the second case, for a single chromosome, there often are not enough gene markers available in wheat. The need to overcome this difficulty is important and it would be useful for plant breeders, using RICLs, to know the number of loci (k) that differ between the chromosome substitution line and the parent cultivar that control a desired quantitative trait. The method of Wehrhahn and Allard (1965) may be used to estimate the number of segregating loci responsible for differences between a chromosome substitution line and a parent cultivar and to test hypotheses. The Wehrhahn and Allard method has several advantages over other biometrical approaches, such as the CastleWright model, since the method can circumvent problems due to transgressive segregation and variation among loci for allelic effects (Lynch and Walsh, 1998). However, the Wehrhahn and Allard method has several limitations when using RICLs. With field data, there is often a considerable chance of incorrectly classifying RICLs as either parental or nonparental types. Wehrhahn and Allard’s method does not account for these errors, and the estimate of k can be seriously biased when the probability of misclassification is large. In addition, Wehrhahn and Allard’s method is based on the assumption of unlinked loci. However, RICLs differ by only one chromosome, or a part of a chromosome, resulting in a good chance of tightly linked loci, which may seriously bias Wehrhahn and Allard’s estimate. The objective of this study was to describe an approach to estimate and test hypotheses about the number of loci with genes that differ between a parent culti-

Techniques based on intercultivar chromosome substitution lines in wheat (Triticum aestivum L.) have been used to identify and locate the genes controlling quantitative traits on a specific chromosome. For a particular trait, the number of segregating loci affecting differences between two parental lines are generally determined by the frequency distribution of recombinant inbred chromosome lines (RICLs). Recombinant inbred chromosome lines are the inbred progeny of crosses between a chromosome substitution line and its parent cultivar. The determination of the presence and the number of segregating loci becomes difficult (i) when the distribution exhibits no clear discrete classes, (ii) when there is a considerable chance of misclassifying lines into parental and recombinant types, and (iii) when loci are linked. We describe an approach to estimate the number of segregating loci responsible for the difference between a chromosome substitution line and parental cultivar using the derived RICLs when classification errors are likely. We also discuss the effects of linked loci on the estimates. The method was used to estimate the number of genes on chromosome 3A controlling grain yield, kernels spike⫺1, kernel weight, spikes m⫺2, grain volume weight, plant height and anthesis date in wheat.

F

or a better understanding of the genetics of continuous variation, it is important that the genes controlling quantitative characters be identified so that their individual properties may be investigated. Techniques based on the use of intercultivar chromosome substitution lines have been used to locate genes controlling important agronomic traits in wheat (Law, 1966; Zemetra et al., 1986; Berke et al., 1992; Muira and Worland, 1994). This technique exploited an intercultivar chromosome substitution line in which a single homologous pair of chromosomes in one cultivar has been replaced by its homologues from a second cultivar (Sears, 1953). However, this technique provides information only concerning the chromosomal location of genes, rather than their number and linkage relationship on the chromosome. Identifying the number and location of genes for agronomic traits on a chromosome is important for breeding strategies. Recombinant inbred chromosome lines developed between a chromosome substitution line (containing the chromosome of interest) and the parental cultivar (containing the homologue for the chromosome of interest) provide a useful tool to determine the number of genes and nature of gene action,

K.M. Eskridge and D.A Travnicek, Dep. of Biometry, University of Nebraska, Lincoln, NE 68583; M.M. Shah and P.S. Baenziger, Dep. of Agronomy, University of Nebraska, Lincoln, NE 68583. Contributions from NE, Agriculture Research Division, Journal Series no. 12525. Received 23 Apr. 1999. *Corresponding author (keskridge1@unl. edu).

Abbreviations: ANOVA, analysis of variance; CNN, ‘Cheyenne’; G ⫻ E, genotype ⫻ environment interaction; RICLs, recombinant inbred chromosome lines; WI, ‘Wichita’.

Published in Crop Sci. 40:398–403 (2000).

398

ESKRIDGE ET AL.: ESTIMATING THE NUMBER OF GENES

var and a chromosome substitution line when studying RICLs populations. The method explicitly incorporates errors of incorrectly classifying RICLs as either parental or nonparental types, and we consider how the estimates are affected if loci are linked. The method is used to estimate the number of genes controlling yield, yield components (kernels spikes⫺1, 1000-kernel weight, spikes m⫺2), grain volume weight, plant height, and anthesis date in wheat.

399

example, let Ai (i ⫽ 1, . . ., k ) be an allele of the ith locus from the substitution line that improves the trait compared with the parent, and let ai be the allele from the parent for the same locus. The probability that all k loci contain the parental alleles is P(傽ik⫽1ai). If it can be assumed that all loci that affect differences between the parent and the substitution line are either unlinked or in coupling phase, then for two linked loci, i and j, P(ai 傽 aj ) ⬎ P(ai )P(aj ). Thus, k

冢 冣

P 傽 ai ⬎ P(ak)P(ak⫺1) . . . P(a1)

[2]

i⫽1

THEORY AND METHODS The strategy for the development of RICLs was developed by Law (1966, Fig. 1) and later modified and improved by Yen and Baenziger (1992) to remove potential background heterogeneity. An F1 is produced by crossing a substitution line (containing the chromosome of interest) to the parent cultivar (containing a homologue of the chromosome of interest). In the F1 genotype, a single-chromosome pair will be heterozygous in a uniform genetic background. By crossing this F1 plant as a male parent to the parent cultivar, monosomic for the chromosome being investigated, monosomic progeny can be obtained which will be hemizygous (being monosomic) either for a recombinant chromosome or, if no recombination occurred, a nonrecombinant chromosome. By selfing these monosomic lines it is possible to obtain euploid (disomic) individuals that will be homozygous. As a result, true-breeding RICLs can be produced in only two generations, and propagated until enough seed is available for replicated trials. The replicated trials are used for determining the number of RICLs that differ in response from the parent cultivar. At any particular locus (with alleles A and a, where a is from the parental cultivar), the probability that a RICL differs from the parent cultivar is q ⫽ 1/2 (Wehrhahn and Allard, 1965). This follows since the cross of two purelines results in homozygous disomic lines of which one-half have allele A. Thus the proportion of all lines that are A, that is, differ from the parent cultivar, is q ⫽ 1/2. When the loci are unlinked, the expected proportion of lines that differ from the parent cultivar at one or more of the k loci is 1 ⫺ (1⫺ q )k (Wehrhahn and Allard, 1965; Mulitze and Baker, 1985). If there are m lines and r differ from the parent, the estimated proportion is pˆ ⫽ r/m. The number of loci with genes affecting the difference between the parental and substitution line for the trait of interest may be estimated by setting the observed proportion of RICLs that differ from the parental cultivar equal to 1 ⫺ (1⫺ q )k and solving for k (Wehrhahn and Allard, 1965; Mulitze and Baker, 1985). For example, given m lines with r RICLs that differ from the parent cultivar, and m ⫺ r that do not, the proportion of lines that differ from the parent is pˆ ⫽ r/m. Solving the equation pˆ ⫽ 1 ⫺ (1 ⫺ q)k for k with q ⫽ 1/2 gives

Now if P(ai ) ⫽ 1/2, Eq. [2] results in P(傽 ai ) ⬎ (1/2)k. The ˆ estimate kˆ is chosen to make (1/2)*k ⫽ P(傽 ai ) where 1 ⫺ P(傽 ai ) is estimated from the data using pˆ. Since with the true k, (1/2)k is less than P(傽 ai ), the estimate kˆ will generally be smaller than the true k. That is, kˆ is an underestimate of k if two or more loci are in coupling phase and the remaining are independent. Similar reasoning can be used to show that if some pairs of loci are in repulsion phase while the remaining are unlinked, P(ai 傽 aj ) ⬍ P(ai )P(aj ) and kˆ will overestimate k. The estimate kˆ is also based on the assumption of correct classification of the RICLs into parental or nonparental types. In practice, a parental line may be incorrectly classified as a nonparental line (Type I error) or a nonparental line may be incorrectly classified as a parental line (Type II error). Failure to account for these errors will result in biased estimates of k. An unbiased estimate of k is possible only if it is based on an unbiased estimate of the true proportion (P ) of RICLs that differ from the parental line (Mulitze and Baker, 1985). In most previous applications, pˆ has been used as an estimate of P. With classification errors, pˆ is a biased estimate of P and thus using pˆ in Eq. [1] will lead to a biased estimate of k. With large classification errors, this bias can be severe. To obtain an unbiased estimate of P, assume that m RICLs are available for classification where the presence of one or more substitution line alleles (Ai ) makes the RICL different from the parental cultivar. Suppose that each nonparental RICL has probability 1 ⫺ ␤ of being correctly classified as a nonparental line [i.e., 1 ⫺ P(Type II error) ⫽ 1 ⫺ ␤], while the probability that a parental RICL (i.e., contains only parental alleles, ai ) is incorrectly classified as a nonparental is ␣ where the P(Type I error) ⫽ ␣. Let y denote the unknown number of nonparental RICLs out of m. If x is the number of RICLs correctly classified as nonparental and w is the number of m ⫺ y parental RICLs that are incorrectly classified as nonparental lines, then the estimated number of nonparental RICLs is r ⫽ x ⫹ w. Kotz and Johnson (1982) show that r has a binomial distribution with parameters m and P(1 ⫺ ␤) ⫹ (1 ⫺ P )␣. To obtain an unbiased estimate of P, set r/m (the biased estimate of a P ) equal to P(1 ⫺ ␤) ⫹ (1 ⫺ P )␣ and solve for P, which results in the following formula:

kˆ ⫽ ⫺1.4427 ln(1 ⫺ pˆ )

Pˆ ⫽ [(r/m) ⫺ ␣]/(1 ⫺ ␤ ⫺ ␣)

[1]

Derived in this way, kˆ is the moment estimator of k, but kˆ can also be shown to be the maximum likelihood estimator of k (Agresti, 1990; Eskridge and Coyne, 1996). kˆ is based on the assumptions of no epistasis, no linkage, and normal diploid meiosis. Any deviation from these assumptions will probably decrease the precision of the estimates as stated in Mulitze and Baker (1985). In the case of RICLs, the assumption of no linkage may not be justified since the lines differ from the parent only by one chromosome or by a part of a chromosome. Loci on the same chromosome may be linked. Linked loci will bias kˆ, with the direction and magnitude of the bias depending on the form (coupling or repulsion) and strength of the linkage. For

[3]

Since r/m is obtained from the experiment and ␣ is set by the researcher when classifying lines, Pˆ may be computed given an estimate of ␤. Ideally, ␤ would be estimated given the sample size, an estimate of the experimental error variance, and the mean difference between the parent cultivar and the nonparental lines. An estimate of k, adjusted for classification errors, could then be obtained by using Pˆ instead of pˆ in Eq. [1]. In most applications, it will not be possible to estimate ␤ directly since rarely will the geneticist know the mean difference between the parent cultivar and the nonparental lines. This mean difference depends on k, giving rise to a seemingly circular argument: to obtain an error adjusted estimate of k, one must know ␤, but to estimate ␤, one must know k. One

400

CROP SCIENCE, VOL. 40, MARCH–APRIL 2000

solution to this problem is to use an iterative scheme to estimate k. Begin with an initial k(i ) at i ⫽ 0. Use k(i ) to estimate ␤(i ), then using ␤(i ) in Eq. [3] and [1], estimate k(i ⫹ 1). Substitute k(i ⫹ 1) for k(i ) and continue until k(i ⫹ 1) ⫺ k(i ) is small. To estimate k using this iterative scheme, it is necessary to specify how the mean difference between the parent cultivar and the nonparental lines is influenced by k. If loci are unlinked, each with two alleles (Ai, ai; i ⫽ 1, . . ., k ) and each Ai allele having the same effect (⌬), then it may be shown that the mean difference between the parent cultivar and the nonparental lines is [2k⫺1/(2k ⫺ 1)](␮s ⫺ ␮p) where ␮s and ␮p are the means of the substitution line and parent cultivar, respectively (see Appendix). Estimating this mean difference, using it to compute ␤ based on the standard power formula (e.g., see Eq. [5.30] in Steel and Torrie, 1980), substituting this ␤ into Eq. [3] and substituting Eq. [3] into Eq. [1] results in an iterative equation for k:



k(i ⫹ 1) ⫽ ⫺1.4427ln 1 ⫺ (pˆ ⫺ ␣)/具1 ⫺ ␣

[4]



⫺ ⌽{Z1⫺␣ ⫺ t[2k(i)⫺1/(2k(i) ⫺ 1)]}典 where

t ⫽ |x¯s ⫺ x¯p|/√MSe/r1 x¯s ⫺ x¯p ⫽ means of substitution line and parent cultivar MSe ⫽ appropriate mean square error for testing effects of entries r1 ⫽ number of plots in computation of x¯s and x¯p ⌽(.) ⫽ standard normal cumulative distribution function Z1⫺␣ ⫽ “z” value such that ⌽(Z1⫺␣) ⫽ 1 ⫺ ␣. Iterative use of Eq. [4] will generally converge to unique k estimates, which do not depend on the initial value of k(0). Here we used k(0) ⫽ 1. However, some cases may occur where final estimates depend on the starting values. In such cases, care should be used in interpreting estimates of k. Once a final estimate of k has been obtained, weighted least-squares, as described below, may then be used on this new estimator to obtain standard errors and to test hypotheses. Weighted least-squares estimates are not possible if Pˆ ⱕ 0 or Pˆ ⱖ 1. To avoid such values of Pˆ, Pˆ is set to 0.01 when (r/m) ⱕ ␣, and Pˆ is set to 0.99 if (r/m ) ⱖ 1 ⫺ ␤. See Agresti (1990) for a technical discussion of the effects of adding small constants to obtain weighted least-squares estimates. Standard errors and hypothesis tests may be based on weighted least-squares (Grizzle et al., 1969; Eskridge and Coyne, 1996). Assume there are s independent groups, where group is any factor (e.g., environment or trait) thought to explain variation among the k values. For the ith group, Pˆ1 is obtained from Eq. [3] with the final estimate of ␤ based on the final estimate of k from Eq. [4]. Also, let mi be the number of lines in the ith group (i ⫽ 1, ..., s ) and kˆ ⫽ (kˆ1 ... kˆs), where ki is a function of Pi as expressed via Eq. [1] using Pˆi. The covariance matrix of kˆ, S is an s ⫻ s diagonal matrix with Si (i ⫽ 1, ..., s ) values on the diagonal, where Si ≈ Hi2Vi , Vi ⫽ var(Pˆi ) ⫽ (1/mi )Pˆi(1 ⫺ Pˆi ), and Hi ⫽ dki/dPˆi evaluated at Pi ⫽ Pˆi. Now define the model k ⫽ X␤, where X is an s ⫻ u design matrix and ␤ is a u ⫻ 1 vector of coefficients. Estimated weighted least-squares may be used to estimate ␤: ␤ ⫽ (XⴕS⫺1X )⫺1 XⴕS⫺1 kˆ, which has the covariance matrix (XⴕS⫺1X )⫺1. Estimates of various quantities (e.g., kˆi, kˆi ⫺ kˆj, etc.) may be obtained by appropriately defining a 1xu vector of constants, lⴕ, and computing lⴕ␤ˆ . The standard error of the

estimate lⴕ␤ˆ is [lⴕ(XⴕS⫺1 X )⫺1 lⴕ]1/2. Any linear hypothesis that may be stated as H0: L␤ˆ ⫽ 0, where L is c ⫻ u matrix of rank c with c ⱕ u, may be tested with X2 ⫽ (L␤ˆ )⬘[L(XⴕS⫺1 X )⫺1 Lⴕ]⫺1(L␤ˆ ), which is asymptotically chi-square with c degrees of freedom when H0 is true. The GENMOD procedure in SAS may be used to obtain weighted least-squares estimates and standard errors using an appropriately defined LINK and ILINK statements (see Appendix) (SAS Institute, 1997). In this application, trait is used as the group factor to simplify programming. To demonstrate the approach, data were used from a study on the inheritance of yield, kernels spike⫺1, 1000-kernel weight, spikes m⫺2, grain volume weight, plant height, and anthesis date, using a population of RICLs for chromosome 3A of hexaploid wheat (Shah et al., 1999a). Chromosome 3A of winter wheat ‘Wichita’ (WI) differs from that of ‘Cheyenne’ (CNN) by a number of important quantitative traits (Berke et al., 1992). Fifty recombinant inbred chromosome lines for chromosome 3A were obtained from a cross between a hard red winter wheat CNN and a chromosome substitution line CNN(WI3A) where chromosome 3A of WI was substituted for its homologue in CNN. In the F1, the only effective recombination occurs between WI3A and CNN3A chromosome, as all the other chromosomes should be from CNN. The resulting crossover products were isolated by crossing the F1 (as male) to the parent cultivar, CNN monosomic for chromosome 3A as female, and selecting the monosomic progeny (recombinant monosomic lines). Upon selfing the recombinant monosomic lines and selecting the disomic progeny, fifty homozygous RICLs were developed in the CNN background. The selection of monosomic or disomic lines was carried out by cytological examination of root-tip cells. These 50 RICLs were grown in replicated field trials using a randomized complete block design during 3 yr (1994–1996) in four to nine diverse environments (Shah et al., 1999a). Single degree of freedom contrasts were tested, using the genotype ⫻ environment (G ⫻ E) mean square as the error variance, to identify which of the 50 lines (RICLs-3A) differed significantly (P ⬍ 0.05) from the parental cultivar (CNN) for each of the seven agronomic traits. Using the number of lines (RICLs-3A), out of 50, that significantly differed from the parent cultivar (CNN), the weighted leastsquares approach was used to estimate the number of loci with gene(s) affecting the difference between the parental cultivar CNN and the chromosome substitution line CNN(WI3A) for these traits. Two sets of estimates were computed: (i) ignoring classification errors and (ii) correcting for classification errors.

RESULTS AND DISCUSSION Correcting for misclassification error had very little impact on the estimated number of genes that differed between CNN and CNN(WI3A) for all traits (Table 1). Both the corrected and the uncorrected estimates (kˆ and kˆadj) indicated that a single locus (or possibly a group of tightly linked genes) was segregating for 1000-kernel weight and plant height, while two loci were segregating for anthesis date (Table 1). Neither the corrected or uncorrected estimates indicated differences between CNN and CNN(WI3A) in the number of genes for grain yield, kernels spike⫺1, spikes m⫺2, and grain volume weight. These traits are known to have large G ⫻ E interaction effects, which probably obscured the genetic effects for the trait of interest, making it difficult to identify differences between the RICLs and CNN.

401

ESKRIDGE ET AL.: ESTIMATING THE NUMBER OF GENES

Table 1. Number of lines out of 50 classified as significantly different (␣ ⫽ 0.05) from ‘Cheyenne’ (CNN) (r ) or not significantly different (m ⫺ r ), uncorrected (pˆ ) and corrected (Pˆ ) proportions to adjust for classification errors, estimates of the number of genes that differ between CNN and a chromosome substitution line CNN(W13A), where chromosome 3A of ‘Wichita’ was substituted for its homologue in CNN, using results from Shah et al. (1999b) (kˆs ), estimates uncorrected (kˆ ) and corrected (kˆadj) for classification errors, estimated probability of a Type II error (␤), and standard errors of gene number estimates. Trait Yield Kernels spike⫺1 1000-kernel weight Spikes m⫺2 Grain vol. wt. Plant ht. Anthesis date

r

m⫺r





5 0 19 2 0 18 34

45 50 31 48 50 32 16

0.10 0.00 0.38 0.04 0.00 0.36 0.68

0.071 0.010† 0.452 0.010† 0.010† 0.397 0.663

kˆs†† 0 1 0 1 0 1 1#



SE



kˆadj

SE

0.152 0 0.690 0.059 0 0.644 1.644

0.068 ¶ 0.160 0.042 ¶ 0.153 0.297

0 0 0.049 0 0 0.011 0.014

0.078 0.015‡ 0.658(7)‡§ 0.015‡ 0.015‡ 0.578 1.613

0.068 ¶ 0.173 ¶ ¶ 0.156 0.311

† Upper limit (0.99) and lower limit (0.01) set to obtain estimates using weighted least squares. ‡ Weighted-least squares upper limit (6.64) and lower limit (0.015) resulting for limits described above. § Using k(0) ⫽ 1 as starting value in eq. [4] gave kˆadj ⫽ 0.658 but when k(0) ⫽ 2, kˆadj ⫽ 6.64. ¶ Standard errors not relevant since estimates are constants. # From Shah et al. (1999a). †† Using Bonferroni corrected level of significance (0.00385) on the single-factor ANOVA F test from Shah et al. (1999b).

The general agreement between the uncorrected and corrected gene number estimates indicated that with these data, misclassification had only a very minor effect on the estimates. For three of the seven traits (kernels spike⫺1, spikes m⫺2, and grain volume weight), a very small proportion (0.04 or less) of the RICLs differed from CNN, indicating that CNN and CNN(WI3A) had the same genes for these traits. Of the remaining traits, the corrected estimates were slightly smaller than the uncorrected estimates. A corrected estimate smaller than the uncorrected estimate would tend to indicate relatively more adjustment for ␣ than for ␤, while a larger corrected estimate, compared with the uncorrected, would mean more of an adjustment for ␤. If the power had been very poor (1 ⫺ ␤ ⬍ 0.20), then kˆadj could have been substantially larger than the corrected estimate kˆ. However, with these traits, the power was greater than 0.95, causing only a small adjustment to the estimates. With these data, the correction for misclassification was small. However, the correction would probably be large in trials where the substitution line differed from the parent but there was only a small chance of correctly identifying a RICL as nonparental. Trials such as these have genetic differences between the parental and substitution lines, but because of misclassification, the unadjusted estimate would be an underestimate of the number of genes. However, if there is very little segregation among the RICLs, the adjustment will be minimal even if power is extremely poor. The sensitivity of the final k (Eq. [4]) to the initial k(0) is an important consideration when using this procedure. For all traits, except 1000-kernel weight, the final adjusted estimates were unaffected by the initial k(0) in the range of 0 to 6. However, for 1000-kernel weight, the final k converged to kˆadj ⫽ 0.658 when k(0) ⬍ 2, whereas it converged to kˆadj ⫽ 6.64 when k(0) was 2 or larger. Both estimates were reasonable given the data. For kˆadj ⫽ 0.658, the final 1 ⫺ ␤ estimate was 0.951, indicating a nonparental line would probably be correctly classified as nonparental. Given that 38% of the lines were classified as nonparental, the single gene model was reasonable. Alternatively, when kˆadj ⫽ 6.64 (a large number of genes for our method), the 1 ⫺ ␤

estimate was 0.33, indicating a poor chance that a nonparental line would be correctly classified. With k large, most of the lines would be nonparental types, but only about one-third would be correctly classified as nonparental. Thus, 38% of the RICLs being classified as nonparental was consistent with a large number of genes given a large ␤. With these data, the final k estimate was affected by k(0) when ␤ was quite sensitive to k (t ⱖ 2.5 in Eq. [4]) and when pˆ was between 0.3 and 0.4. In general, final k estimates should be obtained for several values of k(0) and care should be used in interpretation when the final k estimate is sensitive to the initial k(0) value. Both the corrected and uncorrected estimates were similar to those found by Shah et al. (1999b) (kˆs) using RFLP markers with a Bonferroni correction to limit the experimentwise error rate (Table 1). With the exception of 1000-kernel weight when k(0) ⱖ 2, both kˆ and kˆadj were within one gene of kˆs (Table 1), and for five of the seven traits, our estimates were the same or smaller than kˆs based on Shah et al. (1999b). Our estimates are based on the assumption of no linkage between loci. At the present state of knowledge, there is some debate about the likelihood of linkage in this particular application. However, if some pairs of loci are in coupling phase and the others are unlinked then both kˆ and kˆadj will underestimate the true number of loci. Coupling phase linkage may be a reasonable assumption with these data since all positive alleles were contributed by WI at the loci detected in Shah et al. (1999b). It is also important to recognize that even if all genes are unlinked, the type of variation observed in the RICLs could also be explained by a genetic model based on a very large number of loci (Mulitze and Baker, 1985). This result adds further justification to considering kˆ and kˆadj to be lower bounds of the estimated number of loci with genes affecting the difference between the two parental lines. For anthesis date and 1000-kernel weight, both kˆ and kˆadj estimates were larger than kˆs based on Shah et al. (1999a, 1999b) (anthesis date: kˆs ⫽ 1, kˆ ⫽ 1.64, kˆadj ⫽ 1.61; 1000-kernel weight: kˆs ⫽ 0, kˆ ⫽ 0.69, kˆadj ⫽ 0.66). For anthesis date, our estimates did not strongly contradict those of Shah et al. (1999a, 1999b) since the standard error for anthesis date was large (0.31) and the evidence

402

CROP SCIENCE, VOL. 40, MARCH–APRIL 2000

against the single-gene hypothesis was not strong (P ⬎ 0.01). The Bonferroni correction for kˆs may have been too stringent for 1000-kernel weight since the data indicated a genetic difference between CNN and CNN(WI3A) because the mean 1000-kernel weight differed between CNN and CNN(WI3A) (P ⬍ 0.02) and 38% of the RICLs differed from CNN, but kˆs was 0. Compared with some commonly used methods based on molecular markers, there are some clear advantages to using kˆadj (or kˆ ) in estimating the number of segregating loci that differ between wheat cultivars using RICLs. Only field data are required to obtain kˆadj (or kˆ ), giving substantial cost savings over molecular marker estimates, which require genomic DNA analyses. In addition, since field data must be used for both molecular marker estimates and kˆadj estimates (or kˆ ), there can be large statistical errors associated with classifying the RICLs. Use of kˆadj explicitly accounts for both Type I and Type II statistical errors of misclassification of the RICLs via Eq. [3]. Some of the most commonly used approaches to analyzing molecular markers do not adequately account for misclassification errors resulting in questionable estimates of gene numbers. One common method is to conduct a single factor analysis of variance (ANOVA) on the RICLs’ trait means with the molecular marker (present or absent) as the classification variable. These tests are then conducted for each marker separately to identify markers that are significantly associated with the trait. The error variance used in these ANOVAs contains among-RICL variance not associated with the marker of interest, with the likely consequence of overestimating error variance, which may result in failure to detect important markers. In addition, this procedure does not include an initial test of significance among RICLs. If this among-RICL variation is nonsignificant, then using multiple ANOVAs to identify significant markers, without somehow controlling overall experimentwise Type I error, may result in a substantial overestimate of the number of markers related to the trait of interest. Thus, an inflated error variance coupled with an uncontrolled experimentwise Type I error results in unknown levels of actual Type I and Type II errors. Type I errors appeared to have more of an impact on gene number estimates in Shah et al. (1999b; Table 1) assuming kˆadj (or kˆ from our Table 1) are more accurate. For some traits they considered, the estimated number of marker loci, without using the Bonferroni correction for experimentwise Type I error, was two to three times larger than the estimates based on the Bonferroni correction for experimentwise error. The strategy of using field data and kˆadj is a costeffective method of estimating the number of genes responsible for the difference between a substitution line and a parent. When assumptions are met, the method gives reasonable, unbiased estimates of the number of genes. The method can also be applied to data from other breeding plans such as the inbred-backcross approach. However, it is important to recognize that the method is based on a number of critical assumptions. The observed genetic variation is implicitly assumed to be caused by only a few genes even though

genetic models based on a large number of loci could explain the data equally well. Failure of this assumption will cause kˆadj to be an underestimate (Mulitze and Baker, 1985). In addition, even if all assumptions hold, classification errors may cause the final kˆadj estimate to depend on the initial k(0) values, resulting in situations where both large and small k explain the data equally well. Results in this study are based on the assumptions that loci are either unlinked or in coupling phase. It is not clear how the presence of both coupling and repulsion phase linkage would affect the estimates. Finally, the method requires the standard assumptions that the traits are normally distributed and the appropriate alpha level is 0.05. It is not clear how violation of either of these assumptions affects the estimates. ACKNOWLEDGMENTS We gratefully acknowledge Dr. Steve Kachman for his critical review and help with the SAS program. We also thank Drs. Charles O. Gardner, Jr., Robert Curnow, Shawn Kaeppler, and the reviewers for their helpful comments and critical reviews of the manuscript.

APPENDIX Derivation of the mean difference between the parent cultivar and the nonparental lines as a function of k and the difference between the substitution line and the parent (␮s ⫺ ␮p). Assume there are k unlinked loci, each with two alleles (Ai, ai; i ⫽ 1, … , k ), where all Ai alleles (i) are supplied by the substitution line and (ii) have the same effect (⌬). (Using the binomial distribution, the probability that a line has j loci with positive alleles is kCj/2k and this line will have mean value ␮p ⫹ j⌬. Thus, the mean values for the lines will range from ␮p (parent mean) to ␮p ⫹ k⌬ (⫽ ␮s, the substitution line mean). To find the mean difference between the nonparental lines and the parent (␮np ⫺ ␮p), it is necessary to obtain the expected value of the nonparental lines (␮np). Recall that the nonparental lines are the lines that have at least one locus with a substitution line allele (Ai ). The probability distribution of the nonparental lines may be obtained by finding a constant, c, such that cRjk⫽1kCj /2k ⫽ 1. Using properties of summations of combinations, c ⫽ 2k/(2k ⫺ 1), and the probability that a nonparental line has j loci ( j ⫽ 1, … , k ) with substitution line alleles (Ai ) is [2k/(2k ⫺ 1)]kCj/2k. Using this probability distribution for the nonparental lines with their means (␮p ⫹ j⌬), the expected value of the nonparental lines is ␮p ⫹ k[2k⫺1/ (2k ⫺ 1)]⌬. Since there are k loci that differ between the substitution and the parent lines, ␮s ⫺ ␮p ⫽ k⌬. Substitution of ⌬ ⫽ (␮s ⫺ ␮p)/k gives ␮np ⫺ ␮p ⫽ [2k⫺1/(2k⫺1)](␮s ⫺ ␮p). This difference is then used as the true difference in computing ␤. SAS statements to estimate uncorrected (kˆ ) and corrected (kˆadj) estimates of the number of loci with genes affecting the difference between the parental and substitution lines for several traits. data a; input tn trait$ r m_r beta ; alpha⫽.05; m ⫽ r⫹m_r; d⫽1⫺beta-alpha; p⫽r/m; output; if p⬍ alpha then do; r⫽.5; p⫽r/m; trait⫽trim(trait)||‘adj’; output; end; if p⬎ 1-beta then do; r⫽49.5; p⫽r/m; trait⫽trim(trait)||‘adj’; output; end; cards; 1 yld 5 45 .000 2 snt 0 50 .000

ESKRIDGE ET AL.: ESTIMATING THE NUMBER OF GENES

3 tk 19 31 .049 4 till 2 48 .000 5 tst 0 50 .000 6 ht 18 32 .011 7 hd 34 16 .014 proc print; *************** standard estimators *****************; proc genmod order⫽data; class trait; model r/m ⫽ trait / dist⫽bin noint; fwdlink link ⫽ log(1⫺_mean_)/log(.5); invlink ilink⫽ 1 ⫺ exp(log(.5)*_xbeta_); *************** corrected estimators ****************; data b; set a; if p⬍alpha or p⬎1⫺beta then delete; proc print; proc genmod order⫽data; class trait; model r/m ⫽ trait / dist⫽bin noint; fwdlink link ⫽ log(1⫺(_mean_⫺ alpha)/d)/log(.5); invlink ilink⫽ d*(1 ⫺ exp(log(.5)*_xbeta_)) ⫹ alpha; run; —————————————————————— REFERENCES Agresti, A. 1990. Categorical data analysis. John Wiley, New York. Berke, T.G., P.S. Baenziger, and R. Morris. 1992. Chromosomal location of wheat quantitative trait loci affecting agronomic performance of seven traits, using reciprocal chromosome substitutions. Crop Sci. 32:621–627. Eskridge, K.M., and D.P. Coyne. 1996. Estimating and testing hypotheses about the number of genes using inbred-backcross data. J. Hered. 87:410–412. Grizzle, J.E., C.F. Starmer, and G.G. Koch. 1969. Analysis of categorical data by linear models. Biometrics 25:489–504. Kotz, S., and N.L. Johnson. 1982. Errors in inspection and grading: Distributional aspects of screening and hierarchal screening. Commun. Statist. A: Theor. Meth. 11:1997–2016.

403

Law, C.N. 1966. The location of genetic factors affecting a quantitative character in wheat. Genetics 53:487–498. Law, C.N. 1967. The location of genetic factors controlling a number of quantitative characters in wheat. Genetics 56:445–461. Law, C.N., A.J. Worland, and B. Giorgi. 1976. The genetic control of ear-emergence time by chromosome 5A and 5D of wheat. Heredity 36:49–58. Lynch, M., and B. Walsh. 1998. Genetics and analysis of quantitative traits. Sinauer Assoc., Sunderland, MA. Muira, H., and A.J. Worland. 1994. Genetic control of vernalization, day length response, and earliness per se by homoeologous group 3 chromosomes in wheat. Plant Breed. 113:160–169. Mulitze, D.K., and R.J. Baker, 1985. Evaluation of biometrical methods for estimating the number of genes. 1. Effects of sample size. Theor. Appl. Genet. 69:553–558. SAS Institute. 1997. SAS/STAT software: Changes and enhancements through release 6.12. SAS Inst., Cary, NC. Sears, E.R. 1953. Nullisomic analysis in common wheat. Am. Naturalist 87:245–252. Shah, M.M., P.S. Baenziger, Y. Yen, K.S. Gill, B. Moreno-Sevilla, and K. Haliloglu. 1999a. Genetic analyses of agronomic traits controlled by wheat chromosome 3A. Crop Sci. 39:1016–1021. Shah, M.M., K.S. Gill, P.S. Baenziger, Y. Yen, S.M. Kaeppler, and H.M. Ariyarathne. 1999b. Molecular mapping of loci for agronomic traits on chromosome 3A of bread wheat. Crop Sci. 39:1728–1732. Snape, J.W., C.N. Law, A. Parker, and A.J. Worland. 1985. Genetical analysis of chromosome 5A of wheat and its influence on important agronomic characters. Theor. Appl. Genet. 71:518–526. Steel, R.G.D., and J.H. Torrie. 1980. Principles and procedures of statistics. 2nd ed. McGraw-Hill, New York. Wehrhahn, C., and R.W. Allard. 1965. The detection and measurement of the effects of individual genes involved in the inheritance of a quantitative character in wheat. Genetics 51:109–119. Yen, Y., and P.S. Baenziger. 1992. A better way to construct recombinant inbred chromosome lines and their controls. Genome 35: 827–830. Zemetra, R.S., R. Morris, and J. Schmidt. 1986. Gene location for heading date using reciprocal chromosome substitutions in winter wheat. Crop Sci. 26:531–533.

Suggest Documents