Genetic Resources and Crop Evolution 47: 515–526, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.
515
Development of an algorithm identifying maximally diverse core collections Jane M. Marita1 , Julie M. Rodriguez2 & James Nienhuis2 1 Dept.
of Forestry, 1925 Linden Drive, USDFRC, Madison, WI 53706, U.S.A.; 2 Dept. of Horticulture, 1575 Linden Drive, University of Wisconsin, Madison, WI 53706, U.S.A. (∗ Author for correspondence. E-mail:
[email protected]) Received 2 July 1999; accepted in revised form 12 January 2000
Key words: Capsicum, core collection, genetic diversity, RAPD, Theobroma cacao
Abstract The development of a core collection, one which represents the genetic diversity of a crop with minimal redundancy and increases utility of the collection as a whole, is especially important as the funding for germplasm collections decreases. With limited resources, it is difficult to manage large germplasm collections and disperse genetically diverse germplasm to plant breeders. An algorithm was developed to assist in selection of core collections based on estimates of genetic distance. The criteria for selection of the maximum genetically diverse set were based on rankings of genetic distance between an accession with respect to all other accessions. Depending on the size core which a user wished, a zone around each selected accession was determined and no other accession within these limits was selected. The premise for the algorithm was that the genetic variability represented in the core must be representative of the distribution of genetic distances within the population of interest. In the present study, the algorithm was used with RAPD-marker-based estimates of genetic distance for 270 Theobroma cacao L. accessions and 134 Capsicum accessions that chose a set representing 18.5% of the population and representing the breadth of RAPD-based variation. Abbreviations: AVRDC – Asian Vegetable Research and Development Center; CEPEC – Centro de Pesquisa do Cacau; MDS – multi-dimensional scaling; RAPD – random amplified polymorphic DNA. Introduction A core collection is a subset of accessions from a larger collection of a particular crop species that represents, with a minimum amount of repetitiveness, the genetic diversity of that crop species and its wild relatives (Brown and Clegg, 1983; Frankel, 1984). The purpose of a core is to provide potential users with a representative sample of the available genetic variation in the crop gene pool in a subset of a manageable number (Brown, 1995). Because the core collection is the focus of evaluation and use, it provides a preliminary look at the diversity available in the larger collection. Core collections are increasingly useful as more genebanks are established and the volume of the collections assembled worldwide outgrows the management resources available. Many collections have become too
large and diffuse for use, with the effect that utilization is discouraged rather than assisted (Frankel and Brown, 1984). Many different criteria have been used to analyze genetic diversity in order to construct core collections. These criteria have included geographical and morphological data (Holbrook et al., 1993; Diwan et al., 1994; Basigalup et al., 1995), and biochemical data (Grauke and Thompson, 1995). However, the paucity of characterization data available for an accession in base collections of a crop species may result in sampling biases. DNA marker systems, which provide large numbers of polymorphic loci dispersed in the genome, were initially considered too expensive and time-consuming to be used efficiently for development of core collections (Gepts, 1995). Nevertheless, the amount of molecular and biochemical marker data
516 on crop plants has increased dramatically (Doebley, 1989, 1992; Clegg, 1990; Gepts, 1990, 1993), making them attractive as core collection criteria. The major advantage of using molecular markers as criteria for developing a core collection is that molecular markers are genotypic, directly reflecting changes at the DNA level, while morphological markers reflect phenotypic traits frequently defined by multiple genotypes (Gepts, 1995). In addition, accessions with similar phenotypes may sometimes be evolutionarily unrelated, as in the case of common bean. Races Durango and Chile show similar phenotypes, yet biochemical markers indicate that they belong to different gene pools (Middle America and Andes, respectively) (Singh et al., 1991). Depending on the desired application, a core collection can be defined in two ways: (i) the taxonomist’s perspective, where rare, highly restricted alleles must be represented in any core collection; and (ii) the breeder’s perspective, where broadly adapted and heterotic cultivars containing ‘generalist’ alleles are represented in a core collection. Following the breeder’s perspective, it may not be necessary to maximize the total diversity of the core but instead maximize the representativeness of the genetic diversity in the core (Brown, 1989). Knowledge of genetic structure of germplasm collections can improve the efficiency of characterization and evaluation of these germplasm collections. Development of a core collection, which maximizes the genetic diversity among accessions stored by germplasm centers, may assist in preservation of the available genetic variability within these collections and assist plant breeders in efficiently utilizing this resource. Core collections should not be static and should provide opportunities to respond to change as additional accessions are added to the collection. Systematic procedures are needed for the selection of accessions to be included in core collections based on genetic relationships. Our objective was to develop a computer program that would select a core of accessions representing the maximum genetic diversity among all accessions and would allow for cores of different size. Accessions of Theobroma cacao and Capsicum spp. characterized with RAPD markers were used to validate the computer program and to identify sets of accessions that maximize genetic diversity which could potentially be used as a core. An additional objective was the comparison of the genetic diversity of random samples selected on the basis of our algorithm to samples of equal number developed by random sampling.
Materials and methods Germplasm Two data sets were included for validation and testing of the computer algorithm, one which represented a single species and the other which included six different species. The first data set included accessions from the Centro de Pesquisa do Cacau (CEPEC; Itabuna, Bahia, Brazil) Theobroma cacao collection. A subset of 270 accessions was sampled during 1996–97 for preliminary analysis on characterizing their cacao collection and assembling a core collection (Marita, 1998). Accessions sampled included accessions with some resistance to witches’ broom disease (Crinipellis perniciosus), plus accessions included in CEPEC’s characterization experiments. Also included were different ‘series’, which represented accessions categorized under a single acronym, e.g. C Sul representing unique clones from Cruzeiro do Sol in the Upper Amazon region of Brazil (Table 1). Among the 270 accessions selected, twelve countries of origin were represented, these being: Nicaragua (1 accession), Grenada (2), Guatemala (3), Venezuela (5), French Guiana (6), Mexico (9), Costa Rica (9), Colombia (12), Trinidad and Tobago (18), Ecuador (30), Peru (38), and Brazil (134). Three accessions were classified as USA referring to the location where the crosses were made and not the genetic material involved in the cross. Therefore, the USA classification was not included in the total of countries of origin. The second data set included a total of 134 Capsicum accessions from the Asian Vegetable Research and Development Center (AVRDC; Shanhua, Tainin, Taiwan). Sampling of these accessions from the total Capsicum collection at the AVRDC and characterization with RAPDs were described by Rodriguez et al. (1999). The accessions included 6 Capsicum species: C. annuum L., C. frutescens L., C. chinense (Jacq), C. baccatum L., C. pubescens (Ruiz & Pavon), and C. chacoense (A. T. Huntz). DNA extraction, RAPD reactions, and genetic distance DNA extraction and amplification procedures for T. cacao accessions were as described in Marita (1999). Thirty-eight primers resulting in 133 polymorphic bands were evaluated. DNA extraction procedures for Capsicum accessions, were described in Rodriguez et al. (1999) and followed Johns et al. (1997).
517 Table 1. The 270 accessions from the Centro de Pesquisa do Cacau germplasm collection included in the cacao analysis Acc. #
Clone name
Country origin
Acc. #
Clone name
Country origin
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
AMAZON 2-1 AMAZON 3-2 CAB 5.003-23 CAB 36 CAB 15 CCN 10 CCN 34 CCN 51 C SUL 8 C SUL 3 C SUL 4 SCA 6 SCA 12 TSA 516 TSA 641 TSH 1188 TSH 565 EET 376 EET 390 IAC 1 CEPEC 38 CEPEC 46 CEPEC 89 CEPEC 92 NA 33 NA 312 NA 727 MA 16 MA 13 MA 15 IMC 67 IMC 27 IMC 47 MOQ 216 MOQ 417 CA 1 CA 5 CA 2 OC 77 OC 66 SGU 26 SGU 50 SGU 54 EET 45 EET 62 EET 59 SIAL 84 SIAL 70 SIC 24
Peru Peru Brazil Brazil Brazil Ecuador Ecuador Ecuador Brazil Brazil Brazil Peru Peru Trinidad/Tobago Trinidad/Tobago Trinidad/Tobago Trinidad/Tobago Ecuador Ecuador Brazil Brazil Brazil Brazil Brazil Peru Peru Peru Brazil Brazil Brazil Peru Peru Peru Ecuador Ecuador Brazil Brazil Brazil Venezuela Venezuela Guatemala Guatemala Guatemala Ecuador Ecuador Ecuador Brazil Brazil Brazil
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184
CEPEC 87 SPA 7 EET 399 RB 30 CAS 3 Be 8 SIAL 407 PA 169 ICS 98 ICS 89 EET 61 CA 6 CEPEC 16 EET 228 SIAL 543 SIAL 505 CA 3 IMC 76 APA 4 PA 148 RIM 10 ICS 6 EEG 65 SIC 328 TSA 654 TSH 774 CEPEC 523 CEPEC 541 ICS 16 CEPEC 73 JA 546 C 13.5 CCN 2 MO 9 CJ 8 10 (P) UF 12 CEPEC 519 CEPEC 48 CEPEC 30 UF 296 ICS 75 CEPEC 532 RIM 105 22 (P) CAB 2 CEPEC 95 CEPEC 108 CEPEC 131
Brazil Colombia Ecuador Brazil Brazil Brazil Brazil Peru Trinidad/Tobago Trinidad/Tobago Ecuador Brazil Brazil Ecuador Brazil Brazil Brazil Peru Colombia Peru Mexico Trinidad/Tobago Brazil Brazil Trinidad/Tobago Trinidad/Tobago Brazil Brazil Trinidad/Tobago Brazil Ecuador Nicaragua Ecuador Peru Brazil Mexico Costa Rica Brazil Brazil Brazil Costa Rica Trinidad/Tobago Brazil Mexico Mexico Brazil Brazil Brazil Brazil
518 Table 1. Continued Acc. #
Clone name
Country origin
Acc. #
Clone name
Country origin
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
SIC 662 EEG 29 EEG 50 ICS 9 ICS 1 ICS 39 UF 221 UF 667 UF 677 CAS 2 MOCO 1 P 19 P4B P 11 SPA 12 SPA 5 SPA 17 PA 51 PA 150 PA 13 Be 4 Be 6 SPEC 54-1 SPEC 138-8 RB 36 RB 39 RB 37 GS 36 GS 29 SC 5 SC 49 RIM 76 RIM 52 RIM 15 21 P (J) 8 (P) CC 41 CC 11 CC10 CJ 7 CJ 4 EET 377 EET 392 CEPEC 90 C SUL 7 C SUL 10 EQX 107 CEPEC 94 TSA 644
Brazil Brazil Brazil Trinidad/Tobago Trinidad/Tobago Trinidad/Tobago Costa Rica Costa Rica Costa Rica Brazil Brazil Peru Peru Peru Colombia Colombia Colombia Peru Peru Peru Brazil Brazil Colombia Colombia Brazil Brazil Brazil Grenada Grenada Colombia Colombia Mexico Mexico Mexico Mexico Mexico Costa Rica Costa Rica Costa Rica Brazil Brazil Ecuador Ecuador Brazil Brazil Brazil Ecuador Brazil Trinidad/Tobago
185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233
CEPEC 171 CAB 262 CAB 263 CAB 148 CAB 94 CAB 520 CAB 157 LCTEEN 28s1 GU 125 C GU 136 H CAB 414 CAB 61 CEPEC 144 CEPEC 148 CAB 275 CAB 53 CAB 65 CEPEC 159 CEPEC 125 H 28 LCTEEN 37F CAB 194 CAB 103 CEPEC 151 H7 CAB 486 H 39 CAB 21 CAB 155 CEPEC 158 H 17 CEPEC 166 CEPEC 147 H9 CEPEC 150 CEPEC 136 U 14 CAB 312 CAB 201 CAB 505 CAB 299 CAB 531 CAB 108 CAB 68 CAB 165 CAB 130 SA 3 CAB 382 CAB 223
Brazil Brazil Brazil Brazil Brazil Brazil Brazil Ecuador French Guiana French Guiana Brazil Brazil Brazil Brazil Brazil Brazil Brazil Brazil Brazil Peru Ecuador Brazil Brazil Brazil Peru Brazil Peru Brazil Brazil Brazil Peru Brazil Brazil Peru Brazil Brazil Peru Brazil Brazil Brazil Brazil Brazil Brazil Brazil Brazil Brazil Brazil Brazil Brazil
519 Table 1. Continued Acc. #
Clone name
Country origin
Acc. #
Clone name
Country origin
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
CEPEC 25 PLAYA ALTA 4 CHUAO 120 OC 67 C 87.56 CAB 4 RB 38 CEPEC 86 CEPEC 11 CEPEC 12 CEPEC 13 CEPEC 14 CEPEC 15 CEPEC 533 CEPEC 538 CEPEC 550 SIAL 20 SIC 19 EET 53 EET 94 P7 P 16 PA 4 OB 52 SIAL 512 EEG 14 ICS 8 ICS 60 Be 3 CC 34 EET 397 CEPEC 42 SIAL 283 Be 5 C SUL 2 C SUL 5 C SUL 9
Brazil Venezuela Venezuela Venezuela Trinidad/Tobago Brazil Brazil Brazil Brazil Brazil Brazil Brazil Brazil Brazil Brazil Brazil Brazil Brazil Ecuador Ecuador Peru Peru Peru Brazil Brazil Brazil Trinidad/Tobago Trinidad/Tobago Brazil Costa Rica Ecuador Brazil Brazil Brazil Brazil Brazil Brazil
234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270
CAB 231 CAB 44 CAB 283 CAB 353 CAB 460 CAB 224 U 32 COCA 3370-5 AMAZON 15 GU 221 C CAB 121 CAB 305 CAB 380 CAB 389 CAB 252 LCTEEN 7A GU 154 C PA 175 LCTEEN 163A U2 GNV 225 IMC 83 GNV 31 GNV 111 EET 19 EET 58 P 18 SC 10 EQXZ SPEC 160.9 GU 121 GU 133 C LCTEEN 241 SC 3 SCA 2 P5C MO 20
Brazil Brazil Brazil Brazil Brazil Brazil Peru Ecuador Peru French Guiana Brazil Brazil Brazil Brazil Brazil Ecuador French Guiana Peru Ecuador Peru USA Peru USA USA Ecuador Ecuador Peru Colombia Ecuador Colombia French Guiana French Guiana Ecuador Colombia Peru Peru Peru
The RAPD reaction mixtures for Capsicum accessions followed procedures used for cacao accessions except that the 25 RAPD primers (156 polymorphic bands) differed and were listed in Rodriguez et al. (1999). Genetic distance matrices were computed for all pairwise combinations among cacao accessions and among Capsicum accessions using the complement to the simple matching coefficient (Gower, 1985; Marita, 1999; Rodriguez et al., 1999). Each genetic distance
matrix was converted to two-dimensional coordinates using a multi-dimensional scaling procedure (PROC MDS; SAS Institute, 1990). Maximum genetic diversity algorithm The program (copyright pending) was written in the C++ programming language (Symantec, 1993–94) to carry out two functions: (1) to select a core that maximized genetic distance, and (2) to rank all other
520
Figure 1. A multidimensional scaling plot of 270 cacao accessions from the CEPEC germplasm collection. Solid circles represent the 50 accessions most frequently chosen in 100 repetitions of the maximally diverse algorithm.
accessions relative to any given accession. The first function maximizes the genetic diversity among a set of accessions based on their genetic distance matrix. The user selects the number of accessions to be included in the core, which can vary from 1 to n (n = total available number of accessions). The user selects a ‘seed’ accession number as the first accession to be included in the core. The algorithm then sorts and ranks the genetic distance matrix representing all pairwise comparisons between each accession and all remaining accessions. A line is sorted by decreasing genetic distances and each accession per line given a rank between 1 and the total number of accessions. A rank of 1 indicates an accession with the maximum difference from a given accession, whereas a rank equal to the total number of accessions indicates an accession with the maximum similarity to a given accession. Starting with the ‘seed’ line number, the accession with the lowest rank (maximally diverse) is
included in the core. The selection of the remaining accessions to be included in the core (up to the number specified by the user) is based on the ranks of each accession not selected for the core. The ranks for a given accession are summed across all lines in the matrix already included in the core. The accession with the lowest sum of ranks is selected as a candidate to be included in the core. In addition to the above criteria, a candidate accession to the core must not fall within a certain distance from any previously selected accession. This assures the user that accessions representing the breadth of genetic diversity are represented, not just accessions at either extreme of the genetic distance matrix. This distance equals the total number of accessions selected by the user to be included in the core, subtracted from the total number of accessions analyzed and then divided by the total number of accessions selected by the user to be included in the core. This value represents an area around each
521
Figure 2. A multidimensional scaling plot of the 134 Capsicum accessions. Solid circles represent the 25 accessions most frequently chosen in 100 repetitions of the maximally diverse algorithm.
previously selected accession for the core wherein additional accessions cannot be selected. This criterion allows a spread in sampling of accessions representing different genetic clusters along the extremes of the genetic distribution to be included in the core collection. At the end of this option, the user is given the choice of saving the genetic distance matrix of the core representing the maximum genetic diversity. The final maximum genetically diverse set represents a core collection based on RAPD-marker based estimates of genetic distance. Validation of ‘core’ algorithm The objective was to compare the genetic diversity of random samples selected based on our algorithm to random samples of equal number developed by random sampling. Two independent studies were performed between random sets and program-generated sets using the T. cacao and Capsicum spp. data sets.
One hundred different program-generated core collections of 50 accessions each were generated for the cacao data set and 25 accessions each for the Capsicum data set, using random seeds so that no bias would be introduced. In contrast, 100 ‘core’ collections of size 50 and 25 within cacao and Capsicum data sets, respectively, were also generated by random sampling without replacement. Nei genetic diversity values were calculated for each RAPD locus among all accessions in each replication, and the mean calculated for each replication in the random sets and program-generated sets for both cacao and Capsicum data sets. The Nei genetic diversity at a locus is h = (1−6xi 2 )ni /(ni −1) = (2pi qi ni )/(ni −1) where xi is the allele frequency at the it h locus, pi is the frequency of the presence, and qi is the frequency of the absence of RAPD amplification among n accessions for the it h RAPD marker (Nei, 1987). The mean Nei genetic diversity was calculated for each replication by dividing the sum of
522 50 accessions, while for each Capsicum set m equaled 25 to keep the percent of accessions selected consistent between the two studies. Furthermore, n equaled 270 and 134 in the cacao and Capsicum studies, respectively, and r equaled 100. A Chi-square goodness of fit test (Steel and Torrie, 1980) was used to compare the expected versus the observed times an accession was chosen for the random sets and for the program-generated sets.
Results RAPD analysis
Figure 3. Distributions of mean Nei genetic diversity values averaged over 100 repetitions of 50 randomly selected cacao accessions compared to 50 maximally diverse selected accessions.
all Nei genetic diversity values across RAPD markers by the total number of RAPD markers. This can be represented by 6[(2pi qi ni )/(ni −1)]/m where m is the total number of RAPD markers. To calculate an overall mean for all replications in the random sets and program-generated sets, the sum of the mean Nei genetic diversity value for each replication was divided by the total number of replications. The overall mean can be represented by the equation: 6(6[(2pi qi ni )/(ni −1)]/m)/r where r is the total number of replications. In order to determine if the maximum genetic diversity algorithm selected the same accession in each sample set, the number of times each accession was chosen across 100 replications was summed for random and program-generated sets for both cacao and Capsicum data sets. For the cacao study, m equaled
The genetic relationships among the 270 cacao accessions and 134 Capsicum accessions from the base collections were displayed as MDS plots, respectively (Figures 1 and 2). A total of 134 cacao accessions were classified as Brazilian, representing unique accessions and hybrids between accessions originating in Brazil and other countries. As a result, accessions categorized as Brazilian were spread throughout the MDS plot (Figure 1). Peru, Ecuador, Trinidad/Tobago, Mexico, Costa Rica, Venezuela, and French Guiana formed unique but non-discrete clusters in the MDS plot (Figure 1). Conclusions could not be made concerning Guatemala, Grenada, and Nicaragua because there were only 1 to 3 accessions. Colombian accessions did not form a distinct cluster but were spread among Brazilian hybrids and Peru, Venezuela, and Trinidad/Tobago clusters. Six discrete clusters were observed in the MDS plot of Capsicum accessions corresponding to the six Capsicum species (Rodriguez et al., 1999) except for two C. annuum accessions (#19 and #28; Figure 2) that formed a cluster distinct from all other C. annuum accessions as well as all other species. Maximum genetic diversity algorithm Based on Nei’s genetic diversity, the mean genetic diversity of the program-generated cores 0.377 (± 0.002) and 0.361 (± 0.005) were greater than the random cores 0.305 (± 0.012) and 0.269 (± 0.032) for cacao and Capsicum, respectively (Figure 3). Moreover, the mean genetic distance among accessions increased from 0.310 (± 0.090) and 0.267 (± 0.162) for the all cacao accessions and all Capsicum accessions, respectively, to 0.375 (± 0.080) and 0.359 (± 0.159) for the program-generated cores supporting the al-
523 gorithm’s selection of a maximum genetically diverse core in each study. Since different ‘seed’ numbers (accessions) were used to initiate the core development, we also examined if the program consistently chose similar accessions in each replicate core. In the random cores, the mean number of times an accession was selected did not differ from the expectation (18.520). In contrast, in the program-generated cores, the mean number of times an accession was selected did differ from its expectation with χ 2 = 14279 (P < 0.001) for the cacao data set and χ 2 = 7258.21 (P < 0.0001) for the Capsicum data set (Figure 4). This indicates that the program selected similar cores regardless of initial seed number.
Discussion In the first study, cacao accessions chosen most frequently by the program as the maximum genetically diverse set represented the peripheral regions of the full MDS plot (Figure 1). Nevertheless, the accessions were well dispersed in a MDS plot when only these 50 most frequently selected accessions were analyzed (Figure 5). A MDS plot is a representation of the best fit of the relationship between accessions in two dimensions and is subject to stress. The physical distance between two accessions on the plot does not necessarily reflect the genetic distance. The stress of fitting multi-dimensional distances in two dimensions helps to understand the contrasting appearance of the distribution of the accessions in the full (Figure 1) and core displays (Figure 5). Several accessions in the maximally diverse core of cacao were selected from a tight cluster of accessions in the upper right-hand corner of the cacao MDS plot (Figure 1). These accessions were all closely related to the accessions within the cluster, but they were not as closely related to each other. For example, accessions 53 (ICS 9) and 73 (SPEC 138-8) lay next to each other on the MDS plot; but a list of accessions most closely related to accession 53 revealed accession 73 was the 46th closest. A total of 45 other accessions were more closely related to accession 53 based on genetic distances. This was also reflected by the diverse countries of origin represented among accessions selected as the maximum genetically diverse set (Marita, 1998). Among those accessions selected by the program and located near the upper right-hand cluster listed in the box in
the cacao MDS plot (Figure 1), six countries of origin were represented including Colombia, Trinidad and Tobago, Venezuela, Guatemala, Costa Rica, and Grenada. Capsicum accessions chosen most frequently by the algorithm as the maximum genetically diverse set represented the peripheral regions of the MDS plot (Figure 2) similar to the cacao maximally diverse set. A total of 14 C. annuum accessions, four C. chinense, four C. frutescens, two C. baccatum, one C. chacoense, and zero C. pubescens accessions were selected (Figure 2). The 25 accessions were selected from a broad geographic range including Brazil, Costa Rica, Cuba, France, Guatemala, India, Indonesia, Italy, Mexico, Peru, Thailand, Turkey, USA, USSR, and Zaire. Neither C. pubescens accession was chosen for the maximally diverse set. The genetic distance between the two C. pubescens accessions was 0.009. The mean genetic distance of the C. pubescens cluster to any other cluster was ≥ 0.293. The maximally diverse algorithm ensures that any potential accession does not fall within a certain distance from any previously selected accession. For the Capsicum set, the five most closely related accessions to any previously selected accession could not be chosen [(134–25)/25 = 4.36] to ensure a spread of sampling from different clusters. However, the C. pubescens and C. chacoense clusters had less than five accessions. C. chacoense accession (#120) was included 79 out of the 100 times that a maximally diverse set was created. Both C. pubescens accessions were third and fourth most closely related to this accession (genetic distances equaled 0.355 and 0.364). Therefore, each time accession #120 was chosen, neither of the C. pubescens accessions could be chosen. C. pubescens was anticipated to be included in the final maximally diverse set because it is a distinctive species based on morphology and interspecific breeding behavior (Heiser and Smith, 1953; Smith and Heiser, 1957). This may be resolved by including more accessions from these species or by using the algorithm to select maximally diverse intrarather than interspecific sets. The two data sets evaluated by the maximum genetic diversity core program had very different genetic structures. Accessions chosen for the cacao core and for the Capsicum core were selected by the program ≥ 43% and ≥ 40%, respectively, for each replication (100). Differences resulted between the two data sets in selection of accessions for their respective cores. Thirty-one accessions out of the fifty cacao core ac-
524
Figure 4. A multidimensional scaling plot of the 50 cacao accessions most frequently chosen in 100 repetitions of the maximally diverse algorithm.
cessions were chosen ≥ 90% of the time, with the rest of the core accessions chosen between 43% and 89% of the time. Contrarily, accessions chosen for the Capsicum core were selected at much lower frequencies among the 100 replications. The greater consistency in selection of accessions for the cacao core but not for the Capsicum core was due to differences in their distributions of genetic distances. The distribution of genetic distances for the cacao core and Capsicum core was similar to the distribution of genetic distances for their full data sets. Cacao had a normal distribution ranging from 0.000 to 0.569 with a mean (± SD) of 0.310 (± 0.090); whereas Capsicum had a bimodal distribution ranging from 0.000 to 0.591 with a mean (± SD) of 0.267 (± 0.162). The bimodal distribution resulted from inclusion of six different species with distances reflecting both within and between species relationships. Regardless, the cores selected by the program for each data set resulted in an increase
in mean genetic diversity values and mean genetic distances. The computer program developed in this study is a useful tool for plant breeders and germplasm collection curators because it uses genetic distance matrices to better understand the genetic diversity within a germplasm collection. Specifically, the program allows a user to make groupings of nearest and farthest neighbors to an accession of particular importance and create variable size sets representing the maximum genetic diversity in the collection. The described algorithm selects a core collection using molecular marker – based estimates of genetic distance. Other methods of selecting a core collection with molecular marker data include ranking populations by diversity levels and selecting subsets proportional to the relative diversity level of each population (Schoen and Brown, 1993) and principal component scoring maximizing sample diversity (Noirot et al., 1996). Other statist-
525
Figure 5. Histogram of the number of times each cacao accession was chosen in 100 repetitions of random or program-generated sets of 50 accessions.
ical approaches have selected core collections based on morphological data (Diwan et al., 1994; Franco et al., 1997; Cole-Rodgers et al., 1997) and biochemical data (Schoen and Brown, 1995). By contrast, molecular marker data reflects changes at the DNA level, not at the phenotypic level, resulting in a more robust data set from which to select. As this computer program needs only the importation of a genetic distance matrix, it can be used to analyze the many crop species where genetic distance relationships have been estimated (Lerceteau et al., 1997; Villand et al., 1998; Marita, 1999). Based on both the cacao and Capsicum germplasm analyses, a core collection selected by the maximum genetic diversity program would suitably represent sets that maximize the genetic diversity of the respective population sampled. With the help of the maximum genetic diversity program, breeders may further understand the genetic diversity represented within a
particular germplasm collection and use cores selected by the program to maintain the genetic diversity represented within the germplasm collection in a more systematic and cost effective way. Studies such as these could be used to guide more efficient utilization of the much larger germplasm collections. Applications of the maximum genetic diversity program are available on 3 12 inch disc for a nominal price to cover the cost of copying and shipping. Applications for most Macintosh computers are currently available, with PC applications pending. Requests should be directed to the corresponding author.
Acknowledgements We thank CEPEC and Fazenda Almirante, for supplying and extracting the DNA from the Theobroma cacao samples. We thank AVRDC, especially T. Berke
526 and L.M. Engle, for supplying the Capsicum accessions.
References Basigalup, D.H., D.K. Barnes & R.E. Stucker, 1995. Development of a core collection for perennial Medicago plant introductions. Crop Sci. 35: 1163–1168. Brown, A.H.D., 1989. Core collections: a practical approach to genetic resources management. Genome 31: 818–824. Brown, A.H.D., 1995. The core collection at the crossroads. In: Hodgkin, T., A.H.D. Brown, Th.J.L. van Hintum & E.A.V. Morales (Eds), Core Collections of Plant Genetic Resources, John Wiley & Sons, Chichester, U.K., pp. 3–19. Brown, A.H.D. & M.T. Clegg, 1983. Isozyme assessment of plant genetic resources. In: Rattazzi, M.C., J.G. Scandalios & G.S. Whitt (Eds), Isozymes Current Topics in Biological and Medical Research, Volume 11, Alan R. Liss, New York, NY, pp. 285–295. Clegg, M.T., 1990. Molecular diversity in plant populations. In: Brown, A.H.D., M.T. Clegg, A.L. Kahler & B.S. Weir (Eds), Plant Population Genetics, Breeding, and Genetic Resources, Sinauer, Sunderland, MA. Cole-Rodgers, P., D.W. Smith & P.W. Bosland, 1997. A novel statistical approach to analyze genetic resource evaluations using Capsicum as an example. Crop Sci. 37: 1000–1002. Diwan, N., G.R. Bauchan & M.S. McIntosh, 1994. A core collection for the United States annual Medicago germplasm collection. Crop Sci. 34: 279–285. Doebley, J., 1989. Isozymic evidence and the evolution of crop plants. In: Soltis, D.E. & P.S. Soltis (Eds), Isozymes in Plant Biology, Dioscorides, Portland, OR. Doebley, J., 1992. Molecular systematics and crop evolution. In Soltis, D.E. & P.S. Soltis (Eds), Molecular Systematics of Plants, Chapman & Hall, New York, NY. Franco, J., J. Crossa, J. Villaseñor, S. Taba & S.A. Eberhart, 1997. Classifying Mexican maize accessions using hierarchical and density search methods. Crop Sci. 37: 972–980. Frankel, O.H., 1984. Genetic perspectives of germplasm conservation. In: Arber, W.K., K. Llimensee, W.J. Peacock & P. Starlinger (Eds), Genetic Manipulation: Impact on Man and Society, Cambridge University Press, Cambridge, pp. 161–170. Frankel, O.H. & A.H.D. Brown, 1984. Current plant genetic resources – a critical appraisal. In: Genetics: New Frontiers, Volume 4, Oxford and IBH Publishing Co., New Delhi, pp. 1–11. Gepts, P., 1990. Genetic diversity of seed storage proteins in plants. In: Brown, A.H.D., M.T. Clegg, A.L. Kahler & B.S. Weir (Eds), Plant Population Genetics, Breeding, and Genetic Resources, Sinauer, Sunderland, MA. Gepts, P., 1993. The use of molecular and biochemical markers in crop evolution studies. Evol. Biol. 27: 51–94. Gepts, P., 1995. Genetic markers and core collections. In: Hodgkin, T., A.H.D. Brown, Th.J.L. van Hintum & E.A.V. Morales (Eds), Core Collections of Plant Genetic Resources, John Wiley & Sons, Chichester, U.K., pp. 127–146.
Gower, J.C., 1985. Measures of similarity, dissimilarity, and distance. In: Kotz, S. & N.L. Johnson (Eds), Encyclopedia of statistical sciences, Volume 5, Wiley, New York, NY, pp. 297–405. Grauke, L.J. & T.E. Thompson, 1995. Evaulation of Pecan [Carya illinoinensis (Wangenh.) K. Koch] germplasm collection and designation of a core subset. HortScience 30: 950–954. Heiser, C.B & P.G. Smith, 1953. The cultivated Capsicum peppers. Econ. Bot. 7: 214–226. Holbrook, C.C., W.F. Anderson & R.N. Pittman, 1993. Selection of a core collection from the U.S. germplasm collection of peanut. Crop Sci. 33: 859–861. Johns, M.A., P.W. Skroch, J. Nienhuis, P. Hinrichsen, G. Bascur & C. Munoz-Schick, 1997. Gene pool classification of common bean landraces from Chile based on RAPD and morphological data. Crop Sci. 37: 605–613. Lerceteau, E., T. Robert, V. Pétiard & D. Crouzillat, 1997. Evaluation of the extent of genetic variability among Theobroma cacao accessions using RAPD and RFLP markers. Theor. Appl. Genet. 95: 10–19. Marita, J.M., 1998. Characterization of Theobroma cacao using RAPD-marker based estimates of genetic distance and recommendations for a core collection to maximize genetic diversity. M.S. Thesis, University of Wisconsin, Madison, WI. Nei, M., 1987. Molecular Evolutionary Genetics. Columbia University Press, New York, NY. Noirot, M., S. Hamon & F. Anthony, 1996. The principal component scoring: a new method of constituting a core collection using quantitative data. Genet. Resour. Crop Evol. 43: 1–6. Rodriguez, J.M., T. Berke, L. Engle & J. Nienhuis, 1999. Variation among and within Capsicum species revealed by RAPD markers. Theor. Appl. Genet. 99(1/2): 147–156. SAS Institute, 1990. SAS/STAT User’s Guide, Version 6, Fourth Edition. SAS Institute Inc., Cary, NC. Schoen, D.J. & A.H.D. Brown, 1993. Conservation of allelic richness in wild crop relatives is aided by assessment of genetic markers. Proc. Natl. Acad. Sci. USA 90: 10623–10627. Schoen, D.J. & A.H.D. Brown, 1995. Maximising genetic diversity in core collections of wild relatives of crop species. In: Hodgkin, T., A.H.D. Brown, Th. J.L. van Hintum & E.A.V. Morales (Eds), Core Collections of Plant Genetic Resources, John Wiley & Sons, Chichester, U.K., pp. 55–76. Singh, S.P., R. Nodari & P. Gepts, 1991. Genetic diversity in cultivated common bean. I. Allozymes. Crop Sci. 31: 23–29. Smith, P.G & C.B. Heiser, 1957. Breeding behavior of cultivated peppers. Am. Soc. Hort. Sci. 70: 286–290. Steel, R.G.D. & J.H. Torrie, 1980. Principles and procedures of statistics: a biometrical approach. McGraw-Hill Inc., New York, NY. Symantec, 1993–94. Symantec C++ for Macintosh, Version 7.0.4. Symantec Corporation, Cupertino, CA. Villand, J.M., P.W. Skroch, T. Lai, P. Hanson, C.G. Kuo & J. Nienhuis, 1998. Genetic variation among tomato accessions from primary and secondary centers of diversity. Crop Sci. 38: 1339–1347.