SSRs in Indian Mustard (Brassica juncea)

5 downloads 0 Views 729KB Size Report
SSR primers were designed with the help of BatchPrimer3 ... batchprimer3.cgi) with following parameters: (i) primer length = 18–27 bases (optimal of 22 bases), ...
Identification, characterization, validation and cross-species amplification of genicSSRs in Indian Mustard (Brassica juncea) Binay K. Singh, Dwijesh C. Mishra, Sushma Yadav, Supriya Ambawat, Era Vaidya, Kishor U Tribhuvan, Arun Kumar, Sujith Kumar, et al. Journal of Plant Biochemistry and Biotechnology ISSN 0971-7811 J. Plant Biochem. Biotechnol. DOI 10.1007/s13562-016-0353-y

1 23

Your article is protected by copyright and all rights are held exclusively by Society for Plant Biochemistry and Biotechnology. This eoffprint is for personal use only and shall not be self-archived in electronic repositories. If you wish to self-archive your article, please use the accepted manuscript version for posting on your own website. You may further deposit the accepted manuscript version in any repository, provided it is only made publicly available 12 months after official publication or later and provided acknowledgement is given to the original source of publication and a link is inserted to the published article on Springer's website. The link must be accompanied by the following text: "The final publication is available at link.springer.com”.

1 23

Author's personal copy J. Plant Biochem. Biotechnol. DOI 10.1007/s13562-016-0353-y

ORIGINAL ARTICLE

Identification, characterization, validation and cross-species amplification of genic-SSRs in Indian Mustard (Brassica juncea) Binay K. Singh 1 & Dwijesh C. Mishra 2 & Sushma Yadav 3 & Supriya Ambawat 3 & Era Vaidya 3 & Kishor U Tribhuvan 1 & Arun Kumar 3 & Sujith Kumar 3 & Sanjeev Kumar 2 & K. K. Chaturvedi 2 & Reema Rani 3 & Prashant Yadav 3 & Anil Rai 2 & P. K. Rai 3 & V. V. Singh 3 & Dhiraj Singh 3

Received: 11 October 2015 / Accepted: 3 March 2016 # Society for Plant Biochemistry and Biotechnology 2016

Abstract Brassica juncea is an economically important oilseed crop worldwide. It has limited genomic resources at present. We generated 47,962,057 expressed sequence reads which were assembled into 45,280 unigenes. A total of 4108 SSR loci (≥10 bp) were identified in these unigenes. Trinucleotide was the most frequent repeat unit (59.91 %) followed by di- (38.66 %), tetra - (0.71 %), hexa - (0.49 %) and pentanucleotide repeats (0.24 %). Primers were designed for 2863 SSR loci among which 460 were selected for primer synthesis. A total of 339 loci amplified successfully of which 134 (39.5 %) exhibited polymorphism among six B. juncea genotypes with PIC values ranging from 0.18 to 0.81. Further, 25 polymorphic SSRs were used for analysis of genetic variability in 25 genotypes of Brassicas and their wild relatives. Two to five alleles with PIC values 0.22–0.66 were detected at these loci. The dendrogram grouped the genotypes according to their known pedigree/systematic position.

Keywords Brassica juncea . Transcriptome . Simple sequence repeats . Cross-species amplification Electronic supplementary material The online version of this article (doi:10.1007/s13562-016-0353-y) contains supplementary material, which is available to authorized users. * Binay K. Singh [email protected]

1

ICAR - Indian Institute of Agricultural Biotechnology, Ranchi 834 010, Jharkhand, India

2

ICAR - Indian Agricultural Statistics Research Institute, New Delhi 110 012, India

3

ICAR - Directorate of Rapeseed-Mustard Research, Bharatpur 321 303, Rajasthan, India

Introduction Brassica juncea (L.) Czern & Coss, commonly known as ‘Indian Mustard’ is one of the highly important crop species from the family Brassicaceae. The species is divided into four subspecies i.e. napiformis, tsatsai, integrifolia and juncea (Spect and Diederichsen 2001) of which the oilseed related subsp. juncea is extremely important as an edible oil crop in the Indian sub-continent (Getinet and Sharma 1996). B. juncea (2n = 36, AABB) is a predominantly selfpollinated allotetraploid species. It originated in the Middle East and neighbouring regions through natural doubling of chromosomes after the hybridization of its progenitor species, B. rapa (2n = 20, AA) and B. nigra (2n = 16, BB) (Nagaharu 1935). Both the progenitor species of B. juncea were reportedly produced by extensive triploidization of their ancestral species at the genomic level (Lysak et al. 2005). Consequently, the genome size of B. juncea is very large (~1105 Mbp) (Arumuganathan and Earle 1991). India contributes major share in the world mustard production, where, it is predominantly used as edible oil. In India both, B. juncea and B. rapa are produced as cultivated oil crops but share of B. juncea is around 80 % of the approximately 6–7 million hectares of net sown area of Brassicas (Chauhan et al. 2011). In India, the cultivation of B. juncea is largely carried out under rain-fed farming system, where, the crop is exposed to several biotic and abiotic stresses causing considerable loss to the farmers. Therefore, breeding for B. juncea cultivars which are more adaptive and resilient to these stresses becomes a necessity (Bhardwaj et al. 2015). Current breeding practices in the country are largely based on field selection which is inefficient. Molecular markerassisted breeding approach would be more suited to hasten the selection and development of new improved mustard genotypes. Molecular mapping of some of the important

Author's personal copy J. Plant Biochem. Biotechnol.

qualitative and quantitative traits in B. juncea are already reported in literature. However, majority of these studies have used non-PCR based (Cheung et al. 1998; Sabharwal et al. 2004), less reproducible (Mukherjee et al. 2001; Sharma et al. 2002) or markers derived from related species (Bisht et al. 2009; Panjabi-Massand et al. 2010). Further, these traits need to be fine-mapped along with several other economically important traits like tolerance to Alternaria blight, Sclerotinia stem rot, high temperature, salinity, drought etc. Development of a number of PCR based reproducible markers in B. juncea would be an ideal way to initiate these studies. The advent of next generation sequencing (NGS) technologies has changed the dynamics and pace of genomic research in crop plants. The reduced cost and the ability to rapidly generate enormous amount of sequence data has made these technologies highly effective in the development of DNA based markers in large number (Dutta et al. 2011). During the last few years, tremendous emphasis has been particularly laid on the development of gene-based SSR markers since they have greater potential for linkage to loci associated with agronomic phenotypes. Moreover, they are more likely to be conserved and possess enormous potential for wider applicability across related genera/species (Ellis and Burke 2007). A large number of SSR markers have been developed and extensively characterized in B. rapa (Suwabe et al. 2002; Ramchiary et al. 2011) and B. napus (Wang et al. 2012; Li et al. 2013). However, only few SSR markers are reported in B. juncea (Hopkins et al. 2007) and to the best of our knowledge, no comprehensive study has been undertaken so far to identify genic-SSRs on large scale. In this study, a large dataset of expressed sequences, based on Illumina MiSeq sequencing of cDNA library from B. juncea cv CS-52, were developed. Further, large number of SSRs were mined from the expressed sequence dataset. These SSRs were characterized, validated and a subset from this was used for studying cross-species amplification.

Materials and methods Plant materials Young roots, leaves, inflorescence and developing seeds from B. juncea cv CS-52 were used for isolation of total RNA and transcriptomic analysis. For studying polymorphism at the SSR loci, six commercially released cultivars of B. juncea were used. The evaluation of cross-species amplification of SSRs was done using a total of 25 genotypes which included (i) six commercially released cultivars, two registered genotypes, and two exotic collections of B. juncea, (ii) three cultivars of B. rapa, (iii) one cultivar each of B. nigra, B. napus and B. carinata, and (iv) one genotype each of nine wild species belonging to five different genera of the family

Brassicaceae. Seed samples of cultivated Brassica species were obtained from Germplasm Unit, ICAR - Directorate of Rapeseed-Mustard Research, Rajasthan (India) while the seeds of wild species were collected from National Research Centre on Plant Biotechnology, New Delhi (India). The details of plant materials used in this study are shown in Supplementary Table 1. RNA extraction, library preparation and cDNA sequencing Fresh roots, leaves, inflorescence and developing seeds were collected and preserved in RNAlater RNA Stabilization Reagent procured from Qiagen (Madison, USA). The total RNA was extracted separately from one gram of these tissues using TRIzol Reagent (Life Technologies Inc., USA) following the manufacturer’s protocol. Following extraction, the total RNA was qualitatively analysed on 1.0 % denaturing agarose gel and quantified on Qubit fluorometer (Life Technologies Inc.). One microgram of extracted RNA from each of the four tissue types were mixed together and the pooled RNA was used to prepare paired-end cDNA sequencing library using Illumina TruSeq RNA Sample Preparation Kit (Illumina, San Diego, USA) following the manufacturer’s protocol. The cDNA sequencing library was validated using an Agilent Technologies 2100 Bioanalyzer and sequenced on an Illumina MiSeq platform. De novo assembly of illumina MiSeq reads Expressed sequence reads obtained by paired-end sequencing of normalised cDNA library on Illumina MiSeq platform were checked for quality of sequence reads. Sequences with low quality, based on (i) base quality score distribution, (ii) sequence quality score distribution, (iii) average base content per read, and (iv) GC distribution, were trimmed. Further, adaptor sequences were also removed. The trimming and contamination removal was performed using Trimmomatic v 0.32. The resulting high quality reads were assembled using SOAP denovo-Trans-bin v 1.03; gap filling option was chosen to produce the set of unigenes. Mining of genic-SSRs SSRs were detected in the unigene contigs using MIcroSAtellite (MISA) tool (http://pgrc.ipk-gatersleben.de/ misa). This study included only those loci which contained simple sequences of 2–6 nucleotides reiterated at least five times. Complex SSRs were not considered in this study. SSR primers were designed with the help of BatchPrimer3 v1.0 (http://probes.pw.usda.gov/cgi-bin/batchprimer3/ batchprimer3.cgi) with following parameters: (i) primer length = 18–27 bases (optimal of 22 bases), (ii) PCR product

Author's personal copy J. Plant Biochem. Biotechnol.

size = 100–200 bp, (iii) GC content = 40–60 % (optimal 50 %), and (iv) annealing temperature = 50–60 °C (optimal 55 °C). Characterization of genic-SSRs and functional annotation of genes High-quality genomic DNA of all the genotypes used in this study was extracted by the DNA extraction method described by Murray and Thompson (1980). At the first instance, a total of 460 genic-SSRs (≥18 bp) were tested for PCR amplification. The PCR reaction mixtures consisted of 1× PCR buffer, 200 μM dNTPs, 250 nM of each primer, 1.5 μM MgCl2, 0.25 U of Taq DNA polymerase and 25 ng of genomic DNA extracted from leaf tissues of CS-52, in a total volume of 10 μl. The thermal profile used for PCR reaction was: initial denaturation at 94 °C for 5 min followed by 35 cycles of 94 °C for 45 s, 60 °C for 30 s, 72 °C for 30 s and a final extension of 72 °C for 7 min. PCR conditions for SSR primers that failed to amplify satisfactorily at these conditions were standardised by varying the concentrations of MgCl2 and annealing temperature. The SSR primers yielding successful amplifications in CS-52 were then further tested for polymorphism in six genotypes of B. juncea. The PCR products were size-fractionated on 3.5 % MetaPhor (FMC BioProducts) agarose gels. FGENESH was used for the prediction of genes in the unigene sequences, selected on the basis of successful amplification of SSRs contained in them. BLASTx against non-redundant protein database was performed to predict the putative function(s) of these unigenes. Further, the position of SSRs within the unigenes was analyzed with respect to the open reading frame. Scoring and analysis of genic-SSRs Polymorphism Information Content (PIC) values of genicSSRs were calculated to measure their efficiency in distinguishing the genotypes/taxa. Presence of alleles was scored as (1) and absence as (0) for each of the genotypes. The PIC-value was calculated according to the formula described by Weber (1990) as follows: PIC ¼ 1−

Xk

P2 i¼1 i

where, k is the total number of alleles detected for a SSR locus and Pi, the frequency of ith allele, in the set of genotypes investigated (Nei and Li 1979). Similarity index was used to estimate genetic similarity (GS) between all possible genotypic pairs by using the following formula: GSij ¼

2a 2a þ b þ c

where, a is the number of shared fragments, b and c the number of alleles being present either in line i or line j,

respectively. Unweighted Pair Group Method with Arithmetic Mean (UPGMA) algorithm was used for cluster analysis among the genotypes. The confidence limits of the dendrogram were determined by bootstrap analysis. Clustering of the genotypes and bootstrap analysis was carried out using a series of executable programs in FreeTree software (Pavlicek et al. 1999).

Results Illumina paired-end sequencing and de novo assembly of expressed sequence reads We generated 47,962,057 high quality expressed sequence reads from the cDNA library of B. juncea cv CS-52 using Illumina MiSeq platform. Primary sequence output was deposited in the Sequence Read Archive (SRA) of GenBank (Acc. No. SRX831149). The dataset represented 13.3 Gb of sequence with the mean read length of 139 bp. A total of 45, 280 unigene contigs were obtained from assembly of these reads. The length of contigs ranged from 280 to 8557 bp, with an average of 1162 bp, and the N50 value of 1324 bp. The majority of unigene contigs (48.4 %) were larger than 1.0 kbp. The 45,280 uni-transcript sequences covered 52,642 kbp (~4.76 %) of the B. juncea genome which is estimated to be 1105 Mbp. Types and frequency distribution of genic-SSRs The analysis of 45,280 assembled unigene contigs remitted in the discovery of a total of 4108 discrete SSR loci in 3955 unigenes. Among these unigenes, 3806 (96.2 %) contained a single SSR locus, while 149 contigs (3.8 %) possessed 2–3 SSR loci each, corresponding to one SSR for every 11.0 unigenes. Every 12.8 kbp unigene sequences contained an average of one SSR locus. Frequency analysis of different repeat types revealed that trinucleotide repeats were most abundant with a frequency of 59.91 %. It was followed by di-, tetra-, hexa- and pentanucleotide repeats with frequencies of 38.66, 0.71, 0.49, and 0.24 %, respectively (Fig. 1a). For a given repeat unit, the number of reiterations ranged from 5 to 20, the most frequent being n = 5 (1586 loci; 38.6 %). Repeat motifs exceeding 11 reiterations were rare (less than 1 %, Fig. 1b). SSR loci of 15 bp were most frequent (37.9 %) followed by 18 bp (17.8 %), 12 bp (13.2 %) and 14 bp (9.6 %) lengths. The length of the longest SSR locus was 42 bp (Fig. 1c). In the present study, a total of 63 distinct repeat motifs were identified. Among them, AG/CT and AAG/CTT were the most abundant di- and trinucleotide repeats with frequencies of 33.06 and 13.07 %, respectively. The frequencies of 10 most abundant repeat motifs are shown in Table 1.

Author's personal copy J. Plant Biochem. Biotechnol. Fig. 1 Graphical representation of occurrence and distribution of Brassica juncea genic-SSRs (a) Length of repeat motifs, (b) Number of reiterations of repeat motifs, (c) Length of SSRs

Validation and cross-species amplification of genic-SSRs Out of the total 4108 SSR loci identified in this study, PCR primers were successfully designed for 2863 SSR loci. These were designated as BjSSR-1 to BjSSR-2863 (Supplementary

Table 2). From the 2863 SSR loci, 887 loci were identified with n ≥ 18 bp out of which 460 were selected for primer synthesis. Among them, 339 primer pairs yielded successful amplification in CS-52, out of which 270 amplified products of expected size. A total of 61 primer pairs amplified products

Author's personal copy J. Plant Biochem. Biotechnol. Table 1 S. no.

1 2

Frequency distribution of the 10 most abundant SSR repeat motifs in Brassica juncea Repeat motif

AG/CT AAG/CTT

Number of reiterations of the motif

Total

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

– 326

434 121

342 55

207 18

132 6

75 5

31 1

32 1

26 4

13 –

10 –

7 –

8 –

15 –

21

5 –

1358 537

3

AGA/TCT

188

80

25

13

3

5





2















316

4 5

AGG/CCT AAC/GTT

159 104

65 59

17 18

8 6

1 5

– –

– 1

– –

– –

– –

– –

– –

– –

– –

– –

– –

250 193

6 7

ACT/AGT AC/GT

91 –

46 72

15 45

6 17

4 20

1 2

2 2

– 1

1 –

– 2

– –

– –

– –

– –

– –

– –

166 161

8

ATC/GAT

92

40

20

6





1



















159

9 10

CTC/GAG ACC/GGT

102 95 429

30 30 157

19 9 44

2 – 23

2 – 7

– 1 6

– – 6

– – 3

– – 2

– – 1

– – –

– – –

– – –

– – –

– – –

– – –

155 135 678

1586

1134

609

306

180

95

44

37

35

16

10

7

8

15

21

5

4108

Other motifsa Total a

Additional 53 types of repeat motifs in Brassica juncea transcriptome

that exceeded the expected size; eight primer pairs amplified more than one band while 121 primer pairs did not show detectable amplification (Fig. 2). Among the 339 SSR loci containing unigenes, 203 (59.9 %) sequences showed similarity with functionally characterized proteins while 92 (27.1 %) showed similarity with proteins of unknown/hypothetical functions. Forty-four (13 %) sequences were found to be unique, as they did not match significantly with any known protein. Among the 339 genic-SSRs, 232 (68.4 %) were located in the protein coding region while the remaining SSRs were located in the un-translated regions (UTRs) (Supplementary Table 3). All the 339 genic-SSRs showing successful amplification were scored for polymorphism among six B. juncea genotypes. A total of 134 (39.5 %) SSR loci were found to be polymorphic of which only two alleles each were detected at 119 loci. Thirteen SSR loci exhibited three alleles each, while one SSR locus each exhibited four and nine alleles. The polymorphic SSRs displayed PIC values ranging from 0.18 to 0.81 with an average of 0.30 (Supplementary Table 4). Twenty-five polymorphic genic-SSRs were used for studying cross-species amplification in a set of 25 genotypes,

representing commercially released cultivars/registered genotypes/exotic collections of B. juncea, and related cultivated and wild species. A total of 82 different alleles, ranging from 2 to 5 per locus, were identified among the 25 genotypes. On an average 3.04 alleles per locus were detected with PIC values ranging from 0.22 to 0.66. The average number of alleles per locus and PIC values for the wild species was found to be 2.8 and 0.53, respectively. However, these values were significantly lower for B. juncea (alleles per locus = 2.28 alleles; PIC = 0.40) and other cultivated Brassicas (alleles per locus = 2.6; PIC = 0.49) (Table 2, Fig. 3). Cluster analysis using Nei and Li similarity coefficients generated two major clusters in the dendrogram (Fig. 4). Cluster I included the crop Brassicas while cluster II included all the wild species. Both the clusters were further divided into sub-clusters. Sub-cluster Ia comprised of the ‘B’ genome containing species including B. juncea (AABB), B. nigra (BB) and B. carinata (BBCC). Brassica juncea included Rohini, RH-819, NPJ-112, RGN48, Navagold, DRMRIJ-31, Bio-YSR, PHR-2, Donskaja-IV and Loriet while B. nigra and B. carinata were represented by Surya and Kiran, respectively. Sub-cluster Ib was composed of B. napus (AACC) and one of its ancestral diploid species

Fig. 2 Agarose gel showing PCR amplification of genic-SSRs using genomic DNA extracted from Brassica juncea cv CS-52: 1 BjSSR-207, 2 BjSSR-220, 3 BjSSR-228, 4 BjSSR-258, 5 BjSSR-259, 6 BjSSR-261, 7 BjSSR-264, 8 BjSSR-277, 9 BjSSR-278, 10 BjSSR-280, 11 BjSSR-310,

12 BjSSR-312, 13 BjSSR-313, 14 BjSSR-347, 15 BJSSR-348, 16 BjSSR358, 17 BjSSR-361, 18 BjSSR-362, 19 BjSSR-373, 20 BjSSR-376, 21 BjSSR-392, 22 BjSSR-393, 23 BjSSR-411, 24 BjSSR-418. M-100 bp DNA ladder

BjSSR-55

BjSSR-100

BjSSR-133 BjSSR-190 BjSSR-678 BjSSR-835 BjSSR-903

BjSSR-904 BjSSR-1146

BjSSR-1248 BjSSR-1331 BjSSR-1579

BjSSR-1737 BjSSR-1907 BjSSR-1982 BjSSR-2018 BjSSR-2242 BjSSR-2384 BjSSR-2386 BjSSR-2404 BjSSR-2553

BjSSR-2606 BjSSR-2624 BjSSR-2677 BjSSR-2794 Average

1

2

3 4 5 6 7

8 9

10 11 12

13 14 15 16 17 18 19 20 21

22 23 24 25

SSR ID

300-320 180-210 180-200 150-200

120-150 80-120 120-180 150-200 180-200 150-200 250-300 200-210 250-300

200-220 200-300 80-150

200-240 150-260

160-180 200-250 100-180 400-500 200-350

240-280

500-700

Allele size (bp)

2 3 3 2 2.28

2 2 3 2 2 2 2 2 3

2 2 1

2 3

2 2 2 3 3

2

3

B. juncea

Number of alleles

2 3 3 2 2.6

3 2 3 3 2 2 2 2 2

3 4 3

2 2

2 4 2 4 3

2

3

Related cultivated species

2 3 3 2 2.8

2 4 3 3 2 2 3 2 2

3 3 3

2 4

2 4 3 3 3

4

3

Wild species

2 3 3 2 3.04

3 4 3 3 2 2 3 2 3

3 4 3

3 4

2 5 3 4 3

4

3

Total

0.50 0.58 0.44 0.43 0.40

0.30 0.17 0.67 0.50 0.30 0.15 0.48 0.47 0.47

0.46 0.54 0.00

0.17 0.56

0.50 0.20 0.20 0.44 0.29

0.50

0.56

B. juncea

PIC value

0.49 0.67 0.63 0.35 0.49

0.59 0.38 0.64 0.41 0.35 0.22 0.50 0.50 0.44

0.59 0.69 0.59

0.24 0.42

0.34 0.52 0.38 0.68 0.46

0.44

0.64

Related cultivated species

0.50 0.62 0.51 0.29 0.53

0.47 0.59 0.66 0.66 0.43 0.49 0.65 0.50 0.44

0.64 0.34 0.50

0.24 0.61

0.50 0.58 0.59 0.51 0.57

0.58

0.66

Wild species

Particulars of 25 polymorphic genic-SSRs in B. juncea which were used for studying cross-species amplification in a set of 25 genotypes of Brassica and related genera

S. no.

Table 2

0.50 0.66 0.63 0.36 0.52

0.52 0.44 0.66 0.62 0.36 0.40 0.60 0.49 0.54

0.61 0.56 0.60

0.22 0.57

0.50 0.46 0.43 0.56 0.49

0.52

0.64

Total

Author's personal copy J. Plant Biochem. Biotechnol.

Author's personal copy J. Plant Biochem. Biotechnol.

Fig. 3 Agarose gel showing allelic variation among Brassica juncea and related cultivated and wild species with genic-SSR BjSSR-1146: 1 Rohini, 2 RH-819, 3 NPJ-112, 4 RGN-48, 5 Navagold, 6 DRMRIJ-31, 7 Bio-YSR, 8 PHR-2, 9 Donskaja-IV, 10 Loriet, 11 PT-30, 12 NRCYS-

05, 13 KOS-1, 14 Surya, 15 GSC-6, 16 Kiran, 17 Brassica fruticulosa, 18 B. spinescens, 19 B. tournefortii, 20 Diplotaxis assurgens, 21 D. tenuisiliqua, 22 D. siettiana, 23 Lipidium sativum 24 Capsella bursa pastoris, 25 Camelina sativa. M-50 bp DNA ladder

B. rapa (AA). B. napus was represented by GSC-6 while B. rapa included PT-30, NRCYS-05 and KOS-1. The average

genetic distance between the genotypes belonging to the ‘B’ genome contained polyploid species B. juncea and

Fig. 4 Dendrogram showing phylogenetic relationship among 10 Brassica juncea cultivars/registered genotypes, exotic collections, six cultivars from other crop Brassicas and nine wild species generated from 25 genic-SSRs

Author's personal copy J. Plant Biochem. Biotechnol.

B. carinata, and one of their ancestral diploid species B. nigra was 65 and 63 %, respectively, while it was 61 and 57 % with B. rapa. The average genetic distance between B. napus and the other two polyploids, B. carinata and B. juncea was 53.6 and 63.13 %, respectively while it was 71.3 % with B. rapa. Among the different genotypes of B. juncea, the average genetic distance between the genotypes of Indian origin and the exotic genotypes Donskaja IV and Loriet was 73 and 70 %, respectively. Among the Indian genotypes of B. juncea, the level of genetic similarity ranged from 83 to 96 %. Cluster II was divided into two major sub-clusters IIa and IIb. Subcluster IIa was further subdivided into sub-clusters IIa1 and IIa2. Sub-clusters IIa1 comprised of B. fruticulosa, B. spinescens and B. tournefortii among which B. fruticulosa and B. spinescens were found to be closer to each other with a similarity level of 95.6 %, than with B. tournefortii. Diplotaxis siettiana, D. tenuisiliqua and D. assurgens constituted the sub-cluster IIa2. Among these taxa, D. assurgens and D. tenuisiliqua exhibited more similarity (93.3 %) with each other than with D. siettiana. Sub-group IIb was represented by Lepidium sativum, Camelina sativa and Capsella bursa pastoris. In the intra sub-cluster IIb, the maximum similarity (90.3 %) was obtained between C. sativa and C. bursa pastoris while it was minimum (76 %) between L. sativum and C. sativa.

Discussion DNA based markers are important genetic tools for understanding genome dynamics and facilitating molecular breeding. While, detailed high-density integrated genetic maps derived from different populations and marker types have been generated for most of the cultivated Brassica species during the last two decades (Ramchiary et al. 2011; Paritosh et al. 2013; Raman et al. 2014), the progress in this area has remained slow in case of B. juncea. Therefore, in this study, we have identified, characterized and validated a large set of genic-SSRs which can be used for extensive genetic mapping and phylogeny studies in B. juncea. The genic-SSRs were identified in an expressed sequence dataset generated using whole transcriptome sequencing on Illumina MiSeq platform. Illumina paired-end sequencing and assembly In order to produce a maximally informative B. juncea sequence resource, total RNA was extracted from four different tissues including young roots, leaves, inflorescence and developing seeds of B. juncea cv CS-52. The cDNA library was prepared after pooling equal amounts of extracted RNA from the four different tissues and it was normalised before sequence analysis. Normalization reduces the disparity in concentrations of cDNAs from various genes and at the same time

facilitates efficient detection of rare transcripts (Shcheglov et al. 2007). We used Illumina MiSeq platform to obtain an extensive collection of expressed sequence reads from the cDNA library prepared from B. juncea cv CS-52. Due to its short read length, Illumina sequencing platforms have mainly been utilized in the organisms with reference genomes (Nagalakshmi et al. 2008). However, during the last few years it has been established that relatively shorter reads can also be effectively assembled, particularly with the help of paired-end sequencing (Maher et al. 2009). Moreover, with the advent of Illumina MiSeq technology, which provides significantly longer read lengths and thus better contig assembly, transcriptome or whole genome de novo sequencing and assembly are being extended to non-model organisms as well (Garg et al. 2011). In the present study, 13.3 Gb Illumina MiSeq sequence data was generated. A total of 47,962,057 high quality reads from the 13.3 Gb sequence data was successfully assembled which generated 45,280 unigene contigs with a mean length of 1162 bp. The mean length of unigene contigs obtained in this study was longer than in many of the previous studies. Moreover, the N50 sizes of unigene contigs generated in this research were also longer than in many other earlier reports (Wei et al. 2011; Zhang et al. 2012). In this study, about 42 % sequence reads were assembled of which about 59.9 % of the SSR-containing unigenes could find significant match in the NCBI non-redundant protein database. The possible reasons for a significant proportion of unigenes lacking matches in the protein database may be the presence of new rare transcripts (Emrich et al. 2007) or unique contigs may be the part of UTRs (Roy et al. 2007). Frequency and distribution of SSR loci A total of 4108 perfect SSR loci of ≥10 bp were identified from 45,280 unigene contigs with an overall frequency of nearly one in every 11 unique unigenes (9.1 %). It is comparable to the mean value (9.0 %) of EST-SSR frequency obtained by Ellis and Burke (2007) in their studies on 33 different plant genera. In this study, we obtained genic-SSR (kbp/ SSR) density of 1/12.8 kbp. This is slightly higher than in related species Arabidopsis thaliana (1/13.83kbp) (Victoria et al. 2011). The variation in frequency and abundance of genic-SSRs may be attributed partly to the redundancy of expressed sequences, number and length of unigene contigs, and tools and parameters used to detect SSRs. Size of the genome may also have significant impact on the frequency of genic-SSRs (Varshney et al. 2005a). In our investigation, we observed that trinucleotide repeats were the most common repeat motif, representing 59.91 % of the SSR loci identified. This was in agreement with a number of earlier reports (Izzah et al. 2014; Hendre and Aggarwal 2014). In this study, it was also found that 232 (68.4 %) out of 339 validated SSR loci

Author's personal copy J. Plant Biochem. Biotechnol.

were located in the protein coding regions while the remaining 107 (31.6 %) were located in UTRs. Higher frequency of trinucleotide repeats in the protein coding regions as compared to that in UTRs is already reported by Yu et al. (2004a, b). These workers have reported 74 % of the trinucleotide repeats in the protein coding regions and only 26 % in UTRs. A high frequency of trinucleotide repeats in the protein coding region can generate non-frameshift mutations which may lead to variation of the amino acid residues (Metzgar et al. 2000). Among the dinucleotide repeat motifs ‘AG/CT’ while among the trinucleotide repeat motifs ‘AAG/CTT’ were the most common, accounting for 33.06 and 13.07 %, respectively. These results are consistent with earlier studies in expressed sequences of dicotyledonous plants (Izzah et al. 2014). Cross-species amplification of genic-SSR loci Out of the 4108 genic-SSRs which were identified, PCR primers were designed for 2863 SSRs of which 460 SSR primer pairs were tested for amplification in CS-52. A total of 339 (73.7 %) primer pairs yielded amplification products. The success rate for SSR amplification generally ranges between 60 and 90 %, depending on the quality of sequence or the location of primers within the SSR-containing genes (Varshney et al. 2005a). A majority (79.6 %) of the genicSSRs amplified a single amplicon of expected size. It may be assumed that some errors might have crept in during the assembly of sequence reads which may have contributed, at least partly, to the deviation of the remaining amplicons from the expected size. The presence of large sized introns within the SSR containing unigenes may be another possible reason for this deviation. To determine the level of polymorphism among the genic-SSRs, all the 339 primer pairs which successfully yielded PCR amplicons were subjected to polymorphism study using a set of six B. juncea genotypes. Out of 339 SSRs tested, 134 (39.5 %) were found to be polymorphic with a PIC value ranging from 0.18 to 0.81. The ratio of polymorphic SSRs was similar to that for genic-SSRs in other crops with a range of 40–89 % (Yu et al. 2004a, b; Varshney et al. 2005b). A sub-set of 25 polymorphic genic-SSRs were further employed for cross-species amplification and analysis of genetic relationships among 25 different genotypes of cultivated Brassicas and related wild species. These SSRs clearly distinguished different genotypes and made distinctive groupings among genera and species which corresponded well with their taxonomic classification. In the present study, the diploid species B. nigra (BB), and the polypoloid species B. juncea (AABB) and B. carinata (BBCC) clustered in the same group which indicated that the genomes of the two polyploids were similar to their common ancestral species B. nigra. However, as compared to B. napus (AACC), they were more divergent

to B. rapa (AA). This suggested that the degree of changes in A, B or C genomes are different in allopolyploids of Brassicas. These results are similar to the earlier reports (Liu and Wang 2006; Singh et al. 2012a, b). Among the 10 genotypes of B. juncea which were considered for the study, all the commercially released cultivars/registered genotypes of Indian origin clustered together which may be attributed to the fact that most of these genotypes are pedigree selections from a few common ancestors of Indian gene pool. These genotypes however were quiet distinct from east European and Chinese genotypes Donskaja IV and Loriet, respectively, as reported earlier (Pradhan et al. 1993). All the wild species clustered into two major subgroups. Subgroup IIa was represented by the tribe Brassiceae while subgroup IIb contained the members of two closely related tribes Lepidieae and Camelineae. Subgroup IIa1 was constituted by B. fruticulosa, B. spinescens and B. tournefortii. These taxa have been extensively studied both at the cytological and molecular levels by the earlier workers and high level of genetic similarity between them is already reported (Warwick and Black 1993; Sánchez-Yélamo 2004; Warwick and Black 1991). Subgroup IIa2 was shared by the members of genus Diplotaxis. As indicated in earlier reports, D. siettiana of section Heterocarpum was found to be quiet distinctive from D. assurgens and D. tenuisiliqua belonging to the section Rynchocarpum (Gómez-Campo 1999). The sub-group IIb contained C. Sativa, C. bursa-pastoris and L. sativum. As expected, L. sativum belonging to the tribe Lepidieae was markedly distinguished from C. bursa-pastoris and C. sativa of tribe Camelineae. In conclusion, the genic-SSRs identified in this study constitute a set of potential markers that can be applied across the different species of Brassica and related genera and used for the assessment of genetic relationships as well as genetic mapping studies. Acknowledgments We sincerely acknowledge Director, ICARDirectorate of Rapeseed-Mustard Research, Bharatpur-321303, Rajasthan, India, for providing financial support and the facilities to carry out this research work. We also gratefully acknowledge Director, ICAR Indian Agricultural Statistics Research Institute, New Delhi - 110 012, India, for providing computational facilities of Centre for Agricultural Bioinformatics, and Dr. S.R. Bhatt (Principal Scientist), ICAR- National Research Centre on Plant Biotechnology, New Delhi – 110012, India, for providing seed materials of wild species used in this study. Compliance with ethical standards Conflict of Interest The authors declare that they have no competing interest.

References Arumuganathan K, Earle ED (1991) Nuclear DNA content of some important plant species. Plant Mol Biol Rep 9:208–218

Author's personal copy J. Plant Biochem. Biotechnol. Bhardwaj AR, Joshi G, Kukreja B, Malik V, Arora P, Pandey R, Shukla RN, Bankar KG, Agarwal SK et al (2015) Global insights into high temperature and drought stress regulated genes by RNA-Seq in economically important oilseed crop Brassica juncea. BMC Plant Biol 15:9 Bisht NC, Gupta V, Ramchiary N, Sodhi YS, Mukhopadhyay A, Arumugam N, Pental D, Pradhan AK (2009) Fine mapping of loci involved with glucosinolate biosynthesis in oilseed mustard (Brassica juncea) using genomic information from allied species. Theor Appl Genet 118:413–421 Chauhan JS, Singh KH, Singh VV, Satyanshu K (2011) Hundred years of rapeseed-mustard breeding in India: accomplishments and future strategies. Indian J Agric Sci 81:093–1109 Cheung WY, Gugel RK, Landry BS (1998) Identification of RFLP markers linked to the white rust resistance gene (Acr) in mustard (Brassica juncea (L.) Czern. and Coss.). Genome 41:626–628 Dutta S, Kumawat G, Singh BP, Gupta DK, Singh S, Dogra V, Gaikwad K, Sharma TR et al (2011) Development of genic-SSR markers by deep transcriptome sequencing in pigeonpea [Cajanus cajan (L.) Millspaugh]. BMC Plant Biol 11:17 Ellis JR, Burke JM (2007) EST-SSRs as a resource for population genetic analyses. Heredity 99:125–132 Emrich SJ, Barbazuk WB, Li L, Schnable PS (2007) Gene discovery and annotation using LCM-454 transcriptome sequencing. Genome Res 17:69–73 Garg R, Patel RK, Tyagi AK, Jain M (2011) De novo assembly of chickpea transcriptome using short reads for gene discovery and marker identification. DNA Res 18:53–63 Getinet A, Sharma SM (eds) (1996) Niger [Guizotia abyssinica (L.f.) Cass.]: promoting the conservation and use of underutilized and neglected crops. International Plant Genetic Resources Institute, Rome Gómez-Campo C (1999) Taxonomy. In: Biology of Brassica Coenospecies. Elsevier, Amsterdam Hendre PS, Aggarwal RK (2014) Development of genic and genomic SSR markers of robusta coffee (Coffea canephora Pierre Ex A. Froehner). PLoS One 9(12), e113661 Hopkins CJ, Cogan NOI, Hand M, Jewell E, Kaur J et al (2007) Sixteen new simple sequence repeat markers from Brassica juncea expressed sequences and their cross-species amplification. Mol Ecol Notes 7:697–700 Izzah NK, Lee J, Jayakodi M, Perumal S, Jin M, Beom-Seok P, Ahn K, Yang TJ (2014) Transcriptome sequencing of two parental lines of cabbage (Brassica oleracea L. var. capitata L.) and construction of an EST-based genetic map. BMC Genomics 15:149 Li H, Younas M, Wang X, Li X, Chen L, Zhao B, Chen X, Xu J, Hou F, Hong B, Liu G, Zhao H, Wu X, Du H, Wu J, Liu K (2013) Development of a core set of single-locus SSR markers for allotetraploid rapeseed (Brassica napus L.). Theor Appl Genet 126:937– 947 Liu A, Wang J (2006) Genomic evolution of Brassica allopolyploids revealed by ISSR marker. Genet Resour Crop Ev 53:603–611 Lysak MA, Koch MA, Pecinka A, Schubert I (2005) Chromosome triplication found across the tribe Brassiceae. Genome Res 15:516–525 Maher CA, Palanisamy N, Brenner JC, Cao X, Kalyana-Sundaram S, Luo S, Khrebtukova I et al (2009) Chimeric transcript discovery by paired-end transcriptome sequencing. Proc Natl Acad Sci 106: 12353–12358 Metzgar D, Bytof J, Wills C (2000) Selection against frame shift mutations limits microsatellite expansion in coding DNA. Genome Res 10:72–80 Mukherjee AK, Mohapatra T, Varshney A, Sharma R, Sharma RP (2001) Molecular mapping of a locus controlling resistance to Albugo candida in Indian mustard. Plant Breed 120:483–487 Murray MG, Thompson WF (1980) Rapid isolation of high molecular weight plant DNA. Nucleic Acids Res 8:4321–4326

Nagaharu U (1935) Genome analysis in Brassica with special reference to the experimental formation of B. napus and peculiar mode of fertilization. Jpn J Bot 7:389–452 Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M (2008) The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320:1344–1349 Nei M, Li WH (1979) Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc Natl Acad Sci 76:5269– 5273 Panjabi-Massand P, Yadava SK, Sharma P, Kaur A, Kumar A, Arumugam N, Sodhi YS, Mukhopadhyay A, Gupta V, Pradhan AK, Pental D (2010) Molecular mapping reveals two independent loci conferring resistance to Albugo candida in the east European germplasm of oilseed mustard Brassica juncea. Theor Appl Genet 121:137–145. doi:10.1007/s00122-010-1297-6 Paritosh K, Yadava SK, Gupta V, Panjabi-Massand P, Sodhi YS, Pradhan AK, Pental D (2013) RNA-seq based SNPs in some agronomically important oleiferous lines of Brassica rapa and their use for genome-wide linkage mapping and specific-region fine mapping. BMC Genomics 14:463 Pavlicek A, Hrda S, Flegr J (1999) Free-Tree–freeware program for construction of phylogenetic trees on the basis of distance data and bootstrap/jackknife analysis of the tree robustness. Application in the RAPD analysis of genus Frenkelia. Folia Biol (Praha) 45:97–99 Pradhan AK, Sodhi YS, Mukhopadhyay A, Pental D (1993) Heterosis breeding in Indian mustard (Brassia juncea L. Czern & Coss): analysis of component characters contributing to heterosis for yield. Euphytica 69:219–229 Raman H, Dalton-Morgan J, Diffey S, Raman R, Alamery S, Edwards D, Batley J (2014) SNP markers-based map construction and genomewide linkage analysis in Brassica napus. Plant Biotechnol J 12:851– 860 Ramchiary N, Nguyen VD, Li X, Hong CP, Dhandapani V, Choi SR, Yu G, Piao ZY, Lim YP (2011) Genic microsatellite markers in Brassica rapa: development, characterization, mapping, and their utility in other cultivated and wild Brassica relatives. DNA Res 18:305–320 Roy SW, Penny D, Neafsey DE (2007) Evolutionary conservation of UTR intron boundaries in Cryptococcus. Mol Biol Evol 24:1140– 1148 Sabharwal V, Negi MS, Banga SS, Lakshmikumaran M (2004) Mapping of AFLP markers linked to seed coat colour loci in Brassica juncea (L.) Czern. Theor Appl Genet 109:160–166 Sánchez-Yélamo MD (2004) Taxonomic relationships among Erucastrum and Brassica species based on flavonoids compounds. Eucarpia Cruciferae Newsl 25:13–14 Sharma R, Aggarwal RA, Kumar R, Mohapatra T, Sharma RP (2002) Construction of an RAPD linkage map and localization of QTLs for oleic acid level using recombinant inbreds in mustard (Brassica juncea). Genome 45:467–472 Shcheglov AS, Zhulidov PA, Bogdanova EA, Shagin DA (2007) Normalization of cDNA libraries. In: Nucleic acids hybridization. Springer, p 97–124 Singh BK, Thakur AK, Rai PK (2012a) Genetic diversity and relationships in wild species of Brassica and allied genera as revealed by cross-transferable genomic STMS marker assays. Aust J Crop Sci 6: 815–821 Singh BK, Thakur AK, Tiwari SK, Siddiqui SA, Singh VV, Rai PK (2012b) Transferability of Brassica-derived microsatellites to related genera and their implications for phylogenetic analysis. Natl Acad Sci Lett 35:37–44 Spect CE, Diederichsen A (2001) Brassica. In: Mansfeld’s encyclopedia of agricultural and horticultural crops. Springer, p 1453–1456 Suwabe K, Iketani H, Nunome T, Kage T, Hirai M (2002) Isolation and characterization of microsatellites in Brassica rapa L. Theor Appl Genet 104:1092–1098

Author's personal copy J. Plant Biochem. Biotechnol. Varshney RK, Graner A, Sorrells ME (2005a) Genic microsatellite markers in plants: features and applications. Trends Biotechnol 23: 48–55 Varshney RK, Sigmund R, Borner A, Korzun V, Stein N, Sorrells ME, Longridge P, Graner A (2005b) Interspecific transferability and comparative mapping of barley EST-SSR markers in wheat, rye and rice. Plant Sci 168:195–202 Victoria FC, da Maia LC, de Oliveira AC (2011) In silico comparative analysis of SSR markers in plants. BMC Plant Biol 11:15 Wang F, Wang XF, Chen X, Xiao Y, Li H, Zhang S, Xu J, Fu J, Huang L, Liu C, Wu J, Liu K (2012) Abundance, marker development and genetic mapping of microsatellites from unigenes in Brassica napus. Mol Breeding 30:731–744 Warwick SI, Black LD (1991) Molecular systematic of Brassica and allied genera (subtribe Brassicinae, Brassiceae)-chloroplast genome and cytodome congruence. Theor Appl Genet 82:81–92 Warwick SI, Black LD (1993) Molecular relationships in subtribe Brassicinae (Cruciferae, tribe Brassiceae). Can J Bot 71:906–918

Weber JL (1990) Informativeness of human (dC - dA)n (dG - dT)n polymorphisms. Genomics 7:524–530 Wei W, Qi X, Wang L, Zhang Y, Hua W, Li D, Haixia L, Zhang X (2011) Characterization of the sesame (Sesamum indicum L.) global transcriptome using Illumina paired-end sequencing and development of EST-SSR markers. BMC Genomics 12:451 Yu JK, La Rota M, Kantety RV, Sorrells ME (2004a) EST derived SSR markers for comparative mapping in wheat and rice. Mol Gen Genet 271:742–751 Yu JK, Dake TM, Singh S, Benscher D, Li W, Gill B, Sorrells ME (2004b) Development and mapping of EST-derived simple sequence repeat (SSR) markers for hexaploid wheat. Genome 47: 805–818 Zhang H, Wei L, Miao H, Zhang T, Wang C (2012) Development and validation of genic-SSR markers in sesame by RNA-seq. BMC Genomics 13:316