Species used in comparison with M. cinxia are the butterflies, Kallima inachus and H. melpomene, and the moths, B. mori and. P. xylostella. Gene start stop start.
Supplementary Figures
PROCEDURE
DATA
RESULTS N50
454 single reads Illumina PE
Read error correction*
Illumina PE
Contig assembly
Total length
Initial contigs 2 kb 354 Mb
Repeat masking
Illumina MP SOLiD MP 454 MP
Read error correction*
Mapping
Scaffolding*
Final scaffolds 119 kb 390 Mb
Gap closing
Final contigs 13 kb 361 Mb
Superscaffolding*
Superscaffolds 258 kb 393 Mb
PacBio long reads 454 MP SOLiD MP Linkage map
Supplementary Figure 1. Genome assembly workflow. The sequenced data sets (left), the assembly workflow (middle), and the key statistics (right) of the M. cinxia genome assembly. Stages with novel methods1-3 developed during this project are marked with *; PE, Paired-end library; MP, Mate-pair library.
1
d Int a So ct n ic a te 10 kbp 8 kbp 6 kbp 5 kbp 4 kbp
48502 bp 17000 bp 10171 bp
Supplementary Figure 2. High molecular weight DNA used in MP library construction. Intact gradientisolated DNA and sonicated DNA used in library construction are shown. Molecular weigth markers are Left: GeneRuler 1 kb DNA Ladder (Fermentas) and Right: GeneRuler High Range DNA Ladder (Fermentas).
2
200000
14000
IlluminaPE1
IlluminaPE2
150000
Frequency
Frequency
12000
100000
10000 8000 6000 4000
50000
2000 0
0 0
2000
100 200 300 400 500 600 Insert size
0 600
IlluminaMP1
200
400 600 Insert size
800
1000
SOLiDMP1
1500
Frequency
Frequency
500
1000
400 300 200
500 100 0
0 0
600
500 1000 Insert size
1500
0 2000
SOLiDMP2
1000
2000 3000 Insert size
4000
IlluminaMP2
Frequency
Frequency
500 400 300 200
1500 1000 500
100 0
0 0
1000
1000 2000 3000 4000 5000 Insert size
0 1000
IlluminaMP3
600 400 200
600 400 200
0
0 0
120
1000 2000 3000 4000 5000 Insert size
0 1000
454MP1
100
1000 2000 3000 4000 5000 Insert size 454MP2
800
Frequency
Frequency
IlluminaMP4
800
Frequency
Frequency
800
1000 2000 3000 4000 5000 Insert size
80 60 40
600 400 200
20 0
0 0
2000 4000 6000 8000 10000 Insert size
0
10000 20000 Insert size
30000
Supplementary Figure 3. Insert size distributions of the PE and MP libraries. The insert size distributions of the SOLiDMP2, IlluminaMP2, IlluminaMP3 and IlluminaMP4 libraries show two peaks. The lower peaks between 0.5K and 2K represent fragmented DNA. 3
300000
250000
N50
200000
150000
100000
Superscaffolds
+454MP2 (Final scaffolds)
+454MP1
+IlluminaMP3+IlluminaMP4
+SolidMP2+IlluminaMP2
+SolidMP1
+IlluminaPE2+IlluminaMP1
+IlluminaPE1
0
Initial contigs
50000
Supplementary Figure 4. N50 values at intermediate stages of scaffolding. Scaffolds of the previous stage were merged using the PE and MP libraries with longer insert size.
4
100
N50 N90
Proportion of assembly (%)
90 80 70 60 50 40 30 20 10 0 1000
10000 100000 Minimum scaffold length
1e+06
Supplementary Figure 5. Cumulative scaffold length of the final assembly. For each minimum scaffold length, the proportion of the assembly covered by scaffolds that are longer than the minimum length is shown. N50 (green) and N90 (red) values are indicated.
5
contig1 contig2 contig1 contig2 contig1 contig2 contig1 contig2 contig1 contig2 contig1 contig2 contig1 contig2 contig1 contig2 contig1 contig2 contig1 contig2 contig1 contig2 contig1 contig2
451 GATCAGATGTCTcGCCCTAAATACTTGCGTGCCATGTAGGGTTAAGGTTACCTTTAAAaTGCATTTTTAGCATTAAGCTGagAAAaGGGTCGATACTGTA ||||||||||||||||||||||||.||||||||||||||||||||||||.||||||||||||||||| 1 ---------------------------------ATGTAGggTTAAGGTTACcTTTAATATGCATTTTTAGCATTAAGCTGAGgAAaGGGTCGATACTGTA
550
551 TAACGTTAAA---ATGAaCGTTAAAACGAGTACTATAATAAGTGTGAAATTTTATCGTATTTACTTGTAAATTGTGATTTCTTTCATTGTAATTATAACT |||||||||| |||||||||||||||||||||||||||||||||||||||.|.||||||||||||||||||||.||||||.|.|||.||||||||..| 68 TAAcGTTAAATTTAtGAaCgTTAAAAcGAgTACTATAATAAGTGTGAAATTTGACCGTATTTACTtGTAAaTTGTAATTTCTCTtATTTTAATtATATAT
647
648 TTCTAATTAGGTATGTTATAAACCTAAAAaTTTtA-AAAAGAaTATTATtGTtATATG-CTAATTATCTACTTTA--------------AATGGAAAGAG ||.|..|||..||..|||||||||||||||||||| |||||||||||.|||||||||| ||||||.||||||.|| ||||||||||| 168 TTTTTTTtAATTAAATTATAAACcTAAAAaTTTtAGAAAAGAaTATTTTtGTtATATGCCTAATTGTCTACTATAAATGATCTAGTATGAATGGAAAGAG
731
732 TACGCCCAATTTCAAGGCAGGAATCGAATTCTAGAaTAAAAaGCAGGACCGCTGCTAACGGCGACAACCGAAGAGTCACTAATAAATGAAGGTAATAAAT |||||||||||||||||||||||||.||||||||||||||||||||||||.||||||||.||||||||.||||||||||||||||||||||||||||||| 268 TACGCCCAATTTCAAGGCAGGAATCAAATTCTAGAATAAAAAGCAGGACCACTGCTAACAGCGACAACGGAAGAGTCACTAATAAATGAAGGTAATAAAT
831
832 TAATTAGTACAATAGACATCAGCGAGGATAACGAATTTAATGTCGTTATAATATCGAAATATGTGTAATTAAGGATACaGTTAAATGCAATTA---CATA |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||.||||||||||||||||||||||||| |||| 368 TAATTAGTACAATAGACATCAGCGAggATAACGAATTTAATGTCGTTATAATATCGAAATATGTGTAGTTAAGGATACAGTTAAATGCAATTACGTCATA
928
929 ATAATTAAAATTTAACGCTTAACTTCAATTTtCGAAACATTCAACTCACGGAGCAGATTTTAAGCATTCGTCAGTTCCA-TCTTTCACATATTCTGAATA |||||||||||||||||||||||||||||||||||||||||||||||||.||||||||||||||||||||||||||.|| |||||||||||||||||||| 468 ATAATTAAAATTTAACGCTTAACTTCAATTTTCGAAACATTCAACTCACCGAGCAGATTTTAAGCATTCGTCAGTTACATTCTTTCACATATTCTGAATA
1027
1028 TGATTTAACTTATTACAGATTTTCAAATAAGAAAAGGACGGTGCTTTGAAATATAaCTAACATGTTtATAgTTTtAGGTGCATCCAAaGAAAaTAATTCC ||||||||.||||||||||..||.||||||||||||.|||||||.|.|||||||||| |||| 568 TGATTtAATTTATTACAGACATTTAAATAaGAAAaGAACGGTGCCTCGAAATATAAC-----------------------------------ATAA----
1127
1128 ATCAAAcAGCtCTGTAACAAAAGCTCAAGCTACGAAAATTAATtCCAATAGAATATAtGCATAGgTTtAAATTATATATtGATGGTTAAGGATATCATTG ||||.||.|||||...|.||||||||||||.|||||..|||||||||||.|.|||||||||||||||||||||| 629 --------------------------AAGCCACTAAAaTGTcTCCCAATAGAATATCTGCATCAGTTTAAATtATTTTTTGATGGTTAaGGATATCATTG
1227
1228 TGACGTTATTTGTATACaTTTtAAAAAaTTCGagAaGaTATCTGCCTAGCtCGTGCGTAaCATTCCAaCACACGACAAATATTCATATAGGCTTTACAAA |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 703 TGACGTTATTTGTATACaTTTTAAAAAATTCGAGAaGATATCTGCCTAGCTCGTGCGTAACATTCCAACACACGACAAATATTCATATAGGCTTTACAAA
1327
1328 TaTTTGTCTGGCGTTTGTGACTGTGTtGTGTGTACTTCAAGAGTTCTGgAAATCtGGAACAGgATTtACTTTATaTTTGGgACCGTTGACTGTgAAaGGT |||||||.|||||||||||||||||||||||||||||||||||| ||||.||||||||||||||||||||||||||||||||||||||||||| 803 TATTTGTTTGGCGTTTGTGACTGTGTTGTGTGTACTTCAAGAGT--------TCTGAAACAGGATTTACTTTATATTTGGGACCGTTGACTGTGAAAGGT
1427
1428 TTATTTACAAACTCGCAGTGAAACA----------TCGTtGAACGGGAAAaGGgTAAGAACCCTTCTtCAAAtttttCATTCttAtttttctatCTTaaa ||||||||||||||||||||||||| |.||| ||||||||||.|||||||.|.|||||||||||||||||||.|| 895 TTATTTACAAACTCGCAGTGAAACATTCCCCAAATTTGTT-----------------GAaCCCTTCTCCAAATTTGTTATTCTTATTTTTCTATCTTCAA
1517
1518 aCAATTAATAATAGA------------------------------------------------------------------------------------|||.|||||||||.| 978 aCATTTAaTAATAAACAAGTATATAGCAAAGGAAAATAATTTAAAAGTTCAAAatGGGACTtAAATTACCAGATCTGCATTATTTATAATCTAGCTTGTT
1532
67
167
267
367
467
567
628
702
802
894
977
1077
Supplementary Figure 6. Alignment of the 3’ end of contig1 and 5’ end of contig2 coding for the gene cytochrome P450 cyp337. The alignment shows indel polymorphism which prohibited assembly of the two contigs.
6
a)
Superscaffolds 100
N50 N90
Proportion of assembly (%)
90 80 70 60 50 40 30 20 10 0 1000
b)
10000
1e+07
Superscaffolds + Unplaced scaffolds 100
N50 N90
90 Proportion of assembly (%)
100000 1e+06 Minimum scaffold length
80 70 60 50 40 30 20 10 0 1000
10000
100000 1e+06 Minimum scaffold length
1e+07
Supplementary Figure 7. Cumulative a) superscaffold length and b) the total length of the superscaffolds and unplaced scaffolds. For each minimum scaffold length, the proportion of the assembly covered by scaffolds that are longer than the minimum length is shown. N50 (green) and N90 (red) values can be read from the figure.
7
16000
14000
14000
12000
12000
10000
10000
Melitaea cinxia
Melitaea cinxia
16000
8000
6000
6000
4000
4000
2000
2000
0
0 0
2000
4000
6000 8000 10000 12000 14000 16000 Heliconius melpomene
16000
16000
14000
14000
12000
12000
10000
10000
Melitaea cinxia
Melitaea cinxia
8000
8000
2000
4000
6000 8000 10000 12000 14000 16000 Plutella xylostella
0
2000
4000
6000 8000 10000 12000 14000 16000 Helicoverpa armigera
8000
6000
6000
4000
4000
2000
2000
0
0
0 0
2000
4000
6000 8000 10000 12000 14000 16000 Bombyx mori
Supplementary Figure 8. Alignment of the mitochondrial sequence of Melitaea cinxia against the mitochondrial sequences of Heliconius melpomene, Plutella xylostella, Bombyx mori, and Helicoverpa armigera. The red segments show forward and the blue ones reverse alignments. Two red lines in the same figure indicate different cutting points of the circular mtDNA sequence between the two species.
8
100% 90% Proportion (%) of transcript aligned against scaffolds Unaligned 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100
80%
Proportion of transcripts
70% 60% 50% 40% 30% 20% 10% 0% Unordered
Unlimited gap
Max gap 5000
Supplementary Figure 9. The fraction of transcript contigs covered by the final scaffolds. The leftmost bar shows how much of the sequence of the transcript contigs is present in the genome when the ordering of the matches is not considered. The middle bar shows how well the transcript contigs can be aligned against the genome assembly when the ordering of the matched parts is taken into account. The rightmost bar shows the results when we further restrict the gaps in the scaffolds between alignments to at most 5,000 bp, representing the longest probable intron size or missing sequence.
9
Chromosomes of Melitaea cinxia
Chromosomes of Bombyx mori
Supplementary Figure 10. Sequence-level synteny between Melitaea cinxia and Bombyx mori. Visualization of all genomic positions in which pairwise alignments between scaffolds from M. cinxia and B. mori share at least 200 bp. The figure is filtered to contain only the “best” hits onto the B. mori reference. The red and blue dots represent the forward and reverse complemented hits, respectively.
10
Supplementary Figure 11. Examples of secondary structures of microRNA precursors predicted from the Melitaea cinxia genome. The color-coding represent thermodynamically likely base-pair probabilities as indicated in the figure.
11
b) 8000
8000
7000
7000
6000
6000
5000
5000
Melitaea cinxia
Melitaea cinxia
a)
4000
4000
3000
3000
2000
2000
1000
1000
0
0 0
2000
4000
6000 Attacus ricini
8000
10000
12000
0
1000
2000
3000
4000
5000
6000
7000
Papilio xuthus
c)
Supplementary Figure 12. rDNA sequence alignments of Melitaea cinxia against a) Attacus ricini and b) Papilio xuthus sequences, and c) a schematic representation of 18S, 5.8S, and 28S rRNA genes of Melitaea cinxia. In figures a) and b) the red segments show forward and the blue ones reverse alignments. Figure c) shows the pairwise alignment of scaffold34886 containing rRNA genes from M. cinxia against GenBank entry AF463459.1 containing an A. ricini rDNA repeat unit. Melitaea cinxia rRNA sequences are shown in gray and the corresponding rRNAs from A. ricini are shown in blue. Internal transcribed spacers (ITS) are indicated by thin lines. The identities of the rRNA and the spacer regions are indicated below. ETS1 refers to the external transcribed spacer. The continuous top gray bar shows the entire M. cinxia rDNA scaffold34886. Mismatches between the two species are indicated by red and insertions by blue bars in the M. cinxia sequence.
12
Supplementary Figure 13. The mitochondrial genome of Melitaea cinxia. Protein-coding genes (green) are denoted as COI, COII and COIII for subunits 1-3 of cytochrome c oxidase, cytB for the cytochrome b gene, ND1, 2, 3, 4, 4L, 5 and 6 for subunits 1-6 of the NADH dehydrogenase system, and ATP6 and ATP8 for subunits 6 and 8 of ATP synthase. tRNA (pink) nomenclature follows the standard three letter IUPAC amino acid code. rRNAs (red) are denoted 12S (small subunit rRNA) and 16S (large subunit rRNA). The AT-rich control region (grey) is shown on the top. The direction of transcription for each coding region is depicted as an arrow.
13
A
Uncharacterized sequences
Descriptions in sequence database(s)
Similarity search against sequence database(s) Scoring results by identity percentages, coverage percentages and taxonomic distances (Regression model)
Protein Naming Utility for alternative descriptions and corrections
Clustering of results by TF-IDF and cosine similarity
Word frequency
Description frequency
Sorting clusters Cluster representative selection by using highest scoring description and printing out the result with the GO and EC classes associated to cluster members
FUNCTIONAL ANNOTATION
B
KEGG orthology and pathway mapping
Protein signatures e.g. domain annotations
Transmembrane regions
Secreted peptides
KAAS server
InterProScan
TMHMM
SignalP
Supplementary Figure 14. Flowchart of the functional annotation procedure. Part (A) illustrates the PANNZER workflow4, while part (B) shows the other procedures used for functional annotation.
14
18000
16000
Number of gene models
14000
12000
No functional annotation
10000
Both 8000 Transcripts 6000
Gene models
4000
2000
0 DE
GO
EC
InterPro
KEGG
Supplementary Figure 15. Summary of the number of gene models with functional annotation for Melitaea cinxia. DE refers to protein descriptions, GO to Gene Ontology classes and EC to Enzyme Commission numbers. InterPro shows the predicted protein signatures, and KEGG includes KEGG orthology (KO) and pathway mapping. MAKER refers to predicted gene models and Transcripts to assembled transcript data.
15
tab]
60
50
Number of gene models
No functional annotation Functional annotation
40
30
20
10
1 80 140 180 220 260 300 340 380 420 460 500 540 580 620 660 700 740 780 820 860 900 940 980 1020 1060 1100 1140 1180 1220 1260 1300 1340
0
Protein sequence length
Supplementary Figure 16. The length of the functionally annotated proteins and proteins without functional annotation.
page]
16
OrthoDB: triangulated clusters
EPT: species clashes disallowed
Supplementary Figure 17. Schematic illustration of the principles used by ortholog clustering algorithms. Nodes represent proteins, species are indicated by color, and line width indicates the strength of similarity. TOP: OrthoDB clusters built by triangulation. The basic unit is a triangle formed by reciprocal best hits from three species. Triangles that share an edge are merged. BOTTOM: EPT clusters built hierarchically. In-paralogs are merged within a cluster and out-paralogs are excluded. In this example, triangulation generates one cluster while EPT generates two clusters due to the out-paralog exclusion rule. The second red protein is an out-paralog because its similarity to the multispecies cluster is lower than that of another red protein which is already a member of the cluster.
17
Supplementary Figure 18. Cladogram of representative species used for analysis of one-to-one orthologs. Bootstrap values above 50% are presented. Species are from the following taxa: FELCA and RATNO are mammalian outgroups; IXOSC (deer tick) and DAPPU (waterflea) are arthropod outgroups, and the rest are Hexapoda (insects). Lepidoptera include MELCI, HELME, DANPL, BOMMO and PLUXY. Ants (HARSA, SOLIN), bees (APIME) and wasps (NASVI) are Hymenoptera. Diptera include mosquitoes (AEDAE, ANOGA, CULQU) and fruit flies (DROME, DROMO, DROSI). Pea aphid (ACYPI), red flour beetle (TRICA) and head louse (PEDHU) represent other insects. Species codes are as used in SwissProt: FELCA = Felis catus; RATNO = Rattus norvegicus; IXOSC = Ixodes scapularis; DAPPU = Daphnia pulex; MELCI = Melitaea cinxia; HELME = Heliconius melpomene; DANPL = Danaus plexippus; BOMMO = Bombyx mori; PLUXY = Plutella xylostella; HARSA = Harpegnathos saltator; SOLIN = Solenopsis invicta; APIME = Apis mellifera; NASVI = Nasonia vitripennis; AEDAE = Aedes aegypti; ANOGA = Anopheles gambiae; CULQU = Culex quinquefasciatus; DROME = Drosophila melanogaster; DROMO = Drosophila mojavensis; DROSI = Drosophila simulans; ACYPI = Acyrthosiphon pisum; TRICA = Tribolium castaneum; PEDHU = Pediculus humanus.
18
4
x 10
6
0.5
1 5
Number of clusters
4 2
3
2.5
3
Number of paralogs per cluster
1.5
2
3.5
1
RATNO FELCA ACYPI PEDHU APIME NASVI HARSA SOLIN TRICA DROSI DROME DROMO ANOGA AEDAE CULQU MELCI DANPL HELME BOMMO PLUXY IXOSC DAPPU
4
0
Supplementary Figure 19. Heatmap of orthologous groups in representative Arthropoda and Mammalian outgroups. The figure represents a visualization of orthologous groups between 22 species. Rows show corresponding orthologous groups and columns are species. The darker color indicates in-paralogs. Species are in taxonomic order and the rows are ordered using hierarchical clustering5. CAT and RAT are mammalian outgroups. Species codes are as used in SwissProt: RATNO = Rattus norvegicus; FELCA = Felis catus; ACYPI = Acyrthosiphon pisum; PEDHU = Pediculus humanus; APIME = Apis mellifera; NASVI = Nasonia vitripennis; HARSA = Harpegnathos saltator; SOLIN = Solenopsis invicta; TRICA = Tribolium castaneum; DROSI = Drosophila simulans; DROME = Drosophila melanogaster; DROMO = Drosophila mojavensis; ANOGA = Anopheles gambiae; AEDAE = Aedes aegypti; CULQU = Culex quinquefasciatus; MELCI= Melitaea cinxia; DANPL = Danaus plexippus; HELME = Heliconius melpomene; BOMMO = Bombyx mori; PLUXY = Plutella xylostella; IXOSC = Ixodes scapularis; DAPPU = Daphnia pulex. The lepidopteran species are highlighted with a box. 19
35 30 25 20 15 0
5
10
Frequency
0
5
10
15
20
25
30
Chromosomes
Supplementary Figure 20. The frequency distribution of manually annotated genes of Melitaea cinxia across chromosomes.
20
Supplementary Figure 21. ShxA (Special homeobox A) alignment. Dp: Danaus plexippus; Hm: Heliconius melpomene; Mc: Melitaea cinxia. Amino acids differing from the consensus are highlighted. All species have a short first exon and a long second exon containing the homeodomain (red bar). The exon boundaries are indicated by a black vertical bar. Mc/ShxA-2 has a 70 amino acid insertion relative to the other sequences in exon 2.
21
bcd zen zen2
ShxA
lab
ShxB ShxC
pb
ShxD Antp
Dfd abdA Scr
Ubx ftz
AbdB
Supplementary Figure 22. Maximum likelihood phylogeny of insect Hox homeodomain sequences. Homeodomains were excised from the Hox cluster genes of Apis mellifera (Am), Tribolium castaneum (Tc), Drosophila melanogaster (Dm), Bombyx mori (Bm), Danaus plexippus (Dp), Melitaea cinxia (Mc, sequence names enlarged), and Heliconius melpomene (Hm). For both B. mori and M. cinxia the ShxA paralogues are more closely related to one another than they are to any other species, suggesting independent duplication of ShxA in these species.
22
1e+06
Count
1e+04
1e+02
1e+00 −150
−50 −10
−1
1
10
50
100
150
indel size
Supplementary Figure 23. Distribution of variant lengths in Melitaea cinxia in the Åland Islands population. Negative lengths are deletions and positive lengths are insertions. Zero length indicates a SNP (gray bar).
23
b)
0
4 0
1
10
2
3
Indel density / kb
30 20
SNP density / kb
40
5
6
50
a)
Genic
Coding
Intronic
Genic
Coding
Intronic
Supplementary Figure 24. Boxplot for a) SNP and b) indel density in all (16,667) gene models of Melitaea cinxia in the Åland Islands population. Genic refers to the region from ‘5 UTR to 3’ UTR. The box shows the interquartile range, while the band in the box is the median. The whiskers extend to data points that are within 1.5 times the interquartile range. Dots are outliers.
24
a) 1
r2 fitted E(r2) moving average
0.9 0.8 0.7
LD(r2)
0.6 0.5 0.4 0.3 0.2 0.1 0 0
500
1000
1500
2000
2500
3000
distance (in bp)
b) 1 0.9 0.8 0.7
LD(D’)
0.6 0.5 0.4 0.3 0.2 0.1
D’ fitted E(D’) moving average
0 0
500
1000
1500
2000
2500
3000
distance (in bp)
Supplementary Figure 25. Linkage disequilibrium described by a) r2 and b) D’ as a function of the physical distance in the Åland Islands population. Curves for the expected values of r2 and D’ were fitted to the data as described in Marroni et al.6. For comparison, moving average was calculated for windows of 200 points.
25
0.2 0.0
0.1
Density
0.3
Melitaea cinxia Bombyx mori
25
30
35
40
45
GC%
Supplementary Figure 26. GC content within 100 kb sliding windows and 10 kb shift in Melitaea cinxia and Bombyx mori chromosomes.
26
32 28
30
GC content (%)
34
36
a)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
19
20
23
24
25
26
27
28
29
30
31
Chromosomes
38 36 32
34
GC content (%)
40
42
44
b)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
21
22
23
24
25
26
27
28
Chromosomes
Supplementary Figure 27. Boxplot for GC content within 100 kb sliding windows and 10 kb shift in a) the 31 chromosomes of Melitaea cinxia and b) the 28 chromosomes of Bombyx mori. Red lines show the mean GC content of M. cinxia (30.7%) and B. mori (35.4%) across the sliding windows. The number of sliding windows in the chromosomes varies from 11 to 681. The box shows the interquartile range, while the band in the box is the median. The whiskers extend to data points that are within 1.5 times the interquartile range. Dots are outliers.
27
15 10 0
5
Gene density
20
25
a)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
19
20
23
24
25
26
27
28
29
30
31
Chromosomes
10 0
5
Gene density
15
b)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
21
22
23
24
25
26
27
28
Chromosomes
Supplementary Figure 28. Boxplot for gene density within 100 kb sliding windows and 10 kb shift in a) the 31 chromosomes of Melitaea cinxia and b) the 28 chromosomes of Bombyx mori. Red lines show the median gene density across the sliding windows. The number of sliding windows in the chromosomes varies from 11 to 681. The box shows the interquartile range, while the band in the box is the median. The whiskers extend to data points that are within 1.5 times the interquartile range. Dots are outliers.
28
30 25 20 10
15
Repeat content (%)
35
40
a)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
18
19
22
23
24
25
26
27
28
29
30
31
Chromosomes
50 40 30 10
20
Repeat content (%)
60
b)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
20
21
22
23
24
25
26
27
28
Chromosomes
Supplementary Figure 29. Boxplot for the proportion of repeats within 100 kb sliding windows and 10 kb shift in a) the 31 chromosomes of Melitaea cinxia and b) the 28 chromosomes of Bombyx mori. Black lines show the mean repeat contents across the sliding windows for the superscaffolds of M. cinxia (22.6%) and for the whole genome of B. mori (38.8%). The number of sliding windows in the chromosomes varies from 11 to 681. The box shows the interquartile range, while the band in the box is the median. The whiskers extend to data points that are within 1.5 times the interquartile range. Dots are outliers.
29
a) 50 40 33
GC %
30
31
20
29 10 0 0
2500000
5000000
7500000
b)
Gene density
15 12.5 10.0
10
7.5 5.0
5
2.5 0.0
0 0
2500000
5000000
7500000
c)
Repeat %
40 30
30 25
20
20 10
15
0 0
2500000
5000000
7500000
Position
Supplementary Figure 30. a) GC content (%), b) gene density and c) repeat content (%) within 100 kb sliding windows and 10 kb shift in Melitaea cinxia chromosome 1 (Z). GC distribution is strikingly even across the chromosome. The gene and repeat densities fluctuate more but are not notably lower or higher in any chromosome segment.
30
a) 50 40
GC %
34 30 32 20 30 10 0 0e+00
2e+06
4e+06
6e+06
8e+06
b)
Gene density
15 16 12
10
8 4
5
0 0 0e+00
2e+06
4e+06
6e+06
8e+06
c)
Repeat %
40 30
30
25 20
20
15 10
10
0 0e+00
2e+06
4e+06
6e+06
8e+06
Position
Supplementary Figure 31. a) GC content (%), b) gene density and c) repeat content (%) within 100 kb sliding windows and 10 kb shift in Melitaea cinxia chromosome 5. GC distribution varies very little across the chromosome. Genes and repeats appear at higher densities, but there is no clear pattern in the variation.
31
Libytheana carinenta Danaus gilippus Danaus plexippus Lycorea halia Lycorea ilione Anetia briarea Anetia pantheratus Tellervo zoilus Melinaea menophilus Athyrtis mechanitis Patricia dercillidas Athesis clearista Thyridia psidii Sais rosalia Scada reckia Forbestra equicola Mechanitis lysimnia Mechanitis polymnia Methona themisto Aeria eurimedia Tithorea harmonia Elzunia pavoni Hyposcada illinissa Hyposcada anchiala Hyposcada virginiana Megoleria orestilla Ollantaya aegineta Ollantaya canilla Oleria aquata Oleria didymaea Oleria onega Oleria alexina Oleria ilerdina Oleria quintina Oleria aegle Oleria gunilla Oleria estella Oleria zelica Oleria amalda Oleria paula Oleria fasciata Oleria athalina Oleria victorine Oleria phenomoe Oleria cyrene Oleria makrena Oleria padilla Callithomia lenea Dircenna dero Pteronymia teresita Ceratinia neso Episcada hymenaea Velamysta pupilla Godyris duillia Hypoleria lavinia Heterosais guilia Greta polissena Pseudoscada timna Greta oto Greta diaphanus Epityches eupompe Napeogenes pharo Napeogenes cranto Hypothyris cantobrica Hypothyris ninonia Hypothyris daphnis Hyaliris antea Placidina euryanassa Pagyris cymothoe Ithomia terra Ithomia jucunda Ithomia agnosia Ithomia lichyi Ithomia drymo Ithomia lagusa Ithomia hyala Ithomia ellara Ithomia celemia Ithomia patilla Ithomia diasia Ithomia iphianassa Ithomia salapia Archaeoprepona amphimachus Archaeoprepona demophon Prepona hewitsonius Prepona laertes Hypna clytemnestra Siderone galanthis Zaretis itys Consul fabius Memphis glauce Memphis appias Memphis anna Memphis offa Memphis moruus Memphis acidalia Memphis leonida Anaea troglodyta Fountainea halice Fountainea nobilis Polygrapha tyrianthina Fountainea glycerium Fountainea eurypyle Fountainea ryphea Manataria hercyna Haetera piera Cithaerias aurora Cithaerias pireta Pierella lena Pierella lamia Pierella luna Antirrhea philoctetes Caerois chorinaeus Morpho cisseis Morpho hecuba Morpho anaxibia Morpho amathonte Morpho menelaus Morpho epistrophus Morpho achilles Morpho helenor Dynastor darius Narope cyllabarus Opoptera aorsa Opoptera syme Eryphanis automedon Caligo illioneus Caligo teucer Caligo atreus Caligo euphorbus Caligo idomeneus Brassolis sophorae Catoblepia amphiroe Selenophanes cassiope Penetes pamphanis Blepolenis batea Opsiphanes cassiae Opsiphanes tamarindi Opsiphanes boisduvalii Opsiphanes invirae Opsiphanes quiteria Oressinoma typhla Bicyclus anynana Steremnia umbracina Steroma modesta Etcheverrius chiliensis Chillanella stelligera Auca coctei Pedaliodes phrasiclea Praepedaliodes phanias Corades enyo Oxeoschistus leucospilos Pronophila thelebe Moneuptychia soter Pharneuptychia sp. Cissia penelope Yphthimoides borasta Moneuptychia paeon Hermeuptychia hermes Praefaunula armilla Godartiana muscosa Amphidecta calliomma Amphidecta reynoldsi Archeuptychia cluena Chloreuptychia arnaca Erichthodes antonina Pareuptychia ocirrhoe Paryphthimoides poltys Magneuptychia moderata Posttaygetis penelea Harjesia blanda Taygetis ypthima Pseudodebis valentina Taygetomorpha celia Taygetis kerea Taygetis mermeria Taygetis larua Taygetis tripunctata Taygetis virgilia Taygetis leuctra Taygetis laches Taygetis echo Taygetis thamyra Taygetis sosis Taygetis cleopatra Adelpha alala Adelpha cytherea Adelpha thessalia Adelpha saundersii Adelpha malea Adelpha cocala Adelpha justina Adelpha mesentina Adelpha lycorias Adelpha epione Euptoieta hegesia Yramea cytheris Yramea lathonoides Actinote alcione Actinote parapheles Actinote pellenea Actinote melanisans Actinote carycina Actinote pyrrha Philaethria wernickei Dryadula phaetusa Podotricha telesiphe Dryas iulia Agraulis vanillae Dione moneta Dione glycera Dione juno Eueides isabella Eueides procula Heliconius telesiphe Heliconius erato Heliconius charithonia Heliconius sara Heliconius antiochus Heliconius congener Heliconius aoede Heliconius ethilla Heliconius hecale Heliconius pardalinus Heliconius xanthocles Heliconius doris Heliconius egeria Heliconius burneyi Heliconius wallacei Asterocampa leilia Doxocopa laure Marpesia zerynthia Baeotus deucalion Historis acheronta Smyrna blomfildia Tigridia acesta Colobura dirce Vanessa carye Vanessa virginiensis Hypanartia bella Hypanartia lethe Hypanartia dione Hypanartia kefersteini Siproeta epaphus Siproeta stelenes Metamorpha elissa Anartia jatrophae Anartia amathea Anartia fatima Junonia evarete Junonia coenia Junonia genoveva Melitaea cinxia Chlosyne lacinia Chlosyne hippodrome Chlosyne janais Chlosyne gaudialis Chlosyne narva Anthanassa drusilla Castilia eranites Ortilia ithra Telenassa teletusa Eresia lansdorfi Eresia emerantia Eresia datis Mestra hypermestra Vila azeka Biblis hyperia Catonephele antinoe Catonephele numilia Nessaea obrinus Catonephele nyctimus Myscelia capenas Pyrrhogyra edocla Epiphile huebneri Epiphile orea Peria lamis Nica flavilla Asterope markii Temenis laothoe Haematera pyrame Callicore tolima Callicore hydaspes Diaethria candrena Diaethria clymena Mesotaenia vaninka Perisama humboldti Perisama oppeli Perisama bomplandii Perisama moronina Eunica eurota Eunica bechina Eunica orphise Eunica tatila Eunica malvina Eunica cuvieri Eunica monima Dynamine myrrhina Dynamine tithia Dynamine coenus Dynamine athemon Dynamine mylitta Cybdelis phaisile Panacea regina Ectima liriope Hamadryas atlantis Hamadryas feronia Hamadryas guatemalena Hamadryas glauconome Hamadryas laodamia Hamadryas amphinome Hamadryas arinome
Chromsome Categories Parsimony reconstruction (Unordered) [Steps: 138] 6-8 9-14 15-19 20 21 22 23 24 25 26 27 28 29 30 31 32 33-41 42-78
Supplementary Figure 32. Haploid chromosome number mapped onto a phylogenetic hypothesis of Nymphalidae. Haploid chromosome numbers were treated as discrete character states, which were mapped onto the phylogeny using the principle of parsimony. Character state “31” is shown to be the most likely ancestral state for the family. The arrow indicates Melitaea cinxia. 32
a)
b) High identity Low identity
30 10
20
Frequency
600 400
0
0
200
Frequency
800
Syntenic genes Non-syntenic genes
0
10
20
30
40
Sequence identity (%)
50
60
0
10
20
30
40
50
60
Sequence identity (%)
Supplementary Figure 33. Frequency distribution of minimum sequence identity among one-to-one orthologs of Melitaea cinxia, Bombyx mori and Heliconius melpomene. The identity distribution is shown in a) syntenic and non-syntenic genes, b) only in non-syntenic (translocated) genes classified according to their pairwise identities. Out of 182 translocated genes 37 (20%) had identity less than 20%.
33
Melitaea (n=31) Heliconius (n=21) Bombyx (n=28) 71 – 86 My
107 – 127 My Supplementary Figure 34. The number of potential translocated genes (red) in the Melitaea, Heliconius and Bombyx phylogeny. N referes to the haploid chromosome number.
34
6 5 4 3 2 0
1
Translocated genes (%)
1
2*
3
4*
5
6*
7
8
9*
10* 11* 12* 13* 14* 15*
16
17
18
19
20
21
22* 23* 24* 25* 26* 27* 28* 29* 30* 31*
Chromosome
Supplementary Figure 35. Distribution of 42 potential translocated genes in Melitaea cinxia chromosomes. The numbers are scaled based on the number of one-to-one orthologs between M. cinxia and B. mori. Chromosome 1 is the Z chromosome and * indicates fusion chromosomes in B. mori and H. melpomene.
35
Supplementary Figure 36. Chromosomes of Melitaea cinxia (left), Heliconius melpomene (middle) and Bombyx mori (right). Each box represents one superscaffold in M. cinxia or a scaffold in H. melpomene. Colors and small numbers above the boxes show orthologous M. cinxia chromosomes and chromosome numbers, and thus indicate fusion chromosomes and translocated sites in H. melpomene and B. mori genomes. Horizontal lines within boxes show corresponding loci in M. cinxia chromosomes, and red vertical lines indicate bin borders showing recombination sites in the linkage map.
36
Supplementary Figure 37. Alignment of Melitaea cinxia chromosomes 12 and 31, 14 and 30, and 27 and 29 against Bombyx mori fusion chromosomes 11, 23, and 24. Colored boxes show aligned regions between M. cinxia and B. mori. Upper and lower boxes in M. cinxia indicate forward and reverse alignments between B. mori, respectively. Orange lines denote bin boundaries and black vertical lines mark chromosome boundaries in M. cinxia.
37
Supplementary Tables Supplementary Table 1. DNA samples used in sequencing and sequencing library statistics. 1) used only in mitochondrial DNA assembly, 2) used in scaffold validation and superscaffolding, 3) used only in variation detection. Library
DNA sample
454 single IlluminaPE1 IlluminaPE2 IlluminaMP1 SOLiDMP1 SOLiDMP2 IlluminaMP2 IlluminaMP3 IlluminaMP4 2
SOLiDMP3 454MP1 454MP2 PacBio
3
SOLiD_ÅLpool
1 male, 1 1 female 1 male, 10 full-sib pool 10 full-sib pool 10 full-sib pool 1 male 1 male 10 full-sib pool 10 full-sib pool 10 full-sib pool 100 full-sib pool 100 full-sib pool 100 full-sib pool 100 full-sib pool 53 individuals in 4 pools
DNA isolation method II.2.2. II.2.2., II.2.3. II.2.3. II.2.3. II.2.2. II.2.2. II.2.3. II.2.3. II.2.3. II.2.4. II.2.4. II.2.4. II.2.4. II.2.3.
Average insert Average Number Number of size expected read length of raw filtered (observed) (bp) reads reads/readpairs (M) (M)
Number of Coverage mapped or mapped reads/readpairs reads (M)
single read
360
10
10
-
9.2
500 (460) bp
2 x 58
150
83
56
16.7
800 (710) bp 1 (1.0) kb 2 (1.9) kb 3 (2.7) kb 3 (2.3) kb 3 (3.1) kb 5 (3.1) kb 5 (4.7) kb 8 (6.5) kb 16 (17.0) kb single read
2 x 125 2 x 76 2 x 50 2 x 50 2 x 66 2 x 69 2 x 68 2 x 50 312 354 2,480
20 23 300 200 186 73 97 132 1.8 2.1 3.3
9.9 7.8 38.9 40.5 113.6 59.4 74.7 131 0.95 0.72 2.7
4.3 2.9 20.8 18.4 46.2 24.6 36.6 10.9 0.78 0.59 2
2.8 1.1 5.3 4.7 15.6 8.7 14.3 2.8 0.6 0.5 12.7
115 bp
-
636
-
338
-
38
Supplementary Table 2. RNA samples and library information. The average number of mappable reads is listed per individual, except for pooled samples which reports the total number of mappable reads. n/a - not applicable. Experiment
Annotation1
Annotation2 Annotation3 Variation
Sample definition
Sample size: RNA-seq library males, females
Read length (bp)
Average number of mappable reads
abdomen pool 4: 2, 2 (PoolA) mixed tissue 57: n/a pool (PoolMix)
Full-length transcriptome PE Full-length transcriptome PE
2 x 76
54.8M
2 x 76
31.4M
mixed tissue 155: n/a pool 3 days old 49: 26, 23 adults 2-3 days old 40: 15, 25 adults
454 single read
110, 220
-
PolyA-anchored 75, 100 single read PolyA-anchored PE 2 x 101
5.0M 3.9M
39
Supplementary Table 3. Contig and scaffold statistics. Initial contigs refer to the contig assembly produced by Newbler. The final contigs were produced by scaffolding the initial contigs with MIP Scaffolder and closing gaps between adjacent contigs with SOAPdenovo GapCloser. Final scaffolds include a set of scaffolds with minimum length of 1,500 bp. Number of Max length contigs/scaffolds (bp) (bp)
N50
Total length (bp)
Initial contigs
217,638
26,736
2,105
354,538,866
Final contigs
49,851
144,962
13,489
360,975,554
8,262
668,473
119,328
389,896,394
Final scaffolds
40
Supplementary Table 4. The statistics of superscaffolds per chromosome. Chromosome
Number of superscaffolds
N50 (bp)
Total length (bp)
1 (Z)
74
338,757
14,178,551
2
63
338,247
13,061,208
3
58
322,278
11,714,550
4
50
406,822
12,875,956
5
48
388,358
11,529,948
6
59
287,105
12,012,768
7
46
461,311
11,220,220
8
49
346,751
10,737,528
9
47
369,359
10,754,370
10
56
396,269
11,891,256
11
64
296,143
11,117,473
12
45
400,393
10,573,299
13
41
550,370
10,139,467
14
52
239,652
9,704,362
15
51
275,318
9,849,234
16
51
269,742
9,945,888
17
39
447,995
10,102,199
18
46
285,373
9,814,747
19
40
330,724
8,116,537
20
37
400,666
9,187,434
21
44
362,865
8,449,847
22
47
415,447
8,522,659
23
47
296,019
8.539,619
24
33
333,881
7,263,460
25
41
308,233
7,055,637
26
38
355,976
6,128,182
27
33
247,940
5,385,080
28
36
167,516
4,132,698
29
47
114,306
3,014,943
30
30
184,294
3,241,965
31
41
108,363
2,242,263
all
1,456
330,752
283,283,699
41
Supplementary Table 5. Statistics of superscaffolds and unplaced scaffolds. The unplaced scaffolds include scaffolds without chromosome assignment and chimeric scaffolds as indicated by the linkage map. Number of (super)scaffolds
N50 (bp)
Total length (bp)
Superscaffolds
1,453
330,752
282,503,348
Unplaced scaffolds
4,846
97,739
110,805,803
Superscaffolds and unplaced scaffolds
6,299
258,308
393,309,151
42
Supplementary Table 6. Chromosome statistics based on the linkage map. The table lists the number of supporting markers and scaffolds within chromosomes, and chromosome lengths as the total length of scaffolds and as centiMorgans (cM). Only data from non-chimeric scaffolds are reported. Chromosome Supporting Number of markers scaffolds
Length (Mb)
Length (cM)
1 (Z)
1566
202
14
50
2
1419
130
11.6
75
3
1237
124
11.5
67
4
1203
125
11.4
50
5
1355
114
10.9
33
6
1212
121
10.6
58
7
1276
127
10.4
58
8
1094
108
10.4
33
9
1204
107
10.4
67
10
1039
127
10.1
25
11
1063
120
10.1
58
12
1104
112
9.8
67
13
958
109
9.6
50
14
998
112
9.6
50
15
1059
108
9.2
58
16
939
100
9
58
17
1000
93
9
58
18
951
98
8.9
58
19
868
100
8.4
42
20
951
87
8.2
67
21
892
100
8
67
22
642
91
7.8
67
23
736
105
7.8
58
24
673
74
6.4
67
25
635
87
6.3
50
26
679
73
6.2
58
27
455
63
5.4
25
28
408
70
3.9
50
29
434
81
3.2
50
30
346
59
3
42
31
327
78
2.3
25
28,723
3,205
263.3
1641
total
43
Supplementary Table 7. Summary of genome assembly validation steps.
Validation method
Result
Estimating correctness of assembly by mapping PE and MP reads
89-99% of mapped pairs are concordant with the genome
Estimating correctness of scaffold by rescaffolding the contigs using PacBio reads
82-87% of contig joins are concordant with the scaffolds
Estimating completeness of genome by mapping transcripts
80% of transcripts have an alignment that covers at least 80% of the transcript
Detecting non-chimeric scaffolds with a linkage map
91% of scaffolds are non-chimeric
Estimating completeness of genome by identifying conserved core genes
77% (84%) of core genes have a complete (partial) match
Estimating completeness of orthologous regions by aligning scaffolds with other butterfly genomes
17-19% of the bases in M. cinxia genome can be aligned with other butterfly genomes
Estimating correctness of superscaffolds by comparing gene order against B. mori
90% of scaffold joins are concordant with gene order of B. mori
Detecting non-chimeric superscaffolds with an independent linkage map
97.6% of superscaffolds are non-chimeric
44
Supplementary Table 8. Contig validation based on mapping PE and MP libraries. The libraries are described in Supplementary Table 1. Library
Correctness estimate (%)
IlluminaPE1
88.7
IlluminaPE2
99.1
IlluminaMP1
98.2
IlluminaMP2
93.8
IlluminaMP3
93.2
IlluminaMP4
93.7
SOLiDMP1
95.8
SOLiDMP2
94.5
SOLiDMP3
96.7
454PE1
90.6
45
Supplementary Table 9. Genome assembly validation statistics deduced from the linkage map. Number of Total length % of genome scaffolds (Mb) assembly length non-chimeric
3205
263.3
67.5
302
54.2
13.9
no clear assignment
1090
45.6
11.7
no markers
3665
26.8
6.9
chimeric
46
Supplementary Table 10. Completeness of the genome assemblies of five lepidopteran species assessed using the set of conserved core (CEGMA) genes. Melitaea cinxia v.1.0
Plutella xylostella v.1.0
Danaeus plexippus v.1.0
Danaus Heliconius plexippus melpomene v.3.0 v.1.1
Bombyx mori v.2.0
Complete
77,0 %
82,3 %
87,1 %
89,1 %
81,9 %
82,7 %
Partial
83,9 %
86,4 %
89,5 %
90,7 %
85,9 %
86,3 %
47
Supplementary Table 11. Links to published genome sequences used for comparative analysis. Species (version) Bombyx mori (v. 2.3)
Link to the genome (accession date) 7
http://sgp.dna.affrc.go.jp/data/scaffold.txt.gz (May 16 2013)
Danaus plexippus (v. 3.0)
8-9
http://monarchbase.umassmed.edu/download/Dp_genome_v3.fasta.g z (May 16 2013)
Heliconius melpomene (v. 1.1)
Plutella xylostella (v.1.1)
10
11
Tribolium castaneum (v. 3.0)
http://www.butterflygenome.org/sites/default/files/Hmel11_Release_20120601.tgz (May 16 2013) ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/invertebrates/P lutella_xylostella/DBM_FJ_V1.1/Primary_Assembly/unplaced_scaffold s/FASTA/unplaced.scaf.fa.gz (May 17 2013)
12-13
ftp://ftp.bioinformatics.ksu.edu/pub/BeetleBase/3.0/Tribolium_genome _sequence.fasta (May 16 2013)
48
Supplementary Table 12. Percentage of bases contained within a pairwise alignment as reported by the MUMmer tools. A higher number suggests a more similar pair of scaffolds reflecting a fraction of aligned regions. Note that the matrix values are not symmetric. The most similar pair according to this comparison is M. cinxia aligned against H. Melpomene. P. xylostella and T. castaneum have the most dissimilar scaffolds. Melitaea Heliconius cinxia melpomene
Danaus plexippus
Bombyx mori
Plutella xylostella
Tribolium castaneum
100
19.11
17.26
17.38
11.46
8.02
Heliconius melpomene
25.31
100
20.1
18.05
13.36
9.23
Danaus plexippus
23.59
20.64
100
17.81
12.88
8.81
13.6
10.26
9.86
100
7.43
5
9.1
7.7
7.15
7.73
100
3.6
12.04
10.17
9.7
10.48
7.87
100
Melitaea cinxia
Bombyx mori Plutella xylostella Tribolium castaneum
49
Supplementary Table 13. Genome conservation distances reported by Mauve. A smaller number suggests a more similar pair of scaffolds. The matrix values are symmetric. The most similar pair of scaffolds is between M. cinxia and H. melpomene. Overall, the results are similar to those obtained with MUMmer. Melitaea Heliconius cinxia melpomene Melitaea cinxia
Danaus plexippus
Bombyx mori
Plutella xylostella
Tribolium castaneum
0
0.648
0.65
0.719
0.785
0.783
0.648
0
0.662
0.734
0.795
0.791
0.65
0.662
0
0.731
0.793
0.79
Bombyx mori
0.719
0.734
0.731
0
0.831
0.827
Plutella xylostella
0.785
0.795
0.793
0.831
0
0.865
Tribolium castaneum
0.783
0.791
0.79
0.827
0.865
0
Heliconius melpomene Danaeus plexippus
50
Supplementary Table 14. Classification of links between scaffolds within superscaffolds based on synteny to Bombyx mori. The order of orthologs between M. cinxia and B. mori was compared within adjacent scaffolds, and the links between scaffolds were classified as agreeing with synteny, disagreeing with synteny, or unknown if synteny information was not available. Chromosome 1 (Z) 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Total
Agree with Disagree synteny with synteny 60 42 38 51 48 34 41 28 37 39 33 43 42 25 23 22 34 44 18 47 41 25 20 26 19 14 8 17 4 7 4 934
10 2 5 4 1 6 6 4 4 9 1 4 4 4 2 1 5 2 4 0 4 4 4 3 2 1 2 2 1 1 2 104
Unknown 31 36 35 51 29 23 48 43 24 47 35 35 31 33 44 37 31 35 35 11 18 31 55 27 39 22 20 20 27 24 21 990
51
Supplementary Table 15. Classification and distribution of transposable elements and other repeats in the Melitaea cinxia genome. Note that some repeat elements are overlapping. Redundancy has been filtered from the total statistics.
DNA transposons
Non-LTR Retrotransposons
Number of hits
Length occupied (bp)
Percentage (%)
Total DNA transposons CMC Ginger Herbinger MULE P Sola TcMar Zator hAT Other
93,507 218 171 126 6,064 43 5,698 5,868 329 5,604 11,805
19,313,388 75,380 67,051 72,258 1,418,378 26,556 959,915 2,112,494 93,195 923,252 2,051,127
4.954 0.019 0.017 0.019 0.364 0.007 0.246 0.542 0.024 0.237 0.526
RC/Helitron Maverick
61,968 134
11,825,568 33,282
3.033 0.009
Total LINEs LINE/CR1 LINE/L2 LINE/R1 LINE/other
183,104 105,787 26,843 44,255 19,406
32,633,079 16,255,273 5,329,151 6,826,910 5,243,115
8.370 4.169 1.367 1.751 1.345
245
54,463
0.014
SINE
252,386
42,842,863
10.988
Total LTR elements LTR/other LTR/Copia LTR/DIRS LTR/Gypsy LTR/Pao
56,585 11 187 97 49,977 6,681 3,561 143,448 25,806 14,869 625,490
7,613,672 5,988 147,882 48,019 6,155,508 1,277,806 817,438 8,957,148 1,697,616 4,487,527 107,290,012
1.953 0.002 0.038 0.012 1.579 0.328 0.210 2.297 0.435 1.151 27.518
PLE/Penelope
LTR Retrotransposons
Satellites Simple repeats Other Unclassified Total repeats
52
Supplementary Table 16. Comparison of transposable element contents among four lepidopteran species. The highest copy number is omitted for D. plexippus and P. xylostella due to a high percentage of unclassifed TE elements. Melitaea Heliconius cinxia (%) melpomene (%)14 Proportion of genome
Bombyx Plutella mori (%) 15 xylostella (%) 11
DNA transposons
5.0
10.1
1
3
1.9
Non-LTR retrotransposones: LINEs
8.4
3.9
2.4
13.8
5.2
Non-LTR retrotransposones: SINEs
11.0
8.2
0.5
12.8
0.5
LTR retrotransposons
2.0
0.5
0.2
1.7
2.5
Unclassified
1.2
2.4
6.7
2.4
28.2
27.4
24.9
10.8
35.4
34
Helitron (3.0)
Helitron (5.4)
Tc1-mariner (2.7)
SINE/5S-Deu (5.1)
SINE/Metulj (8.2)
SINE/Bm1 (11.4)
Total Highest copy number
Danaus plexippus (%)8
DNA transposons Retrotransposons
53
Supplementary Table 17. Number of gene models supported by transcriptome data based on mapped RNA-seq reads and contigs. Total number
Mapped RNA-seq reads
%
Mapped RNA-seq contigs
%
Mapped RNA-seq reads or contigs
%
genes coding exons
16,667
15,268
91.6
11,737
70.4
15,941
95.6
96,875
79,365
81.9
47,121
48.6
82,334
85
5’UTR exons
11,112
10,448
94
5,441
49
10,606
95.4
3’UTR exons
7,089
6,591
93
5,158
72.8
7,037
99.3
54
Supplementary Table 18. Attributes of the genome of Melitaea cinxia and four other Lepidoptera. Melitaea cinxia (v1)
Heliconius melpomene
31
21
30
28
31
Assembly length in Mb (with Ns)
393.3
273.8
248.6
481.8
393.5
Number of (super)scaffolds
6,299
4,309
5,397
43,463
1,793
Scaffold N50 (kb)
258.3
194.3
715.6
4,008.4
737.2
Number of protein coding genes
16,667
12,817
15,130
14,622
18,071
Avg (sd) span of coding genes in bp
8,129
6,779
6,001
6,029
8,081
(10,275)
(8,555)
(10,492)
(7,127)
(11,232)
317.1
453
460.1
406.8
460.8
(337.6)
(482.2)
(520.5)
(498.6)
(475.8)
Number of chromosomes
Avg (sd) protein size in aa Avg (sd) number of coding exons Avg (sd) exon size in bp Avg (sd) intron size in bp
(v1.1)
10
Danaus plexippus (v3)
9
Bombyx mori 7
(v2)
Plutella xylostella 11
(v1)
5.8
6.6
6.7
5.4
6.5
(4.2)
(6.3)
(7.1)
(5.8)
(7.2)
163
206
205
224
213
(259.9)
(332.7)
(300.9)
(418.1)
(327.7)
1493
965
810
1083
1225
(3,404)
(2,190)
(3,475)
(1,250)
(2,658)
Total repeat content %
27.9 %
24.9 %
10.2 %
35.4 %
34.0 %
Total GC %
32.6 %
32.8 %
31.6 %
38.8 %
38.0 %
Total coding length %
4%
7%
9%
4%
6%
Total intron length %
31 %
26 %
29 %
16 %
31 %
55
Supplementary Table 19. The 45 predicted microRNAs in the Melitaea cinxia genome. The last column lists homologous miRNAs from B. mori (bmo) and H. melpomene (hme). Pre-miRNA name
Scaffold
Start
End
Strand
Homologous miRNA
mci-mir-279c
scaffold1180
8254
8333
+
bmo-mir-279c
mci-mir-7
scaffold1252
108675
108762
-
bmo-mir-7
mci-mir-278
scaffold1277
127020
127106
-
bmo-mir-278
mci-mir-1a
scaffold1382
155022
155105
+
bmo-mir-1a
mci-mir-1b
scaffold1382
155025
155102
-
bmo-mir-1b
mci-mir-965
scaffold1391
52841
52943
+
bmo-mir-965
mci-mir-274
scaffold1415
93378
93472
-
bmo-mir-274
mci-mir-87
scaffold148
106760
106857
+
bmo-mir-87
mci-mir-285
scaffold1516
57828
57915
+
bmo-mir-285
mci-mir-193
scaffold1595
78432
78509
-
hme-mir-193
mci-mir-2788
scaffold1595
74602
74682
-
hme-mir-2788
mci-mir-2788
scaffold1595
74604
74679
-
bmo-mir-2788
mci-mir-282
scaffold1770
219359
219443
+
bmo-mir-282
mci-mir-10
scaffold22
37401
37484
+
bmo-mir-10
mci-mir-993a
scaffold22
120468
120556
-
bmo-mir-993a
mci-mir-3327
scaffold2296
61062
61166
-
bmo-mir-3327
mci-mir-3338
scaffold2296
61216
61322
-
bmo-mir-3338
mci-mir-279a
scaffold2731
54187
54268
+
bmo-mir-279a
mci-mir-iab-4
scaffold2892
152106
152192
+
bmo-mir-iab-4
mci-mir-iab-8
scaffold2892
152106
152192
-
bmo-mir-iab-8
mci-mir-2817
scaffold307
112610
112675
+
bmo-mir-2817
mci-mir-750
scaffold331
38900
38980
-
bmo-mir-750
mci-mir-1175
scaffold331
38753
38834
-
bmo-mir-1175
mci-mir-276
scaffold3363
235757
235840
+
bmo-mir-276
mci-mir-927
scaffold4048
31990
32078
-
bmo-mir-927
mci-mir-1926
scaffold4048
31989
32079
+
bmo-mir-1926
mci-mir-133
scaffold4307
97572
97672
-
bmo-mir-133
mci-mir-2a-1
scaffold496
55874
55963
+
bmo-mir-2a-1
mci-mir-2a-2
scaffold496
56278
56352
+
bmo-mir-2a-2
mci-mir-13a
scaffold496
56030
56107
+
bmo-mir-13a
mci-mir-13b
scaffold496
56160
56239
+
bmo-mir-13b
mci-mir-281
scaffold5086
10861
10937
-
bmo-mir-281
mci-mir-989
scaffold5254
mci-mir-2755
scaffold55
mci-mir-210 mci-mir-3286
14941
15031
-
bmo-mir-989
100756
100834
+
bmo-mir-2755
scaffold611
83830
83916
+
bmo-mir-210
scaffold611
86876
86995
+
bmo-mir-3286
mci-mir-307
scaffold6220
50345
50440
+
bmo-mir-307
mci-mir-263a
scaffold66
73039
73133
+
bmo-mir-263a
mci-mir-124
scaffold701
346075
346158
-
bmo-mir-124
mci-mir-279d
scaffold703
329304
329380
+
bmo-mir-279d
mci-mir-3362
scaffold7248
2608
-
bmo-mir-3362
mci-mir-137
scaffold727
131885
131975
-
bmo-mir-137
mci-mir-277
scaffold923
84009
84129
+
bmo-mir-277
mci-mir-317
scaffold923
63251
63341
+
bmo-mir-317
mci-mir-9a
scaffold997
262482
262570
+
bmo-mir-9a
2498
56
Supplementary Table 20. Predicted microRNAs located in intragenic regions. miRNA name
Location
Gene model ID
mci-mir-279c
intronic
MCINX000703
mci-mir-7
intronic
MCINX001004
mci-mir-278
intronic
MCINX001113
mci-mir-965
intronic
MCINX001656
mci-mir-274
intronic
MCINX001774
mci-mir-10
coding+intronic
MCINX006394
mci-mir-750
coding+3'UTR
MCINX009447
mci-mir-1175
3'UTR
MCINX009447
mci-mir-2a-1
intronic
MCINX012468
mci-mir-2a-2
intronic
MCINX012468
mci-mir-13a
intronic
MCINX012468
mci-mir-13b
intronic
MCINX012468
mci-mir-281
intronic
MCINX012603
mci-mir-2755
intronic
MCINX013168
mci-mir-124
coding+intronic
MCINX014412
mci-mir-317
coding+intronic
MCINX016287
57
Supplementary Table 21. Summary of putative Melitaea cinxia transfer RNAs. Three first columns show individual codons, codon percentage and counts in protein coding genes. The five last columns show the respective anticodons, detected number of tRNA genes, and putative tRNA pseudogenes (predictions with poor primary or secondary structure) representing each codon. Codon GCA GCC GCG GCT AGA AGG CGA CGG CGT AAC AAT GAC GAT TGC TGT CAA CAG GAA GAG GGA GGC GGG CAC ATA ATC ATT CTA CTC CTG CTT TTA TTG AAA AAG ATG TTC TTT CCA CCC CCG CCT TGA AGC AGT TCA TCC TCG TCT TAA ACA ACC ACG ACT TGG TAC TAT GTA GTC GTG GTT CAT CGC GGT TAG
Percentage
Codon count
1.29 % 0.82 % 0.90 % 1.40 % 1.93 % 0.90 % 1.04 % 0.57 % 0.90 % 1.96 % 3.45 % 1.44 % 2.18 % 1.01 % 1.75 % 2.24 % 1.26 % 2.94 % 1.39 % 1.33 % 0.87 % 0.55 % 1.09 % 2.88 % 1.41 % 2.97 % 1.28 % 0.97 % 1.17 % 1.50 % 2.81 % 1.83 % 4.69 % 2.16 % 2.02 % 1.66 % 3.31 % 1.29 % 0.61 % 0.74 % 1.07 % 1.28 % 1.08 % 1.66 % 1.76 % 0.87 % 1.00 % 1.49 % 2.03 % 2.10 % 0.96 % 1.08 % 1.65 % 1.17 % 1.64 % 2.65 % 1.58 % 1.04 % 1.29 % 1.84 % 1.48 % 0.72 % 1.19 % 0.86 %
106210 67386 74079 115148 159209 74486 85621 46675 74560 161312 284398 119017 179404 83373 144140 184558 104065 242359 114481 109259 71622 45594 89688 237526 116578 244784 105194 79975 96122 123727 231187 151040 386720 177699 166813 136763 272857 106306 50217 60895 88419 105199 88795 137205 144876 71569 82336 123014 167376 173128 79278 88825 136204 96337 134737 218280 130504 86051 106399 151244 122173 59238 97868 70530
tRNA type
tRNA type count
TGC Ala GGC Ala CGC Ala AGC Ala TCT Arg CCT Arg TCG Arg CCG Arg ACG Arg GTT Asn ATT Asn GTC Asp ATC Asp GCA Cys ACA Cys TTG Gln CTG Gln TTC Glu CTC Glu TCC Gly GCC Gly CCC Gly GTG His TAT Ile GAT Ile AAT Ile TAG Leu GAG Leu CAG Leu AAG Leu TAA Leu CAA Leu TTT Lys CTT Lys CAT Met GAA Phe AAA Phe TGG Pro GGG Pro CGG Pro AGG Pro TCA STOP/SeC GCT Ser ACT Ser TGA Ser GGA Ser CGA Ser AGA Ser TTA STOP TGT Thr GGT Thr CGT Thr AGT Thr CCA Trp GTA Tyr ATA Tyr TAC Val GAC Val CAC Val AAC Val ATG His GCG Arg ACC Gly CTA STOP
10 55 10 32 10 7 7 1 19 34 2 74 1 11 4 7 10 15 15 10 17 5 22 8 27 17 6 1 6 13 6 8 10 11 22 10 7 6 2 18 15 2 15 1 7 11 11 11 2 91 1620 16 58 11 21 1 13 1 11 14 NO HITS NO HITS NO HITS NO HITS
Anticodon
Pseudo Pseudo tRNA tRNA count Pseudo Pseudo
25 3
Pseudo Pseudo Pseudo Pseudo
1 178 2 1
Pseudo Pseudo Pseudo
281 15 8
Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo
4 4 10 8 2 1 3 3
Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo
396 62 7 3 1 5 1 3 3
Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo
53 1 3 6 42 12 2 4 1
Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo
8 3 19 2 6 1 1 1899 151 12 12
Pseudo Pseudo Pseudo Pseudo
153 13 2 3
Pseudo Pseudo Pseudo
1 30 13
Pseudo
1
58
Supplementary Table 22. Putative high-confidence (E < 1e-5) genes encoding small nuclear RNA components of U2- and U12-dependent spliceosomes. The candidates were visually inspected to confirm that they contained specific sequence elements such as the sm-site and sequences functioning in intron recognition. U snRNA
Scaffold
Start
End
U1
scaffold1909 scaffold3257 scaffold1383 scaffold233 scaffold1202 scaffold87 scaffold2349 scaffold3170 scaffold1909 scaffold616 scaffold1352 scaffold1202 scaffold1687
162264 18833 2727 40825 85179 167204 134321 34418 177345 28659 159872 82331 9252
162424 18993 2887 40985 85339 167365 134478 34581 177186 28500 159713 82172 9117
+ + + + + + + + -
4.20E-35 6.80E-33 2.70E-32 6.60E-32 2.70E-31 3.60E-29 3.40E-17 1.70E-14 6.60E-44 8.20E-44 1.30E-41 4.10E-38 2.10E-05
U2
scaffold1202 scaffold5582 scaffold606 scaffold1462 scaffold2452 scaffold4588
89520 12338 28813 98538 26989 1848
89713 12524 28747 98472 26923 1782
+ + -
6.60E-42 2.20E-17 2.60E-25 2.60E-25 2.60E-25 2.60E-25
U4
scaffold3126
27046
26922
-
5.60E-33
scaffold7000
15406
15282
-
5.60E-33
U5
U6
Strand
Eval
scaffold2452
27395
27506
+
5.00E-18
scaffold1447 scaffold96 scaffold1684 scaffold1081
141496 120440 28531 46480
141609 120552 28643 46590
+ + + +
2.40E-17 6.00E-17 7.90E-17 2.40E-16
scaffold2051
734
841
+
3.10E-08
scaffold2231
10757
10847
+
4.70E-23
scaffold3494 scaffold4568 scaffold2311 scaffold3359 scaffold2322 scaffold882 scaffold1116 scaffold3881
13850 2228 58101 13047 184062 15422 185794 62527
13940 2318 58191 13137 184163 15511 185875 62617
+ + + + + + + +
4.70E-23 4.70E-23 6.50E-23 1.10E-22 1.20E-13 2.20E-11 1.90E-06 2.20E-06
scaffold1227
64561
64640
+
3.40E-05
U11
none
U12
scaffold1158
186798
186943
+
1.80E-09
U4atac
scaffold185
154921
154789
-
8.30E-20
U6atac
scaffold363
29852
29942
+
6.40E-15
59
Supplementary Table 23. Annotation data for the Melitaea cinxia mitochondrial genome. For each feature the start and stop positions and length are given in bp. Direction indicates the direction of transcription for coding regions. Start and stop codons are shown for PCGs (Protein Coding Genes), and AT content is given for PCGs, rRNAs and control region. tRNA nomenclature follows the standard three letter IUPAC amino acid code. Type
Gene
tRNA
Met
tRNA tRNA PCG
NADH dehydrogenase subunit 2
tRNA
Trp
tRNA
Cys
tRNA PCG
Start
Stop Length Direction Start codon Stop codon AT content
1
68
68
forward
Ile
70
136
67
forward
Gln
134
202
69
reverse
249
1262
1014
forward
1265
1333
69
forward
1326
1389
64
reverse
Tyr
1390
1456
67
reverse
Cytochrome Oxidase subunit 1
1459
2994
1536
forward
tRNA
Leu
2990
3056
67
forward
PCG
Cytochrome Oxidase subunit 2
3057
3732
676
forward
tRNA
Lys
3733
3803
71
forward
tRNA
Asp
3806
3872
67
forward
PCG
ATP sythase subunit 8
3873
4040
168
PCG
ATP sythase subunit 6
4030
4707
PCG
Cytochrome Oxidase subunit 3
4707
tRNA
Gly
5498
PCG
NADH dehydrogenase subunit 3
tRNA tRNA
ATT
TAA
84
CGA
TAA
70.8
ATG
T--
75.6
forward
ATT
TAA
91.7
678
forward
ATG
TAA
78
5495
789
forward
ATG
TAA
73.3
5566
69
forward
5567
5920
354
forward
ATT
TAA
81.9
ALa
5924
5988
65
forward
Arg
5988
6049
62
forward
tRNA
Asn
6050
6113
64
forward
tRNA
Ser
6114
6174
61
forward
tRNA
Glu
6180
6249
70
forward
tRNA
Phe
6250
6313
64
reverse
PCG
NADH dehydrogenase subunit 5
6312
8046
1735
reverse
ATT
T--
81.2
tRNA
His
8047
8114
68
reverse
PCG
NADH dehydrogenase subunit 4
8114
9453
1340
reverse
ATG
T--
79.7
PCG
NADH dehydrogenase subunit 4L
9454
9741
288
reverse
ATG
TAA
83.3
tRNA
Thr
9744
9807
64
forward
tRNA
Pro
9808
9872
65
reverse
PCG
NADH dehydrogenase subunit 6
9875
10402
528
forward
ATT
TAA
83.7
PCG
Cytochrome B
10410
11558
1149
forward
ATG
TAA
75.7
tRNA
Ser
11557
11624
68
forward
PCG
NADH dehydrogenase subunit 1
11649
12586
938
reverse
ATG
T--
78.7
tRNA
Leu
12588
12656
69
reverse
rRNA
Large subunit ribosomal RNA
12664
14001
1338
reverse
tRNA
Val
14002
14067
66
reverse
rRNA
Small subunit ribosomal RNA
14069
14840
772
reverse
14841
15171
331
Control region
84.7 84.8 93.7
60
Supplementary Table 24. Comparison of mitochondrial protein coding gene start and stop codons between Melitaea cinxia and four other Lepidoptera. Start and stop codons are shown for each of the 13 protein coding genes of the mitochondrial genome. Species used in comparison with M. cinxia are the butterflies, Kallima inachus and H. melpomene, and the moths, B. mori and P. xylostella. Melitaea cinxia
Kallima inachus
Heliconius melpomene
Bombyx mori
Plutella xylostella
Gene
start
stop
start
stop
start
stop
start
stop
start
stop
NADH dehydrogenase subunit 2
ATT
TAA
ATT
TAA
ATT
TAA
ATA
TAA
ATT
TAA
Cytochrome Oxidase subunit 1
CGA
TAA
CGA
TAA
CGA
TAA
CGA
TAA
CGA
TAA
Cytochrome Oxidase subunit 2
ATG
T--
ATG
T--
ATG
T--
ATG
T--
ATG
T--
ATP sythase subunit 8
ATT
TAA
ATT
TAA
ATT
TAA
ATA
TAA
ATC
TAA
ATP sythase subunit 6
ATG
TAA
ATG
TAA
ATG
TAA
ATG
TAA
ATG
TAA
Cytochrome Oxidase subunit 3
ATG
TAA
ATG
TAA
ATG
TAA
ATG
TAA
ATG
TAA
NADH dehydrogenase subunit 3
ATT
TAA
ATT
T--
ATT
TAG
ATT
TAA
ATG
TAA
NADH dehydrogenase subunit 5
ATT
T--
ATT
T--
ATT
TAA
ATT
TAA
ATT
TAA
NADH dehydrogenase subunit 4
ATG
T--
ATG
T--
ATG
T--
ATG
TAA
ATG
T--
NADH dehydrogenase subunit 4L ATG
TAA
ATG
TAA
ATG
TAA
ATG
TAA
ATG
TAA
NADH dehydrogenase subunit 6
ATT
TAA
ATT
TAA
ATT
TAA
ATT
TAA
ATT
TAA
Cytochrome B
ATG
TAA
ATG
TAA
ATG
TAA
ATG
TAA
ATG
TAA
NADH dehydrogenase subunit 1
ATG
T--
ATG
T--
ATA
TAA
ATT
TAA
ATG
TAA
61
Supplementary Table 25. Groups of manually annotated genes.
Gene group
Number of manually curated genes
Heat shock proteins
72
Other chaperone-related genes
35
Immunity related genes
65
Cytochrome P450
51
Muscles, muscle development, flight
28
Odorant binding proteins
44
Glycolysis
18
Growth
28
Homeobox genes
16
Other genes
172
Total
529
62
Supplementary Table 26. Variants from genomic and RNA-seq data from the Åland Islands population. The total number of variants in the top row has been divided into five categories shown as percents. Each variant belongs to only one category; if an indel spanned several categories, the priority in class assignment was 1) coding exon, 2) 5’UTR, 3) 3’UTR, 4) intron, 5) intergenic. Genomic SNP sequencing library total number
Genomic indel
SOLiD_Ålpool SOLiD_Ålpool
RNA SNP
Genomic SNP
Genomic indel
PolyA-anchored IlluminaPE1 IlluminaPE1 PE IlluminaPE2 IlluminaPE2
3,020,826
378,206
20,281
5,245,947
563,225
coding exon%
3.5
1
19
2.9
0.7
5’UTR%
0.6
0.7
1.9
0.5
0.5
3’UTR%
1.4
1.8
25.9
1.5
1.8
intron%
36
38.3
29.1
36.2
37.7
58.5
58.3
24.1
59
59.4
intergenic%
63
Supplementary Table 27. Summary statistics of SNP and indel densities in kb for genic, coding, intronic and UTR regions. Median (25%, 75% quantile) SNP density in kb
Median (25%, 75% quantile) indel density in kb
14.6 (7.9, 20.8)
1.38 (0.5, 2.3)
coding
8.2 (3.8, 13.8)
0.0 (0, 0)
intronic
15.3 (7.5, 22.8)
1.53 (0.4, 2.7)
7.34 (0, 17.2)
0.0 (0, 0)
genic:5'UTR-3'UTR
5’ and 3’ UTRs
64
Supplementary Table 28. Chromosome mapping of one-to-one orthologous proteins between Melitaea cinxia (x-axis) and Bombyx mori (y-axis). The M. cinxia chromosomes have been reordered to match the corresponding B. mori chromosomes. Chromosome 1 represents the Z chromosome in both species. Boxed elements indicate one-to-one (25 cases) and two-to-one (3 cases) mapped chromosomes between M. cinxia and B. mori. From the total set of 4,485 orthologs, only 4% (178) map to non-orthologous chromosomes. 1 1 208 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 2 11 1 12 0 13 0 14 1 15 0 16 1 17 1 18 0 19 0 20 0 21 0 22 0 23 0 24 0 25 0 26 0 27 0 28 0
28 17 2 6 5 20 13 10 9 12 0 0 1 1 2 2 0 1 1 0 52 0 2 0 0 0 0 0 0 0 0 163 2 0 1 0 1 0 1 1 0 0 254 0 0 0 0 0 0 0 0 0 0 240 0 0 0 0 1 0 0 0 0 0 148 0 1 0 0 0 1 0 0 0 0 141 0 0 0 0 0 0 0 0 0 0 161 0 0 0 0 2 0 0 0 0 0 176 0 0 0 0 0 0 0 0 0 1 218 0 0 0 0 0 0 1 0 0 1 188 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 2 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 2 0 2 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 2 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 2 2 0 0 3 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0
31 4 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 37 1 0 0 189 0 0 0 147 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 1 0
26 3 21 8 7 16 25 19 11 14 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 2 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 68 2 0 2 1 0 0 0 0 0 0 235 1 1 0 1 1 0 0 0 0 0 166 0 0 0 0 0 0 0 0 1 0 147 0 0 0 0 0 0 0 0 3 0 136 0 0 0 0 0 0 1 0 0 0 136 0 1 0 0 0 0 0 0 0 0 101 0 0 0 0 0 1 0 0 0 0 110 1 5 4 0 0 0 0 0 1 0 178 0 0 0 0 0 0 0 0 0 0 131 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0
30 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 40 0 0 0 0 0
27 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 50 0 0 0 0
29 18 23 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 3 1 0 0 0 0 0 0 0 0 2 0 0 0 0 1 0 0 0 0 45 1 0 0 187 1 1 0 104 0 0 0 0 0 0
24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 79 0
22 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 3 0 72
65
Supplementary Table 29. Chromosome mapping of one-to-one orthologous proteins between Melitaea cinxia (x-axis) and Heliconius melpomene (y-axis). The M. cinxia chromosomes have been reordered to match the corresponding H. melpomene chromosomes. Chromosome 1 represents the Z chromosome in M. cinxia. Boxed elements indicate one-to-one (11 cases) and two-to-one (10 cases) mapped chromosomes between M. cinxia and H. melpomene chromosomes. From the total set of 3,869 orthologs, 4.7% (181) map to non-orthologous chromosomes. Z 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 2 47 2 0 260 0 0 0 0 0 0 0 0 2 0 0 2 0 0 0 1 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 1 0 0
27 0 44 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
21 5 0 0 0 1 73 0 0 123 0 0 0 1 1 0 3 0 0 0 0 0 1 0 6 2 0 0 0 0 0 0 0 0 1 0 10 0 0 0 0 0 0 0
19 17 10 0 0 0 0 3 0 0 0 0 0 0 0 79 0 0 0 149 0 1 3 162 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0
31 12 0 0 0 0 0 0 0 0 0 0 0 1 36 0 0 156 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0
28 18 20 6 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 60 0 0 0 0 167 0 0 1 1 144 0 0 0 0 235 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 3 0 0 0 0 1 1 0 1 2 1 0 0 2 0 0 0 1 1 0 0 0 0
22 3 13 25 11 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 3 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 89 0 0 0 0 0 232 1 0 0 1 0 158 102 0 0 1 0 1 148 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 2 0 0 0 2 0 0 0
26 16 8 7 15 0 0 1 0 0 0 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 2 0 0 0 0 0 1 1 0 66 0 0 0 0 0 113 0 0 0 0 0 158 0 0 0 0 0 170 0 0 3 1 0 104 1 0 2 0 0 1 0 1 1 0 0 0 0 0 0
29 14 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 38 0 0 129 0 0 0 0
24 4 23 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 2 0 0 0 2 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 65 0 1 0 193 100 0 0 0
9 0 1 1 0 10 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 75
30 0 0 0 0 0 0 0 0 0 0 7 0 1 0 0 1 0 0 0 0 13
66
Supplementary Table 30. The total counts of one-to-one orthologous proteins mapped between the chromosomes of Melitaea cinxia and Biston betularius. The X-axis represents M. cinxia chromosomes reordered to match the corresponding B. betularius chromosome in the y-axis. Diagonal shows one-to-one mapping to orthologous chromosomes between the two species. From the total set of 113 orthologs, 95% map to orthologous chromosomes. Chromosome 1 represents the Z chromosome in both species.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
1 12 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
28 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
20 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
31 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
26 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 4 0 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 1
29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 5 0 0 0 0 0 0
23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0
30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0
27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3
67
Supplementary Table 31. Chromosome mapping of one-to-one orthologous proteins between Melitaea cinxia (x-axis) and Plutella xylostella (y-axis). The chromosomes of M. cinxia have been reordered to match the corresponding P. xylostella chromosomes. Chromosome 1 represents the Z chromosome in both species. Boxed elements show orthologous chromosomes between M. cinxia and P. xylostella. From the total set of 701 orthologs, 23.8% map to non-orthologous chromosomes (16% in autosomes). M. cinxia chromosomes 27, 28 and 30 have no clear orthologous chromosome in P.xylostella. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
1 21 0 0 0 0 0 0 0 0 1 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17 0 0 12 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 10 0 0 57 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 12 0 0 0 58 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
5 1 0 0 0 0 9 0 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0
20 2 0 0 0 0 0 10 0 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
13 1 0 0 0 0 0 0 18 0 3 0 0 0 0 1 1 0 2 0 2 0 0 0 0 0 0 1 0 0 0 0
10 1 0 0 0 0 0 0 0 11 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 4 0 0 0 0 0 0 0 0 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
31 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 0 0 0 0 0 0 0 0 0 0 21 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
15 2 0 0 0 0 0 0 1 0 0 0 0 36 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
26 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 7 0 0 0 0 0 0 0 0 0 0 0 0 1 35 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
21 7 0 0 0 0 1 0 0 0 0 0 0 0 0 0 35 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
8 1 0 0 0 0 0 0 0 0 7 1 0 0 0 0 0 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 22 0 0 0 0 0 0 0 0 0 0 0 0 0
16 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 1 1 1 0 0 0 0 0 0
25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 1 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 11 0 0 1 0 0 19 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 17 0 0 0 0 0 0 0 0
29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0
18 0 0 0 1 0 1 0 0 0 5 0 0 0 0 0 3 0 0 0 0 0 1 0 0 25 0 0 0 0 0 0
23 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 5 0 0 0 0 4 0 0 0 0 0
24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0
22 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0
12 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 3 0 14 0 0
27 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
30 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
68
Supplementary Table 32. Chromosome mapping using pairwise sequence alignments of genomic scaffolds between Melitaea cinxia (x-axis) and Plutella xylostella (y-axis). The chromosomes of M. cinxia have been reordered to match the corresponding P. xylostella chromosomes. Chromosome 1 represents the Z chromosome in both species. The matrix shows the total length of aligned sequences mapped between chromosomes. Boxed elements show orthologous chromosomes between M. cinxia and P. xylostella. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
1 374 0 0 0 0 0 0 0 0 0 0 247 0 0 0 0 0 0 123 0 0 0 149 0 0 0 0 0 138 0 0
28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17 2 6 5 20 13 0 0 0 0 0 94 0 0 0 0 0 0 826 0 0 0 0 0 0 11895 0 294 0 76 0 263 6702 232 0 0 0 0 0 2221 0 0 0 0 0 0 2196 0 0 0 489 0 0 7748 0 0 0 0 0 0 0 0 0 107 0 0 0 0 133 0 0 0 0 0 0 217 0 0 0 0 0 0 361 0 0 0 0 0 0 0 0 359 152 0 0 0 0 0 0 0 0 442 0 0 567 0 0 0 0 0 0 256 0 139 0 0 0 0 0 0 0 149 571 0 0 0 0 0 0 0 0 0 0 0 0 0 0 137 0 0 0 0 108 0 0 0 0 0 0 260 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 681 0 0 0 0 0 0 0 0 0 0
10 9 0 0 0 0 139 0 0 0 281 0 0 0 0 0 0 0 910 0 0 2925 0 0 0 181 0 0 101 0 0 0 0 0 113 0 0 0 0 0 176 0 0 297 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
31 4 15 26 3 21 8 7 96 0 0 0 240 0 0 0 87 0 0 0 0 0 0 0 0 0 0 0 147 0 0 0 0 113 0 400 332 0 0 0 363 0 0 0 0 0 0 0 88 0 0 0 0 0 618 0 225 0 0 0 138 0 0 0 0 0 0 0 149 0 0 0 89 0 0 0 0 320 0 0 89 0 0 0 412 0 0 154 0 0 0 0 0 0 0 0 0 2323 0 0 0 0 0 0 0 246 2688 0 82 0 0 0 0 0 0 1266 0 0 0 0 176 1024 0 0 2528 0 0 0 88 0 0 0 0 3605 0 0 109 0 0 0 0 0 1424 0 231 0 0 0 149 0 0 2622 88 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 120 0 0 0 0 0 0 0 0 0 0 0 153 0 0 0 143 253 0 0 0 0 0 0 89 0 0 0 0 0 0 0 102 0 0 0 0 0 0 0 204 121 0 0 128 0 0 0 408 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 88 0 0 0 0 0 0 0 0 0 0 0 142 0 0 0 0 0 0 0 0 0 0 0
16 25 19 11 14 0 0 0 0 568 0 0 0 0 0 0 0 0 0 0 278 0 91 0 0 0 0 0 102 0 0 0 0 0 0 0 0 0 0 151 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 152 0 0 0 435 0 0 0 0 560 0 0 0 0 0 189 0 141 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 207 0 150 89 0 0 1600 0 0 0 0 0 3372 0 0 0 0 0 1710 0 0 0 0 0 3051 143 0 0 0 0 0 0 0 0 186 0 0 0 0 0 0 0 0 0 0 149 0 0 0 0 0 0 0 0 0 0 0 0 0 611 0 0 0 0 0
29 18 23 24 22 0 0 1446 0 0 0 0 0 0 0 0 78 0 0 0 0 1584 0 0 0 645 374 1011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 182 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 75 212 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 802 0 0 186 0 0 0 0 0 1996 0 0 0 0 248 233 0 0 0 0 0 3104 0 0 0 102 0 3304 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 121 0 0 0 0 0 0 0 159 0 0 0 0 0 0 0 0 0 0 0 502 0 0
30 0 0 0 0 0 0 0 0 0 0 0 92 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
69
Supplementary Table 33. Classification and distribution of repetitive elements in the whole genome and in the estimated fusion sites in Bombyx mori. Whole genome Number Length Percentage of hits occupied (bp) (%) 44,179 37,810 2,887 1,904 871 403 242 62
10,215,142 9,228,198 415,788 267,754 65,485 111,311 103,796 22,810
2.708 2.447 0.110 0.071 0.017 0.030 0.028 0.006
95 81 9 4
17,924 16,597 820 420
2.769 2.564 0.127 0.065
1
87
0.013
Non-LTR Retrotransposons Total LINEs LINE/Jockey LINE/RTE LINE/CR1 LINE/R1 LINE/L2 LINE/R4 LINE/LTR LINE/I LINE/Daphne LINE/CRE LINE/DNA LINE/L1 LINE/chimera LINE/R2 LINE/Unknown
233,685 49,249 48,520 49,476 16,814 13,582 7,521 1,514 720 864 117 85 58 41 18 45,106
49,225,708 13,170,459 8,911,387 8,201,382 5,215,787 2,932,217 1,987,607 369,266 194,652 192,932 28,723 27,125 19,894 19,095 14,550 7,940,632
13.051 3.492 2.363 2.174 1.383 0.777 0.527 0.098 0.052 0.051 0.008 0.007 0.005 0.005 0.004 2.105
442 81 66 113 34 27 26 3 2 1 1
104,629 25,347 11,840 20,762 11,920 5,775 7,230 505 740 73 183
16.163 3.916 1.829 3.207 1.841 0.892 1.117 0.078 0.114 0.011 0.028
4 2 82
2,422 2,899 14,933
0.374 0.448 2.307
SINE/Bm1 SINE/Unknown
196,265 64,339
41,163,590 9,987,682
10.914 2.648
332 73
71,706 13,189
11.077 2.037
220
42,366
0.011
12,301 4,167 3,391 327 43 18 4,355
3,907,127 1,536,109 1,199,945 106,852 14,347 10,113 1,039,761
1.036 0.407 0.318 0.028 0.004 0.003 0.276
36 23 6
13,362 10,925 1,844
2.064 1.688 0.285
7
593
0.092
24,765 115,626 219,518
946,810 4,286,427 46,739,607
0.251 1.136 12.392
48 113 432
2,630 4,379 101,863
0.406 0.676 15.736
910,898
166,514,459
44.147
1,571
329,682
50.928
DNA transposons
Total DNA transposons DNA/Tc1_mariner DNA/Helitron DNA/hAT DNA/BMC1 DNA/piggybac DNA/Harbinger DNA/P
Fusion sites Number Length Percentage of hits occupied (bp) (%)
Penelope LTR Retrotransposons Total LTR/Pao LTR/Gypsy LTR/Copia LTR/Micropia LTR/Helitron LTR/Unknown Simple_repeat Low_complexity Unknown Total repeats
70
Supplementary Table 34. Classification and distribution of repetitive elements in the whole genome and in the estimated fusion sites in Heliconius melpomene. Whole genome Number of Length Percentage hits occupied (bp) (%) DNA transposons
Total DNA transposons DNA/Helitron DNA/Mariner DNA/Tc3 DNA/piggBac DNA/hAT DNA/Harbinger DNA/TcMar-Fot1 DNA/Unknown
Non-LTR Retrotransposons: Total LINEs LINE/RTE LINE/Daphne LINE/L2 LINE/Jockey LINE/Zenon LINE/Vingi LINE/R1 LINE/R4 LINE/I LINE/CR1 LINE/Proto2 LINE/Unknown SINE LTR Retrotransposons Total LTR LTR/Gypsy LTR/Copia Simple repeat Low_complexity Unknown Total repeats
Fusion sites Number of Length Percentage hits occupied (bp) (%)
151,099 54,157 49,346 36,787 4,712 2,568 740 5 2,784
27,204,205 14,314,694 5,690,635 3,970,760 1,197,562 1,049,617 237,052 2,645 741,240
10.089 3.671 1.460 1.018 0.307 0.269 0.061 0.001 0.190
2,617 901 851 644 97 52 16
464,938 223,988 107,869 71,112 20,378 18,839 5,360
11.027 5.312 2.558 1.687 0.483 0.447 0.127
56
17,392
0.412
32,088 7,255 3,838 3,379 1,786 3,496 240 1,082 1,189 2,032 901 128 6,762
9,821,700 2,390,781 1,224,728 1,169,726 941,745 851,094 64,090 580,811 475,682 408,053 366,346 56,091 1,292,553
3.642 0.613 0.314 0.300 0.242 0.218 0.016 0.149 0.122 0.105 0.094 0.014 0.332
591 116 72 57 44 56 5 33 21 41 11 2 133
179,449 26,783 23,945 19,533 23,171 10,842 352 19,455 10,068 10,839 3,535 567 30,359
4.256 0.635 0.568 0.463 0.550 0.257 0.008 0.461 0.239 0.257 0.084 0.013 0.720
155,981
21,958,707
5.632
2,541
356,801
8.462
4,918 2,039 2,850 29
1,258,731 634,494 612,458 11,779
0.467 0.163 0.157 0.003
106 39 65 2
22,920 11,071 11,557 292
0.544 0.263 0.274 0.007
20,730 294,250 54,914 713,980
856,174 13,049,905 6,074,828 80,224,250
0.216 3.347 1.558 29.751
346 4,401 975 11,577
14,079 194,615 113,577 1,346,379
0.334 4.616 2.694 31.932
71
Supplementary Notes Supplementary Note 1. Biology of the Glanville fritillary butterfly The Glanville fritillary butterfly (Melitaea cinxia L.) has become a widely recognized model system in population and evolutionary biology, especially in the study of the ecological, genetic, and evolutionary consequences of habitat fragmentation 16. The Glanville fritillary study system in the Åland Islands in Finland has played a pivotal role in the development of the metapopulation theory (reviewed in17, 18). The empirical study was initiated in 1991 across a very large area, 50 by 70 km, covering the main Åland Island and several other large islands in SW Finland 17, 19. The landscape is highly fragmented, and the habitat suitable for reproduction consists of 4,248 (in 2012) small dry meadows with the pooled area of 783 ha, which covers 0.5% of the total land area (1 552 km 2)19. This study system comprises a prime example of classic metapopulations, systems of extinctionprone local populations that persist in a balance between stochastic local extinctions and recolonizations of currently unoccupied habitat patches 17. Local extinctions and re-colonizations have been documented systematically since 1993 18, 19, yielding an unparalleled record of ca 2,000 extinction and re-colonization events. Knowledge of the long-term history of hundreds of local populations provides unique material for experimental studies 20-22. The long-term research has yielded several significant ‘firsts’, including the demonstration of elevated risk of local extinction due to inbreeding23, alternative stable states in metapopulation dynamics24, unequivocal evidence for an extinction threshold in metapopulation dynamics 25, and demonstration of allelic variation in a metabolic gene influencing population dynamics 26. The Glanville fritillary has one generation per year in northern Europe, adults flying in June to early July. Females oviposit in clusters of 50 to 250 eggs on two host plant species, Plantago lanceolata L. and Veronica spicata L27,
28
. Larvae hatch in 2-3 weeks, forage gregariously and spin a web
around the host plant, in which they stay at night, during bad weather and when not feeding. Halfgrown larvae overwinter in compact ‘winter nests’, which they spin at the base of the host plant at the end of August (see Figure 1 in Ojanen et al. 19). The larvae resume feeding in the spring when host plants renew growth, usually in the beginning of April, and remain gregarious until the final instar. Pupation takes place in May. Further details of the life-cycle and life history are reported by Kuussaari27, Nieminen et al.28, Hanski17, Hanski et al.21, Saastamoinen29 and Saastamoinen et al.30. The fact that each larval group spins a winter nest before winter diapause makes the large-scale survey of local populations possible. The winter nests are conspicuous in early September, making 72
it feasible to aim at counting all winter nests on every meadow in a network of thousands of meadows, giving an estimate of local population sizes across the entire study area as well as an opportunity to sample larval family groups for experiments 19. The transcriptome of the Glanville fritillary was initially sequenced with a Roche 454 FLX sequencer (454 Life Sciences, CT, USA)31, which was the starting point for the present work. Two gene expression experiments have compared female butterflies originated from newly-established and old local populations22 and full-sib families of post-diapause larvae reared under different thermal conditions 32. Since 2005, we have conducted association studies on several candidate genes. In particular, more than 10 studies on the gene Pgi, encoding the glycolytic enzyme phosphoglucose isomerase, have revealed strong associations with a range of life history traits and measures of individual performance, such as the peak flight metabolic rate 20, dispersal rate in the field33, body temperature in low ambient temperatures 34, egg clutch size29, lifespan30, and population growth rate in the field26.
73
Supplementary Note 2.
Genome Sequence Genome sequencing strategy In northern Europe, the Glanville fritillary has one generation per year and obligatory diapause of six to eight months19. Because of long generation time, it is not feasible to establish a laboratory colony to produce a highly inbred line. Moreover, the yield of DNA from a single individual (5-20 μg) is insufficient both in quantity and quality to complete whole genome sequencing. We therefore decided to use a hierarchical genome assembly strategy, in which DNA from a single male and a few full-sibs provided material for the initial genomic contigs. Those were linked to scaffolds using several paired-end (PE) and mate-pair (MP) libraries of varying insert sizes (0.5-16 kb). The high molecular weight DNA needed for the MP libraries required hundreds of μg of genomic DNA, which was obtained from full-sibs to minimize the amount of variation in the DNA pool. To account for several platform-specific features such as read length, data yield and error profiles35, we used several 2nd generation (454, Illumina, SOLiD) and 3rd generation (PacBio) sequencers. For genome assembly we used MIP Scaffolder 3 developed in-house which can incorporate data from the different sequencing platforms for genome assembly. Finally, a highdensity linkage map36 together with long MP reads and PacBio data was used in building the superscaffolds. Each of the key steps in the genome assembly (contig assembly, scaffolding, and superscaffolding) was independently validated as described in Supplementary Note 6. Furthermore, substantial transcriptome data sets (Supplementary Table 2) also contributed to assembly validation, building of the gene models, functional annotation and variation analyses. The key steps in genome assembly are depicted in Supplementary Figure 1.
DNA sequencing DNA samples The initial 454 sequencing was performed from a single M. cinxia male 7th instar larva and used for initial contig assembly. A male larva was used to avoid repeat-containing DNA expected from the female-specific W chromosome 37, 38. PE and MP sequencing was performed for scaffolding and complementing the initial contig assembly (Supplementary Fig. 1). Two SOLiD MP libraries (SOLiDMP1, SOLiDMP2) were constructed using the same single male larva as for the initial sequencing. For an Illumina PE library (IlluminaPE1), again the same male larva together with thorax tissues from ten full-sibs were used. For the other sequencing libraries, abdomen or thorax 74
tissues were used. The other PE and MP libraries with short insert sizes (150 aa) (Supplementary Fig. 16).
Orthology analyses The whole proteome from 22 species was used in the orthology analyses. Lepidopteran species included all five species from which the whole proteome was available, M. cinxia, H. melpomene, D. plexippus, B. mori, and P. xylostella. Diptera included three mosquitoes (Aedes aegypti, Anopheles gambiae, Culex quinquefasciatus) and three fruit flies (Drosophila simulans, Drosophila melanogaster, Drosophila mojavensis). Ants (Harpegnathos saltator, Solenopsis invicta), bees (A. mellifera) and wasps (Nasonia vitripennis) represent Hymenoptera, and pea aphid (A. pisum), red flour beetle (T. castaneum) and head louse (Pediculus humanus) other insects. Deer tick (Ixodes scapularis) and waterflea (Daphnia pulex) were treated as arthropodan outgroups, while cat and rat (Felis catus and Rattus norvegicus) were mammalian outgroups. The data sets were downloaded from NCBI RefSeq, except for rat and cat which were obtained from Ensembl, and P. xylostella which was downloaded from http://www.iae.fafu.edu.cn.
100
BLAST was used to perform an all-against-all comparison of the protein sequences. Pairwise similarities with an e-value below 1e-5 and query/subject coverage above 70 % were retained in the sequence graph (Supplementary Fig. 17). The proteins were clustered into orthologous groups using the EPT algorithm86. Orthologs were defined as reciprocal best hits. Gene duplication events were identified as in InParanoid, i.e., in-paralogs are proteins within one (reconstructed) species which have stronger similarity to each other than to the nearest ortholog in another species. EPT reconstructs ancestral proteomes at the branch points of a phylogenetic guide tree. A flat tree with star-like topology (every species branching off from the root) yielded similar results (data not shown). The orthologous groups in Figure 1 (main text) were classified based on the presence of a group in the clades Lepidoptera, Hymenoptera and Diptera. A group is defined as conserved in a clade if it is present in at least two species of the above clades. We define one-to-one-orthologs as missing from at most four (out of 22) species or duplicated in at most two species. All other groups conserved in Lepidoptera, Hymenoptera and Diptera are called N-to-N-orthologs. A group conserved in two of these three clades is classified as Patchy. A group conserved in only one of the three clades is classified as Lepidopteran, Dipteran or Hymenopteran. In the case of Diptera, a group is considered conserved if it is present in both mosquitoes and fruit flies. Mosquito groups are conserved in mosquitoes but not fruit flies, and vice versa for Drosophila groups. Speciesspecific groups are present in a single species. Arthropod groups are present but not conserved in Lepidoptera, Hymenoptera or Diptera, and are absent from the mammals. All remaining groups are classified as Other. Proteins which have non-detectable sequence similarity to other proteins in the set are in the class “no-hits”. Phylogenetic analysis of one-to-one-orthologs (Supplementary Fig. 18) was based on reciprocal best hits detected by SANS49. The phylogenetic tree was constructed using 191 groups of one-toone orthologs. Multiple sequence alignments for the groups of one-to-one-orthologs were constructed using MUSCLE v. 3.8.3176. To find the optimal tree, phylogenetic analyses were carried out in RAxML v. 7.3.087 by bootstrapping 100 times per group and creating a majority rule consensus tree from 19,100 bootstrapped trees which was then visualized with iTol 88. The predicted M. cinxia proteins were also mapped to OrthoDB orthologous groups (data not shown, http://cegg.unige.ch/orthodb6). OrthoDB89 orthologous groups are built progressively, with an e-value cutoff of 1e-3 for triangulating best reciprocal hits (BRHs), and 1e-6 for pair-only BRHs, requiring an overall minimum sequence alignment overlap of 30 amino acids. New genes are added to existing groups by a mapping procedure which first compares all genes from the new organism to all genes in OrthoDB groups, and then performs the BRH clustering procedure only allowing new genes to be added to existing groups. Supplementary Figure 17 illustrates the 101
algorithm schematically. M. cinxia is now also included in the complete-clustering release 7 of OrthoDB (http://cegg.unige.ch/orthodb7). Supplementary Figure 19 clearly shows a broad band of protein families conserved across all species, and blocks of protein families conserved in taxonomic orders, suborders or families. It is also notable that many blocks appear patchy. Apparent deletions or extra genes can result from incomplete genome data (e.g., genes split over more than one scaffold), errors in gene prediction, false-negative homology detection, or false-positive similarity links accepted during clustering. Visual inspection of multiple alignments of orthologous groups showed that although sequence identity was quite high, protein lengths differed greatly and there are gaps due to missing ends (protein fragments) or missing exons. The analysis of orthologous groups gives a similar picture (Fig. 1) to previous comparative studies 8, 11, 90
. The statistics for M. cinxia are very similar to those for the other Lepidoptera, notably other
Nymphalidae, H. melpomene10 and D. plexippus8. The dominant classes of orthologous groups are (i) a conserved core genome, (ii) taxonomic order- or family-specific proteins (species-specific in orders represented by a single species), and (iii) proteins without detectable sequence similarity to others (“no-hits”). The aphid genome (ACYPI) is an outlier in this analysis: it has lost many protein families from the core genome that is conserved in other arthropods and between other insects and mammals91.
Manual annotation Manual annotations of gene predictions were made using an Apollo genome annotation editor 92. Gff3 files generated with MAKER and wiggy files including coverage of RNA-seq mapping were used for checking the gene models in Apollo. Additionally, manual annotators used BAM files from TopHat54 for visualizing RNA-seq mappings in Artemis 93. The modified gene models were sent to the VectorBase Community Annotation Portal at the European Bioinformatics Institute (EBI) 94. Genes belonging to families and pathways of special interest were manually annotated. Altogether 558 gene models and protein names were manually curated, 29 of which were deleted. The list of manually annotated genes together with their manually and automatically predicted protein descriptions are shown in Supplementary Data 1. The curated genes included heat shock and other chaperone-related genes, Cytochrome P450, Hox genes and immunity-related genes (Supplementary Table 25). We also focused on genes related to muscles and muscle development and genes that were previously reported to be activated after flight 43. Additional validation was carried out on random gene models in the Z chromosome and all gene models in random scaffolds. Most (88 %) of the gene models examined needed manual correction. One third of the corrected models remained partial due to, for example, gaps in the scaffolds or short scaffolds 102
that split genes into two or several scaffolds. On average 12 gene models were manually curated in each chromosome (Supplementary Fig. 20).
Hox cluster The Hox gene cluster was annotated according to sequence conservation with other insects and alignment of M. cinxia transcripts. Conservation of intron/exon structure and VISTA alignment with other Lepidoptera were also used to aid identification of conserved sequence around the Special homeobox (Shx) genes. All M. cinxia Hox cluster genes except fushi tarazu (ftz) were represented in at least one of the two expression libraries mapped to the genome (Annotation1; Supplementary Table 2, Supplementary Note 2). Shx genes may be lepidopteran-specific Hox cluster genes 95. Shx genes have been described from the genomes of B. mori, H. melpomene and D. plexippus10, 96, 97. M. cinxia was found to have all the canonical Hox genes plus four Shx genes: two copies of ShxA, and a single copy each of ShxB and ShxC. The two copies of ShxA were found in tandem on the same scaffold as the proboscipedia (pb) gene (scaffold442), and were numbered according to the direction of transcription. Sequence identity between the two M. cinxia ShxA paralogs was low (53.2 % amino acid/ 55.8 % nucleotide; ClustalW / translation alignments using Geneious R6 V6.1.4), and alignment of the translated sequence with H. melpomene and D. plexippus showed that McShxA-2 has a 70 amino acid insertion in exon 2 relative to the other sequences (Supplementary Fig. 21). A recent study has addressed the evolution of the Shx genes in the Lepidoptera95. In this study it was found that all species examined except B. mori had four Shx genes, with one copy each of ShxA-D. This analysis included data from four other nymphalid butterflies, H. melpomene, D. Plexippus, Parage aegeria (speckled wood) and Polygonia c-album (comma). The M. cinxia duplication of ShxA, and loss of ShxD therefore stands in contrast to these species. In order to verify the loss of ShxD in M. cinxia, PacBio and transcript reads over the region were manually inspected. Searches were also carried out for conserved ShxD motifs95, and open reading frames in the region were translated and aligned with ShxD from the other nymphalids. We found that no ShxD homeodomain could be identified, but that there were limited regions of sequence similarity with other butterfly ShxD genes in the expected location and orientation. These regions were fragmented however, and contained multiple stop codons in every translation frame. All transcripts from ShxD also contained multiple stop codons, suggesting that this gene has become highly degraded in M. cinxia. B. mori also has multiple duplications of ShxA and loss of ShxD10, but phylogenetic analysis suggests that both events are convergent between the two species (Supplementary Fig. 22). In addition, 5 of the 8 B. mori ShxA genes contain multiple homeodomains, many of which are 103
degenerate, whereas both M. cinxia ShxA genes have a single intact homeodomain motif (Supplementary Fig. 21). The M. cinxia Hox/Shx genes were located in nine scaffolds, all of which mapped to chromosome 5. After manual inspection, all scaffolds except those containing labial (lab), ftz and the 3’ end of Sex combs reduced (Scr) mapped to bin 2, while the scaffold containing lab mapped to bin 4. This suggests that the lab/pb split in the Hox cluster recorded for B. mori and H. melpomene10 is likely to be conserved in M. cinxia. No bin was determined for the scaffold containing ftz and 3’ end of Scr, but the scaffold containing the 5’ end of Scr and the Deformed (Dfd) gene was in bin 2, making it very likely that the Hox cluster is maintained in this region. As a result of further PacBio sequencing and assembly, the scaffolds containing pb and Shx genes (scaffold442; scaffold2095) were manually combined into a single scaffold. Superscaffolding joined five other scaffolds into two superscaffolds: chr5_superscaffold27 (scaffold5113; 5’ end of scaffold22; scaffold6633) and chr5_superscaffold14 (scaffold5218; scaffold2892). Thus, the M. cinxia Hox/Shx genes can be positioned in six superscaffolds.
104
Supplementary Note 9. SNP and indel variation Genome-wide variation SNPs and indels were detected from four data sets: 1) SOLiD genomic pools from 53 individuals (SOLiD_ÅLpool; Supplementary Table 1), 2) an Illumina genomic pool from 10 individuals (IlluminaPE1, IlluminaPE2; Supplementary Table 1), 3) Illumina polyA-anchored RNA-seq data from 40 individuals (Variation; Supplementary Table 2), and 4) a PacBio genomic pool from 100 individuals (PacBio; Supplementary Table 1). These datasets were used to describe nucleotide variation of M. cinxia in the Åland Island population. SOLiD genomic pool (SOLiD_ÅLpool) data were mapped to the genomic contigs and variants were detected using LifeScope 2.5 diBayes, LifeScope 2.5 SmallIndel, and LifeScope 2.1 LargeIndel softwares (Applied Biosystems) with default parameters. For comparison, variants were also detected from Illumina PE reads (IlluminaPE1, IlluminaPE2) used in the genome assembly. The coverage of mapped reads in this data was 30X which was higher than in the SOLiD pool data (SOLiD_ÅLpool). Two libraries consisting of one independent individual and ten full-sibs (IlluminaPE1 and IlluminaPE2) were first analyzed separately and then the results were merged. Variants were detected using a GATK pipeline 98, 99. Read alignments were performed with BWA version 0.5.9-r16 with mutation rate set to 0.06 and 3' trimming quality to 5. Variants were called with UnifiedGenotyper of GATK version 2.2 using a ploidy value of 8, stand_call_conf value of 50.0 and stand_emit_conf value of 10.0. Additionally, the SNP positions were filtered using a minimum minor allele count of three. The minimum distance to the nearest SNP position was set to five bp in order to minimize the effect of mapping errors. The polyA-anchored RNA-seq reads (Variation; Supplementary Table2) were filtered and mapped onto genomic scaffolds as described in Supplementary Note 4. The allele counts were extracted from the mapped RNA-seq reads. No indels were called from the RNA-seq data. The results from the two sequencing libraries with different insert sizes were combined. For filtering the data, only bi-allelic SNPs were included, and at least 20 individuals were expected to be polymorphic at each SNP site. Long indels were detected using PacBio data. PacBio reads were mapped onto genomic scaffolds with BWA-SW46, and indels whose length exceeded 50 bp were detected from CIGAR alignments. The number of SNPs and indels in each data set are shown in Supplementary Table 26. Only 105
variants located in scaffolds with at least one gene model are included. Variants located at the overlap of two genes were removed; these represented 2 % in the genomic data. Only 76 % of mappable RNA-seq data were within predicted gene models which can be partly explained by missing/incomplete gene model 3’UTRs. The distribution of variant lengths based on SOLiD and PacBio data shows that deletions (2,165) were more abundant than insertions (313) (Supplementary Fig. 23).
SNP density Two Illumina PE libraries consisting of one male and ten full-sibs (IlluminaPE1) and of the same ten full-sibs (IlluminaPE2; Supplementary Table 1) were used to estimate the average level of SNP density within M. cinxia gene models. The variants were called and filtered as explained above. The median SNP and indel densities in genomic regions are shown in Supplementary Table 27. SNP density in 16,667 gene models was on average 8.2/kb in coding and 15.3/kb in intronic regions. The majority of genic regions (88%) included indels, most of which are located in introns and only 2,567 (15%) and 3,484 (21%) in coding and UTR regions, respectively. The median indel density in intronic regions was 1.5/kb (Supplementary Table 27). The distributions of SNP and indel densities (Supplementary Fig. 24) have very long tails towards high density values. We found 595 and 115 gene models with a higher SNP density than 30 and 50 SNPs/kb in the coding region, where the SNP density limits corresponded to 96.4 % and 99.3 % of the density distribution, respectively. We also found gene models with high intronic indel density: a total of 116 and 786 gene models had 0.5% and 1% indel density in intronic regions, respectively. Comparable information about levels of SNP and indel variation and density is limited to model organisms and laboratory strains. The few published reports of wild population variation also illustrate high variation densities in other insects. Sequencing of 11 wild populations of B. mandarina identified over 13 million SNPs (30 SNPs/kb) and 251,000 indels (0.58 indels/kb) in the genome sequence100. In the coding exons, 363,792 SNPs (~ 20/kb) and 1,206 indels (~ 0.07/kb) were found. Whole genome sequencing of 192 inbred D. melanogaster lines yielded over 4.7 million SNPs (33 SNPs/kb)101. In the analysis of the P. xylostella genome, 558,374 SNPs were detected from a pool of a laboratory population, yielding a coding exon SNP density of 4 SNPs/kb11. In M. cinxia the SNP density was estimated from one independent individual and ten full-sibs, yielding SNP and indel densities of 13.2/kb (8.2/kb in coding regions) and 1.7/kb, respectively. In this data set, the SNP density was lower than in B. mandarina and D. melanogaster, whereas the indel density was higher than in those species, which might have impeded genome assembly (Supplementary Note 3, Supplementary Fig. 6).
106
Linkage disequilibrium Linkage disequilibrium (LD) was evaluated from RNA-seq data from 40 individuals (Variation; Supplementary Table 2) for the Åland Islands population. The raw allele counts for each individual were first converted into conditional probabilities of each genotype. The allele counts n of an individual at a SNP were assumed to be binomially distributed as Bin(0.95,n) if the corresponding genotype was homozygous in a major allele, Bin(0.05, n) if the genotype was homozygous in a minor allele, or Bin(0.5, n) if the genotype was heterozygous. Then potential SNPs were filtered keeping only those in which all three possible genotypes were present with 80% confidence, and in which with 80% confidence less than 15% of data were missing. After filtering, the final dataset included 3,331 SNPs located in 1,312 scaffolds. Finally, the LD (r2 and D’) was evaluated for all SNP pairs within each scaffold. This was done by first iteratively finding the maximum likelihood haplotype frequency estimate for each pair of SNPs using a modified version of the EM-algorithm 102 taking into account genotype uncertainty. From the maximum likelihood haplotype frequencies, estimates of r2 and D´ were obtained. All LD computations were executed using in-house Awk 103 scripts. Supplementary Figure 25 illustrates the values of r2 and D´ for each marker pair as the function of physical distance. The LD (r2) reaches a level of 0.4 at about 300 bp distance. This range of LD in M. cinxia was comparable to estimates for B. mori (r2=0.4 at ~400 bp) and B. mandarina (r2 always