data procedure results

3 downloads 0 Views 8MB Size Report
Species used in comparison with M. cinxia are the butterflies, Kallima inachus and H. melpomene, and the moths, B. mori and. P. xylostella. Gene start stop start.
Supplementary Figures

PROCEDURE

DATA

RESULTS N50

454 single reads Illumina PE

Read error correction*

Illumina PE

Contig assembly

Total length

Initial contigs 2 kb 354 Mb

Repeat masking

Illumina MP SOLiD MP 454 MP

Read error correction*

Mapping

Scaffolding*

Final scaffolds 119 kb 390 Mb

Gap closing

Final contigs 13 kb 361 Mb

Superscaffolding*

Superscaffolds 258 kb 393 Mb

PacBio long reads 454 MP SOLiD MP Linkage map

Supplementary Figure 1. Genome assembly workflow. The sequenced data sets (left), the assembly workflow (middle), and the key statistics (right) of the M. cinxia genome assembly. Stages with novel methods1-3 developed during this project are marked with *; PE, Paired-end library; MP, Mate-pair library.

1

d Int a So ct n ic a te 10 kbp 8 kbp 6 kbp 5 kbp 4 kbp

48502 bp 17000 bp 10171 bp

Supplementary Figure 2. High molecular weight DNA used in MP library construction. Intact gradientisolated DNA and sonicated DNA used in library construction are shown. Molecular weigth markers are Left: GeneRuler 1 kb DNA Ladder (Fermentas) and Right: GeneRuler High Range DNA Ladder (Fermentas).

2

200000

14000

IlluminaPE1

IlluminaPE2

150000

Frequency

Frequency

12000

100000

10000 8000 6000 4000

50000

2000 0

0 0

2000

100 200 300 400 500 600 Insert size

0 600

IlluminaMP1

200

400 600 Insert size

800

1000

SOLiDMP1

1500

Frequency

Frequency

500

1000

400 300 200

500 100 0

0 0

600

500 1000 Insert size

1500

0 2000

SOLiDMP2

1000

2000 3000 Insert size

4000

IlluminaMP2

Frequency

Frequency

500 400 300 200

1500 1000 500

100 0

0 0

1000

1000 2000 3000 4000 5000 Insert size

0 1000

IlluminaMP3

600 400 200

600 400 200

0

0 0

120

1000 2000 3000 4000 5000 Insert size

0 1000

454MP1

100

1000 2000 3000 4000 5000 Insert size 454MP2

800

Frequency

Frequency

IlluminaMP4

800

Frequency

Frequency

800

1000 2000 3000 4000 5000 Insert size

80 60 40

600 400 200

20 0

0 0

2000 4000 6000 8000 10000 Insert size

0

10000 20000 Insert size

30000

Supplementary Figure 3. Insert size distributions of the PE and MP libraries. The insert size distributions of the SOLiDMP2, IlluminaMP2, IlluminaMP3 and IlluminaMP4 libraries show two peaks. The lower peaks between 0.5K and 2K represent fragmented DNA. 3

300000

250000

N50

200000

150000

100000

Superscaffolds

+454MP2 (Final scaffolds)

+454MP1

+IlluminaMP3+IlluminaMP4

+SolidMP2+IlluminaMP2

+SolidMP1

+IlluminaPE2+IlluminaMP1

+IlluminaPE1

0

Initial contigs

50000

Supplementary Figure 4. N50 values at intermediate stages of scaffolding. Scaffolds of the previous stage were merged using the PE and MP libraries with longer insert size.

4

100

N50 N90

Proportion of assembly (%)

90 80 70 60 50 40 30 20 10 0 1000

10000 100000 Minimum scaffold length

1e+06

Supplementary Figure 5. Cumulative scaffold length of the final assembly. For each minimum scaffold length, the proportion of the assembly covered by scaffolds that are longer than the minimum length is shown. N50 (green) and N90 (red) values are indicated.

5

contig1 contig2 contig1 contig2 contig1 contig2 contig1 contig2 contig1 contig2 contig1 contig2 contig1 contig2 contig1 contig2 contig1 contig2 contig1 contig2 contig1 contig2 contig1 contig2

451 GATCAGATGTCTcGCCCTAAATACTTGCGTGCCATGTAGGGTTAAGGTTACCTTTAAAaTGCATTTTTAGCATTAAGCTGagAAAaGGGTCGATACTGTA ||||||||||||||||||||||||.||||||||||||||||||||||||.||||||||||||||||| 1 ---------------------------------ATGTAGggTTAAGGTTACcTTTAATATGCATTTTTAGCATTAAGCTGAGgAAaGGGTCGATACTGTA

550

551 TAACGTTAAA---ATGAaCGTTAAAACGAGTACTATAATAAGTGTGAAATTTTATCGTATTTACTTGTAAATTGTGATTTCTTTCATTGTAATTATAACT |||||||||| |||||||||||||||||||||||||||||||||||||||.|.||||||||||||||||||||.||||||.|.|||.||||||||..| 68 TAAcGTTAAATTTAtGAaCgTTAAAAcGAgTACTATAATAAGTGTGAAATTTGACCGTATTTACTtGTAAaTTGTAATTTCTCTtATTTTAATtATATAT

647

648 TTCTAATTAGGTATGTTATAAACCTAAAAaTTTtA-AAAAGAaTATTATtGTtATATG-CTAATTATCTACTTTA--------------AATGGAAAGAG ||.|..|||..||..|||||||||||||||||||| |||||||||||.|||||||||| ||||||.||||||.|| ||||||||||| 168 TTTTTTTtAATTAAATTATAAACcTAAAAaTTTtAGAAAAGAaTATTTTtGTtATATGCCTAATTGTCTACTATAAATGATCTAGTATGAATGGAAAGAG

731

732 TACGCCCAATTTCAAGGCAGGAATCGAATTCTAGAaTAAAAaGCAGGACCGCTGCTAACGGCGACAACCGAAGAGTCACTAATAAATGAAGGTAATAAAT |||||||||||||||||||||||||.||||||||||||||||||||||||.||||||||.||||||||.||||||||||||||||||||||||||||||| 268 TACGCCCAATTTCAAGGCAGGAATCAAATTCTAGAATAAAAAGCAGGACCACTGCTAACAGCGACAACGGAAGAGTCACTAATAAATGAAGGTAATAAAT

831

832 TAATTAGTACAATAGACATCAGCGAGGATAACGAATTTAATGTCGTTATAATATCGAAATATGTGTAATTAAGGATACaGTTAAATGCAATTA---CATA |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||.||||||||||||||||||||||||| |||| 368 TAATTAGTACAATAGACATCAGCGAggATAACGAATTTAATGTCGTTATAATATCGAAATATGTGTAGTTAAGGATACAGTTAAATGCAATTACGTCATA

928

929 ATAATTAAAATTTAACGCTTAACTTCAATTTtCGAAACATTCAACTCACGGAGCAGATTTTAAGCATTCGTCAGTTCCA-TCTTTCACATATTCTGAATA |||||||||||||||||||||||||||||||||||||||||||||||||.||||||||||||||||||||||||||.|| |||||||||||||||||||| 468 ATAATTAAAATTTAACGCTTAACTTCAATTTTCGAAACATTCAACTCACCGAGCAGATTTTAAGCATTCGTCAGTTACATTCTTTCACATATTCTGAATA

1027

1028 TGATTTAACTTATTACAGATTTTCAAATAAGAAAAGGACGGTGCTTTGAAATATAaCTAACATGTTtATAgTTTtAGGTGCATCCAAaGAAAaTAATTCC ||||||||.||||||||||..||.||||||||||||.|||||||.|.|||||||||| |||| 568 TGATTtAATTTATTACAGACATTTAAATAaGAAAaGAACGGTGCCTCGAAATATAAC-----------------------------------ATAA----

1127

1128 ATCAAAcAGCtCTGTAACAAAAGCTCAAGCTACGAAAATTAATtCCAATAGAATATAtGCATAGgTTtAAATTATATATtGATGGTTAAGGATATCATTG ||||.||.|||||...|.||||||||||||.|||||..|||||||||||.|.|||||||||||||||||||||| 629 --------------------------AAGCCACTAAAaTGTcTCCCAATAGAATATCTGCATCAGTTTAAATtATTTTTTGATGGTTAaGGATATCATTG

1227

1228 TGACGTTATTTGTATACaTTTtAAAAAaTTCGagAaGaTATCTGCCTAGCtCGTGCGTAaCATTCCAaCACACGACAAATATTCATATAGGCTTTACAAA |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 703 TGACGTTATTTGTATACaTTTTAAAAAATTCGAGAaGATATCTGCCTAGCTCGTGCGTAACATTCCAACACACGACAAATATTCATATAGGCTTTACAAA

1327

1328 TaTTTGTCTGGCGTTTGTGACTGTGTtGTGTGTACTTCAAGAGTTCTGgAAATCtGGAACAGgATTtACTTTATaTTTGGgACCGTTGACTGTgAAaGGT |||||||.|||||||||||||||||||||||||||||||||||| ||||.||||||||||||||||||||||||||||||||||||||||||| 803 TATTTGTTTGGCGTTTGTGACTGTGTTGTGTGTACTTCAAGAGT--------TCTGAAACAGGATTTACTTTATATTTGGGACCGTTGACTGTGAAAGGT

1427

1428 TTATTTACAAACTCGCAGTGAAACA----------TCGTtGAACGGGAAAaGGgTAAGAACCCTTCTtCAAAtttttCATTCttAtttttctatCTTaaa ||||||||||||||||||||||||| |.||| ||||||||||.|||||||.|.|||||||||||||||||||.|| 895 TTATTTACAAACTCGCAGTGAAACATTCCCCAAATTTGTT-----------------GAaCCCTTCTCCAAATTTGTTATTCTTATTTTTCTATCTTCAA

1517

1518 aCAATTAATAATAGA------------------------------------------------------------------------------------|||.|||||||||.| 978 aCATTTAaTAATAAACAAGTATATAGCAAAGGAAAATAATTTAAAAGTTCAAAatGGGACTtAAATTACCAGATCTGCATTATTTATAATCTAGCTTGTT

1532

67

167

267

367

467

567

628

702

802

894

977

1077

Supplementary Figure 6. Alignment of the 3’ end of contig1 and 5’ end of contig2 coding for the gene cytochrome P450 cyp337. The alignment shows indel polymorphism which prohibited assembly of the two contigs.

6

a)

Superscaffolds 100

N50 N90

Proportion of assembly (%)

90 80 70 60 50 40 30 20 10 0 1000

b)

10000

1e+07

Superscaffolds + Unplaced scaffolds 100

N50 N90

90 Proportion of assembly (%)

100000 1e+06 Minimum scaffold length

80 70 60 50 40 30 20 10 0 1000

10000

100000 1e+06 Minimum scaffold length

1e+07

Supplementary Figure 7. Cumulative a) superscaffold length and b) the total length of the superscaffolds and unplaced scaffolds. For each minimum scaffold length, the proportion of the assembly covered by scaffolds that are longer than the minimum length is shown. N50 (green) and N90 (red) values can be read from the figure.

7

16000

14000

14000

12000

12000

10000

10000

Melitaea cinxia

Melitaea cinxia

16000

8000

6000

6000

4000

4000

2000

2000

0

0 0

2000

4000

6000 8000 10000 12000 14000 16000 Heliconius melpomene

16000

16000

14000

14000

12000

12000

10000

10000

Melitaea cinxia

Melitaea cinxia

8000

8000

2000

4000

6000 8000 10000 12000 14000 16000 Plutella xylostella

0

2000

4000

6000 8000 10000 12000 14000 16000 Helicoverpa armigera

8000

6000

6000

4000

4000

2000

2000

0

0

0 0

2000

4000

6000 8000 10000 12000 14000 16000 Bombyx mori

Supplementary Figure 8. Alignment of the mitochondrial sequence of Melitaea cinxia against the mitochondrial sequences of Heliconius melpomene, Plutella xylostella, Bombyx mori, and Helicoverpa armigera. The red segments show forward and the blue ones reverse alignments. Two red lines in the same figure indicate different cutting points of the circular mtDNA sequence between the two species.

8

100% 90% Proportion (%) of transcript aligned against scaffolds Unaligned 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100

80%

Proportion of transcripts

70% 60% 50% 40% 30% 20% 10% 0% Unordered

Unlimited gap

Max gap 5000

Supplementary Figure 9. The fraction of transcript contigs covered by the final scaffolds. The leftmost bar shows how much of the sequence of the transcript contigs is present in the genome when the ordering of the matches is not considered. The middle bar shows how well the transcript contigs can be aligned against the genome assembly when the ordering of the matched parts is taken into account. The rightmost bar shows the results when we further restrict the gaps in the scaffolds between alignments to at most 5,000 bp, representing the longest probable intron size or missing sequence.

9

Chromosomes of Melitaea cinxia

Chromosomes of Bombyx mori

Supplementary Figure 10. Sequence-level synteny between Melitaea cinxia and Bombyx mori. Visualization of all genomic positions in which pairwise alignments between scaffolds from M. cinxia and B. mori share at least 200 bp. The figure is filtered to contain only the “best” hits onto the B. mori reference. The red and blue dots represent the forward and reverse complemented hits, respectively.

10

Supplementary Figure 11. Examples of secondary structures of microRNA precursors predicted from the Melitaea cinxia genome. The color-coding represent thermodynamically likely base-pair probabilities as indicated in the figure.

11

b) 8000

8000

7000

7000

6000

6000

5000

5000

Melitaea cinxia

Melitaea cinxia

a)

4000

4000

3000

3000

2000

2000

1000

1000

0

0 0

2000

4000

6000 Attacus ricini

8000

10000

12000

0

1000

2000

3000

4000

5000

6000

7000

Papilio xuthus

c)

Supplementary Figure 12. rDNA sequence alignments of Melitaea cinxia against a) Attacus ricini and b) Papilio xuthus sequences, and c) a schematic representation of 18S, 5.8S, and 28S rRNA genes of Melitaea cinxia. In figures a) and b) the red segments show forward and the blue ones reverse alignments. Figure c) shows the pairwise alignment of scaffold34886 containing rRNA genes from M. cinxia against GenBank entry AF463459.1 containing an A. ricini rDNA repeat unit. Melitaea cinxia rRNA sequences are shown in gray and the corresponding rRNAs from A. ricini are shown in blue. Internal transcribed spacers (ITS) are indicated by thin lines. The identities of the rRNA and the spacer regions are indicated below. ETS1 refers to the external transcribed spacer. The continuous top gray bar shows the entire M. cinxia rDNA scaffold34886. Mismatches between the two species are indicated by red and insertions by blue bars in the M. cinxia sequence.

12

Supplementary Figure 13. The mitochondrial genome of Melitaea cinxia. Protein-coding genes (green) are denoted as COI, COII and COIII for subunits 1-3 of cytochrome c oxidase, cytB for the cytochrome b gene, ND1, 2, 3, 4, 4L, 5 and 6 for subunits 1-6 of the NADH dehydrogenase system, and ATP6 and ATP8 for subunits 6 and 8 of ATP synthase. tRNA (pink) nomenclature follows the standard three letter IUPAC amino acid code. rRNAs (red) are denoted 12S (small subunit rRNA) and 16S (large subunit rRNA). The AT-rich control region (grey) is shown on the top. The direction of transcription for each coding region is depicted as an arrow.

13

A

Uncharacterized sequences

Descriptions in sequence database(s)

Similarity search against sequence database(s) Scoring results by identity percentages, coverage percentages and taxonomic distances (Regression model)

Protein Naming Utility for alternative descriptions and corrections

Clustering of results by TF-IDF and cosine similarity

Word frequency

Description frequency

Sorting clusters Cluster representative selection by using highest scoring description and printing out the result with the GO and EC classes associated to cluster members

FUNCTIONAL ANNOTATION

B

KEGG orthology and pathway mapping

Protein signatures e.g. domain annotations

Transmembrane regions

Secreted peptides

KAAS server

InterProScan

TMHMM

SignalP

Supplementary Figure 14. Flowchart of the functional annotation procedure. Part (A) illustrates the PANNZER workflow4, while part (B) shows the other procedures used for functional annotation.

14

18000

16000

Number of gene models

14000

12000

No functional annotation

10000

Both 8000 Transcripts 6000

Gene models

4000

2000

0 DE

GO

EC

InterPro

KEGG

Supplementary Figure 15. Summary of the number of gene models with functional annotation for Melitaea cinxia. DE refers to protein descriptions, GO to Gene Ontology classes and EC to Enzyme Commission numbers. InterPro shows the predicted protein signatures, and KEGG includes KEGG orthology (KO) and pathway mapping. MAKER refers to predicted gene models and Transcripts to assembled transcript data.

15

tab]

60

50

Number of gene models

No functional annotation Functional annotation

40

30

20

10

1 80 140 180 220 260 300 340 380 420 460 500 540 580 620 660 700 740 780 820 860 900 940 980 1020 1060 1100 1140 1180 1220 1260 1300 1340

0

Protein sequence length

Supplementary Figure 16. The length of the functionally annotated proteins and proteins without functional annotation.

page]

16

OrthoDB: triangulated clusters

EPT: species clashes disallowed

Supplementary Figure 17. Schematic illustration of the principles used by ortholog clustering algorithms. Nodes represent proteins, species are indicated by color, and line width indicates the strength of similarity. TOP: OrthoDB clusters built by triangulation. The basic unit is a triangle formed by reciprocal best hits from three species. Triangles that share an edge are merged. BOTTOM: EPT clusters built hierarchically. In-paralogs are merged within a cluster and out-paralogs are excluded. In this example, triangulation generates one cluster while EPT generates two clusters due to the out-paralog exclusion rule. The second red protein is an out-paralog because its similarity to the multispecies cluster is lower than that of another red protein which is already a member of the cluster.

17

Supplementary Figure 18. Cladogram of representative species used for analysis of one-to-one orthologs. Bootstrap values above 50% are presented. Species are from the following taxa: FELCA and RATNO are mammalian outgroups; IXOSC (deer tick) and DAPPU (waterflea) are arthropod outgroups, and the rest are Hexapoda (insects). Lepidoptera include MELCI, HELME, DANPL, BOMMO and PLUXY. Ants (HARSA, SOLIN), bees (APIME) and wasps (NASVI) are Hymenoptera. Diptera include mosquitoes (AEDAE, ANOGA, CULQU) and fruit flies (DROME, DROMO, DROSI). Pea aphid (ACYPI), red flour beetle (TRICA) and head louse (PEDHU) represent other insects. Species codes are as used in SwissProt: FELCA = Felis catus; RATNO = Rattus norvegicus; IXOSC = Ixodes scapularis; DAPPU = Daphnia pulex; MELCI = Melitaea cinxia; HELME = Heliconius melpomene; DANPL = Danaus plexippus; BOMMO = Bombyx mori; PLUXY = Plutella xylostella; HARSA = Harpegnathos saltator; SOLIN = Solenopsis invicta; APIME = Apis mellifera; NASVI = Nasonia vitripennis; AEDAE = Aedes aegypti; ANOGA = Anopheles gambiae; CULQU = Culex quinquefasciatus; DROME = Drosophila melanogaster; DROMO = Drosophila mojavensis; DROSI = Drosophila simulans; ACYPI = Acyrthosiphon pisum; TRICA = Tribolium castaneum; PEDHU = Pediculus humanus.

18

4

x 10

6

0.5

1 5

Number of clusters

4 2

3

2.5

3

Number of paralogs per cluster

1.5

2

3.5

1

RATNO FELCA ACYPI PEDHU APIME NASVI HARSA SOLIN TRICA DROSI DROME DROMO ANOGA AEDAE CULQU MELCI DANPL HELME BOMMO PLUXY IXOSC DAPPU

4

0

Supplementary Figure 19. Heatmap of orthologous groups in representative Arthropoda and Mammalian outgroups. The figure represents a visualization of orthologous groups between 22 species. Rows show corresponding orthologous groups and columns are species. The darker color indicates in-paralogs. Species are in taxonomic order and the rows are ordered using hierarchical clustering5. CAT and RAT are mammalian outgroups. Species codes are as used in SwissProt: RATNO = Rattus norvegicus; FELCA = Felis catus; ACYPI = Acyrthosiphon pisum; PEDHU = Pediculus humanus; APIME = Apis mellifera; NASVI = Nasonia vitripennis; HARSA = Harpegnathos saltator; SOLIN = Solenopsis invicta; TRICA = Tribolium castaneum; DROSI = Drosophila simulans; DROME = Drosophila melanogaster; DROMO = Drosophila mojavensis; ANOGA = Anopheles gambiae; AEDAE = Aedes aegypti; CULQU = Culex quinquefasciatus; MELCI= Melitaea cinxia; DANPL = Danaus plexippus; HELME = Heliconius melpomene; BOMMO = Bombyx mori; PLUXY = Plutella xylostella; IXOSC = Ixodes scapularis; DAPPU = Daphnia pulex. The lepidopteran species are highlighted with a box. 19

35 30 25 20 15 0

5

10

Frequency

0

5

10

15

20

25

30

Chromosomes

Supplementary Figure 20. The frequency distribution of manually annotated genes of Melitaea cinxia across chromosomes.

20

Supplementary Figure 21. ShxA (Special homeobox A) alignment. Dp: Danaus plexippus; Hm: Heliconius melpomene; Mc: Melitaea cinxia. Amino acids differing from the consensus are highlighted. All species have a short first exon and a long second exon containing the homeodomain (red bar). The exon boundaries are indicated by a black vertical bar. Mc/ShxA-2 has a 70 amino acid insertion relative to the other sequences in exon 2.

21

bcd zen zen2

ShxA

lab

ShxB ShxC

pb

ShxD Antp

Dfd abdA Scr

Ubx ftz

AbdB

Supplementary Figure 22. Maximum likelihood phylogeny of insect Hox homeodomain sequences. Homeodomains were excised from the Hox cluster genes of Apis mellifera (Am), Tribolium castaneum (Tc), Drosophila melanogaster (Dm), Bombyx mori (Bm), Danaus plexippus (Dp), Melitaea cinxia (Mc, sequence names enlarged), and Heliconius melpomene (Hm). For both B. mori and M. cinxia the ShxA paralogues are more closely related to one another than they are to any other species, suggesting independent duplication of ShxA in these species.

22

1e+06

Count

1e+04

1e+02

1e+00 −150

−50 −10

−1

1

10

50

100

150

indel size

Supplementary Figure 23. Distribution of variant lengths in Melitaea cinxia in the Åland Islands population. Negative lengths are deletions and positive lengths are insertions. Zero length indicates a SNP (gray bar).

23

b)

0

4 0

1

10

2

3

Indel density / kb

30 20

SNP density / kb

40

5

6

50

a)

Genic

Coding

Intronic

Genic

Coding

Intronic

Supplementary Figure 24. Boxplot for a) SNP and b) indel density in all (16,667) gene models of Melitaea cinxia in the Åland Islands population. Genic refers to the region from ‘5 UTR to 3’ UTR. The box shows the interquartile range, while the band in the box is the median. The whiskers extend to data points that are within 1.5 times the interquartile range. Dots are outliers.

24

a) 1

r2 fitted E(r2) moving average

0.9 0.8 0.7

LD(r2)

0.6 0.5 0.4 0.3 0.2 0.1 0 0

500

1000

1500

2000

2500

3000

distance (in bp)

b) 1 0.9 0.8 0.7

LD(D’)

0.6 0.5 0.4 0.3 0.2 0.1

D’ fitted E(D’) moving average

0 0

500

1000

1500

2000

2500

3000

distance (in bp)

Supplementary Figure 25. Linkage disequilibrium described by a) r2 and b) D’ as a function of the physical distance in the Åland Islands population. Curves for the expected values of r2 and D’ were fitted to the data as described in Marroni et al.6. For comparison, moving average was calculated for windows of 200 points.

25

0.2 0.0

0.1

Density

0.3

Melitaea cinxia Bombyx mori

25

30

35

40

45

GC%

Supplementary Figure 26. GC content within 100 kb sliding windows and 10 kb shift in Melitaea cinxia and Bombyx mori chromosomes.

26

32 28

30

GC content (%)

34

36

a)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

19

20

23

24

25

26

27

28

29

30

31

Chromosomes

38 36 32

34

GC content (%)

40

42

44

b)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

21

22

23

24

25

26

27

28

Chromosomes

Supplementary Figure 27. Boxplot for GC content within 100 kb sliding windows and 10 kb shift in a) the 31 chromosomes of Melitaea cinxia and b) the 28 chromosomes of Bombyx mori. Red lines show the mean GC content of M. cinxia (30.7%) and B. mori (35.4%) across the sliding windows. The number of sliding windows in the chromosomes varies from 11 to 681. The box shows the interquartile range, while the band in the box is the median. The whiskers extend to data points that are within 1.5 times the interquartile range. Dots are outliers.

27

15 10 0

5

Gene density

20

25

a)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

19

20

23

24

25

26

27

28

29

30

31

Chromosomes

10 0

5

Gene density

15

b)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

21

22

23

24

25

26

27

28

Chromosomes

Supplementary Figure 28. Boxplot for gene density within 100 kb sliding windows and 10 kb shift in a) the 31 chromosomes of Melitaea cinxia and b) the 28 chromosomes of Bombyx mori. Red lines show the median gene density across the sliding windows. The number of sliding windows in the chromosomes varies from 11 to 681. The box shows the interquartile range, while the band in the box is the median. The whiskers extend to data points that are within 1.5 times the interquartile range. Dots are outliers.

28

30 25 20 10

15

Repeat content (%)

35

40

a)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

18

19

22

23

24

25

26

27

28

29

30

31

Chromosomes

50 40 30 10

20

Repeat content (%)

60

b)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

20

21

22

23

24

25

26

27

28

Chromosomes

Supplementary Figure 29. Boxplot for the proportion of repeats within 100 kb sliding windows and 10 kb shift in a) the 31 chromosomes of Melitaea cinxia and b) the 28 chromosomes of Bombyx mori. Black lines show the mean repeat contents across the sliding windows for the superscaffolds of M. cinxia (22.6%) and for the whole genome of B. mori (38.8%). The number of sliding windows in the chromosomes varies from 11 to 681. The box shows the interquartile range, while the band in the box is the median. The whiskers extend to data points that are within 1.5 times the interquartile range. Dots are outliers.

29

a) 50 40 33

GC %

30

31

20

29 10 0 0

2500000

5000000

7500000

b)

Gene density

15 12.5 10.0

10

7.5 5.0

5

2.5 0.0

0 0

2500000

5000000

7500000

c)

Repeat %

40 30

30 25

20

20 10

15

0 0

2500000

5000000

7500000

Position

Supplementary Figure 30. a) GC content (%), b) gene density and c) repeat content (%) within 100 kb sliding windows and 10 kb shift in Melitaea cinxia chromosome 1 (Z). GC distribution is strikingly even across the chromosome. The gene and repeat densities fluctuate more but are not notably lower or higher in any chromosome segment.

30

a) 50 40

GC %

34 30 32 20 30 10 0 0e+00

2e+06

4e+06

6e+06

8e+06

b)

Gene density

15 16 12

10

8 4

5

0 0 0e+00

2e+06

4e+06

6e+06

8e+06

c)

Repeat %

40 30

30

25 20

20

15 10

10

0 0e+00

2e+06

4e+06

6e+06

8e+06

Position

Supplementary Figure 31. a) GC content (%), b) gene density and c) repeat content (%) within 100 kb sliding windows and 10 kb shift in Melitaea cinxia chromosome 5. GC distribution varies very little across the chromosome. Genes and repeats appear at higher densities, but there is no clear pattern in the variation.

31

Libytheana carinenta Danaus gilippus Danaus plexippus Lycorea halia Lycorea ilione Anetia briarea Anetia pantheratus Tellervo zoilus Melinaea menophilus Athyrtis mechanitis Patricia dercillidas Athesis clearista Thyridia psidii Sais rosalia Scada reckia Forbestra equicola Mechanitis lysimnia Mechanitis polymnia Methona themisto Aeria eurimedia Tithorea harmonia Elzunia pavoni Hyposcada illinissa Hyposcada anchiala Hyposcada virginiana Megoleria orestilla Ollantaya aegineta Ollantaya canilla Oleria aquata Oleria didymaea Oleria onega Oleria alexina Oleria ilerdina Oleria quintina Oleria aegle Oleria gunilla Oleria estella Oleria zelica Oleria amalda Oleria paula Oleria fasciata Oleria athalina Oleria victorine Oleria phenomoe Oleria cyrene Oleria makrena Oleria padilla Callithomia lenea Dircenna dero Pteronymia teresita Ceratinia neso Episcada hymenaea Velamysta pupilla Godyris duillia Hypoleria lavinia Heterosais guilia Greta polissena Pseudoscada timna Greta oto Greta diaphanus Epityches eupompe Napeogenes pharo Napeogenes cranto Hypothyris cantobrica Hypothyris ninonia Hypothyris daphnis Hyaliris antea Placidina euryanassa Pagyris cymothoe Ithomia terra Ithomia jucunda Ithomia agnosia Ithomia lichyi Ithomia drymo Ithomia lagusa Ithomia hyala Ithomia ellara Ithomia celemia Ithomia patilla Ithomia diasia Ithomia iphianassa Ithomia salapia Archaeoprepona amphimachus Archaeoprepona demophon Prepona hewitsonius Prepona laertes Hypna clytemnestra Siderone galanthis Zaretis itys Consul fabius Memphis glauce Memphis appias Memphis anna Memphis offa Memphis moruus Memphis acidalia Memphis leonida Anaea troglodyta Fountainea halice Fountainea nobilis Polygrapha tyrianthina Fountainea glycerium Fountainea eurypyle Fountainea ryphea Manataria hercyna Haetera piera Cithaerias aurora Cithaerias pireta Pierella lena Pierella lamia Pierella luna Antirrhea philoctetes Caerois chorinaeus Morpho cisseis Morpho hecuba Morpho anaxibia Morpho amathonte Morpho menelaus Morpho epistrophus Morpho achilles Morpho helenor Dynastor darius Narope cyllabarus Opoptera aorsa Opoptera syme Eryphanis automedon Caligo illioneus Caligo teucer Caligo atreus Caligo euphorbus Caligo idomeneus Brassolis sophorae Catoblepia amphiroe Selenophanes cassiope Penetes pamphanis Blepolenis batea Opsiphanes cassiae Opsiphanes tamarindi Opsiphanes boisduvalii Opsiphanes invirae Opsiphanes quiteria Oressinoma typhla Bicyclus anynana Steremnia umbracina Steroma modesta Etcheverrius chiliensis Chillanella stelligera Auca coctei Pedaliodes phrasiclea Praepedaliodes phanias Corades enyo Oxeoschistus leucospilos Pronophila thelebe Moneuptychia soter Pharneuptychia sp. Cissia penelope Yphthimoides borasta Moneuptychia paeon Hermeuptychia hermes Praefaunula armilla Godartiana muscosa Amphidecta calliomma Amphidecta reynoldsi Archeuptychia cluena Chloreuptychia arnaca Erichthodes antonina Pareuptychia ocirrhoe Paryphthimoides poltys Magneuptychia moderata Posttaygetis penelea Harjesia blanda Taygetis ypthima Pseudodebis valentina Taygetomorpha celia Taygetis kerea Taygetis mermeria Taygetis larua Taygetis tripunctata Taygetis virgilia Taygetis leuctra Taygetis laches Taygetis echo Taygetis thamyra Taygetis sosis Taygetis cleopatra Adelpha alala Adelpha cytherea Adelpha thessalia Adelpha saundersii Adelpha malea Adelpha cocala Adelpha justina Adelpha mesentina Adelpha lycorias Adelpha epione Euptoieta hegesia Yramea cytheris Yramea lathonoides Actinote alcione Actinote parapheles Actinote pellenea Actinote melanisans Actinote carycina Actinote pyrrha Philaethria wernickei Dryadula phaetusa Podotricha telesiphe Dryas iulia Agraulis vanillae Dione moneta Dione glycera Dione juno Eueides isabella Eueides procula Heliconius telesiphe Heliconius erato Heliconius charithonia Heliconius sara Heliconius antiochus Heliconius congener Heliconius aoede Heliconius ethilla Heliconius hecale Heliconius pardalinus Heliconius xanthocles Heliconius doris Heliconius egeria Heliconius burneyi Heliconius wallacei Asterocampa leilia Doxocopa laure Marpesia zerynthia Baeotus deucalion Historis acheronta Smyrna blomfildia Tigridia acesta Colobura dirce Vanessa carye Vanessa virginiensis Hypanartia bella Hypanartia lethe Hypanartia dione Hypanartia kefersteini Siproeta epaphus Siproeta stelenes Metamorpha elissa Anartia jatrophae Anartia amathea Anartia fatima Junonia evarete Junonia coenia Junonia genoveva Melitaea cinxia Chlosyne lacinia Chlosyne hippodrome Chlosyne janais Chlosyne gaudialis Chlosyne narva Anthanassa drusilla Castilia eranites Ortilia ithra Telenassa teletusa Eresia lansdorfi Eresia emerantia Eresia datis Mestra hypermestra Vila azeka Biblis hyperia Catonephele antinoe Catonephele numilia Nessaea obrinus Catonephele nyctimus Myscelia capenas Pyrrhogyra edocla Epiphile huebneri Epiphile orea Peria lamis Nica flavilla Asterope markii Temenis laothoe Haematera pyrame Callicore tolima Callicore hydaspes Diaethria candrena Diaethria clymena Mesotaenia vaninka Perisama humboldti Perisama oppeli Perisama bomplandii Perisama moronina Eunica eurota Eunica bechina Eunica orphise Eunica tatila Eunica malvina Eunica cuvieri Eunica monima Dynamine myrrhina Dynamine tithia Dynamine coenus Dynamine athemon Dynamine mylitta Cybdelis phaisile Panacea regina Ectima liriope Hamadryas atlantis Hamadryas feronia Hamadryas guatemalena Hamadryas glauconome Hamadryas laodamia Hamadryas amphinome Hamadryas arinome

Chromsome Categories Parsimony reconstruction (Unordered) [Steps: 138] 6-8 9-14 15-19 20 21 22 23 24 25 26 27 28 29 30 31 32 33-41 42-78

Supplementary Figure 32. Haploid chromosome number mapped onto a phylogenetic hypothesis of Nymphalidae. Haploid chromosome numbers were treated as discrete character states, which were mapped onto the phylogeny using the principle of parsimony. Character state “31” is shown to be the most likely ancestral state for the family. The arrow indicates Melitaea cinxia. 32

a)

b) High identity Low identity

30 10

20

Frequency

600 400

0

0

200

Frequency

800

Syntenic genes Non-syntenic genes

0

10

20

30

40

Sequence identity (%)

50

60

0

10

20

30

40

50

60

Sequence identity (%)

Supplementary Figure 33. Frequency distribution of minimum sequence identity among one-to-one orthologs of Melitaea cinxia, Bombyx mori and Heliconius melpomene. The identity distribution is shown in a) syntenic and non-syntenic genes, b) only in non-syntenic (translocated) genes classified according to their pairwise identities. Out of 182 translocated genes 37 (20%) had identity less than 20%.

33

Melitaea (n=31) Heliconius (n=21) Bombyx (n=28) 71 – 86 My

107 – 127 My Supplementary Figure 34. The number of potential translocated genes (red) in the Melitaea, Heliconius and Bombyx phylogeny. N referes to the haploid chromosome number.

34

6 5 4 3 2 0

1

Translocated genes (%)

1

2*

3

4*

5

6*

7

8

9*

10* 11* 12* 13* 14* 15*

16

17

18

19

20

21

22* 23* 24* 25* 26* 27* 28* 29* 30* 31*

Chromosome

Supplementary Figure 35. Distribution of 42 potential translocated genes in Melitaea cinxia chromosomes. The numbers are scaled based on the number of one-to-one orthologs between M. cinxia and B. mori. Chromosome 1 is the Z chromosome and * indicates fusion chromosomes in B. mori and H. melpomene.

35

Supplementary Figure 36. Chromosomes of Melitaea cinxia (left), Heliconius melpomene (middle) and Bombyx mori (right). Each box represents one superscaffold in M. cinxia or a scaffold in H. melpomene. Colors and small numbers above the boxes show orthologous M. cinxia chromosomes and chromosome numbers, and thus indicate fusion chromosomes and translocated sites in H. melpomene and B. mori genomes. Horizontal lines within boxes show corresponding loci in M. cinxia chromosomes, and red vertical lines indicate bin borders showing recombination sites in the linkage map.

36

Supplementary Figure 37. Alignment of Melitaea cinxia chromosomes 12 and 31, 14 and 30, and 27 and 29 against Bombyx mori fusion chromosomes 11, 23, and 24. Colored boxes show aligned regions between M. cinxia and B. mori. Upper and lower boxes in M. cinxia indicate forward and reverse alignments between B. mori, respectively. Orange lines denote bin boundaries and black vertical lines mark chromosome boundaries in M. cinxia.

37

Supplementary Tables Supplementary Table 1. DNA samples used in sequencing and sequencing library statistics. 1) used only in mitochondrial DNA assembly, 2) used in scaffold validation and superscaffolding, 3) used only in variation detection. Library

DNA sample

454 single IlluminaPE1 IlluminaPE2 IlluminaMP1 SOLiDMP1 SOLiDMP2 IlluminaMP2 IlluminaMP3 IlluminaMP4 2

SOLiDMP3 454MP1 454MP2 PacBio

3

SOLiD_ÅLpool

1 male, 1 1 female 1 male, 10 full-sib pool 10 full-sib pool 10 full-sib pool 1 male 1 male 10 full-sib pool 10 full-sib pool 10 full-sib pool 100 full-sib pool 100 full-sib pool 100 full-sib pool 100 full-sib pool 53 individuals in 4 pools

DNA isolation method II.2.2. II.2.2., II.2.3. II.2.3. II.2.3. II.2.2. II.2.2. II.2.3. II.2.3. II.2.3. II.2.4. II.2.4. II.2.4. II.2.4. II.2.3.

Average insert Average Number Number of size expected read length of raw filtered (observed) (bp) reads reads/readpairs (M) (M)

Number of Coverage mapped or mapped reads/readpairs reads (M)

single read

360

10

10

-

9.2

500 (460) bp

2 x 58

150

83

56

16.7

800 (710) bp 1 (1.0) kb 2 (1.9) kb 3 (2.7) kb 3 (2.3) kb 3 (3.1) kb 5 (3.1) kb 5 (4.7) kb 8 (6.5) kb 16 (17.0) kb single read

2 x 125 2 x 76 2 x 50 2 x 50 2 x 66 2 x 69 2 x 68 2 x 50 312 354 2,480

20 23 300 200 186 73 97 132 1.8 2.1 3.3

9.9 7.8 38.9 40.5 113.6 59.4 74.7 131 0.95 0.72 2.7

4.3 2.9 20.8 18.4 46.2 24.6 36.6 10.9 0.78 0.59 2

2.8 1.1 5.3 4.7 15.6 8.7 14.3 2.8 0.6 0.5 12.7

115 bp

-

636

-

338

-

38

Supplementary Table 2. RNA samples and library information. The average number of mappable reads is listed per individual, except for pooled samples which reports the total number of mappable reads. n/a - not applicable. Experiment

Annotation1

Annotation2 Annotation3 Variation

Sample definition

Sample size: RNA-seq library males, females

Read length (bp)

Average number of mappable reads

abdomen pool 4: 2, 2 (PoolA) mixed tissue 57: n/a pool (PoolMix)

Full-length transcriptome PE Full-length transcriptome PE

2 x 76

54.8M

2 x 76

31.4M

mixed tissue 155: n/a pool 3 days old 49: 26, 23 adults 2-3 days old 40: 15, 25 adults

454 single read

110, 220

-

PolyA-anchored 75, 100 single read PolyA-anchored PE 2 x 101

5.0M 3.9M

39

Supplementary Table 3. Contig and scaffold statistics. Initial contigs refer to the contig assembly produced by Newbler. The final contigs were produced by scaffolding the initial contigs with MIP Scaffolder and closing gaps between adjacent contigs with SOAPdenovo GapCloser. Final scaffolds include a set of scaffolds with minimum length of 1,500 bp. Number of Max length contigs/scaffolds (bp) (bp)

N50

Total length (bp)

Initial contigs

217,638

26,736

2,105

354,538,866

Final contigs

49,851

144,962

13,489

360,975,554

8,262

668,473

119,328

389,896,394

Final scaffolds

40

Supplementary Table 4. The statistics of superscaffolds per chromosome. Chromosome

Number of superscaffolds

N50 (bp)

Total length (bp)

1 (Z)

74

338,757

14,178,551

2

63

338,247

13,061,208

3

58

322,278

11,714,550

4

50

406,822

12,875,956

5

48

388,358

11,529,948

6

59

287,105

12,012,768

7

46

461,311

11,220,220

8

49

346,751

10,737,528

9

47

369,359

10,754,370

10

56

396,269

11,891,256

11

64

296,143

11,117,473

12

45

400,393

10,573,299

13

41

550,370

10,139,467

14

52

239,652

9,704,362

15

51

275,318

9,849,234

16

51

269,742

9,945,888

17

39

447,995

10,102,199

18

46

285,373

9,814,747

19

40

330,724

8,116,537

20

37

400,666

9,187,434

21

44

362,865

8,449,847

22

47

415,447

8,522,659

23

47

296,019

8.539,619

24

33

333,881

7,263,460

25

41

308,233

7,055,637

26

38

355,976

6,128,182

27

33

247,940

5,385,080

28

36

167,516

4,132,698

29

47

114,306

3,014,943

30

30

184,294

3,241,965

31

41

108,363

2,242,263

all

1,456

330,752

283,283,699

41

Supplementary Table 5. Statistics of superscaffolds and unplaced scaffolds. The unplaced scaffolds include scaffolds without chromosome assignment and chimeric scaffolds as indicated by the linkage map. Number of (super)scaffolds

N50 (bp)

Total length (bp)

Superscaffolds

1,453

330,752

282,503,348

Unplaced scaffolds

4,846

97,739

110,805,803

Superscaffolds and unplaced scaffolds

6,299

258,308

393,309,151

42

Supplementary Table 6. Chromosome statistics based on the linkage map. The table lists the number of supporting markers and scaffolds within chromosomes, and chromosome lengths as the total length of scaffolds and as centiMorgans (cM). Only data from non-chimeric scaffolds are reported. Chromosome Supporting Number of markers scaffolds

Length (Mb)

Length (cM)

1 (Z)

1566

202

14

50

2

1419

130

11.6

75

3

1237

124

11.5

67

4

1203

125

11.4

50

5

1355

114

10.9

33

6

1212

121

10.6

58

7

1276

127

10.4

58

8

1094

108

10.4

33

9

1204

107

10.4

67

10

1039

127

10.1

25

11

1063

120

10.1

58

12

1104

112

9.8

67

13

958

109

9.6

50

14

998

112

9.6

50

15

1059

108

9.2

58

16

939

100

9

58

17

1000

93

9

58

18

951

98

8.9

58

19

868

100

8.4

42

20

951

87

8.2

67

21

892

100

8

67

22

642

91

7.8

67

23

736

105

7.8

58

24

673

74

6.4

67

25

635

87

6.3

50

26

679

73

6.2

58

27

455

63

5.4

25

28

408

70

3.9

50

29

434

81

3.2

50

30

346

59

3

42

31

327

78

2.3

25

28,723

3,205

263.3

1641

total

43

Supplementary Table 7. Summary of genome assembly validation steps.

Validation method

Result

Estimating correctness of assembly by mapping PE and MP reads

89-99% of mapped pairs are concordant with the genome

Estimating correctness of scaffold by rescaffolding the contigs using PacBio reads

82-87% of contig joins are concordant with the scaffolds

Estimating completeness of genome by mapping transcripts

80% of transcripts have an alignment that covers at least 80% of the transcript

Detecting non-chimeric scaffolds with a linkage map

91% of scaffolds are non-chimeric

Estimating completeness of genome by identifying conserved core genes

77% (84%) of core genes have a complete (partial) match

Estimating completeness of orthologous regions by aligning scaffolds with other butterfly genomes

17-19% of the bases in M. cinxia genome can be aligned with other butterfly genomes

Estimating correctness of superscaffolds by comparing gene order against B. mori

90% of scaffold joins are concordant with gene order of B. mori

Detecting non-chimeric superscaffolds with an independent linkage map

97.6% of superscaffolds are non-chimeric

44

Supplementary Table 8. Contig validation based on mapping PE and MP libraries. The libraries are described in Supplementary Table 1. Library

Correctness estimate (%)

IlluminaPE1

88.7

IlluminaPE2

99.1

IlluminaMP1

98.2

IlluminaMP2

93.8

IlluminaMP3

93.2

IlluminaMP4

93.7

SOLiDMP1

95.8

SOLiDMP2

94.5

SOLiDMP3

96.7

454PE1

90.6

45

Supplementary Table 9. Genome assembly validation statistics deduced from the linkage map. Number of Total length % of genome scaffolds (Mb) assembly length non-chimeric

3205

263.3

67.5

302

54.2

13.9

no clear assignment

1090

45.6

11.7

no markers

3665

26.8

6.9

chimeric

46

Supplementary Table 10. Completeness of the genome assemblies of five lepidopteran species assessed using the set of conserved core (CEGMA) genes. Melitaea cinxia v.1.0

Plutella xylostella v.1.0

Danaeus plexippus v.1.0

Danaus Heliconius plexippus melpomene v.3.0 v.1.1

Bombyx mori v.2.0

Complete

77,0 %

82,3 %

87,1 %

89,1 %

81,9 %

82,7 %

Partial

83,9 %

86,4 %

89,5 %

90,7 %

85,9 %

86,3 %

47

Supplementary Table 11. Links to published genome sequences used for comparative analysis. Species (version) Bombyx mori (v. 2.3)

Link to the genome (accession date) 7

http://sgp.dna.affrc.go.jp/data/scaffold.txt.gz (May 16 2013)

Danaus plexippus (v. 3.0)

8-9

http://monarchbase.umassmed.edu/download/Dp_genome_v3.fasta.g z (May 16 2013)

Heliconius melpomene (v. 1.1)

Plutella xylostella (v.1.1)

10

11

Tribolium castaneum (v. 3.0)

http://www.butterflygenome.org/sites/default/files/Hmel11_Release_20120601.tgz (May 16 2013) ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/invertebrates/P lutella_xylostella/DBM_FJ_V1.1/Primary_Assembly/unplaced_scaffold s/FASTA/unplaced.scaf.fa.gz (May 17 2013)

12-13

ftp://ftp.bioinformatics.ksu.edu/pub/BeetleBase/3.0/Tribolium_genome _sequence.fasta (May 16 2013)

48

Supplementary Table 12. Percentage of bases contained within a pairwise alignment as reported by the MUMmer tools. A higher number suggests a more similar pair of scaffolds reflecting a fraction of aligned regions. Note that the matrix values are not symmetric. The most similar pair according to this comparison is M. cinxia aligned against H. Melpomene. P. xylostella and T. castaneum have the most dissimilar scaffolds. Melitaea Heliconius cinxia melpomene

Danaus plexippus

Bombyx mori

Plutella xylostella

Tribolium castaneum

100

19.11

17.26

17.38

11.46

8.02

Heliconius melpomene

25.31

100

20.1

18.05

13.36

9.23

Danaus plexippus

23.59

20.64

100

17.81

12.88

8.81

13.6

10.26

9.86

100

7.43

5

9.1

7.7

7.15

7.73

100

3.6

12.04

10.17

9.7

10.48

7.87

100

Melitaea cinxia

Bombyx mori Plutella xylostella Tribolium castaneum

49

Supplementary Table 13. Genome conservation distances reported by Mauve. A smaller number suggests a more similar pair of scaffolds. The matrix values are symmetric. The most similar pair of scaffolds is between M. cinxia and H. melpomene. Overall, the results are similar to those obtained with MUMmer. Melitaea Heliconius cinxia melpomene Melitaea cinxia

Danaus plexippus

Bombyx mori

Plutella xylostella

Tribolium castaneum

0

0.648

0.65

0.719

0.785

0.783

0.648

0

0.662

0.734

0.795

0.791

0.65

0.662

0

0.731

0.793

0.79

Bombyx mori

0.719

0.734

0.731

0

0.831

0.827

Plutella xylostella

0.785

0.795

0.793

0.831

0

0.865

Tribolium castaneum

0.783

0.791

0.79

0.827

0.865

0

Heliconius melpomene Danaeus plexippus

50

Supplementary Table 14. Classification of links between scaffolds within superscaffolds based on synteny to Bombyx mori. The order of orthologs between M. cinxia and B. mori was compared within adjacent scaffolds, and the links between scaffolds were classified as agreeing with synteny, disagreeing with synteny, or unknown if synteny information was not available. Chromosome 1 (Z) 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Total

Agree with Disagree synteny with synteny 60 42 38 51 48 34 41 28 37 39 33 43 42 25 23 22 34 44 18 47 41 25 20 26 19 14 8 17 4 7 4 934

10 2 5 4 1 6 6 4 4 9 1 4 4 4 2 1 5 2 4 0 4 4 4 3 2 1 2 2 1 1 2 104

Unknown 31 36 35 51 29 23 48 43 24 47 35 35 31 33 44 37 31 35 35 11 18 31 55 27 39 22 20 20 27 24 21 990

51

Supplementary Table 15. Classification and distribution of transposable elements and other repeats in the Melitaea cinxia genome. Note that some repeat elements are overlapping. Redundancy has been filtered from the total statistics.

DNA transposons

Non-LTR Retrotransposons

Number of hits

Length occupied (bp)

Percentage (%)

Total DNA transposons CMC Ginger Herbinger MULE P Sola TcMar Zator hAT Other

93,507 218 171 126 6,064 43 5,698 5,868 329 5,604 11,805

19,313,388 75,380 67,051 72,258 1,418,378 26,556 959,915 2,112,494 93,195 923,252 2,051,127

4.954 0.019 0.017 0.019 0.364 0.007 0.246 0.542 0.024 0.237 0.526

RC/Helitron Maverick

61,968 134

11,825,568 33,282

3.033 0.009

Total LINEs LINE/CR1 LINE/L2 LINE/R1 LINE/other

183,104 105,787 26,843 44,255 19,406

32,633,079 16,255,273 5,329,151 6,826,910 5,243,115

8.370 4.169 1.367 1.751 1.345

245

54,463

0.014

SINE

252,386

42,842,863

10.988

Total LTR elements LTR/other LTR/Copia LTR/DIRS LTR/Gypsy LTR/Pao

56,585 11 187 97 49,977 6,681 3,561 143,448 25,806 14,869 625,490

7,613,672 5,988 147,882 48,019 6,155,508 1,277,806 817,438 8,957,148 1,697,616 4,487,527 107,290,012

1.953 0.002 0.038 0.012 1.579 0.328 0.210 2.297 0.435 1.151 27.518

PLE/Penelope

LTR Retrotransposons

Satellites Simple repeats Other Unclassified Total repeats

52

Supplementary Table 16. Comparison of transposable element contents among four lepidopteran species. The highest copy number is omitted for D. plexippus and P. xylostella due to a high percentage of unclassifed TE elements. Melitaea Heliconius cinxia (%) melpomene (%)14 Proportion of genome

Bombyx Plutella mori (%) 15 xylostella (%) 11

DNA transposons

5.0

10.1

1

3

1.9

Non-LTR retrotransposones: LINEs

8.4

3.9

2.4

13.8

5.2

Non-LTR retrotransposones: SINEs

11.0

8.2

0.5

12.8

0.5

LTR retrotransposons

2.0

0.5

0.2

1.7

2.5

Unclassified

1.2

2.4

6.7

2.4

28.2

27.4

24.9

10.8

35.4

34

Helitron (3.0)

Helitron (5.4)

Tc1-mariner (2.7)

SINE/5S-Deu (5.1)

SINE/Metulj (8.2)

SINE/Bm1 (11.4)

Total Highest copy number

Danaus plexippus (%)8

DNA transposons Retrotransposons

53

Supplementary Table 17. Number of gene models supported by transcriptome data based on mapped RNA-seq reads and contigs. Total number

Mapped RNA-seq reads

%

Mapped RNA-seq contigs

%

Mapped RNA-seq reads or contigs

%

genes coding exons

16,667

15,268

91.6

11,737

70.4

15,941

95.6

96,875

79,365

81.9

47,121

48.6

82,334

85

5’UTR exons

11,112

10,448

94

5,441

49

10,606

95.4

3’UTR exons

7,089

6,591

93

5,158

72.8

7,037

99.3

54

Supplementary Table 18. Attributes of the genome of Melitaea cinxia and four other Lepidoptera. Melitaea cinxia (v1)

Heliconius melpomene

31

21

30

28

31

Assembly length in Mb (with Ns)

393.3

273.8

248.6

481.8

393.5

Number of (super)scaffolds

6,299

4,309

5,397

43,463

1,793

Scaffold N50 (kb)

258.3

194.3

715.6

4,008.4

737.2

Number of protein coding genes

16,667

12,817

15,130

14,622

18,071

Avg (sd) span of coding genes in bp

8,129

6,779

6,001

6,029

8,081

(10,275)

(8,555)

(10,492)

(7,127)

(11,232)

317.1

453

460.1

406.8

460.8

(337.6)

(482.2)

(520.5)

(498.6)

(475.8)

Number of chromosomes

Avg (sd) protein size in aa Avg (sd) number of coding exons Avg (sd) exon size in bp Avg (sd) intron size in bp

(v1.1)

10

Danaus plexippus (v3)

9

Bombyx mori 7

(v2)

Plutella xylostella 11

(v1)

5.8

6.6

6.7

5.4

6.5

(4.2)

(6.3)

(7.1)

(5.8)

(7.2)

163

206

205

224

213

(259.9)

(332.7)

(300.9)

(418.1)

(327.7)

1493

965

810

1083

1225

(3,404)

(2,190)

(3,475)

(1,250)

(2,658)

Total repeat content %

27.9 %

24.9 %

10.2 %

35.4 %

34.0 %

Total GC %

32.6 %

32.8 %

31.6 %

38.8 %

38.0 %

Total coding length %

4%

7%

9%

4%

6%

Total intron length %

31 %

26 %

29 %

16 %

31 %

55

Supplementary Table 19. The 45 predicted microRNAs in the Melitaea cinxia genome. The last column lists homologous miRNAs from B. mori (bmo) and H. melpomene (hme). Pre-miRNA name

Scaffold

Start

End

Strand

Homologous miRNA

mci-mir-279c

scaffold1180

8254

8333

+

bmo-mir-279c

mci-mir-7

scaffold1252

108675

108762

-

bmo-mir-7

mci-mir-278

scaffold1277

127020

127106

-

bmo-mir-278

mci-mir-1a

scaffold1382

155022

155105

+

bmo-mir-1a

mci-mir-1b

scaffold1382

155025

155102

-

bmo-mir-1b

mci-mir-965

scaffold1391

52841

52943

+

bmo-mir-965

mci-mir-274

scaffold1415

93378

93472

-

bmo-mir-274

mci-mir-87

scaffold148

106760

106857

+

bmo-mir-87

mci-mir-285

scaffold1516

57828

57915

+

bmo-mir-285

mci-mir-193

scaffold1595

78432

78509

-

hme-mir-193

mci-mir-2788

scaffold1595

74602

74682

-

hme-mir-2788

mci-mir-2788

scaffold1595

74604

74679

-

bmo-mir-2788

mci-mir-282

scaffold1770

219359

219443

+

bmo-mir-282

mci-mir-10

scaffold22

37401

37484

+

bmo-mir-10

mci-mir-993a

scaffold22

120468

120556

-

bmo-mir-993a

mci-mir-3327

scaffold2296

61062

61166

-

bmo-mir-3327

mci-mir-3338

scaffold2296

61216

61322

-

bmo-mir-3338

mci-mir-279a

scaffold2731

54187

54268

+

bmo-mir-279a

mci-mir-iab-4

scaffold2892

152106

152192

+

bmo-mir-iab-4

mci-mir-iab-8

scaffold2892

152106

152192

-

bmo-mir-iab-8

mci-mir-2817

scaffold307

112610

112675

+

bmo-mir-2817

mci-mir-750

scaffold331

38900

38980

-

bmo-mir-750

mci-mir-1175

scaffold331

38753

38834

-

bmo-mir-1175

mci-mir-276

scaffold3363

235757

235840

+

bmo-mir-276

mci-mir-927

scaffold4048

31990

32078

-

bmo-mir-927

mci-mir-1926

scaffold4048

31989

32079

+

bmo-mir-1926

mci-mir-133

scaffold4307

97572

97672

-

bmo-mir-133

mci-mir-2a-1

scaffold496

55874

55963

+

bmo-mir-2a-1

mci-mir-2a-2

scaffold496

56278

56352

+

bmo-mir-2a-2

mci-mir-13a

scaffold496

56030

56107

+

bmo-mir-13a

mci-mir-13b

scaffold496

56160

56239

+

bmo-mir-13b

mci-mir-281

scaffold5086

10861

10937

-

bmo-mir-281

mci-mir-989

scaffold5254

mci-mir-2755

scaffold55

mci-mir-210 mci-mir-3286

14941

15031

-

bmo-mir-989

100756

100834

+

bmo-mir-2755

scaffold611

83830

83916

+

bmo-mir-210

scaffold611

86876

86995

+

bmo-mir-3286

mci-mir-307

scaffold6220

50345

50440

+

bmo-mir-307

mci-mir-263a

scaffold66

73039

73133

+

bmo-mir-263a

mci-mir-124

scaffold701

346075

346158

-

bmo-mir-124

mci-mir-279d

scaffold703

329304

329380

+

bmo-mir-279d

mci-mir-3362

scaffold7248

2608

-

bmo-mir-3362

mci-mir-137

scaffold727

131885

131975

-

bmo-mir-137

mci-mir-277

scaffold923

84009

84129

+

bmo-mir-277

mci-mir-317

scaffold923

63251

63341

+

bmo-mir-317

mci-mir-9a

scaffold997

262482

262570

+

bmo-mir-9a

2498

56

Supplementary Table 20. Predicted microRNAs located in intragenic regions. miRNA name

Location

Gene model ID

mci-mir-279c

intronic

MCINX000703

mci-mir-7

intronic

MCINX001004

mci-mir-278

intronic

MCINX001113

mci-mir-965

intronic

MCINX001656

mci-mir-274

intronic

MCINX001774

mci-mir-10

coding+intronic

MCINX006394

mci-mir-750

coding+3'UTR

MCINX009447

mci-mir-1175

3'UTR

MCINX009447

mci-mir-2a-1

intronic

MCINX012468

mci-mir-2a-2

intronic

MCINX012468

mci-mir-13a

intronic

MCINX012468

mci-mir-13b

intronic

MCINX012468

mci-mir-281

intronic

MCINX012603

mci-mir-2755

intronic

MCINX013168

mci-mir-124

coding+intronic

MCINX014412

mci-mir-317

coding+intronic

MCINX016287

57

Supplementary Table 21. Summary of putative Melitaea cinxia transfer RNAs. Three first columns show individual codons, codon percentage and counts in protein coding genes. The five last columns show the respective anticodons, detected number of tRNA genes, and putative tRNA pseudogenes (predictions with poor primary or secondary structure) representing each codon. Codon GCA GCC GCG GCT AGA AGG CGA CGG CGT AAC AAT GAC GAT TGC TGT CAA CAG GAA GAG GGA GGC GGG CAC ATA ATC ATT CTA CTC CTG CTT TTA TTG AAA AAG ATG TTC TTT CCA CCC CCG CCT TGA AGC AGT TCA TCC TCG TCT TAA ACA ACC ACG ACT TGG TAC TAT GTA GTC GTG GTT CAT CGC GGT TAG

Percentage

Codon count

1.29 % 0.82 % 0.90 % 1.40 % 1.93 % 0.90 % 1.04 % 0.57 % 0.90 % 1.96 % 3.45 % 1.44 % 2.18 % 1.01 % 1.75 % 2.24 % 1.26 % 2.94 % 1.39 % 1.33 % 0.87 % 0.55 % 1.09 % 2.88 % 1.41 % 2.97 % 1.28 % 0.97 % 1.17 % 1.50 % 2.81 % 1.83 % 4.69 % 2.16 % 2.02 % 1.66 % 3.31 % 1.29 % 0.61 % 0.74 % 1.07 % 1.28 % 1.08 % 1.66 % 1.76 % 0.87 % 1.00 % 1.49 % 2.03 % 2.10 % 0.96 % 1.08 % 1.65 % 1.17 % 1.64 % 2.65 % 1.58 % 1.04 % 1.29 % 1.84 % 1.48 % 0.72 % 1.19 % 0.86 %

106210 67386 74079 115148 159209 74486 85621 46675 74560 161312 284398 119017 179404 83373 144140 184558 104065 242359 114481 109259 71622 45594 89688 237526 116578 244784 105194 79975 96122 123727 231187 151040 386720 177699 166813 136763 272857 106306 50217 60895 88419 105199 88795 137205 144876 71569 82336 123014 167376 173128 79278 88825 136204 96337 134737 218280 130504 86051 106399 151244 122173 59238 97868 70530

tRNA type

tRNA type count

TGC Ala GGC Ala CGC Ala AGC Ala TCT Arg CCT Arg TCG Arg CCG Arg ACG Arg GTT Asn ATT Asn GTC Asp ATC Asp GCA Cys ACA Cys TTG Gln CTG Gln TTC Glu CTC Glu TCC Gly GCC Gly CCC Gly GTG His TAT Ile GAT Ile AAT Ile TAG Leu GAG Leu CAG Leu AAG Leu TAA Leu CAA Leu TTT Lys CTT Lys CAT Met GAA Phe AAA Phe TGG Pro GGG Pro CGG Pro AGG Pro TCA STOP/SeC GCT Ser ACT Ser TGA Ser GGA Ser CGA Ser AGA Ser TTA STOP TGT Thr GGT Thr CGT Thr AGT Thr CCA Trp GTA Tyr ATA Tyr TAC Val GAC Val CAC Val AAC Val ATG His GCG Arg ACC Gly CTA STOP

10 55 10 32 10 7 7 1 19 34 2 74 1 11 4 7 10 15 15 10 17 5 22 8 27 17 6 1 6 13 6 8 10 11 22 10 7 6 2 18 15 2 15 1 7 11 11 11 2 91 1620 16 58 11 21 1 13 1 11 14 NO HITS NO HITS NO HITS NO HITS

Anticodon

Pseudo Pseudo tRNA tRNA count Pseudo Pseudo

25 3

Pseudo Pseudo Pseudo Pseudo

1 178 2 1

Pseudo Pseudo Pseudo

281 15 8

Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo

4 4 10 8 2 1 3 3

Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo

396 62 7 3 1 5 1 3 3

Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo

53 1 3 6 42 12 2 4 1

Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo Pseudo

8 3 19 2 6 1 1 1899 151 12 12

Pseudo Pseudo Pseudo Pseudo

153 13 2 3

Pseudo Pseudo Pseudo

1 30 13

Pseudo

1

58

Supplementary Table 22. Putative high-confidence (E < 1e-5) genes encoding small nuclear RNA components of U2- and U12-dependent spliceosomes. The candidates were visually inspected to confirm that they contained specific sequence elements such as the sm-site and sequences functioning in intron recognition. U snRNA

Scaffold

Start

End

U1

scaffold1909 scaffold3257 scaffold1383 scaffold233 scaffold1202 scaffold87 scaffold2349 scaffold3170 scaffold1909 scaffold616 scaffold1352 scaffold1202 scaffold1687

162264 18833 2727 40825 85179 167204 134321 34418 177345 28659 159872 82331 9252

162424 18993 2887 40985 85339 167365 134478 34581 177186 28500 159713 82172 9117

+ + + + + + + + -

4.20E-35 6.80E-33 2.70E-32 6.60E-32 2.70E-31 3.60E-29 3.40E-17 1.70E-14 6.60E-44 8.20E-44 1.30E-41 4.10E-38 2.10E-05

U2

scaffold1202 scaffold5582 scaffold606 scaffold1462 scaffold2452 scaffold4588

89520 12338 28813 98538 26989 1848

89713 12524 28747 98472 26923 1782

+ + -

6.60E-42 2.20E-17 2.60E-25 2.60E-25 2.60E-25 2.60E-25

U4

scaffold3126

27046

26922

-

5.60E-33

scaffold7000

15406

15282

-

5.60E-33

U5

U6

Strand

Eval

scaffold2452

27395

27506

+

5.00E-18

scaffold1447 scaffold96 scaffold1684 scaffold1081

141496 120440 28531 46480

141609 120552 28643 46590

+ + + +

2.40E-17 6.00E-17 7.90E-17 2.40E-16

scaffold2051

734

841

+

3.10E-08

scaffold2231

10757

10847

+

4.70E-23

scaffold3494 scaffold4568 scaffold2311 scaffold3359 scaffold2322 scaffold882 scaffold1116 scaffold3881

13850 2228 58101 13047 184062 15422 185794 62527

13940 2318 58191 13137 184163 15511 185875 62617

+ + + + + + + +

4.70E-23 4.70E-23 6.50E-23 1.10E-22 1.20E-13 2.20E-11 1.90E-06 2.20E-06

scaffold1227

64561

64640

+

3.40E-05

U11

none

U12

scaffold1158

186798

186943

+

1.80E-09

U4atac

scaffold185

154921

154789

-

8.30E-20

U6atac

scaffold363

29852

29942

+

6.40E-15

59

Supplementary Table 23. Annotation data for the Melitaea cinxia mitochondrial genome. For each feature the start and stop positions and length are given in bp. Direction indicates the direction of transcription for coding regions. Start and stop codons are shown for PCGs (Protein Coding Genes), and AT content is given for PCGs, rRNAs and control region. tRNA nomenclature follows the standard three letter IUPAC amino acid code. Type

Gene

tRNA

Met

tRNA tRNA PCG

NADH dehydrogenase subunit 2

tRNA

Trp

tRNA

Cys

tRNA PCG

Start

Stop Length Direction Start codon Stop codon AT content

1

68

68

forward

Ile

70

136

67

forward

Gln

134

202

69

reverse

249

1262

1014

forward

1265

1333

69

forward

1326

1389

64

reverse

Tyr

1390

1456

67

reverse

Cytochrome Oxidase subunit 1

1459

2994

1536

forward

tRNA

Leu

2990

3056

67

forward

PCG

Cytochrome Oxidase subunit 2

3057

3732

676

forward

tRNA

Lys

3733

3803

71

forward

tRNA

Asp

3806

3872

67

forward

PCG

ATP sythase subunit 8

3873

4040

168

PCG

ATP sythase subunit 6

4030

4707

PCG

Cytochrome Oxidase subunit 3

4707

tRNA

Gly

5498

PCG

NADH dehydrogenase subunit 3

tRNA tRNA

ATT

TAA

84

CGA

TAA

70.8

ATG

T--

75.6

forward

ATT

TAA

91.7

678

forward

ATG

TAA

78

5495

789

forward

ATG

TAA

73.3

5566

69

forward

5567

5920

354

forward

ATT

TAA

81.9

ALa

5924

5988

65

forward

Arg

5988

6049

62

forward

tRNA

Asn

6050

6113

64

forward

tRNA

Ser

6114

6174

61

forward

tRNA

Glu

6180

6249

70

forward

tRNA

Phe

6250

6313

64

reverse

PCG

NADH dehydrogenase subunit 5

6312

8046

1735

reverse

ATT

T--

81.2

tRNA

His

8047

8114

68

reverse

PCG

NADH dehydrogenase subunit 4

8114

9453

1340

reverse

ATG

T--

79.7

PCG

NADH dehydrogenase subunit 4L

9454

9741

288

reverse

ATG

TAA

83.3

tRNA

Thr

9744

9807

64

forward

tRNA

Pro

9808

9872

65

reverse

PCG

NADH dehydrogenase subunit 6

9875

10402

528

forward

ATT

TAA

83.7

PCG

Cytochrome B

10410

11558

1149

forward

ATG

TAA

75.7

tRNA

Ser

11557

11624

68

forward

PCG

NADH dehydrogenase subunit 1

11649

12586

938

reverse

ATG

T--

78.7

tRNA

Leu

12588

12656

69

reverse

rRNA

Large subunit ribosomal RNA

12664

14001

1338

reverse

tRNA

Val

14002

14067

66

reverse

rRNA

Small subunit ribosomal RNA

14069

14840

772

reverse

14841

15171

331

Control region

84.7 84.8 93.7

60

Supplementary Table 24. Comparison of mitochondrial protein coding gene start and stop codons between Melitaea cinxia and four other Lepidoptera. Start and stop codons are shown for each of the 13 protein coding genes of the mitochondrial genome. Species used in comparison with M. cinxia are the butterflies, Kallima inachus and H. melpomene, and the moths, B. mori and P. xylostella. Melitaea cinxia

Kallima inachus

Heliconius melpomene

Bombyx mori

Plutella xylostella

Gene

start

stop

start

stop

start

stop

start

stop

start

stop

NADH dehydrogenase subunit 2

ATT

TAA

ATT

TAA

ATT

TAA

ATA

TAA

ATT

TAA

Cytochrome Oxidase subunit 1

CGA

TAA

CGA

TAA

CGA

TAA

CGA

TAA

CGA

TAA

Cytochrome Oxidase subunit 2

ATG

T--

ATG

T--

ATG

T--

ATG

T--

ATG

T--

ATP sythase subunit 8

ATT

TAA

ATT

TAA

ATT

TAA

ATA

TAA

ATC

TAA

ATP sythase subunit 6

ATG

TAA

ATG

TAA

ATG

TAA

ATG

TAA

ATG

TAA

Cytochrome Oxidase subunit 3

ATG

TAA

ATG

TAA

ATG

TAA

ATG

TAA

ATG

TAA

NADH dehydrogenase subunit 3

ATT

TAA

ATT

T--

ATT

TAG

ATT

TAA

ATG

TAA

NADH dehydrogenase subunit 5

ATT

T--

ATT

T--

ATT

TAA

ATT

TAA

ATT

TAA

NADH dehydrogenase subunit 4

ATG

T--

ATG

T--

ATG

T--

ATG

TAA

ATG

T--

NADH dehydrogenase subunit 4L ATG

TAA

ATG

TAA

ATG

TAA

ATG

TAA

ATG

TAA

NADH dehydrogenase subunit 6

ATT

TAA

ATT

TAA

ATT

TAA

ATT

TAA

ATT

TAA

Cytochrome B

ATG

TAA

ATG

TAA

ATG

TAA

ATG

TAA

ATG

TAA

NADH dehydrogenase subunit 1

ATG

T--

ATG

T--

ATA

TAA

ATT

TAA

ATG

TAA

61

Supplementary Table 25. Groups of manually annotated genes.

Gene group

Number of manually curated genes

Heat shock proteins

72

Other chaperone-related genes

35

Immunity related genes

65

Cytochrome P450

51

Muscles, muscle development, flight

28

Odorant binding proteins

44

Glycolysis

18

Growth

28

Homeobox genes

16

Other genes

172

Total

529

62

Supplementary Table 26. Variants from genomic and RNA-seq data from the Åland Islands population. The total number of variants in the top row has been divided into five categories shown as percents. Each variant belongs to only one category; if an indel spanned several categories, the priority in class assignment was 1) coding exon, 2) 5’UTR, 3) 3’UTR, 4) intron, 5) intergenic. Genomic SNP sequencing library total number

Genomic indel

SOLiD_Ålpool SOLiD_Ålpool

RNA SNP

Genomic SNP

Genomic indel

PolyA-anchored IlluminaPE1 IlluminaPE1 PE IlluminaPE2 IlluminaPE2

3,020,826

378,206

20,281

5,245,947

563,225

coding exon%

3.5

1

19

2.9

0.7

5’UTR%

0.6

0.7

1.9

0.5

0.5

3’UTR%

1.4

1.8

25.9

1.5

1.8

intron%

36

38.3

29.1

36.2

37.7

58.5

58.3

24.1

59

59.4

intergenic%

63

Supplementary Table 27. Summary statistics of SNP and indel densities in kb for genic, coding, intronic and UTR regions. Median (25%, 75% quantile) SNP density in kb

Median (25%, 75% quantile) indel density in kb

14.6 (7.9, 20.8)

1.38 (0.5, 2.3)

coding

8.2 (3.8, 13.8)

0.0 (0, 0)

intronic

15.3 (7.5, 22.8)

1.53 (0.4, 2.7)

7.34 (0, 17.2)

0.0 (0, 0)

genic:5'UTR-3'UTR

5’ and 3’ UTRs

64

Supplementary Table 28. Chromosome mapping of one-to-one orthologous proteins between Melitaea cinxia (x-axis) and Bombyx mori (y-axis). The M. cinxia chromosomes have been reordered to match the corresponding B. mori chromosomes. Chromosome 1 represents the Z chromosome in both species. Boxed elements indicate one-to-one (25 cases) and two-to-one (3 cases) mapped chromosomes between M. cinxia and B. mori. From the total set of 4,485 orthologs, only 4% (178) map to non-orthologous chromosomes. 1 1 208 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 2 11 1 12 0 13 0 14 1 15 0 16 1 17 1 18 0 19 0 20 0 21 0 22 0 23 0 24 0 25 0 26 0 27 0 28 0

28 17 2 6 5 20 13 10 9 12 0 0 1 1 2 2 0 1 1 0 52 0 2 0 0 0 0 0 0 0 0 163 2 0 1 0 1 0 1 1 0 0 254 0 0 0 0 0 0 0 0 0 0 240 0 0 0 0 1 0 0 0 0 0 148 0 1 0 0 0 1 0 0 0 0 141 0 0 0 0 0 0 0 0 0 0 161 0 0 0 0 2 0 0 0 0 0 176 0 0 0 0 0 0 0 0 0 1 218 0 0 0 0 0 0 1 0 0 1 188 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 2 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 2 0 2 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 2 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 2 2 0 0 3 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0

31 4 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 37 1 0 0 189 0 0 0 147 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 1 0

26 3 21 8 7 16 25 19 11 14 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 2 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 68 2 0 2 1 0 0 0 0 0 0 235 1 1 0 1 1 0 0 0 0 0 166 0 0 0 0 0 0 0 0 1 0 147 0 0 0 0 0 0 0 0 3 0 136 0 0 0 0 0 0 1 0 0 0 136 0 1 0 0 0 0 0 0 0 0 101 0 0 0 0 0 1 0 0 0 0 110 1 5 4 0 0 0 0 0 1 0 178 0 0 0 0 0 0 0 0 0 0 131 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0

30 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 40 0 0 0 0 0

27 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 50 0 0 0 0

29 18 23 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 3 1 0 0 0 0 0 0 0 0 2 0 0 0 0 1 0 0 0 0 45 1 0 0 187 1 1 0 104 0 0 0 0 0 0

24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 79 0

22 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 3 0 72

65

Supplementary Table 29. Chromosome mapping of one-to-one orthologous proteins between Melitaea cinxia (x-axis) and Heliconius melpomene (y-axis). The M. cinxia chromosomes have been reordered to match the corresponding H. melpomene chromosomes. Chromosome 1 represents the Z chromosome in M. cinxia. Boxed elements indicate one-to-one (11 cases) and two-to-one (10 cases) mapped chromosomes between M. cinxia and H. melpomene chromosomes. From the total set of 3,869 orthologs, 4.7% (181) map to non-orthologous chromosomes. Z 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1 2 47 2 0 260 0 0 0 0 0 0 0 0 2 0 0 2 0 0 0 1 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 1 0 0

27 0 44 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

21 5 0 0 0 1 73 0 0 123 0 0 0 1 1 0 3 0 0 0 0 0 1 0 6 2 0 0 0 0 0 0 0 0 1 0 10 0 0 0 0 0 0 0

19 17 10 0 0 0 0 3 0 0 0 0 0 0 0 79 0 0 0 149 0 1 3 162 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0

31 12 0 0 0 0 0 0 0 0 0 0 0 1 36 0 0 156 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0

28 18 20 6 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 60 0 0 0 0 167 0 0 1 1 144 0 0 0 0 235 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 3 0 0 0 0 1 1 0 1 2 1 0 0 2 0 0 0 1 1 0 0 0 0

22 3 13 25 11 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 3 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 89 0 0 0 0 0 232 1 0 0 1 0 158 102 0 0 1 0 1 148 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 2 0 0 0 2 0 0 0

26 16 8 7 15 0 0 1 0 0 0 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 2 0 0 0 0 0 1 1 0 66 0 0 0 0 0 113 0 0 0 0 0 158 0 0 0 0 0 170 0 0 3 1 0 104 1 0 2 0 0 1 0 1 1 0 0 0 0 0 0

29 14 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 38 0 0 129 0 0 0 0

24 4 23 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 2 0 0 0 2 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 65 0 1 0 193 100 0 0 0

9 0 1 1 0 10 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 75

30 0 0 0 0 0 0 0 0 0 0 7 0 1 0 0 1 0 0 0 0 13

66

Supplementary Table 30. The total counts of one-to-one orthologous proteins mapped between the chromosomes of Melitaea cinxia and Biston betularius. The X-axis represents M. cinxia chromosomes reordered to match the corresponding B. betularius chromosome in the y-axis. Diagonal shows one-to-one mapping to orthologous chromosomes between the two species. From the total set of 113 orthologs, 95% map to orthologous chromosomes. Chromosome 1 represents the Z chromosome in both species.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

1 12 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

28 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

17 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

6 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

20 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

13 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

10 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

9 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

31 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

4 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

15 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

26 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0

7 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0

16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 4 0 0 0 0 0 0 0 0 0 0 0

19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0

11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0

14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 1

29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 5 0 0 0 0 0 0

23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0

12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0

30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0

27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3

67

Supplementary Table 31. Chromosome mapping of one-to-one orthologous proteins between Melitaea cinxia (x-axis) and Plutella xylostella (y-axis). The chromosomes of M. cinxia have been reordered to match the corresponding P. xylostella chromosomes. Chromosome 1 represents the Z chromosome in both species. Boxed elements show orthologous chromosomes between M. cinxia and P. xylostella. From the total set of 701 orthologs, 23.8% map to non-orthologous chromosomes (16% in autosomes). M. cinxia chromosomes 27, 28 and 30 have no clear orthologous chromosome in P.xylostella. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

1 21 0 0 0 0 0 0 0 0 1 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

17 0 0 12 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 10 0 0 57 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

6 12 0 0 0 58 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

5 1 0 0 0 0 9 0 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0

20 2 0 0 0 0 0 10 0 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

13 1 0 0 0 0 0 0 18 0 3 0 0 0 0 1 1 0 2 0 2 0 0 0 0 0 0 1 0 0 0 0

10 1 0 0 0 0 0 0 0 11 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

9 4 0 0 0 0 0 0 0 0 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

31 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

4 1 0 0 0 0 0 0 0 0 0 0 21 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

15 2 0 0 0 0 0 0 1 0 0 0 0 36 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

26 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

3 7 0 0 0 0 0 0 0 0 0 0 0 0 1 35 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

21 7 0 0 0 0 1 0 0 0 0 0 0 0 0 0 35 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0

8 1 0 0 0 0 0 0 0 0 7 1 0 0 0 0 0 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0

7 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 22 0 0 0 0 0 0 0 0 0 0 0 0 0

16 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 1 1 1 0 0 0 0 0 0

25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0

19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 1 0 0 0 0 0 0

11 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 11 0 0 1 0 0 19 0 0 0 0 0 0 0 0 0

14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 17 0 0 0 0 0 0 0 0

29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0

18 0 0 0 1 0 1 0 0 0 5 0 0 0 0 0 3 0 0 0 0 0 1 0 0 25 0 0 0 0 0 0

23 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 5 0 0 0 0 4 0 0 0 0 0

24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0

22 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0

12 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 3 0 14 0 0

27 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

30 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

68

Supplementary Table 32. Chromosome mapping using pairwise sequence alignments of genomic scaffolds between Melitaea cinxia (x-axis) and Plutella xylostella (y-axis). The chromosomes of M. cinxia have been reordered to match the corresponding P. xylostella chromosomes. Chromosome 1 represents the Z chromosome in both species. The matrix shows the total length of aligned sequences mapped between chromosomes. Boxed elements show orthologous chromosomes between M. cinxia and P. xylostella. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

1 374 0 0 0 0 0 0 0 0 0 0 247 0 0 0 0 0 0 123 0 0 0 149 0 0 0 0 0 138 0 0

28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

17 2 6 5 20 13 0 0 0 0 0 94 0 0 0 0 0 0 826 0 0 0 0 0 0 11895 0 294 0 76 0 263 6702 232 0 0 0 0 0 2221 0 0 0 0 0 0 2196 0 0 0 489 0 0 7748 0 0 0 0 0 0 0 0 0 107 0 0 0 0 133 0 0 0 0 0 0 217 0 0 0 0 0 0 361 0 0 0 0 0 0 0 0 359 152 0 0 0 0 0 0 0 0 442 0 0 567 0 0 0 0 0 0 256 0 139 0 0 0 0 0 0 0 149 571 0 0 0 0 0 0 0 0 0 0 0 0 0 0 137 0 0 0 0 108 0 0 0 0 0 0 260 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 681 0 0 0 0 0 0 0 0 0 0

10 9 0 0 0 0 139 0 0 0 281 0 0 0 0 0 0 0 910 0 0 2925 0 0 0 181 0 0 101 0 0 0 0 0 113 0 0 0 0 0 176 0 0 297 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

31 4 15 26 3 21 8 7 96 0 0 0 240 0 0 0 87 0 0 0 0 0 0 0 0 0 0 0 147 0 0 0 0 113 0 400 332 0 0 0 363 0 0 0 0 0 0 0 88 0 0 0 0 0 618 0 225 0 0 0 138 0 0 0 0 0 0 0 149 0 0 0 89 0 0 0 0 320 0 0 89 0 0 0 412 0 0 154 0 0 0 0 0 0 0 0 0 2323 0 0 0 0 0 0 0 246 2688 0 82 0 0 0 0 0 0 1266 0 0 0 0 176 1024 0 0 2528 0 0 0 88 0 0 0 0 3605 0 0 109 0 0 0 0 0 1424 0 231 0 0 0 149 0 0 2622 88 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 120 0 0 0 0 0 0 0 0 0 0 0 153 0 0 0 143 253 0 0 0 0 0 0 89 0 0 0 0 0 0 0 102 0 0 0 0 0 0 0 204 121 0 0 128 0 0 0 408 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 88 0 0 0 0 0 0 0 0 0 0 0 142 0 0 0 0 0 0 0 0 0 0 0

16 25 19 11 14 0 0 0 0 568 0 0 0 0 0 0 0 0 0 0 278 0 91 0 0 0 0 0 102 0 0 0 0 0 0 0 0 0 0 151 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 152 0 0 0 435 0 0 0 0 560 0 0 0 0 0 189 0 141 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 207 0 150 89 0 0 1600 0 0 0 0 0 3372 0 0 0 0 0 1710 0 0 0 0 0 3051 143 0 0 0 0 0 0 0 0 186 0 0 0 0 0 0 0 0 0 0 149 0 0 0 0 0 0 0 0 0 0 0 0 0 611 0 0 0 0 0

29 18 23 24 22 0 0 1446 0 0 0 0 0 0 0 0 78 0 0 0 0 1584 0 0 0 645 374 1011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 182 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 75 212 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 802 0 0 186 0 0 0 0 0 1996 0 0 0 0 248 233 0 0 0 0 0 3104 0 0 0 102 0 3304 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

12 0 0 0 0 0 0 0 0 121 0 0 0 0 0 0 0 159 0 0 0 0 0 0 0 0 0 0 0 502 0 0

30 0 0 0 0 0 0 0 0 0 0 0 92 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

69

Supplementary Table 33. Classification and distribution of repetitive elements in the whole genome and in the estimated fusion sites in Bombyx mori. Whole genome Number Length Percentage of hits occupied (bp) (%) 44,179 37,810 2,887 1,904 871 403 242 62

10,215,142 9,228,198 415,788 267,754 65,485 111,311 103,796 22,810

2.708 2.447 0.110 0.071 0.017 0.030 0.028 0.006

95 81 9 4

17,924 16,597 820 420

2.769 2.564 0.127 0.065

1

87

0.013

Non-LTR Retrotransposons Total LINEs LINE/Jockey LINE/RTE LINE/CR1 LINE/R1 LINE/L2 LINE/R4 LINE/LTR LINE/I LINE/Daphne LINE/CRE LINE/DNA LINE/L1 LINE/chimera LINE/R2 LINE/Unknown

233,685 49,249 48,520 49,476 16,814 13,582 7,521 1,514 720 864 117 85 58 41 18 45,106

49,225,708 13,170,459 8,911,387 8,201,382 5,215,787 2,932,217 1,987,607 369,266 194,652 192,932 28,723 27,125 19,894 19,095 14,550 7,940,632

13.051 3.492 2.363 2.174 1.383 0.777 0.527 0.098 0.052 0.051 0.008 0.007 0.005 0.005 0.004 2.105

442 81 66 113 34 27 26 3 2 1 1

104,629 25,347 11,840 20,762 11,920 5,775 7,230 505 740 73 183

16.163 3.916 1.829 3.207 1.841 0.892 1.117 0.078 0.114 0.011 0.028

4 2 82

2,422 2,899 14,933

0.374 0.448 2.307

SINE/Bm1 SINE/Unknown

196,265 64,339

41,163,590 9,987,682

10.914 2.648

332 73

71,706 13,189

11.077 2.037

220

42,366

0.011

12,301 4,167 3,391 327 43 18 4,355

3,907,127 1,536,109 1,199,945 106,852 14,347 10,113 1,039,761

1.036 0.407 0.318 0.028 0.004 0.003 0.276

36 23 6

13,362 10,925 1,844

2.064 1.688 0.285

7

593

0.092

24,765 115,626 219,518

946,810 4,286,427 46,739,607

0.251 1.136 12.392

48 113 432

2,630 4,379 101,863

0.406 0.676 15.736

910,898

166,514,459

44.147

1,571

329,682

50.928

DNA transposons

Total DNA transposons DNA/Tc1_mariner DNA/Helitron DNA/hAT DNA/BMC1 DNA/piggybac DNA/Harbinger DNA/P

Fusion sites Number Length Percentage of hits occupied (bp) (%)

Penelope LTR Retrotransposons Total LTR/Pao LTR/Gypsy LTR/Copia LTR/Micropia LTR/Helitron LTR/Unknown Simple_repeat Low_complexity Unknown Total repeats

70

Supplementary Table 34. Classification and distribution of repetitive elements in the whole genome and in the estimated fusion sites in Heliconius melpomene. Whole genome Number of Length Percentage hits occupied (bp) (%) DNA transposons

Total DNA transposons DNA/Helitron DNA/Mariner DNA/Tc3 DNA/piggBac DNA/hAT DNA/Harbinger DNA/TcMar-Fot1 DNA/Unknown

Non-LTR Retrotransposons: Total LINEs LINE/RTE LINE/Daphne LINE/L2 LINE/Jockey LINE/Zenon LINE/Vingi LINE/R1 LINE/R4 LINE/I LINE/CR1 LINE/Proto2 LINE/Unknown SINE LTR Retrotransposons Total LTR LTR/Gypsy LTR/Copia Simple repeat Low_complexity Unknown Total repeats

Fusion sites Number of Length Percentage hits occupied (bp) (%)

151,099 54,157 49,346 36,787 4,712 2,568 740 5 2,784

27,204,205 14,314,694 5,690,635 3,970,760 1,197,562 1,049,617 237,052 2,645 741,240

10.089 3.671 1.460 1.018 0.307 0.269 0.061 0.001 0.190

2,617 901 851 644 97 52 16

464,938 223,988 107,869 71,112 20,378 18,839 5,360

11.027 5.312 2.558 1.687 0.483 0.447 0.127

56

17,392

0.412

32,088 7,255 3,838 3,379 1,786 3,496 240 1,082 1,189 2,032 901 128 6,762

9,821,700 2,390,781 1,224,728 1,169,726 941,745 851,094 64,090 580,811 475,682 408,053 366,346 56,091 1,292,553

3.642 0.613 0.314 0.300 0.242 0.218 0.016 0.149 0.122 0.105 0.094 0.014 0.332

591 116 72 57 44 56 5 33 21 41 11 2 133

179,449 26,783 23,945 19,533 23,171 10,842 352 19,455 10,068 10,839 3,535 567 30,359

4.256 0.635 0.568 0.463 0.550 0.257 0.008 0.461 0.239 0.257 0.084 0.013 0.720

155,981

21,958,707

5.632

2,541

356,801

8.462

4,918 2,039 2,850 29

1,258,731 634,494 612,458 11,779

0.467 0.163 0.157 0.003

106 39 65 2

22,920 11,071 11,557 292

0.544 0.263 0.274 0.007

20,730 294,250 54,914 713,980

856,174 13,049,905 6,074,828 80,224,250

0.216 3.347 1.558 29.751

346 4,401 975 11,577

14,079 194,615 113,577 1,346,379

0.334 4.616 2.694 31.932

71

Supplementary Notes Supplementary Note 1. Biology of the Glanville fritillary butterfly The Glanville fritillary butterfly (Melitaea cinxia L.) has become a widely recognized model system in population and evolutionary biology, especially in the study of the ecological, genetic, and evolutionary consequences of habitat fragmentation 16. The Glanville fritillary study system in the Åland Islands in Finland has played a pivotal role in the development of the metapopulation theory (reviewed in17, 18). The empirical study was initiated in 1991 across a very large area, 50 by 70 km, covering the main Åland Island and several other large islands in SW Finland 17, 19. The landscape is highly fragmented, and the habitat suitable for reproduction consists of 4,248 (in 2012) small dry meadows with the pooled area of 783 ha, which covers 0.5% of the total land area (1 552 km 2)19. This study system comprises a prime example of classic metapopulations, systems of extinctionprone local populations that persist in a balance between stochastic local extinctions and recolonizations of currently unoccupied habitat patches 17. Local extinctions and re-colonizations have been documented systematically since 1993 18, 19, yielding an unparalleled record of ca 2,000 extinction and re-colonization events. Knowledge of the long-term history of hundreds of local populations provides unique material for experimental studies 20-22. The long-term research has yielded several significant ‘firsts’, including the demonstration of elevated risk of local extinction due to inbreeding23, alternative stable states in metapopulation dynamics24, unequivocal evidence for an extinction threshold in metapopulation dynamics 25, and demonstration of allelic variation in a metabolic gene influencing population dynamics 26. The Glanville fritillary has one generation per year in northern Europe, adults flying in June to early July. Females oviposit in clusters of 50 to 250 eggs on two host plant species, Plantago lanceolata L. and Veronica spicata L27,

28

. Larvae hatch in 2-3 weeks, forage gregariously and spin a web

around the host plant, in which they stay at night, during bad weather and when not feeding. Halfgrown larvae overwinter in compact ‘winter nests’, which they spin at the base of the host plant at the end of August (see Figure 1 in Ojanen et al. 19). The larvae resume feeding in the spring when host plants renew growth, usually in the beginning of April, and remain gregarious until the final instar. Pupation takes place in May. Further details of the life-cycle and life history are reported by Kuussaari27, Nieminen et al.28, Hanski17, Hanski et al.21, Saastamoinen29 and Saastamoinen et al.30. The fact that each larval group spins a winter nest before winter diapause makes the large-scale survey of local populations possible. The winter nests are conspicuous in early September, making 72

it feasible to aim at counting all winter nests on every meadow in a network of thousands of meadows, giving an estimate of local population sizes across the entire study area as well as an opportunity to sample larval family groups for experiments 19. The transcriptome of the Glanville fritillary was initially sequenced with a Roche 454 FLX sequencer (454 Life Sciences, CT, USA)31, which was the starting point for the present work. Two gene expression experiments have compared female butterflies originated from newly-established and old local populations22 and full-sib families of post-diapause larvae reared under different thermal conditions 32. Since 2005, we have conducted association studies on several candidate genes. In particular, more than 10 studies on the gene Pgi, encoding the glycolytic enzyme phosphoglucose isomerase, have revealed strong associations with a range of life history traits and measures of individual performance, such as the peak flight metabolic rate 20, dispersal rate in the field33, body temperature in low ambient temperatures 34, egg clutch size29, lifespan30, and population growth rate in the field26.

73

Supplementary Note 2.

Genome Sequence Genome sequencing strategy In northern Europe, the Glanville fritillary has one generation per year and obligatory diapause of six to eight months19. Because of long generation time, it is not feasible to establish a laboratory colony to produce a highly inbred line. Moreover, the yield of DNA from a single individual (5-20 μg) is insufficient both in quantity and quality to complete whole genome sequencing. We therefore decided to use a hierarchical genome assembly strategy, in which DNA from a single male and a few full-sibs provided material for the initial genomic contigs. Those were linked to scaffolds using several paired-end (PE) and mate-pair (MP) libraries of varying insert sizes (0.5-16 kb). The high molecular weight DNA needed for the MP libraries required hundreds of μg of genomic DNA, which was obtained from full-sibs to minimize the amount of variation in the DNA pool. To account for several platform-specific features such as read length, data yield and error profiles35, we used several 2nd generation (454, Illumina, SOLiD) and 3rd generation (PacBio) sequencers. For genome assembly we used MIP Scaffolder 3 developed in-house which can incorporate data from the different sequencing platforms for genome assembly. Finally, a highdensity linkage map36 together with long MP reads and PacBio data was used in building the superscaffolds. Each of the key steps in the genome assembly (contig assembly, scaffolding, and superscaffolding) was independently validated as described in Supplementary Note 6. Furthermore, substantial transcriptome data sets (Supplementary Table 2) also contributed to assembly validation, building of the gene models, functional annotation and variation analyses. The key steps in genome assembly are depicted in Supplementary Figure 1.

DNA sequencing DNA samples The initial 454 sequencing was performed from a single M. cinxia male 7th instar larva and used for initial contig assembly. A male larva was used to avoid repeat-containing DNA expected from the female-specific W chromosome 37, 38. PE and MP sequencing was performed for scaffolding and complementing the initial contig assembly (Supplementary Fig. 1). Two SOLiD MP libraries (SOLiDMP1, SOLiDMP2) were constructed using the same single male larva as for the initial sequencing. For an Illumina PE library (IlluminaPE1), again the same male larva together with thorax tissues from ten full-sibs were used. For the other sequencing libraries, abdomen or thorax 74

tissues were used. The other PE and MP libraries with short insert sizes (150 aa) (Supplementary Fig. 16).

Orthology analyses The whole proteome from 22 species was used in the orthology analyses. Lepidopteran species included all five species from which the whole proteome was available, M. cinxia, H. melpomene, D. plexippus, B. mori, and P. xylostella. Diptera included three mosquitoes (Aedes aegypti, Anopheles gambiae, Culex quinquefasciatus) and three fruit flies (Drosophila simulans, Drosophila melanogaster, Drosophila mojavensis). Ants (Harpegnathos saltator, Solenopsis invicta), bees (A. mellifera) and wasps (Nasonia vitripennis) represent Hymenoptera, and pea aphid (A. pisum), red flour beetle (T. castaneum) and head louse (Pediculus humanus) other insects. Deer tick (Ixodes scapularis) and waterflea (Daphnia pulex) were treated as arthropodan outgroups, while cat and rat (Felis catus and Rattus norvegicus) were mammalian outgroups. The data sets were downloaded from NCBI RefSeq, except for rat and cat which were obtained from Ensembl, and P. xylostella which was downloaded from http://www.iae.fafu.edu.cn.

100

BLAST was used to perform an all-against-all comparison of the protein sequences. Pairwise similarities with an e-value below 1e-5 and query/subject coverage above 70 % were retained in the sequence graph (Supplementary Fig. 17). The proteins were clustered into orthologous groups using the EPT algorithm86. Orthologs were defined as reciprocal best hits. Gene duplication events were identified as in InParanoid, i.e., in-paralogs are proteins within one (reconstructed) species which have stronger similarity to each other than to the nearest ortholog in another species. EPT reconstructs ancestral proteomes at the branch points of a phylogenetic guide tree. A flat tree with star-like topology (every species branching off from the root) yielded similar results (data not shown). The orthologous groups in Figure 1 (main text) were classified based on the presence of a group in the clades Lepidoptera, Hymenoptera and Diptera. A group is defined as conserved in a clade if it is present in at least two species of the above clades. We define one-to-one-orthologs as missing from at most four (out of 22) species or duplicated in at most two species. All other groups conserved in Lepidoptera, Hymenoptera and Diptera are called N-to-N-orthologs. A group conserved in two of these three clades is classified as Patchy. A group conserved in only one of the three clades is classified as Lepidopteran, Dipteran or Hymenopteran. In the case of Diptera, a group is considered conserved if it is present in both mosquitoes and fruit flies. Mosquito groups are conserved in mosquitoes but not fruit flies, and vice versa for Drosophila groups. Speciesspecific groups are present in a single species. Arthropod groups are present but not conserved in Lepidoptera, Hymenoptera or Diptera, and are absent from the mammals. All remaining groups are classified as Other. Proteins which have non-detectable sequence similarity to other proteins in the set are in the class “no-hits”. Phylogenetic analysis of one-to-one-orthologs (Supplementary Fig. 18) was based on reciprocal best hits detected by SANS49. The phylogenetic tree was constructed using 191 groups of one-toone orthologs. Multiple sequence alignments for the groups of one-to-one-orthologs were constructed using MUSCLE v. 3.8.3176. To find the optimal tree, phylogenetic analyses were carried out in RAxML v. 7.3.087 by bootstrapping 100 times per group and creating a majority rule consensus tree from 19,100 bootstrapped trees which was then visualized with iTol 88. The predicted M. cinxia proteins were also mapped to OrthoDB orthologous groups (data not shown, http://cegg.unige.ch/orthodb6). OrthoDB89 orthologous groups are built progressively, with an e-value cutoff of 1e-3 for triangulating best reciprocal hits (BRHs), and 1e-6 for pair-only BRHs, requiring an overall minimum sequence alignment overlap of 30 amino acids. New genes are added to existing groups by a mapping procedure which first compares all genes from the new organism to all genes in OrthoDB groups, and then performs the BRH clustering procedure only allowing new genes to be added to existing groups. Supplementary Figure 17 illustrates the 101

algorithm schematically. M. cinxia is now also included in the complete-clustering release 7 of OrthoDB (http://cegg.unige.ch/orthodb7). Supplementary Figure 19 clearly shows a broad band of protein families conserved across all species, and blocks of protein families conserved in taxonomic orders, suborders or families. It is also notable that many blocks appear patchy. Apparent deletions or extra genes can result from incomplete genome data (e.g., genes split over more than one scaffold), errors in gene prediction, false-negative homology detection, or false-positive similarity links accepted during clustering. Visual inspection of multiple alignments of orthologous groups showed that although sequence identity was quite high, protein lengths differed greatly and there are gaps due to missing ends (protein fragments) or missing exons. The analysis of orthologous groups gives a similar picture (Fig. 1) to previous comparative studies 8, 11, 90

. The statistics for M. cinxia are very similar to those for the other Lepidoptera, notably other

Nymphalidae, H. melpomene10 and D. plexippus8. The dominant classes of orthologous groups are (i) a conserved core genome, (ii) taxonomic order- or family-specific proteins (species-specific in orders represented by a single species), and (iii) proteins without detectable sequence similarity to others (“no-hits”). The aphid genome (ACYPI) is an outlier in this analysis: it has lost many protein families from the core genome that is conserved in other arthropods and between other insects and mammals91.

Manual annotation Manual annotations of gene predictions were made using an Apollo genome annotation editor 92. Gff3 files generated with MAKER and wiggy files including coverage of RNA-seq mapping were used for checking the gene models in Apollo. Additionally, manual annotators used BAM files from TopHat54 for visualizing RNA-seq mappings in Artemis 93. The modified gene models were sent to the VectorBase Community Annotation Portal at the European Bioinformatics Institute (EBI) 94. Genes belonging to families and pathways of special interest were manually annotated. Altogether 558 gene models and protein names were manually curated, 29 of which were deleted. The list of manually annotated genes together with their manually and automatically predicted protein descriptions are shown in Supplementary Data 1. The curated genes included heat shock and other chaperone-related genes, Cytochrome P450, Hox genes and immunity-related genes (Supplementary Table 25). We also focused on genes related to muscles and muscle development and genes that were previously reported to be activated after flight 43. Additional validation was carried out on random gene models in the Z chromosome and all gene models in random scaffolds. Most (88 %) of the gene models examined needed manual correction. One third of the corrected models remained partial due to, for example, gaps in the scaffolds or short scaffolds 102

that split genes into two or several scaffolds. On average 12 gene models were manually curated in each chromosome (Supplementary Fig. 20).

Hox cluster The Hox gene cluster was annotated according to sequence conservation with other insects and alignment of M. cinxia transcripts. Conservation of intron/exon structure and VISTA alignment with other Lepidoptera were also used to aid identification of conserved sequence around the Special homeobox (Shx) genes. All M. cinxia Hox cluster genes except fushi tarazu (ftz) were represented in at least one of the two expression libraries mapped to the genome (Annotation1; Supplementary Table 2, Supplementary Note 2). Shx genes may be lepidopteran-specific Hox cluster genes 95. Shx genes have been described from the genomes of B. mori, H. melpomene and D. plexippus10, 96, 97. M. cinxia was found to have all the canonical Hox genes plus four Shx genes: two copies of ShxA, and a single copy each of ShxB and ShxC. The two copies of ShxA were found in tandem on the same scaffold as the proboscipedia (pb) gene (scaffold442), and were numbered according to the direction of transcription. Sequence identity between the two M. cinxia ShxA paralogs was low (53.2 % amino acid/ 55.8 % nucleotide; ClustalW / translation alignments using Geneious R6 V6.1.4), and alignment of the translated sequence with H. melpomene and D. plexippus showed that McShxA-2 has a 70 amino acid insertion in exon 2 relative to the other sequences (Supplementary Fig. 21). A recent study has addressed the evolution of the Shx genes in the Lepidoptera95. In this study it was found that all species examined except B. mori had four Shx genes, with one copy each of ShxA-D. This analysis included data from four other nymphalid butterflies, H. melpomene, D. Plexippus, Parage aegeria (speckled wood) and Polygonia c-album (comma). The M. cinxia duplication of ShxA, and loss of ShxD therefore stands in contrast to these species. In order to verify the loss of ShxD in M. cinxia, PacBio and transcript reads over the region were manually inspected. Searches were also carried out for conserved ShxD motifs95, and open reading frames in the region were translated and aligned with ShxD from the other nymphalids. We found that no ShxD homeodomain could be identified, but that there were limited regions of sequence similarity with other butterfly ShxD genes in the expected location and orientation. These regions were fragmented however, and contained multiple stop codons in every translation frame. All transcripts from ShxD also contained multiple stop codons, suggesting that this gene has become highly degraded in M. cinxia. B. mori also has multiple duplications of ShxA and loss of ShxD10, but phylogenetic analysis suggests that both events are convergent between the two species (Supplementary Fig. 22). In addition, 5 of the 8 B. mori ShxA genes contain multiple homeodomains, many of which are 103

degenerate, whereas both M. cinxia ShxA genes have a single intact homeodomain motif (Supplementary Fig. 21). The M. cinxia Hox/Shx genes were located in nine scaffolds, all of which mapped to chromosome 5. After manual inspection, all scaffolds except those containing labial (lab), ftz and the 3’ end of Sex combs reduced (Scr) mapped to bin 2, while the scaffold containing lab mapped to bin 4. This suggests that the lab/pb split in the Hox cluster recorded for B. mori and H. melpomene10 is likely to be conserved in M. cinxia. No bin was determined for the scaffold containing ftz and 3’ end of Scr, but the scaffold containing the 5’ end of Scr and the Deformed (Dfd) gene was in bin 2, making it very likely that the Hox cluster is maintained in this region. As a result of further PacBio sequencing and assembly, the scaffolds containing pb and Shx genes (scaffold442; scaffold2095) were manually combined into a single scaffold. Superscaffolding joined five other scaffolds into two superscaffolds: chr5_superscaffold27 (scaffold5113; 5’ end of scaffold22; scaffold6633) and chr5_superscaffold14 (scaffold5218; scaffold2892). Thus, the M. cinxia Hox/Shx genes can be positioned in six superscaffolds.

104

Supplementary Note 9. SNP and indel variation Genome-wide variation SNPs and indels were detected from four data sets: 1) SOLiD genomic pools from 53 individuals (SOLiD_ÅLpool; Supplementary Table 1), 2) an Illumina genomic pool from 10 individuals (IlluminaPE1, IlluminaPE2; Supplementary Table 1), 3) Illumina polyA-anchored RNA-seq data from 40 individuals (Variation; Supplementary Table 2), and 4) a PacBio genomic pool from 100 individuals (PacBio; Supplementary Table 1). These datasets were used to describe nucleotide variation of M. cinxia in the Åland Island population. SOLiD genomic pool (SOLiD_ÅLpool) data were mapped to the genomic contigs and variants were detected using LifeScope 2.5 diBayes, LifeScope 2.5 SmallIndel, and LifeScope 2.1 LargeIndel softwares (Applied Biosystems) with default parameters. For comparison, variants were also detected from Illumina PE reads (IlluminaPE1, IlluminaPE2) used in the genome assembly. The coverage of mapped reads in this data was 30X which was higher than in the SOLiD pool data (SOLiD_ÅLpool). Two libraries consisting of one independent individual and ten full-sibs (IlluminaPE1 and IlluminaPE2) were first analyzed separately and then the results were merged. Variants were detected using a GATK pipeline 98, 99. Read alignments were performed with BWA version 0.5.9-r16 with mutation rate set to 0.06 and 3' trimming quality to 5. Variants were called with UnifiedGenotyper of GATK version 2.2 using a ploidy value of 8, stand_call_conf value of 50.0 and stand_emit_conf value of 10.0. Additionally, the SNP positions were filtered using a minimum minor allele count of three. The minimum distance to the nearest SNP position was set to five bp in order to minimize the effect of mapping errors. The polyA-anchored RNA-seq reads (Variation; Supplementary Table2) were filtered and mapped onto genomic scaffolds as described in Supplementary Note 4. The allele counts were extracted from the mapped RNA-seq reads. No indels were called from the RNA-seq data. The results from the two sequencing libraries with different insert sizes were combined. For filtering the data, only bi-allelic SNPs were included, and at least 20 individuals were expected to be polymorphic at each SNP site. Long indels were detected using PacBio data. PacBio reads were mapped onto genomic scaffolds with BWA-SW46, and indels whose length exceeded 50 bp were detected from CIGAR alignments. The number of SNPs and indels in each data set are shown in Supplementary Table 26. Only 105

variants located in scaffolds with at least one gene model are included. Variants located at the overlap of two genes were removed; these represented 2 % in the genomic data. Only 76 % of mappable RNA-seq data were within predicted gene models which can be partly explained by missing/incomplete gene model 3’UTRs. The distribution of variant lengths based on SOLiD and PacBio data shows that deletions (2,165) were more abundant than insertions (313) (Supplementary Fig. 23).

SNP density Two Illumina PE libraries consisting of one male and ten full-sibs (IlluminaPE1) and of the same ten full-sibs (IlluminaPE2; Supplementary Table 1) were used to estimate the average level of SNP density within M. cinxia gene models. The variants were called and filtered as explained above. The median SNP and indel densities in genomic regions are shown in Supplementary Table 27. SNP density in 16,667 gene models was on average 8.2/kb in coding and 15.3/kb in intronic regions. The majority of genic regions (88%) included indels, most of which are located in introns and only 2,567 (15%) and 3,484 (21%) in coding and UTR regions, respectively. The median indel density in intronic regions was 1.5/kb (Supplementary Table 27). The distributions of SNP and indel densities (Supplementary Fig. 24) have very long tails towards high density values. We found 595 and 115 gene models with a higher SNP density than 30 and 50 SNPs/kb in the coding region, where the SNP density limits corresponded to 96.4 % and 99.3 % of the density distribution, respectively. We also found gene models with high intronic indel density: a total of 116 and 786 gene models had 0.5% and 1% indel density in intronic regions, respectively. Comparable information about levels of SNP and indel variation and density is limited to model organisms and laboratory strains. The few published reports of wild population variation also illustrate high variation densities in other insects. Sequencing of 11 wild populations of B. mandarina identified over 13 million SNPs (30 SNPs/kb) and 251,000 indels (0.58 indels/kb) in the genome sequence100. In the coding exons, 363,792 SNPs (~ 20/kb) and 1,206 indels (~ 0.07/kb) were found. Whole genome sequencing of 192 inbred D. melanogaster lines yielded over 4.7 million SNPs (33 SNPs/kb)101. In the analysis of the P. xylostella genome, 558,374 SNPs were detected from a pool of a laboratory population, yielding a coding exon SNP density of 4 SNPs/kb11. In M. cinxia the SNP density was estimated from one independent individual and ten full-sibs, yielding SNP and indel densities of 13.2/kb (8.2/kb in coding regions) and 1.7/kb, respectively. In this data set, the SNP density was lower than in B. mandarina and D. melanogaster, whereas the indel density was higher than in those species, which might have impeded genome assembly (Supplementary Note 3, Supplementary Fig. 6).

106

Linkage disequilibrium Linkage disequilibrium (LD) was evaluated from RNA-seq data from 40 individuals (Variation; Supplementary Table 2) for the Åland Islands population. The raw allele counts for each individual were first converted into conditional probabilities of each genotype. The allele counts n of an individual at a SNP were assumed to be binomially distributed as Bin(0.95,n) if the corresponding genotype was homozygous in a major allele, Bin(0.05, n) if the genotype was homozygous in a minor allele, or Bin(0.5, n) if the genotype was heterozygous. Then potential SNPs were filtered keeping only those in which all three possible genotypes were present with 80% confidence, and in which with 80% confidence less than 15% of data were missing. After filtering, the final dataset included 3,331 SNPs located in 1,312 scaffolds. Finally, the LD (r2 and D’) was evaluated for all SNP pairs within each scaffold. This was done by first iteratively finding the maximum likelihood haplotype frequency estimate for each pair of SNPs using a modified version of the EM-algorithm 102 taking into account genotype uncertainty. From the maximum likelihood haplotype frequencies, estimates of r2 and D´ were obtained. All LD computations were executed using in-house Awk 103 scripts. Supplementary Figure 25 illustrates the values of r2 and D´ for each marker pair as the function of physical distance. The LD (r2) reaches a level of 0.4 at about 300 bp distance. This range of LD in M. cinxia was comparable to estimates for B. mori (r2=0.4 at ~400 bp) and B. mandarina (r2 always