XP_005486070.1. HMGN3. Start lost. 96. 1. M/I. XP_005486910.1. ACYP2. Stop gain. 99. (incomplete annotation). 16. R/*. XP_005487649.1. GGPS1. Start lost.
Supporting Information (SI Appendix) Overview: In this study, we generated genome sequence of a rare ‘super-white’ white-throated sparrow homozygous for a rearrangement on chromosome two (ZAL2m/ZAL2m) (1). Prior to our work, Tuttle et al. (2) published the genome sequence of a tan individual (homozygous for ZAL2). Because of the homozygosity of both of these individuals, we could use both data sets to confidently identify ZAL2m-specific substitutions. To investigate genetic divergence between the rearranged regions of ZAL2 and ZAL2m, we first classified ZAL2-linked tan scaffolds utilizing multiple lines of evidence (including, for example, mapping tan scaffolds to the zebra finch chromosome that is homologous to ZAL2/2m). This step improved the previous available list of ZAL2-linked scaffolds by identifying some previously unidentified ZAL2-linked scaffolds and eliminating an erroneous assignment (see below, ‘Identification of scaffolds on the second chromosome’). We then mapped our newly generated sequence reads from the super-white to the tan scaffolds. To confidently identify substitutions that distinguish the ZAL2 and ZAL2m chromosomes, we used the Genome Analysis Toolkit (GATK) to call variants in the available genome sequence data and in RNA-seq data published by Zinzow-Kramer et al. (3). Additionally, we investigated morph-biased and allele-specific expression patterns using this RNA-Seq data from multiple tan and white birds. We identified many genes with differential expression using DESeq2 (4). Sequencing. We sequenced a super-white bird homozygous for the rearrangement (ZAL2m/ZAL2m) (1). High molecular weight genomic DNA was extracted from liver and sequenced using HiSeq2500 at the Roy Carver Genome Center of the University of Illinois. Approximately 240 million reads of 150 bps were generated, which are available in the SRA database (SRA accession number: SRR4191732). Identification of scaffolds on the second chromosome. Genomic scaffolds from a tan bird were recently published by Tuttle et al. (2). These data are available in NCBI (GCF_000385455.1). To confidently identify scaffolds that originate from the second chromosome of white throated sparrows, we mapped those scaffolds onto that of the zebra finch using LASTZ 1.03.73 (5) (parameters: --step=20 --chain --gfextend --gapped --traceback=2000M --ydrop=300400 --identity=85 --matchcount=1000), with additional parameters of >30% coverage for scaffolds longer than 10kbps and >80% coverage for scaffolds shorter than 10kbps. These cutoff criteria were selected to avoid scaffolds that mapped to ZAL2 due to partial mapping to repetitive sequences (6), on the basis of the coverage% distribution across scaffolds (Fig. S8).
1
We identified ZAL2 scaffolds using homology to the corresponding zebra finch chromosome (commonly referred to as TGU3 due to its homology to chicken chromosome 3 (7)), previous fluorescent in situ hybridization (FISH) studies (2, 8, 9), as well as homology with two other passerine birds with chromosome-level assemblies, collared flycatcher (Ficedula albicollis) and great tit (Parus major) (10, 11). Following these procedures, we identified 56 scaffolds on the ZAL2 chromosome (Table S1). Compared with the results of Tuttle et al. (2), our results included 19 additional ZAL2 scaffolds (corresponding to ~25Mb) that were previously unrecognized as linked to this chromosome. In addition, we excluded a ~45Mb scaffold (NW_005081536.1) that had been denoted as residing on a non-rearranged portion of the ZAL2 (2). Regions homologous to this scaffold were found on a different chromosome in zebra finch, collared flycatcher and great tit, making it unlikely that it has moved to ZAL2 in Z. albicollis given the well-conserved chromosomal homology in birds (7, 11, 12). FISH studies also previously showed that regions outside of the rearrangement on chromosome two are only ~10Mbs in length (8, 9). Genetic divergence between ZAL2 and ZAL2m from genome sequences and RNA-seq data. We then called variants (SNPs and indels) that distinguish ZAL2 and ZAL2m sequences by following the Genome Analysis Toolkit (GATK) best practices for variant calling in genome sequencing data (13-15). First, super-white reads from whole-genome sequencing were aligned to the reference genome from a tan bird (2) using BWA 0.7.12 (16). GATK 3.4 was used to call variants (the variant call data are available at Figshare DOI: http://dx.doi.org/10.6084/m9.figshare.5716039), and those variants with quality lower than 30 or read depth less than 5 were excluded. We then used a sliding window approach to calculate short indel frequencies and dXY following Jukes-Cantor correction (17). We also called variants from available transcriptome data from the hypothalamus and nucleus taeniae of nine tan and 11 white individuals (3) following the GATK best practices for variant calling in RNA-seq data (13-15) (The variant call data are available at Figshare DOI: http://dx.doi.org/10.6084/m9.figshare.5715037). A variant was considered putatively fixed in the sampled individuals if: 1) the variant was biallelic; 2) tan individuals were homozygous for the reference allele (AA); 3) white individuals (ZAL2/ZAL2m) were heterozygous (Aa); 4) the superwhite (ZAL2m/ZAL2m) individual was homozygous for the alternative allele (aa). Coordinates for fixed differences can be accessed through the following Figshare DOI: http://dx.doi.org/10.6084/m9.figshare.5715079.
2
Scaffolds inside versus outside the rearrangement. Scaffolds residing inside versus outside the rearrangement were identified on the basis of distinctive patterns of genetic divergence. Specifically, the distributions of dXY and FST were bimodal, categorized as 'high’ or ‘low’ dXY or FST (Fig. S9). Scaffolds with high dXY, high FST and putatively fixed differences were designated as ‘confidently inside’. If only one of the two criteria were satisfied, the scaffolds were designated as ‘likely inside’. Scaffolds were defined as ‘confidently outside’ if they had low dXY, low FST, and no putatively fixed differences, and one of the following two conditions was satisfied: either they were already shown to be outside the rearrangement (2, 8, 9), or they shared homology with the homologous chromosome of two other passerine birds (F. albicollis and P. major) (10, 11). Scaffolds that exhibited low dXY, low FST and an absence of putatively fixed differences, with no extra supporting evidence, were designated as ‘likely outside’. We used only ‘confidently inside’ and ‘confidently outside’ scaffolds for the calculations shown in Fig. 1 C-E, but including ‘likely inside’ and ‘likely outside’ did not change the patterns (Fig. S1). Analyses of protein coding sequences. For each ZAL2-linked gene, we extracted the longest transcript and constructed the ZAL2m version with the putatively fixed differences. Additionally, we downloaded all available genome annotations for 13 other avian species in the order of Passeriformes (the same order as the white-throated sparrow) from NCBI, and the tree for all 14 species were inferred from several avian phylogeny studies (18-21) (Fig. S6). Each gene was aligned by MAFFT v7.245 (22), low-quality alignment parts were trimmed by trimAl v1.4 (23), and the codon alignment was constructed by PAL2NAL v14 (24). We obtained codon alignments for a total of 800 genes. We calculated dN and dS for the ZAL2 and ZAL2m branches with a free-ratio model using codeml from the PAML 4.8 package (25). The Hon-New package (26), which adopts the amino acid classification system that considers charge and polarity, was used to estimate rates of radical amino acid substitution (dR) and conservative amino acid substitution (dC). To test for positive selection, a branch-site model was run with codeml in PAML, which was then compared with the null model following the simulation approach used by Nielsen et al. (27). Briefly, 10,000 random DNA sequence alignments with the same substitution parameters used in the null model were generated using Evolver (25). The resulting empirical LRT distribution served as the null distribution of the test statistic, and the P-value was inferred from the empirical distribution of LRTs. Since incomplete lineage sorting may result in discordance between the gene trees and the species tree and therefore may prevent the identification of positively selected genes (PSGs), we constructed maximum-likelihood gene trees in MEGA7 (28) for the candidate PSGs from the previous step. We reran PAML using the gene trees.
3
Positively-selected functional categories were identified by first assigning genes according to the PANTHER Classification System version 10 (29). For each category with more than 10 genes, the difference between the cumulative distributions of P for genes in the category, versus that of genes not in that category, was tested using a one-tailed Mann-Whitney U (MWU) test (30, 31). Gene expression. We examined genes with morph-biased (TS≠WS) expression and allelespecific expression (ZAL2≠ZAL2m) patterns. To account for potential mapping bias towards the reference (ZAL2/ZAL2) genome caused by mismatches between ZAL2 and ZAL2m, we Nmasked (32) putatively fixed differences in the reference. We additionally checked potential leftover bias by aligning whole-genome sequences from three white birds described by Tuttle et al. (2) to this N-masked genome. ZAL2 and ZAL2m alleles should have roughly equal coverage per site if mapping bias has been eliminated. Indeed, for all three white birds, we did not observe significant coverage bias towards the ZAL2 allele (see ‘Examining mapping bias in the N-masked reference genome’ and Fig. S7). We mapped the aforementioned RNA-Seq data from nine tan and 10 white individuals (3) to the N-masked genome with STAR 2.4.1d under the 2-pass mode (33). Only uniquely mapped reads were retained for further differential expression analysis. SNPsplit 0.3.3 (32) was run to assign reads to ZAL2 or ZAL2m for the white samples and to filter out reads without fixed differences in the tan samples. Read counts per gene at the morph and allele level were calculated by htseqcount 0.9.1 (34) with ‘-s no -m intersection-nonempty’. To detect morph-biased expression, we calculated size factors, normalized libraries with these factors, and then identified differential expression with ‘design = ~ morph’ in DESeq2 1.12.3 (4). To detect allele-specific expression, we normalized libraries with the size factors generated in the previous step, and identified differential expression with ‘design = ~ allele + sample’ in DESeq2. Only genes with average expression levels (‘baseMean’ in the DESeq2 output) higher than 5 at the morph level were retained for later analysis (809 genes for the hypothalamus and 806 for nucleus taeniae). All differential expression data can be accessed through Figshare DOI: http://dx.doi.org/10.6084/m9.figshare.5715064. De novo assembly of the super-white genome, whole-genome alignment, and detection of gene deletion. Paired-end sequences from the super-white bird were first trimmed by PRINSEQ 0.20.4 (35) and assembled by Abyss 1.5.2 (36). The final assembly has a contig N50 of 26,601bp and a scaffold N50 of 32,876bp. The total assembly size is 1.01 Gbps, which is close to the estimated genome size of the white-throated sparrow (2). This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession PKOH00000000. The version 4
described in this paper is version PKOH01000000. We aligned the newly generated super-white assembly to the tan reference genome using LASTZ with the aforementioned parameters. We found no evidence of deletion of exons and/or large (> 50bps) indels. Annotation of repetitive sequences. We used both de novo and homology-based approaches to annotate repetitive sequences in the tan and super-white assemblies. First, both genomes were annotated with RepeatModeler 1.0.8 (37). The generated de novo library was merged with the avian RepeatMasker library (20150807 version) (38). Then, RepeatMasker 4.0.6 was used to annotate both genomes on the basis of homology with repeats in the merged library (parameters: -xsmall -s -nolow -norna -nocut). According to these analyses, the identified repeat content was highly similar on ZAL2 and ZAL2m, 7.11% and 7.38% respectively (Table S5). However, since the super-white genome assembly is more fragmented than the reference, the repeat content might be underestimated on ZAL2m. Examining mapping bias in the N-masked reference genome. We examined potential leftover mapping bias by aligning whole-genome sequences of three white birds (Sample IDs: 10_083, 10_092 and 10_093) published by Tuttle et al. (2) to the N-masked genome using HISAT2.1.0 (39) (parameters: --no-spliced-alignment --sp 1000,1000). Reads were assigned to the ZAL2 or ZAL2m chromosome by SNPsplit 0.3.3 (32) on the basis of putative fixed differences, and bedtools genomecov (40) was used to count per base coverage on ZAL2 and ZAL2m, respectively. Across putative fixed differences, we observed roughly equal per base coverage between ZAL2 and ZAL2m (Fig. S7), suggesting mapping bias was significantly eliminated using the SNP N-masking approach.
5
SI Figures
A
B
dXY
C
FST
***
Indel frequency
***
***
***
***
0.004
0.02
***
0.4 NS
0.01
NS
0.002
NS
0.2
0.00
0.0 Inside
Outside Genome
0.000 Inside
Outside Genome
Inside Outside Genome
Fig. S1. A similar divergence pattern is found using a less conservative approach. Fig. 1 of the main text shows the patterns of divergence that were revealed using only scaffolds that are ‘confidently inside’ and ‘confidently outside’ the ZAL2m rearrangement. This figure shows that adding all ‘likely inside’ and ‘likely outside’ scaffolds does not change the pattern. (A) Pairwise nucleotide divergences (dXY), (B) degrees of population differentiation (FST), and (C) indel frequencies, all measured in 10-kb non-overlapping windows, were significantly higher in scaffolds within the rearrangement than in those outside the rearrangement (Mann–Whitney U test, ***:P < 0.001; NS: not significant).
6
Fig. S2. Alternative scenario for dosage compensation in Z. albicollis. Fig. 3 in the main paper shows a potential scenario for dosage compensation when dosage imbalance is caused by down-regulation of ZAL2m alleles. When dosage imbalance is instead caused by up-regulation of ZAL2m alleles, we propose the following alternative scenario. A) Initially, expression dosage (shown as black waves) is similar between ZAL2 and ZAL2m and between tan and white birds. B) Expression of the ZAL2m allele may increase due to mis-regulation. Consequently, heterozygous (white) individuals should show increased expression. C) Dosage may be re-balanced, to be similar in the two morphs, via under-expression of the ZAL2 (non-degenerated) allele in white birds. Consequently, expression of the ZAL2 allele should be greater in tan than white birds. To test this prediction, we examined the ratio of ZAL2 in tan to ZAL2 in white birds. D-E) Levels of compensation (tan-ZAL2/white-ZAL2) were significantly elevated for tan≈white genes compared with tanZAL2m), those with relatively similar expression in tan and white birds (tan≈white) exhibit significantly higher expression than that of the background genes. The definitions of ZAL2>ZAL2m and tan≈white genes are based on A-B) P values and C-D) FDR-corrected Q values from DESeq2 (Table 1). The Y-axis represents ‘baseMean’ from the DESeq2 output, which is essentially the mean of normalized counts of all samples. Mann-Whitney U test, ***:P < 0.001.
8
Fig. S4. Potentially dosage-compensated ZAL2m>ZAL2 genes are more highly expressed than background genes in nucleus taeniae. In the nucleus taeniae, but not in the hypothalamus, the average expression (irrespective of morph) of genes with higher ZAL2m than ZAL2 expression (ZAL2m>ZAL2) and relatively similar expression in tan and white birds (tan≈white) is significantly higher than that of the background genes (i.e., genes that do not exhibit allele-specific or morph-biased expression patterns). The definitions of ZAL2m>ZAL2 and tan≈white genes are based on A-B) P values and C-D) FDR-corrected Q values from DESeq2 (Table 1). The Y-axis represents ‘baseMean’ from the DESeq2 output which is essentially the mean of normalized counts of all samples. Mann-Whitney U test, *:P < 0.05; **:P < 0.01; NS: not significant.
9
Fig. S5. Evidence for dosage compensation is robust in the set of genes defined by FDRcorrected Q values. A-B) For ZAL2>ZAL2m genes, the ratio of white-ZAL2 expression to tanZAL2 expression is significantly elevated for tan»white genes compared with tan>white genes. CD) For ZAL2m>ZAL2 genes, the ratio of tan-ZAL2 expression to white-ZAL2 expression is significantly elevated for tan»white genes compared with tan 0.05).
12
Fig. S8. Distribution of percent coverage (Coverage %) for scaffolds in the tan reference genome. Coverage % was calculated as the length of the region that could be mapped to the TGU3 chromosome, divided by the total length of that scaffold. The cutoff for coverage % by our criteria is shown as the red dashed line. Only scaffolds >10kbps are included.
13
Fig. S9. Bimodal patterns of divergence for pairwise nucleotide divergence and degrees of population differentiation. The graphs show the distribution of average pairwise nucleotide divergence (dXY) between ZAL2 and ZAL2m chromosomes (left panel) and degrees of population differentiation (FST) between white and tan samples over ZAL2 scaffolds (right panel). Cutoffs to distinguish between ‘Low’ and ‘High’ (as well as ‘Intermediate’ in the case of FST) of the two measures were determined by the clear divisions in the distributions. Thus, low dXY is in the range of [0.00080, 0.00588], and high dXY corresponds to [0.00968, 0.01431]. Similarly, low FST is in the range of [0.00448, 0.01420], intermediate FST is [0.11821, 0.22853], and high FST is [0.28586, 0.45355].
14
SI Tables Table S1. Designation of scaffolds inside or outside the inversion.
Scaffold
Length (bp)
Extra
Presence of dXY
FST
fixed
Inversion
differences
being outside the inversion
NW_005081548.1
11927161
0.01125
0.40361
yes
NW_005081553.1
10178608
0.01111
0.21094
yes
NW_005081561.1
7587459
0.01111
0.39278
yes
Inside
NW_005081569.1
6428186
0.01231
0.34407
yes
Inside
NW_005081574.1
5690812
0.01261
0.40039
yes
Inside
NW_005081577.1
5965230
0.01153
0.33666
yes
Inside
NW_005081582.1
4964275
0.01265
0.36901
yes
Inside
NW_005081589.1
4458466
0.01243
0.40993
yes
Inside
NW_005081591.1
4515391
0.0118
0.38121
yes
Inside
NW_005081596.1
4339766
0.01182
0.40298
yes
Inside
NW_005081602.1
4246152
0.01356
0.35859
yes
Inside
NW_005081609.1
3616420
0.01415
0.37739
yes
Inside
NW_005081611.1
3581226
0.01233
0.41416
yes
Inside
NW_005081615.1
3504361
0.00968
0.42848
yes
Inside
NW_005081620.1
3189763
0.01130
0.39955
yes
Inside
NW_005081621.1
3168318
0.01273
0.31734
yes
Inside
NW_005081632.1
2520776
0.01176
0.38496
yes
Inside
NW_005081635.1
2503139
0.00222
0.00639
no
Outside
NW_005081642.1
2153052
0.01251
0.33943
yes
Inside
NW_005081653.1
2022704
0.01198
0.40088
yes
Inside
NW_005081654.1
1930009
0.01194
0.28586
yes
NW_005081662.1
1786786
0.01343
0.42324
yes
Inside
NW_005081697.1
1189320
0.01268
0.36119
yes
Inside
NW_005081699.1
1172919
0.01334
0.33628
yes
Inside
NW_005081708.1
1082959
0.00080
0.01420
no
Outside
15
evidence for
Inside Likely Inside
Supported by (2, 8, 9)
Likely Inside
Supported by
(2, 9) Supported by homology of NW_005081720.1
1134561
0.00381
0.00616
no
Outside
chicken, zebra finch, great tit and flycatcher
NW_005081729.1
913800
0.01168
0.37611
yes
Inside
NW_005081742.1
825557
0.01062
0.44194
yes
Inside
NW_005081746.1
758667
0.01175
0.45355
yes
Inside
NW_005081754.1
810001
0.01220
NA
no
NW_005081771.1
578351
0.01431
NA
no
NW_005081821.1
411424
0.00473
0.34085
yes
NW_005081827.1
355097
0.01366
NA
no
NW_005081831.1
438198
0.00327
0.00448
no
NW_005081832.1
349547
0.00588
0.30376
yes
NW_005081844.1
292358
0.01314
0.22853
yes
NW_005081876.1
266586
0.01127
0.22601
yes
NW_005081883.1
224620
0.00583
0.11821
yes
NW_005081958.1
224289
0.00358
0.03357
no
NW_005082055.1
43098
0.00527
NA
no
Unknown
NW_005082170.1
28300
0.00573
NA
no
Unknown
NW_005082187.1
36427
0.00344
0.21837
yes
NW_005082848.1
7948
NA
NA
no
Unknown
NW_005082865.1
18815
0.00090
NA
no
Unknown
16
Likely Inside Likely Inside Likely Inside Likely Inside Likely Outside Likely Inside Likely Inside Likely Inside Likely Inside Likely Outside
Likely Inside
NW_005083054.1
6122
NA
NA
no
Unknown
NW_005083097.1
5868
NA
NA
no
Unknown
NW_005083174.1
5400
NA
NA
no
Unknown
NW_005083671.1
3948
NA
NA
no
Unknown
NW_005083866.1
5132
NA
NA
no
Unknown
NW_005083989.1
2649
NA
NA
no
Unknown
NW_005084012.1
2613
NA
0.14742
no
Unknown
NW_005084703.1
1690
NA
NA
no
Unknown
NW_005085200.1
1515
NA
NA
no
Unknown
NW_005085751.1
1335
NA
NA
no
Unknown
NW_005085851.1
1254
NA
NA
no
Unknown
NW_005086488.1
1178
NA
NA
no
Unknown
17
Table S2. Genes with disrupted ORFs (open reading frames). Mutation
Protein
Mutation
type
length
position
CENPF
Stop gain
2937
1934
Q/*
XP_005483141.1
PREPL
Stop lost
729
729
*/E
XP_005484012.1
CASP8AP2
Stop gain
2002
652
Q/*
XP_005485183.2
KIAA1919
Stop gain
569
565
W/*
XP_005485996.1
PGM3
Stop lost
543
543
*/S
XP_005486070.1
HMGN3
Start lost
96
1
M/I
16
R/*
Protein accession
Gene symbol
XP_005483072.1
Mutation
99 XP_005486910.1
ACYP2
Stop gain
(incomplete annotation)
XP_005487649.1
GGPS1
Start lost
299
1
M/T
XP_005488567.1
SYNDIG1
Start lost
270
1
M/T
XP_005489463.1
PROKR1
Start lost
395
1
M/I
XP_005489744.1
HSP90AB1
Stop gain
737
2
Y/*
XP_005489744.1
HSP90AB1
Start lost
737
1
M/I
XP_005493909.1
LOC102067625
Stop gain
1283
1028
W/*
XP_014119811.1
LOC102071579
Stop gain
511
250
R/*
XP_014120380.1
EYS
Stop gain
1379
106
K/*
XP_014121156.1
TRAF3IP2
Stop gain
576
17
W/*
XP_014121156.1
TRAF3IP2
Start lost
576
1
M/V
XP_014121760.1
LOC102064740
Stop gain
378
21
Q/*
XP_014121813.1
CEP162
Stop gain
1419
469
R/*
XP_014122104.1
LOC106629365
Stop lost
1078
1078
*/W
XP_014122157.1
PARK2
Stop gain
515
4
Q/*
XP_014122326.1
LOC106629373
Stop lost
266
266
*/Q
XP_014122352.1
CFAP61
Stop gain
1162
1089
Q/*
XP_014122355.1
LOC106629377
Stop gain
145
70
W/*
XP_014122639.1
LOC106629392
Start lost
308
1
M/T
XP_014122653.1
PQLC3
Stop lost
166
166
*/W
XP_014123260.1
MYB
Start lost
795
1
M/V
XP_014123293.1
FNDC1
Stop gain
1696
11
W/*
XP_014124168.1
LOC102065471
Start lost
548
1
M/V
XP_014127383.1
LOC106629801
Stop gain
362
160
R/*
18
Table S3. Genes with signatures of positive selection that are detected using a branch-site model.
Foreground branch
Gene
LRT
Psim value
PK2a+PK2b
statistic
(Qsim value)1
(%)2
Positively dN/dS
selected sites (BEB probability3) 311S (0.876)
DIEXF
19.159
0 (0)
0.415
>10
336G (0.841)
ZAL2
ZAL2m
314R (0.834)
LGALSL
3.424
0 (0)
6.768
2.118
-
SLC35F3
7.994
0.0006 (0.16)
0.335
>10
26N (0.979)
AKAP12
14.699
0 (0)
0.222
>10
1912P (0.883)
DISC1
11.121
0 (0)
0.252
>10
469L (0.959)
All genes with FDR-adjusted Q < 0.2 are shown. Positions of positively selected sites are based on the longest transcripts. 1
P and FDR-adjusted Q values were calculated using a simulation method (SI Appendix) with 10,000
replications. 2
Proportion of K2a and K2b sites; specifically, K2a sites are those under positive selection (dN/dS ≥ 1)
on the foreground branch and under purifying selection (dN/dS < 1) on background branches, and K2b sites are those under positive selection (dN/dS ≥ 1) on the foreground branch and under neutral evolution (dN/dS = 1) on background branches. 3
Probability of a site being positively selected was estimated by the Bayes Empirical Bayes (BEB)
method (41). Only sites with BEB probability > 0.8 are shown.
19
Table S4. Candidate functional categories (biological process and molecular function) under positive selection. The cumulative distribution of P values of each functional category was compared with that of genomic background using Mann-Whitney U tests. Categories with FDR-adjusted Q < 0.1 are shown. For testing signatures of positive selection, see Materials and Methods and SI Appendix. # of Functional Category
Assigned
FDRPMWU
Genes ZAL2
Biological Process
Adjusted QMWU
rRNA processing (GO:0006364)
14
1.48 x 10-5
0.0924
response to drug (GO:0042493)
13
0.0039
0.0681
cilium assembly (GO:0042384)
13
0.0039
0.0681
10
3.18 x 10-11
2.80 x 10-9
11
0.0016
0.0468
10
0.0009
0.0381
12
0.0027
0.0511
12
0.0026
0.0511
11
0.0017
0.0511
15
0.0078
0.0879
13
0.0037
0.0524
Wnt signaling pathway Biological
(GO:0016055)
Process
mitotic anaphase (GO:0000090) positive regulation of neuron projection development (GO:0010976)
ZAL2m
ATPase activity (GO:0016887) protein domain specific binding (GO:0019904) Molecular
protein C-terminus binding
Function
(GO:0008022) receptor binding (GO:0005102) protein complex binding(GO:0032403)
20
Table S5. RepeatMasker(38) annotation of interspersed repeat content on ZAL2 and ZAL2m. Sequence%
Sequence%
(ZAL2)
(ZAL2m)
Total
0.08
0.08
ALUs
0
0
MIRs
0.04
0.04
Total
3.77
4.00
LINE1
0
0
LINE2
0.04
0.05
L3/CR1
3.72
3.93
Total
2.23
2.28
ERVL
1.28
1.49
ERVL-MaLRs
0
0
ERVL-classI
0.17
0.22
ERVL-classII
0.76
0.56
DNA elements
Total
0.37
0.37
Unclassified
Total
0.66
0.65
Total interspersed repeats
Total
7.11
7.38
SINEs
LINEs
LTR elements
21
SI References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23.
Horton BM, et al. (2013) Behavioral characterization of a white-throated sparrow homozygous for the ZAL2m chromosomal rearrangement. Behavior genetics 43(1):60-70. Tuttle EM, et al. (2016) Divergence and functional degradation of a sex chromosome-like supergene. Curr Biol 26(3):344-350. Zinzow-Kramer WM, et al. (2015) Genes located in a chromosomal inversion are correlated with territorial song in white-throated sparrows. Genes Brain Behav 14(8):641654. Love MI, Huber W, & Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15(12). Harris RS (2007) Improved pairwise alignment of genomic DNA. Doctoral thesis (The Pennsylvania State University). Zhou Q, et al. (2014) Complex evolutionary trajectories of sex chromosomes across bird taxa. Science 346(6215):1246338. Warren WC, et al. (2010) The genome of a songbird. Nature 464(7289):757-762. Thomas JW, et al. (2008) The chromosomal polymorphism linked to variation in social behavior in the white-throated sparrow (Zonotrichia albicollis) is a complex rearrangement and suppressor of recombination. Genetics 179(3):1455-1468. Davis JK, et al. (2011) Haplotype-based genomic sequencing of a chromosomal polymorphism in the white-throated sparrow (Zonotrichia albicollis). J Hered 102(4):380390. Ellegren H, et al. (2012) The genomic landscape of species divergence in Ficedula flycatchers. Nature 491(7426):756-760. Laine VN, et al. (2016) Evolutionary signals of selection on cognition from the great tit genome and methylome. Nat Commun 7:10474. Shetty S, Griffin DK, & Graves JA (1999) Comparative painting reveals strong chromosome homology over 80 million years of bird evolution. Chromosome Res 7(4):289-295. McKenna A, et al. (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20(9):1297-1303. Van der Auwera GA, et al. (2013) From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics 43:11.10.1111.10.33. DePristo MA, et al. (2011) A framework for variation discovery and genotyping using nextgeneration DNA sequencing data. Nat Genet 43(5):491-498. Li H & Durbin R (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26(5):589-595. Jukes TH & Cantor CR (1969) Evolution of protein molecules. Mammalian Protein Metabolism, ed Munro HN (Academic Press), pp 21-132. Prum RO, et al. (2015) A comprehensive phylogeny of birds (Aves) using targeted nextgeneration DNA sequencing. Nature 526(7574):569-U247. Jetz W, Thomas GH, Joy JB, Hartmann K, & Mooers AO (2012) The global diversity of birds in space and time. Nature 491(7424):444-448. Jetz W, et al. (2014) Global distribution and conservation of evolutionary distinctness in birds. Curr Biol 24(9):919-930. Jarvis ED, et al. (2014) Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 346(6215):1320-1331. Katoh K & Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30(4):772-780. Capella-Gutierrez S, Silla-Martinez JM, & Gabaldon T (2009) trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25(15):19721973.
22
24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41.
Suyama M, Torrents D, & Bork P (2006) PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res 34(Web Server issue):W609-612. Yang ZH (2007) PAML 4: phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution 24(8):1586-1591. Zhang J (2000) Rates of conservative and radical nonsynonymous nucleotide substitutions in mammalian nuclear genes. Journal of Molecular Evolution 50(1):56-68. Nielsen R, et al. (2005) A scan for positively selected genes in the genomes of humans and chimpanzees. Plos Biol 3(6):e170. Kumar S, Stecher G, & Tamura K (2016) MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets. Mol Biol Evol 33(7):1870-1874. Mi H, Poudel S, Muruganujan A, Casagrande JT, & Thomas PD (2016) PANTHER version 10: expanded protein families and functions, and analysis tools. Nucleic Acids Res 44(D1):D336-342. Clark AG, et al. (2003) Inferring nonneutral evolution from human-chimp-mouse orthologous gene trios. Science 302:1960-1963. Haygood R, Fedrigo O, Hanson B, Yokoyama KD, & Awray G (2007) Promoter regions of many neural- and nutrition-related genes have experienced positive selection during human evolution. Nat Genet 39(9):1140-1144. Krueger F & Andrews SR (2016) SNPsplit: allele-specific splitting of alignments between genomes with known SNP genotypes. F1000Res 5:1479. Dobin A, et al. (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1):1521. Anders S, Pyl PT, & Huber W (2015) HTSeq — a Python framework to work with highthroughput sequencing data. Bioinformatics 31(2):166-169. Schmieder R & Edwards R (2011) Quality control and preprocessing of metagenomic datasets. Bioinformatics 27(6):863-864. Simpson JT, et al. (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19(6):1117-1123. Smit AFA & Hubley R (2008-2015) RepeatModeler Open-1.0. Smit AFA, Hubley R, & Green P (2013-2015) RepeatMasker Open-4.0. Kim D, Langmead B, & Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12(4):357-360. Quinlan AR & Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6):841-842. Yang ZH, Wong WSW, & Nielsen R (2005) Bayes empirical Bayes inference of amino acid sites under positive selection. Molecular Biology and Evolution 22(4):1107-1118.
23