Supplementary Materials for - Science

1 downloads 0 Views 1MB Size Report
Apr 25, 2014 - ... armadillo, axin, disheveled, frizzled, slmb, APC, pangolin, and shaggy. ..... genome sequence brings theories of insect defense into question.
www.sciencemag.org/cgi/content/full/344/6182/380/DC1

Supplementary Materials for Genome Sequence of the Tsetse Fly (Glossina morsitans): Vector of African Trypanosomiasis International Glossina Genome Initiative* *Corresponding author. E-mail: [email protected] (Serap Aksoy); [email protected] (Geoffrey Attardo); [email protected] (Matthew Berriman) Published 25 April 2014, Science 344, 380 (2014) DOI: 10.1126/science.1249656

This PDF file includes: Materials and Methods Supplementary Text Figs. S1 to S9 References Other Supplementary Materials for this manuscript include the following: (available at www.sciencemag.org/cgi/content/full/344/6182/380/DC) Tables S1 to S43 as Excel files

Materials and Methods Fly maintenance The Glossina morsitans morsitans colony maintained in the insectary at Yale University was originally established with puparia from fly populations in Zimbabwe. Newly emerged flies are separated by sex and mated at three to four days post-eclosion. Flies are maintained at 24±1˚C, 50-55% relative humidity, and receive defibrinated bovine blood every 48h using an artificial membrane system (39). This line of tsetse is used for research in most labs working on tsetse. Preparation of genomic DNA DNA was prepared from several lines of a female and her adult female progeny. Mothers were maintained on a blood diet supplemented with 25 micrograms of tetracycline, to remove all symbiotic bacteria and vitamin metabolites. Fertility was maintained these flies following the yeast supplementation method developed in Pais et. al. (40). This dietary protocol could only partially rescue tetracycline induced sterility and the majority of lines only yielded one to two progeny which necessitated the use of multiple lines for genomic material. Library preparation and genome sequencing Approximately 50 pupae were used to produce a BAC library (Library number VMRC-29, available via bacpac.chori.org) for end sequencing. A further 15 tetracycline-treated flies were used for whole genome shotgun sequencing (Table S1). To minimize polymorphism within the assembly, the number of individual flies was kept to a minimum and a combination of wholegenome amplified (using Genomiphi v2 kit, Illustra) and non-amplified material was used for sequencing (Table S1). Preparation of Sanger-sequencing libraries was performed at the Wellcome Trust Sanger Institute using in-house protocols available upon request. From small insert plasmid (pOT12 and pMAQ) and fosmid (pCCfos) libraries, 2.86 million and 20,000 reads were produced, respectively, using ABI BigDye version 3.1 and standard primers, and analyzed on an ABI 3730 Capillary DNA Analyser. A BAC library was prepared from agarose plugs of immobilized Tsetse fly DNA using a CopyControl BAC Cloning Kit (Epicentre Biotechnologies, Madison, WI) with the size-fractionated DNA fragments (up to 100 kb) following methods outlined in (41). The protocols used for DNA extraction, preparation of agarose plugs, size fractionation and BAC library construction have been described previously (42). End-sequences were produced from approximately 20,000 clones. Libraries for 454 and Illumina sequencing were produced according to the manufacturers’ protocols with the exception of one PCR-free library made from a single female larva. Illumina sequencing was performed for 108 cycles with both the forward and reverse sequencing primers Sequence assembly The genome (366 Mb) was assembled using the Celera assembler from a combination of 454 and Sanger reads and screened for quality and contamination (Table S2). Using 20.6 Gb of Illumina reads from a PCR-free small fragment (~300 bp) library and 9 iterations of IMAGE (43), approximately 20,000 gaps were closed. The initial assembly was termed version 1 and used for subsequent gene finding and annotation. Within the assembly, 50% of the bases were assembled into 570 scaffolds of at least 120 kb (full statistics provided in Table S3). 2

To assess the completeness of the version 1 assembly, two approaches were taken. First, the software CEGMA (Core Eukaryotic Genes Mapping Approach), which searches for conserved single-copy eukaryotic orthologs, was used to determine that the genome sequence appears 99% complete. Of the 248 test genes, 246 could be found as complete genes and 247 could be found using a more permissive setting that allows gene fragments to be predicted. Second, the sequences of 3 BACs that had previously been completely manually sequenced aligned in their entirety with near-perfect identity. To improve the scaffolding of the version 1 assembly, two Illumina Hi-Seq mate-pair libraries with an insert of approximately 1.6 kb were generated from female larvae DNA (100bp reads, giving a total of 60 Gb or x160 coverage). In order to conserve annotation based on the version 1 assembly, the latter scaffolds were used as input into the SOAPdenovo scaffolder (44) with the HiSeq libraries to generate “superscaffolds”. As a result, the N50 of the original annotated assembly approximately tripled in size, to 386 kb, and the total number of scaffolds dropped by over 40%, from 13,807 to 7,886. In a minority of cases SOAPdenovo merged some of the original scaffolds with the result that redundant sequence was lost, although the total size of the genome remained much the same. These superscaffolds, termed assembly version 2, were used for analysis of synteny. Transcriptomic resources utilized for gene prediction Transcriptome data from the midgut (NCBI accession #: PRJNA194), fat body (NCBI BioSample accession #: LIBEST_018342), salivary gland (NCBI accession #: PRJNA40877) and reproductive tissues (NCBI accession #: PRJNA43601) were used for gene predictions. In addition, Illumina sequencing data from two libraries created from lactating and non-lactating (post parturition) whole female tsetse was mapped to the genomic scaffolds to assist in gene predictions. Library construction was performed using standard protocols for Illumina mRNASeq sequencing by the W. M. Keck Foundation Microarray Resource at the Yale School of Medicine as described in detail in the satellite paper on the tsetse pregnancy transcriptome (10). Gene prediction and functional annotation Genome annotation was performed using a modified version of the Ensembl gene prediction pipeline (45) and aggregation of predictions from multiple sources using the MAKER software (46). The annotation process was broken down into the following phases: Identification of de novo repeat sequences using RepeatScout (47) and RECON (48). These were supplemented with publicly available repeat sequences from GenBank and mapped to the genome assembly using RepeatMasker (49). The first instance of training of the ab initio gene prediction programs SNAP (50) and Augustus (51) was performed using full-length cDNA sequences from EST/transcriptome projects. Subsequent rounds of re-training were based on the output of the prediction programs themselves. Similarity based gene predictions were generated using exonerate alignments of EST sequences (52), alignment and merging of transcriptional fragments (transfrags) from RNAseq datasets using tophat & bowtie (53) and protein-based predictions using the genewise tools (54). Gene predictions from both the ab initio and similarity approaches were aggregated into a final set using three rounds of the MAKER software. The first 2 rounds were designed to iteratively improve the training of the ab initio gene predictions and a final round informed with protein similarities to all metazoan sequences in the nr protein database to guide the final 3

predictions. The resulting gene set (GmorY1.0) contained 12,362 protein-coding genes which formed the basis for community led prediction appraisal and improvement (see Manual annotation section). Peptidase annotation The entire Glossina proteome was submitted to the MEROPS batch Blast (55) where it was compared to a reduced set of peptidase and peptidase inhibitor sequences. A hit was only considered a homologue if the e-value was less than 1 x e-10., or if the hit appeared to be a fragment, i.e. it did not overlap all of the known active site residues in the peptidase. Manual annotation Genes were annotated by multiple research groups. Annotations were performed primarily by homology based searches by BLAST (56) with characterized dipteran and insect gene sequences against predicted gene sequences and raw genomic scaffold sequences to identify genes of interest. Glossina sequence data and annotation data were obtained from the Community Annotation Portal at VectorBase (57). Gene model corrections and renaming were performed with the Artemis (58) sequence viewing and annotation software. The manual and automatic annotations were then merged into a single gene set with community-assessed models taking precedence over automatically generated ones. This final set, GmorY1.2, contains 12,308 protein coding genes, which VectorBase is responsible for hosting and undertaking future updates. Orthology Analysis Orthologue detection was performed using peptide sequences from vectorbase: Aedesaegypti-Liverpool_PEPTIDES_AaegL1.3, Anopheles-gambiae-PEST_PEPTIDES_AgamP3.7, Culex-quinquefasciatus_Johannesburg_PEPTIDES_CpipJ1.3, Phlebotomus-papatasiIsrael_PEPTIDES_PpapI1.0; Flybase: Drosophila_melanogaster_r5.53 and Glossina morsitans. This study was performed using OrthoMCL 1.4 with default settings (59). The resulting clusters were parsed using custom perl scripts to produce groups of genes shared between species. Synteny analysis of the D. melanogaster and G. morsitans genomes Pair-wise whole genome alignments were performed using the SatsumaSynteny program from the Spines software packages (60). From these alignments, pairwise homologous synteny blocks (HSBs) were automatically defined using the SyntenyTracker software (61). Briefly, the SyntenyTacker software defines an HSB as a set of two or more consecutive orthologous markers in homologous regions of the two genomes, such that no other HSB can be defined within the region bordered by these markers. There are two exceptions to this rule: the first involves single orthologous markers that are not otherwise defined within HSBs; and the second involves two consecutive singleton markers separated by a distance less than the resolution threshold (either 10 kb or 100 kb for this analysis). A further analysis was performed using seven other insect genomes with D. melanogaster as the reference genome instead of Glossina. Results of this analysis were uploaded to Evolution Highway comparative chromosome browser (http://eh-demo.ncsa.uiuc.edu/Drosophila). To view these results first load the URL. Then in “Select data source” click on Drosophila; next tick a box under “Reference Genome” e.g. “dm3:100” for 100 kb syntenic block resolution of the D. melanogaster reference genome; in “Reference Chromosome” tick one or more boxes to visualize alignments to a D. melanogaster chromosome e.g. 2L and 2R; and in “Comparative 4

Genome” tick the boxes for “tsetse:100” and “tsetse_new” to view synteny for versions 1 and 2 of the genome, respectively; finally click on “Apply”. It may take a minute or so before the synteny visualizations appear. Zoom in and out using the controls just above the visualization window. Other insect genomes available under “Comparative Genome” include Apis mellifera (amel), Culex quinquefasciatus (cquin), Drosophila grimshawi (dgri), Anopheles gambiae (mosquito), Bombyx mori (silk2), Tribolium castaneum (tca), and Nasonia vitripennis (wasp). The software has been tested with the following browsers: Firefox 23.0.1, Opera 15, Google Chrome 28 and Windows Explorer 10 (it may be necessary to alter the browsers cache settings). Enrichment of Gene Ontology annotations within synteny blocks Genomic functions that are relatively stable over long evolutionary time scales may be revealed by analysing pairwise blocks of homologous synteny between D. melanogaster and Glossina for enrichment of different GO categories. To perform this analysis Glossina scaffolds were divided into 10 kb windows and the number of genes from each Gene Ontology (GO) category (with more than 50 assigned genes in all Glossina scaffolds) was counted in each 10 kb window. Next, the average number of genes for each GO category was calculated separately for the windows located within HSBs and the remainder of Glossina scaffolds. The average number of genes was calculated for 10 kb windows located within the HSBs and compared with the average number of genes in 10 kb windows found outside the HSBs using a t-test with unequal variances. Characterization of horizontal gene transfer events Wolbachia: For the identification of Wolbachia specific sequences, the complete genomes of wMel (AE017196), wRi (CP001391), wBm(AE017321), and wGmm (AWUH01000000) were used as reference sequences. Characterized chromosomal sequences were assembled de novo using a reference genome with MIRA and AMOS (62, 63). Detailed methods and results can be found in the accompanying satellite paper (13). Transposable Elements: Transposable elements with sequence similarities to known arthropod sequences were identified using a modified version of the TarGet algorithm (64). Denovo sequences were identified using RepeatModeler (49). Sequences identified using automated methods were reviewed manually and when possible extended beyond the boundaries reported by the automated programs. Previously identified Glossina transposable element sequences from the sialotranscriptome were also incorporated and extended (16). MITE sequences were identified using the program MITE-Hunter (65). Percentage of the genome occupied by each family was determined using RepeatMasker (49). Supplementary Text Transporter Genes To identify nutrient transporters, membrane transporters from Drosophila melanogaster, Aedes aegypti and Anopheles gambiae assembled from membranetransport.org. We used these ID's to pull Glossina homologs for transporters from the predicted transcriptome. A total of 228 transporters were identified for which 135 transcripts were assigned putative type and function (Table S9). Gene models were corrected where appropriate. We identified 104 transcripts predicted to be from the Major Facilitator Superfamily (MFS) in Glossina. These include glycerol-3-phosphate, lipid, oxalate/formate, and acetyl-CoA transporters, members of the Sugar Porter (SP) subgroup (sugars and organic cations/carnitine substrates), the Oligosaccharide H+ symporter (OHS) subgroup, the H+ symporter (FGHS) 5

subgroup (fucose-galactose-glucose substrates) and the monocarboxylate porter (MP) subgroup (acetate, pyruvate, mevalonate substrates). Thirty-seven identified transporters did not have an identifiable substrate. The hypothetical lipid transporter is a possible ortholog of the Drosophila sterol carrier protein 2; however it does not appear to have any transmembrane domains and is not likely to be membrane bound. The ATP-binding Cassette (ABC) Superfamily of transporters was represented by 53 predicted transcripts, with putative substrates predominately related to multidrug resistance. Seven members of the Organo Anion Transporter (OAT) Family or Organic Anion-Transporting Polypeptide (OATP) were found. The Solute:Sodium Symporter (SSS) Family was represented by 12 members, this family typically imports multivitamin and monocarboxylate substrates. Nucleoside and sugar transport was represented by the one member of the Concentrative Nucleoside Transporter (CNT) family, eight members of the Nucleoside Sugar Transporters (NST) family and one member of the Glycoside-Pentoside-Hexuronide (GPH):Cation Symporter family. Amino acid and peptide transport was represented by several families. We found 19 members of the Amino Acid/Auxin Permease (AAAP) family. The Amino Acid-PolyamineOrganocation (APC) family was represented by 11 members, including orthologs to Drosophila minidiscs, slimfast, and genderblind as well as orthologs of several members of the Cationic Amino Acid Transporter (CAT) and Heterdimeric Amino Acid Transporter (HAT) groups, which are important for blood meal activation of yolk protein synthesis in Aedes aegypti (66). Two P-proteins were identified that belong to the Arsenite-Antimonite (ArsB) Efflux family 3. One member of the Proton-dependent Oligopeptide Transporter (POT) family was identified. The annotation of the Glossina genome also resulted in the identification of a single folate transporter (gene ID: GMOY005445) that was classified by the Pfam database (Accesion PF01770.11) as a reduced folate carrier (RFC). This carrier is described as a major route for folate transport and a member of a large superfamily of transporters, each with distinct substrate specificity (67). We annotated a large number of nutrient transporters representing many different superfamilies of membrane transporters, based primarily on their homology to orthologous proteins in Aedes aegypti, Anopheles gambiae, and Drosophila and on protein domain prediction. More transporters likely exist within the genome and are yet to be annotated. Identification of putative substrates is predominately based on similarities with transporters identified in other organisms. However, many have only been categorized in E. coli or other bacteria so far and much more work is required to identify actual substrates and specific biological roles for these proteins. Bracoviral Gene Insertions Analysis of Glossina genes lacking orthologous Drosophila sequences revealed genes bearing high homology to bracovirus genes. Bracoviruses are members of the Polydnavirus family of insect viruses. These viruses are normally associated with parasitic braconid wasps, are synthesized within the wasp and are injected with wasp embryos into host insects. Injected virus particles function to suppress the host immune system to prevent phagocytosis and melanization of developing wasp embryos/larvae (68). Hits bearing high homology to bracoviral genes are spread throughout the genome with 151 of the Glossina genomic scaffolds containing sequences with BLAST hits lower than 1E-50 to bracoviral sequences (Tables S11+S12). The bracoviral sequences with the highest homology to those found in Glossina originate from the parasitic braconid wasp Glyptapanteles flavicoxis. 6

These wasps are associated with the parasitization of Lepidopteran caterpillar hosts such as the Gypsy moth (Lymantria dispar). The viral genes identified here suggest that tsetse is a target for parasitization by an unidentified braconid wasp. This aspect of Glossina’s biology remains unstudied and deserves further effort due to the potential for the development of biological control strategies. Antioxidant response genes Antioxidant proteins are associated with multiple physiological processes in insects, including aging, stress response and immunity. We identified 19 Glossina genes directly involved in oxidative stress tolerance (Table S15). In the classic oxidative stress response, two superoxide dismutase genes (Mn/Fe and Cu/Zn sod) convert superoxide to H2O2 and two catalase genes (catalase 1 and 2) convert H2O2 to water and oxygen. Nine genes (peroxidoxin and thioredoxin peroxidases) are involved in the reduction of H2O2 using electrons, provided via thioredoxin reductase (two genes present). This pathway is important in the regulation of thioredoxins, proteins that act as strong antioxidants and facilitate the reduction of other proteins by cysteine thiol-disulfide exchange. Nitric oxide synthase, dual oxidase (Duox1), oxidation resistance 1 (Oxr1) and prophenoloxidase (PPO) genes are involved in immunity. Previous studies have shown that the antioxidant response in tsetse is critical to Glossina-trypanosome interactions (69). Homeodomain gene orthologs in Glossina Homeodomain proteins are a conserved family of transcription factors. These factors are characterized by homeodomain helix-turn-helix DNA binding motifs. Homeodomain proteins are known for their coordination of gene expression associated with body plan organization, body polarity, segmentation, organ/tissue differentiation and other functions. These factors are conserved throughout the metazoan evolutionary timeline. The function of some of these proteins is well understood within the context of embryonic and immature development. However, their function within adult organisms’ remains understudied (70). The Drosophila genome contains 104 homeobox genes/pseudogenes. Homology based searches of these genes identified a total of 96 orthologus/homologus tsetse homeobox proteins (Table S16). The 8 genes that lacked significant hits are achintya, bicoid, buttonless, CG34031, CG7056, ods-site homeobox, twin of eyg, vsx2, zerknullt-related. Comparative syntenic analysis of the Glossina genome with other arthropods We performed two synteny analyses using a small resolution threshold (10 kb) and large resolution threshold (100 kb). The 10 kb resolution analyses tend to elucidate relatively recent changes in the genomes by identifying lineage specific breaks in synteny and conserved sequences free from small scale rearrangements. On the other hand, the 100 kb resolution analyses tend to merge many of the small scale rearrangements and lineage specific breaks into longer syntenic blocks: thus elucidating more distant ancestral organisation of the chromosome. Supplemental Figure 4 shows the total amount of genome sequence in HSBs (Homologous Syntenic Blocks), for Glossina (the reference genome) as compared to D. melanogaster, A. gambiae, and N. vitripennis. The 10 kb resolution analyses (left set of columns) indicate that the relative sizes of the HSBs are roughly proportional to the sizes of the insects’ genomes. The HSBs are usually several times larger in Glossina and in fact are at least 63 Mb and 28 Mb in Glossina and Drosophila, respectively, consistent with the Glossina genome being about 3 times larger than that of D. melanogaster. The 100 kb resolution analyses (right set of columns) relate 7

more to shared ancestral chromosomal organisation. Here the total amount of sequence in synteny is nearly equal for Glossina and D. melanogaster. Supplemental Figure 5 compares the distribution of the synteny block sizes for Glossina and D. melanogaster. The 10 kb resolution analyses are characterized by a relatively large number of small blocks (i.e. blocks less than 100 kb in size), with the largest no bigger than about 650 kb. In comparison the 100 kb resolution analyses define a more irregular distribution of block sizes, with a significant number of large blocks up to around 4 Mb in size. Using D. melanogaster as the reference genome, Supplemental Figure 6 gives a visual representation of synteny defined with a 10 kb resolution threshold by mapping the synteny blocks defined in the Glossina and A. gambiae genomes to the D. melanogaster chromosomes: it shows that a number of synteny blocks are conserved across all three species. The GO enrichment results are shown in Supplemental Figure 7. All GO categories above the significance threshold are more likely to occur in HSBs when compared to the rest of the genome, whereas GO categories below this threshold do not demonstrate any preferential localisation in HSBs compared with other regions of the genome. The analysis was performed separately for HSBs larger than 10 kb and 100 kb in the Glossina genome. These results demonstrate that the regions of microsynteny (>10 kb) between D. melanogaster and Glossina are enriched for a number of GO categories including transport, protein binding, hydrolysis activity, etc. However, larger regions of synteny (>100 kb) are enriched only for genes related to protein binding, suggesting that despite a large evolutionary distance between the genomes, genes involved in interactions between proteins tend to have more stable order in these insect chromosomes. Blocks of synteny conserved between the three dipterans, D. melanogaster, A. gambiae, and G. morsitans, (defined using Evolution Highway, see the Annotation section, above) are displayed in Supplemental Figure 8. In Supplemental Figure 9 the fraction of genes allocated a particular GO category are shown for G. morsitans genes lying within the multi-species homologous synteny blocks (msHSBs), and for all G. morsitans genes. In this analysis we have ignored unannotated G. morsitans genes, and we only show results for the 5 most common GO categories from the set of all annotated G. morsitans genes (almost 100 GO categories are represented in these msHSBs). The syntenic relationships between genes in these blocks may be conserved because they are important for particular biological processes: if this is the case, then it would be expected that certain GO categories would be either over or under represented in the msHSBs as compared to the entire genome. Supplemental Figure 9 does show this: the fraction of genes with the GO categories “Intracellular” and “Zinc ion binding” are almost equal in the set of all Glossina genes. However, in the msHSBs there is much greater variation in these two GO categories, with the fraction of “Intracellular” genes markedly less, and the fraction of “Zinc ion binding” genes is greater than the corresponding fractions from the set of all genes. Peptidase genes The number of peptidase homologues found in the Glossina proteome totals 619, which is less than in the Drosophila proteome 769. As is often the case for families of peptidases, several homologues in each family do not have all the active site residues conserved and are therefore unlikely to have peptidase activity. Excluding these non-peptidase homologues and apparent fragments, 418 homologues are predicted to be active peptidases in Glossina compared with 501 in Drosophila. However, there are less peptidase inhibitor homologues in Drosophila than found in Glossina (187 to 206 respectively). 8

Peptidases in Glossina represent 68 different families, from five of the six known catalytic types (aspartic, metallo, cysteine, serine and threonine; the missing glutamate peptidases are restricted to fungi and viruses). The families containing the most homologues are: S1 (chymotrypsin family, 177 homologues), S9 (α/β hydrolases including prolyl oligopeptidase, 53 homologues), M14 (metallocarboxypeptidases, 30 homologues), C19 (deubiquitinating enzymes, 25 homologues) and M13 (neprilysin family, 24 homologues). Peptidase inhibitors are derived from 15 families, with the most homologues from I63 (of which the only known inhibitor is proeosinophil major basic protein, an inhibitor of pappalysin), I2 (Kunitz-BPTI family, 26), I1 (Kazal family, 21) and I4 (serpins, 17) (71). Tsetse fly peptidases were compared with those from ten other insects whose genomes have been completely sequenced (the mosquitoes Aedes aegypti, Anopheles gambiae and Culex pipiens quinquefasciatus; four species of Drosophila (D. melanogaster, D. pseudoobscura, D. simulans and D. yakuba); the bee Apis mellifera, the wasp Nasonia vitripennis; and the beetle Tribolium castaneum). For all eleven insect species, the peptidase family with most members is S1 (chymotrypsin). D. melanogaster has 269 members, whereas the mosquitoes each have over 300. Glossina has fewer 177, however, A. mellifera has even fewer with only 52. Because the genes for members of family S1 often occur in clusters, there may be issues with their correct identification. Table S19 shows the 44 peptidases common to all eleven insect species, of which nine are biochemically uncharacterized. Glossina is missing only two peptidases common to all the other insects: OTLD1 deubiquitinylating enzyme (C85.001) and CG7432 protein (S01.A53). Of the peptidases common to all members of the order Diptera, Glossina is missing: UfSP1 peptidase (C78.001); CG1299 protein (S01.B33); lysosomal Pro-Xaa carboxypeptidase (S28.001); and CG9953 protein (S28.A10). Glossina has peptidase homologues in the following families that are either not present in other dipterans or cannot be mapped to D. melanogaster peptidases (number of homologues in brackets): A1 (2), C69 (1), M16 (1), M17 (2), M20 (2), M22 (1), M24 (4), M38 (1). M49 (1), M50 (2), M67 (1), M79 (1), S1 (43), S8 (3), S9 (8), S33 (6), T1 (4), T2 (1) and U69 (18). Degradation of haemoglobin The process of host haemoglobin degradation is characterized in a number of parasites. The peptidases involved are shown in Table S20. Only cathepsin D is known from the tsetse fly, and this is one of the peptidases found in all eleven insect proteomes. This is a lysosomal peptidase that is found in most animals and degrades haemoglobin in mammals (72). In addition, two homologues of cathepsin D are present in the tsetse fly proteome that are not present in the other insects. These could be candidate peptidases for the degradation of haemoglobin. Surprisingly, legumain, which is a lysosomal cysteine peptidase that cleaves after asparagine bonds and is found in most animal species, is absent from all eleven insect proteomes. Anticoagulants Haematophages must deal with host haemostasis, and anti-coagulants are usually a part of this process. Often these anticoagulants are peptidase inhibitors that bind and inactivate coagulation factors, particularly thrombin. Table S21 shows families with known thrombin inhibitors and the number of homologues found in the G. m. morsitans proteome. The large number of homologues in family I63 reflects the fact that the inhibitory domain is also a lectin domain, which is found in many animal proteins not known as peptidase inhibitors. One thrombin inhibitor has been biochemically characterized from the tsetse fly (73), and has been 9

assigned the MEROPS identifier I76.001. No homologues are known from any other species. Additionally, the tsetse fly possesses six homologues of phosphatidylethanolamine-binding protein (I51.002), which is known to inhibit thrombin weakly (74). Autophagy-associated genes in tsetse flies Autophagy is a highly conserved process during which proteins and organelles present within the cytoplasm are non-selectively degraded within lysosomes. The products of the autophagic degradation can be broken into small molecules that can be recycled for the generation of macromolecular products or utilized in the generation of energy. This process is critical for a wide variety of biological processes including development, stress tolerance and starvation. Through utilization of a comparative approach, we have annotated the G. morsitans genes associated with autophagosome induction, nucleation and expansion (Table S22). Based on these analyses, Glossina contain the same core set of genes associated with autophagy as does Drosophila. High variability speciation loci In the last decade significant progress has been made in the identification and characterization of speciation genes, especially within the Drosophila lineage (75). A common pattern observed in these loci is that they exhibit high rates of molecular evolution, which may drive speciation. Accordingly, many such genes might be expected to be lineage specific and/or difficult to identify in distantly related taxa. The search in Glossina included homologs of rapidly evolving speciation genes, such as Odysseus (OdsH, hybrid sterility), its paralog unc-4 , hybrid male rescue (hmr, hybrid lethality and female sterility), Lethal hybrid rescue (Lhr, hybrid male lethality), and Nuclear pore protein (Nup96, hybrid male lethality). All of these genes were identified in Glossina with the exception of Lhr (Table S23). In comparison, all of these proteins except for OdsH (and unc-4) were also found in Lucilia sericata based on transcriptome analysis (76). Thus while some speciation loci may be rapidly evolving among lineages, they remain conserved within Drosophilidae, Calliphoridae, and Glossinidae. Comparative analysis of these genes between Glossina species will be of particular interest. Fatty acid elongation enzyme genes Glossina exhibits an expansion of genes encoding enzymes associated with fatty acid elongation pathways. In total 21 genes orthologus to Drosophila genes associated with lipid elongation and diverse lipid biosynthesis were identified within Glossina. This set includes multiple paralogs of genes encoding acetyl-CoA acyltransferase, enoyl-CoA hydratase, and elongase activities. These gene expansions suggest that enzymatic functions associated with fatty acid elongation are of particular importance to Glossina’s physiology. This fits with physiological observations demonstrating that Glossina relies primarily upon stored lipids for energy as sugars and glycogen are not utilized. Fatty acids also play a crucial role in determining membrane structure and stability. This could be of particular importance during the process of involution and regeneration of the milk gland tissue that occurs between each pregnancy The genes coding for the key enzymes in fatty acid elongation pathway have been classified within Drosophila. The Glossina gene id and description for each enzyme in the fatty acid elongation pathway are shown together with the putative D. melanogaster orthologs and EC IDs (Table S24). Abbreviations used for nomenclature of the enzymes in the fatty acid elongation: ACAT, acetyl-CoA acyltransferase; ECH, enoyl-CoA hydratase; ECR, trans-2-enoyl-CoA 10

reductase; PPT, palmitoyl-protein thioesterase; ELG, elongase; ACD, 3-hydroxy acyl-CoA dehydratase; ENR, enoyl reductase. Heat shock proteins (Hsps) Small Hsps Four small heat shock proteins, similar to those found in Drosophila, were identified in the Glossina genome. Two other genes, l2efl and HSPB8, containing alpha-crystalline domains were also recovered (Table S25). These genes probably function as chaperones, but are unlikely to be inducible under stress because of the presence of introns. Of note is the proximity of the small heat shock proteins on supercontig scf7180000650671. This is similar to the small HSP gene clustering observed on Drosophila chromosome 3L 9366000 to 9378000. Summary of hsp83 (hsp90) and hsp70/hsc70 identified from the Glossina genome. Two orthologs to heat shock protein 83 (hsp83a, b), one heat shock protein 70 (hsp70A), and six heat shock protein cognate (hsc70-1, -2, -3, -4a, -4b, and -5) genes were recovered from the predicted genes associated with the Glossina genome based on their amino acid sequence homology to those of the corresponding genes in Drosophila (Table S26). Interestingly, two copies of hsp83 gene are present in Glossina, but only one in Drosophila. Additionally, Drosophila’s genome contains at least six unique hsp70 genes while only one is found in Glossina. The six hsc70 genes from Drosophila are represented in the Glossina genome with the exception of hsc70-4, which has two orthologs in the Glossina. The presence of more hsc70 and less hsp70 genes may indicate that these gene sets developed independently to deal with the different stressors associated with the unique lifestyles of Glossina and Drosophila. Further work is necessary to determine if Glossina heat shock protein genes are constitutively-expressed, stress-inducible or function in a combined capacity. In addition, the lack of predicted hsp70 genes requires further investigation. Previously, the hsp70, hsc70-3, hsc70-4 and hsc70-5 sequences were identified in an analysis of the Glossina sialome (16). These sequences are 99-100% identical to their genomic counterparts, which suggests a nearly perfect match between the previous transcriptome and this genome study. Also, an hsp83 gene was previously identified as an endoplasmic reticulum glucose-regulated protein (16) which also belongs to hsp90 family. However, the detailed primary structure analysis and sequence alignment suggest that this gene is an hsp83 or hsp90 instead of an endoplasmic reticulum glucose-regulated protein. Thus we named these two genes hsp83a and hsp83b. Cuticular protein genes Eighty-seven putative cuticular protein sequences containing the R&R consensus chitin binding domain, hence belonging to the CPR family, were manually annotated (Table S27). All but 18 (P after gene symbol) appear to be complete sequences. Many predicted genes need to be revised (R after gene name). Both Apis and Nasonia have far fewer CPR genes (32 and 62 respectively) than Drosophila melanogaster and Anopheles gambiae (101 and 158 respectively). The annotation was performed with the goal of determining if the protected immature life style of the two Hymenoptera was responsible for their reduced number of CPR genes by learning if this reduction is also observed in Glossina, which has a similar protected lifestyle during intrauterine larval development. The presence of 87 CPR genes in Glossina indicates that a reduction in the number of cuticular protein genes need not correlate with protected immature 11

development. Many of the CPR genes have unambiguous orthologs in Drosophila; other less impressive orthologs were based primarily on the Consensus region. An additional 20 sequences belonging to other cuticular protein families were also identified. All of these putative cuticular proteins were defined based upon sequence similarity to proteins that were either isolated directly from cuticle or that appeared in proteomic analysis of Anopheles cuticles where ~ 90% of genes predicted to be for cuticular proteins were confirmed to be cuticular components. Neuropeptide and protein hormone receptors Neuropeptides are the most diverse class of signaling molecules in the central nervous system (CNS). By operating as hormones, neurotransmitters or neuromodulators, they are involved in many biological processes such as homeostasis and metabolism, reproduction, growth and development, feeding, circadian rhythms, sensorimotor integration, adaptive behaviors and cognition (77). A number of neuropeptidergic systems are evolutionary conserved in the animal kingdom. As such, they are categorized in several families, based on conserved structural features. A core set of about 20 conserved neuropeptide genes are found in the genomes of all insects sequenced to date, while a variable set of 26 neuropeptides is present in some species but not in others (78). Our results identify 39 neuropeptide genes in Glossina, which include all conserved core genes, as well as a variable cohort of neuropeptide genes that is most similar to that of Drosophila (see Table S28). However, we also detected particular differences in the Glossina genome that may reflect specific physiological and/or behavioral adaptations to the different life cycle and environment of this dipteran insect. As in Drosophila, genes encoding adipokinetic hormone/corazonin-like neuropeptide (ACP - function unknown), allatotropin (stimulation of juvenile hormone synthesis and cardiac activity), inotocin (control of water balance) and orcokinin (regulation of circadian rhythms) are not found in Glossina. The sulfakinin encoding gene (control of feeding and of larval locomotion/odor preferences), which is present in Drosophila, is probably degenerated in Glossina. Similarly, the sex peptide involved in female fruit fly postmating behavior is absent in Glossina as in all other investigated insects and seems to be an evolutionary invention in most, but not all, Drosophila species. Like all other dipterans, the Glossina genome encodes the neuropeptide-like precursor 1 (NPLP-1), a modulator of the innate IMD immune pathway. The other Drosophila NPLP-genes 2-4 with unknown functions are absent in Glossina, as in all other investigated dipterans. In contrast to Drosophila and Anopheles, but similar to Aedes and Culex quinquefasciatus, the Glossina genome contains a neuroparsin gene (ecdysteroid synthesis and ovarian development) (Table S28). Most neuropeptides carry out their function by activating specific G-protein coupled receptors (GPCRs). A few exceptions include eclosion hormone and NPLP-1 that interact with guanylate cyclase receptors- and the larger protein hormones, prothoracicotropic hormone and insulin-like peptides, which interact with receptor tyrosine kinases. Neuropeptide GPCRs are transmembrane proteins that upon activation by their ligand, transduce this signal by changing conformation, resulting in an intracellular response (77). We annotated 43 GPCRs in the Glossina genome database (Table 29), 32 of which can be associated with their putative neuropeptide ligand. Eleven GPCRs are classified as orphan receptors with as yet unknown ligand(s). The vasopressin/oxytocin related neuropeptide, inotocin is absent in all dipteran genomes investigated so far. The allatotropin and the ACP neuropeptides and their receptors are absent both in Glossina and Drosophila, but do occur in mosquito genomes. The presence of a sulfakinin-like pseudogene and absence of a sulfakinin receptor in Glossina points to a 12

remarkable evolutionary loss of this signaling system in Glossina which may be attributed to their particular feeding behavior. Insulin-like peptides Insulin signaling regulates a wide range of physiological processes in insects, including growth, metabolism, aging, immunity and reproduction (79). Dipterans and other insects typically express multiple insulin-like peptides (ILPs), but have only a single insulin receptor. We identified five ILPs in the Glossina genome (Table S30). Glossina ILP1 and ILP2 are separated by 1025 bp in the genome and are encoded on the forward and reverse strand, respectively. A phylogenetic analysis indicates that ILP1 is most similar to DILP3 in Drosophila, while ILP2 does not cluster with another dipteran ILP. Glossina ILP3 and ILP4 are also linked in the genome and only separated by 6441 bases. A phylogenetic analysis indicates that these two ILPs are most similar to the mosquito ILP3s and likely arose from a recent duplication event. Glossina ILP5 is located on its own contig and is most closely related to DILP7. Interestingly, Glossina ILP2 and ILP5 have a unique primary structure with four amino acids between the second and third cysteines of the A-chain, similar to that of ILP5 in mosquitoes and DILP7 in Drosophila (79). We did not identify a putative insulin growth factor (IGF) ortholog in the Glossina genome. Juvenile hormone and ecdysone signaling (Table S31) Genes identified as components of the JH signaling pathway are conserved between Glossina and Drosophila (Table S31). This is perhaps unsurprising given the deep evolutionary conservation of the fundamental aspects of JH signaling (80). There are a small number of known JH inducible genes. Examples of these include the early 20E-inducible genes Broad, E75, and E74. Other JH-inducible genes known in Drosophila include JhIP-1, -21, and -26, two isoforms of E74, minidiscs (mnd), Phosphoenolpyruvate carboxykinase (pepck), and JH Esterase (JHE). Each of these genes has a clear homolog in Glossina with the exception of JhIP-21. The Methoprene tolerant (met) gene is recognized as the best candidate JH receptor. In Brachycera, including Glossina and Drosophila, two paralogous JH receptor candidates Met and gce are thought to transduce the JH signal (81). The gce gene is homologous to “Met” in other Holometabola. Knockdown of Met and gce genes in Glossina suggests functional divergence of these genes with respect to reproduction (82). Met and gce are subject to regulatory control in Drosophila by key negative regulators of the Wnt signaling pathway, whose constituent genes in Glossina include homologs of Drosophila armadillo, axin, disheveled, frizzled, slmb, APC, pangolin, and shaggy. JH Biosynthesis Homologs of enzymes involved in JH biosynthesis in Drosophila were identified in Glossina, including HMG Coenzyme A synthase (HMGS), HMG Coenzyme A reductase (HMGCR), Farnesoic Acid Methyl Transferase (FAMeT), and Juvenile Hormone Acid Methyl Transferase (JHAMT). Recently we have identified JHAMT as a downstream target of Glossina FOXO activity. JH Catabolism A gene duplication in Drosophila produced JHE and Jhedup, the latter of which is absent in Glossina. Two Glossina genes showing moderate sequence identity with Jhedup are predicted to have general carboxyleasterase activity. Additionally, Glossina harbors only two JH Epoxide 13

Hydrolase (JHEH) copies, homologous to JHEH1 and JHEH3 in Drosophila. Two proteins, FKBP36 and CHD6464, bind JH response elements and function as critical regulators of JH activity. A Glossina ortholog of FKBP39 was identified, whereas a clear Chd64 ortholog was not. Ecdysone biosynthesis and catabolism The Halloween gene family is involved in the biosynthesis of ecdysteroids and includes several cytochrome P450 paralogs in addition to a short-chain dehydrogenase/reductase (Shroud). The Glossina genome harbors orthologs to each of the Halloween genes excluding shadow (Sad; Cyp315A1); tsetse orthologs were identified for disembodied (dib; Cyp302A1), phantom (phm; Cyp306A1), shade (shd; Cyp314A1), spook (spo; Cyp307A1), spookier (spk; Cyp307A2) and shroud (sro). Circadian regulation in Glossina Nine canonical circadian clock genes were identified in Glossina based on circadian clock genes described from Drosophila. These genes, with the exception of period and timeless, are of similar length and exon structure between Glossina and Drosophila (Table S32). For period and timeless the length of the genes are similar between Glossina and Drosophila, but several large exons in Drosophila have been divided into shorter exons in the Glossina. Moreover, we failed to identify a mammalian-type Cryptochrome gene (Crym or Cry2) in Glossina. This suggests that Glossina has a Drosophila-like circadian clock where the PERIOD and TIMELESS proteins serve as the key negative regulators of CLOCK and CYCLE. For a detailed description of the circadian clock model in Drosophila see (83). In addition to core circadian clock genes we also identified two genes with clock-related functions based on their high sequence homology to described genes in Drosophila. miRNAs We identified 74 unique miRNA genes (precursors), with 72 unique mature miRNAs. A second round Mapmi analysis with a setting allowing 3 mismatches did not identified more miRNAs. In addition to running Mapmi as described above, both the mature and stem-loop sequences of all known Drosophila melanogaster miRNAs (miRBase v18) were used as query to blast against Glossina genome ( e= 0.01). Three additional miRNA genes, which have Mapmi scores slightly below 35, were recovered in this manner, bringing the total predicted miRNA genes to 77 (75 mature miRNAs). Out of the 77 miRNA genes, 39 miRNA genes have both 3p arm and 5p arm recovered by Mapmi with 1 mismatch setting, indicating a high conservation of the whole stem-loop sequences. Thirty-one miRNA genes have only one arm (either 3p or 5p) recovered by the stringent Mapmi setting, while the other arm is less conserved and can only be recovered by blast. The remaining 7 miRNA genes appear to have evolved so fast that only the mature sequences were conserved and the stem-loop structures were maintained. To investigate the species distribution of the Glossina miRNA, we performed Mapmi to investigate the distribution of the predicted miRNAs in 11 insect species, including 4 Drosophila species, 3 mosquito species and 4 non-dipteran species (Table S33). Five Glossina miRNAs are shared with Drosophila species only and 3 additional miRNAs are shared among dipterans in general. Iron metabolism 14

Proteins involved in iron uptake are relatively conserved in the Glossina genome (Table S34). In contrast, proteins involved in cellular iron export in vertebrates are virtually absent (ceruloplasmin (CP)), SLC40A1 (FPN/IREG1), hephaestin (Heph/Hp) with the exception of the two conserved homologues of FLVCR1 (heme exporter) and putative orthologs of ABCG2 (heme exporter). Several orthologs for proteins involved in the regulation of iron absorption are absent from the Glossina genome, including hepcidin (HEPC/HAMP), hemojuvelin (HJV), high Fe protein (hemochromatosis/HFE) and α2-microglobulin (B2M). Conversely, 4 homologues for bone morphogenetic protein 6 (BMP 6), 4 homologues for α2-macroglobulin (A2M) and 2 homologues for furin are present. These latter proteins regulate hepcidin expression and activity. Several homologues also are present for the transmembrane protease, serine 6 family for matriptase-2 (TMPRSS6) that regulates hemojuvelin. However, the role of these homologues in iron absorption in insects without their regulatory partners remains questionable. The iron transport protein homologues in the Glossina genome are incomplete with three conserved transferrin paralogues (Tsf1-3) that bind iron for extracellular transport in vertebrates. However, no homologues are found for transferrin receptor (TfR) 1 and 2, or for the six transmembrane epithelial antigen of the prostate 3 (STEAP3) that releases endosomal iron from transferrin, indicating that an alternate mechanism may be used for the uptake of iron from transferrin. The intracellular iron storage protein, ferritin, may function as the major iron transport protein in insects including Glossina. In insects, transferrin can also act as an acute phase protein for the sequestration of iron during pathogenic challenge. In vertebrates, the bacteriostatic protein, lipocalin 2 (LCN2), is expressed to prevent bacterial growth by tightly binding bacterial enterobactin siderophores, small iron-binding molecules synthesized by bacteria to acquire iron from their host. A conserved lipocalin 2 ortholog is not present in the Glossina genomic database. Most of the protein orthologs involved in intracellular iron homeostasis are conserved in the Glossina genome, including proteins involved in post-transcriptional regulation of iron metabolism (iron regulatory protein (Irp-1A / C-Acon); ), heme detoxification (heme oxygenase 1 and 2 (Ho) , mitochondrial heme and iron-sulfur cluster formation (aminolevulinate synthase (Alas) and frataxin homolog (fh) ), an iron chaperone (e.g. poly(rC) binding protein (PCBP1) ), and iron storage proteins (ferritin 1 heavy chain homologue (Fer1HCH), ferritin 2 light chain homologue (Fer2LCH)). The iron oxidative stress response proteins are also are present, including superoxide dismutase 1 (SOD1), nitric oxide synthase (NOS), dual oxidase 2 (DUOX2), NADPH oxidase (nox), and Cytochrome b5 (Cyt-b5). Iron protein homologues involved in DNA synthesis such as Ribonucleoside diphosphate reductase large and small subunits (RnrL & RnrS) are conserved as well. Intracellular iron homeostasis mechanisms of regulation are the most significantly conserved of all of the areas of the iron metabolic pathway that we documented. These results are consistent with previous analysis of other insect genomes (84). Iron response element characterization Iron responsive elements (IREs) are the riboregulators of iron and are located in the untranslated regions that control the synthesis rate of iron metabolism mRNAs (85). IREs as part of the iron regulatory system are recognized and targeted by cytoplasmic iron regulatory proteins, whose binding results in the stability or inhibition of specific mRNAs. We performed a genome-wide search for IREs in the untranslated regions Glossina genes. 15

UTR sequences, 1000 bp up- and down-stream of the start and stop-codons respectively were retrieved and searched for patterns associated with IREs using a DNA-pattern program as part of the RSAT package (86). The IRE-containing UTRs were then analyzed using SIRE, a Perl-based program that searches for 19-20 nt sequences related to IRE core patterns (http://ccbg.imppc.org/sires/). The results of SIRE were further filtered based on the score provided for each prediction. Patterns corresponding to a “High” and “Medium” score with no mismatches were considered as putative IRE structures. The genes identified to contain IREs were searched against the nr database and further annotated using the blast2go program (http://www.blast2go.com). From the 24440 UTRs that were analyzsed, 1233 (5.04 %) were identified to contain canonical forms of IRE, while 5383 (22.02 %) have the non-canonical form. The UTR sequences of these genes were further analyzed to confirm the presence of IREs and to assess the folding energies of the identified motifs. Among the 6616 UTR sequences provided as input to SIRE, 902 were identified to have an IRE in their untranslated regions. Evaluation of the quality of the predicted IREs indicated 72 IREs of “High” quality, 299 of “Medium” quality and 531 of “Low” quality. Based on score-based filtering, a total of 153 putative IRE-containing genes were retrieved, of which 74 genes contained 5’IREs and 79 contained 3’IREs (Tables S36 and S37). Categorization of the identified genes into major GO (Gene Ontology) categories revealed them to be components of cell communication, transport, development, and metabolic systems (a subset of genes could not be allocated to any of these major GO-categories), whereby the abundance of metabolic and development- related genes are evident. Pathway analysis suggests that genes involved in glycine biosynthesis, circadian rhythm signalling, folate polyglutamylation, RNA signalling, ketogenesis, cell cycle control of chromosomal replication, sonic hedgehog signalling, nucleotide excision repair and serotonin receptor signaling are overrepresented by IRE-containing genes. Sex determination RNA transcripts of the sex-determining transformer (tra) gene are sex-specifically spliced in Glossina. TRA, together with the RNA binding protein TRA2, are required for sex-specific splicing of doublesex (dsx) and fruitless (fru) in Drosophila (87). TRA/TRA2 also likely regulate sex-specific splicing of Glossina dsx and fru as several predicted TRA/TRA2 binding sites were found within the putative female-specific exons of both genes. Gmm TRA2 and Gmm intersex (IX) are very similar to orthologs identified in other diptera. For completeness, Sxl and genes that regulate early Sxl expression in Drosophila (eg run, sisA) are listed in Table S38. However it should be noted that Sxl likely plays no role in sex determination in tsetse as tra expression appears to be auto-regulated as in other non-drosophilid Diptera (e.g. L. cuprina, C. capitata) (88). Analysis of X chromosome gene dosage compensation genes In Drosophila, the male specific lethal (MSL) complex is required for X chromosome dosage compensation (89). There was no significant difference in the activity of two X-linked enzymes in male and female Glossina morsitans morsitans (90), suggesting that this species has a mechanism for X chromosome gene dosage compensation. Orthologs of the five MSL proteins are present in the Glossina genome. Protein motifs identified as important for interaction between the MSL proteins (91, 92) are also well conserved in the Glossina orthologs. However, the motifs associated with X chromosome binding in Drosophila (e.g. MSL1 amino terminal end (93)) are not well conserved. Further, although the nucleic acid-binding CXC motif of MSL2 16

(94) is conserved, the tsetse motif is more similar to that of mosquito (Ae. aegypti) and beetle (T. casteneum) than D. melanogaster. This suggests that the Glossina MSL complex is likely binding to a different DNA sequence than that recognized by the Drosophila complex (95) (Table S39). Hexamerins In addition to milk proteins, larval serum proteins (hexamerins/arylphorins/hemocyanin related proteins) were annotated as these are a primary mechanism of nutrient storage during larval development. These proteins are highly conserved making them a valuable tool for studying evolutionary relationships in the arthropod lineage. We identified 2 tsetse larval serum protein genes that appear orthologus to the lsp-1 (GMOY012011) and lsp-2 (GMOY005820) genes in Calliphora vicina (Urban bluebottle fly) and Musca domestica (Common house fly). Embryonic anterior/posterior polarity determination A critical process in development is the determination of embryo anterior/posterior polarity (96). Absent from the Glossina genome are both the bicoid and the nanos genes, which are responsible for the well-defined anterior and posterior embryonic polarity system in Drosophila. Orthologs for these genes were not found in the genomic scaffolds or in de novo assemblies created using Illumina data from reproductively active whole female flies. Orthologs to genes immediately flanking the 5’ and 3’ ends of the bicoid and nanos loci in Drosophila are present in the Glossina assembly. This polarity mechanism is thought to be specific to the Brachycera. These findings suggest that the conservation of this system between Drosophila and other Brachyceran flies may not be as well defined as previously thought (97). Other insects determine embryonic polarity through a gradient of maternal RNA for orthologs of the ocelliless/orthodenticle (oc/otd) (GMOY006617) and hunchback (hb) (GMOY004735) genes both of which are present in Glossina. Glossina specific genes and gene expansions Analysis of predicted genes within the genome identified many Glossina specific genes. Orthology analysis of these Glossina specific genes by OrthoMCL (http://www.orthomcl.org/cgi-bin/OrthoMclWeb.cgi) (59) revealed that a number of these hypothetical proteins form gene families as the result of gene expansions unique to tsetse. The largest Glossina specific orthology group is represented by 30 proteins, which contain BTB/POZ domains and Kelch motifs which form β-propeller structures. Proteins with BTB/POZ domains are known for their ability to form homo and heterodimers. Kelch containing proteins cover a broad range of functions and are often associated with actin binding and cytoskeletal organization (98). One possible function for these proteins could be during the recovery stage immediately after pregnancy in tsetse. Milk gland tubules between pregnancy cycles undergo a significant amount of breakdown and reconstruction which would require significant amounts of cytoskeletal activity (Table S40). Another orthology group contains 27 predicted members. Motif analysis of these proteins reveals the presence of “Regulator of chromosome condensation (RCC1)” repeats in some members of this group. The RCC1 repeat is associated with chromatin binding and the binding of a nuclear GTP-binding protein called RAN and is thought to play an important role in regulation of gene expression (99) (Table S41). 17

Another interesting Glossina specific expansion is that of the C2H2 zinc finger containing proteins. We found 49 proteins containing zinc finger motifs without orthologus sequences in Drosophila. These proteins break into 10 orthology groups with the largest group containing 17 members some of which appear to be derived from recent duplication events due to their genomic co-localization. Zinc finger motifs are the most abundant nucleic acid binding motif in higher eukaryotes. They are found in RNA binding proteins, transcription factors and chromatin components. Proteins containing this motif are consistently expanded within the genomes of Diptera that have been sequenced. Of the 10 tsetse specific orthology groups, 1 group (comprised of 2 genes) contain zinc finger associated domain (ZAD) motifs. ZAD containing proteins have undergone lineage specific expansions in higher holometabolus insects and are hypothesized to contribute to the developmental plasticity associated with these insects (100) (Table S42). Finally, a group of 9 genes bearing homology to the Drosophila protein Yuri Gagarin (yuri), which has been associated with graviperception and sperm tail elongation functions, were identified. However this gene is expressed ubiquitously in Drosophila. This protein is thought to regulate F-actin and tubulin function. Observed phenotypes are primarily associated with tissues that produce cilia (101). The expansion of this group of proteins in combination with the Kelch proteins described previously suggest the possibility of cytoskeletal specializations within tsetse due to the expansion of actin/tubulin associated protein families (Table S43).

18

Fig. S1. Adapted phylogeny illustrating Glossina morsitans morsitans relationship within the Brachycera. The relative relationships between tsetse species and other selected members of the Brachycera. This tree was adapted from a Maximum parsimony tree based upon the combined sequence data from four genes: mitochondrial 16S ribosomal DNA (16s rDNA), nuclear 28S ribosomal DNA (28s rDNA), the carbamoylphosphate synthase (CPSase) domain of the nuclear CAD gene and the mitochondrial gene cytochrome oxidase I (COI). The full tree with additional species, bootstrap support values and posterior probabilities can be found in Petersen et.al. 2007 (3)

19

Fig. S2 Overview comparing genomic statistics from Glossina with Drosophila melanogaster, Anopheles gambiae, Aedes aegypti and Culex quinquefasciatis. In figures B-D thick bars are associated with the left axis and thin bars are associated with the right axis. (A) Comparison of genome sizes, (B) Comparison of the number and length of gene predictions, (C) Comparison of the number and length of exons, (D) Comparison of the number and length of introns.

20

Fig. S3 Comparative Orthology Analysis of the predicted Glossina Transcriptome against other Dipteran transcriptomes (Drosophila melanogaster, Aedes aegypti, Anopheles gambiae, Culex quinquefasciatus, and Phlebotomus papatasi). (A) Orthology groups common between taxa. (B) Shared orthology groups between Glossina and representative species.

21

S4 Total amount of genome sequence in HSBs. The genome sizes of the D. melanogaster, A. gambiae, and N. vitripennis are 122 Mb, 278 Mb, and 298 Mb respectively.

Fig.

22

Fig. S5 Distribution of homologous synteny block sizes for Glossina and D. melanogaster up to 500 kb, for 10 kb resolution (top) and 100 kb (bottom). 23

Fig. S6 Comparative visualization of synteny between Drosophila, Glossina and Anopheles. Visualized within D. melanogaster chromosomes are blocks of homologous synteny between D. melanogaster and Glossina (right column) and between D. melanogaster and A. gambiae (left column), defined with a 10 kb resolution; the blocks are shown in pink and blue, according to the direction of the alignment.

24

Fig. S7 Enrichment for GO categories within HSBs between D. melanogaster and Glossina genomes. Dark gray blocks represent GO categories enriched in HSBs >10 kb while light gray blocks show the same categories in HSBs >100 kb. Black line shows the significance threshold equal to P=0.05.

25

Fig. S8 Comparative visualization of shared syntenic blocks between Drosophila, Glossina and Anopheles. Visualized within D. melanogaster chromosomes are blocks of homologous synteny between D. melanogaster and Glossina (right column) and between D. melanogaster and A. gambiae (left column). Red lines indicate positions of msHSBs that are present in all three genomes.

26

Fig. S9 Comparison of proportions of msHSBs associated gene types in syntenic blocks. Comparing the proportion of genes with a particular GO category: for genes present in msHSBs defined across three dipteran species, and for genes present in the entire Glossina genome.

27

Additional Data table S1 (separate file) Composition of libraries used for sequencing. Additional Data table S2 (separate file) Sequence Data used for initial assembly. Additional Data table S3 (separate file) Properties of the version 1.0 assembly. Additional Data table S4 (separate file) Number of transposable elements and percent of the genome occupied by fragments or full length elements in Glossina. Totals for each major group of transposable elements is shown in bold, comparable numbers for Drosophila are also shown. Additional Data table S5 (separate file) Salivary protein genes. Additional Data table S6 (separate file) List of peritrophins and peritrophin-like proteins, containing 1 or more chitin binding domains (CBD), from Glossina. Additional Data table S7 (separate file) Glossina aquaporin genes. Additional Data table S8 (separate file) Gene expansions associated with lipid metabolism. Additional Data table S9 (separate file) Comparison of transporter gene numbers between Glossina, Drosophila, Aedes and Anopheles. Additional Data table S10 (separate file) Characteristics of Wolbachia genomic insertions. Additional Data table S11 (separate file) Classes of Bracoviral proteins with homologs within the Glossina genome. 28

Additional Data table S12 (separate file) Glossina transcripts homologus to Bracoviral proteins. Additional Data table S13 (separate file) Immune function associated gene orthologs. Additional Data table S14 (separate file) Milk protein genes. Additional Data table S15 (separate file) Glossina antioxidant associated genes. Additional Data table S16 (separate file) Glossina homeodomain gene orthologs. Additional Data table S17 (separate file) Comparison of chemoreceptor genes between Glossina, Drosophila melanogaster and Anopheles gambiae. CSPs: chemosensory proteins, GRs: Gustatory receptors, OBPs: Odorant Binding Proteins, ORs: Odorant Receptors, IRs: Ionotropic Receptors, SNMPs: Sensory Neuron Membrane Proteins. Additional Data table S18 (separate file) Retinal gene conservation in Glossina and Drosophila. Grey background indicates lack of detectable ortholog in Glossina. Additional Data table S19 (separate file) Peptidases common to all eleven insect species. Additional Data table S20 (separate file) Peptidases known to involved in haemoglobin degradation in various parasites. Additional Data table S21 (separate file) Thrombin inhibitor genes.

29

Additional Data table S22 (separate file) Autophagy associated gene orthologs. Additional Data table S23 (separate file) High Variability Speciation Loci. Additional Data table S24 (separate file) Predicted fatty acid elongation enzyme orthologs from Glossina and Drosophila genes. Additional Data table S25 (separate file) Small heat shock protein orthologs in Glossina. Additional Data table S26 (separate file) Comparison of heat shock protein 83, 70, and heat shock protein cognates between Glossina and Drosophila. Additional Data table S27 (separate file) Genes encoding putative structural cuticular proteins in Glossina. Additional Data table S28 (separate file) Neuropeptide genes identified in Glossina morsitans (identification number), and their presence/absence in other dipteran species, as deduced by homology searches of the respective genomes in silico. Additional Data table S29 (separate file) Neuropeptide receptor genes identified in Glossina (identification number) and their Drosophila orthologs as deduced by homology searches of the respective genomes in silico. Additional Data table S30 (separate file) Putative orthologs of the insulin signaling cascade from Glossina. The % identity column shows the % amino acid match between the putative Glossina signaling molecule and the Drosophila ortholog. Percent identity was determined using LALIGN (www.ch.embnet.org). Additional Data table S31 (separate file) Genes associated with Juvenile Hormone and Ecdysone Signaling pathways.

30

Additional Data table S32 (separate file) Circadian clock gene structure in Drosophila and Glossina. Additional Data table S33 (separate file) Mapmi and homology search identified lineage-specific miRNAs Additional Data table S34 (separate file) Iron metabolism associated genes identified by tblastn analysis. Additional Data table S35 (separate file) Comparative analysis of amino acid metabolism genes. Additional Data table S36 (separate file) Details of the high ranked IRE containing genes in Glossina. Additional Data table S37(separate file) Details of the medium ranked IRE containing genes in Glossina. Additional Data table S38 (separate file) Sex Determination associated gene orthologs in Glossina. Additional Data table S39 (separate file) X Chromosome gene dosage compensation gene orthologs. Additional Data table S40 (separate file) Kelch/BTB/POZ domain containing proteins. Additional Data table S41 (separate file) RCC1 repeat containing proteins. Additional Data table S42 (separate file) Zinc finger proteins.

31

Additional Data table S43 (separate file) Nucleotide binding (Yuri gagarin like proteins).

32

References and Notes 1. S. C. Welburn, I. Maudlin, P. P. Simarro, Controlling sleeping sickness: A review. Parasitology 136, 1943–1949 (2009). doi:10.1017/S0031182009006416 Medline 2. B. M. Wiegmann, M. D. Trautwein, I. S. Winkler, N. B. Barr, J. W. Kim, C. Lambkin, M. A. Bertone, B. K. Cassel, K. M. Bayless, A. M. Heimberg, B. M. Wheeler, K. J. Peterson, T. Pape, B. J. Sinclair, J. H. Skevington, V. Blagoderov, J. Caravas, S. N. Kutty, U. SchmidtOtt, G. E. Kampmeier, F. C. Thompson, D. A. Grimaldi, A. T. Beckenbach, G. W. Courtney, M. Friedrich, R. Meier, D. K. Yeates, Episodic radiations in the fly tree of life. Proc. Natl. Acad. Sci. U.S.A. 108, 5690–5695 (2011). doi:10.1073/pnas.1012675108 Medline 3. F. T. Petersen, R. Meier, S. N. Kutty, B. M. Wiegmann, The phylogeny and evolution of host choice in the Hippoboscoidea (Diptera) as reconstructed using four molecular markers. Mol. Phylogenet. Evol. 45, 111–122 (2007). doi:10.1016/j.ympev.2007.04.023 Medline 4. M. J. Lehane, S. Aksoy, E. Levashina, Immune responses and parasite transmission in bloodfeeding insects. Trends Parasitol. 20, 433–439 (2004). doi:10.1016/j.pt.2004.07.002 Medline 5. B. L. Weiss, J. Wang, M. A. Maltz, Y. Wu, S. Aksoy, Trypanosome infection establishment in the tsetse fly gut is influenced by microbiome-regulated host immune barriers. PLOS Pathog. 9, e1003318 (2013). doi:10.1371/journal.ppat.1003318 Medline 6. S. Aksoy, M. Berriman, N. Hall, M. Hattori, W. Hide, M. J. Lehane, A case for a Glossina genome project. Trends Parasitol. 21, 107–111 (2005). doi:10.1016/j.pt.2005.01.006 Medline 7. V. Michalkova, J. B. Benoit, G. M. Attardo, J. Medlock, S. Aksoy, Amelioration of reproductionassociated oxidative stress in a viviparous insect is critical to prevent reproductive senescence. PLOS ONE 9, e87554 (2014). 10.1371/journal.pone.0087554 8. J. B. Benoit, I. A. Hansen, G. M. Attardo, V. Michalkova, P. O. Mireji, J. L. Bargul, L. L. Drake, D. K. Masiga, S. Aksoy, Aquaporins are critical for provision of water during lactation and intrauterine progeny hydration to maintain tsetse fly reproductive success. PLOS Negl. Trop. Dis. 8, e2517 (2014). 10.1371/journal.pntd.0002517 9. G. M. Attardo, J. B. Benoit, V. Michalkova, K. R. Patrick, T. B. Krause, S. Aksoy, The homeodomain protein ladybird late regulates synthesis of milk proteins during pregnancy in the tsetse fly (Glossina morsitans). PLOS Negl. Trop. Dis. 10, e2645 (2014). 10.1371/journal.pntd.0002645 10. J. B. Benoit, G. M. Attardo, V. Michalkova, T. B. Krause, J. Bohova, Q. Zhang, A. A. Baumann, P. O. Mireji, P. Takac, D. L. Denlinger, J. M. Ribeiro, S. Aksoy, A novel highly divergent protein family identified from a viviparous insect by RNA-seq analysis: a potential target for tsetse fly-specific abortifacients. PLOS Genet. 10, e1003874 (2014). 10.1371/journal.pgen.1003874 11. C. Rose, R. Belmonte, S. D. Armstrong, G. Molyneux, L. R. Haines, M. J. Lehane, J. Wastling, A. Acosta-Serrano, An investigation into the protein composition of the teneral Glossina morsitans morsitans peritrophic matrix. PLOS Negl. Trop. Dis. 8, e2691 (2014). 10.1371/journal.pntd.0002691 33

12. E. L. Telleria, J. B. Benoit, X. Zhao, A. F. Savage, S. Regmi, M. O'Neill, S. Aksoy, Insights into the trypanosome transmission process revealed through transcriptomic analysis of parasitized tsetse salivary glands. PLOS Negl. Trop. Dis. 8, e2649 (2014). 10.1371/journal.pntd.0002649 13. C. Brelsfoard, G. Tsiamis, M. Falchetto, L. Gomulski, E. Telleria, U. Alam, E. Ntountoumis, F. Scolari, M. Swain, P. Takac, A. R. Malacrida, K. Bourtzis, S. Aksoy, Wolbachia symbiont genome sequence and extensive chromosomal insertions present in the host Glossina morsitans morsitans genome. PLOS Negl. Trop. Dis. 8, e2728 (2014). 10.1371/journal.pntd.0002728 14. G. F. O. Obiero, P. O. Mireji, S. R. G. Nyanjom, A. Christoffels, H. M. Robertson, D. K. Masiga, Odorant and gustatory receptors in the tsetse fly Glossina morsitans morsitans. PLOS Negl. Trop. Dis. 8, e2663 (2014). 10.1371/journal.pntd.0002663 15. J. S. Kaminker, C. M. Bergman, B. Kronmiller, J. Carlson, R. Svirskas, S. Patel, E. Frise, D. A. Wheeler, S. E. Lewis, G. M. Rubin, M. Ashburner, S. E. Celniker, The transposable elements of the Drosophila melanogaster euchromatin: A genomics perspective. Genome Biol. 3, RESEARCH0084 (2002). doi:10.1186/gb-2002-3-12-research0084 Medline 16. J. Alves-Silva, J. M. C. Ribeiro, J. Van Den Abbeele, G. Attardo, Z. Hao, L. R. Haines, M. B. Soares, M. Berriman, S. Aksoy, M. J. Lehane, An insight into the sialome of Glossina morsitans morsitans. BMC Genomics 11, 213 (2010). doi:10.1186/1471-2164-11-213 Medline 17. G. Caljon, K. De Ridder, B. Stijlemans, M. Coosemans, S. Magez, P. De Baetselier, J. Van Den Abbeele, Tsetse salivary gland proteins 1 and 2 are high affinity nucleic acid binding proteins with residual nuclease activity. PLOS ONE 7, e47233 (2012). doi:10.1371/journal.pone.0047233 Medline 18. J. M. Ribeiro, B. J. Mans, B. Arcà, An insight into the sialome of blood-feeding Nematocera. Insect Biochem. Mol. Biol. 40, 767–784 (2010). doi:10.1016/j.ibmb.2010.08.002 Medline 19. T. Dolezal, E. Dolezelova, M. Zurovec, P. J. Bryant, A role for adenosine deaminase in Drosophila larval development. PLOS Biol. 3, e201 (2005). doi:10.1371/journal.pbio.0030201 Medline 20. J. Van Den Abbeele, G. Caljon, K. De Ridder, P. De Baetselier, M. Coosemans, Trypanosoma brucei modifies the tsetse salivary composition, altering the fly feeding behavior that favors parasite transmission. PLOS Pathog. 6, e1000926 (2010). doi:10.1371/journal.ppat.1000926 Medline 21. M. J. Lehane, Peritrophic matrix structure and function. Annu. Rev. Entomol. 42, 525–550 (1997). doi:10.1146/annurev.ento.42.1.525 Medline 22. E. M. Campbell, A. Ball, S. Hoppler, A. S. Bowman, Invertebrate aquaporins: A review. J. Comp. Physiol. B 178, 935–955 (2008). doi:10.1007/s00360-008-0288-2 Medline 23. D. A. Norden, D. J. Paterson, Carbohydrate metabolism in flight muscle of the tsetse fly (Glossina) and the blowfly (Sarcophaga). Comp. Biochem. Physiol. 31, 819–827 (1969). doi:10.1016/0010-406X(69)92082-9 Medline 24. B. L. Weiss, M. Maltz, S. Aksoy, Obligate symbionts activate immune system development in the tsetse fly. J. Immunol. 188, 3395–3403 (2012). doi:10.4049/jimmunol.1103691 Medline 34

25. R. V. Rio, R. E. Symula, J. Wang, C. Lohs, Y. N. Wu, A. K. Snyder, R. D. Bjornson, K. Oshima, B. S. Biehl, N. T. Perna, M. Hattori, S. Aksoy, Insight into the transmission biology and species-specific functional capabilities of tsetse (Diptera: glossinidae) obligate symbiont Wigglesworthia. MBio 3, e00240-11 (2012). doi:10.1128/mBio.00240-11 Medline 26. U. Alam, J. Medlock, C. Brelsfoard, R. Pais, C. Lohs, S. Balmand, J. Carnogursky, A. Heddi, P. Takac, A. Galvani, S. Aksoy, Wolbachia symbiont infections induce strong cytoplasmic incompatibility in the tsetse fly Glossina morsitans. PLOS Pathog. 7, e1002415 (2011). doi:10.1371/journal.ppat.1002415 Medline 27. A. M. Abd-Alla, F. Cousserans, A. G. Parker, J. A. Jehle, N. J. Parker, J. M. Vlak, A. S. Robinson, M. Bergoin, Genome analysis of a Glossina pallidipes salivary gland hypertrophy virus reveals a novel, large, double-stranded circular DNA virus. J. Virol. 82, 4595–4611 (2008). doi:10.1128/JVI.02588-07 Medline 28. N. A. Dyer, C. Rose, N. O. Ejeh, A. Acosta-Serrano, Flying tryps: Survival and maturation of trypanosomes in tsetse flies. Trends Parasitol. 29, 188–196 (2013). doi:10.1016/j.pt.2013.02.003 Medline 29. V. Bosco-Drayon, M. Poidevin, I. G. Boneca, K. Narbonne-Reveau, J. Royet, B. Charroux, Peptidoglycan sensing by the receptor PGRP-LE in the Drosophila gut induces immune responses to infectious bacteria and tolerance to microbiota. Cell Host Microbe 12, 153–165 (2012). doi:10.1016/j.chom.2012.06.002 Medline 30. C. G. Elsik, The pea aphid genome sequence brings theories of insect defense into question. Genome Biol. 11, 106 (2010). doi:10.1186/gb-2010-11-2-106 Medline 31. K. Hens, P. Lemey, N. Macours, C. Francis, R. Huybrechts, Cyclorraphan yolk proteins and lepidopteran minor yolk proteins originate from two unrelated lipase families. Insect Mol. Biol. 13, 615–623 (2004). doi:10.1111/j.0962-1075.2004.00520.x Medline 32. R. Liu, X. He, S. Lehane, M. Lehane, C. Hertz-Fowler, M. Berriman, L. M. Field, J. J. Zhou, Expression of chemosensory proteins in the tsetse fly Glossina morsitans morsitans is related to female host-seeking behaviour. Insect Mol. Biol. 21, 41–48 (2012). doi:10.1111/j.1365-2583.2011.01114.x Medline 33. R. Hardie, K. Vogt, A. Rudolph, The compound eye of the tsetse fly (Glossina morsitans morsitans and Glossina palpalis palpalis). J. Insect Physiol. 35, 423–431 (1989). doi:10.1016/0022-1910(89)90117-0 34. G. Gibson, S. J. Torr, Visual and olfactory responses of haematophagous Diptera to host stimuli. Med. Vet. Entomol. 13, 2–23 (1999). doi:10.1046/j.1365-2915.1999.00163.x Medline 35. J. Brady, Flying mate detection and chasing by tsetse flies (Glossina). Physiol. Entomol. 16, 153–161 (1991). doi:10.1111/j.1365-3032.1991.tb00551.x 36. M. Friedrich, Encyclopedia of Life Sciences (Wiley, Chichester, 2010); doi: 10.1002/9780470015902.a0022898. 37. J. M. Lindh, P. Goswami, R. S. Blackburn, S. E. Arnold, G. A. Vale, M. J. Lehane, S. J. Torr, Optimizing the colour and fabric of targets for the control of the tsetse fly Glossina fuscipes fuscipes. PLOS Negl. Trop. Dis. 6, e1661 (2012). doi:10.1371/journal.pntd.0001661 Medline 35

38. A. Schmitt, A. Vogt, K. Friedmann, R. Paulsen, A. Huber, Rhodopsin patterning in central photoreceptor cells of the blowfly Calliphora vicina: Cloning and characterization of Calliphora rhodopsins Rh3, Rh5 and Rh6. J. Exp. Biol. 208, 1247–1256 (2005). doi:10.1242/jeb.01527 Medline 39. S. K. Moloo, An artificial feeding technique for Glossina. Parasitology 63, 507–512 (1971). doi:10.1017/S0031182000080021 Medline 40. R. Pais, C. Lohs, Y. Wu, J. Wang, S. Aksoy, The obligate mutualist Wigglesworthia glossinidia influences reproduction, digestion, and immunity processes of its host, the tsetse fly. Appl. Environ. Microbiol. 74, 5965–5974 (2008). doi:10.1128/AEM.00741-08 Medline 41. K. Osoegawa, P. Y. Woon, B. Zhao, E. Frengen, M. Tateno, J. J. Catanese, P. J. de Jong, An improved approach for construction of bacterial artificial chromosome libraries. Genomics 52, 1–8 (1998). doi:10.1006/geno.1998.5423 Medline 42. J. Wang, C. Hu, Y. Wu, A. Stuart, C. Amemiya, M. Berriman, A. Toyoda, M. Hattori, S. Aksoy, Characterization of the antimicrobial peptide attacin loci from Glossina morsitans. Insect Mol. Biol. 17, 293–302 (2008). doi:10.1111/j.1365-2583.2008.00805.x Medline 43. I. J. Tsai, T. D. Otto, M. Berriman, Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol. 11, R41 (2010). doi:10.1186/gb2010-11-4-r41 Medline 44. R. Li, H. Zhu, J. Ruan, W. Qian, X. Fang, Z. Shi, Y. Li, S. Li, G. Shan, K. Kristiansen, S. Li, H. Yang, J. Wang, J. Wang, De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010). doi:10.1101/gr.097261.109 Medline 45. V. Curwen, E. Eyras, T. D. Andrews, L. Clarke, E. Mongin, S. M. Searle, M. Clamp, The Ensembl automatic gene annotation system. Genome Res. 14, 942–950 (2004). doi:10.1101/gr.1858004 Medline 46. C. Holt, M. Yandell, MAKER2: An annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, 491 (2011). doi:10.1186/1471-2105-12-491 Medline 47. A. L. Price, N. C. Jones, P. A. Pevzner, De novo identification of repeat families in large genomes. Bioinformatics 21, (Suppl 1), i351–i358 (2005). doi:10.1093/bioinformatics/bti1018 Medline 48. Z. Bao, S. R. Eddy, Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 12, 1269–1276 (2002). doi:10.1101/gr.88502 Medline 49. A. Smit, R. Hubley, http://www.repeatmasker.org (2010). 50. I. Korf, Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004). doi:10.1186/14712105-5-59 Medline 51. M. Stanke, O. Schöffmann, B. Morgenstern, S. Waack, Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7, 62 (2006). doi:10.1186/1471-2105-7-62 Medline 52. G. S. Slater, E. Birney, Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005). doi:10.1186/1471-2105-6-31 Medline 36

53. C. Trapnell, L. Pachter, S. L. Salzberg, TopHat: Discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111 (2009). doi:10.1093/bioinformatics/btp120 Medline 54. E. Birney, M. Clamp, R. Durbin, GeneWise and Genomewise. Genome Res. 14, 988–995 (2004). doi:10.1101/gr.1865504 Medline 55. N. D. Rawlings, A. J. Barrett, A. Bateman, MEROPS: The database of proteolytic enzymes, their substrates and inhibitors. Nucleic Acids Res. 40, (D1), D343–D350 (2012). doi:10.1093/nar/gkr987 Medline 56. M. Johnson, I. Zaretskaya, Y. Raytselis, Y. Merezhuk, S. McGinnis, T. L. Madden, NCBI BLAST: A better web interface. Nucleic Acids Res. 36, (Web Server), W5–W9 (2008). doi:10.1093/nar/gkn201 Medline 57. K. Megy, S. J. Emrich, D. Lawson, D. Campbell, E. Dialynas, D. S. Hughes, G. Koscielny, C. Louis, R. M. Maccallum, S. N. Redmond, A. Sheehan, P. Topalis, D. Wilson; VectorBase Consortium, VectorBase: Improvements to a bioinformatics resource for invertebrate vector genomics. Nucleic Acids Res. 40, (D1), D729–D734 (2012). doi:10.1093/nar/gkr1089 Medline 58. T. Carver, S. R. Harris, M. Berriman, J. Parkhill, J. A. McQuillan, Artemis: An integrated platform for visualization and analysis of high-throughput sequence-based experimental data. Bioinformatics 28, 464–469 (2012). doi:10.1093/bioinformatics/btr703 Medline 59. S. Fischer et al., Using OrthoMCL to assign proteins to OrthoMCL-DB groups or to cluster proteomes into new ortholog groups, in Current Protocols in Bioinformatics, Andreas D. Baxevanis et al., Eds. (2011), Chapter 6, Unit 6 12 11-19. 60. M. G. Grabherr, P. Russell, M. Meyer, E. Mauceli, J. Alföldi, F. Di Palma, K. Lindblad-Toh, Genome-wide synteny through highly sensitive sequence alignment: Satsuma. Bioinformatics 26, 1145–1151 (2010). doi:10.1093/bioinformatics/btq102 Medline 61. R. Donthu, H. A. Lewin, D. M. Larkin, SyntenyTracker: A tool for defining homologous synteny blocks using radiation hybrid maps and whole-genome sequence. BMC Res. Notes 2, 148 (2009). doi:10.1186/1756-0500-2-148 Medline 62. B. Chevreux, T. Pfisterer, B. Drescher, A. J. Driesel, W. E. Müller, T. Wetter, S. Suhai, Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res. 14, 1147–1159 (2004). doi:10.1101/gr.1917404 Medline 63. T. J. Treangen, D. D. Sommer, F. E. Angly, S. Koren, M. Pop, Next generation sequence assembly with AMOS, in Current Protocols in Bioinformatics, Andreas D. Baxevanis et al., Eds. (2011), Chapter 11, Unit 11 18. 64. Y. Han, J. M. Burnette, 3rd, S. R. Wessler, TARGeT: A web-based pipeline for retrieving and characterizing gene and transposable element families from genomic sequences. Nucleic Acids Res. 37, e78 (2009). doi:10.1093/nar/gkp295 Medline 65. Y. Han, S. R. Wessler, MITE-Hunter: A program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic Acids Res. 38, e199 (2010). doi:10.1093/nar/gkq862 Medline 66. V. K. Carpenter, L. L. Drake, S. E. Aguirre, D. P. Price, S. D. Rodriguez, I. A. Hansen, SLC7 amino acid transporters of the yellow fever mosquito Aedes aegypti and their role in fat body 37

TOR signaling and reproduction. J. Insect Physiol. 58, 513–522 (2012). doi:10.1016/j.jinsphys.2012.01.005 Medline 67. L. H. Matherly, D. I. Goldman, Membrane transport of folates. Vitam. Horm. 66, 403–456 (2003). doi:10.1016/S0083-6729(03)01012-4 Medline 68. J. A. Kroemer, B. A. Webb, Polydnavirus genes and genomes: Emerging gene families and new insights into polydnavirus replication. Annu. Rev. Entomol. 49, 431–456 (2004). doi:10.1146/annurev.ento.49.072103.120132 Medline 69. E. T. MacLeod, I. Maudlin, A. C. Darby, S. C. Welburn, Antioxidants promote establishment of trypanosome infections in tsetse. Parasitology 134, 827–831 (2007). doi:10.1017/S0031182007002247 Medline 70. L. Pick, A. Heffer, Hox gene evolution: Multiple mechanisms contributing to evolutionary novelties. Ann. N. Y. Acad. Sci. 1256, 15–32 (2012). doi:10.1111/j.1749-6632.2011.06385.x Medline 71. A. J. Barrett, N. D. Rawlings, ‘Species’ of peptidases. Biol. Chem. 388, 1151–1157 (2007). doi:10.1515/BC.2007.151 Medline 72. I. Fruitier, I. Garreau, J. M. Piot, Cathepsin D is a good candidate for the specific release of a stable hemorphin from hemoglobin in vivo: VV-hemorphin-7. Biochem. Biophys. Res. Commun. 246, 719–724 (1998). doi:10.1006/bbrc.1998.8614 Medline 73. M. Cappello, S. Li, X. Chen, C. B. Li, L. Harrison, S. Narashimhan, C. B. Beard, S. Aksoy, Tsetse thrombin inhibitor: Bloodmeal-induced expression of an anticoagulant in salivary glands and gut tissue of Glossina morsitans morsitans. Proc. Natl. Acad. Sci. U.S.A. 95, 14290–14295 (1998). doi:10.1073/pnas.95.24.14290 Medline 74. U. Hengst, H. Albrecht, D. Hess, D. Monard, The phosphatidylethanolamine-binding protein is the prototype of a novel family of serine protease inhibitors. J. Biol. Chem. 276, 535–540 (2001). doi:10.1074/jbc.M002524200 Medline 75. P. Nosil, D. Schluter, The genes underlying the process of speciation. Trends Ecol. Evol. 26, 160–167 (2011). doi:10.1016/j.tree.2011.01.001 Medline 76. S. H. Sze, J. P. Dunham, B. Carey, P. L. Chang, F. Li, R. M. Edman, C. Fjeldsted, M. J. Scott, S. V. Nuzhdin, A. M. Tarone, A de novo transcriptome assembly of Lucilia sericata (Diptera: Calliphoridae) with predicted alternative splices, single nucleotide polymorphisms and transcript expression estimates. Insect Mol. Biol. 21, 205–221 (2012). doi:10.1111/j.1365-2583.2011.01127.x Medline 77. J. Caers, H. Verlinden, S. Zels, H. P. Vandersmissen, K. Vuerinckx, L. Schoofs, More than two decades of research on insect neuropeptide GPCRs: An overview. Front. Endocrinol. (Lausanne) 3, 151 (2012). doi:10.3389/fendo.2012.00151 Medline 78. F. Hauser, S. Neupert, M. Williamson, R. Predel, Y. Tanaka, C. J. Grimmelikhuijzen, Genomics and peptidomics of neuropeptides and protein hormones present in the parasitic wasp Nasonia vitripennis. J. Proteome Res. 9, 5296–5310 (2010). doi:10.1021/pr100570j Medline 79. Y. Antonova, A. J. Arik, W. Moore, M. A. Riehle, M. R. Brown, in Insect Endocrinology, L. I. Gilbert, Ed. (Academic Press, Waltham, MA, 2012), pp. 63–92. 38

80. M. Jindra, S. R. Palli, L. M. Riddiford, The juvenile hormone signaling pathway in insect development. Annu. Rev. Entomol. 58, 181 (2013). doi:10.1146/annurev-ento-120811153700 Medline 81. M. A. Abdou, Q. He, D. Wen, O. Zyaan, J. Wang, J. Xu, A. A. Baumann, J. Joseph, T. G. Wilson, S. Li, J. Wang, Drosophila Met and Gce are partially redundant in transducing juvenile hormone action. Insect Biochem. Mol. Biol. 41, 938–945 (2011). doi:10.1016/j.ibmb.2011.09.003 Medline 82. A. A. Baumann, J. B. Benoit, V. Michalkova, P. O. Mireji, G. M. Attardo, J. K. Moulton, T. G. Wilson, S. Aksoy, Juvenile hormone and insulin suppress lipolysis between periods of lactation during tsetse fly pregnancy. Mol. Cell. Endocrinol. 372, 30–41 (2013). doi:10.1016/j.mce.2013.02.019 Medline 83. Q. Yuan, D. Metterville, A. D. Briscoe, S. M. Reppert, Insect cryptochromes: Gene duplication and loss define diverse ways to construct insect circadian clocks. Mol. Biol. Evol. 24, 948– 955 (2007). doi:10.1093/molbev/msm011 Medline 84. J. J. Winzerling, D. Q. Pham, Iron metabolism in insect disease vectors: Mining the Anopheles gambiae translated protein database. Insect Biochem. Mol. Biol. 36, 310–321 (2006). doi:10.1016/j.ibmb.2006.01.006 Medline 85. H. Nichol, J. H. Law, J. J. Winzerling, Iron metabolism in insects. Annu. Rev. Entomol. 47, 535– 559 (2002). doi:10.1146/annurev.ento.47.091201.145237 Medline 86. M. Thomas-Chollier, O. Sand, J. V. Turatsinze, R. Janky, M. Defrance, E. Vervisch, S. Brohée, J. van Helden, RSAT: Regulatory sequence analysis tools. Nucleic Acids Res. 36, (Web Server), W119–W127 (2008). doi:10.1093/nar/gkn304 Medline 87. E. C. Verhulst, L. van de Zande, L. W. Beukeboom, Insect sex determination: It all evolves around transformer. Curr. Opin. Genet. Dev. 20, 376–383 (2010). doi:10.1016/j.gde.2010.05.001 Medline 88. M. Hediger, C. Henggeler, N. Meier, R. Perez, G. Saccone, D. Bopp, Molecular characterization of the key switch F provides a basis for understanding the rapid divergence of the sexdetermining pathway in the housefly. Genetics 184, 155–170 (2010). doi:10.1534/genetics.109.109249 Medline 89. T. Conrad, A. Akhtar, Dosage compensation in Drosophila melanogaster: Epigenetic finetuning of chromosome-wide transcription. Nat. Rev. Genet. 13, 123–134 (2012). doi:10.1038/nrg3124 Medline 90. R. H. Gooding, G. S. McIntyre, Glossina morsitans morsitans and Glossina palpalis palpalis: Dosage compensation raises questions about the Milligan model for control of trypanosome development. Exp. Parasitol. 90, 244–249 (1998). doi:10.1006/expr.1998.4332 Medline 91. M. J. Scott, L. L. Pan, S. B. Cleland, A. L. Knox, J. Heinrich, MSL1 plays a central role in assembly of the MSL complex, essential for dosage compensation in Drosophila. EMBO J. 19, 144–155 (2000). doi:10.1093/emboj/19.1.144 Medline 92. J. Kadlec, E. Hallacli, M. Lipp, H. Holz, J. Sanchez-Weatherby, S. Cusack, A. Akhtar, Structural basis for MOF and MSL3 recruitment into the dosage compensation complex by MSL1. Nat. Struct. Mol. Biol. 18, 142–149 (2011). doi:10.1038/nsmb.1960 Medline 39

93. F. Li, D. A. Parry, M. J. Scott, The amino-terminal region of Drosophila MSL1 contains basic, glycine-rich, and leucine zipper-like motifs that promote X chromosome binding, selfassociation, and MSL2 binding, respectively. Mol. Cell. Biol. 25, 8913–8924 (2005). doi:10.1128/MCB.25.20.8913-8924.2005 Medline 94. T. Fauth, F. Müller-Planitz, C. König, T. Straub, P. B. Becker, The DNA binding CXC domain of MSL2 is required for faithful targeting the Dosage Compensation Complex to the X chromosome. Nucleic Acids Res. 38, 3209–3221 (2010). doi:10.1093/nar/gkq026 Medline 95. A. A. Alekseyenko, S. Peng, E. Larschan, A. A. Gorchakov, O. K. Lee, P. Kharchenko, S. D. McGrath, C. I. Wang, E. R. Mardis, P. J. Park, M. I. Kuroda, A sequence motif within chromatin entry sites directs MSL establishment on the Drosophila X chromosome. Cell 134, 599–609 (2008). doi:10.1016/j.cell.2008.06.033 Medline 96. F. van Eeden, D. St Johnston, The polarisation of the anterior-posterior and dorsal-ventral axes during Drosophila oogenesis. Curr. Opin. Genet. Dev. 9, 396–404 (1999). doi:10.1016/S0959-437X(99)80060-4 Medline 97. S. Lemke, M. Stauber, P. J. Shaw, A. M. Rafiqi, A. Prell, U. Schmidt-Ott, Bicoid occurrence and Bicoid-dependent hunchback regulation in lower cyclorrhaphan flies. Evol. Dev. 10, 413–420 (2008). doi:10.1111/j.1525-142X.2008.00252.x Medline 98. J. Adams, R. Kelso, L. Cooley, The kelch repeat superfamily of proteins: Propellers of cell function. Trends Cell Biol. 10, 17–24 (2000). doi:10.1016/S0962-8924(99)01673-6 Medline 99. M. Dasso, RCC1 in the cell cycle: The regulator of chromosome condensation takes on new roles. Trends Biochem. Sci. 18, 96–101 (1993). doi:10.1016/0968-0004(93)90161-F Medline 100. H. R. Chung, U. Löhr, H. Jäckle, Lineage-specific expansion of the zinc finger associated domain ZAD. Mol. Biol. Evol. 24, 1934–1943 (2007). doi:10.1093/molbev/msm121 Medline 101. M. J. Texada, R. A. Simonette, C. B. Johnson, W. J. Deery, K. M. Beckingham, Yuri gagarin is required for actin, tubulin and basal body functions in Drosophila spermatogenesis. J. Cell Sci. 121, 1926–1936 (2008). doi:10.1242/jcs.026559 Medline

40