A Novel Method for Discovering Primate Gene

A Novel Method for Discovering Primate Gene Loss using the Ancestral Genome J. Zachary Sanborn, Jing Zhu, Mark Diekhans & David Haussler June 15, 2006 Abstract The evolution of a species can be described by that species’ gradual acquisition of a set of genes that help it become better suited for its environment. This set of genes can result from the addition of new genes (e.g. functional diversification following a gene duplication event), the mutation of an ancestral gene to something more efficient, or the complete loss of a gene (e.g. introduction of a premature stop codon). For this project, we are specifically interested in discovering gene losses caused by mutations in the gene’s protein-coding region that occurred during the course of human evolution. We hypothesize that a known (RefSeq) mouse gene that maps to a (1) functional gene in the boroeutherian ancestral genome and a (2) non-functional gene in the human genome shall be considered a good candidate for a human gene loss and worth further study. We use the mappings produced by TransMap as a basis for our automated method. Known positive controls, such as the verified loss of urate oxidase (Uox ), will be used to validate the results of the automated method23 . Furthermore, we describe HTMaLigner, a visualization tool developed to aid the manual inspection of the TransMap gene loss predictions.

1

Introduction

The present-day genome of an extant species is the culmination of millions of years of evolutionary change. In essence, the genome contains a record of that species’ entire evolutionary history. Comparative genomics uses sequence-based homology to discover similar genes in distinct organisms. This simple finding leads to the amazing conclusion that organisms as different as humans and rats must have at some point emerged from a common ancestor. Careful analysis of the specific differences between these genes allows an estimate of the evolutionary distance between these organisms (alternatively, the approximate time at which these organisms diverged from their common ancestor) to be made. Throughout a species’ history, its genome is under experimentation, constantly searching for a set of genes that helps its chances for survival and, therefore, procreation. Anthropomorphisms aside, this is a primarily blind search since many of the methods for genetic modification are inherently random processes (e.g. mutation due to replication error, gene duplication, etc.). Mutations accumulate due to mistakes during the DNA replication or via DNA damage in the germ line. New genes may be introduced via the incorporation of intraspecies (e.g. sex) or interspecies (e.g. viral) genetic material. However, the beautifully simple evolutionary process sorts through these myriad possibilities to find the paths that give rise to a gradual increase in the species’ fitness with respect to its current environment. The focus of this paper is on a specific method for genetic modification known as gene loss. A gene loss is characterized by an established gene (i.e. one found currently active in multiple species) that was inactivated by some mechanism in a particular lineage. Genes can be inactivated by a variety of mutational events in the protein-coding sequence or by a disruption of the transcription 1

or translation signals (e.g. the promoter) that the gene requires to be expressed in the eukaryotic cell. The focus of this research is to detect gene losses caused by two types of mutational events (nonsense mutations and frame-shift inducing insertions/deletions) that occur within the proteincoding sequence. In the majority of cases, a gene inactivation spells death for the organism, since the gene may provide an essential and unique function in the cell. Other times, a gene inactivation may have a deleterious effect to the organism, but not be entirely fatal. However, some gene losses are actually beneficial to the cell and undergo positive selection in the population. In these cases, the particular gene may be acting as a barrier to the organism’s further evolution or the gene may simply be obsolete, where it provides a function that the cell no longer requires. Of course, being human, it is only natural to begin this study by attempting to discover human-specific genes losses or, at the very least, gene losses that are specific to the primate lineage. Some noteworthy examples of human- or primate-specific gene losses are the genes that code for cysteinyl aspartate proteinase (CASPASE12 ), myosin heavy chain 16 (Myh16 ), and gulonolactone (L-) oxidase (Gulo). It is believed that the human-specific loss of CASPASE12 may have lead to increased resistance to severe sepsis. This resistance is thought to have become important as human populations coalesced into farming communities, which, due to the concentrated exposure to microbes of other humans and domesticated animals coupled with the lack of proper hygiene, increased the probability of receiving a blood-borne infection22 . A gene loss with an entirely different effect is the primate-specific loss of Myh16 that is hypothesized as one of the many stepping stones to the development of our increased brain size by reducing the size of the jawbone muscles20 . The gene loss of Gulo is a case where its function (biosynthesis of vitamin C) is no longer needed as humans obtain vitamin C from their diet15 . The human cell thus becomes more efficient by ceasing production of an obsolete gene. What is presented here is an automated method for discovering non-specific human gene losses since the divergence from the mouse and human common ancestor. These are identified as nonspecific human losses since our method, while correctly identifying the human gene loss event, is currently incapable of determining where in the primate lineage the loss occurred. A manual comparison is made using the human, chimp, and rhesus conservation tracks of the UCSC Genome Browser to fulfill this end. Although the method described here was developed to discover human gene losses, it shall be noted that it can readily be applied to discover gene losses in other species where high quality genomic data exist.

2

Method

The recent introduction of mathematically reconstructed ancestral genomes allows us a novel method for discovering human gene losses. In this method, the hypothesis is that a human gene loss occurs when the gene in question has been inactivated in the human genome while a functioning copy of the gene is found in both the mouse genome and the common ancestor of mouse and human, termed the boroeutherian ancestor. A full schematic of the method is shown in Figure 1, details are given below.

2.1

TransMap Mappings

An integral part of this project is the mapping of genes between species’ genomes. To determine these interspecies mappings, the TransMap software developed at UCSC by Mark Diekhans was used5 . It is named after the transitive property of algebra (if A → B and B → C, then A → C), but as applied to comparative genomics. The software begins by mapping a known mRNA from the 2

“query” species to the genome of that query species using the BLAT alignment tool11 . By mapping the mRNA onto the species’ genome, a BLASTZ syntenic alignment between the query species to the “target” species can then be used to map the known gene to the target species’ genome3, 17 . By mapping through an intermediate (the query species’ genome), TransMap does a better job handling the untranslated-region (UTR) and introns than what would have resulted from a simple BLAT alignment of the query mRNA to the target species’ genome. The mappings produced by TransMap are then used to construct a gene prediction for the target species by attempting to correct for any evolutionary changes that may have occurred between the query and target species. The corrections try to produce a “good” gene annotation by moving the boundaries of the predicted exons, adjusting the start and end of the prediction’s coding region, fixing any splice sites, among many other heuristics. After these corrections are applied, the final TransMap-predicted gene is validated by checking for the necessary qualities of a good gene. In order to be classified a “good” gene, the gene prediction must have the following qualities: the coding region begins with a start codon and ends with a stop codon, any splice site conforms to one of the three most prevalent splice-site motifs, no in-frame stop codons are present, the number of bases in the coding region must be a multiple of three, and the reading frame must remain consistent as it moves from one exon and into the next. In order to test the hypothesis, TransMap gene predictions for human and the boroeutherian ancestor will be required using a reliable set of mRNAs from a chosen query species. It was decided that the best database to use for the query species would be the curated RefSeq genes for the latest mouse genome (assembly: February 2006) since not only does the mouse species share the boroeutherian ancestor with human, but the RefSeq database is well-known for its reliability. In addition, aside from human, mouse is the only other species with a only comprehensive high quality gene set. Starting with mouse RefSeq mRNAs, Mark used his software to produce TransMap gene predictions for both the human (assembly: March 2006) and the boroeutherian ancestor (assembly: April 24, 2006). All gene predictions were validated as described above.

2.2

Determining Gene Losses

The gene predictions for the human and boroeutherian ancestor were analyzed to determine which of these predictions were validated as “good” genes in the ancestor but not in human. Since we know the gene is still active in mouse (from the RefSeq annotation) and is predicted to have also been active in the boroeutherian ancestor, it is thus a candidate for a human gene loss. However, knowing that the variability in the TransMap method has the potential to introduce mistakes in its gene predictions, it is necessary to check that the locations of the predicted gene losses show no evidence of transcription. To filter out any mistakes, the locations of the predicted gene losses shall be intersected with known human RefSeq genes. If any predicted loss overlaps the genomic region of a human RefSeq gene, the predicted gene loss will be removed from consideration. As a more conservative filter, the predicted gene losses shall also be intersected with human mRNAs. The database of human mRNAs is less reliable, given the low threshold on the quality of data it accepts, but using it as a filter provides the strongest criterion that the final set of predicted gene losses have never shown any evidence of active transcription in human. Nevertheless, the final list produced via this filter is a conservative one, since good gene losses could be struck from the list due to a bad human mRNA that was incorrectly mapped to the same location.

3

Mouse RefSeq mRNA TransMap 5’ UTR

5’ UTR

3’ UTR 1

2

3

3’ UTR 1

2

Mouse mRNA

3

(BLAT)

Mouse Genome human

1

2

ancestor

3

5’ UTR

1 3’ UTR

“bad”

2

5’ UTR

(BLASTZ)

Target Genomes Predicted Genes

3 3’ UTR

Validation

“good”

Initial List Filters

Final List Figure 1: A schematic of the automated method used to determine gene losses in the human lineage.

4

3

Results

Of the 19,541 mouse RefSeq mRNAs mapped to both of these species, 1,008 predicted genes were classified as potential human gene losses using the initial criteria described above. This list then underwent the filtering the process as mentioned above, and summarized in Figure 2. The first filter removed those that intersect with human RefSeq genes, reducing the list to 174 gene losses. The second, more conservative filter using human mRNAs further reduced this list to the final set of 78 predicted gene losses. Initial List

1,008 834 Overlaps with RefSeq

174 96 Overlaps with mRNA Final List

78

Figure 2: A breakdown of the filtering performed on the initial list of potential gene losses produced by the TransMap mappings.

Of these 78 predicted gene losses, 50 have known functions in mouse: 16 are annotated as olfactory receptors and 34 have various other annotated functions. The remaining 28 predicted gene losses are annotated as “hypothetical” genes in mouse; their biological functions are currently unknown. It is reassuring to find the list populated with a large number of olfactory receptors that have been loss in the human lineage. It is common knowledge that many lower forms of mammals (dogs, cats, etc.) have a well-developed sense of smell, leading one to naturally conclude that they have a greater number and variety of olfactory receptors than humans. Research has shown that humans accumulate disruptive mutations in olfactory receptors four times faster than observed in other primates8 . Therefore, the fact that 20% of the gene losses predicted here are olfactory receptors bodes well for the effectiveness of the method.

3.1

HTMaLigner Visualization Tool

The number of predicted gene losses is now at a size that allows for a manual inspection of the TransMap mappings. This is necessary in order to verify that there exists reasonable evidence of the gene truly being loss in human. Unfortunately, the only tool available at the time is the UCSC Genome Browser, which is not designed to quickly scan alignments at the amino acid level13 . Also, the TransMap mappings, on which our analysis is based, is not available within the browser in a manner appropriate for the necessary analysis. Therefore, the need arose to develop a tool that could visualize the TransMap mapping data at the nucleotide and amino acid level. With such a tool, it becomes possible to not only locate the mechanism of gene inactivation (e.g. mutation, frame shift, etc.), but also to verify that the TransMap mappings and subsequent validation of the predicted genes performed properly for both the human and ancestor. To verify these mappings, the tool named HTMaLigner was developed. It constructs an HTML file from the genome mappings produced by TransMap and the human, mouse, and ancestral ge5

nomic data available on the cluster. The mapping files are formatted using the PSL format and thus contain all the data necessary to reconstruct the pair-wise alignment between the query and each target species at the nucleotide level. HTMaLigner builds a multiple alignment from the pairwise alignments of mouse RefSeq mRNAs to the latest assemblies of mouse (mm8), human (hg18), and ancestral (boroEut13) genomes. Although it is not necessary to align the mRNA to the mouse genome, it aids in visualizing the mouse gene’s intron/exon structure and how this structure remains conserved in the human and ancestral genomes. After constructing the multiple alignment, the genetic sequences for each species (mouse, human, and ancestor) are translated into amino acids. Since the start codon is known for the mouse genome, only the correct reading frame is used for its translation. However, since the beginning of the human and ancestral genes are merely predictions, it is necessary to provide translations using all three reading frames. To aid the eye in scanning these alignments, color is used to indicate a good match (at the nucleotide or amino acid level) between the mouse sequence and either of the two target species. A pair of aligned nucleotides that are identical will be marked by a yellow background. Similarly, a pair of aligned amino acids will be colored in various shades of green. The shade of green is determined by the BLOSUM62 substitution matrix, with darker colors indicating a higher substitution rate (which correlates with amino acid similarity) between the aligned amino acids. This particular colorization really improves the readability of the amino acid track as it makes it very easy to see where frame shifts occur. Figure 3 shows an example of the output of HTMaLigner. Notice that at the end of each line of the genomic tracks is a direct link to the UCSC Genome Browser for the specific sequence given on that line. This gives the user the ability to inspect the exact area where a mutation, insertion, or deletion may have occurred and tap the vast stores of information available in the browser to check if this particular problem is found in other species, is polymorphic in the human population, etc.

Figure 3: An example of the output of the HTMaLigner tool. This example is from the mouse RefSeq gene NM 146126, annotated as sorbitol dehydrogenase 1. The human predicted gene shown here begins in frame designated H2, then a single base pair deletion causes the frame to shift relative to the mouse gene (denoted as MM for the mRNA translation and MG for the genome translation). The human gene remains in frame H2 until it crosses through the intron and into the exon below where it emerges in frame H1 and encounters an early stop codon. Note that the single base pair deletion is found in human, but not in the ancestor, and thus the gene would be classified as a good gene loss candidate.

6

3.2

Analysis of the Gene Losses

The HTMaLigner tool was used to assess the quality of the mappings for 62 of the 78 predicted gene losses. The 16 olfactory receptors are excluded from further analysis in part due to time considerations, but also because they are the least interesting gene losses. Mapping quality was categorized in the following way: “Good” “Decent” “Mediocre” “Bad”

Verifiable loss in human. Ancestral copy (nearly) intact. Verifiable loss in human but possible problems with mapping. Ancestral copy (nearly) intact. Suspect loss in human and/or ancestral mapping suspicious. Gene is not loss in human and/or terrible ancestral mapping.

It shall be noted that, while the “good” and “bad” categories have sufficiently strict definitions, the “decent” and “mediocre” categorizations are likely to be interpreted differently by different people. This quality assessment is therefore intended as a means to focus efforts on the most reliable candidates first, progressing to lower categories as the higher-quality subsets are exhausted. Those predicted gene losses that are determined to have “good” mapping quality shall be checked for any single nucleotide polymorphisms (SNPs) at the gene loss event and when the gene was likely loss in the primate lineage. This took advantage of the HTMaLigner feature that allows any section of the alignment to be shown in the UCSC Genome Browser. Any mutations or insertion/deletions believed to cause the gene loss were checked using the browser. While the SNP datasets are not yet available for the latest human assembly, each mutation or insertion/deletion causing a gene loss was checked to see if they also exist in the chimp and rhesus genomes using the syntenic conservation track available on the UCSC Genome Browser1, 6, 10, 12, 18, 19, 24 . By identifying which primate exhibits the same gene mutation or insertion/deletion, the point at which the gene loss occurred during the evolution of primates can be roughly approximated. For example, if the identical AGA→TGA nonsense mutation is found in human, chimp, and rhesus, then the gene loss very likely occurred before human and chimp diverged from their common ancestor with rhesus. Table 1 summarizes the analysis performed on the 62 predicted gene losses. 23 genes are classified as having “good” mappings (17 annotated, 6 hypothetical), 13 genes are classified as “decent” (7 annotated, 6 hypothetical), and 6 genes are classified as “mediocre” (1 annotated, 5 hypothetical). The remaining 20 gene loss candidates exhibited “bad” mappings and therefore are removed from further analysis. The “good” gene loss candidates were further analyzed using the UCSC Genome Browser to check for signs of identical gene loss in chimp and rhesus. Of the 23 “good” candidates, 7 losses were specific to the human lineage (6 annotated, 1 hypothetical), 12 were found in both human and chimp (9 annotated, 3 hypothetical), and 3 were found in human, chimp, and rhesus (2 annotated, 1 hypothetical).

7

Table 1: Primate Gene Losses Categorized by Lineage where Loss Occurred and Quality of Alignment. ID Ctf2 Htr5b Gpr33 Taar4 Krt1-17 Sord Frg1 Gsta4 Cyp21a1 Uox Slc7a15 Cetn4 Crygf Sult1d1 Gimap5 Nr1h5 Csnd Gulo S100a15 Tsx 2210010C04Rik Twist2 Krtap14 Ngp Cryge + 8 Bad Quality 5330437I02Rik 2610318N02Rik E030002O03Rik LOC434214 0610012H03Rik 5730521E12Rik 4933429E10Rik 4933422H20Rik 8030411F24Rik 1700052K11Rik 1700016D06Rik 1700037C18Rik Gm266 9330180L21Rik 1700013G24Rik LOC546166 B930011P16Rik + 12 Bad Quality + 16 Olf. Receptors a H = Human-specific

Mouse-Annotated Function cardiotrophin 2 serotonin receptor 5B G protein-coupled receptor 33 trace amine-associated receptor 4 keratin complex 1, acidic, gene 17 sorbitol dehydrogenase 1 FSHD region gene 1 glutathione S-transferase, alpha 4 cytochrome P450 21a, urate oxidase aromatic-preferring amino acid transporter centrin 4 crystallin, gamma F sulfotransferase family 1D, member 1 GTPase, IMAP family member 5 nuclear receptor 1H5 casein delta gulonolactone (L-) oxidase S100 calcium binding protein A15 testis specific X-linked gene trypsinogen 7 twist homolog 2 keratin associated protein 14 neutrophilic granule protein crystallin, gamma E

Location (hg18) chr7:127510107-127516764 chr1:123337303-123356011 chr12:52944623-52949683 chr10:23649909-23650953 chr11:100072308-100077079 chr2:121926279-121956778 chr8:42896281-42915891 chr9:77977715-77995031 chr17:34409836-34412455 chr3:146534534-146568869 chr12:8554490-8625074 chr3:37500176-37503995 chr1:65860797-65862580 chr5:88629211-88643520 chr6:48675782-48683784 chr3:103068719-103093194 chr5:88882715-88898978 chr14:64940896-64963317 chr3:90740228-90744057 chrX:99617200-99627299 chr6:40959874-40965116 chr1:93631882-93678433 chr16:88714149-88714989 chr9:110265015-110268177 chr1:64982834-64985290

Lineagea H H H H H H HC HC HC HC HC HC HC HC (x2) HC HCR HCR TBD TBD TBD TBD TBD TBD TBD TBD

Quality Good Good Good Good Good Good Good Good Good Good Good Good Good Good Good Good Good Decent Decent Decent Decent Decent Decent Decent Mediocre

hyp. protein – acyltransferase 3b hypothetical protein LOC70458 hypothetical protein LOC244180 hypothetical protein LOC434214 hypothetical protein LOC74088 hypothetical protein LOC66650 hypothetical protein LOC380701 hypothetical protein LOC432613 RIKEN cDNA 8030411F24 hypothetical protein LOC73431 hypothetical protein LOC76413 hypothetical protein LOC73261 hypothetical protein LOC212539 hypothetical protein LOC77268 hypothetical protein LOC69380 hypothetical protein LOC546166 RIKEN cDNA B930011P16

chr18:65823636-65876906 chr16:17026961-17038669 chr7:104026834-104038637 chr7:98346724-98352670 chr2:105025182-105180630 chr10:52080022-52093019 chr11:61117825-61159055 chr11:115256634-115264358 chr2:148473449-148477377 chr11:104995913-104997523 chr8:11654895-11678722 chr16:3821028-3823886 chr12:111932418-111933604 chr14:29650628-29651712 chr4:136725372-136727537 chr9:119982160-119985027 chr5:27380179-27382917

H HC HC HC HCR TBD TBD TBD TBD TBD TBD TBD TBD TBD TBD TBD TBD

Good Good Good Good Good Good Decent Decent Decent Decent Decent Decent Mediocre Mediocre Mediocre Mediocre Mediocre

loss, HC = Human and Chimp identical losses, HCR = Human, Chimp, and Rhesus identical

losses. The HC (x2) indicates that two gene loss events were identical in both Human and Chimp. b

Annotation of this hypothetical gene is provided by Pfam7 .

8

4

Discussion

The success of this automated method for discovering human gene losses is largely dependent on its ability to find verified gene losses. A literature search of the six human-specific, annotated genes in Table 1 found that all six have been researched before. The five genes Ctf2, Htr5b, Taar4, Gpr33, and Krt1-17 are all verified human gene losses4, 9, 14, 16, 22 . The only remaining human-specific, annotated gene loss is that of Sord, which actually has a functioning copy in human2 . However, this does not indicate a failure of the method, as there are two copies of this gene, Sord1 and Sord2, and the inactivated version, Sord2, was the gene caught by this method. In addition to the above, verified gene losses known prior to implementing this method, Uox and Gulo, are also present in this list. Interestingly, new information was gained about when the gene losses of Uox and Htr5b likely occurred. Both of these genes were known to have been lost at some point during the evolution of primates, but unsure of the lineage(s) in which it still may be active. It now appears that the inactivating mutation of Uox occurred prior to the human and chimp split, but after human/chimp split from rhesus. The story of Htr5b is slightly more complicated. Another graduate student at UCSC, Tim Drezner, independently found the gene loss of Htr5b using a radically different method than the one presented here. However, in his analysis, Tim discovered that not only was Htr5b lost in human, but it was also lost independently in chimps. Since, our method only searched the other primate genomes for the specific mutations that occurred in human, we missed the fact that it was also lost in chimp by a different set of mutations. The fact that Htr5b is a serotonin receptor that was inactivated in the two of the most intelligent species on the planet makes this an intriguing gene loss event. Thus far, one human-specific gene loss categorized as a hypothetical protein, 5330437I02Rik, appears to be a new discovery. According to Pfam, this hypothetical protein has an acyltransferase 3 structural domain, but its specific function in mouse is currently unknown. This particular protein was sequenced from the pituitary gland of an adult male mouse and is predicted to have an endopeptidase inhibitor activity. The fact that this loss is specific to humans and is preferentially expressed in the mammalian pituitary gland (the “master gland” of the endocrine system) might make this loss very interesting within the context of human development. Unfortunately, the gene loss candidates shown in Table 1 are by no means a comprehensive list of all human gene losses. For instance, the aforementioned gene Myh16 was not discovered using this method. The reason for this is that the RefSeq database does not contain a Myh16 gene for mouse. Since the mouse RefSeq mRNAs seed the search for gene loss candidates, any genes not included in this database have zero chance of being discovered as human gene losses using this method. Also, the gene CASPASE12 was not discovered because its loss is actually polymorphic in the population, and the latest human assembly used in this analysis (hg18) has the functional copy of CASPASE12. Thus, the automated method correctly identified that CASPASE12 was not inactivated in human. Recent research from a group at University of Michigan, Ann Arbor, performed a similar study where they found 67 human-specific gene losses, of which 36 were olfactory receptors22 . They worked from a database of human pseudogenes (i.e. segments of DNA that are similar to genes, but have lost their functional activity) and found those that have a still functional ortholog in chimps21 . Comparing the gene losses found by this research and those given in Table 1, only two gene losses, Krt1-17 and Gpr33, were discovered using both methods† . This is a surprising result †

The U of M, Ann Arbor group’s automated method did not actually discover either of these genes automatically, but rather found both genes via a literature search and then showed that they satisfied their criteria for human-specific pseudogenes. The two gene losses found by both methods does not include an analysis of the olfactory receptors.

9

and suggests that, while the opposing method found a significantly larger number of human-specific gene losses, neither method in its present state is able to form the complete set. Despite its shortcomings, the automated method presented here appears to be highly capable at discovering gene losses in primates. Although the full analysis of the gene loss predictions has yet to be completed, there has yet to be a case where this method has erroneously predicted a gene loss. Those few cases where an actual functional gene in human was predicted to be loss are easily explained by the fact that the TransMap had mapped to the non-functional copy of that gene. This is clearly not a failure of the automated method since it correctly predicted that the non-functional copy is indeed non-functional, but instead suggests the subsequent filtering process needs improvement. Lastly, it shall be noted that this automated method can be generalized immediately to discover gene losses in any organism. The only requirements for such an analysis to work with this method are genomes of a target species, a distantly-related query species with a reliable gene set, and a reconstruction of their common ancestor. If these genomes are of sufficiently high quality, the automated method should work as well as observed in the research discussed here.

5

Future Work

While this method has successfully discovered verifiable human gene losses, there are ways it can be improved and extended. Firstly, coming up with a way to automate the quality assessment of the TransMap mappings would remove the most time-consuming part of the method presented here. This process would begin by using the current quality assessments to improve the heuristics that TransMap employs to fix the predicted gene structure. With improved heuristics, it is hoped that fewer “bad” quality mappings would reach the final stage of the analysis. Then, the gene validation process could also be improved using the knowledge gained with the manual inspection of the mappings to better determine when the human and ancestral gene predictions are truly “good” or “bad” gene predictions. Overall, these improvements would lead to better discrimination of the gene loss predictions, ideally producing fewer false-positives as well as a larger yield of potentially good gene loss candidates to investigate. As in the case of Myh16, it is likely that some human gene losses were missed simply because these genes do not exist in mouse, the query species used in this analysis. Since the query species’ genes are the basis for all human gene losses, a simple extension to this method would be to expand the RefSeq mRNAs used in the analysis to include those from all available mammalian genomes. As before, the common ancestor of human and each query species would be used to verify the loss in the human lineage. With a larger database of basis mRNAs, the list of gene loss candidates will almost certainly grow, ideally including genes like Myh16 that were missed by this initial analysis. To further expand the list to include those gene losses that are not completely fixed in the population, the SNP datasets should be incorporated in the automated method. Starting with the initial TransMap gene predictions, a search for SNPs in the human-specific gene predictions that potentially introduce a nonsense mutation in the reading frame (e.g. CGA→TGA) or introduce a frame shift via a single insertion or deletion will be performed. If a normally “good” human gene prediction has one of these types of SNPs, such as CASPASE12, the SNP will be introduced in its human genome sequence and the automated method will be re-run on the SNP-altered sequence. If the validation step then determines that this “good” human gene has become a “bad” human gene, a polymorphic human gene loss will likely have been found. The same analysis can be performed in reverse, whereby human genes initially classified as gene losses are checked for any SNPs that remove the cause of the gene inactivation, indicating that the gene loss, initially assumed to be 10

fixed in the population, is actually polymorphic. Both of these analyses can provide a candidate gene list for any future genome re-sequencing studies, providing data that can determine if any of these gene losses have undergone a recent selective sweep in the population, strong evidence that the gene loss provides some sort of evolutionary advantage to humans. Jing Zhu had the idea to alter the current method to find not total gene loss, but a loss of a particular alternatively-spliced transcript, after hearing a talk given by Julie Ni of the Ares Lab in UCSC. The Ares Lab is investigating the possibility that the introduction of nonsense mutations in a single exon is being used as a regulatory mechanism to remove specific alternatively-spliced transcripts through nonsense mediated decay (NMD). If the exon containing the nonsense mutation were skipped via alternative splicing, a functional gene could still be transcribed, assuming the other exons contained no other nonsense mutations. In order to find these types of gene losses, we would first have to restrict the list to include only those gene losses caused by an in-frame stop codon in a single exon. The human RefSeq and mRNA filters would then be relaxed so that they find any gene loss prediction where all but the offending exon is present in any mRNA evidence. The alternatively-spliced gene loss candidates found through this method could be given to the Ares lab so that they may collect evidence that the transcripts including the nonsense-mutated exons are associated with NMD. Also, the relative selective pressure between the intact exons and the problem exon could be computed to see if mutations are accumulating at a faster rate in the problem exon. This would suggest that the good alternatively-spliced transcript is being maintained through evolution. The long term goal of this project would be to determine as precisely as possible when gene losses occurred throughout the evolution of mammals. This would require reliable ancestral genomes for every node of the mammalian evolutionary tree. To do this properly, the anthropocentric aspect of this project will have to removed. Gene losses would have to be predicted for all available mammalian genomes as many ancestral nodes are not linked to human. However, by doing this, not only will be able to determine where in a lineage a gene loss occurs, but finding those interesting cases where independent gene losses occur in two (or more) species, such as the independent loss of Htr5b in chimp and human, would be automatic. Finally, the tool HTMaLigner will be released for general use. This tool was extraordinarily helpful for quickly determining the problem(s) in a particular genomic alignment, the purpose for which it was developed. Currently, the tool is functioning well, but can be made more user-friendly and expanded to include more than just a three species multiple alignment. In addition, the code will need to be cleaned up and documented so that it is left in a state that allows for future maintenance.

References [1] M. Blanchette, W.J. Kent, C. Riemer, L. Elnitski, A.F.A. Smit, K.M. Roskin, R. Baertsch, K. Rosenbloom, H. Clawson, E.D. Green, D. Haussler, and W. Miller. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Reserves, 14(4):708–715, 2004. [2] I.M. Carr, A. Whitehouse, P.L. Coletta, and A.F. Markham. Structural and evolutionary characterization of the human sorbitol dehydrogenase gene duplication. Mammalian Genome, 9:1042–1048, 1998. [3] F. Chiaromonte, V.B. Yap, and W. Miller. Scoring pairwise genomic sequence alignments. Pacific Symposium Biocomputation, pages 115–126, 2002.

11

[4] D. Derouet, F. Rousseau, F. Alfonsi, J. Froger, J. Hermann, F. Barbier, D. Perret, C. Diveu, C. Guillet, L. Preisser, A. Dumont, M. Barbado, A. Morel, O. deLapeyriere, H. Gascan, and S. Chevalier. Neuropoietin, a new IL-6-related cytokine signaling through the ciliary neurotrophic factor receptor. PNAS, 101(14):4827–4832, 2004. [5] Mark Diekhans and David Haussler. Analysis of the evolution of gene structure using crossspecies cDNA alignment mapping. 2006. In preparation. [6] J. Felsenstein and G.A. Churchill. A hidden Markov model approach to variation among sites in rate of evolution. Molecular Biology Evolution, 13:93–104, 1996. [7] R.D. Finn, J. Mistry, B. Schuster-Bckler, S. Griffiths-Jones, V. Hollich, T. Lassmann, S. Moxon, M. Marshall, A. Khanna, R. Durbin, S.R. Eddy, E.L.L. Sonnhammer, and A. Bateman. Pfam: clans, web tools and services. Nucleic Acids Research, 34:D247–D251, 2006. [8] Y. Gilad, O. Man, S. Paabo, and Doron Lancet. Human specific loss of olfactory receptors. PNAS, 100(6):3324–3327, March 2003. [9] R. Grailhe, G. Grabtree, and R. Hen. Human 5-HT(5) receptors: the 5-HT(5A) receptor is functional but the 5-HT(5B) receptor was lost during mammalian evolution. European Journal of Pharmacology, 418:157–167, 2001. [10] D. Karolchik, R. Baertsch, M. Diekhans, T.S. Furey, A. Hinrichs, Y.T. Lu, K.M. Roskin, M. Schwartz, C.W. Sugnet, D.J. Thomas, R.J. Weber, D. Haussler, and W.J. Kent. The UCSC genome browser database. Nucleic Acids Reserves, 31(1):51–54, 2003. [11] W.J. Kent. BLAT - the BLAST-like alignment tool. Genome Reserves, 12(4):656–664, 2002. [12] W.J. Kent, R. Baertsch, A. Hinrichs, W. Miller, and D. Haussler. Evolution’s cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proceedings of the National Academy of Sciences, 100(20):11484–11489, 2003. [13] W.J. Kent, C. W. Sugnet, T. S. Furey, K.M. Roskin, T. H. Pringle, A. M. Zahler, and D. Haussler. The human genome browser at UCSC. Genome Reserves, 12(6):996–1006, 2002. [14] L. Lindemann, M. Ebeling, N.A. Kratochwil, J.R. Bunzow, D.K. Grandy, and M.C. Hoener. Trace amine-associated receptors form structurally and functionally distinct subfamilies of novel G protein-coupled receptors. Genomics, pages 372–385, 2005. [15] M. Nishikimi, R. Fukuyama, S. Minoshima, N. Shimizu, and K. Yagi. Cloning and chromosomal mapping of the human nonfunctional gene for L-gulono-y-lactone oxidase, the enzyme for for Labsorbic acid biosynthesis missing in man. The Journal of Biological Chemistry, 269(18):13685– 13688, 1994. [16] H. Rompler, A. Schulz, C. Pitra, G. Coop, M. Przeworski, S. Paabo, and T. Schoneberg. The rise and fall of the chemoattractant receptor GPR33. The Journal of Biological Chemistry, 280(35):31068–31075, 2005. [17] S. Schwartz, W.J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. Hardison, D. Haussler, and W. Miller. Human-mouse alignments with BLASTZ. Genome Reserves, 13(1):103–107, 2003.

12

[18] A. Siepel, G. Bejerano, J.S. Pedersen, A. Hinrichs, M. Hou, K. Rosenbloom, H. Clawson, J. Spieth, L.W. Hillier, S. Richards, G.M. Weinstock, R. K. Wilson, R.A. Gibbs, W.J. Kent, W. Miller, and D. Haussler. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Reserves, 15:1034–1050, 2005. [19] A. Siepel and D. Haussler. Phylogenetic hidden Markov models, pages 325–351. Springer, New York, NY, 2005. [20] H.H. Stedman, B.W. Kozyak, Nelson A., D.M. Thesier, L.T. Su, and et al. Myosin gene mutation correlates with anatomical changes in the human lineage. Nature, 428:415–418, 2004. [21] D. Torrents, M. Suyama, E. Zdobnov, and P. Bork. A genome-wide survey of human pseudogenes. Genome Reserves, 13:2559–2567, 2003. [22] X. Wang, W.E. Grus, and J. Zhang. Gene losses during human origins. PLoS Biology, 4(3):e52, 2006. [23] Xiangwei Wu, Cheng Chi Lee, Donna M. Muzny, and C. Thomas Caskey. Urate oxidase: Primary structure and evolutionary implications. Proceedings of the National Academy of Sciences, 86:9412–9416, December 1989. [24] Z. Yang. A space-time process model for the evolution of DNA sequences. Genetics, 139:993– 1005, 1995.

13