A Locus-Based Paradigm for Generating Systems

A Locus-Based Paradigm for Generating Systems Biological Inferences from Large Scale Functional Genomics Datasets by Ajish Dominic George

A Dissertation Submitted to the University at Albany, State University of New York in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

School of Public Health Department of Biomedical Sciences 2009

Acknowledgements

“There is a theory which states that if anybody ever discovers exactly what the Universe is for and why it is here, it will instantly disappear and be replaced by something even more bizarre and inexplicable. There is another theory which states that this has already happened.” – Douglas Adams, Hitchhiker's Guide to the Galaxy

This thesis is dedicated to my family, friends, colleagues, and mentors who kept me alive (barely), sane (questionably), and smiling (often ruefully) throughout this ordeal. I would like to thank Scott for believing in me and trusting that my coat-tails would be long enough to ride in on. I would like to thank Julio for showing me how careful, meticulous, science is done and Tom for demonstrating how to be professionally skeptical. I would like to thank Dave for teaching me how to make analysis into science and Randy for always coming up with the most insightful questions at our meetings.

I need to thank Sridar for reminding me to stay honest and passionate in my work and Dave and Marcy at the microarray core for the camaraderie that I would not have survived without. I want to thank Frank and Chris for being awesomely fun, incredibly sharp and helping realize this work, Tim, Denis and other lab members both in Albany ii

and New York for all the good times and bad ideas, Wen and Anna Lisa for playing at angel and demon, and of course Tiffany for all of the drugs. From some to we, Los Gatos!

This last bit is for my family – my mother who can finally stop worrying if I'll ever get done, to my father who just lost the bet on whether I'd finish before I went mad, my brother who's just now placing his own bets, and my sister who knows just how cooked the odds were to start with – I love you all!

The smiling face that hides behind the curtains of reality keeps taunting me to steal a glance and so I carry on.

iii

Abstract Genomics data is growing at a exponential rate. The ability to integrate new results with existing knowledge about genomic biology is rapidly becoming the limiting factor as there no universal language with which to describe genomic functional elements. To integrate and compare new and existing genomic data, we define our basic functional unit of a genome to be a locus -- a set of positional coordinates along any genome with an arbitrary amount of functional annotations attached. The locus concept enables addressing genomic elements and annotations at any level of granularity from entire swaths of chromosomes to single base-positions. We define a locus-based framework to compare a given set of genomic elements to any of existing genomic annotations. We use this to build a tool to find genomic annotations significantly and frequently overlaping with a set. We also use this to build a tool to infer functional interactions from locus intersections and show how the inference of regulatory interactions from genomics data and the analysis of the topological properties of genomic networks can provide useful biological insights. We demonstrate the importance of the locus based paradigm in the emergent field of ribonomic profiling. We show how locus based comparisons reveal novel overlaps between the binding sites of microRNAs and the HSL-mRBP and explore other potential coordinate regulatory sites. We use locus-based intersections to compare RIP-Chip derived targets of the HuR mRNA binding protein across cell-lines and platforms to each other, to known HuR targets in the literature, and to AREs thought to be associated with iv

HuR targets. We reveal a startling lack of overlap between targets from matched samples profiled on different platforms and across the target sets in general. This shows the need for further refinement on RIP-Chip technology and highlights the efficacy of the locusbased approach. Finally, we profile well-characterized sets of RBP-binding sites against sets of regulatory, functional, and disease annotations, and find that we can not only recapitulate the known functional properties of these sites but also find relatively unknown aspects of these sets that are supported by literature.

v

Table of Contents Acknowledgements ............................................................................................................ii Abstract .............................................................................................................................iv Index of Tables ...................................................................................................................x Figure Index ......................................................................................................................xi List of Terms & Abbreviations .......................................................................................xiii 1. Background .....................................................................................................................1 1.1 Summary ..................................................................................................................1 1.2 Genomic Biology and the Evolution of Genomic Data ...........................................2 1.3 High-Throughput Genomic Studies .........................................................................4 1.4 Warehouses of Functional Genomics Data ..............................................................8 2. A Locus-Based Framework for Genomic Data Analysis ..............................................18 2.1 Summary ................................................................................................................18 2.2 Motivation ..............................................................................................................19 2.3 A Genomic Coordinate Space ................................................................................19 2.4 Operations on Genomic Loci .................................................................................25 2.5 Inference Mining with Genomic Locus Sets ..........................................................39 3. Ribonomic Profiling of HuR Targets ............................................................................68 3.1 Summary ................................................................................................................69 3.2 Introduction ............................................................................................................70 3.3 Methods ..................................................................................................................85

vi

3.4 Results ....................................................................................................................94 3.5 Discussion ............................................................................................................120 4. Inferring Functional Properties of UTR Elements ......................................................126 4.1 Summary ..............................................................................................................126 4.2 Collections of Post-Transcriptional Regulatory Elements ...................................127 4.3 UTR Elements in UTRSite and UTRdb ...............................................................128 4.4 Functionally Annotating Sets of UTR Elements ..................................................130 4.5 Histone-Stem Loop Associations .........................................................................135 4.6 Iron-Response Element Associations ...................................................................136 4.7 Selenocysteine Insertion Sequence Annotations ..................................................137 4.8 Discussion ............................................................................................................142 5. Summary & Future Prospects .....................................................................................145 6. Bibliography ...............................................................................................................147 7. Appendix I: Inferring Transcriptional Regulatory Networks from Gene Expression Data .................................................................................................................................176 Abstract .....................................................................................................................178 Introduction ...............................................................................................................179 Materials and methods ..............................................................................................182 Results .......................................................................................................................187 Discussion ...............................................................................................................194 Acknowledgements ...................................................................................................198 References .................................................................................................................199

vii

Figures .......................................................................................................................202 Supplementary Figures ..............................................................................................210 8. Appendix II: Estimating Significance of Graph-Theoretic Properties of Sub-Networks of Interactomes ................................................................................................................221 Abstract .....................................................................................................................223 Description of Approach ...........................................................................................224 References .................................................................................................................231 9. Appendix III: Informatic Resources for Studying RNA Motifs ................................233 Summary ...................................................................................................................235 Introduction ...............................................................................................................236 Acknowledgments .....................................................................................................238 Table 1. Databases of RNA Structural Motifs ..........................................................239 Table 2. Search Tools for Known RNA Structural Motifs .......................................244 Table 3. Programs for Aligning Sequences of RNA Consensus Structures .............252 Table 4. Tools for Identifying Consensus Structures in Unaligned RNA Sequences .....................................................................................................................................259 Table 5. Benchmark Datasets for Consensus RNA Structure Prediction .................268 Bibliography ..............................................................................................................269 10. Appendix IV: MicroRNA Modulation of RNA-Binding Protein Regulatory Elements ..........................................................................................................................................274 Introduction ...............................................................................................................275 A New Role for MicroRNA .......................................................................................276

viii

A Hypothesis of MicroRNA Regulation of the Post-Transcriptional Regulatory Code .....................................................................................................................................279 Non-Coding RNA ......................................................................................................283 Bibliography ..............................................................................................................284

ix

Index of Tables Table 1: High-Confidence Reports of HuR Targets from Literature.................................84 Table 2: Summary of Present Probesets on Arrays used for HuR RIP-Chip.....................96 Table 3: Correlations Between HuR-IP and Total Samples Across Platforms.................103 Table 4: Summary of Available HuR RIP-Chip Datasets................................................107 Table 5: Intersection Coefficients for HuR-IP Sets Across Studies.................................110 Table 6: Genes Associated with HuR Across All Studies................................................113 Table 7: Simulation of IP targets under different population assumptions & analysis methods............................................................................................................................123 Table 8: Top annotations associated with histone stem loop instances............................139 Table 9: Annotations associated with IRE elements........................................................140 Table 10: Annotations associated with SECIS elements..................................................141

x

Figure Index Figure 1: An Overview of Functional Genomics Technologies...........................................6 Figure 2: Varieties of Genomics Data................................................................................14 Figure 3: Overview of Genomic Regulation Databases....................................................16 Figure 4: Locus-Based Genomics Datasets.......................................................................23 Figure 5: A Generic, Flexible Framework for Defining Locus-Based Data......................26 Figure 6: Locus Intersection Parameters............................................................................29 Figure 7: Effect of Bridging Parameters on Locus Set Intersections.................................31 Figure 8: Squishing Locus Sets..........................................................................................35 Figure 9: Flattening Locus Sets.........................................................................................37 Figure 10: Parsing annotations associated with a locus set...............................................45 Figure 11: Annotation Term Mining with Locus Sets........................................................47 Figure 12: Control Statistics for Annotations Intersecting with a Locus Set.....................51 Figure 13: Commodus Interface as a Filterable Radial Graph..........................................53 Figure 14: Commodus Viewer Recursive Exploration of Annotation Combinations........55 Figure 15: Locus Intersections as Interactions...................................................................62 Figure 16: Building Regulatory Networks from Locus Intersections................................64 Figure 17: Connexus Interface...........................................................................................66 Figure 18: The multiple aspects of gene-expression..........................................................73 Figure 19: Ribonomic Profiling using RIP-Chip...............................................................75 Figure 20: A Locus-Based Framework for Genomics Data Analysis................................79

xi

Figure 21: Clustering of replicate samples on the Affymetrix Human Exon 1.0ST platform..............................................................................................................................97 Figure 22: Clustering of replicate samples on the Agilent Human Whole Genome platform..............................................................................................................................99 Figure 23: Correlations between Affymetrix and Agilent HuR IP mRNA fold changes over total..........................................................................................................................104 Figure 24: Similarities between HuR target sets as determined by cross-intersection score ..........................................................................................................................................111 Figure 25: Associations by Platform of HuR Targets with HuR-Relevant Annotations..115 Figure 26: Partitioning of HuR-Relevant Annotations with HuR-Targets.......................118 Figure 27: Mapping Sets of Functionally Related UTR Structures.................................131

xii

List of Terms & Abbreviations ARE: AU-Rich Element (motif associated with mRNAs) DABG: Detected Above Background (for Affymetrix GeneChips) HSL: Histone 3'UTR Stem Loop (motif associated with mRNAs) HuEx: abbreviation of Affymetrix Human Exon 1.0ST microarray platform HuR: Hu Antigen R, also known as embryonic lethal abnormal vision 1 (ELAVL1) Input: Total RNA after preparation for immunoprecipitation (in RIP-Chip) IP: RNA immunoprecipitated using antibodies to the RBP of interest (in RIP-Chip) IRE: Iron Response Element (motif associated with mRNAs) Locus: Genomic coordinates -- minimally a chromosome with start and end positions LocusSet: A set of genomic loci (see locus above) PLIER: Probe Logarithmic Intensity Error Estimate (for Affymetrix GeneChips) RBP: RNA Binding Protein RIP-Chip: RNA immunoprecipitation coupled with microarray chip profiling RMA: Robust Multiarray Average (for Affymetrix GeneChips) SECIS: Selenocysteine Insertion Sequence (motif associated with mRNAs) TFBS: Transcription Factor Binding Site Total: A sample of total RNA (in RIP-Chip) TTR: Translation and Turnover Regulating (a class of RNA binding proteins) UTR: Untranslated Region (part of a messenger RNA)

xiii

1. Background 1.1 Summary Genomics data is growing at a exponential rate. The ability to integrate new results with existing knowledge about genomic biology is rapidly becoming the limiting factor as there no universal language with which to describe genomic functional elements. To integrate and compare new and existing genomic data, we define our basic functional unit of a genome to be a locus -- a set of positional coordinates along any genome with an arbitrary amount of functional annotations attached. The locus concept enables addressing genomic elements and annotations at any level of granularity from entire swaths of chromosomes to single base-positions. We define a locus-based framework to compare a given set of genomic elements to any of existing genomic annotations. We use this to build a tool to find genomic annotations significantly and frequently overlaping with a set. We also use this to build a tool to infer functional interactions from locus intersections and show how the inference of regulatory interactions from genomics data and the analysis of the topological properties of genomic networks can provide useful biological insights.

1

1.2 Genomic Biology and the Evolution of Genomic Data Cellular life can be described as an anti-chaotic, auto-catalytic system evolving at the molecular level (Kauffman 1993). Intricate metabolic pathways provide the animating force for cellular life regulating everything from nutrient uptake to catabolism (Kauffman 1993). Environmental or developmental cues cascade through a variety of signal transduction pathways to induce coordinated responses that provide for survival and reproduction of the organism (Kauffman 1993). The entire set of instructions for producing the molecular biological system can be encoded by the genome of a single cell. There are some 3 billion base-pairs making up the 46 human chromosomes. Genomics is the study of their function. What do each of these bases do? How much of the genome is random, functionless junk, how much is evolutionary noise, and just how many varieties of competing and cooperating sequence organisms thrive in this complex biochemical dimension of life? How much of it codes for protein, how many for ribozymes, small RNAs, regulatory elements? What determines their interactions with each other, and with the environment? How do they produce cellular behaviour and coordinate the development of multi-cellular organisms? Genomics technologies have been developed and used to sequence the genome, catalog genes, proteins, and other elements, and to generate live pictures of protein and RNA expression, modification, localization, and association (Knudsen 2004). Bioinformatics tools have been developed to model features such as promoters, domains, secondary structures and splice sites (Teufel et al. 2006). There are numerous databases of such predicted and catalogued features and of phylogeny, polymorphisms, pathways, 2

functions and other associations (Birney 2003; Karolchik et al. 2003; Maglott et al. 2007). The evolution of genomic data has been driven by the evolution of genetic and molecular biological techniques. After the revelation of DNA as the genetic material and the acceptance of the central dogma of the genetic code (Crick 1958), scientific endeavours in this field took two forms. Molecular biologists developed catalogs of expressed genes using molecular cloning techniques – libraries which formed the basis for the current databases of transcripts and proteins. Simultaneously, geneticists began to use linkage mapping techniques to derive relative distances between genes and to assign disease and other functional associations to chromosomal loci. When sequencing technologies emerged, many identifiers for particular gene products were merged as uniqueness was now measured by sequence identity. With the completion of the Human Genome Project (Waterston et al. 2002), it became possible to integrate all of the accumulated knowledge on genomic elements by tying them back to genomic positional coordinates.

3

1.3 High-Throughput Genomic Studies Functional genomics today is characterized by large scale systematic studies of genomic elements and their behaviour within a particular biological context. Highthroughput profiling technologies have been developed to assay the genome, transcriptome, and proteome (Figure 1) and they are commonly used in modern molecular biology. Studying the regulation and misregulation of genomic elements in particular disease conditions and treatment parameters has become an effective way to uncover the molecules and interactions driving a phenotype of interest to better understand its etiology and design targeted interventions. Genomic and transcriptomic elements can be assayed with DNA microarrays and with high-throughput sequencing methods. DNA microarrays use single stranded DNA oligonucleotide linked probes that are complementary to sequences of interest in the experimental sequence pool (Ramsay 1998). Older arrays spotted randomly fragmented amplicons from cDNA libraries while newer oligonucleotide arrays use sets of short synthesized probes complementary to different regions of the target sequence (Knudsen 2004). The probes are linked specifically to a gridded surface and subsequently hybridized to fluorescence or radioactivity linked cDNA amplicons from the experimental sequence pool (Knudsen 2004). Sequencing is moving past laborious and low-throughput methods such as Sanger sequencing and pyrosequencing to massively parallelized high-throughput sequencing platforms (Harris et al. 2008). There are a variety of high-throughput sequencing

4

platforms characterized by different amplification techniques and sequence reading strategies (Shendure & H. Ji 2008). Sequences generated from these new techniques range from 13 bp to 250 bp depending on the platform and a variety of tools are used to map these sequences back to the genome or sequence pool of interest (Wang et al. 2009). DNA microarrays and high-throughput sequencing are used for a variety of tasks. Genomic sequence variation is interrogated using Single Nucleotide Polymorphism (SNP) chips (Sellick et al. 2004) or sequencing while methylation patterns, histone binding, and chromatin accessibility are assayed by combining various assays with promoter arrays (containing probe sets complementary to promoter regions), tiling arrays (Bertone et al. 2004) or sequencing. Transcriptional regulation is studied using chromatin immunoprecipitation of transcription-factor associated sequence along promoter arrays, tiling arrays, and sequencing. Post-transcriptionally, RNA expression is interrogated using feature specific arrays (such as gene, exon, or miRNA arrays), genome-wide tiling arrays or sequencing while association of RNA with structural features, RNA-binding proteins or translational machinery is studied by coupling these platforms to various RNA isolation methods. Functional genomics at the proteome level is most often studied using mass spectrometry (usually coupled with liquid chromatography) techniques that allow characterization of expressed peptides and peptide modifications in a sample pool (Coon et al. 2005)There is also a growing variety of techniques that can be used to study the nature, extent, and type of interactions between proteins and so draw out biological signalling and regulation networks.

5

Figure 1: An Overview of Functional Genomics Technologies Genomic and transcriptomic features are studied using microarrays or sequencing in combination with chromatin and RNA isolation techniques. Protein expression is studied using mass spectrometry technologies while protein interactions are assayed using many different techniques.

6

Polymorphisms

DNA Transcription

Epigenetics

chromatin isolation techniques

RNA Expression Regulation

RNA isolation techniques

microarrays sequencing

Interactions

Protein Expression

protein separation

mass spectrometry various methods

7

1.4 Warehouses of Functional Genomics Data High-throughput genomics platforms and the numerous projects for sequencing and annotating genomic elements all contribute to the steady exponential rise in the amount of available genomics data and the number of genomics databases. Databases of genomic functional elements such as genes and proteins and their structures, transcribed sequences, non-coding RNAs, cis-regulatory elements, and polymorphisms have been growing as our knowledge of the different species of the molecular world expands. Databases of known functions, disease associations, interactions, orthology, and conservation are also being assembled. Projects are also underway to warehouse the results of functional genomics experiments and provide publicly accessible services for exploring and analyzing these sets. Large warehouses of these functional genomics databases are being assembled at large institutions under national and international auspices to better facilitate their consolidation, integration, and exploration. Genomics data repositories can be broadly divided into four major groups (Figure 2). Catalogs of genomic elements such as genes, mRNAs and proteins constitute the first group. Databases cataloging the expression of these elements in various cells and under particular physiochemical conditions form the second while repositories of regulatory interactions at the transcriptional, post-transcriptional, and post-translational levels make up a related third group. The last group consists of sets of phenotypes, diseases, and functions attributable to genomic elements, their expression, and their regulation or dysregulation.

8

Genomic elements exist at the DNA, RNA and protein level. DNA elements include genes (and the related concepts of promoters, exons, and introns), polymorphisms such as SNPs and alleles, and repeating elements such as LINES, SINES and microsatellites. There are countless databases of genes but the most widely used are RefSeq Gene (Sayers et al. 2009), UniGene (Sayers et al. 2009), and ENSEMBL Gene (Birney 2003). SNPs and alleles are catalougued by the International Haplotype Map (HapMap) project (Frazer et al. 2007), by the Online Mendelian Inheritance in Man Database (OMIM) (Hamosh et al. 2005), and by the database of Single Nucleotide Polymorphisms (dbSNP) (Sherry et al. 2001). Repeat regions in the known genome sequence have been catalogued by using tools such as RepeatMasker (Chen 2004) along with repeat signature repositories like the Repeat Database (RepBase) (Jurka et al. 2005). RNA elements include the mRNA transcripts of protein-coding genes as well as a variaty of non-coding RNAs including miRNAs, piRNAs, snoRNAs, and various ribozymes. The major repository of transcribed sequences is the GenBank database but curated sets of transcripts can also be found in RefSeq mRNA (Sayers et al. 2009), and ENSEMBL Transcript (Birney et al. 2004). These databases also house information about non-coding RNA sequences but more specialized resources for these exist at miRBase (Griffiths-Jones 2006) and Rfam (Griffiths-Jones et al. 2003). Protein information includes protein sequences, structures and domain-level data. The largest repositories of protein level information, besides the protein-level extensions of RefSeq and ENSEMBL, are UniProt (The UniProt Consortium 2008) and the Protein DataBank (PDB) (Berman et al. 2003). These include data not only on protein sequence 9

but also on known and predicted structures and domains within each protein. Expression databases include curated databases cataloging mRNA expression across a variety of selected tissues such as the GNF Atlas (Su et al. 2004), OncoMine (Rhodes et al. 2007) and TiGER database (Liu et al. 2008), as well as general repositories for gene expression based experimental data such as the Gene Expression Omnibus (GEO) (Barrett et al. 2007), the Stanford Microarray Database (SMD) (Demeter et al. 2007), and EBI's ArrayExpress (Parkinson et al. 2007). The latter type of database holds both raw and transformed expression information, standard information about each experiment to ensure reproducibility of results, and has the goal of making all published expression data publicly accessible. There are currently no well known databases of protein expression but projects currently under development such as the Yale Protein Expression Database (YPED) (Shifman et al. 2007)are looking to change this situation. Genomic regulation databases are of particular interest in functional genomics. These include sets of transcriptional, post-transcriptional, and post-transcriptional regulators (Figure 3). Transcriptional regulation database include databases of chromatin accessibility, usually compiled results from nuclease or chromatin immunoprecipitation screens, repositories of enhancers such as the Vista Enhancer database (Visel et al. 2007), the Enhancer Element Locator (EEL) (Palin et al. 2006), or disaggregate results from various enhancer screens, databases of epigenetically important sites such as the UCSC Genome Browser database of CpG islands (Karolchik et al. 2003), the Methylation Database (methDB) (Amoreira et al. 2003), or results from methylation screens. They also include databases of known and predicted transcription start sites such as Eponine 10

(Down & Hubbard 2002) and Database of Transciption Start Sites (DBTSS) (Wakaguri et al. 2008) along with the results of various 5' TSS screens, as well as databases of known and predicted transcription factor binding sites from TRANSFAC (Wingender et al. 1996) and JASPAR (Bryne et al. 2008) or individual chromatin immunoprecipitation (ChIP) experiments. Post-transcriptional regulation databases include collections of known and predicted splice junctions and polyadenlyation sites such as the Splice Database (spliceDB) (Burset et al. 2001), polyA_DB (Lee et al. 2007), and those inferred from mRNA sequences. They also include the collections of binding sites for RNA binding proteins housed at UTRdb (Mignone et al. 2005) and Rfam (Griffiths-Jones et al. 2003) and for small-RNAs such as miRNAs and piRNAs at PicTar (Krek et al. 2005), and TargetScanS (Lewis et al. 2005) along with corresponding binding sites inferred from RNA immunoprecipitation (RIP) studies. Post-translational regulation databases include sets of functional protein domains which include SMART (Letunic et al. 2009), Pfam (Sammut et al. 2008), and ProSite (Hulo et al. 2008); databases of known post-translational modifications like dbPTM (Lee et al. 2006), MODi (Kim et al. 2006), and Signal-3L (Hong-Bin Shen & Chou 2007) which holds information about post-translationally modified residues in proteins; databases with both biologically known and computationally predicted protein subcellular localization information such as the Signal Peptide Database (Spdb) (Choo et al. 2005), LOCATE (Sprenger et al. 2008), and DBSubLoc (Guo et al. 2004); and repositories of protein-protein interaction such as the Database of Interacting Proteins 11

(DIP) (Salwinski et al. 2004), Molecular Interactions Database (MINT) (Zanzoni et al. 2002), and Biomolecular Interaction Network Database (BIND) (Willis & Hogue 2006). The databases of genomic elements, their regulation, and expression ultimately feed into producing observed phenotypes at the molecular and physiological levels. Molecular functions are catalogued in gene ontology databases such as the Gene Ontology (GO) Database (Harris et al. 2004) and the PANTHER Database (Mi et al. 2005). These databases also include information on cellular compartmentalization and biological processes or pathways. There are also databases dedicated to pathway information such as the public KEGG Pathway Database (Kanehisa & Goto 2000) In addition to being associated with pathways and molecular functions, genomic elements can also be associated with physiological phenotypes and diseases. These associations can include general associations of genes to diseases associated with their malfunction such as those found in the Online Mendelian Inheritance in Man (OMIM) (Hamosh et al. 2005), and the Genetic Association Database (GAD) (Becker et al. 2004). They can also include more specific connections between polymorphisms at the single nucleotide (SNP) or allele level and phenotypes. These are found in dbSNP and the Haplotype Map (HapMap) database, as well as in various quantitative trait locus (QTL) databases. The most commonly used repositories of functional genomics information are maintained at NCBI, ENSEMBL and UCSC (Kersey & Apweiler 2006). NCBI provides the above mentioned RefSeq, GenBank, GEO, dbSNP, OMIM, and GAD databases, gene-centric descriptions, publications, and cross-references in the Entrez database, 12

along with other databases of homology, orthology, structures, and mapping information (Sayers et al. 2009). The ENSEMBL collection holds similar information about genes, proteins, polymorphisms, and phenotypes but also maintains databases of pseudogenes, RNA genes, gene predictions, and more comprehensive protein structure databases (Birney et al. 2004). The UCSC Genome Browser Database takes a more aggregate approach to holding functional genomic information. All data in the UCSC Genome Browser Database is mapped to positional coordinates in a specific assembly of the human genome (provided by NCBI). This allows divergent datasets to be visualized in a twodimensional space and elements from multiple datasources such as those provided by ENSEMBL and NCBI can be visualized together and compared (Karolchik et al. 2003). The BioMart project is also notable in that it aims to integrate ENSEMBL data with functional annotations from sources like the Gene Ontology Database and KEGG Pathway Database, known and predicted domains such as those from INTERPRO, Prosite, and Pfam, and probe-identifiers from leading DNA array vendors (Smedley et al. 2009). It accomplishes this with a genomic coordinate system similar to that implemented by UCSC.

13

Figure 2: Varieties of Genomics Data Existing repositories of genomics data can be divided into a few basic families. Catalogs of genomic elements such as genes, mRNAs and proteins constitute the first group. Databases cataloging the expression of these elements in various cells and under particular physiochemical conditions exist alongside repositories of regulatory interactions at the transcriptional, post-transcriptional, and post-translational levels. There are also catalogs of phenotypes, diseases, and other functions attributable to genomic elements, their expression, and their regulation or dysregulation.

14

Genomic Elements DNA: Genes (promoters, exons, introns) Polymorphisms, Repeats RNA: mRNAs, ncRNAs, ribozymes, cis-regulatory sites Protein: sequence, structure, domains

Regulation

Expression

Transcriptional chromatin, methylation, TSSs, TFBSs Post-Transcriptional splicing, poly(A), RBP-BSs, ncRNA-BSs Post-Translational modification, localization, interations

RNA Expression microarray, sequencing Protein Expression mass spectrometry

Phenotype Function Molecular Function, Biological Process/Pathway Disease Quantitative Trait Loci, Genetic Associations

15

Figure 3: Overview of Genomic Regulation Databases Genomic regulation databases are of particular interest in functional genomics. These include sets of transcriptional, post-transcriptional, and post-transcriptional regulators. Transcriptional regulation database include databases of chromatin accessibility, enhancers, epigenetically important sites, transcription start sites, transcription factor binding sites. Post-transcriptional regulation information includes splice junctions, polyadenlyation sites, binding sites for RNA binding proteins and smallRNAs such as miRNAs and piRNAs. Post-translational regulation information includes functional protein domains, post-translational modifications, sub-cellular localization and protein-protein interactions. Typical examples of databases and screens used to catalogue each type of regulation are provided in the given table.

16

Transcriptional

DNA

transcription Post-Transcriptional

RNA

translation Post-Translational

Protein

Nuclease Screens ChIP Screens Enchancers Vista Enhancers EEL Enhancer-TRAP Screens Methylation UCSC CpG Islands MethDB Methylation Screens Transcription Start Sites Eponine DBTSS Transcription Screens Transcription Factor Binding TRANSFAC Sites JASPAR ChIP Screens Splice Junctions spliceDB Inferred from mRNA Polyadenylation Sites polyA_DB Inferred from mRNA RBP Binding Sites UTRDB Rfam RIP Screens Small RNA Binding Sites TargetScanS PicTar miRBase/miRanda Domains SMART Pfam ProSite Post-translational dbPTM Modification MODi Signal3L Localization Spdb LOCATE DBSubLoc Interactions BIND MINT DIP Chromatin Accesibility

17

2. A Locus-Based Framework for Genomic Data Analysis 2.1 Summary In this section, we outline a method for genomic data comparison predicated on genomic positional coordinates. The method allows comparison of disparate data ranging from expression data to functional annotations and motifs using a locus based framework. Here we define the concepts of a genomic locus and locus set. We define the possible comparisons between sets of genomic loci in terms of locus intersections and provide an algorithm for flexible and efficient Locus Intersection Analysis. We then go on to describe an algorithm, Interlocutor, for flexibly and systematically intersecting a locus set against sets of functional annotation locus sets and summarizing the intersections of the query with functional annotation terms. We show how these annotation associations can be mined in terms of direct association analysis using the Commodus tool and how they can be recast as functional interaction networks using the Connexus tool.

18

2.2 Motivation Genomics data is growing at a exponential rate. The ability to integrate new results with existing knowledge about genomic biology is rapidly becoming the limiting factor in developing new functional inferences Massive amounts of data exist in various heterogenous formats (Teufel et al. 2006). Each database has its own conventions for naming and organizing data and so there are many names and ways to describe the same functional element. Currently every method of analysis needs to be separately reengineered for each new and existing platform and database (Teufel et al. 2006). Tools that operate using ENSEMBL identifiers as the reference will not be compatible with RefSeq identifiers and the tool must either be re-engineered or cross-references tables must be built to accommodate the synonymous annotations. Moreover many crossreference databases that exist are lossy and do not maintain the full levels of granularity. For example, transcript level annotations in one database may be tied to gene level annotations in another database (Sherman et al. 2007). There is no efficient way to compare elements across databases in terms of uniqueness and granularity – no universal language with which to describe genomic functional elements.

2.3 A Genomic Coordinate Space Some of the problems in coming up with a uniform system for addressing genomic elements can be solved by thoroughly decompressing the genetic information. The completion of the human genome and other sequencing projects have allowed us to take collections of functional elements and map them back to the positions in the 19

genome. All transformations of the DNA to RNA or protein can be extrapolated from the coordinates. More complex modifications such as splicing, insertions, deletions, or mutations can also be mapped back to the affected sequence fragments. We call this representation of a genomic functional element a locus and define it as a set of genomic positional coordinates with any number of annotations attached to it. Allowing many annotations for the same locus allows unification of many disparate nomenclatures that refer to some functional facet of the same element (i.e. gene implies mRNA and protein and all three correspond to the same set of genomic bases). Allowing recursive composability (i.e. a locus can be composed of any number of sub-loci, just as a gene is composed of its introns, exons, and UTRs) enables addressing genomic elements and annotations at any level of granularity from entire swaths of chromosomes to single basepositions.

20

2.3.1 A Galaxy of Locus-Based Resources The locus concept and our mapping based integration approach not only allows homogeneous addressing of a wide variety of genomic functional elements but also enables incorporation of more indirectly related information such as expression patterns, pathway and ontology information, disease associations, and even genetic polymorphisms into our genomic analysis context. The largest genomic data warehouses including UCSC, NCBI, and ENSEMBL keep and provide genomic annotations in a locus-conformable fashion (see Figure 4). Using identifiers from these tables, it becomes possible to link in annotations from databases with no directly attached locus information (such as GO (Harris et al. 2004), KEGG (Kanehisa & Goto 2000), BIND (Bader et al. 2001). Motif search algorithms (binding sites, protein domains, secondary structures, etc) could also be run on locus-annotated sequences and their output transformed into a database of loci. This expands the set of minable genomic information to include any data that can be directly or indirectly tied to the genome. The UCSC Genome Browser Database along with BioMart and BioMart-based genome databases provide the largest connected repositories of genomic functional annotations (Kersey & Apweiler 2006). Built as a back-end for the Genome Browser sequence feature visualization tool, the UCSC database provides a wide variety of genomic annotations including chromosomal features, known and predicted transcripts, sequence and structure motifs, conservation and orthology information, and even expression scores derived from microarray and sequencing experiments (Karolchik et al. 2003). All the data is warehoused in the context of a particular assembly of an organism's 21

genome and anchored to genomic positional coordinates. The BioMart project provides a similar interface to a similarly broad range of functional annotations based around the ENSEMBL and EBI data sets (Smedley et al. 2009). It distinguishes itself in its ability to aggregate and maintain data from existing annotation sources using the Distributed Annotation Service platform and so provide immediate updates to BioMart data when a source dataset is modified (Smedley et al. 2009).

2.3.2

Visualizing and Comparing Sets of Loci The data analysis features provided by both UCSC and BioMart are limited and

focused around relational database style filtering and joining of data-tables. The UCSC browser also provides rudimentary facilities for filtering one data set using the positional coordinates of another (Blankenberg et al. 2007). Recently, though we have seen the genesis of the Galaxy platform which provides a serious attempt at delineating and executing locus-based operations on sets of genomic elements (Giardine et al. 2005). The Galaxy software provides facilities for intersecting, merging, and subtracting two sets of genomic loci as well as tools for importing and converting data from various sources to be locus-conformable. While Galaxy provides a great resource for performing simple locus-based operation, we believe that the locus-based model for genomic data and implied operations can be more flexibly defined.

22

Figure 4: Locus-Based Genomics Datasets An view of locus based data from the UCSC Genome Browser. Chromosomes are displayed in linear fashion from left to right, with coordinate markers appearing across the top. Data sets (such as genes) are shown in the same manner, with each item appearing at it's appropriate coordinates. Multiple data sets can be shown simultaneously by stacking them top to bottom. Figure courtesy of Chris Zaleski.

23

Database 'tracks' representing data (eg: genes) placed at their genomic coordinates

Examples of additional coordinate-based data such as conserved regions and SNPs

Chromosomal coordinates, increasing from left to right

Example of a single data record, representing a gene

24

2.4 Operations on Genomic Loci 2.4.1 Definition of Locus In our model, a locus is simply defined as a recursively composable set of genomic positional coordinates with an arbitrary number of annotations attached to it. Allowing many annotations for the same locus allows unification of many disparate nomenclatures that refer to some functional facet of the same element. Allowing recursive composability (i.e. a locus can be composed of any number of sub-loci, just as a gene is composed of its introns, exons, and UTRs) enables addressing genomic elements and annotations at any level of granularity from entire swaths of chromosomes to single base-positions. This concept of a genomic locus is illustrated in Figure 5.

25

Figure 5: A Generic, Flexible Framework for Defining Locus-Based Data Each coordinate based item represented in the data is translated into a Locus object, defined by chromosome, start, end, and strand. Locus objects can contain other locus objects allowing a hierarchical nesting structure. For example, the RefSeq gene (shown in blue) is used to create a Locus object with 'children' for each of it's component regions – UTRs, exons, etc. Locus objects can be nested so that children loci can contain their own children and the framework allows parent-child hierarchies of arbitrary depth.. Additionally each complete data set is translated into a LocusSet of Locus objects (roughly corresponding to a track at the UCSC Genome Browser Database), such as Human ESTs (shown in the center of the figure in black). Figure courtesy of Chris Zaleski.

26

Locus (EST)

A nested Locus structure, representing a Gene and some of it's possible “child” Loci

Locus (EST)

LocusSet Locus (EST)

Locus (5'UTR) Locus (Exon)

Locus (Gene)

A Locus Set Representing a sample of Human ESTs

Locus (Exon)

Locus (3'UTR) Locus (Exon)

Locus (EST)

27

2.4.2 Locus Intersections Comparisons between genomic loci are formulated in terms the nature and extent of their positional proximity or overlap. These comparisons are called locus intersections. For each intersection operation, a variable number of nucleotides can be defined for either minimum required overlap, or maximum allowed gap between loci. This minimum overlap or maximum gap can be set as either a fixed number, or a percentage. Locus intersections can also be parameterized in terms of strand-specificity and intersections can be considered valid or not depending on whether the loci being compared share the same strand, have opposite strands, or have undefined strandedness (Figure 6). For the case of three or more loci being intersected, the option of allowing bridging intersections is provided. Where bridging is disallowed, all pairwise intersections between loci in the set are required for the set to be considered intersecting. Where bridging is allowed, the condition loosens and any chain of pairwise intersections that cover the entire set will mark it as intersecting. For example: assume loci A, B, and C. A & B do not intersect, however if A & C do intersect and B & C do intersect, then C bridges A & B, and all three are considered to intersect (Figure 7).

28

Figure 6: Locus Intersection Parameters Intersection between query loci and target loci can be parametrized in terms of strand specificity and in terms of the type and extent of positional coincidence. Strand specificity: If NEUTRAL, any target locus could match. If MATCH_STRICT, only A, B, and E could match. If MATCH_PERMISSIVE, A, B, C, and E could match. If COMPLEMENT_STRICT or COMPLEMENT_PERMISSIVE, only D, or C and D, respectively could match. Coincidence: This can be specified either as a FIXED distance value or a PERCENTAGE overlap value. If FIXED distance, positive values indicate the maximum base-pair distance allowed between two intersecting loci and negative values indicate the minimum number of overlapping bases. For instance, values above -15 would allow E to intersect, above -12 would allow C, D, and E to intersect, above -7 would allow B,C,D and E to intersect, and values above 5 would allow A,B,C,D and E to intersect. If PERCENTAGE overlap, two loci are said to intersect if they overlap by the given fraction of the smaller locus. For instance, the query overlaps with B by 7 bases and since the query is the smaller of the two, the fractional overlap is 0.4. The query overlaps with C by 13 bases and since C is smaller, the fractional overlap is 1.0.

29

Locus

T T G G T A C G C

>>>>>>>>>>>>>>>>>>>>>>> T T T

str: same, dist: -7, olap: 0.47

T T A G G G C A A C C C A A A A T

Q uery A C G T

>> A >> str: same, dist: 5, olap: 0.0 >>>>>>>>>> B >>>>>>>>>>>>>

str: undef, dist: -13, olap: 1.0 ------------- C ------------------>>>>>>>> E >>>>>>>>>>>>>>>>>>>>>>>>>> str: same, dist: -15, %olap: 100%

30

Figure 7: Effect of Bridging Parameters on Locus Set Intersections Fig A represents three locus sets to be intersected. Figures B and C represent what the results would look like from two slightly different parameter sets – one with bridging and one without bridging. When bridging is not allowed, there must exist an intersection region (highlighted in yellow) which is overlapped by at least one locus from each of the locus sets. The intersection result consists of a union region whose children consist of only loci overlapping the intersection region from each of the sets and the intersection region itself. When bridging is allowed, an intersection region need not exist and a union region is created for any contiguous space where any two of the given locus sets are positive (have at least one locus existing). All loci contained within this chain of overlaps are returned as children of the union region. Figure courtesy of Chris Zaleski.

31

A. Original Loci Chr1: SET A

100 |

150 |

200 |

250 |

A1

300 |

350 |

400 |

450 |

450 |

A3

B1

550 |

600 |

650 |

A6 A5 A4

A2

SET B

500 |

B2

B4 B3

SET C

C1

C2

C4

C3

B. Result – Overlap of minimum 1 nucleotide, and bridging = FALSE Chr1:

100 |

150 |

200 |

250 |

300 |

350 |

400 |

450 |

450 |

500 |

550 |

600 |

650 |

500 |

550 |

600 | Union

650 |

Union Union Locus SET A

A1 A2

SET B

B1

SET C

C2 Intersection Locus Intersection

C. Result – Overlap of minimum 1 nucleotide, and bridging = TRUE Chr1:

100 |

150 |

200 | Union

250 |

300 |

350 |

400 |

Union Locus SET A

450 |

Union Locus

A1 A2

SET B SET C

450 |

A4

B1 C1

B4 C2 C4 Intersection Locus Intersection

32

2.4.3 Transformations of Locus-Sets

2.4.3.1 Squishing

LocusSet squishing means that all the Locus objects within a LocusSet are checked for overlap. Loci within the set which overlap (they share a common region of at least 1 nucleotide) are added to a 'parent' Locus. This parent Locus is given a specific type called a REGION, and acts as a container for the overlapping loci. This ensures that all the loci directly contained within the LocusSet are linear, but the original data is maintained via the parent/child hierarchy. See Figure 8.

2.4.3.2 Flattening

Sets of aggregate loci where any locus might contain more low-level child loci can be recursively flattened so that all loci and sub-loci in the original heirarchy are now peers in one locus set (Figure 9). Flattening is useful especially for filtering proceedures where parent, children, and sub-children loci need to be treated homogeneously. For example, a set of genes could be flattened to include component exons and both genelevel and exon-level loci could be filtered for criteria such as conservation or G/C content.

2.4.3.3 Filtering

A locus set may also be filtered on strings and numeric values associated with the member loci. The filters allow inclusion or exclusion of loci matching predicates of either a string regular expression or a numeric range (minimum and maximum) and may be 33

applied to one of the predefined locus properties such as chromosome, strand, start, end, and type or to any of the other associated arbitrary annotations.

34

Figure 8: Squishing Locus Sets Section A represents the original raw data – three LocusSets, shown in blue, red and green. LocusSets may contain loci which overlap with each other, for instance A1 & A2, B2 & B3, etc. Loci that overlap within each set are added to a 'region' parent locus. Example: locus A1 and locus A2 become Region A1-R. Each region maintains information about the loci which it contains, but gives the LocusSet a linear data structure which can be used by other algorithms. The user can choose whether all loci are added to a parent container (even if no overlaps are present), or if only overlapping loci are aggregated while leaving the unique individuals alone. Figure courtesy of Chris Zaleski.

35

A. Original LocusSets

B. “Squished” LocusSets

36

Figure 9: Flattening Locus Sets It is often appropriate to break apart nested locus structures, such as genes, into their component loci and treat each of the subcomponents as peers in a larger set. The flattening algorithm simply walks through each locus in a set, traversing any child loci in a recursive fashion, and adding each to the resulting flat set. For example, the composed gene locus above consisting of nested exon and UTR structures is flattened into six separate loci.

37

Original Gene

LocusSet

(contains

Exon1

1

locus)

Exon2

5'UTR

Flattened 5'UTR Exon1

LocusSet

(contains

6

Exon2 Exon3 Exon4 Gene

38

loci)

Exon3

Exon4

2.5 Inference Mining with Genomic Locus Sets In high-throughput genomics screen, one is most often left with a set of genomic elements that were coexpressed in or otherwise coregulated by the experimental factor (for example, some genes are induced and others are repressed by the factor). Often the first question asked by the researcher is this: what are the properties that this set of genomic elements share that could explain their biological relevance to this experiment? To answer this, a researcher is forced to naively interrogate their list against known sets of functionally annotated elements and query it using any variety of pattern mining tools. The problem with the naive approach is that given limited time to analyze the large volume of data and given that only limited sets of pattern-mining tools and annotation databases that are compatible with any given experimental readout, a researcher is only capable of querying what few compatible databases he is aware of and may thus miss many commonalities and patterns he could find by using the entire breadth of available genomics data.

2.5.1 Existing Tools for Annotating Sets of Genomic Elements In a conventional bioinformatics workflow, the process of comparing a dataset to any collection of annotations would involve building tables of cross-references connecting identifiers in the query set to identifiers in every target annotation set. For instance, to compare targets from different gene-expression platforms, vendor-IDs for probesets on each platform would have to be mapped to more universal identifiers such as RefSeq (Pruitt et al. 2007), ENSEMBL (Birney 2003), or Entrez Gene (Maglott et al. 39

2007). The mappings provided by each vendor would have to be checked for obsolete identifiers and unmapped entries. For each pair of platforms to be compared, the universal identifier simultaneously maximizing coverage for both platforms would have to be selected. This problem grows exponentially with each functional annotation that needs to be queried. In the worst case, a different intermediate identifier would have to be provided to map each platform to each functional annotation resource – i.e. UniProt (The UniProt Consortium 2008) to GO (Harris et al. 2004), HGNC Symbols (Bruford et al. 2008) to transcription factor binding sites, RefSeq to microRNA targets, cytoband locus to OMIM (Hamosh et al. 2005) etc. The task becomes one of managing a potentially intractable and often incomplete set of links between arbitrary collections of functional annotations. There are many existing tools concerned with annotating genes and with maintaining annotations. They include annotation frameworks like Gaggle (Shannon et al. 2006) and Atlas (Shah et al. 2005) and integrated cross-database lookup tools such as Resourcerer (Tsai et al. 2001) and GeneNotes (Hong & Wong 2005). In comparison, there are few integrative tools for finding annotation terms enriched in a gene-set. The two most comprehensive (in terms of gene identifiers supported and annotation databases integrated) efforts at providing this are found in the Database for Annotation Visualization and Integrated Discovery (DAVID) (Sherman et al. 2007) and at Babelomics with FatiGO (Al-Shahrour et al. 2007) (and to a more limited extent Marmite (Al-Shahrour et al. 2006) ). The base layer of both DAVID and Babelomics comprises a set of known cross-

40

database mappings of gene identifiers. These mappings encompass gene-based identifiers from the main gene/transcript repositories such as GenBank and ENSEMBL as well as mappings into widely used expression profiling platforms such as Affymetrix arrays. DAVID aggregates available cross-references between databases in into DAVID “Genes” using a graph-theoretic single-linkage approach. Any cross-reference indicating that two identifiers in different databases are the same entity – be it at the gene, transcript, mRNA, or protein level – will cause the identifiers to be merged into the same DAVID ID cluster (Sherman et al. 2007). Babelomics uses a more limited version of the same approach based on cross-references compiled in the ENSEMBL database. These sets of geneidentifiers are used to pull associations to a wide range of annotations including Gene Ontology and pathway terms, protein-protein interactions, known or predicted protein/sequence features, and mined literature associations (usually associating a gene to a paper). Both tools provide facilities for entering a list of gene identifiers to mine, and selecting an appropriate background gene list. Both also provide enrichment statistics for each annotation term estimated by bootstrap selection of gene-lists from the the background set in DAVID (Huang et al. 2007) and by Fisher’s Exact Test comparison to the background set in FatiGO (Al-Shahrour et al. 2007). DAVID provides a functional annotation clustering tool that groups together enriched annotations based on similarities in their gene-membership. FatiGO can feed into FatiScan which allows measurement of term associations ranking induced partitions of the gene-list (Al-Shahrour et al. 2006). Both tools are limited in that users must either use or map into one of the known identifiers to query the database. They are also limited in their inability to deal with both annotations and identifiers in the non-genic space. Chromosomal features, non-coding 41

genes, and other extra-genic annotation data must necessarily be discarded. Another, more subtle issue with is the inability to of these tools handle differentially spliced transcripts as all identifiers associated with the same gene will inevitably be merged and considered synonymous. Moreover, the capabilities provided by these tools for incorporating new annotations are either limited (FatiGO) or non-existent (DAVID). DAVID provides a static set of functional annotations that are updated and supplemented on an irregular schedule. FatiGO and Marmite allow a user to bring in hand-built flat-files of terms annotating gene identifiers.

2.5.2 Interlocutor: A Generic Method for Inference Mining To address the above-mentioned issues and provide a more flexible system for inference mining, we develop a method for deriving functional associations for a set of genomic elements using our locus based framework. The method is denoted Interlocutor and is designed as follows. Interlocutor takes as input a query set of functional loci and a collection of target sets of functional loci that the query set should be compared to. The query set may be the results of a functional genomics screen such as a microarray or proteomics experiment and the target sets would include annotations of interest such as ontology and disease terms or collections of cis-regulatory motifs. One may also provide a query population set and/or a collection of query control sets to be interrogated in parallel with the query set. The population set or population 42

control would model the population that the query is derived from. For example, a query set of differentially expressed genes in a microarray study may be modelled as a subset of all genes found to be expressed in the study. The control sets or sampling controls would provide a population model for the query set where the entire population is nondeterminable. For example, if the query set consisted of non-coding regions found to be expressed with a sequencing study, one might model the source population with collection of sets of non-coding regions matched to the query set on size, length distribution, and other parameters of interest. Both the population control and sampling control can subsequently be used to derive measures of significance for associations found in the query set. Once a query set and corresponding control sets are provided, parametrizations are selected for intersections of the query set with each target annotation set. One may choose to apply any of the locus-transformations such as flattening and filtering to either the query or the target set for a particular comparison. In addition the parameters for locus intersection must be selected for each pair of query and target set. For example, let us suppose the query set consists of gene-loci and the target set consists of particular secondary structure motifs. To find the association between UTRs of the query genes with secondary structure motifs of type X, one could flatten and filter the query set to include only UTR regions and filter the target set to include only motifs of type X. One could further vary the locus-intersection parameters to find motifs completely contained in the query UTRs or motifs partially overlapping with them. After pre-transformation and intersection criteria are selected for each target set,

43

query loci are systematically intersected with each target locus set and a LocusNexus describing each pairwise intersection of query locus with target locus is derived. For each intersection of query locus with target locus, all annotations associated with the target locus that can be parsed as a string or set of strings is extracted. A row is added to the interlocutor data-structure (essentially a matrix of bits with columns corresponding to loci in the query set and rows corresponding to associated terms) for each string annotation from the target locus if one does not already exist. The column corresponding the given query locus in the current row is set to a positive bit and the next intersection in the LocusNexus is parsed. Terms directly associated with the query loci such as chromosome, strand, and type may also be mined and added to the bit-matrix in this fashion. The process of mining words associated with query loci and those associated through target intersections in a LocusNexus is depicted in Figure 10 and Figure 11. When all intersections in the LocusNexus have been parsed, the query set is replaced with each of the given control sets and intersected with the target set using identical parameters. For each LocusNexus from the control set intersections, term annotations from intersecting target loci are added to corresponding bit-matrix data structures with columns corresponding to items in the current control set and rows corresponding to terms already associated with the query set. Annotations intersecting with control loci that do not have intersections with query loci are discarded. This process is repeated for each target locus set and a result consisting of a query intersection bit-matrix and control intersection bit-matrices are returned. (Figure 11). Corresponding matrices would exist for any control sets provided.

44

Figure 10: Parsing annotations associated with a locus set. For each intersection of query locus with target locus, all annotations associated with the target locus that can be parsed as a string or set of strings is extracted. A row is added to the interlocutor data-structure (essentially a matrix of bits with columns corresponding to loci in the query set and rows corresponding to associated terms) for each string annotation from the target locus if one does not already exist. The column corresponding the given query locus in the current row is set to a positive bit and the next intersection in the LocusNexus is parsed. Terms directly associated with the query loci such as chromosome, strand, and type are also and added to the bit-matrix in this fashion. Here terms associated with a query locus directly (“refseq”,”UTR”,”chr”) and those associated with an intersecting annotation locus (“tfbs”,”v$foxm1”,”FoxM1”) are found in the bitmatrix.

45

46

Figure 11: Annotation Term Mining with Locus Sets A query locus set (in orange) is intersected systematically with each of the target locus sets derived from various annotation databases (shown above the query set) and each intersection of query locus with target locus is marked. A bit-matrix with columns corresponding to each locus in the query set and rows corresponding to associated annotation terms is created (note: the bit-matrix shown in the figure is transposed). To associate strings with the query locus set, first, a row is created in the bit-matrix for each unique string directly associated with a query locus and the cells corresponding to the relevant query loci are marked with a positive bit. Next, a row is created for each unique term associated with an annotation locus that overlaps a query locus. The cells corresponding to the intersecting query loci in these columns are then marked with a positive bit.

47

GO10295: Development GO10485: Transcription GO01923: Nuclear MESH: colorectal neoplasm MESH: breast neoplasms CHEM: mercury

GO Gene Ontology Terms

GAD Genetic Association DB

InterPro

GO10297: Development GO10483: Cytokine Receptor GO01921: Cell Membrane MESH: breast neoplasms CHEM: 4' isopentohypothenol ENV: smoking (tobacco)

ZnF

Protein Domains

ARED

ARE: Type II

ARE: Pentamer

AU-Rich Elements

miRNAMap

mir-133

mir-133

lin7

Cross-DB miRNA Targets

TargetScanS

mir-133

lin7

Conserved miRNA Targets

UTRSite

ARE

IRES

Exonic UTRSite Motifs SOX2 ATF2 HOXA1

Transfac Known Validated TF-Gene Relations

SP1 Chr: chr2 Strand: + Type: 3'UTR Name:Query2

chr2 + 3'UTR GO GO10295: Development GO10485: Transcription GO01923: Nuclear GO10483: Cytokine Receptor GO01921: Cell Membrane GAD MESH: Colorectal Neoplasms MESH: Breast Neoplasms CHEM: Mercury CHEM: 4' Isopentohypothenol ENV: smoking (tobacco) InterPro ZnF ARED ARE: Type II ARE: Pentamer miRNAMap MiR-133 lin7 TargetScanS UTRSite ARE IRES TRANSFAC Known SOX2 HOXA1 ATF2 P53 CREB HMR TFBSCons SP1

Chr: chr2 Strand: + Type: 3'UTR Name:Query1

CREB ATF2 SOX2 ATF2

SP1

CREB

User Query Loci from User's Study

SP1

Conserved Predicted TFBS

ATF2 SOX2 SOX2

HMR TFBSCons

ATF2 P53 CREB

Query1 Query2

1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 1 1 1 0 1 0 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 0 1 1 1 0 0

48

2.5.3 Commodus: Inferring Functional Term Associations One of the most common tasks in mining functional inferences from a set of genomic loci is to simply look at how frequently and how significantly it is associated with any given functional annotation or combination of annotations. To provide this functionality, we develop a renderer for the Interlocutor data structure that can calculate the frequency and significance with which a locus set associates with an annotation, display these characteristics in an intuitive visual fashion, and further explore combinations of annotations in a recursive fashion. To calculate the frequency with which a an annotation is associated with a locus set, one simply takes the ordinality (sum of items set to true) of the row of the bit-matrix representing the term of interest. To calculate the significance of a given annotationassociation, one uses the population-control bit-matrix or the selection-control bitmatrices as follows. In the case of a population-control, each query set is modeled as if drawn from a finite population and intersections of the query set with an annotating term are compared to intersections of it's parent population with the given annotation term. The module then computes the significance of each term comparing it's frequency in the query versus its frequency in the population using a Binomial approximation to the Hypergeometric Test for sampling without replacement (Bain & Engelhardt 2000). This yields a good estimate of the Hypergeometric statistic when the population is more than 20 times size of query (ref) and suffices for most cases we are interested in. See section A of Figure 12. In the case of sampling-controls, the query set is modeled as if sampled from a potentially infinite (or finite but computationally intractable) space and 49

intersections of the query set with an annotation term are compared to intersections of several matched control sets with the same term. The module computes the significance of each term by building an empirical probability distribution of the frequencies of annotation-associations for the query and sampling control sets and returns the area under the curve for frequencies greater than or equal to the query set annotation-association frequency. See section B of Figure 12. Given a minimum support (frequency) and a minimum confidence (significance), the Commodus module renders all annotation terms meeting these thresholds. It displays the terms as a radial graph about an arbitrary root node where the size of the nodes corresponds to the support of the term and the distance from a node to the center is inversely proportional to its significance (Figure 11). A user is able to vary the minimum support and confidence parameters as desired and instantly render the updated graph. By clicking on a given node, the user is able to see frequent and significant combinations of the selected word with other annotation terms. In this case, the root node is replaced by the selected node, the query items intersecting with the term are given as a list on the right hand side, and all significant and frequent term combinations are displayed as a radial graph around the new root (Figure 13). Finally, the user is also able to produce a table containing all the desired information by simply providing a minimum support, a minimum confidence, and a maximum length of term-combinations to be considered.

50

Figure 12: Control Statistics for Annotations Intersecting with a Locus Set To estimate the significance of association of a given term with a query locus set, a population control or sampling control is created. Population controls (A) are appropriate when the query set is drawn from a finite population whose intersections with the annotation term can also be computed. The association of the term in question with the entire population set is measured and then a binomial approximation of the hypergeometric test for sampling without replacement is used to find whether the query set is significantly more enriched in intersections with the term than the overall population. Sampling controls (B) are appropriate where the query set is drawn from a potentially infinite population. Here control sets matching controlled characteristics of the query set (ex: set size, locus length, region type, G/C content) are used to cover the selection space. The association of the term in question with each of the control sets is measured and the empirical PDF of the frequencies of annotation-association for query and control sets is used to estimate the significance of the query set association (i.e. the total fraction of sets with frequency of association equal to or greater than the query set).

51

52

Figure 13: Commodus Interface as a Filterable Radial Graph Commodus viewer interface for displaying frequent and significant intersections. Size of text reflects support of annotation term while distance from center corresponds inversely to significance. A node is highlighted when the mouse cursor hovers over it. The term, support, and confidence associated with the highlighted node are displayed in the bottom left. The minimum support and minimum confidence thresholds can be set in the boxes to the right-bottom and the graph will be updated when the update button is clicked. The table button can be used to generate a table-style output with configurable parameters. The user may also click on a given node to show annotations that are frequent and significant in combination with the selected one or to see member loci associated with it.

53

54

Figure 14: Commodus Viewer Recursive Exploration of Annotation Combinations Clicking on a term in the main view produces a new graph view. The list on the right shows all members of query set intersecting with the word at the center (GO:0051252). Annotations that are frequent and significant in combination with the center annotation are shown as a radial graph around it. Clicking on any term in this view will show the member terms associated with both the root node here and the clicked node and further show all words frequent and significant in combination with both of these terms.

55

56

2.5.4 Connexus: Inferring Association Networks The genomic landscape is composed not only of individual genomic elements with particular functional characteristics but also of a multipartite web of functional interactions ranging from the trans-interactions of regulators such as transcription factors and miRNAs with their DNA and RNA targets to the protein-protein and signaling interactions that constitute biological pathways. By mining the various databases of known and predicted interactions, one can recast the results of a high-throughput genomics experiment in terms of the underlying network of interactions that drive the biology. There are many different types of genomic interactions and many different databases and tool-sets for cataloguing known interactions and predicting unknown ones from the known sequence biology. The two major catgeories of interactions are proteinnucleotide interactions the most well studied group of which is transcription-factor-DNA interactions and protein-protein interactions usually studied using imaging, TAP, or twohybrid systems. Protein-protein interactions are cataloged by a few major sources including the Database of Interacting Proteins (DIP) , the Molecular Interactions Database (MINT), Human Protein Reference Database (HPRD), the IntAct database, and the Biomolecular Interaction Database (BIND) (Lehne & Schlitt 2009). These data are usually curated sets of molecular interactions derived from individual screens for interacting partners and are provided in one of a few protein-interaction descriptions that include the Simple Interaction Format (SIF), the Graph Markup Language (GML), and the Proteomics 57

Standards Initiative Molecular Interactions XML schema (PSI-MI) (Shannon et al. 2003). Data in these formats are easily portable to common tools for visualizing protein interaction networks such as Cytoscape (Shannon et al. 2003) and VisANT (Hu et al. 2004). Protein-nucleotide interactions and the related set of nucleotide-nucleotide interactions are characteristed by the binding of a regulatory protein or functional RNA to a cis-regulatory motif on the target nucleotide molecule. Most resources for studying this family of functional interactions are dedicated to cataloguing the cis-regulatory motifs bound by the regulatory actors and many databases of transcription factor binding sites such as TRANSFAC and JASPAR (Kielbasa et al. 2005) and RNA cis-regulatory motifs such as Rfam (Griffiths-Jones et al. 2005) or UTRdb (Mignone et al. 2005) cataloguing protein-binding motifs in RNA and miRbase cataloging micoRNA sequences (GriffithsJones et al. 2006). There also exist a variety of motif-searching tools that can be used to search for a given cis-regulatory motif from these databases. They include sequencemotif matchers such as TRANSFAC's Match (Wingender et al. 2000) or miRbase's Miranda (John et al. 2004)and structural motif matchers such as UTRdb's PatSearch (Pesole et al. 2000). Predicted motifs may also be filtered based on phylogenetic conservation and the miRNA target predictor picTAR provides a good example of this (Krek et al. 2005). Many of these databases, including TRANSFAC, Rfam, PicTar, miRbase, and UTRdb, provide compiled sets of known and predicted interactions. In addition, the rise of technologies that combine chromatin immunoprecipitation (ChIP) or RNA immunoprecipitation (RIP) with microarrays (Chip) or high-throughput sequencing (Seq) 58

are providing new datasources for interactions that are more biologically grounded. One of the most common tasks in analyzing the results of a high-throughput genomics dataset is to take the results and re-cast them into the genomic-interaction network context. In a microarray experiment, recasting upregulated and downregulated genes into the context of transcriptional regulatory networks lets us visualize the regulatory cascades by which these genes are turned on or off. It also allows us to find important control points (factors that regulate many of the genes) and isolated subcomponents in the system (where treatments can be effected with fewer side-effects). In fact, there have been relatively few methods developed for visualizing transciptional responses underlying gene expression change in high-throughput genomics data. We have developed a novel method for integrating databases of known and predicted transcriptional regulatory interactions with expression correlation data from microarrays to infer the strength and extent of transcriptional responses controlling the observed changes in expression. This procedure is applied to the case of analysing transcriptional responses downstream of p38 inhibition in a tumor dormancy model and is detailed in Appendix I. Once a transcriptional or post-transcriptional network is inferred, the systemic properties of the observed network can be readily observed. Molecular biological nteraction networks can be modelled as a multipartite directional graph composed of interactions at the transcriptional, post-transcriptional, and protein level that together are responsible for the progression of the developmental and environmental cascades into the available survivable potential wells. In any single partition of this network of networks, the interactions between the extant nodes can be using graph theoretic methods. While 59

the study of graph theoretic properties such as degree-distribution, average-path-length, and clustering-coefficient have become accepted approaches to beginning to understand the systemic properties of these networks, their application has been hampered by a lack of control measures to determine if measured properties are somehow significant. We have developed methods and tools for measuring the graph theoretic properties of a subset of nodes within a larger interactome and developed control generation methods to ensure significance of both significance of selection and significance of the connectivity properties of the network. This result is presented in Appendix II. In the locus-based analysis framework, we proceed by intersecting any given result set of genomic loci with databases of real and predicted interaction sites, be they binding sites for trans-actors such as transcription factors, mRNA binding proteins, and miRNAs, or protein domains responsible for protein-protein interactions or signal reception. The result is a set of pairs of interactors where the first element is a node from the query set and the second is extrapolated from the interaction site loci. The nodes in this network can then be overlaid with expression data or other functional annotations and edges can be tied to scores of predicted interactions or correlations in expression. This allows a dynamic multipartite network view where many different types of genomic interactions can be simultaneously visualized and filtered on the properties of the nodes and edges. The major advantage of the locus-based framework is the ability to continuously update and integrate new datasets of discovered and predicted interaction sites and the results of new genomics experiments. Once a framework for network-building using locus-based resources is established, it can be used homogeneously across any variety of 60

interaction-annotations. Once a query locus-set is intersected with any number of target data sets, we use the resulting interlocutor bit-matrix data structure to build the network. We interpret both the rows and the columns of the matrix as potential nodes and true bits in the matrix as edges. The edges are drawn from target locus to query locus as in most cases the target loci represent regulatory sites (such as transcription factor binding sites or miRNA target sites) affecting the query loci. This corresponds to drawing an edge between each target locus and intersecting query locus (Figure 15). The connexus tool takes the interlocutor bit-matrix and builds a network visualization by drawing nodes and edges for all true bits in the matrix (Figure 16). The initial network contains just connections between target and query loci. In many instances target loci are binding sites for query loci. For example, when drawing connections between transcription factor binding sites and target genes, it is important to translate the transcription factor identifiers to notations used by the query loci genes so that synonymous nodes can be merged and the hierarchy of regulatory cascades visualized (Figure). We provide a rudimentary facility for importing sets of synonyms corresponding to identifiers in the query and target locus sets. Nodes that can be linked by any chain of synonyms are recursively merged into a single node. The connexus interface also provides a facility for filtering the visualized network to include only nodes that are derived, at least in part, from a target locus. This allows visualization of just the regulatory nodes (for example, transcription factors, miRNAs, RNA-binding proteins) and the interactions between them. This allows us to more quickly pick out master regulators responsible for driving the phenotype of interest. 61

Figure 15: Locus Intersections as Interactions A LocusNexus depicting the pairwise intersections between loci from a query set (green) and a target set of binding sites (yellow) is interpreted as nodes and edges in a network. Each intersection of a binding site locus with a query locus is treated as an edge between the target locus and the query locus.

62

q2]

t1]

id: “locus1” source: “refseq” type: “UTR” chr: “chr1” strand: “+” start:673894 end: 687631 annotations: symbol: “CDC2” trait: “zounds”

INTERSECTING QUERY LOCI

q2

t1

q2

t3

q3

t4

qN

t4

qN

tK

id: “v$foxm1” source: “tfbs” type: “tfbs.pred” chr: “chr1” strand: “+” start:687627 end: 687631 annotations: symbol: “FoxM1” score: 0.983

INTERSECTING TARGET LOCI

LocusNexus NODES IN GRAPH

Node: CDC2 LocusSet: q2, t3

Node: FoxM1 LocusSet: t1, t4, q9

EDGES IN GRAPH

Node(FoxM1)->Node(CDC2) ...

GRAPH

63

Figure 16: Building Regulatory Networks from Locus Intersections Intersections between a target locus and a query locus are treated as edges between the implied target node and the implied query node in the network. Terms that bridge the query locus gene identifiers with the target locus transcription factor identifiers are used to merge synonymous nodes.

64

L O C U S V I E W N E T W O R K V I E W

Gene1 (SOX2)

TFX (SOX2) TFA (HOXA1)

SOX2 (Gene1,TFX)

Gene2 (COLT45)

Gene3 (WEBL)

COLT45 (Gene2)

TFX (SOX2)

TFA (HOXA1)

HOXA1 (TFA)

TFA (HOXA1)

WEBL (Gene3)

65

Figure 17: Connexus Interface The Connexus interface loads an Interlocutor bit-matrix and interprets it as an adjacency matrix for a network. Synonyms can be loaded to bridge query and target identifiers and the network can be filtered to include only target-derived nodes (ostensibly the regulators). Loaded synonyms for a connected node are displayed when it is selected and targets of its outbound edges are shown in a box to the right.

66

67

3. Ribonomic Profiling of HuR Targets

Note: The data discussed in the following section was derived from RNA immunoprecipitation work carried out by Ritu Jain and Tiffany Devine at the Tenenbaum Lab and from gene-expression arrays prepared by David Frank, Marcy Keuntzel, and Sridar Chittur at the SUNY Albany Center for Functional Genomics Microarray Core. The methods were collated from discussions with the above-mentioned parties. The analysis work discussed, however, was conducted solely by the author of this thesis.

68

3.1 Summary In the following section we demonstrate the application of the locus-based tools in comparing data from different microarray platforms to each other and to various functional annotations. Specifically, we conduct the first cross-platform, cross-cell-line comparison of RIP-Chip targets on differential expression profiling platforms for a single RNA binding protein, HuR. We find low-level cross-cell-line intersections of HuR targets indicates that HuR mediated mRNA stabilization may target different sets of HuR targets under different conditions. The discordance between HuR RIP targets found across platforms indicates that protocol optimization and performance analysis must be done for each differential expression profiling platform before its RIP-Chip targets can be considered representative. The available information does not allow us to conclude whether differences between the HuR target sets are biological or due to differences in array technology. We show that all HuR IP sets studied here show enrichment of ARE elements but that the presence of an ARE element does not necessarily imply HuR association. Furthermore, this study demonstrates the application of novel locus-based approaches to comparing result sets from different expression profiling platforms to each other and to collections of cis-regulatory elements.

69

3.2 Introduction 3.2.1 Post-Transcriptional Gene Regulation The central dogma of molecular biology and the mechanism by which genomic code becomes functional protein revolves around the processes of transcription and translation (Kauffman 1993). Most work on genomic regulation, to date, has been concerned with the processes of transcription whereby transcription factors coordinately regulate the transcription of DNA sequences into messenger RNA (Boguski 2004). Most measures of whether a particular gene is expressed in a given experimental system are measures of the presence of the gene's mRNA, and the transcription of a gene is usually taken to be sufficient indication of its activity and relevance (Knudsen 2004). The reality differs from this binary viewpoint wherein genes are simply on or off in response to transcriptional signals. In fact, a cell carefully regulates the quantities, varieties, and localities of the various mRNAs it produces, not only through combinations of transcriptional signals, but also postranscriptionally through the activity of various ribozymes and mRNA binding proteins (Keene & Tenenbaum 2002). A multitude of crucial regulatory processes are handled post-transcriptionally including alternative splicing, localization, degradation, and translation efficiency (Kauffman 1993; Keene & Tenenbaum 2002). The power and scope of expression profiling methods are rapidly expanding. Gene and exon array platforms profile known isoforms of protein-coding genes (DalmaWeiszhausz et al. 2006). Arrays to profile the expression of microRNAs and other

70

functional RNAs have recently become available (Shingara et al. 2005). Tiling arrays have been developed to naively profile expression from large stretches of chromosomes (Bertone et al. 2004). High-throughput sequencing technologies are emerging now where the expression levels and sequences of nearly any transcript can be directly interrogated (Hutchison 2007) We can now look at expression changes not only in known genes but also in a whole array of other functional sequences, such as the ones in UTRs, whose expression patterns have thus far remained a mystery. We can now explore the actions of RNA binding proteins, the post-transcriptional analogues and ancestors of trancription factors, that control the splicing, stability and localization of expressed RNA.

3.2.2 RNA Binding Proteins & Ribonomic Profiling There are an estimated 500 to 3000 RNA-binding proteins (RBPs) encoded in the human genome making this family of proteins as abundant as transcription factors (Keene 2001). RBP-RNA interactions are mediated by motifs, usually the 3' and 5' untranslated regions (UTRs), with a specific combination of sequence and/or secondary structure. This makes RBP binding sites difficult to characterize using conventional tools most of which are geared towards detecting sequence motifs such as those found in transcription factor binding sites or in protein domains. To date, there are very few RBPs whose binding motifs have been well characterized (Tenenbaum et al. 2000) Most known RBP binding sites are located in the 5' or 3' untranslated regions (UTRs) of the mRNA though there are a few reports of RBPs binding coding regions of genes (Keene & Tenenbaum 2002). We focus on targeting RBPs to better understand the post-transcriptional generegulation networks responsible for splicing, editing, localization, stabilizing, and 71

regulating translation of mRNAs as they move from transcription to translation (Figure 18). We have developed genome-scale, antibody based technologies for immunoprecipitating endogenous RBP-mRNA complexes from cellular extracts and have identified associated messages using microarray technology (Figure 19) (Penalva et al. 2004). This new approach to post-transcriptional functional genomics (ribonomic profiling) has facilitated the high-throughput, quantitative identification of multiple mRNA targets of many RBPs and facilitated the analysis of the structural and functional relationship of encoded proteins from these co-regulated mRNAs . Ribonomic profiling, unlike total mRNA expression profiling, has the potential to shed light on the infrastructure of coordinated eukaryotic post-transcriptional gene expression and the involvement of RBPs in mechanisms compartmentalizing and regulating RNA subsets.

72

Figure 18: The multiple aspects of gene-expression Gene expression can be regulated at different levels. Two families of proteins control most of the steps that link the genome to the proteome. Transcription factors are responsible for the first step and regulate RNA transcription with all the events after this point (RNA splicing, transport, stability, localization and translation) being regulated by RNA binding proteins.

73

Genomic DNA TRANSCRIPTION FACTORS

transcription

Immature RNA splicing

editing

mRNA Mature RNA ncRNA

stabilization

RNA BINDING PROTEINS

localization

translation

Protein modification

localization

74

VARIOUS PROTEINS

Figure 19: Ribonomic Profiling using RIP-Chip In a typical ribonomic profiling workflow: 1) the total RNA pool of interest (usually a total cell lysate from the cell-line of interest) is obtained, 2) antibody specific to the RNA-binding protein of interest is then bound to protein A or G coated beads, 3) the cell lysate is then added to the prepared beads, 4) 10% of the mixture is separated for use as the input RNA sample (to control for conditions in the pre-immunoprecipitation mixture), 5) RBP-immunoprecipitation is performed by incubating the remaining cell-lysate with the antibody bound beads, 6) the bead mixture is washed to remove unbound lysate, 7) the bound RNA-protein complexes are eluted from the beads, 8) the bound RNA is separated from the RBP and set aside as the IP RNA, 9) mass matched amounts of IP RNA and input RNA are put through amplification steps and hybridized to microarrays, and 10) finally microarray profiles of the IP RNA and input RNA are compared to each other and to profiles of total RNA to determine RBP-associated RNAs.

75

endogenous RBP associated with RNA

Y Y Y

Total Cell Lysate

Y

Y

antibody to RBP

Y

Add cell-lysate to IP bead mixture

immobilized protein A/G

Isolate 10% as Input Incubate

Input RNA

Elute Bound RNA

mass match (50ng)

IP Profile

mass match (50ng)

Input Profile

76

total RNA prep

Total Profile

3.2.3 Profiling of mRBP Targets There are a variety of approaches currently used to find an RBP's targets using differential gene-expression profiling platforms. Most commonly, gene expression signal in RIP-Chip samples for the RBP of interest is compared to expression signals in negative control RIP-Chip samples. A negative-control RIP-Chip sample is created by following the RIP-Chip procedure using antibody to a protein that does not bind RNA (IgG for example) or to a protein not found in the cells of interest (usually G10). If a tagrecombinase system, such as T7, is used for the specific IP, the negative control can be created by running the same IP with lysate fros cells not transfected with the recombinant protein. Enrichment measures deigned to detect differences between total RNA samples (such as Z-score or normalized fold-change differentials) are usually employed without modification. RIP-Chip studies to date have only performed secondary validation (by RTPCR or Northern blot) on a small, top-scoring, subset of identified RBP targets as validation of association or non-association of all probed transcripts is precluded by time and cost limitations. This makes it impossible to directly gauge the effectiveness of any given expression profiling platform or analysis methodology in finding true targets. We can however study the global regulatory profile of a given RBP by asking a few specific questions. How does the RIP-Chip identified set of targets compare to the known set of targets for the RBP? How does this set compare to previously identified RIP-Chip target sets for the protein of interest? Finally, how does the functional profile of the RIP-Chip target set compare to the known functions of the RBP in question? We have engineered a platform for asking these questions in a robust and fashion using an analysis 77

platform grounded in genomic positional coordinates (or loci) (Figure 20) and we illustrate their application to profiling targets of the HuR mRNA binding protein.

78

Figure 20: A Locus-Based Framework for Genomics Data Analysis The locus based framework for genomics data analysis allows comparisons of data and annotations from a wide variety of platforms and databases. The framework proceeds by first associating each dataset to their genomic positional coordinates. Once all sets are converted to genomic loci – comparisons between two genomic locus sets, A and B, can be enumerated in terms of the number of loci in A that overlap loci in B and vice versa. The criteria for overlap can be as little as one base-pair overlap or as strict as 100% overlap of one locus with another. The principle of overlapping loci can be used to enumerate similarities and differences between the datasets. Locus based comparisons can also be made between a dataset of interest and functional annotations (such as ontological, regulatory, and disease associations) that have been mapped to genomic positional coordinates. By examining how many members of a dataset overlap each functional annotation, it is possible to profile a dataset in terms of the functional and regulatory characteristics associated with it.

79

Genomics Datasets (probe signals, sequences)

Map to Genomic Coordinates GO10295: Development GO10485: Transcription GO01923: Nuclear MESH: colorectal neoplasm MESH: breast neoplasms CHEM: mercury

GO Gene Ontology Terms

GAD Genetic Association DB

InterPro

GO10297: Development GO10483: Cytokine Receptor GO01921: Cell Membrane MESH: breast neoplasms CHEM: 4' isopentohypothenol ENV: smoking (tobacco)

ZnF

Protein Domains

ARED

ARE: Type II

ARE: Pentamer

AU-Rich Elements

miRNAMap

mir-133

mir-133

lin7

Cross-DB miRNA Targets

TargetScanS

mir-133

lin7

Conserved miRNA Targets

UTRSite

ARE

IRES

Exonic UTRSite Motifs SOX2 ATF2 HOXA1

Transfac Known Validated TF-Gene Relations

Chr: chr16 Strand: +

Dataset 2


SP1

Dataset 1

CREB ATF2 SOX2 ATF2

SP1

CREB

SP1

Conserved Predicted TFBS

ATF2 SOX2 SOX2

HMR TFBSCons

ATF2 P53 CREB

Chr: chr16 Strand: -


Chr: chr16 Strand: -


Compare to Other Annotations

80

Annotation

Dataset1

Dataset2

GO: Cell Cycle GO: RNA Regulation GO: Nucleus TF: HOXA2 miRNA: lin7a Type: 3'UTR Type: 5'UTR AREDB: Class 4 UTRDB: HSL3 TF: FOXO1 Pfam: ZnF

50% 84% 46% 77% 67% 12% 76% 6% 12% 11% 16%

75% 36% 98% 77% 32% 89% 13% 83% 64% 23% 15%

3.2.4 The mRNA Binding Protein HuR HuR (Hu-antigen R also known as ELAVL1 – embryonic lethal abnormal vision like 1) is one of four human members of the ELAV or Hu family of developmental mRNA binding proteins and is thought to be involved in cell cycle regulation, carcinogenesis, and muscle differentiation (Peng et al. 1998). HuR is isolated primarily in the nucleus in unstimulated cells but is actively shuttled from the nucleus to the cytoplasm in response to MAPK, AMPK, or PKC signaling (Hayakawa et al. 2004). HuR is expressed ubiquitously across tissues, unlike its family members HuB, HuC, and HuD which show primarily neuronal expression. HuR protein is made up of three conserved RNA recognition motifs (RRMs) and shows a high affinity for AU- or U-rich sequences in its cytoplasmic role as a stabilizer of ARE-containing mRNAs (Peng et al. 1998). HuR has recently been described as a translation and turnover regulating (TTR) RBP, a group that includes proteins that either destabilize (TTP, AUF1, KSRP, hnRNPA1, BRF1) or stabilize (Hu proteins, NF90) mRNAs or inhibit protein translation (TIA-1, TIAR). The ARE is the most well studied example of a TTR RBP motif and is the most common determinant of RNA stability in mammalian cells, fating a large collection of early-response-gene mRNAs including differentiation factors, proto-oncogenes, transcription factors and cytokines for rapid degradation (Barreau et al. 2006). AREs are normally found in the 3' UTRs of mRNAs and can be divided into two families -- those that contain the AUUUA pentamer and those that do not (Chen & Shyu 1994). The AUUUA containing AREs are further broken down into Class I AREs that contain 1-3 AUUUA pentamers (or UUAUUUAUU nonamers) coupled with a long, continuous U81

rich sequence and Class II AREs that contain multiple consecutive and overlapping instances of the AUUUA sequence (ex: AUUUAUUUAUUUA) (Barreau et al. 2006) The non-AUUUA containing AREs are less well defined but some known instances contain a long-continuous U-rich sequence and multiple GUUUG pentamers (Chen & Shyu 1994). Given the lack of understanding of ARE composition, some simply define it as long stretch of mRNA that contains mostly U or A residues and sometimes a few G residues but rarely ever a C residue (Barreau et al. 2006). Many studies have shown that HuR binds to ARE-containing, specifically class I ARE containing, mRNAs and that HuR mediated mRNA stabilization can be affected by modifying a target ARE sequence (Chen et al. 2002). Other groups dismiss the ARE as the being the specific HuR binding element. One of these groups finds a 17-20 bp stemloop motif with a 6-base stem, an 8-base loop, and a U residue on the end of the loop that is conserved across their seed HuR target sequences (de Silanes et al. 2004). Another finds that HuR shows poly(A) binding activity and appears to be able to simultaneously bind both the ARE and the poly(A) tail in vitro to stabilize its targets (Peng et al. 1998). There are also reports of HuR binding to large non-ARE sequences that potentially form complex, nested stem-loop structures in 3'UTR regions (Goldberg-Cohen et al. 2002), 5' UTR regions (Meng et al. 2005), and even coding regions (Prechtel et al. 2006). In this study, we take HuR IP material from the K562 cell line (an immortalized myelogenous leukemia cell-line that is one of the two Tier I ENCODE cell lines) and interrogate the same RNA with two different gene expression profiling platforms – the Human Exon 1.0 ST from Affymetrix and the Human Whole Genome G4112F from Agilent. We also profile HuR targets from the other ENCODE Tier II cell line, GM12878 82

(of lymphoblastoid origin), on the same Affymetrix Exon platform. We compare targets from each of these experiments to known HuR targets (from literature where confirmed by RT-PCR) (see Table 1 and Methods) and to literature derived reports of HuR RIP-Chip targets in HeLa and RKO cells on the MGC array platform from the Gorospe group and in normal and TSP1 overexpressing MCF7 cells on the Illumina HumanRef8 platform from the Gartenhaus group. We compare the overall functional profiles across sets generally and pick out associated annotations of particular interest including AU-Rich Elements and RT-PCR validated HuR targets for a more detailed examination.

83

Table 1: High-Confidence Reports of HuR Targets from Literature HGNC ACTB ACTG1 ADRB1 ADRB2 AR BMP6 CALCR CCL11 CCNA2 CCNB1 CCND1 CCND2 CCNE1 CD40LG CD83 CDH2 CDKN1A CDKN1B CSF2 CTNNB1 DEK EIF4E FOS FSHB GAP43 GSK3B HDAC2 HLF IL1B IL2 IL3 IL4 IL6 IL8 JUN MARCKS MMP9 MTA1 MYC MYCN MYOD1 MYOG NDUFB6 NF1 NOS2A PITX2 PLAU PLAUR PPP1CB PTGS2 PTMA RAB10 SERPINB2 SLC11A1 SLC2A1 SLC5A1 SLC7A1 THBS1 TIMP3 TLR4 TNF TP53

ENSG ID ENSG00000075624 ENSG00000184009 ENSG00000043591 ENSG00000169252 ENSG00000169083 ENSG00000153162 ENSG00000004948 ENSG00000172156 ENSG00000145386 ENSG00000134057 ENSG00000110092 ENSG00000118971 ENSG00000105173 ENSG00000102245 ENSG00000112149 ENSG00000170558 ENSG00000124762 ENSG00000111276 ENSG00000164400 ENSG00000168036 ENSG00000124795 ENSG00000151247 ENSG00000170345 ENSG00000131808 ENSG00000172020 ENSG00000082701 ENSG00000196591 ENSG00000108924 ENSG00000125538 ENSG00000109471 ENSG00000164399 ENSG00000113520 ENSG00000136244 ENSG00000169429 ENSG00000177606 ENSG00000155130 ENSG00000100985 ENSG00000182979 ENSG00000136997 ENSG00000134323 ENSG00000129152 ENSG00000122180 ENSG00000165264 ENSG00000196712 ENSG00000007171 ENSG00000164093 ENSG00000122861 ENSG00000011422 ENSG00000213639 ENSG00000073756 ENSG00000187514 ENSG00000084733 ENSG00000197632 ENSG00000018280 ENSG00000117394 ENSG00000100170 ENSG00000139514 ENSG00000137801 ENSG00000100234 ENSG00000136869 ENSG00000206439 ENSG00000141510

Gene Name ACTIN, BETA ACTIN, GAMMA 1 ADRENERGIC, BETA-1-, RECEPTOR ADRENERGIC, BETA-2-, RECEPTOR ANDROGEN RECEPTOR BONE MORPHOGENETIC PROTEIN 6 CALCITONIN RECEPTOR CHEMOKINE (C-C MOTIF) LIGAND 11 CYCLIN A2 CYCLIN B1 CYCLIN D1 CYCLIN D2 CYCLIN E1 CD40 LIGAND CD83 ANTIGEN CADHERIN 2, TYPE 1, N-CADHERIN CYCLIN-DEPENDENT KINASE INHIBITOR 1A CYCLIN-DEPENDENT KINASE INHIBITOR 1B COLONY STIMULATING FACTOR 2 CATENIN, BETA 1 DEK ONCOGENE EUKARYOTIC TRANSLATION INITIATION FACTOR 4E V-FOS HOMOLOG FOLLICLE STIMULATING HORMONE, BETA GROWTH ASSOCIATED PROTEIN 43 GLYCOGEN SYNTHASE KINASE 3 BETA HISTONE DEACETYLASE 2 HEPATIC LEUKEMIA FACTOR INTERLEUKIN 1, BETA INTERLEUKIN 2 INTERLEUKIN 3 INTERLEUKIN 4 INTERLEUKIN 6 INTERLEUKIN 8 V-JUN HOMOLOG MYRISTOYLATED ALANINE-RICH PROTEIN KINASE C SUBSTR MATRIX METALLOPEPTIDASE 9 METASTASIS ASSOCIATED 1 V-MYC HOMOLOG V-MYC, NEUROBLASTOMA DERIVED MYOGENIC DIFFERENTIATION 1 MYOGENIN (MYOGENIC FACTOR 4) NADH DEHYDROGENASE 1 BETA 6 NEUROFIBROMIN 1 NITRIC OXIDE SYNTHASE 2A PAIRED-LIKE HOMEODOMAIN TRANSCRIPTION FACTOR 2 PLASMINOGEN ACTIVATOR, UROKINASE PLASMINOGEN ACTIVATOR, UROKINASE RECEPTOR PROTEIN PHOSPHATASE 1, CATALYTIC SUBUNIT, BETA PROSTAGLANDIN-ENDOPEROXIDE SYNTHASE 2 PROTHYMOSIN, ALPHA RAB10, MEMBER RAS ONCOGENE FAMILY SERPINE PEPTIDASE INHIBITOR, CLADE B, MEMBER 2 SOLUTE CARRIER FAMILY 11, MEMBER 1 SOLUTE CARRIER FAMILY 2, MEMBER 1 SOLUTE CARRIER FAMILY 5, MEMBER 1 SOLUTE CARRIER FAMILY 7, MEMBER 1 THROMBOSPONDIN 1 TIMP METALLOPEPTIDASE INHIBITOR 3 TOLL-LIKE RECEPTOR 4 TUMOR NECROSIS FACTOR TUMOR PROTEIN P53

Reference (Dormoy-Raclet et al. 2007) (de Silanes et al. 2004) (Blaxall et al. 2000) (Blaxall et al. 2000) (Yeap et al. 2002) (L. Burt Nabors et al. 2001) (Yasuda et al. 2004) (Atasoy et al. 2003) (W. Wang et al. 2002) (W. Wang et al. 2002) (W. Wang et al. 2002) (Briata et al. 2003) (Guo & Hartley 2006) (Sakai et al. 2003) (Prechtel et al. 2006) (de Silanes et al. 2004) (W. Wang et al. 2000) (Kullmann et al. 2002) (J. Wang et al. 2006) (López de Silanes et al. 2003) (de Silanes et al. 2004) (Lal et al. 2004) (Chen & Shyu 1994b) (Manjithaya & Dighe 2004) (Chung et al. 1997) (Tiruchinapalli et al. 2008) (de Silanes et al. 2004) (de Silanes et al. 2004) (Meisner et al. 2004) (Seko et al. 2004) (Raghavan et al. 2001) (Meisner et al. 2004) (Nabors et al. 2001) (Suswam et al. 2005) (Briata et al. 2003) (Wein et al. 2003) (Akool et al. 2003) (de Silanes et al. 2004) (Lafon et al. 1998) (Ma et al. 1996) (van der Giessen et al. 2003) (van der Giessen et al. 2003) (de Silanes et al. 2004) (Haeussler et al. 2000) (Rodriguez-Pascual et al. 2000) (Briata et al. 2003) (Tran et al. 2003) (Tran et al. 2003) (Lal et al. 2004) (Nabors et al. 2001) (Lemay et al. 2008) (Lal et al. 2004) (Manohar et al. 2002) (Xu et al. 2005) (Xu et al. 2005) (Loflin & Lever 2001) (Yaman et al. 2003) (Mazan-Mamczarz et al. 2008b) (Lal et al. 2004) (Lin et al. 2006) (Dean et al. 2001) (Mazan-Mamczarz et al. 2003)

o

lder for refs(Meisner et al.

2004) (Dormoy-Raclet et al. 2007) (Tiruchinapalli et al. 2008) (Blaxall et al. 2000) (Yeap et al. 2002) (L. Burt Nabors et al. 2001) (Yasuda et al. 2004) (U Atasoy et al. 1998) (Ulus Atasoy et al. 2003) (Wengong Wang et al. 2002) (Wengong Wang et al. 2000) (Wengong Wang et al. 2003) (Jin Gene Wang et al. 2006) (Wengong Wang et al. 2001) (Briata et al. 2003) (X. Guo & Hartley 2006) (Sakai et al. 2003) (Prechtel et al. 2006a) (Seko et al. 2004) (Raghavan et al. 2001) (C Y Chen & A B Shyu 1994) (Kullmann et al. 2002) (Lopez de Silanes et al. 2003) (Manjithaya & Dighe 2004) (Chung et al. 1997) (L. Burt Nabors et al. 2001) (Suswam et al. 2005) (Wein et al. 2003) (Akool et al. 2003) (Lafon et al. 1998) (W. J. Ma et al. 1996) (van der Giessen et al. 2003) (Haeussler et al. 2000) (Rodriguez-Pascual et al. 2000) (Lal et al. 2004) (Tran et al. 2003) (Lemay et al. 2008) (Manohar et al. 2002) (Yong Zhong Xu et al. 2005) (Loflin & Lever 2001) (Yaman et al. 2003) (Feng-Yen Lin et al. 2006) (Dean et al. 2001)

84

3.3 Methods 3.3.1 IP Sample Preparation Lysates of cultured K562 and GM12878 cells were prepared as described in (Penalva et al. 2004; Baroni et al. 2008) Briefly, cells were washed with ice-cold PBS, then harvested using a scraper, and pelleted by centrifugation at 4°C at 3000g for 5 min. Polysome lysis buffer II, in 1.5 times the volume of cell pellet, was added and the combination was mixed to uniformity by pipetting and then spun in a microcentrifuge for 10 min (14,000g) at 4°C. The supernatant was removed and saved. The pellet was resuspended in 1 volume (compared to cell pellet) of lysis buffer and the pipetting and centrifugation steps were repeated, and the resulting supernatant was added to the previous batch. One g samples of each lysate were set aside to be used as total samples. For the immunoprecipitation steps, the RNP lysate was centrifuged in a Microfuge for 10 min (14,000g) at 4°C and the resultant supernatant was transferred to a new tube on ice. Antibody-coated beads, sufficient to provide 5% (v/v) for each immuprecipitation reaction, were rinsed repeated with ice-cold NT2 buffer and then resuspended in NT2 buffer supplemented with 100 U/mL RNase OUT, 0.2% vanadyl ribonucleoside complexes, 1 mM dithiothreitol, and 20–30 mM EDTA so that the volume of resuspended beads in NT2 buffer corresponded to approximately 10 times the volume of the RNP lysate being used. The resuspended antibody-coated beads were mixed several times by inversion, the RNP lysate was added, and the immunoprecipitation reaction mixtures were tumbled end-over-end at room temperature for 2–4 h. 10% of each sample was

85

collected at the beginning of this incubation to serve as the input RNA control to assess RNase activity. After the incubation, the beads were spun down and washed four to six times with approximately 10–20 bead volumes of ice-cold NT2 buffer, vigorously mixing between each rinse. The washed beads were resuspended in 600 μL of proteinase K digestion buffer plus 25 μL of the proteinase K stock solution and incubated for 30 min at 50°C, with occasional mixing. 600 μL of phenol-chloroform was added to the bead suspension, before vortexing for 1 min, and centrifuging at 14,000g at 4°C for 10 min. The upper phase was removed and the extraction was repeated with 1 vol of chloroform. The RNA was precipitated by adding 1 vol of isopropanol, 60 μL of 4 M ammonium acetate, 3 μL of 1 M MgCl2, and 8 μL of glycogen. The samples were stored at 80°C until ready for gene expression analysis at which point they were centrifuged for 30 min (14,000g) at 4°C and washed with 100 μL of 80% ethanol.

3.3.2 Affymetrix Human Exon 1.0 ST Microarray Protocol RNA isolated from input as well as IP samples was checked for quality using a nanodrop spectrophotometer and an Agilent Bioanalyzer. 50 ng of IP RNA from HuR and G10 IP, 50 ng of input RNA, and 1 ug of total RNA (without the Affymetrix Small Sample RNA Amplifcation Kit used for the smaller RNA samples) was converted to labeled fragmented cRNA using the Affymetrix WT Sense Target Labeling protocol. Briefly, the RNA was treated with the Invitrogen RiboMinus Human/Mouse Transcriptome Isolation Kit and subsequently converted to single-stranded cDNA using Superscript II reverse transcriptase and the GeneChip WT cDNA Amplification Kit (Affymetrix). The single-stranded cDNA was converted to double-stranded cDNA using 86

DNA polymerase I, DNA ligase, and RNase H from Escherichia coli. The doublestranded cDNA was subsequently purified using a cleanup kit from Affymetrix and converted to cRNA, fragmented and end-labeled using the Affymetrix GeneChip Cleanup Column Kit and the Affymetrix WT Sense Terminal Labeling kit. The labelled cRNA was hybridized to Affymetrix Human Exon 1.0 ST arrays. After hybridization, the chip was washed, stained with streptavidin-phycoerythrin before being scanned. An antibody amplification staining protocol that uses biotinylated goat IgG followed by a second streptavidin-phycoerythrin staining increases the sensitivity of the assay. The chip was then scanned, and images were analyzed qualitatively using Affymetrix GeneChip Operating System software and summarized as Affymetrix Binary CEL files.

3.3.3 Agilent Human Whole Genome G4112F Microarray Protocol RNA was quantified with NanoDrop ND-1000 followed by quality assessment with 2100 Bioanalyzer (Agilent Technologies) according to manufacturer’s protocol. Sample amplification and labeling procedures were carried out with Low RNA Input Fluorescent Linear Amplification Kit (Agilent Technologies) with minor modifications. Briefly, 0.4 g total RNA was reverse-transcribed into cDNA by MMLV-RT using an oligo dT primer (System Biosciences) that incorporates a T7 promoter sequence. The cDNA is then used as a template for in vitro transcription in the presence of T7 RNA polymerase and Cyanine-3 labeled CTP. Labeled cRNA was purified using RNeasy mini kit (Qiagen). RNA spike-in controls (Agilent Technologies) were added to RNA samples before amplification and labeling according to manufacturer’s protocol. 0.825 g of each 87

Cy3-labeled samples were used for hybridization on Agilent 4 44K whole human genome microarray (G4112F) at 65ºC for 17 hours in a hybridization oven with rotation. After hybridization, microarrays were washed and dried according to manufacturer’s protocols. Microarrays were scanned with an Agilent Scanner using Agilent Scan Control software, and data were extracted with Agilent Feature Extraction 10.5 software.

3.3.4 Affymetrix Human Exon 1.0 ST Analysis CEL files for three HuR IP samples and three corresponding input samples were processed using the apt-probeset-summarize binary provided by Affymetrix. One qualitative (DABG) and two quantitative (RMA, PLIER) measures of expression were derived for each gene-level metaprobeset on the chips using the Affymetrix Powertools apt-probeset-summarize. The DABG signal values were converted to a negative log 10 scale for expedient parsing. The RMA summary values were generated using RMAbackground correction (rma-bg), followed by G/C content based background correction of perfect match probe values (gc-bg), followed by median polish summarization. The PLIER summary values were generated using G/C content based background correction of perfect-match probe-values followed by PLIER summarization. No normalization was performed across chips. The apt-probeset command line parameter used to generate the summary values is given below: apt-probeset-summarize -a rma -a quant-norm,pm-gcbg,plier -a dabg.negloG10=true --precision 10 -p lib/HuEx-1_0-st-v2.r2.pgf -c lib/HuEx-1_0-st-v2.r2.clf -b lib/HuEx-1_0-st-v2.r2.antigenomic.bgp -m lib/HuEx-1_0-st-v2.r2.dt1.hg18.full.mps -o gene/ --cel-files cel_list

The output files for each of the three measures were imported into an R session 88

for further analysis. A DABG p-value cutoff of 0.001 was used for calling present probesets across the HuR IP chips. For the probesets present in all HuR IP replicates, a ttest for overexpression in IP compared to input was performed. The subset of genes both present in all IP replicates and overexpressed in IP with a p-value of 0.05 or less across both RMA and PLIER methods were called HuR targets. Of the 22011 core transcripts, 17881 are mapped to the hg18 (NCBI b36) human genome build using annotations from Affymetrix. 2461 HuR targets are mapped this way. The sets of HuR targets from this experiment are denoted elsewhere as HuEx.GM12878 and HuEx.K562.

3.3.5 Agilent Human Whole Genome G4112F Analysis Agilent Feature Extraction v 10.5 (agilent technologies 2009) result files for three HuR IP samples and three corresponding input samples were imported into an R session. Background corrected signal values as well as isWellAboveBackground flags were pulled for each probeset from each chip. The signal values were log2 transformed and a t-test for overexpression in IP samples over input samples was performed. Probesets marked as above background (present) in all IP samples and showing greater expression in IP with a p-value of 0.05 or less were called HuR targets. The ~45000 probesets are mapped to 21423 RefSeq mRNA loci in the hg18 human genome build and leave 2459 mapped HuR target mRNAs. The set of HuR targets from this experiment are denoted elsewhere simply as Agilent.

3.3.6 Correlation Analysis of HuR-IPs between Agilent and Affymetrix The LIA tool for locus intersection analysis (Zaleski et al., unpub. data) was used 89

to intersect (75% overlap, strand permissive) mRNA loci corresponding to probesets on the two platforms. The table of pairwise intersections between loci from the two sets was extracted. Comparisons of signals and fold-changes across platforms were made by indexing the sets with corresponding entries from the intersection table.

3.3.7 Cataloging Known HuR Targets A PubMed literature search was conducted for the terms “HuR” and “ELAVL1” and the results were filtered for papers that showed evidence of mRNAs associating with HuR protein. Immunoprecipation followed by specific RT-PCR for the mRNA of interest was considered the minimum level of evidence for association. The 46 HuR targets collected were then annotated with their official gene symbols, ENSEMBL gene identifiers, and corresponding genomic coordinates on the human NCBI b36 (hg18) genome and are provided in Table 6.

3.3.8 HuR Targets in RKO Cells HuR associated mRNAs profiled by RIP-Chip in human colorectal carcinoma RKO cells using contemporaneous MGC arrays as described by deSilanes (de Silanes et al. 2004) were retrieved from the web link provided. As described in the publication, the 1121 mRNAs with a z-score difference between HuR and IGG IP samples of 1.0 or greater were taken to be HuR targets. Genbank mRNA identifiers from the MGC were used to assign hg18 genomic coordinates from the UCSC Genome Browser to 9257 of 9600 probesets on the chip. The set of HuR targets from this experiment are denoted elsewhere as MGC.RKO. 90

3.3.9 HuR Targets in HeLa Cells HuR associated mRNAs profiled by RIP-Chip in human cervical cancer HeLa cells using contemporaneous MGC arrays as described by Lal (Lal et al. 2004) were retrieved from the Gene Expression Omibus GSE1361. Analysis was recapitulated as described by importing the series matrix file into R, normalizing each column by Z-score transformation and selecting the 1294 mRNAs with a z-score difference between HuR-IP and IgG-IP of 1.0 or greater. Genbank mRNA identifiers from the MGC were used to assign hg18 genomic coordinates from the UCSC Genome Browser to 9257 of 9600 probesets on the chip. The set of HuR targets from this experiment are denoted elsewhere as MGC.HeLa.

3.3.10 HuR Targets in MCF7 Cells HuR associated mRNAs profiled by RIP-Chip in mammary epithelial adenocarcinoma MCF7 cells stably transfected with pLXSN empty vector (baseline) or pLXSN-MCT-1-V5-histidine (MCT-1 overexpressing) using Illumina HumanRef-8 v2 Expression BeadChips were retrieved from the web link provided (Mazan-Mamczarz et al. 2008). The 9409 mRNAs marked as being HuR-associated in either cell-line by (specific method not provided) were taken to be HuR targets. Refseq identifiers provided by Illumina were used in conjunction with the UCSC Genome Browser to assign hg18 genomic coordinates to 19911 of the probesets on the chip. The set of HuR targets from this experiment are denoted elsewhere as Ilmn.MCF7.

91

3.3.11 Comparison of HuR Targets Across Platforms Comparison of HuR mRNA targets was performed using the LIA tool. Two loci were considered to intersect if they overlapped by at least 75% of the smaller locus (LIA: percent-overlap >= 0.5). Strand-specificity was only considered if available for both loci of interest (LIA: comparison strand - permissive). Intersections were taken between each pair of HuR target locus sets and an intersection-coefficient was defined based on the sizes of A and B and the numbers of intersecting transcripts as follows: Intersection-Coefficient = (size(A) * size(A in B) / size(B)) +(size(B) * size(B in A) / size(A)) size(A) + size(B)

3.3.12 Comparison of HuR Targets to Likely Cis-Regulatory Elements Sets of AU-rich element instances potentially associated with HuR targets were extracted from the ARE Database (Bakheet et al. 2006) and UTRdb (Mignone et al. 2005) were annotated with their genomic coordinates. Locus intersections of the HuR target sets with each of these element sets (100% overlap, strand-permissive) were taken. To test if the HuR target sets were enriched in particular motifs, the number of loci in a HuR target set intersecting with a particular annotation term was taken and compared to the number of loci in the source platform intersecting with the same annotation term using Fisher's Exact Test for significant associations(Fisher 1922). Next, to test if any of the known or predicted binding motifs for HuR showed association with the RIP-Chip derived HuR targets, the fraction of each motif-containing locus set intersecting with each HuR target set was computed. This was divided by the fraction of the platform represented by a given set of HuR targets to obtain a partition 92

score reflecting the likelihood that a particular cis-regulatory motif was more strongly associated with HuR targets than with probed elements on the profiling platform in general.

93

3.4 Results 3.4.1 Samples To analyse targets specifically bound to HuR, we ran three replicates of the RIPChip protocol against HuR and two replicates against G10 to measure background hybridization. Three input samples were run in parallel for the HuR IPs. This set was replicated for both the GM12878 and K562 cell lines on the Affymetrix Human Exon 1.0 ST platform and for just the K562 cell line on the Agilent platform. Total samples from the ENCODE cell-line phenotyping study from GEO (GSE12760) were also included and used for comparison. A full list of samples and platforms used is provided in Table 2.

3.4.2 Clustering To look at replicate clustering of the samples (Table 2) within a given platform, correlations were taken between all pairs of samples and projected as distances in a 2D space (Figure 21) using a multidimensional scaling approach. The correlation values and MDS projections revealed that the input samples on the Affymetrix platform had low cross-replicate correlation as well as low correlations to total sample. The input samples (from both cell-lines for analysis consistency) were thus discarded from the analysis. The GM12878 total samples showed a distinctly different density profile on the DABG metric (data not shown) and so were excluded from further analysis. Since the inputs from both cell-lines had already been discarded, this meant that all total comparisons would be made with the ENCODE phenotyping samples. 94

The G10 negative control samples showed close correlation to the total samples on the MDS profile which would not normally be expected. In addition, only one replicate of G10 IP was available from the GM12878 line. This combined with the fact that two G10 IP replicates from the K562 showed the lowest replicate correlation, led us to discard G10 IP from the analysis. The K562 samples on the Agilent platform showed similar incongruity between both the G10 IP replicates and the input replicates (Figure 22). All HuR IPs therefore were compared to the triplicates of K562 total available on the platform.

95

Table 2: Summary of Present Probesets on Arrays used for HuR RIP-Chip Sample G10_GM12878-1 HuR_GM12878-1 HuR_GM12878-2 HuR_GM12878-3 Input_GM12878-1 Input_GM12878-2 Input_GM12878-3 Input_GM12878-4 Input_GM12878-5 Total_GM12878-1 Total_GM12878-2 Total_GM12878-3 Total_GM12878-4 Total_GM12878-5 G10_K562-1 G10_K562-2 HuR_K562-1 HuR_K562-2 HuR_K562-3 Input_K562-1 Input_K562-2 Input_K562-3 Input_K562-4 Total_K562-1 Total_K562-2 Total_K562-3 Total_K562-4 G10_K562-1 G10_K562-2 HuR_K562-1 HuR_K562-2 HuR_K562-3 Input_K562-1 Input_K562-2 Total_K562-1 Total_K562-2 Total_K562-3

Cell_Line GM12878 GM12878 GM12878 GM12878 GM12878 GM12878 GM12878 GM12878 GM12878 GM12878 GM12878 GM12878 GM12878 GM12878 K562 K562 K562 K562 K562 K562 K562 K562 K562 K562 K562 K562 K562 K562 K562 K562 K562 K562 K562 K562 K562 K562 K562

IP_Condition G10 HuR HuR HuR Input Input Input Input Input Total Total Total Total Total G10 G10 HuR HuR HuR Input Input Input Input Total Total Total Total G10 G10 HuR HuR HuR Input Input Total Total Total

96

Platform Count Fraction HuEx 19255 0.875 HuEx 19128 0.869 HuEx 19097 0.868 HuEx 19094 0.867 HuEx 18958 0.861 HuEx 18348 0.834 HuEx 14781 0.672 HuEx 14527 0.660 HuEx 14444 0.656 HuEx 14346 0.652 HuEx 14188 0.645 HuEx 14163 0.643 HuEx 13710 0.623 HuEx 12970 0.589 HuEx 12857 0.584 HuEx 12641 0.574 HuEx 12467 0.566 HuEx 11655 0.530 HuEx 11224 0.510 HuEx 11114 0.505 HuEx 11057 0.502 HuEx 10518 0.478 HuEx 10502 0.477 HuEx 10449 0.475 HuEx 10111 0.459 HuEx 9680 0.440 HuEx 9343 0.424 Agil 35426 0.787 Agil 36084 0.802 Agil 39580 0.879 Agil 37595 0.835 Agil 39632 0.880 Agil 37662 0.837 Agil 9854 0.219 Agil 37250 0.828 Agil 36888 0.819 Agil 34823 0.77

Figure 21: Clustering of replicate samples on the Affymetrix Human Exon 1.0ST platform Spearman correlation values between pairs of samples run on the Affymetrix Human Exon 1.0 ST platform were projected into a two-dimensional space using Sammon multidimensional scaling (MDS). The plot shows a gross separation between samples from different cell-lines (marked by the dark black line) and the similarity between replicate samples (labeled in red).

97 97

98

Figure 22: Clustering of replicate samples on the Agilent Human Whole Genome platform Spearman correlation values between pairs of samples run on the Agilent Human Whole Genome platform were projected into a two-dimensional space using Sammon multidimensional scaling (MDS). The plot shows clustering of the sets of replicate samples.

99

100

3.4.3 HuR IP Targets Subsequent to the filtering steps mentioned above, we compared the HuR IP samples to the corresponding total samples in both cell-lines and platforms. For the HuEx platform, we noted probesets that were present across all HuR IP samples (DABG pvalue < 0.001) and from these we picked out those that were significantly enriched in the HuR IP over the total (uncorrected one sided t-test p-value < 0.05) and having greater average expression in HuR IP than in total according to both the RMA and PLIER expression measures. We found 5152 transcripts matching these criteria in the GM12878 cell line and 3808 in the K562 cell line (Table 2). For the Agilent platform, we performed a similar analysis, picking out probesets present across all HuR IP samples (SignalWellAboveBG = 1), significantly enriched in HuR IP over total (uncorrected one sided t-test p-value < 0.05), and with higher average expression in HuR IP than in total. We found 2459 transcripts matching our threshold in this set (Table 2).

3.4.4 Cross-Platform Correlations between Totals, HuR, and HuR.FC To compare gene-expression scores for the K562 samples across the two platforms, we took available genomic mappings for both platforms and used the Locus Intersection Analysis tool (LIA) to find pairs of probesets from each chip whose mapped transcripts overlapped by 50% or more of the length of the shorter transcript. The 39035 pairs of overlapping probesets were used to measure the Spearman Rank correlation between the HuEx PLIER expression values and log2 transformed Agilent signal values 101

for total and HuR IP samples (Table 3). We found the highest correlations between totals from the two platforms (average of approximately 0.47). We also found that HuEx HuR IPs correlated as well with Agilent totals as with Agilent HuR IPs and that Agilent HuR IPs actually correlated better with HuEx totals than with HuEx HuR IPs. We profiled 5152 mapped transcripts that met our criteria in the GM12878 cell line and 3808 mapped HuR IP transcripts in the K562 cell line. Next, we took average log2 fold changes between HuR IP and total from the two platforms to see if this more platform independent relative score would correlate between. In fact, we found that the Spearman correlation between the fold-changes across platforms was even weaker with a value of approximately 0.22. We plotted the foldchange values against each other (Figure 23) to confirm that there was no correlation bias across fold-change value ranges (i.e. that higher fold-change genes were not better correlated). We found 2459 mapped HuR IP targets meeting this criteria in the K562 cell line.

102

Table 3: Correlations Between HuR-IP and Total Samples Across Platforms

HuEx vs Agilent HuR-1 HuR-2 HuR-3 Total-1 Total-2 Total-1 HuR-1 0.361 0.297 0.27 0.335 0.330 0.327 HuR-2 0.381 0.308 0.271 0.354 0.349 0.345 HuR-3 0.376 0.303 0.269 0.345 0.339 0.335 Total-1 0.397 0.323 0.286 0.463 0.458 0.454 Total-2 0.402 0.331 0.294 0.474 0.468 0.464 Total-3 0.404 0.331 0.293 0.475 0.470 0.466 Total-4 0.391 0.317 0.28 0.452 0.447 0.44

103

Figure 23: Correlations between Affymetrix and Agilent HuR IP mRNA fold changes over total The fold-change correlation between each pair of mRNAs from the K562 cell-line on the HuEx and Agilent platforms showing at least 75% base-pair overlap is shown as an X-Y plot. The scores for both platforms are given as average log2 fold changes of HuR IP samples over corresponding input samples. Affymetrix scores are plotted on the X axis and Agilent scores are shown on the Y axis. The overall Spearman correlation between the two sets was found to be approximately 0.22.

104

Agil IP Fold-Change (log2)

HuEx IP Fold-Change (log2)

105

3.4.5 Intersections of HuR Target Sets To compare the HuR IP targets across cell lines and across platforms, we used the LIA algorithm to intersect the target sets. We considered two transcripts to overlap if the shorter of the two overlapped the other by 75% or more in permissive strand-specific manner (if both are stranded, then strand is considered, else not). Looking through the literature for other reports of HuR RIP-Chip experiments, we found two experiments by the Gorospe group – one by deSilanes et al. that found 1142 transcripts in RKO cells and the other by Lal et al that found 1294 targets in HeLa cells. In addition, we found a recent report of HuR RIP-Chip profiling by Mazan-Mamcarz et al from which we culled the 9409 transcripts associated with HuR in either normal or TSP-1 overexpressing MCF7 cells according to Illumina Human Whole Genome Arrays. Finally, we also included in the intersection analysis the set of genes known to be associated with HuR (with mininum evidence being RT-PCR after HuR-IP) from a PubMed literature search. A description of each of these sets can be found in Table 4.

106

Table 4: Summary of Available HuR RIP-Chip Datasets

5152

Total

Santa CruzSC_21027 Affymetrix Human Exon 1.0 ST Total

Santa CruzSC_21027 Affymetrix Human Exon 1.0 ST Total

Z-score differential

fold-change & p-value



Baseline Differential

3808

Santa CruzSC_21027 Agilent HWG G4112F

IgG1 IP

Profiling Platform

K562 2459

Santa CruzSC_5261 NIA MGC Human 9.6K

IgG1 IP Z-score differential

No. of Targets Antibody

HuEx.K562 K562 1142

Santa Cruzunknown NIA MGC Human 9.6K

none

Cell-Line

Agil.K562 RKO 1294

Santa Cruzunknown Illumina HumanRef-8 v2

Name

MGC.RKO HeLa 9409

HuEx.GM12878 GM12878

MGC.HeLa MCF7Vector

detection calls only

Illmn.MCF7 MCF7MCT1+

107

Since overlapping transcripts are present in the data sets, the intersection relationship is not commutative (the number of A intersecting with B is not the same as the number of B intersecting with A). To build a summary relationship between any pair of sets A and B, we built a score based on the sizes of A and B and the numbers of intersecting transcript as follows: Intersection-Coefficient = (size(A) * size(A in B) / size(B)) +(size(B) * size(B in A) / size(A)) size(A) + size(B)

To better visualize the intersection-coefficient relationships between the HuRassociated transcript sets (Table 5), we transformed the intersection coefficient scores into a distance measure (according to the formula (1-coeff)/2). We took the intersection distance score matrix and used multidimensional scaling with the Sammon method to project the relationships into two dimensions. We plotted the sets on their MDS dimensions, showing for each set a circle corresponding to its relative size (Figure 24). Targets from the two HuEx platform based profiling experiments were tightly clustered together. All of the sets were distributed approximately equidistantly in a circle around the Ilmn.MCF7 set which calls 9409 of 19911 mappable targets on the Illumina platform positive for HuR association. The Agil.K562, MGC.HeLa, and Ilmn.MCF7 sets were closest to the set of known targets but no set intersect with more than two-thirds of it. The farthest sets from the known targets by this metric were both of the HuEx sets and MGC.RKO set. To find sets of transcripts shared between analysis methodologies, we intersected the three sets analysed using the fold-change metric (the GM12878 and K562 cell lines) and found 721 targets in common across the cell lines on the HuEx and Agilent 108

platforms. The same intersection in the sets analyzed using a z-score differential contained 256 targets common to MGC.RKO, MGC.HeLa, and Ilmn.MCF7. When all sets of HuR targets were intersected the target set shrunk to 20 genes (Table 6). A cursory GO analysis (data not shown) revealed no common theme among these 20 genes.

109

Table 5: Intersection Coefficients for HuR-IP Sets Across Studies Known HuEx.GM12878 HuEx.K562 Agil.K562 MGC.RKO MGC.HeLa Illmn.MCF7

Known HuEx.GM12878 HuEx.K562 Agil.K562 MGC.RKO MGC.HeLa Ilmn.MCF7

1.000 0.324 0.246 0.290 0.178 0.492 0.497

1.000 0.817 0.319 0.268 0.290 0.602

1.000 0.273 0.201 0.220 0.627

110

1.000 0.109 0.139 0.636

1.000 0.252 0.593

1.000 0.581

1.000

Figure 24: Similarities between HuR target sets as determined by cross-intersection score Similarities between HuR target sets as determined by cross-intersection score. For each permutation of pairs of HuR target sets A and B, the direct intersection of A with B and the intersection of A with the platform of B is measured and the fraction of A in B is computed. The similarity between any two sets A and B is then computed by the formula (size(A)*size(A_in_B)/size(B) + size(B) * size(B_in_A)/size(A))/(size(B) +size(A)). The similarities are projected into two dimensions by Sammon MDS and plotted. The sizes of the circles correspond to the set size.

111

112

Table 6: Genes Associated with HuR Across All Studies

Symbol SMAD2 RP1-21O18.1 CALM3 KPNA4 PPP4C COX7A2L PSMD5 STX6

RefSeq NM_001003652 NM_001018000 NM_001743 NM_002268 NM_002720 NM_004718 NM_005047 NM_005819

AGPAT1 METTL7A NFYC TMEM97 KIAA0152 SCOTIN SIKE PPP1R15B C18ORF45 HNRPUL1 IFT20 TUBB

NM_006411 NM_014033 NM_014223 NM_014573 NM_014730 NM_016479 NM_025073 NM_032833 NM_032933 NM_144732 NM_174887 NM_178014

Gene Name SMAD, MOTHERS AGAINST DPP HOMOLOG 2 (DROSOPHILA) KAZRIN CALMODULIN 1 (PHOSPHORYLASE KINASE, DELTA) KARYOPHERIN ALPHA 4 (IMPORTIN ALPHA 3) PROTEIN PHOSPHATASE 4 (FORMERLY X), CATALYTIC SUBUNIT CYTOCHROME C OXIDASE SUBUNIT VIIA POLYPEPTIDE 2 LIKE PROTEASOME (PROSOME, MACROPAIN) 26S SUBUNIT, NON-ATPASE, 5 SYNTAXIN 6 1-ACYLGLYCEROL-3-PHOSPHATE O-ACYLTRANSFERASE 1 (LYSOPHOSPHATIDIC ACID ACYLTRANSFERASE, ALPHA) METHYLTRANSFERASE LIKE 7A NUCLEAR TRANSCRIPTION FACTOR Y, GAMMA TRANSMEMBRANE PROTEIN 97 KIAA0152 SCOTIN SUPPRESSOR OF IKK EPSILON PROTEIN PHOSPHATASE 1, REGULATORY (INHIBITOR) SUBUNIT 15B CHROMOSOME 18 OPEN READING FRAME 45 HETEROGENEOUS NUCLEAR RIBONUCLEOPROTEIN U-LIKE 1 INTRAFLAGELLAR TRANSPORT 20 HOMOLOG (CHLAMYDOMONAS) TUBULIN, BETA

113

3.4.6 HuR Target Associations with Cis-Regulatory Elements The mRNA cis-regulatory element most commonly thought to be associated with HuR targets and to mediate HuR-mRNA binding is the ARE element. We pulled ARE elements from the ARE Database (Bakheet et al. 2006) and associated each with its genomic coordinates. We then manually searched across genomic 3' UTR regions (as annotated by KnownGene (Hsu et al. 2006)) for instances of the ARE Pentamer (AUUUA), the ARE Nonamer (UUAUUUAUU), and the ARE2 motif from UTRdb (Mignone et al. 2005). We also added the known HuR targets as a comparison set to the analysis so that the enrichment of each of these motifs could be compared to the enrichment of known targets in a given set. Intersecting the sets of HuR mRNA targets (Known, HuEx.GM12878, HuEx.K562, Agil.K562, MGC.RKO, MGC.HeLa, and Ilmn.MCF7) with each of the above mentioned sets revealed that all of the sets were frequently (25 to 50%) and at least marginally significantly (hypergeometric test comparison to frequency in all ENSEMBL (Birney et al. 2004) genes, p < 0.05) associated with ARE Pentamers and the UTRdb ARE2 element (Figure 27). Notably, the two HuEx experiments and the Agil.K562 experiment showed the greatest enrichment in ARE targets with both HuEx target sets showing >80% association to ARE Pentamers and >55% association to ARE2 elements. Many of the sets also showed significant enrichment in the other cis-regulatory elements but the frequencies of association were for the most part below 10%. No set distinguishes itself thoroughly from the others on this metric but AREs are enriched in all.

114

Figure 25: Associations by Platform of HuR Targets with HuR-Relevant Annotations Annotations with known or predicted relevance to HuR including known HuR targets and instances of various AU-Rich Elements are compared across platforms to the HuR target sets. Locus-sets for the Known target set and for each of the ARE motifs were intersected at the base-pair level with the RIP-Chip sets. The number of Known targets overlapping by 75% or more and the number of each ARE subtype completely intersecting with the RIP-Chip sets were compiled as the frequencies of association. To assign significance values to these frequencies, the number of Known and ARE elements likewise intersecting with all loci from each RIP-Chip platform was taken and a hypergeometric test was used to derive a p-value for enrichment of the elements in each RIP-Chip target set. For each RIP-Chip target set, the frequency of association or support for each element is indicated by the size of the plotted circle and is given by the number in the center of each. The significance is converted to a -log10 scale and used as the value on the x-axis (more significant associations are further to the right on each plot).

115

116

3.4.7 Cis-Regulatory Element Partitioning by HuR Target Sets To see if the low-frequency enrichment of many of these cis-regulatory elements were due to biases induced by small total numbers of the elements, we decided to analyse the reverse associations. Having tested for the relationship set implies motif, we now tested for motif implies set. For this we defined the partition score as the fraction of instances of a cis-regulatory element (rather than target transcripts in the previous section) intersecting with a HuR target set divided by the fraction of target set platform represented by the HuR target set. For example, if a HuR target set Y represented 10% of profiled targets on it's platform and 10% of all instances of a motif X intersected with set Y, the partition score for X in Y would be 1.0 and the association would be considered random. We calculated this association score for each pair of HuR target set and motif mentioned in the previous section and plotted the scores for each target set as a series of barplots (Figure 26). We found greater than 2-fold enrichment of known targets in the Agilent, MGC.RKO and MGC.HeLa target sets.

117

Figure 26: Partitioning of HuR-Relevant Annotations with HuR-Targets Annotations with known or predicted relevance to HuR including known HuR targets and instances of various AU-Rich Elements are compared across platforms to the HuR target sets. The number of annotation intersections with each target set are divided by annotation intersections with the corresponding platform set and this fraction is given as the partition score.

118

119

3.5 Discussion 3.5.1 High background poses a problem for RIP-Chip In this study we found IP samples that showed greater total signal and a greater number of expressed mRNAs than their corresponding input samples even when RT-PCR of known positive and negative controls shows a clean IP profile. Besides degradation of RNA in the input samples or high background in the IP reactions, there are three possible reasons for this. First, the amplification and cDNA creation steps of the microarray sample processing protocols could be inflating the levels of background mRNAs. Second, off target binding of IP RNA to non-specific probes could at high-levels produce a shift in the overall signal. Finally, normalization or scaling procedures in the image processing and signal calling steps, can produce a spurious increase in overall signal if a given RBP's targets are non-randomly distributed on a chip. The analysis methods we apply are designed to mitigate these problems and derive a “best-evidence” ranking for all potential RBP targets. In general, RIP-Chip experiments pose a variety of challenges that differentiate them from the majority of gene expression studies. In a typical gene-expression study, since all samples are of total mRNA, all chips in the study can then be expected to have a similar number of expressed genes or at least a large number of invariant genes unresponsive to the condition being modulated. Most analysis methods make assumptions about the overall distributions of the expression on each chip. Some model this with a normal distribution (z-score transformation) while others use more complex 120

models that integrate error estimates (PLIER or RMA on Affymetrix chips). These methods are optimized to reproducibly produce differential expression estimates between two similarly distributed sets of expression scores. In a RIP-Chip experiment the number of mRNA targets can vary greatly between RBPs, between cell-lines, and between treatments. In addition, the expression distribution for the RBP-associated mRNAs cannot be assumed to be normal (for instance some RBPs may only bind highly expressed genes while others bind a more varied set). Given that most microarray platforms are best suited for measuring differential expression, most RIP-Chip strategies employ either a negative control IP (such as with IgG) or a sample of total RNA as a baseline condition. The baseline is subtracted from the RBP-IP to measure enrichment of the RBP's mRNA targets. There are two possible outcomes when running a negative control IP with a protein that does not bind mRNAs. The first is that that the negative IP brings down mRNAs showing generalized affinity for IgG in general and so bring down a random subset of mRNAs. The second is that the negative IP brings down, not mRNAs that show any sort of specificity, but rather some small subset of total RNA that could not be removed by the clean-up steps. In the first case the negative IP should show little or no correlation to total mRNA but in the second case, it should be a representative subset of the total mRNA population. For any particular protein not known to be mRNA-binding, there is no guarantee that there is not a set of naturally affine mRNAs. In this case, comparing RBP-IP to a negative IP can produce both false positive and false negatives. False positives arise where expression of a non-specific target in the RBP-IP is many fold-levels higher than it's expression in the negative IP. False negatives arise where a high-level non-specific target in the negative IP 121

is a real target of the RBP which does not show a high fold change (Table 7). The case where the negative IP is similar to the total mRNA pool poses the same issues as when the total mRNA pool is used as the baseline for comparison but for a few caveats. The amount of RNA extracted from the RBP-IP is usually much less than the total amount of mRNA extracted and often less than the optimal amount of RNA to be used for a microarray platform. In practice, the RBP-IPs are usually put through a twocycle amplification step before the necessary amount of IP RNA is available for an IP platform. If this IP is then compared directly to a total mRNA sample, there is no guarantee that the IP targets will have been sufficiently enriched in the amplification process to exceed the levels in the total. In fact, the higher the expression of a target mRNA in the total sample, the more likely it is closer to saturating probes on the chip and the less likely it is that the IP will show enrichment. This false negative error problem can be mitigated by mass-matching the IP and total samples before they are put through any amplification process. The RBP-IP should contain a smaller set of messages that the total profile and so a message with high affinity for the RBP should be amplified many more times than a target in the more heterogeneous total population. Using the same amount of input RNA ensures that a direct fold-change comparison is valid here. In our case, the mass-matched input samples across both cell-lines and both expression platforms were much more inconsistent than corresponding sets of HuR IP and total mRNA indicating potential RNA degradation in at least one of the input samples. This forced us to compare the HuR IPs directly to non-mass-matched total samples.

122

Table 7: Simulation of IP targets under different population assumptions & analysis methods

Feature Total Neg_Rep Neg_Rep_Sig Neg_Rep_Z Neg_Rand Neg_Rand_Sig Neg_Rand_Z Pos_IP Pos_IP_sig Pos_IP_Z FC_v_tot FC_v_rep FC_v_rand Z-diff-rep Z-diff-rand Gene_1 1000 200 1379.66 0.58 670.43 1618.42 0.76 4 165.01 -0.28 -2.6 -3.06 -3.29 -0.85 -1.04 Gene_2 900 81 558.76 -0.02 199.51 481.62 -0.08 92.43 3812.94 2.09 2.08 2.77 2.98 2.12 2.17 Gene_3 800 128 882.98 0.21 340.36 821.62 0.17 5.12 211.21 -0.25 -1.92 -2.06 -1.96 -0.46 -0.42 Gene_4 800 192 1324.47 0.53 417.72 1008.37 0.31 3.84 158.41 -0.28 -2.34 -3.06 -2.67 -0.82 -0.59 Gene_5 700 196 1352.07 0.55 43.65 105.37 -0.36 5.88 242.56 -0.23 -1.53 -2.48 1.2 -0.78 0.13 Gene_6 500 25 172.46 -0.3 313.5 756.79 0.12 1 41.25 -0.36 -3.6 -2.06 -4.2 -0.05 -0.48 Gene_7 300 9 62.08 -0.38 128.73 310.75 -0.21 0.09 3.71 -0.38 -6.34 -4.06 -6.39 0 -0.17 Gene_8 300 9 62.08 -0.38 17.15 41.41 -0.4 30.27 1248.7 0.43 2.06 4.33 4.91 0.81 0.83 Gene_9 300 36 248.34 -0.25 252.4 609.3 0.01 1.44 59.4 -0.34 -2.34 -2.06 -3.36 -0.1 -0.36 Gene_10 150 11.25 77.61 -0.37 5.75 13.88 -0.43 0.11 4.64 -0.38 -5.01 -4.06 -1.58 -0.01 0.04 Gene_11 100 3 20.69 -0.41 30.99 74.81 -0.38 0.15 6.19 -0.38 -4.01 -1.74 -3.6 0.04 0 Gene_12 100 5 34.49 -0.4 1.45 3.49 -0.43 0.2 8.25 -0.38 -3.6 -2.06 1.24 0.03 0.06 Gene_13 75 0.56 3.88 -0.43 23.67 57.14 -0.39 0.01 0.46 -0.38 -7.34 -3.06 -6.94 0.04 0.01 Gene_14 50 0.25 1.72 -0.43 41.6 100.42 -0.36 5.01 206.67 -0.25 2.05 6.9 1.04 0.18 0.11 Gene_15 40 0.8 5.52 -0.43 31.81 76.79 -0.38 0.03 1.32 -0.38 -4.92 -2.06 -5.86 0.04 0 Gene_16 40 0.64 4.41 -0.43 29.78 71.9 -0.38 0.03 1.32 -0.38 -4.92 -1.74 -5.77 0.04 0 Gene_17 20 0.2 1.38 -0.43 10.64 25.68 -0.42 0.01 0.25 -0.38 -6.34 -2.48 -6.7 0.05 0.03 Gene_18 10 0.05 0.34 -0.43 2.32 5.61 -0.43 0 0.06 -0.38 -7.34 -2.48 -6.5 0.05 0.05 Gene_19 5 0 0.02 -0.43 3.7 8.93 -0.43 0.5 20.63 -0.37 2.04 10.22 1.21 0.06 0.06 Gene_20 3 0 0.02 -0.43 0.29 0.71 -0.44 0 0 -0.38 -11.4 -4.06 -9.31 0.05 0.05

123

3.5.2 Targets vary by platform more than by cell-line In the cross-experiment comparison, the samples seem to be spread out evenly with little overlap clustering between IP-sets. It is notable that there is more tight clustering between samples from different cell-lines run on the same platform than between the same cell-line run on different platforms. The difference between platforms in their ability to specifically detect subsets of mRNAs seems to supercede the biological differences between the mRNA subsets themselves in this case. Overall the sparse overlaps between the different sets of HuR IP targets, especially those that are biologically identical but show large platform-based differences, indicate that each platform should be benchmarked independently to determine its false-positive and falsenegative rates for expression analysis of mRNA subpopulations. We cannot, with the available information, conclude whether differences between the HuR target sets are biological or due to differences in array technology. A digital profiling method for RNA populations such as high-throughput sequencing may be better suited to identifying RBPassociated RNAs.

3.5.3 HuR targets may bind elements besides AREs In the comparison of HuR target sets to sets of potential cis-regulatory elements (Figure 25), we found most significantly that AU-rich elements, specifically the ARE pentamer and ARE-2 element (from ARED) were frequently and significantly enriched in all of the sets. The converse, however, is not true since the fraction of ARE elements associated with each set is close to random expectation and show partition scores close to 124

1.0 (Figure 26). While most or all HuR targets contain an ARE, the presence of an ARE does not guarantee association with HuR. This result suggests is that binding of HuR targets to mRNAs maybe regulated in a combinatorial fashion. There maybe specific cofactors associated with HuR that allow it to bind at a particular ARE and other factors bound to mRNAs that prevent HuR from binding. HuR binding may not be regulated by the ARE alone and another cis-regulatory element could provide the specificity for HuR binding. Finally, this study demonstrates novel locus-based approaches to comparing result sets from different expression profiling platforms, be it to each other or to other sets of cis-regulatory elements. Rather than having to find cross references between the identifiers on each pair of platforms, we were able to simply annotate each platform direct with it's genomic coordinates and use the coordinates to check across data sets for intersection an identity. The locus based approach also allowed us to homogeneously compare each set of genic loci with loci for regulatory elements such as AREs – thus saving the time required to repredict each motif in out sets of sequences. Further studies that explore motif-discovery in the known target set and a more in depth analysis of the sources of variation in RIP-chip experiments are warranted.

125

4. Inferring Functional Properties of UTR Elements 4.1 Summary In this section, we use the Commodus tool described in Chapter 2 to interrogte functionally related sets of genomic loci and provide a benchmark for our ability to pick out known and novel functional association for genomic locus sets. This work was largely motivated by our surreptitious discovery of associations between RNA binding protein binding sites and microRNA target sites in using a very crude precursor to the locus based system. That work is describe in more detail in Appendix IV. To develop a more systematic method for mining functional assocations, we proceed to integrate genic (KnownGene, RefSeq), ontology (GO), disease (GAD), regulatory (TRANSFAC, ARED, UTRSite, TargetScanS), and other functional annotation data (MESH, CHEM, InterPro) into our locus collection to provide a broad cross-section of currently available functional annotations. We use the Commodus tool to then mine annotations associated with three sets of well-characterized RNA secondary structure motifs (HSL, IRE, and SECIS) from the UTRdb. We find that most of the known functional properties of these sets are recapitulated. In addition, we uncover more unexpected characteristics of these data sets that are independently corroborated by literature search.

126

4.2 Collections of Post-Transcriptional Regulatory Elements Ribonomic profiling and the RIP-Chip technology has great potential to expand our knowledge of the post-transcriptional regulatory architecture and greatly expand on these collections of post-transcriptional regulatory motifs. Once a RIP-Chip experiment is analysed and a set of genes or binding-sites potentially associated with an RNA-binding protein is identified, the next step is to find functional properties associated with the given set. What pathways are regulated by particular mRNA cis-regulatory elements? Are combinations of these elements associated with particular functional roles? Answering these questions can help to shed light on the role of the protein or motif of interest in the post-transcriptional sphere. There are a few bioinformatics resources that attempt to find and catalog structural and sequential motifs important to post-transcriptional regulation. The UTResource developed by Giovanni Pesole at the Italian CNR houses a scheme for defining functional secondary structure patterns and detecting them in sequence as well as a collection of literature-derived functional patterns and their occurrence in genomic UTRs (Mignone et al. 2005). EvoFold is dataset developed by Jakob Pederson at UCSC that picks out phylogenetically conserved secondary structures in the genome (Pedersen et al. 2006). Infernal is one of the many projects cHuRned out by Sean Eddy and his groups that harnesses phylo-SCFGs to define and detect secondary structure motifs (Eddy 2006). The miRNA field tends to be more active in tool and database development and we find frequently updated miRNA repositories like miRBase and a constantly evolving breadth of target prediction tools such as miRanda (John et al. 2004) that will be useful 127

for identifying interacting post-transcriptional players. Our limited understanding of regulatory UTR motifs follows from our limited knowledge of RNA binding proteins. From well-characterized, simple conserved structures like the histone stem loop to complex and variable structures like the iron response element, new UTR motifs are catalogued only when a new RNA binding protein is discovered and footprinted. This is a model of motif discovery analagous to transcription factor discovery, a slow process that took more than a decade to come up with the still incomplete set of transcriptional regulators and enhancers. While large collections of known and predicted UTR elements do exist, there exist few methods for comparing these motif instances to other known or predicted functional annotations at genomic, proteomic, or pathway level to better characterize their role. To generically mine functional properties associated with sets of genomic elements including these UTR elements, we have developed the interlocutor algorithm and the commodus viewer described previously in this thesis. We now proceed to benchmark the efficacy of this approach in mining functional properties associated with instances of three well-characterised sets of UTR elements from the UTRdb: the Histone Stem Loop (HSL3) element, the Selenocysteine Insertion Sequence (SECIS) element, and the Iron Response Element (IRE). We will determine if the associations found using our methodology recapitulates what is known about the motif and its role in gene-regulation.

4.3 UTR Elements in UTRSite and UTRdb The UTResource developed by the Pesole lab (Pesole & Liuni 1999) is the post128

transcriptional analog of the well-known TRANSFACand JASPAR transcriptional binding motif databases (Heinemeyer et al. 1998). They house a collection of hundreds of known and suspected mRBP binding site motifs defined by both sequence and structure. The UTResource provides catalogs of UTR motifs (UTRSite), tools for finding instances of UTR motifs in sequences (UTRScan and PatSearch), and a database of motif instances across transcribed sequences (Mignone et al. 2005). The UTRSite hosts a collection of about 100 distinct UTR elements. They include both sequence-defined elements such as the ARE and the K-Box as well as structural elements such as the histone stem loop and the internal ribosomal entry site (Pesole et al. 2000). A UTRSite entry consists of a motif definition given in the PatSearch expression language that allows denotation of both sequential and structural features in terms of a flexible alphabet and blocks of repeating or complementing sequences (Mignone et al. 2005). There are two tools available at the UTResource for finding sequence and structure patterns. The first, UTRScan, searches for instances of all of the motifs catalogued in UTRSite across a user provided set of sequences. The second, PatSearch, is a bit more flexible and allows the user to fully define the pattern of interest and to search across input sequences or curated collections of 5' and 3' UTRs across clades of organisms (Mignone et al. 2005). Finally, the UTRdb houses predicted instances of UTRSite motifs compiled across 5' and 3' UTR regions of known transcripts from EMBL in the general UTR section or from RefSeq mRNAs only in the UTRef section. Entries in the UTRdb are organized by 129

unique UTR sequence and the position of the UTR and of known UTRSite motifs within the source transcript is provided along with cross-references to EMBL or RefSeq (Mignone et al. 2005).

4.4 Functionally Annotating Sets of UTR Elements We downloaded instances of secondary structure elements from the UTRef section of the UTRdb, then mapped the RefSeq transcript identifiers for each motifcontaining UTR to the hg18 build of the genome at the UCSC Genome Browser database (Karolchik et al. 2003). We then converted the transcript-relative position of each matrix to absolute genomic positional coordinates using these mappings. From the entire UTRef database we recovered hg18 mappings for 16,429 motif instances representing 24 distinct motifs and including 54 HSL3 sites, 49 IRE sites and 188 SECIS sites. We compiled files containing the UTR element coordinates for the entire set of UTRef elements and for the subsets of histone-stem loop (HSL3), iron-response element (IRE), and selenocysteine insertion sequence (SECIS) motif instances (Figure 27).

130

Figure 27: Mapping Sets of Functionally Related UTR Structures Mapping the UTRef database to hg18 genomic coordinates yielded 16429 mapped sites representing 34 distinct motifs. Instances of the histone stem loop (HSL3 - U001), iron response element (IRE - U001), and selenocysteine insertion sequence (SECIS1 - U003) were culled out for use as test collections of functionally related UTR elements. Structures depicted are based on corresponding entries for these elements in the Rfam database.

131

132

Next we prepared query functional annotation tracks from a variety of functional databases. We used the mappings between ENSEMBL Gene and GO Terms in the BioMart collection to annotate genomic coordinates with ontology term. For each ontology term associated with a locus, we used the GO Database to recursively process its parent terms and associate those directly with the locus. We also harvested links between ENSEMBL Gene and protein domain identifiers from PROSITE, PFAM, and InterPro and assigned these annotations to the gene loci. We proceeded then to annotate NCBI Gene coordinates with disease associations (from MeSH & OMIM) and chemical interactions as curated by the Comparative Toxicogenomics Database. We moved onto integrating databases of cis-regulatory elements that maybe coinciding with our loci of interest. To do this, we first integrated the TargetScanS track of microRNA target sites from the UCSC Genome Browser Database. We added to this collections of AU-Rich Element instances across 3' UTR regions of the genome using both the precompiled ARE Database (ARED) and a manual search of all KnownGene 3' UTR regions for the canonical ARE pentamer (AUUUA) and nonamer (UUAUUUAUU). We also used the TRANSFAC database of transcription factor binding site (TFBS) motifs and the TRANSFAC match tool to search upstream promoters (2000 bp upstream to 200 bp downstream of TSS) of KnownGene transcripts. To compress this large dataset to a more manageable transcription-factor to gene relation set, each entire transcript was annotated with the TFBS elements found in their promoters. Once the sets of functional annotations were converted to loci, we proceeded to systematically intersect each of our three UTR element sets with this collection. Using

133

the commodus module described previously, we derived associations between members of each UTR element set with functional annotation terms from our collection. We filtered out terms that were associated with less than 10% or 10 individual elements of a given set. We compared the association of a term with a UTR element set to the term's association with the entire UTRef set and used the hypergeometric approximation of Fisher's Exact Test to guage significance. We picked the twenty most significant annotations associated with each set for further study.

134

4.5 Histone-Stem Loop Associations The histone stem loop (HSL) is a 26 base pair stem-loop structure present in the 3' UTRs of histone mRNAs (Dominski & Marzluff 1999). Most mRNAs are polyadenylated, then bound by poly-A binding protein (PABP) at their poly-A tails and by a cap binding protein complex (CBC) at their 5' cap sites, and then translated when PABP complexes with CBC and eukaryotic translation initiation factor 4e (eIF4E) to recruit ribosomes to the mRNA loop (Sonenberg & Hinnebusch 2009). In contrast, most histone messages are not polyadenylated but are instead bound by stem-loop binding protein at the HSL elements in their 3' UTRs (Marzluff et al. 2008). SLBP then mediates interactions of the mRNA with the U7 snRNP which removes all bases 3' of the HSL (Wang et al. 1996), and with eukaryotic initiation factors responsible for initiating translation (Sonenberg & Hinnebusch 2009). Dysregulation of the HSL or SLBP leads to abberant translation of histones in both proliferating and endocycling cells leading to developmental and epigenetic aberrations (Marzluff et al. 2008). The Commodus algorithm showed the expected associations of HSL elements histone domains and subsequently related terms such as protein-DNA complex, nucleosome, chromosome, and DNA binding (Table 8). In addition we recapitulated literature links between the HSL elements and autoimmune diseases (Czaja 2005; Kurien & Scofield 2006), as well as with colorectal cancer (Konishi & Issa 2007; Kondo & Issa 2004). Finally we see that genes containing HSLs are likely to be regulated by the transcription factor NFYC (also known as CBF) which is particularly interesting given that CBF mediated gene expression is vital for the progression of cells from G2/M phase 135

– the transition that serves as the checkpoint for cell division and the prerequisite process of histone production (Hu et al. 2006; Sterner et al. 1987).

4.6 Iron-Response Element Associations The iron response element (IRE) is a short stem loop found in the UTRs of genes involved in iron-metabolism (Muckenthaler et al. 2008). IREs are bound by iron response proteins (IRPs), such as IRP1 and IRP2, that are responsible for monitoring cytosolic iron concentration and optimizing iron availability (Cairo & Recalcati 2007). IRPs can either stabilize or translationally repress IRE containing mRNAs. When the cell experiences a low iron concentration, for instance, IRPs bind IRE elements to stabilize transferrin which is responsible for iron import and translationally repress ferritin which responsible for iron storage (Rouault 2006; Muckenthaler et al. 2008) Maintenance of iron levels is crucial in cellular processes such as cell-growth, apoptosis, and inflammation and an integral component of cytochromes, and oxygen-carrying globins (Rouault 2006). We found that IRE instances were, as expected, associated with iron ion transport and iron ion homeostasis (Table 9). Most of the remaining mined associations were with diseases and processes known to have a direct relationship to iron homeostasis. They include metabolic diseases such as hemochromatosis (Fix & Kowdley 2008), thalassemia (Collard 2009), anemia (Cairo & Recalcati 2007), and porphyria cutaenea tarda (Batts 2007) and neurological diseases such as Friedreich's Ataxia (Pandolfo & Pastore 2009), multiple sclerosis (Agarwal & Griffith 2008), and Parkinson's disease (Wolozin & Golts 2002) Less well-categorized diseases and systemic problems such as sudden infant death

136

syndrome (SIDS) (Weber et al. 2009), cardiomyopathies (Hershko et al. 2004), and susceptibility to bacterial infections (Collard 2009).also show associations with the IRE element set coordinates. The literature supports associations between iron metabolism and Restless Leg Syndrome (Agarwal & Griffith 2008; Teive & Munhoz 2009), alpha-1 antitrypsin deficiency (Pandur et al. 2009), and osteoarthritis (Weinberg 2006) among the other given annotations.

4.7 Selenocysteine Insertion Sequence Annotations Selenocysteine insertion sequence (SECIS) motifs are found downstream of UGA codons in selenoprotein mRNAs (Kryukov et al. 2003). Where the UGA normally is a stop codon, the SECIS element recruits protein factors that allow its translation as selenocysteine (Lescure et al. 2002). Selenoproteins are characterized by the presence of one or more selenocysteine residues most of which are involved in redox reactions (Reeves & Hoffmann 2009). Selenoproteins are known to be transcriptionally regulated by the Nrf2/Keap1 system (Lu and Holmgren 2009) where Nrf2 in association with Maf regulates the transcription of detoxification and antioxidant enzymes (Surh 2003). Selenoproteins include families of thioredoxin reductases involved in apoptosis and cellproliferation as well as glutathione peroxidases, the major components of the antioxidant defense system that reduce hydroperoxides in the cytoplasm, the gastrointestinal tract, and in plasma (Reeves & Hoffmann 2009). Selenoproteins are also known to be important in the testis where they are vital to sperm maturation and male fertility (Reeves & Hoffmann 2009).

137

Using the Commodus tool, we found that selenium binding was the most significantly associated functional term with the SECIS element set (Table 10). This is notable as only 18 of the 188 SECIS elements were found in genes associated with selenium binding and demonstrates the utility of the tool to mining sets of less-well characterised elements. The role of SECIS containing genes as oxidoreductases was also recapitulated with 22 of the genes being associated with this term. In addition, we find many transcription factor binding site motifs significantly associated with genes containing SECIS elements. The list includes MAF (V$VMAF_01) which is known to be a regulator of selenoprotein transcription , RFX1 (V$RFX1_01) and OVOL2 (V$MOVOB) known to be involved in spermatogenesis which is intricated tied to selenium metabolism (Wolfe et al. 2006), (Li et al. 2002), and FOXI1 (V$FOX_Q2 & V$HFH3_01) known to be a master regulator of proton transport and metal ion transport (Vidarsson et al. 2009). We also see an association of the SECIS element loci with influenza which is corroborated by literature showing more severe symptomatology and rise of new influenza virus strains when hosts are selenium deficient (Beck et al. 2001).

138

Table 8: Top annotations associated with histone stem loop instances Term Count (54) Significance 31 -66.31 IPR007125|Histone_core_D PF00125| 31 -65.61 GO:0032993|protein-DNA complex 31 -53.86 31 -53.86 GO:0000786|nucleosome 31 -52.59 GO:0006334|nucleosome assembly GO:0031497|chromatin assembly 31 -51.43 31 -50.57 GO:0007001|chromosome organization and biogenesis (sensu Eukaryota) GO:0022607|cellular component assembly 31 -49.18 GO:0006333|chromatin assembly or disassembly 31 -48.99 GO:0006323|DNA packaging 31 -48.81 GO:0065004|protein-DNA complex assembly 31 -48.63 GO:0000785|chromatin 31 -41.71 GO:0006325|establishment and/or maintenance of chromatin architecture 31 -40.37 GO:0051276|chromosome organization and biogenesis 31 -37.63 GO:0044427|chromosomal part 31 -37.24 GO:0005694|chromosome 31 -34.76 GO:0065003|macromolecular complex assembly 31 -31.02 D008628|Mercury 21 -22.71 31 -22.10 GO:0006996|organelle organization and biogenesis MESH:D001327|Autoimmune Diseases 21 -21.26 IPR002119|Histone_H2A 10 -19.90 8 -18.54 IPR000164|Histone_H3 GO:0043232|intracellular non-membrane-bounded organelle 31 -17.92 GO:0043228|non-membrane-bounded organelle 31 -17.92 D013849|Thimerosal 11 -17.17 28 -16.43 MESH:D015179|Colorectal Neoplasms V$NFY_C 26 -15.48 P$DOF_Q2 52 -14.92 GO:0003677|DNA binding 32 -14.40 GO:0032991|macromolecular complex 31 -14.39 6 -14.24 C027579|periodate-oxidized adenosine GO:0044446|intracellular organelle part 31 -13.72 GO:0044422|organelle part 31 -13.70 V$NFY_Q6_01 15 -13.34 C024352|fludarabine 7 -12.72 31 -12.25 GO:0016043|cellular component organization and biogenesis V$E2F_Q6_01 12 -12.05 V$OCT1_B 15 -11.94 V$TATA_01 26 -10.34 GO:0003676|nucleic acid binding 32 -10.30 V$NFY_Q6 12 -10.26 C022838|nickel chloride 8 -10.16 V$CAAT_01 16 -10.06 V$OCT1_Q6 22 -10.05 7 -10.03 C014347|decitabine C012589|trichostatin A 7 -9.92 V$OCT_C 19 -9.14 7 -9.09 D004121|Dimethyl Sulfoxide V$OCT1_Q5_01 11 -8.82 V$E2F_Q6 10 -8.80

139

Table 9: Annotations associated with IRE elements. Term GO:0006826|iron ion transport GO:0000041|transition metal ion transport MESH:D009202|Cardiomyopathies C013531|ferric ammonium citrate GO:0006879|cellular iron ion homeostasis GO:0055072|iron ion homeostasis D003676|Deferoxamine MESH:D013398|Sudden Infant Death MESH:D019896|alpha 1-Antitrypsin Deficiency MESH:D012148|Restless Legs Syndrome MESH:D009404|Nephrotic Syndrome MESH:D001145|Arrhythmias, Cardiac MESH:D005621|Friedreich Ataxia MESH:D001424|Bacterial Infections MESH:D007006|Hypogonadism MESH:D011528|Protozoan Infections MESH:D007037|Hypothyroidism MESH:D020961|Lewy Body Disease MESH:D006525|Hepatitis, Viral, Human MESH:D008268|Macular Degeneration MESH:D020528|Multiple Sclerosis, Chronic Progressive MESH:D006432|Hemochromatosis MESH:D009181|Mycoses MESH:D017119|Porphyria Cutanea Tarda MESH:D010003|Osteoarthritis MESH:D000756|Anemia, Sideroblastic MESH:D011041|Poisoning MESH:D006486|Hemosiderosis GO:0015674|di-, tri-valent inorganic cation transport MESH:D002311|Cardiomyopathy, Dilated GAD_DT:58|Iron Overload D007455|Iodine GAD_DT:64|Porphyria Cutanea Tarda GAD_BP:48|porphyria cutanea tarda MESH:D017086|beta-Thalassemia D000431|Ethanol D006427|Hemin D007501|Iron PF04253| GAD_DT:46|Hemochromatosis GAD_BP:37|hemochromatosis MESH:D002056|Burns GO:0030001|metal ion transport PF02225| D017373|Cyproterone Acetate MESH:D020734|Parkinsonian Disorders MESH:D020149|Manganese Poisoning MESH:D000740|Anemia MESH:D006529|Hepatomegaly MESH:D003922|Diabetes Mellitus, Type 1

140

Count (49) Significance 11 -20.38 11 -18.68 13 -18.25 7 -16.74 8 -15.58 8 -15.58 7 -15.01 11 -14.93 11 -14.93 11 -14.93 11 -14.93 11 -14.93 11 -14.93 11 -14.93 11 -14.93 11 -14.93 11 -14.93 11 -14.93 11 -14.93 11 -14.93 11 -14.83 11 -14.74 11 -14.64 11 -14.64 11 -14.55 12 -14.16 11 -14.12 11 -13.96 13 -13.68 11 -13.50 6 -13.38 5 -13.21 5 -13.21 5 -13.21 5 -13.21 7 -12.62 6 -12.56 7 -12.42 5 -12.37 6 -12.23 5 -11.77 8 -11.18 15 -10.56 5 -10.56 5 -10.26 7 -10.17 7 -10.17 11 -10.15 13 -10.10 11 -10.07

Table 10: Annotations associated with SECIS elements Term Count (188) Significance GO:0008430|selenium binding 18 -16.78 27 -10.53 TargetScanS miRNA Tgts V$FOX_Q2 35 -8.41 V$PEA3_Q6 69 -6.90 V$SMAD_Q6_01 31 -6.81 V$RFX1_01 58 -6.36 F$MATALPHA2_01 66 -6.16 V$VMAF_01 79 -6.01 V$PAX4_04 70 -5.80 V$AP4_Q6 20 -5.80 V$MOVOB_01 65 -5.73 V$HFH3_01 38 -5.55 V$SOX10_Q6 65 -5.54 V$ARP1_01 47 -5.54 V$CEBPGAMMA_Q6 44 -5.45 V$CBF_01 32 -5.44 V$NKX25_01 63 -5.32 V$FXR_Q3 31 -5.27 V$FREAC7_01 42 -5.21 V$AIRE_02 34 -5.15 I$BRCZ3_01 34 -4.87 V$MYB_Q6 59 -4.83 V$PPARG_02 103 -4.73 18 -4.69 ARE_Pentamer AU-Rich Elements 18 -4.69 P$ALFIN1_Q2 32 -4.66 V$XFD1_01 34 -4.61 22 -4.61 GO:0016491|oxidoreductase activity V$HMGIY_Q6 43 -4.50 V$GR_Q6 89 -4.47 V$RORA_Q4 106 -4.43 V$TAL1_Q6 57 -4.43 V$ZF5_01 106 -4.39 MESH:D007251|Influenza, Human 22 -4.36 P$WRKY_Q2 20 -4.35 P$ABI4_01 55 -4.35 V$MEF2_04 28 -4.34 V$SZF11_01 68 -4.32 V$ALPHACP1_01 24 -4.14 P$GT1_Q6 49 -4.12 F$STRE_B 43 -4.05 V$MYB_Q5_01 54 -4.02 V$ETF_Q6 69 -4.02 V$SREBP1_02 60 -3.99 V$SREBP1_01 31 -3.98 P$RAV1_02 24 -3.96 V$CP2_02 43 -3.79 V$PPARA_01 53 -3.76 V$LEF1_Q2 78 -3.74 V$TAXCREB_01 21 -3.73

141

4.8 Discussion We find that the Commodus tool can find annotations both directly and indirectly associated with a set of functionally related loci. We show with the histone stem loop example that most of the mined annotations were directly associated with histone messages and with their biological function. It reveals associations of histone stem loop loci to loci affected by chemical agents such as thimerosal, and dimethyl sulfoxide and to loci annotated with diseases such as colorectal disorders and autoimmune diseases suggesting its utility in genomics based drug-discovery tasks. The fact that commodus is able to detect functional annotations associated with histone motifs serves as an effective positive control for the tool. The case of iron response element loci is similar and reveals associations with disease terms known to be related to deficiencies in iron metabolism. The case of selenocysteine insertion sequence elements is particularly interesting in two ways. First, only 18 of the 188 elements have any sort of significant functional annotation associated with them but fact that the associated annotation was in fact selenium binding shows that the locus-based approach is applicable even when the set is culled from annotation deserts within a genome. Second, most of the other associations of the SECIS loci were with binding sites for transcription factors such as MAF and OVOL2 which were remarkably associated at least indirectly with selenoprotein biology. This shows that the commodus approach can not only be used to functionally characterize a locus set but also to find its potential transcriptional and post-transcriptional regulators.

142

The iron response element set shows a cluster of associations with Sudden Infant Death Syndrome (SIDS), a less well-characterized disease for which no known mechanism exists (Opdal & Rognum 2004). The current literature on SIDS branches into theories involving neurological deficits, cardiac arrhythmias, as well as various genetic and enviromental effects that may contribute to these. In the Commodus analysis we see that the cluster of genes associated with SIDS is also significantly associated with cardiac arrythmias and bacterial infections. Germane to this discussion is a recent hypothesis that explicity suggests that SIDS maybe caused by bacterial toxins that produce cardiac arrythmias in genetically prone individuals (Morris et al. 2009).We suggest that the IRE regulated genes maybe integrally involved in the etiology of SIDS as they show associations all elements of the hypothesis propounded by Morris et al. This also suggests that the Commodus tool can be used to explore diseases with unknown etiologies providing a mechanism for both generating potential hypotheses and for ranking existing theories according to the available data. In addition to the ontological and disease annotations we were able to uncover transcription factors that are potentially regulating these target sets. An interesting study on the functional significance of the associations mined here would be to measure the effect, on each subset of UTR motif associated targets, of induction or repression of the mined transcription factor data. To see how for instance NFYC affected histone production and how MAF, RFX1, or OVOL2 affect the expression and metabolism of selenoproteins. To screen for potentially interesting regulatory networks here, we plan to look at correlations between pairs of RBPs, RBP targets, and the transcription factors mentioned above in publicly available gene expression datasets. 143

Commodus is the first tool for functional annotation term mining built on a locusbased framework. We have demonstrated here its ability to recapitulate known functional associations in sets of functional associated genomic elements. We have also shown how Commodus could be used to characterize diseases in terms of known and potential mechanisms. The locus-based framework thus provides a platform for homogeneously analysing sets of disparate genomics data sets, expanding the space of minable annotations well beyond those associated with genes. As the space of available locusbased annotations grows, commodus and the locus-based framework can be used to rapidly and effectively mine these for interesting functional inferences.

144

5. Summary & Future Prospects Using the increasingly well-accepted concept of genomic coordinates to maximally decompress and simplify genomic information, we are able to develop a framework that can integrate, compare, and analyze the widest variety of existing and generated genomics data sets. This paradigm is suited to be both a lossless format for storing genomics and proteomics data and a robust development platform for integrative genomics analysis tools. We have described here an algorithm, Commodus, to quickly distill genomic locus-sets into lists of significantly enriched annotation terms by systematic intersection with sets of annotation loci. We have also developed here a tool to systematically query interaction databases for all functional interactions affecting a set of loci, and build a dynamically filterable, information-rich visualization of the network. We have begun to compile datasets and resources for the study of posttranscriptional regulation and ribonomic profiling. The datasets produced by this field go well beyond the gene-based concepts usually used to analyze genomics data and prove a fertile testing ground for the locus-based framework. We outline informatic resources existing for the study of post-transcriptional regulation and begin to convert the most useful of these into locus-based resources. We apply the locus-intersection tools to study correlations between different classes of regulatory elements, namely between RNA binding protein binding sites and miRNA targets sites and discover potential coordinate regulation across the two classes. We use the locus-intersection framework and the commodus tool to compare and contrast ribonomic profiling results from a variety of 145

platforms to each other and to significant cis-regulatory sites. We finally show how the commodus tool can be used to mine regulatory, ontological, disease, and chemical annotations associated with sets of functionally associated loci. We demonstrate how functional interaction mining can be coupled to microarray analysis to discover transcriptional cascades responsible for the phenotypes of interest. We also show how interaction networks can be mined using graph-theoretic analysis to discover important clustering properties and points of control. It is important to integrate these methods with the locus-based framework and expand on the limited interaction mining module we have built. The locus-based framework provides an intuitive model for integrating various disparate functional genomics datasets. While sets can now be integrated at the coordinate level and terms associated with these can be mined and explored, we still have only limited capabilities for dealing with numerical data such as the results of microarray experiments. Building tools and modules to explore the statistical properties of locus-sets and to build numerical comparison such as correlations across sets provides the next major challenge. An interesting further goal would be to build methods that can treat the genomic coordinate space as just two dimensions (chromosomal positions and strands) of a multidimensional genomic annotation space and to look across this landscape for structures, patterns, and anomalies.

146

6. Bibliography Agarwal, P. & Griffith, A., 2008. Restless legs syndrome: a unique case and essentials of diagnosis and treatment. Medscape Journal of Medicine, 10(12), 296. Agilent Technologies, 2009. Agilent Feature Extraction Software (v10.5) Reference Guide. Av a i l a b l e at: http://www.chem.agilent.com/Library/usermanuals/Public/G446090020_FE_10.5_Reference.pdf [Accessed April 5, 2009]. Akool, E. et al., 2003. Nitric Oxide Increases the Decay of Matrix Metalloproteinase 9 mRNA by Inhibiting the Expression of mRNA-Stabilizing Factor HuR. Mol. Cell. Biol., 23(14), 4901-4916. Al-Shahrour, F. et al., 2007. FatiGO +: a functional profiling tool for genomic data. Integration of functional annotation, regulatory motifs and interaction data with microarray experiments. Nucleic Acids Research, 35(Web Server issue), W91-96. Al-Shahrour, F. et al., 2006. BABELOMICS: a systems biology perspective in the functional annotation of genome-scale experiments. Nucleic acids research, 34(Web Server issue), W472-6. Amoreira, C., Hindermann, W. & Grunau, C., 2003. An improved version of the DNA Methylation database (MethDB). Nucleic Acids Research, 31(1), 75-77. Atasoy, U. et al., 1998. ELAV protein HuA (HuR) can redistribute between nucleus and cytoplasm and is upregulated during serum stimulation and T cell activation. Journal of cell science, 111 ( Pt 21), 3145-56. Atasoy, U. et al., 2003. Regulation of eotaxin gene expression by TNF-alpha and IL-4 through mRNA stabilization: involvement of the RNA-binding protein HuR. Journal of immunology (Baltimore, Md. : 1950), 171(8), 4369-78. Bader, G.D. et al., 2001. BIND--The Biomolecular Interaction Network Database. Nucleic acids research, 29(1), 242-5. Bain, L.J. & Engelhardt, M., 2000. Introduction to Probability and Mathematical Statistics 2nd ed., Duxbury Press. Bakheet, T., Williams, B.R.G. & Khabar, K.S.A., 2006. ARED 3.0: the large and diverse AU-rich transcriptome. Nucleic acids research, 34(Database issue), D111-4. Baroni, T.E. et al., 2008. Advances in RIP-chip analysis : RNA-binding protein immunoprecipitation-microarray profiling. Methods in Molecular Biology 147

(Clifton, N.J.), 419, 93-108. Barreau, C., Paillard, L. & Osborne, H.B., 2006. AU-rich elements and associated factors: are there unifying principles? Nucl. Acids Res., 33(22), 7138-7150. Barrett, T. et al., 2007. NCBI GEO: mining tens of millions of expression profiles-database and tools update. Nucleic Acids Research, 35(Database issue), D760765. Batts, K.P., 2007. Iron overload syndromes and the liver. Mod Pathol, 20(1s), S31-S39. Becker, K.G. et al., 2004. The genetic association database. Nature genetics, 36(5), 4312. Berman, H., Henrick, K. & Nakamura, H., 2003. Announcing the worldwide Protein Data Bank. Nat Struct Mol Biol, 10(12), 980. Bertone, P. et al., 2004. Global identification of human transcribed sequences with genome tiling arrays. Science (New York, N.Y.), 306(5705), 2242-6. Birney, E., 2003. Ensembl: a genome infrastructure. Cold Spring Harbor symposia on quantitative biology, 68, 213-5. Birney, E. et al., 2004. An overview of Ensembl. Genome research, 14(5), 925-8. Blankenberg, D. et al., 2007. A framework for collaborative analysis of ENCODE data: making large-scale analyses biologist-friendly. Genome research, 17(6), 960-4. Blaxall, B.C. et al., 2000. Purification and Characterization of beta -Adrenergic Receptor mRNA-binding Proteins. J. Biol. Chem., 275(6), 4290-4297. Boguski, M.S., 2004. ENCODE and ChIP-chip in the genome era. Genomics, 83(3), 3478. Briata, P. et al., 2003. The Wnt/-CateninPitx2 Pathway Controls the Turnover of Pitx2 and Other Unstable mRNAs. Molecular Cell, 12(5), 1201-1211. Bruford, E.A. et al., 2008. The HGNC Database in 2008: a resource for the human genome. Nucl. Acids Res., 36(suppl_1), D445-448. Bryne, J.C. et al., 2008. JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Research, 36(Database issue), D102-106. Burset, M., Seledtsov, I.A. & Solovyev, V.V., 2001. SpliceDB: database of canonical and 148

non-canonical mammalian splice sites. Nucleic Acids Research, 29(1), 255-259. Cairo, G. & Recalcati, S., 2007. Iron-regulatory proteins: molecular biology and pathophysiological implications. Expert Reviews in Molecular Medicine, 9(33), 113. Chen, C.Y. & Shyu, A.B., 1994. Selective degradation of early-response-gene mRNAs: functional analyses of sequence features of the AU-rich elements. Molecular and cellular biology, 14(12), 8471-82. Chen, C.A., Xu, N. & Shyu, A., 2002. Highly selective actions of HuR in antagonizing AU-rich element-mediated mRNA destabilization. Molecular and cellular biology, 22(20), 7268-78. Chen, N., 2004. Using RepeatMasker to identify repetitive elements in genomic sequences. Current Protocols in Bioinformatics / Editoral Board, Andreas D. Baxevanis ... [et Al, Chapter 4, Unit 4.10. Chen, T. et al., 1991. Nucleosome fractionation by mercury affinity chromatography. Contrasting distribution of transcriptionally active DNA sequences and acetylated histones in nucleosome fractions of wild-type yeast cells and cells expressing a histone H3 gene altered to encode a cysteine 110 residue. J. Biol. Chem., 266(10), 6489-6498. Choo, K., Tan, T. & Ranganathan, S., 2005. SPdb - a signal peptide database. BMC Bioinformatics, 6(1), 249. Chung, S. et al., 1997. The Elav-like Proteins Bind to a Conserved Regulatory Element in the 3'-Untranslated Region of GAP-43 mRNA. J. Biol. Chem., 272(10), 65936598. Collard, K.J., 2009. Iron Homeostasis in the Neonate. Pediatrics, 123(4), 1208-1216. Coon, J.J. et al., 2005. Tandem mass spectrometry for peptide and protein sequence analysis. BioTechniques, 38(4), 519, 521, 523. Crick, F.H., 1958. On Protein Synthesis. In The Symposia of the Society for Experimental Biology. pp. 138-163. Available at: http://profiles.nlm.nih.gov/SC/B/B/Z/Y/ [Accessed April 13, 2009]. Czaja, A.J., 2005. Autoantibodies in autoimmune liver disease. Advances in Clinical Chemistry, 40, 127-64. Dalma-Weiszhausz, D.D. et al., 2006. The affymetrix GeneChip platform: an overview. Methods in enzymology, 410, 3-28. 149

Dean, J.L. et al., 2001. The 3' untranslated region of tumor necrosis factor alpha mRNA is a target of the mRNA-stabilizing factor HuR. Molecular and cellular biology, 21(3), 721-30. Demeter, J. et al., 2007. The Stanford Microarray Database: implementation of new analysis tools and open source release of software. Nucleic Acids Research, 35(Database issue), D766-770. Dominski, Z. & Marzluff, W.F., 1999. Formation of the 3' end of histone mRNA. Gene, 239(1), 1-14. Dormoy-Raclet, V. et al., 2007. The RNA-binding protein HuR promotes cell migration and cell invasion by stabilizing the beta-actin mRNA in a U-rich-elementdependent manner. Molecular and cellular biology, 27(15), 5365-80. Down, T.A. & Hubbard, T.J.P., 2002. Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Research, 12(3), 458-461. Eddy, S.R., 2006. Computational analysis of RNAs. Cold Spring Harbor Symposia on Quantitative Biology, 71, 117-28. Fisher, R.A., 1922. On the Interpretation of 2 from Contingency Tables, and the Calculation of P. Journal of the Royal Statistical Society, 85(1), 87-94. Fix, O.K. & Kowdley, K.V., 2008. Hereditary hemochromatosis. Minerva Medica, 99(6), 605-617. Frazer, K.A. et al., 2007. A second generation human haplotype map of over 3.1 million SNPs. Nature, 449(7164), 851-861. Giardine, B. et al., 2005. Galaxy: a platform for interactive large-scale genome analysis. Genome research, 15(10), 1451-5. van der Giessen, K. et al., 2003. RNAi-mediated HuR Depletion Leads to the Inhibition of Muscle Cell Differentiation. J. Biol. Chem., 278(47), 47119-47128. Goldberg-Cohen, I., Furneauxb, H. & Levy, A.P., 2002. A 40-bp RNA element that mediates stabilization of vascular endothelial growth factor mRNA by HuR. The Journal of biological chemistry, 277(16), 13635-40. Griffiths-Jones, S., 2006. miRBase: the microRNA sequence database. Methods in molecular biology (Clifton, N.J.), 342, 129-38. 150

Griffiths-Jones, S. et al., 2003. Rfam: an RNA family database. Nucleic acids research, 31(1), 439-41. Griffiths-Jones, S. et al., 2006. miRBase: microRNA sequences, targets and gene nomenclature. Nucleic acids research, 34(Database issue), D140-4. Griffiths-Jones, S. et al., 2005. Rfam: annotating non-coding RNAs in complete genomes. Nucleic acids research, 33(Database issue), D121-4. Guo, T. et al., 2004. DBSubLoc: database of protein subcellular localization. Nucleic Acids Research, 32(Database issue), D122-124. Guo, X. & Hartley, R.S., 2006. HuR contributes to cyclin E1 deregulation in MCF-7 breast cancer cells. Cancer Research, 66(16), 7948-56. Haeussler, J. et al., 2000. Tumor antigen HuR binds specifically to one of five proteinbinding segments in the 3'-untranslated region of the neurofibromin messenger RNA. Biochemical and Biophysical Research Communications, 267(3), 726-32. Hamosh, A. et al., 2005. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res, 33(Database issue), D514-D517. Harris, M.A. et al., 2004. The Gene Ontology (GO) database and informatics resource. Nucleic acids research, 32(Database issue), D258-61. Harris, T.D. et al., 2008. Single-molecule DNA sequencing of a viral genome. Science (New York, N.Y.), 320(5872), 106-109. Hayakawa, J. et al., 2004. Identification of promoters bound by c-Jun/ATF2 during rapid large-scale gene activation following genotoxic stress. Molecular Cell, 16(4), 521-35. Henikoff, S., 2007. ENCODE and our very busy genome. Nature genetics, 39(7), 817-8. Hershko, C. et al., 2004. Purging iron from the heart. British Journal of Haematology, 125(5), 545-551. Hong, P. & Wong, W.H., 2005. GeneNotes--a novel information management software for biologists. BMC Bioinformatics, 6, 20. Hsu, F. et al., 2006. The UCSC Known Genes. Bioinformatics (Oxford, England), 22(9), 1036-46. Hu, Q. et al., 2006. Inhibition of CBF/NF-Y mediated transcription activation arrests cells 151

at G2/M phase and suppresses expression of genes activated at G2/M phase of the cell cycle. Nucl. Acids Res., 34(21), 6272-6285. Hu, Z. et al., 2004. VisANT: an online visualization and analysis tool for biological interaction data. BMC bioinformatics, 5, 17. Huang, D.W. et al., 2007. DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic acids research, 35(Web Server issue), W169-75. Hulo, N. et al., 2008. The 20 years of PROSITE. Nucleic Acids Research, 36(Database issue), D245-249. Hutchison, C.A., 2007. DNA sequencing: bench to bedside and beyond. Nucleic acids research, 35(18), 6227-37. John, B. et al., 2004. Human MicroRNA targets. PLoS biology, 2(11), e363. Jurka, J. et al., 2005. Repbase Update, a database of eukaryotic repetitive elements. Cytogenetic and Genome Research, 110(1-4), 462-467. Kanehisa, M. & Goto, S., 2000. KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research, 28(1), 27-30. Karolchik, D. et al., 2003. The UCSC Genome Browser Database. Nucleic acids research, 31(1), 51-4. Kauffman, S.A., 1993. The Origins of Order: Self-Organization and Selection in Evolution, New York: Oxford University Press. Keene, J.D., 2001. Ribonucleoprotein infrastructure regulating the flow of genetic information between the genome and the proteome. Proceedings of the National Academy of Sciences of the United States of America, 98(13), 7018-24. Keene, J.D. & Tenenbaum, S.A., 2002. Eukaryotic mRNPs may represent posttranscriptional operons. Molecular cell, 9(6), 1161-7. Kersey, P. & Apweiler, R., 2006. Linking publication, gene and protein data. Nat Cell Biol, 8(11), 1183-1189. Kielbasa, S.M., Gonze, D. & Herzel, H., 2005. Measuring similarities between transcription factor binding sites. BMC Bioinformatics, 6, 237. Kim, S. et al., 2006. MODi: a powerful and convenient web server for identifying multiple post-translational peptide modifications from tandem mass spectra. 152

Nucleic Acids Research, 34(Web Server issue), W258-263. Knudsen, S., 2004. Guide to Analysis of DNA Microarray Data 2nd ed., Hoboken, N.J: Wiley-Liss. Kondo, Y. & Issa, J.J., 2004. Epigenetic changes in colorectal cancer. Cancer Metastasis Reviews, 23(1-2), 29-39. Konishi, K. & Issa, J.J., 2007. Targeting aberrant chromatin structure in colorectal carcinomas. Cancer Journal (Sudbury, Mass.), 13(1), 49-55. Krek, A. et al., 2005. Combinatorial microRNA target predictions. Nat Genet, 37(5), 495500. Kryukov, G.V. et al., 2003. Characterization of mammalian selenoproteomes. Science (New York, N.Y.), 300(5624), 1439-1443. Kullmann, M. et al., 2002. ELAV/Hu proteins inhibit p27 translation via an IRES element in the p27 5'UTR. Genes & Development, 16(23), 3087-99. Kurien, B.T. & Scofield, R.H., 2006. Autoantibody determination in the diagnosis of systemic lupus erythematosus. Scandinavian Journal of Immunology, 64(3), 22735. Lafon, I. et al., 1998. Developmental expression of AUF1 and HuR, two c-myc mRNA binding proteins. Oncogene, 16(26), 3413-21. Lal, A. et al., 2004. Concurrent versus individual binding of HuR and AUF1 to common labile target mRNAs. The EMBO journal, 23(15), 3092-102. Lee, J.Y. et al., 2007. PolyA_DB 2: mRNA polyadenylation sites in vertebrate genes. Nucleic Acids Research, 35(Database issue), D165-168. Lee, T. et al., 2006. dbPTM: an information repository of protein post-translational modification. Nucleic Acids Research, 34(Database issue), D622-627. Lehne, B. & Schlitt, T., 2009. Protein-protein interaction databases: keeping up with growing interactomes. Human Genomics, 3(3), 291-297. Lemay, J. et al., 2008. HuR interacts with human immunodeficiency virus type 1 reverse transcriptase, and modulates reverse transcription in infected cells. Retrovirology, 5(1), 47. Lescure, A. et al., 2002. Protein factors mediating selenoprotein synthesis. Current Protein & Peptide Science, 3(1), 143-151. 153

Letunic, I., Doerks, T. & Bork, P., 2009. SMART 6: recent updates and new developments. Nucleic Acids Research, 37(Database issue), D229-232. Lewis, B.P., Burge, C.B. & Bartel, D.P., 2005. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell, 120(1), 15-20. Li, B. et al., 2002. Ovol2, a Mammalian Homolog of Drosophila ovo: Gene Structure, Chromosomal Mapping, and Aberrant Expression in Blind-Sterile Mice. Genomics, 80(3), 319-325. Lin, F. et al., 2006. The role of human antigen R, an RNA-binding protein, in mediating the stabilization of toll-like receptor 4 mRNA induced by endotoxin: a novel mechanism involved in vascular inflammation. Arteriosclerosis, thrombosis, and vascular biology, 26(12), 2622-9. Liu, X. et al., 2008. TiGER: a database for tissue-specific gene expression and regulation. BMC Bioinformatics, 9, 271. Loflin, P. & Lever, J.E., 2001. HuR binds a cyclic nucleotide-dependent, stabilizing domain in the 3' untranslated region of Na(+)/glucose cotransporter (SGLT1) mRNA. FEBS Letters, 509(2), 267-71. Lopez de Silanes, I. et al., 2003. Role of the RNA-binding protein HuR in colon carcinogenesis. Oncogene, 22(46), 7146-54. Lu, J. & Holmgren, A., 2009. Selenoproteins. The Journal of Biological Chemistry, 284(2), 723-727. Ma, W.J. et al., 1996. Cloning and characterization of HuR, a ubiquitously expressed Elav-like protein. The Journal of Biological Chemistry, 271(14), 8144-51. Maglott, D. et al., 2007. Entrez Gene: gene-centered information at NCBI. Nucleic acids research, 35(Database issue), D26-31. Manjithaya, R.R. & Dighe, R.R., 2004. The 3' Untranslated Region of Bovine FollicleStimulating Hormone beta Messenger RNA Downregulates Reporter Expression: Involvement of AU-Rich Elements and Transfactors. Biol Reprod, 71(4), 11581166. Manohar, C.F. et al., 2002. HuD, a Neuronal-specific RNA-binding Protein, Increases the in Vivo Stability of MYCN RNA. J. Biol. Chem., 277(3), 1967-1973. Marzluff, W.F., Wagner, E.J. & Duronio, R.J., 2008. Metabolism and regulation of 154

canonical histone mRNAs: life without a poly(A) tail. Nat Rev Genet, 9(11), 843854. Mazan-Mamczarz, K. et al., 2008. Post-transcriptional gene regulation by HuR promotes a more tumorigenic phenotype. Oncogene, 27(47), 6151-63. Meisner, N. et al., 2004. mRNA openers and closers: modulating AU-rich elementcontrolled mRNA stability by a molecular switch in mRNA secondary structure. Chembiochem: A European Journal of Chemical Biology, 5(10), 1432-47. Meng, Z. et al., 2005. The ELAV RNA-stability factor HuR binds the 5'-untranslated region of the human IGF-IR transcript and differentially represses cap-dependent and IRES-mediated translation. Nucleic acids research, 33(9), 2962-79. Mi, H. et al., 2005. The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Research, 33(Database issue), D284-288. Mignone, F. et al., 2005. UTRdb and UTRsite: a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs. Nucleic acids research, 33(Database issue), D141-6. Muckenthaler, M.U., Galy, B. & Hentze, M.W., 2008. Systemic iron homeostasis and the iron-responsive element/iron-regulatory protein (IRE/IRP) regulatory network. Annual Review of Nutrition, 28, 197-213. Nabors, L.B. et al., 2001. HuR, a RNA Stability Factor, Is Expressed in Malignant Brain Tumors and Binds to Adenine- and Uridine-rich Elements within the 3' Untranslated Regions of Cytokine and Angiogenic Factor mRNAs. Cancer Res, 61(5), 2154-2161. Palin, K., Taipale, J. & Ukkonen, E., 2006. Locating potential enhancer elements by comparative genomics using the EEL software. Nature Protocols, 1(1), 368-374. Pandolfo, M. & Pastore, A., 2009. The pathogenesis of Friedreich ataxia and the structure and function of frataxin. Journal of Neurology, 256, 9-17. Pandur, E. et al., 2009. α-1 Antitrypsin binds preprohepcidin intracellularly and prohepcidin in the serum. FEBS Journal, 276(7), 2012-2021. Parkinson, H. et al., 2007. ArrayExpress--a public database of microarray experiments and gene expression profiles. Nucleic Acids Research, 35(Database issue), D747750. Pedersen, J.S. et al., 2006. Identification and Classification of Conserved RNA Secondary Structures in the Human Genome. PLoS Computational Biology, 2(4), e33. 155

Penalva, L.O., Tenenbaum, S.A. & Keene, J.D., 2004. Gene expression analysis of messenger RNP complexes. Methods in molecular biology (Clifton, N.J.), 257, 125-34. Peng, S.S. et al., 1998. RNA stabilization by the AU-rich element binding protein, HuR, an ELAV protein. The EMBO journal, 17(12), 3461-70. Pesole, G. et al., 2000. UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs. Nucleic acids research, 28(1), 193-6. Prechtel, A.T. et al., 2006a. Expression of CD83 is regulated by HuR via a novel cisactive coding region RNA element. The Journal of Biological Chemistry, 281(16), 10912-25. Prechtel, A.T. et al., 2006b. Expression of CD83 Is Regulated by HuR via a Novel cisActive Coding Region RNA Element. J. Biol. Chem., 281(16), 10912-10925. Pruitt, K.D., Tatusova, T. & Maglott, D.R., 2007. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res, 35(Database issue), D61-D65. Raghavan, A. et al., 2001. HuA and Tristetraprolin Are Induced following T Cell Activation and Display Distinct but Overlapping RNA Binding Specificities. J. Biol. Chem., 276(51), 47958-47965. Ramsay, G., 1998. DNA chips: State-of-the art. Nat Biotech, 16(1), 40-44. Reeves, M.A. & Hoffmann, P.R., 2009. The human selenoproteome: recent insights into functions and regulation. Cellular and Molecular Life Sciences: CMLS, 66(15), 2457-2478. Rhodes, D.R. et al., 2007. Oncomine 3.0: Genes, Pathways, and Networks in a Collection of 18,000 Cancer Gene Expression Profiles. Neoplasia (New York, N.Y.), 9(2). Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi? tool=pubmed&pubmedid=17356713 [Accessed February 29, 2008]. Rodriguez-Pascual, F. et al., 2000. Complex Contribution of the 3'-Untranslated Region to the Expressional Regulation of the Human Inducible Nitric-oxide Synthase Gene. INVOLVEMENT OF THE RNA-BINDING PROTEIN HuR. J. Biol. Chem., 275(34), 26040-26049. Rouault, T.A., 2006. The role of iron regulatory proteins in mammalian iron homeostasis and disease. Nature Chemical Biology, 2(8), 406-414. 156

Sakai, K. et al., 2003. Binding of the ELAV-like protein in murine autoimmune T-cells to the nonameric AU-rich element in the 3 untranslated region of CD154 mRNA. Molecular Immunology, 39(14), 879-883. Salwinski, L. et al., 2004. The Database of Interacting Proteins: 2004 update. Nucleic Acids Research, 32(Database issue), D449-451. Sammut, S.J., Finn, R.D. & Bateman, A., 2008. Pfam 10 years on: 10,000 families and still growing. Briefings in Bioinformatics, 9(3), 210-219. Sayers, E.W. et al., 2009. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 37(Database issue), D5-15. Seko, Y. et al., 2004. Selective cytoplasmic translocation of HuR and site specific binding to the IL-2 mRNA are not sufficient for CD28-mediated stabilization of the mRNA. J. Biol. Chem., M312306200. Sellick, G.S. et al., 2004. Genomewide linkage searches for Mendelian disease loci can be efficiently conducted using high-density SNP genotyping arrays. Nucleic Acids Research, 32(20), e164. Shah, S.P. et al., 2005. Atlas - a data warehouse for integrative bioinformatics. BMC bioinformatics, 6, 34. Shannon, P. et al., 2003. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome research, 13(11), 2498-504. Shannon, P.T. et al., 2006. The Gaggle: an open-source software system for integrating bioinformatics software and data sources. BMC bioinformatics, 7, 176.

Shen, H. & Chou, K., 2007. Signal-3L: A 3-layer approach for predicting signal peptides. Biochemical an Communications, 363(2), 297-303. Shendure, J. & Ji, H., 2008. Next-generation DNA sequencing. Nat Biotech, 26(10), 1135-1145. Sherman, B.T. et al., 2007. DAVID Knowledgebase: a gene-centered database integrating heterogeneous gene annotation resources to facilitate high-throughput gene functional analysis. BMC bioinformatics, 8, 426. Sherry, S.T. et al., 2001. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res, 29(1), 308-311. Shifman, M.A. et al., 2007. YPED: a web-accessible database system for protein 157

expression analysis. Journal of Proteome Research, 6(10), 4019-4024. Shingara, J. et al., 2005. An optimized isolation and labeling platform for accurate microRNA expression profiling. RNA, 11(9), 1461-1470. de Silanes, I.L. et al., 2004. Identification of a target RNA motif for RNA-binding protein HuR. Proceedings of the National Academy of Sciences, 101(9), 2987-2992. Smedley, D. et al., 2009. BioMart--biological queries made easy. BMC Genomics, 10, 22. Sonenberg, N. & Hinnebusch, A.G., 2009. Regulation of Translation Initiation in Eukaryotes: Mechanisms and Biological Targets. Cell, 136(4), 731-745. Sprenger, J. et al., 2008. LOCATE: a mammalian protein subcellular localization database. Nucleic Acids Research, 36(Database issue), D230-233. Sterner, R. et al., 1987. Cell cycle-dependent changes in conformation and composition of nucleosomes containing human histone gene sequences. Nucleic Acids Research, 15(11), 4375-91. Su, A.I. et al., 2004. A gene atlas of the mouse and human protein-encoding transcriptomes. Proceedings of the National Academy of Sciences of the United States of America, 101(16), 6062-6067. Surh, Y., 2003. Cancer chemoprevention with dietary phytochemicals. Nature Reviews. Cancer, 3(10), 768-780. Suswam, E.A. et al., 2005. IL-1beta induces stabilization of IL-8 mRNA in malignant breast cancer cells via the 3' untranslated region: Involvement of divergent RNAbinding factors HuR, KSRP and TIAR. International journal of cancer. Journal international du cancer, 113(6), 911-9. Teive, H.A.G. & Munhoz, R.P., 2009. Karl-Axel Ekbom and iron deficiency in restless legs syndrome. Movement Disorders, 9999(9999), NA. Tenenbaum, S.A. et al., 2000. Identifying mRNA subsets in messenger ribonucleoprotein complexes by using cDNA arrays. Proceedings of the National Academy of Sciences of the United States of America, 97(26), 14085-90. Tenenbaum, S.A. et al., 2002. Ribonomics: identifying mRNA subsets in mRNP complexes using antibodies to RNA-binding proteins and genomic arrays. Methods (San Diego, Calif.), 26(2), 191-8. Teufel, A. et al., 2006. Current bioinformatics tools in genomic biomedical research (Review). International journal of molecular medicine, 17(6), 967-73. 158

The UniProt Consortium, 2008. The Universal Protein Resource (UniProt). Nucl. Acids Res., 36(suppl_1), D190-195. Tiruchinapalli, D.M., Caron, M.G. & Keene, J.D., 2008. Activity-dependent expression of ELAV/Hu RBPs and neuronal mRNAs in seizure and cocaine brain. Journal of Neurochemistry, 107(6), 1529-43. Tran, H., Maurer, F. & Nagamine, Y., 2003. Stabilization of urokinase and urokinase receptor mRNAs by HuR is linked to its cytoplasmic accumulation induced by activated mitogen-activated protein kinase-activated protein kinase 2. Molecular and cellular biology, 23(20), 7177-88. Tsai, J. et al., 2001. RESOURCERER: a database for annotating and linking microarray resources within and across species. Genome Biology, 2(11), SOFTWARE0002. Vidarsson, H. et al., 2009. The Forkhead Transcription Factor Foxi1 Is a Master Regulator of Vacuolar H+-ATPase Proton Pump Subunits in the Inner Ear, Kidney and Epididymis. PLoS ONE, 4(2), e4471. Visel, A. et al., 2007. VISTA Enhancer Browser--a database of tissue-specific human enhancers. Nucleic Acids Research, 35(Database issue), D88-92. Wakaguri, H. et al., 2008. DBTSS: database of transcription start sites, progress report 2008. Nucleic acids research, 36(Database issue), D97-101. Wang, J.G. et al., 2006. LFA-1-dependent HuR nuclear export and cytokine mRNA stabilization in T cell activation. Journal of immunology (Baltimore, Md. : 1950), 176(4), 2105-13. Wang, W. et al., 2002. AMP-Activated Kinase Regulates Cytoplasmic HuR. Molecular and Cellular Biology, 22(10), 3425–3436. Wang, W. et al., 2000. HuR Regulates p21 mRNA Stabilization by UV Light. Mol. Cell. Biol., 20(3), 760-769. Wang, W. et al., 2001. Loss of HuR Is Linked to Reduced Expression of Proliferative Genes during Replicative Senescence. Mol. Cell. Biol., 21(17), 5889-5898. Wang, W. et al., 2003. Increased AMP:ATP ratio and AMP-activated protein kinase activity during cellular senescence linked to reduced HuR function. The Journal of biological chemistry, 278(29), 27016-23. Wang, Z.F. et al., 1996. The protein that binds the 3' end of histone mRNA: a novel RNAbinding protein required for histone pre-mRNA processing. Genes & 159

Development, 10(23), 3028-3040. Wang, Z., Gerstein, M. & Snyder, M., 2009. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews. Genetics, 10(1), 57-63. Waterston, R.H. et al., 2002. Initial sequencing and comparative analysis of the mouse genome. Nature, 420(6915), 520-62. Weber, M.A. et al., 2009. The frequency and significance of alveolar haemosiderin-laden macrophages in sudden infant death. Forensic Science International, 187(1-3), 51-57. Wein, G. et al., 2003. The 3'-UTR of the mRNA coding for the major protein kinase C substrate MARCKS contains a novel CU-rich element interacting with the mRNA stabilizing factors HuD and HuR. European Journal of Biochemistry / FEBS, 270(2), 350-65. Weinberg, E.D., 2006. Iron loading: a risk factor for osteoporosis. Biometals: An International Journal on the Role of Metal Ions in Biology, Biochemistry, and Medicine, 19(6), 633-635. Willis, R.C. & Hogue, C.W.V., 2006. Searching, viewing, and visualizing data in the Biomolecular Interaction Network Database (BIND). Current Protocols in Bioinformatics / Editoral Board, Andreas D. Baxevanis ... [et Al, Chapter 8, Unit 8.9. Wingender, E. et al., 2000. TRANSFAC: an integrated system for gene expression regulation. Nucleic acids research, 28(1), 316-9. Wingender, E. et al., 1996. TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic acids research, 24(1), 238-41. Wolfe, S.A., Wert, J.V. & Grimes, S.R., 2006. Transcription factor RFX2 is abundant in rat testis and enriched in nuclei of primary spermatocytes where it appears to be required for transcription of the testis-specific histone H1t gene. Journal of Cellular Biochemistry, 99(3), 735-746. Wolozin, B. & Golts, N., 2002. Iron and Parkinson's disease. The Neuroscientist: A Review Journal Bringing Neurobiology, Neurology and Psychiatry, 8(1), 22-32. Xu, Y.Z. et al., 2005. RNA-binding protein HuR is required for stabilization of SLC11A1 mRNA and SLC11A1 protein expression. Molecular and cellular biology, 25(18), 8139-49. Yaman, I. et al., 2003. The zipper model of translational control: a small upstream ORF is 160

the switch that controls structural remodeling of an mRNA leader. Cell, 113(4), 519-31. Yasuda, S. et al., 2004. Interaction between 3' untranslated region of calcitonin receptor messenger ribonucleic acid (RNA) and adenylate/uridylate (AU)-rich element binding proteins (AU-rich RNA-binding factor 1 and Hu antigen R). Endocrinology, 145(4), 1730-8. Yeap, B.B. et al., 2002. Novel binding of HuR and poly(C)-binding protein to a conserved UC-rich motif within the 3'-untranslated region of the androgen receptor messenger RNA. The Journal of biological chemistry, 277(30), 2718392. Agarwal, P. & Griffith, A., 2008. Restless legs syndrome: a unique case and essentials of diagnosis and treatment. Medscape Journal of Medicine, 10(12), 296. agilent technologies, 2009. Agilent Feature Extraction Software (v10.5) Reference Guide. Available at: http://www.chem.agilent.com/Library/usermanuals/Public/G446090020_FE_10.5_Reference.pdf [Accessed April 5, 2009]. Akool, E. et al., 2003. Nitric Oxide Increases the Decay of Matrix Metalloproteinase 9 mRNA by Inhibiting the Expression of mRNA-Stabilizing Factor HuR. Mol. Cell. Biol., 23(14), 4901-4916. Al-Shahrour, F. et al., 2007. FatiGO +: a functional profiling tool for genomic data. Integration of functional annotation, regulatory motifs and interaction data with microarray experiments. Nucleic Acids Research, 35(Web Server issue), W91-96. Al-Shahrour, F. et al., 2006. BABELOMICS: a systems biology perspective in the functional annotation of genome-scale experiments. Nucleic acids research, 34(Web Server issue), W472-6. Amoreira, C., Hindermann, W. & Grunau, C., 2003. An improved version of the DNA Methylation database (MethDB). Nucleic Acids Research, 31(1), 75-77. Atasoy, U. et al., 1998. ELAV protein HuA (HuR) can redistribute between nucleus and cytoplasm and is upregulated during serum stimulation and T cell activation. Journal of cell science, 111 ( Pt 21), 3145-56. Atasoy, U. et al., 2003. Regulation of eotaxin gene expression by TNF-alpha and IL-4 through mRNA stabilization: involvement of the RNA-binding protein HuR. Journal of immunology (Baltimore, Md. : 1950), 171(8), 4369-78. Bader, G.D. et al., 2001. BIND--The Biomolecular Interaction Network Database. Nucleic acids research, 29(1), 242-5. 161

Bain, L.J. & Engelhardt, M., 2000. Introduction to Probability and Mathematical Statistics 2nd ed., Duxbury Press. Bakheet, T., Williams, B.R.G. & Khabar, K.S.A., 2006. ARED 3.0: the large and diverse AU-rich transcriptome. Nucleic acids research, 34(Database issue), D111-4. Baroni, T.E. et al., 2008. Advances in RIP-chip analysis : RNA-binding protein immunoprecipitation-microarray profiling. Methods in Molecular Biology (Clifton, N.J.), 419, 93-108. Barreau, C., Paillard, L. & Osborne, H.B., 2006. AU-rich elements and associated factors: are there unifying principles? Nucl. Acids Res., 33(22), 7138-7150. Barrett, T. et al., 2007. NCBI GEO: mining tens of millions of expression profiles-database and tools update. Nucleic Acids Research, 35(Database issue), D760765. Batts, K.P., 2007. Iron overload syndromes and the liver. Mod Pathol, 20(1s), S31-S39. Becker, K.G. et al., 2004. The genetic association database. Nature genetics, 36(5), 4312. Berman, H., Henrick, K. & Nakamura, H., 2003. Announcing the worldwide Protein Data Bank. Nat Struct Mol Biol, 10(12), 980. Bertone, P. et al., 2004. Global identification of human transcribed sequences with genome tiling arrays. Science (New York, N.Y.), 306(5705), 2242-6. Birney, E., 2003. Ensembl: a genome infrastructure. Cold Spring Harbor symposia on quantitative biology, 68, 213-5. Birney, E. et al., 2004. An overview of Ensembl. Genome research, 14(5), 925-8. Blankenberg, D. et al., 2007. A framework for collaborative analysis of ENCODE data: making large-scale analyses biologist-friendly. Genome research, 17(6), 960-4. Blaxall, B.C. et al., 2000. Purification and Characterization of beta -Adrenergic Receptor mRNA-binding Proteins. J. Biol. Chem., 275(6), 4290-4297. Boguski, M.S., 2004. ENCODE and ChIP-chip in the genome era. Genomics, 83(3), 3478. Briata, P. et al., 2003. The Wnt/-CateninPitx2 Pathway Controls the Turnover of Pitx2 and Other Unstable mRNAs. Molecular Cell, 12(5), 1201-1211. 162

Bruford, E.A. et al., 2008. The HGNC Database in 2008: a resource for the human genome. Nucl. Acids Res., 36(suppl_1), D445-448. Bryne, J.C. et al., 2008. JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Research, 36(Database issue), D102-106. Burset, M., Seledtsov, I.A. & Solovyev, V.V., 2001. SpliceDB: database of canonical and non-canonical mammalian splice sites. Nucleic Acids Research, 29(1), 255-259. Cairo, G. & Recalcati, S., 2007. Iron-regulatory proteins: molecular biology and pathophysiological implications. Expert Reviews in Molecular Medicine, 9(33), 113. Chen, C.Y. & Shyu, A.B., 1994. Selective degradation of early-response-gene mRNAs: functional analyses of sequence features of the AU-rich elements. Molecular and cellular biology, 14(12), 8471-82. Chen, C.A., Xu, N. & Shyu, A., 2002. Highly selective actions of HuR in antagonizing AU-rich element-mediated mRNA destabilization. Molecular and cellular biology, 22(20), 7268-78. Chen, N., 2004. Using RepeatMasker to identify repetitive elements in genomic sequences. Current Protocols in Bioinformatics / Editoral Board, Andreas D. Baxevanis ... [et Al, Chapter 4, Unit 4.10. Choo, K., Tan, T. & Ranganathan, S., 2005. SPdb - a signal peptide database. BMC Bioinformatics, 6(1), 249. Chung, S. et al., 1997. The Elav-like Proteins Bind to a Conserved Regulatory Element in the 3'-Untranslated Region of GAP-43 mRNA. J. Biol. Chem., 272(10), 65936598. Collard, K.J., 2009. Iron Homeostasis in the Neonate. Pediatrics, 123(4), 1208-1216. Coon, J.J. et al., 2005. Tandem mass spectrometry for peptide and protein sequence analysis. BioTechniques, 38(4), 519, 521, 523. Crick, F.H., 1958. On Protein Synthesis. In The Symposia of the Society for Experimental Biology. pp. 138-163. Available at: http://profiles.nlm.nih.gov/SC/B/B/Z/Y/ [Accessed April 13, 2009]. Czaja, A.J., 2005. Autoantibodies in autoimmune liver disease. Advances in Clinical Chemistry, 40, 127-64. 163

Dalma-Weiszhausz, D.D. et al., 2006. The affymetrix GeneChip platform: an overview. Methods in enzymology, 410, 3-28. Dean, J.L. et al., 2001. The 3' untranslated region of tumor necrosis factor alpha mRNA is a target of the mRNA-stabilizing factor HuR. Molecular and cellular biology, 21(3), 721-30. Demeter, J. et al., 2007. The Stanford Microarray Database: implementation of new analysis tools and open source release of software. Nucleic Acids Research, 35(Database issue), D766-770. Dominski, Z. & Marzluff, W.F., 1999. Formation of the 3' end of histone mRNA. Gene, 239(1), 1-14. Dormoy-Raclet, V. et al., 2007. The RNA-binding protein HuR promotes cell migration and cell invasion by stabilizing the beta-actin mRNA in a U-rich-elementdependent manner. Molecular and cellular biology, 27(15), 5365-80. Down, T.A. & Hubbard, T.J.P., 2002. Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Research, 12(3), 458-461. Eddy, S.R., 2006. Computational analysis of RNAs. Cold Spring Harbor Symposia on Quantitative Biology, 71, 117-28. Fisher, R.A., 1922. On the Interpretation of 2 from Contingency Tables, and the Calculation of P. Journal of the Royal Statistical Society, 85(1), 87-94. Fix, O.K. & Kowdley, K.V., 2008. Hereditary hemochromatosis. Minerva Medica, 99(6), 605-617. Frazer, K.A. et al., 2007. A second generation human haplotype map of over 3.1 million SNPs. Nature, 449(7164), 851-861. Giardine, B. et al., 2005. Galaxy: a platform for interactive large-scale genome analysis. Genome research, 15(10), 1451-5. van der Giessen, K. et al., 2003. RNAi-mediated HuR Depletion Leads to the Inhibition of Muscle Cell Differentiation. J. Biol. Chem., 278(47), 47119-47128. Goldberg-Cohen, I., Furneauxb, H. & Levy, A.P., 2002. A 40-bp RNA element that mediates stabilization of vascular endothelial growth factor mRNA by HuR. The Journal of biological chemistry, 277(16), 13635-40. 164

Griffiths-Jones, S., 2006. miRBase: the microRNA sequence database. Methods in molecular biology (Clifton, N.J.), 342, 129-38. Griffiths-Jones, S. et al., 2003. Rfam: an RNA family database. Nucleic acids research, 31(1), 439-41. Griffiths-Jones, S. et al., 2006. miRBase: microRNA sequences, targets and gene nomenclature. Nucleic acids research, 34(Database issue), D140-4. Griffiths-Jones, S. et al., 2005. Rfam: annotating non-coding RNAs in complete genomes. Nucleic acids research, 33(Database issue), D121-4. Guo, T. et al., 2004. DBSubLoc: database of protein subcellular localization. Nucleic Acids Research, 32(Database issue), D122-124. Guo, X. & Hartley, R.S., 2006. HuR contributes to cyclin E1 deregulation in MCF-7 breast cancer cells. Cancer Research, 66(16), 7948-56. Haeussler, J. et al., 2000. Tumor antigen HuR binds specifically to one of five proteinbinding segments in the 3'-untranslated region of the neurofibromin messenger RNA. Biochemical and Biophysical Research Communications, 267(3), 726-32. Hamosh, A. et al., 2005. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res, 33(Database issue), D514-D517. Harris, M.A. et al., 2004. The Gene Ontology (GO) database and informatics resource. Nucleic acids research, 32(Database issue), D258-61. Harris, T.D. et al., 2008. Single-molecule DNA sequencing of a viral genome. Science (New York, N.Y.), 320(5872), 106-109. Hayakawa, J. et al., 2004. Identification of promoters bound by c-Jun/ATF2 during rapid large-scale gene activation following genotoxic stress. Molecular Cell, 16(4), 521-35. Heinemeyer, T. et al., 1998. Databases on transcriptional regulation: TRANSFAC, TRRD and COMPEL. Nucleic acids research, 26(1), 362-7. Hershko, C. et al., 2004. Purging iron from the heart. British Journal of Haematology, 125(5), 545-551. Hong, P. & Wong, W.H., 2005. GeneNotes--a novel information management software for biologists. BMC Bioinformatics, 6, 20. 165

Hsu, F. et al., 2006. The UCSC Known Genes. Bioinformatics (Oxford, England), 22(9), 1036-46. Hu, Q. et al., 2006. Inhibition of CBF/NF-Y mediated transcription activation arrests cells at G2/M phase and suppresses expression of genes activated at G2/M phase of the cell cycle. Nucl. Acids Res., 34(21), 6272-6285. Hu, Z. et al., 2004. VisANT: an online visualization and analysis tool for biological interaction data. BMC bioinformatics, 5, 17. Huang, D.W. et al., 2007. DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic acids research, 35(Web Server issue), W169-75. Hulo, N. et al., 2008. The 20 years of PROSITE. Nucleic Acids Research, 36(Database issue), D245-249. Hutchison, C.A., 2007. DNA sequencing: bench to bedside and beyond. Nucleic acids research, 35(18), 6227-37. John, B. et al., 2004. Human MicroRNA targets. PLoS biology, 2(11), e363. Jurka, J. et al., 2005. Repbase Update, a database of eukaryotic repetitive elements. Cytogenetic and Genome Research, 110(1-4), 462-467. Kanehisa, M. & Goto, S., 2000. KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research, 28(1), 27-30. Karolchik, D. et al., 2003. The UCSC Genome Browser Database. Nucleic acids research, 31(1), 51-4. Kauffman, S.A., 1993. The Origins of Order: Self-Organization and Selection in Evolution, New York: Oxford University Press. Keene, J.D., 2001. Ribonucleoprotein infrastructure regulating the flow of genetic information between the genome and the proteome. Proceedings of the National Academy of Sciences of the United States of America, 98(13), 7018-24. Keene, J.D. & Tenenbaum, S.A., 2002. Eukaryotic mRNPs may represent posttranscriptional operons. Molecular cell, 9(6), 1161-7. Kersey, P. & Apweiler, R., 2006. Linking publication, gene and protein data. Nat Cell Biol, 8(11), 1183-1189. Kielbasa, S.M., Gonze, D. & Herzel, H., 2005. Measuring similarities between 166

transcription factor binding sites. BMC Bioinformatics, 6, 237. Kim, S. et al., 2006. MODi: a powerful and convenient web server for identifying multiple post-translational peptide modifications from tandem mass spectra. Nucleic Acids Research, 34(Web Server issue), W258-263. Knudsen, S., 2004. Guide to Analysis of DNA Microarray Data 2nd ed., Hoboken, N.J: Wiley-Liss. Kondo, Y. & Issa, J.J., 2004. Epigenetic changes in colorectal cancer. Cancer Metastasis Reviews, 23(1-2), 29-39. Konishi, K. & Issa, J.J., 2007. Targeting aberrant chromatin structure in colorectal carcinomas. Cancer Journal (Sudbury, Mass.), 13(1), 49-55. Krek, A. et al., 2005. Combinatorial microRNA target predictions. Nat Genet, 37(5), 495500. Kryukov, G.V. et al., 2003. Characterization of mammalian selenoproteomes. Science (New York, N.Y.), 300(5624), 1439-1443. Kullmann, M. et al., 2002. ELAV/Hu proteins inhibit p27 translation via an IRES element in the p27 5'UTR. Genes & Development, 16(23), 3087-99. Kurien, B.T. & Scofield, R.H., 2006. Autoantibody determination in the diagnosis of systemic lupus erythematosus. Scandinavian Journal of Immunology, 64(3), 22735. Lafon, I. et al., 1998. Developmental expression of AUF1 and HuR, two c-myc mRNA binding proteins. Oncogene, 16(26), 3413-21. Lal, A. et al., 2004. Concurrent versus individual binding of HuR and AUF1 to common labile target mRNAs. The EMBO journal, 23(15), 3092-102. Lee, J.Y. et al., 2007. PolyA_DB 2: mRNA polyadenylation sites in vertebrate genes. Nucleic Acids Research, 35(Database issue), D165-168. Lee, T. et al., 2006. dbPTM: an information repository of protein post-translational modification. Nucleic Acids Research, 34(Database issue), D622-627. Lehne, B. & Schlitt, T., 2009. Protein-protein interaction databases: keeping up with growing interactomes. Human Genomics, 3(3), 291-297. Lemay, J. et al., 2008. HuR interacts with human immunodeficiency virus type 1 reverse transcriptase, and modulates reverse transcription in infected cells. Retrovirology, 167

5(1), 47. Lescure, A. et al., 2002. Protein factors mediating selenoprotein synthesis. Current Protein & Peptide Science, 3(1), 143-151. Letunic, I., Doerks, T. & Bork, P., 2009. SMART 6: recent updates and new developments. Nucleic Acids Research, 37(Database issue), D229-232. Lewis, B.P., Burge, C.B. & Bartel, D.P., 2005. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell, 120(1), 15-20. Li, B. et al., 2002. Ovol2, a Mammalian Homolog of Drosophila ovo: Gene Structure, Chromosomal Mapping, and Aberrant Expression in Blind-Sterile Mice. Genomics, 80(3), 319-325. Lin, F. et al., 2006. The role of human antigen R, an RNA-binding protein, in mediating the stabilization of toll-like receptor 4 mRNA induced by endotoxin: a novel mechanism involved in vascular inflammation. Arteriosclerosis, thrombosis, and vascular biology, 26(12), 2622-9. Liu, X. et al., 2008. TiGER: a database for tissue-specific gene expression and regulation. BMC Bioinformatics, 9, 271. Loflin, P. & Lever, J.E., 2001. HuR binds a cyclic nucleotide-dependent, stabilizing domain in the 3' untranslated region of Na(+)/glucose cotransporter (SGLT1) mRNA. FEBS Letters, 509(2), 267-71. Lopez de Silanes, I. et al., 2003. Role of the RNA-binding protein HuR in colon carcinogenesis. Oncogene, 22(46), 7146-54. Lu, J. & Holmgren, A., 2009. Selenoproteins. The Journal of Biological Chemistry, 284(2), 723-727. Ma, W.J. et al., 1996. Cloning and characterization of HuR, a ubiquitously expressed Elav-like protein. The Journal of Biological Chemistry, 271(14), 8144-51. Maglott, D. et al., 2007. Entrez Gene: gene-centered information at NCBI. Nucleic acids research, 35(Database issue), D26-31. Manjithaya, R.R. & Dighe, R.R., 2004. The 3' Untranslated Region of Bovine FollicleStimulating Hormone beta Messenger RNA Downregulates Reporter Expression: Involvement of AU-Rich Elements and Transfactors. Biol Reprod, 71(4), 11581166. 168

Manohar, C.F. et al., 2002. HuD, a Neuronal-specific RNA-binding Protein, Increases the in Vivo Stability of MYCN RNA. J. Biol. Chem., 277(3), 1967-1973. Marzluff, W.F., Wagner, E.J. & Duronio, R.J., 2008. Metabolism and regulation of canonical histone mRNAs: life without a poly(A) tail. Nat Rev Genet, 9(11), 843854. Mazan-Mamczarz, K. et al., 2008. Post-transcriptional gene regulation by HuR promotes a more tumorigenic phenotype. Oncogene, 27(47), 6151-63. Meisner, N. et al., 2004. mRNA openers and closers: modulating AU-rich elementcontrolled mRNA stability by a molecular switch in mRNA secondary structure. Chembiochem: A European Journal of Chemical Biology, 5(10), 1432-47. Meng, Z. et al., 2005. The ELAV RNA-stability factor HuR binds the 5'-untranslated region of the human IGF-IR transcript and differentially represses cap-dependent and IRES-mediated translation. Nucleic acids research, 33(9), 2962-79. Mi, H. et al., 2005. The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Research, 33(Database issue), D284-288. Mignone, F. et al., 2005. UTRdb and UTRsite: a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs. Nucleic acids research, 33(Database issue), D141-6. Morris, J.A. et al., 2009. Sudden infant death syndrome and cardiac arrhythmias. Future Cardiology, 5(2), 201-207. Muckenthaler, M.U., Galy, B. & Hentze, M.W., 2008. Systemic iron homeostasis and the iron-responsive element/iron-regulatory protein (IRE/IRP) regulatory network. Annual Review of Nutrition, 28, 197-213. Nabors, L.B. et al., 2001. HuR, a RNA Stability Factor, Is Expressed in Malignant Brain Tumors and Binds to Adenine- and Uridine-rich Elements within the 3' Untranslated Regions of Cytokine and Angiogenic Factor mRNAs. Cancer Res, 61(5), 2154-2161. Opdal, S.H. & Rognum, T.O., 2004. The sudden infant death syndrome gene: does it exist? Pediatrics, 114(4), e506-512. Palin, K., Taipale, J. & Ukkonen, E., 2006. Locating potential enhancer elements by comparative genomics using the EEL software. Nature Protocols, 1(1), 368-374. Pandolfo, M. & Pastore, A., 2009. The pathogenesis of Friedreich ataxia and the structure and function of frataxin. Journal of Neurology, 256, 9-17. 169

Pandur, E. et al., 2009. α-1 Antitrypsin binds preprohepcidin intracellularly and prohepcidin in the serum. FEBS Journal, 276(7), 2012-2021. Parkinson, H. et al., 2007. ArrayExpress--a public database of microarray experiments and gene expression profiles. Nucleic Acids Research, 35(Database issue), D747750. Pedersen, J.S. et al., 2006. Identification and Classification of Conserved RNA Secondary Structures in the Human Genome. PLoS Computational Biology, 2(4), e33. Penalva, L.O., Tenenbaum, S.A. & Keene, J.D., 2004. Gene expression analysis of messenger RNP complexes. Methods in molecular biology (Clifton, N.J.), 257, 125-34. Peng, S.S. et al., 1998. RNA stabilization by the AU-rich element binding protein, HuR, an ELAV protein. The EMBO journal, 17(12), 3461-70. Pesole, G. & Liuni, S., 1999. Internet resources for the functional analysis of 5' and 3' untranslated regions of eukaryotic mRNAs. Trends in Genetics: TIG, 15(9), 378. Pesole, G. et al., 2000. UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs. Nucleic acids research, 28(1), 193-6. Prechtel, A.T. et al., 2006a. Expression of CD83 is regulated by HuR via a novel cisactive coding region RNA element. The Journal of Biological Chemistry, 281(16), 10912-25. Prechtel, A.T. et al., 2006b. Expression of CD83 Is Regulated by HuR via a Novel cisActive Coding Region RNA Element. J. Biol. Chem., 281(16), 10912-10925. Pruitt, K.D., Tatusova, T. & Maglott, D.R., 2007. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res, 35(Database issue), D61-D65. Raghavan, A. et al., 2001. HuA and Tristetraprolin Are Induced following T Cell Activation and Display Distinct but Overlapping RNA Binding Specificities. J. Biol. Chem., 276(51), 47958-47965. Ramsay, G., 1998. DNA chips: State-of-the art. Nat Biotech, 16(1), 40-44. Reeves, M.A. & Hoffmann, P.R., 2009. The human selenoproteome: recent insights into functions and regulation. Cellular and Molecular Life Sciences: CMLS, 66(15), 2457-2478. 170

Rhodes, D.R. et al., 2007. Oncomine 3.0: Genes, Pathways, and Networks in a Collection of 18,000 Cancer Gene Expression Profiles. Neoplasia (New York, N.Y.), 9(2). Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi? tool=pubmed&pubmedid=17356713 [Accessed February 29, 2008]. Rodriguez-Pascual, F. et al., 2000. Complex Contribution of the 3'-Untranslated Region to the Expressional Regulation of the Human Inducible Nitric-oxide Synthase Gene. INVOLVEMENT OF THE RNA-BINDING PROTEIN HuR. J. Biol. Chem., 275(34), 26040-26049. Rouault, T.A., 2006. The role of iron regulatory proteins in mammalian iron homeostasis and disease. Nature Chemical Biology, 2(8), 406-414. Sakai, K. et al., 2003. Binding of the ELAV-like protein in murine autoimmune T-cells to the nonameric AU-rich element in the 3 untranslated region of CD154 mRNA. Molecular Immunology, 39(14), 879-883. Salwinski, L. et al., 2004. The Database of Interacting Proteins: 2004 update. Nucleic Acids Research, 32(Database issue), D449-451. Sammut, S.J., Finn, R.D. & Bateman, A., 2008. Pfam 10 years on: 10,000 families and still growing. Briefings in Bioinformatics, 9(3), 210-219. Sayers, E.W. et al., 2009. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 37(Database issue), D5-15. Seko, Y. et al., 2004. Selective cytoplasmic translocation of HuR and site specific binding to the IL-2 mRNA are not sufficient for CD28-mediated stabilization of the mRNA. J. Biol. Chem., M312306200. Sellick, G.S. et al., 2004. Genomewide linkage searches for Mendelian disease loci can be efficiently conducted using high-density SNP genotyping arrays. Nucleic Acids Research, 32(20), e164. Shah, S.P. et al., 2005. Atlas - a data warehouse for integrative bioinformatics. BMC bioinformatics, 6, 34. Shannon, P. et al., 2003. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome research, 13(11), 2498-504. Shannon, P.T. et al., 2006. The Gaggle: an open-source software system for integrating bioinformatics software and data sources. BMC bioinformatics, 7, 176.

Shen, H. & Chou, K., 2007. Signal-3L: A 3-layer approach for predicting signal peptides. Biochemical an 171

Communications, 363(2), 297-303. Shendure, J. & Ji, H., 2008. Next-generation DNA sequencing. Nat Biotech, 26(10), 1135-1145. Sherman, B.T. et al., 2007. DAVID Knowledgebase: a gene-centered database integrating heterogeneous gene annotation resources to facilitate high-throughput gene functional analysis. BMC bioinformatics, 8, 426. Sherry, S.T. et al., 2001. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res, 29(1), 308-311. Shifman, M.A. et al., 2007. YPED: a web-accessible database system for protein expression analysis. Journal of Proteome Research, 6(10), 4019-4024. Shingara, J. et al., 2005. An optimized isolation and labeling platform for accurate microRNA expression profiling. RNA, 11(9), 1461-1470. de Silanes, I.L. et al., 2004. Identification of a target RNA motif for RNA-binding protein HuR. Proceedings of the National Academy of Sciences, 101(9), 2987-2992. Smedley, D. et al., 2009. BioMart--biological queries made easy. BMC Genomics, 10, 22. Sonenberg, N. & Hinnebusch, A.G., 2009. Regulation of Translation Initiation in Eukaryotes: Mechanisms and Biological Targets. Cell, 136(4), 731-745. Sprenger, J. et al., 2008. LOCATE: a mammalian protein subcellular localization database. Nucleic Acids Research, 36(Database issue), D230-233. Sterner, R. et al., 1987. Cell cycle-dependent changes in conformation and composition of nucleosomes containing human histone gene sequences. Nucleic Acids Research, 15(11), 4375-91. Su, A.I. et al., 2004. A gene atlas of the mouse and human protein-encoding transcriptomes. Proceedings of the National Academy of Sciences of the United States of America, 101(16), 6062-6067. Surh, Y., 2003. Cancer chemoprevention with dietary phytochemicals. Nature Reviews. Cancer, 3(10), 768-780. Suswam, E.A. et al., 2005. IL-1beta induces stabilization of IL-8 mRNA in malignant breast cancer cells via the 3' untranslated region: Involvement of divergent RNAbinding factors HuR, KSRP and TIAR. International journal of cancer. Journal international du cancer, 113(6), 911-9. 172

Teive, H.A.G. & Munhoz, R.P., 2009. Karl-Axel Ekbom and iron deficiency in restless legs syndrome. Movement Disorders, 9999(9999), NA. Tenenbaum, S.A. et al., 2000. Identifying mRNA subsets in messenger ribonucleoprotein complexes by using cDNA arrays. Proceedings of the National Academy of Sciences of the United States of America, 97(26), 14085-90. Teufel, A. et al., 2006. Current bioinformatics tools in genomic biomedical research (Review). International journal of molecular medicine, 17(6), 967-73. The UniProt Consortium, 2008. The Universal Protein Resource (UniProt). Nucl. Acids Res., 36(suppl_1), D190-195. Tiruchinapalli, D.M., Caron, M.G. & Keene, J.D., 2008. Activity-dependent expression of ELAV/Hu RBPs and neuronal mRNAs in seizure and cocaine brain. Journal of Neurochemistry, 107(6), 1529-43. Tran, H., Maurer, F. & Nagamine, Y., 2003. Stabilization of urokinase and urokinase receptor mRNAs by HuR is linked to its cytoplasmic accumulation induced by activated mitogen-activated protein kinase-activated protein kinase 2. Molecular and cellular biology, 23(20), 7177-88. Tsai, J. et al., 2001. RESOURCERER: a database for annotating and linking microarray resources within and across species. Genome Biology, 2(11), SOFTWARE0002. Vidarsson, H. et al., 2009. The Forkhead Transcription Factor Foxi1 Is a Master Regulator of Vacuolar H+-ATPase Proton Pump Subunits in the Inner Ear, Kidney and Epididymis. PLoS ONE, 4(2), e4471. Visel, A. et al., 2007. VISTA Enhancer Browser--a database of tissue-specific human enhancers. Nucleic Acids Research, 35(Database issue), D88-92. Wakaguri, H. et al., 2008. DBTSS: database of transcription start sites, progress report 2008. Nucleic acids research, 36(Database issue), D97-101. Wang, J.G. et al., 2006. LFA-1-dependent HuR nuclear export and cytokine mRNA stabilization in T cell activation. Journal of immunology (Baltimore, Md. : 1950), 176(4), 2105-13. Wang, W. et al., 2002. AMP-Activated Kinase Regulates Cytoplasmic HuR. Molecular and Cellular Biology, 22(10), 3425–3436. Wang, W. et al., 2000. HuR Regulates p21 mRNA Stabilization by UV Light. Mol. Cell. Biol., 20(3), 760-769. 173

Wang, W. et al., 2001. Loss of HuR Is Linked to Reduced Expression of Proliferative Genes during Replicative Senescence. Mol. Cell. Biol., 21(17), 5889-5898. Wang, W. et al., 2003. Increased AMP:ATP ratio and AMP-activated protein kinase activity during cellular senescence linked to reduced HuR function. The Journal of biological chemistry, 278(29), 27016-23. Wang, Z.F. et al., 1996. The protein that binds the 3' end of histone mRNA: a novel RNAbinding protein required for histone pre-mRNA processing. Genes & Development, 10(23), 3028-3040. Wang, Z., Gerstein, M. & Snyder, M., 2009. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews. Genetics, 10(1), 57-63. Waterston, R.H. et al., 2002. Initial sequencing and comparative analysis of the mouse genome. Nature, 420(6915), 520-62. Weber, M.A. et al., 2009. The frequency and significance of alveolar haemosiderin-laden macrophages in sudden infant death. Forensic Science International, 187(1-3), 51-57. Wein, G. et al., 2003. The 3'-UTR of the mRNA coding for the major protein kinase C substrate MARCKS contains a novel CU-rich element interacting with the mRNA stabilizing factors HuD and HuR. European Journal of Biochemistry / FEBS, 270(2), 350-65. Weinberg, E.D., 2006. Iron loading: a risk factor for osteoporosis. Biometals: An International Journal on the Role of Metal Ions in Biology, Biochemistry, and Medicine, 19(6), 633-635. Willis, R.C. & Hogue, C.W.V., 2006. Searching, viewing, and visualizing data in the Biomolecular Interaction Network Database (BIND). Current Protocols in Bioinformatics / Editoral Board, Andreas D. Baxevanis ... [et Al, Chapter 8, Unit 8.9. Wingender, E. et al., 2000. TRANSFAC: an integrated system for gene expression regulation. Nucleic acids research, 28(1), 316-9. Wingender, E. et al., 1996. TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic acids research, 24(1), 238-41. Wolfe, S.A., Wert, J.V. & Grimes, S.R., 2006. Transcription factor RFX2 is abundant in rat testis and enriched in nuclei of primary spermatocytes where it appears to be required for transcription of the testis-specific histone H1t gene. Journal of Cellular Biochemistry, 99(3), 735-746. 174

Wolozin, B. & Golts, N., 2002. Iron and Parkinson's disease. The Neuroscientist: A Review Journal Bringing Neurobiology, Neurology and Psychiatry, 8(1), 22-32. Xu, Y.Z. et al., 2005. RNA-binding protein HuR is required for stabilization of SLC11A1 mRNA and SLC11A1 protein expression. Molecular and cellular biology, 25(18), 8139-49. Yaman, I. et al., 2003. The zipper model of translational control: a small upstream ORF is the switch that controls structural remodeling of an mRNA leader. Cell, 113(4), 519-31. Yasuda, S. et al., 2004. Interaction between 3' untranslated region of calcitonin receptor messenger ribonucleic acid (RNA) and adenylate/uridylate (AU)-rich element binding proteins (AU-rich RNA-binding factor 1 and Hu antigen R). Endocrinology, 145(4), 1730-8. Yeap, B.B. et al., 2002. Novel binding of HuR and poly(C)-binding protein to a conserved UC-rich motif within the 3'-untranslated region of the androgen receptor messenger RNA. The Journal of biological chemistry, 277(30), 2718392. Zanzoni, A. et al., 2002. MINT: a Molecular INTeraction database. FEBS Letters, 513(1), 135-140.

175

7. Appendix I: Inferring Transcriptional Regulatory Networks from Gene Expression Data

*Note: In the following section, the thesis author was responsible for all of the microarray data analysis and derivation of transcriptional regulatory networks. Alejandro Adam was primarily responsible for generating the initial microarray data and confirming the hypotheses suggested by the analysis.

The following section is excerpted from previously published work: Adam, A.P. et al., 2009. Computational identification of a p38SAPK-regulated transcription factor network required for tumor cell quiescence. Cancer Research, 69(14), 5664-5672.

176

Computational identification of a p38SAPK regulated transcription factor network required for tumor cell quiescence. Alejandro P. Adam2, ,#, Ajish George1,2,#, Denis Schewe1,2,5, Paloma Bragado1, Bibiana V. Iglesias2,4, Aparna C. Ranganathan1,2, Antonis Kourtidis2, Douglas S. Conklin2 and Julio A. Aguirre-Ghiso1,2*.

1

Division of Hematology and Oncology, Department of Medicine and Department of Otolaryngology,

Mount Sinai School of Medicine, New York, NY 10029. 2Department of Biomedical Sciences, School of Public Health and Center for Excellence in Cancer Genomics, University at Albany, State University of New York, One Discovery Drive, Rensselaer, NY 12144-3456.

#

These authors contributed equally to this work.

Running title: p38 and tumor cell quiescence Key words: p38, FoxM1, BHLHB3, c-Jun, p53, quiescence.

*Correspondence to: Julio A. Aguirre-Ghiso, Division of Hematology and Oncology, Department of Medicine and Department of Otolaryngology, Mount Sinai School of Medicine, New York, NY, 10029. Phone: 212-241-9582 Fax: 1212-426-4390 Box: 1079 E-mail: [email protected]

3,4 ,5

Present address: 3Center for Cardiovascular Sciences and 4Center for Immunology & Microbial Disease,

Albany Medical College, Albany, NY 12208. 5Kinderklinik und Kinderpoliklinik, Dr. von Haunersches Kinderspital, Ludwig-Maximilians-Universität, Lindwurmstr. 2, 80337 München

# of words: 5311

177

Abstract The stress activated kinase p38 is an important player in development and disease. In particular it has been proposed as a key determinant of tumor suppression during oncogene-induced transformation. While p38 activation can also suppress the growth of advanced cancer cells, the programs regulating this process are unknown. This is significant because, although it is known that tumor cells can undergo genetic and/or epigenetic changes and exhibit non-malignant behavior, the mechanisms remain poorly understood. Using computational tools we have identified a transcription factor network regulated by p38/ that is required for human squamous carcinoma cell quiescence in vivo. We found that p38 transcriptionally regulates a core network of 46 genes that includes 16 transcription factor nodes. Perturbation using genetic or pharmacological inhibitors showed that the identified transcription factor network was highly predictive of the contribution of each gene to the quiescent phenotype. Activation of p38 induced the expression of the endogenous p53 and BHLHB3 genes, while inhibiting the expression of the c-Jun and FoxM1 genes and induction of p53 by p38 was dependent on the dowregulation of c-Jun. Accordingly, while RNAi downregulation of BHLHB3 or p53 interrupted tumor cell quiescence, downregulation of c-Jun or FoxM1 or overexpression of BHLHB3 in malignant cells mimicked the onset of quiescence. Our results reveal, at a systems level, novel genetic determinants regulating p38-induced tumor cell quiescence. These findings identify several components of the regulatory mechanisms driving cancer cell dormancy.

178

Introduction The stress activated protein kinase p38 is an evolutionarily conserved pathway involved in transducing inflammatory and stress signals (1). In mammals there are four p38 isoforms encoded by different genes (, , and (1). The best functionally characterized isoforms are p38 and p38(1). Deletion of p38 is embryonic lethal in mice due to problems with placental development. In adult tissues, p38 isoforms mediate inflammatory signals, post-transcriptional gene expression and apoptotic signals important for tissue homeostasis(1). Activation of p38 is required for suppression of oncogene-induced transformation and tumorigenesis (1). These functions depend in part on p53, the cellular senescence pathway and the regulation of genes important for G1-S and G2-M checkpoints (1-3). Several tumors overexpress the p38 phosphatase WIP1, suggesting that downregulation of p38 is important in allowing tumor progression (2). Further, conditional deletion of p38 in lungs and livers favors unscheduled proliferation of progenitor cells and tumor formation through enhanced JNK and c-Jun signaling in mice (4, 5). Although increased p38 activity may be advantageous for tumor cells in highly advanced cancers (6), in some instances, tumors cells may still be susceptible to the negative regulation of p38 (7-11). For instance, MKK4 mediates suppression of metastases in ovarian cancer via activation of p38 (10). When metastatic squamous carcinoma HEp3 cells (T-HEp3) are adapted to the in vitro microenvironment for >40 generations, these cells re-program and lose their tumorigenic and metastatic potential in vivo (9, 12). Loss of malignancy is not due to 179

enhanced apoptosis, which is less than 5% in these tumors, but due to a G0-G1 arrest that leads to the acquisition of a dormant phenotype upon re-inoculation in vivo (13). This phenotype is also not due to selection of pre-existent “dormancy predisposed” clones as it happens in multiple clones isolated from HEp3 tumors at almost 100% cloning efficiency (12). Furthermore, the dormant phenotype is transient because after ~11 weeks in vivo all individual clones or polyclonal tumor cell populations resumed growth (12). The cells from these tumors when placed in vitro can once more adopt a dormant behavior, suggesting that the mechanisms driving this reversible dormancy behavior might be epigenetic in nature (12). This phenotypic shift can be activated repeatedly just by changing the microenvironment (12). Thus, any genetic drift that occurs in vitro and/or in vivo is not dominant over epigenetic regulation or is selected against, because the phenotypes remain reversible and dependent on the cell growth environment. We hypothesized that if tumor cell dormancy does not result from the appearance of rare clones, then an epigenetic reprogramming might be responsible for the phenotypic shift. Exploration of this hypothesis revealed that loss of malignancy of HEp3 cells is due to the activation of p38, which inhibits ERK signaling, causing a downregulation of the urokinase receptor (uPAR) and reduced trans-activation of EGFR (13). Genetic or pharmacologic inhibition of p38 was sufficient to restore ERK activation, uPAR expression and proliferation of these cells in vivo (8, 9). In addition to HEp3 cells, activation of p38 was also predictive of proliferation vs. growth arrest in breast, prostate and fibrosarcoma cancer cell lines (7). Further, independent studies showed that even in malignant cervical carcinoma HeLa cells, p38 activation suppresses their malignant phenotype (11). However, the global changes in gene expression responsible for such 180

phenotypic outcomes were unknown. Here we have combined gene expression profiling with a powerful bioinformatics analysis to provide a systems view of p38 and the transcription factor (TF) network signaling it regulates in dormant HEp3 cells. We identified a putative TF network using data from conventional gene array platforms and through in vivo validation and found that it was predictive of gene function in dormant HEp3 cells. Using this approach we discovered that p38-induced tumor cell quiescence is controlled in part by the regulation of TFs including R213Q mutant p53, BHLHB3, Jun and FoxM1. These results provide insight into the mechanisms by which stress signaling might reprogram tumor cells to acquire a quiescence program and may help identify genetic determinants of cancer cell dormancy.

181

Materials and methods Reagents, siRNAs and antibodies. SB203580, SB202190 and PD98059 were from Calbiochem (Beverly, MA). Doxorubicin and DMSO were from Sigma (St. Louis, MO). Rabbit polyclonal anti-p53 and anti Bax antibodies were from Cell Signaling (Danvers, MA); rabbit polyclonal anti c-Jun (H-79) and FoxM1 (C-20) and mouse monoclonal anti phospho-Erk (E-4) were from Santa Cruz Biotechnology (Santa Cruz, CA). Anti-Erk, anti-HA, anti-V5, anti phospho-p38 and anti p38 monoclonal antibodies were from BD Biosciences (San Jose, CA). Horseradish peroxidase (HRP)–conjugated anti-mouse IgG antibody was from Vector Laboratories (Burlingame, CA). HRP-conjugated anti-rabbit IgG antibody was from Chemicon International (Temecula, CA). siRNAs to p38 were from Ambion (p38-1, Austin, TX) and New England Biolabs (p38-2, Ipswich, MA); siRNAs to p38 and DEC2 were from Santa Cruz Biotechnology (Santa Cruz, CA); siRNA to c-Jun was from Dharmacon (Chicago, IL) ; custom FoxM1 siRNA was purchased from Dharmacon (sense sequence: CCUUUCCCUGCACGACAUGUU); control and GAPDH siRNAs were from Ambion (Austin, TX).

Cell lines and Xenograft studies. Tumorigenic (T-HEp3) and ‘‘spontaneous’’ dormant (D-HEp3) human epidermoid carcinoma HEp3 cell(8), D-HEp3-neo, and D-HEp3-p38 DN cell lines were described previously (9). For reagents, antibodies and siRNAs please see supplementary materials and methods. Tumor growth on chick embryo CAMs or Balb/c nude mice has been 182

described previously(8). All animal experiments were approved by IACUC (SUNYAlbany and MSSM).

Microarray Analysis and TF Network Mapping A total of 20 Affymetrix Hgu133a gene chips were run with the following samples: 4 SB203580 treatments (5μM, 48h) and 4 DMSO controls; 3 SB202190 treatments (5μM, 48h) and 3 DMSO controls; 3 DNp38 expressing D-HEp3 cells and 3 empty vector (Neo) expressing D-HEp3 cells. Raw data was obtained in the .CEL file format after processing by the GCOS array reader. The batch of .CEL files were imported into a Bioconductor session using the Affymetrix package (S-Fig 3A). The batch was background-corrected, normalized, and summarized using both gcrma and mas5 functions in parallel. Linear models of differential expression were fit to each of the transformed datasets and contrasts were estimated between each set of p38-inhibited samples and its corresponding set of controls. Probability P-values of the F-statistic (8) were used to gage the likelihood a gene was significantly differentially expressed (S-Fig 3A). The distributions of the F- statistic P value as determined by the classify-TestsF function of the limma package (8) were examined and a cutoff of significance was established at the inflection point of 0.05 occurring in both data sets. Genes significant in both data sets at this level were marked as a low-stringency group. The response of each gene in each contrast (SB203580, SB202190, and DNp38) was then calculated using the same p-value of 0.05 for a cutoff and a response matrix with values -1 (repressed by p38), 0 (no change), and +1 (induced by p38) was created. Response matrices for both gcrma and mas5 datasets were merged to show only responses common to both. Genes 183

significantly changing in all three treatments according to both methods were marked as a high-stringency group. Tables of the differentially expressed genes in each of these lists were generated including probe-ids, expression and significance statistics along with other descriptive annotations (S-Fig 3A).

TF Network Mapping The genes responding in the pharmacological and genetic treatments were condensed into two sets: a low-stringency set which moved significantly in any of the three treatments; and a high-stringency set which moved significantly in all of the three treatments. A database of known and predicted interactions between transcription factors and target gene promoters was developed using the TRANSFAC database, the MATCH tool (8), and a database of gene promoters derived from LocusLink (8). Predicted interactions were derived by running MATCH with all known TRANSFAC binding site matrices against the 2000 bp upstream to 500 bp downstream (from the transcription start site) regions of the LocusLink promoters (8). The resulting MATCH scores (ranging between 0 and 1) passing the default thresholds were recorded for each combination of factor and gene. Known interactions between factors and genes were compiled directly from TRANSFAC and assigned a score of 10 (8). Mappings from Bioconductor's hgu133a package were used to map from LocusLink identifiers to Affymetrix probeset identifiers so that subsequent analysis of the significant gene-sets could be carried out as follows. For each combination of a transcription factor T and a gene G present on the array: 1) first, the scores for all known or potential binding sites for T in the promoter of 184

G were summed up as the co-regulation score; 2) next, the non-parametric Spearman correlation between T and G across all samples was taken as the co-expression score; 3) and finally, the co-regulation and co-expression scores were multiplied together to obtain a combined score whose magnitude reflects the strength of both the co-regulation and coexpression scores and whose sign reflects correlation or anti-correlation (and by extension transcriptional induction or repression) (S-Fig 3). This inferred transcriptional regulation network was examined across different thresholds of the combined score for both the high and low stringency responder gene sets and including or excluding transcription factors that were not in the responsive gene sets. A combined score threshold of 0.75, restriction of genes to the high-stringency set, and restriction of TFs to the low-stringency set was used to keep the networks interpretable. The R analysis session and workflow used here are made openly available. The algorithm for building the TF network is available upon request.

RNA interference studies Retroviral delivery of shRNAs targeting the genes of interest (Open Biosystems, Huntsville, AL) or, firefly luciferase or an empty vector, as control and development of stable cell lines was done as previously described (14). Transfections of siRNAs targeting the desired sequences or, as controls, a siRNA targeting GAPDH or scrambled siRNA, were performed as previously described (14). Cells were either used for in vivo experiments or lysed 24-72 h later for immunobloting and/or qPCR analysis. as described previously (14).

185

RT-PCR For RT-PCR experiments, cells were lysed in Trizol reagent (Invitrogen, Carlsbad, CA) and total RNA was purified as instructed by the manufacturer. Two micrograms of total RNA were then reverse-transcribed using MMuLV RT (NEB, Ipswich, MA). The cDNA was then amplified by standard PCR using Taq DNA polymerase (NEB, Ipswich, MA). Primers were purchased from IDT (Coralville, IA). Primer sequences used were the following: GAPDH-F, CGTCATGGGTGTGAACCATGAG; GAPDH-R, GTAGACGGCAGGTCAGGTCCA; p38-F, TGCATAATGGCCGAGCTGTTGACTGG; p38-R, AAGGGCTTGGGCCGCTGTAATTCTCT; p38-F, GCCTGGAGATTGAGCAGTGAGGTG; p38-R, GACACTTGTGCCCAGACTCCTACAC; p38-F, AGCTGAAGATCCTGGACTTCGGCC; p38-R, GGGAGGCCCTTCATGTAGTTCTTGG; p38-F, AGCAGCCGTTTGATGATTCCTTAGAAC; p38-R, TTTGGTAGTGACAAATACTGGTCCTTG; BHLHB3-F, ACGGAGGTTCAAGCAGAGTGAGAA; BHLHB3-R, TCAGCCACAGAACAGACCCTTCTT; p53-F, GCCCCTCCTCAGCATCTTATCCG; p53-R, TCCCAGGACAGGCACAAACACGC; cJun-F, TTAACAGTGGGTGCCAACTCATGCTAACGC; cJun-R, GAGATCGAATGTTAGGTCCATGCAGTTCTTG; 186

FoxM1-F, GCAGCAGGCTGCACTATCAACAAT; FoxM1-R, TTCCCTGGTCCTGCAGAAGAAAGA.

Luciferase reporter assays In vitro and in vivo dual luciferase assays were performed using the Dual Luciferase Reporter Kit (Promega, Madison, WI) following the vendor’s instructions as described previously (7).

Statistical analysis In vivo assays were evaluated by the non-parametric tests Mann Whitney or KruskalWallis followed by Dunn’s multiple comparison test. Luciferase assays were evaluated by ANOVA followed by Bonferroni multiple comparison tests. A p value of less than 0.05 was considered statistically significant. These test were performed using Graphpad Prism 5.0.

Results Identification of a putative TF network embedded in the p38SAPK-regulated gene expression profile. We previously showed that pharmacologic or genetic inhibition of p38/ signaling with SB203580 or dominant negative p38(DNp38) caused a reversion of quiescence in vivo (7-9). Similar results were obtained with si or shRNAs targeting p38 (See S-Fig1 and supplementary results). Further, activation of p38 signaling in T-HEp3 187

cells inhibited T-HEp3 cell proliferation in vivo, mimicking the induction of dormancy (S-Fig 1). Thus, we used this model of forced inhibition of p38 to reveal the gene program driving D-HEp3 tumor cell quiescence in vivo. To uncover the p38-regulated growth arrest program we used Affymetrix gene array profiling. We compared gene expression changes of cells treated with or without p38 pharmacological inhibitors SB203580 or SB202190 (5M) for 48 hrs in the absence of FBS or a genetic approach comparing D-HEp3 cells stably expressing an empty neomycin resistant vector (Neo) or a DNp38(9) (Fig1A and S-results). The genes changing in expression in these treatments were grouped into two sets: a low-stringency set, which changed significantly in any of the three treatments (15); and a high-stringency set in which gene expression changed significantly in all of the three treatments (see Sresults). So that we could identify TFs responsible for the p38-regulated growth arrest program, we looked in both the high and low stringency sets for TFs whose activities might explain the observed expression changes (16). For this we used a database in which associations between TFs and their target gene promoters are represented numerically with interaction scores (see S-methods), (Fig1 and S-Fig3). For each combination of a TF, T, and gene, G, present on the array, we asked the following questions. First, were there known or potential binding sites for T in the promoter of G? If so, we summed the scores for these to use as a “co-regulation score” (S-Fig 3). Next, we asked if the expression change of T correlated with that of G. We took the non-parametric Spearman correlation between T and G across all samples as our “co-expression score” (S-Fig 3). 188

Finally, we multiplied the co-regulation and co-expression scores together to obtain a combined score. The magnitude of the combined score reflects the strength of the scores for both co-regulation (i.e. if there are no binding sites for T upstream of G, the coregulation score and the product overall score will be zero) and co-expression (i.e. low correlations will be weighted proportionally lower). The sign of the combined score reflects correlation or anti-correlation, which by extension corresponds to potential transcriptional induction or repression (see Methods Suppl.-Methods and S-Fig 3). We used a combined score threshold of 0.75, restricting genes to the high-stringency set, and restricting TFs to the low-stringency set to keep the networks manageable. The low-stringency set (S-Fig 4) revealed an association network containing 129 genes. Of these 43 represent known TFs (boxes), the remainder correspond to other nonTF genes (circles) (Fig 1B and S-Fig 4). In the high-stringency set (Benjamini-Hochberg corrected test p6 fold more cells per tumor nodule than the tumors from control cells (Fig 2C). The same cells inoculated in nude mice showed a statistically significant shortening in the dormancy period for D-HEp3 cells with BHLHB3 knock down (Fig 2C). In agreement, over-expression of BHLHB3 in T-HEp3 was able to significantly inhibit their proliferation (Fig 2D). This effect could be further enhanced by co-expression of an MKK6 active mutant that activates p38(Fig 2D) (7). We conclude that BHLHB3 is a novel negative regulator of tumor growth functionally linked to p38 signaling. Downregulation of FoxM1 or c-Jun inhibits T-HEp3 growth in vivo. We next tested whether FoxM1 and c-Jun down-regulation by p38 may be linked to D-HEp3 cell dormancy (Fig 1). RT- and qPCR analysis revealed that c-Jun expression was higher in T-HEp3 than D-HEp3 cells (Fig 3A). Further, p38 inhibition resulted in cJun upregulation only in D-HEp3 cells while Mek1/2 inhibition downregulated c-Jun expression in both cells (Fig 3A). SiRNAs inhibition of c-Jun expression (Fig 3B) did not affect BHLHB3 and FoxM1 mRNA levels (data not shown) suggesting that these genes are not downstream of c-Jun. However, c-Jun downregulation by RNAi (si c-Jun-I) stimulated p53 expression in T-HEp3 cells and to some extent in D-HEp3 cells (Fig 3B). Thus, our data support that p38-mediated upregulation of p53 transcript is at least in part mediated by a p38-dependent inhibition of c-Jun transcript levels. In addition, RNAi to c191

Jun was sufficient to cause a strong inhibition of T-HEp3 proliferation in vivo (Fig 3B). This correlated with reduced levels of the proliferation marker phospho-Histone H3 in siRNA treated tumors, but not with increased apoptosis as measured by cleaved-caspase3 staining (Fig 3C). FoxM1 expression was repressed by p38 as SB203580 treatment upregulated FoxM1 mRNA in D-HEp3 cells (Fig 3D). RNAi-mediated downregulation of FoxM1 was detected by RT- and qPCR (Fig 3D) and was also very efficient in inhibiting T-HEp3 tumor growth in vivo (Fig 3D). Reduced tumor growth was also associated with reduced phospho-H3 levels in these tumors. However, inhibition of FoxM1 expression induced apoptosis by ~2 fold in T-HEp3 cells (Fig 3C). We conclude that both FoxM1 and c-Jun are important to promote T-HEp3 tumor growth (and survival in the case of FoxM1). Opposing regulation of p53mutR213Q by ERK and p38 contributes to D-HEp3 tumor cell quiescence. We found that p38 induces p53 mRNA levels in quiescent D-HEp3 cells (Fig 1). In agreement, p53 transcript levels are lower in tumor vs. normal head and neck cancer tissue (22) (S-Fig 5A). However, in ~50% of head and neck squamous cell carcinoma (HNSCC) tumors p53 is mutated (23). Thus, we sequenced p53 in HEp3 cells (in both Tand D-HEp3 cells) and found a mutation comprising an Arg to Gln substitution at codon 213 (R213Q) (Fig 4A), which is present in oral cancers (24). Sequencing data in S-Fig 2C showed that the CGA to CAA G->A substitution appears to be complete suggesting that both alleles are mutated. Although mutations or post-translational modifications can inactivate p53, we tested whether p53mutR213Q was functional in HEp3 cells and whether transcriptional downregulation of p53mutR213Q may be an additional mechanism to 192

overcome p53 inhibitory effects in HNSCC. We confirmed that p53 mRNA and protein levels were upregulated in D-HEp3 vs. T-HEp3 cells (Fig 4A). Basal luciferase reporter expression driven by a p53-binding element was 6-10-fold higher in D-HEp3 than T-HEp3 cells (S-Fig 2D), suggesting a functional p53 protein despite the R213Q mutation and protein abundance being an important contributor to the difference in p53 activity between T- and D-HEp3 cells. SB203580 (10M) treatment for 48 hrs inhibited p53 mRNA, protein expression and activity (Fig 4A and S-Fig 2D) in D-HEp3 cells. Also, Mek1/2 inhibition caused a dramatic upregulation in p53 mRNA and activity in both cells (Fig 4A and S-Fig 2D). In addition, phosphorylation of p53 at Ser15 after modulation of ERK or p38 activity and protein turnover following proteasome inhibition were unaffected in T-HEp3 and DHEp3 cells (data not shown). Endogenous p53 protein level and activity could be induced by doxorubicin in T-HEp3 and D-HEp3 cells albeit to a lesser extent in the latter (S-Fig 2D). However, the downstream target Bax was induced poorly in T-HEp3 cells and did not change in D-HEp3 cells, which have high basal p53 and Bax expression (S-Fig 2D). The marginal Bax induction is in agreement with the abrogated capacity of the R123Q p53 mutant to induce apoptotic genes (24). Finally, p53 activity was significantly higher in D-HEp3 cells compared to T-HEp3 cells after three days in vivo. This suggests that higher p53mutR213Q expression and activity in D-HEp3 cells, persists upon in vivo growth arrest (Fig 4B). This implies that this mutant p53 may allow HNSCC cells to uncouple growth arrest from apoptosis in response to stress. We next tested whether p53 is required for p38-induced quiescence by downregulating p53 using different shRNA or siRNAs. The shRNAs showed a consistent and 193

specific downregulation of p53 protein of ~50-60% (Fig 4C) compared to a shRNA to luciferase. SiRNAs to p53 targeting a different region of the mRNA generated an almost complete reduction in mRNA and protein that was observed after 48 hrs by qPCR and Western blot, respectively (Fig 4D). In all instances downregulation of p53 was sufficient to allow these cells to resume proliferation in vivo (Fig 4C&D). These results further support the predictive value of the TF network and suggest that p53mutR213Q is in part required for p38-induced quiescence in D-HEp3 cells.

Discussion We set out to identify the gene expression program responsible for p38-induced tumor cell quiescence and the TFs executing this program. To achieve this goal we used computational tools that (i) revealed TF networks, (ii) were predictive of gene function and (iii) were of a complexity level addressable experimentally. Our TF network analysis identified p38 co-regulated targets were functionally linked to the quiescent phenotype of D-HEp3 cells. Further, we revealed clues about the activating or repressing functions of several TFs (present in the TRANSFAC database) on numerous gene promoters. We also included TFs that did not show a statistically significant change in expression (white boxes) but that remained correlated to changes in target gene expression as these may be regulated at the activity level (e.g., ATF2 and MyoD1, are phosphorylated by p38(2527). Our analysis identified TFs potentially responsible for the quiescent phenotype of D-HEp3 cells that could be validated in vivo. For example, inhibiting p53 and BHLHB3 resulted in a similar phenotype to that of p38 inhibition. BHLHB3 and p53 appear to have 194

similar abilities to suppress growth despite the apparent difference in the number of predicted downstream targets in the network. This also suggests that the number of connected genes may not directly assign importance to a TF in regulating a phenotype. Although not weighed statistically this “connectivity” difference may be due to the fact that the p38 regulation profiles were obtained from the cells in vitro, where D-HEp3 cells proliferate normally. For instance out of 22 genes expressed in D-HEp3 cells that have p53 binding elements, only Apex1, a DNA repair enzyme, and the tumor suppressor APC were correlated significantly. It was noticeable that for c-Jun and FoxM1 or Nr2F1 and BHLHB3, although clearly co-regulated by p38, there was no single common regulating TF. Further, identification of p38 target genes in vivo where D-HEp3 cells fully growth arrest (28) and/or temporal studies will help identify additional p53 targets and upstream regulators of these TFs. We found that activation of p38 during D-HEp3 cells quiescence required downregulation of c-Jun consistent with the role of c-Jun during G1-S transition (29) and the observed upregulation in tumor vs. normal HNSCC tissue (S-Fig 5C). This is supported by the finding that knockdown of c-Jun in T-HEp3 cells inhibited proliferation and tumor growth. Accordingly, recent data showed that p38deficient mouse fetal livers strongly upregulated c-Jun and resulted in accelerated liver cancer development (5). Similar results were observed in mice lacking p38 in the lung (4). In our studies, the upregulation of p53 by p38 was dependent on the downregulation of c-Jun as its knockdown was sufficient to enhance p53 expression and inhibit T-HEp3 tumor growth (Fig 5B). The antagonistic effect of c-Jun on p53 was shown to be linked to p38 signaling in liver regeneration (30), a process where the ERK/p38 ratio regulates hepatocyte exit from quiescence (31) 195

and mouse liver cancer development (32). Further, c-Jun -/- MEFs have a proliferative defect mostly attributed to the transcriptional upregulation of p53 (33). This was dependent on c-Jun mediated transcriptional repression of p53 from PF-1 site on p53 promoter (33). Similarly, blocking FoxM1 in T-HEp3 cells inhibited their growth in vivo. Still this was not only linked to reduced proliferation but also to a 2-fold increase in apopotosis. This suggests that in D-HEp3 cells, downregulation of FoxM1 and the potential loss of prosurvival signaling might be compensated by other survival pathways. These may include the recently described ATF6-Rheb-mTOR pathway that regulates survival of DHEp3 cells in vivo (34). FoxM1 is a known regulator of G1-S and G2-M transition located at the 12p13 locus, usually amplified in HNSCC(35). Further, the Oncomine database shows that FoxM1 transcript is up-regulated in HNSCC vs. normal tissues (36) (SFig 5D) and if expressed at high levels in primary breast tumors it might be a poor prognosis indicator(37, 38). Our results suggest that reprogramming by p38 of D-HEp3 cells requires decreased FoxM1 and c-Jun expression to enter prolonged G0-G1 arrest in vivo. Transcriptional regulation of p53 was important for p38-induced D-HEp3 cell quiescence. This was interesting, since p53 is mutated in ~50% squamous carcinomas (39, 40), and T- and D-HEp3 cells carry an R213Q mutation. We show that quiescence induction is still functional in p53mutR213Q and that it is regulated by p38 (1, 41). Our data diverged from other studies on post-translational regulation of p53 (1, 41), in that ERK and p38 had opposing roles in the regulation of p53 at the transcript level (see also supplementary results). The Oncomine data supports that transcriptional downregulation of p53 occurs in patients (22) (S-Fig 5A). This suggests that spontaneous reprogramming and quiescence in this type of cancer might occur upon upregulation of p53 even in 196

patients displaying the p53mutR213Q. These results also suggest that this mutation might predict for dormant disease as it may eliminate p53 pro-apoptotic functions (24) and allow tumor cells to enter quiescence but survive stress imposed by therapy or new microenvironments. Mutants in codon 213 fail to bind the p53 binding proteins ASPP1/2, which stimulate p53 to trans-activate apoptosis but not growth arrest genes (42, 43). It is possible that p53mutR213Q may have lost this function. In some cases p38 induces prosurvival signals while inducing growth arrest (44). However, even the growth inhibitory function of p38 might be lost, as some squamous carcinoma cell lines form tumors and take advantage of p38 and p38 for survival (6). Another required component that appears to operate in parallel to p53 is BHLHB3. This transcriptional repressor is very important for p38-induced D-HEp3 cell quiescence and we are currently dissecting the mechanism of action. Anaylsis of the Oncomine database revealed that BHLHB3 is expressed at higher levels in tumor vs. normal HNSCC tissues (S-Fig 5E). However, high expression of BHLHB3 in primary breast cancer tumors is a good prognosis indicator. Thus, the clinical relevance of BHLHB3 expression might vary between tumor types We showed that the ERK/p38 ratio is predictive of dormancy/quiescence of fibrosarcoma, breast, prostate and squamous carcinoma cell lines (7-9) and similar results were observed with cervical carcinoma HeLa cells (11). Our new findings show a transcriptional program regulated by the ERK/p38 ratio. How these two pathways converge to oppositely regulate TF function is unknown. However, a recent report (45) showed that PIASx a co-activator of Elk1, can repress this TF transcriptional activity in response to p38 activation, while the opposite response holds true for ERK signaling (45). 197

Our studies reveal a previously unrecognized network of TFs regulated by p38 and required for the induction of tumor cell quiescence. Understanding how these TFs contribute to p38 induced growth arrest is of importance because inhibiting and/or reactivating more than one gene may be required to inhibit tumor progression. The strategy presented here may help identify new therapeutic targets or prognostic indicators when applied to large array datasets from patient samples.

Acknowledgements We thank Drs. Thomas Begley (SUNY-Albany), Ari Melnick (Cornell University) and Yang Zhang (MSSM) for critical reading of the manuscript and help with some control experiments. This work is supported by grants from the NIH/National Cancer Institute grant CA109182 (J.A.A-G), the U.S. Army Medical Research Acquisition Activity W8IWXH-04-1-0474 (D.S.C) and by the Samuel Waxman Cancer Research Foundation Tumor Dormancy Program (J.A.A-G and D.S.C). A.C.R. is a recipient of a Ruth L. Kirschstein National Research Service Award (NIH/NCI) Fellowship. D. M. S. is a recipient of a Dr. Mildred-Scheel postdoctoral grant by the Deutsche Krebshilfe.

198

References 1. 2. 3.

4. 5. 6.

7.

8.

9.

10.

11. 12. 13.

14.

15. 16. 17.

Bulavin DV, Fornace AJ, Jr. p38 MAP kinase's emerging role as a tumor suppressor. Adv Cancer Res 2004;92:95-118. Bulavin DV, Demidov ON, Saito S, et al. Amplification of PPM1D in human tumors abrogates p53 tumor-suppressor activity. Nat Genet 2002;31:210-5. Bulavin DV, Phillips C, Nannenga B, et al. Inactivation of the Wip1 phosphatase inhibits mammary tumorigenesis through p38 MAPK-mediated activation of the p16(Ink4a)-p19(Arf) pathway. Nat Genet 2004;36:343-50. Ventura JJ, Tenbaum S, Perdiguero E, et al. p38alpha MAP kinase is essential in lung stem and progenitor cell proliferation and differentiation. Nat Genet 2007;39:750-8. Hui L, Bakiri L, Mairhorfer A, et al. p38alpha suppresses normal and cancer cell proliferation by antagonizing the JNK-c-Jun pathway. Nat Genet 2007;39:741-9. Junttila MR, Ala-Aho R, Jokilehto T, et al. p38alpha and p38delta mitogen-activated protein kinase isoforms regulate invasion and growth of head and neck squamous carcinoma cells. Oncogene 2007;26:5267-79. Aguirre-Ghiso JA, Estrada Y, Liu D, Ossowski L. ERK(MAPK) activity as a determinant of tumor growth and dormancy; regulation by p38(SAPK). Cancer Res 2003;63:1684-95. Aguirre-Ghiso JA, Ossowski L, Rosenbaum SK. Green fluorescent protein tagging of extracellular signal-regulated kinase and p38 pathways reveals novel dynamics of pathway activation during primary and metastatic growth. Cancer Res 2004;64:7336-45. Aguirre-Ghiso JA, Liu D, Mignatti A, Kovalski K, Ossowski L. Urokinase receptor and fibronectin regulate the ERK(MAPK) to p38(MAPK) activity ratios that determine carcinoma cell proliferation or dormancy in vivo. Mol Biol Cell 2001;12:863-79. Hickson JA, Huo D, Vander Griend DJ, et al. The p38 kinases MKK4 and MKK6 suppress metastatic colonization in human ovarian carcinoma. Cancer Res 2006;66:2264-70. Timofeev O, Lee TY, Bulavin DV. A subtle change in p38 MAPK activity is sufficient to suppress in vivo tumorigenesis. Cell Cycle 2005;4:118-20. Ossowski L, Reich E. Changes in malignant phenotype of a human carcinoma conditioned by growth environment. Cell 1983;33:323-33. Liu D, Aguirre Ghiso J, Estrada Y, Ossowski L. EGFR is a transducer of the urokinase receptor initiated signal that is required for in vivo growth of a human carcinoma. Cancer Cell 2002;1:445-57. Ranganathan AC, Ojha S, Kourtidis A, Conklin DS, Aguirre-Ghiso JA. Dual function of pancreatic endoplasmic reticulum kinase in tumor cell growth arrest and survival. Cancer Res 2008;68:3260-8. Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004;3:Article3. Tuck DP, Kluger HM, Kluger Y. Characterizing disease states from topological properties of transcriptional regulatory networks. BMC Bioinformatics 2006;7:236. Ferrer-Martinez A, Marotta M, Baldan A, Haro D, Gomez-Foix AM. Chicken 199

18.

19.

20. 21. 22.

23. 24. 25.

26. 27.

28.

29.

30. 31.

32. 33. 34.

ovalbumin upstream promoter-transcription factor I represses the transcriptional activity of the human muscle glycogen phosphorylase promoter in C2C12 cells. Biochim Biophys Acta 2004;1678:157-62. Azmi S, Ozog A, Taneja R. Sharp-1/DEC2 inhibits skeletal muscle differentiation through repression of myogenic transcription factors. J Biol Chem 2004;279:5264352. Kawamoto T, Noshiro M, Furukawa M, et al. Effects of fasting and re-feeding on the expression of Dec1, Per1, and other clock-related genes. J Biochem (Tokyo) 2006;140:401-8. Rohleder N, Langer C, Maus C, et al. Influence of photoperiodic history on clock genes and the circadian pacemaker in the rat retina. Eur J Neurosci 2006;23:105-11. Falvella FS, Colombo F, Spinola M, et al. BHLHB3: a candidate tumor suppressor in lung cancer. Oncogene 2008;27:3761-4. Toruner GA, Ulger C, Alkan M, et al. Association between gene expression profile and tumor invasion in oral squamous cell carcinoma. Cancer Genet Cytogenet 2004;154:27-35. Gath HJ, Brakenhoff RH. Minimal residual disease in head and neck cancer. Cancer Metastasis Rev 1999;18:109-26. Pan Y, Haines DS. Identification of a tumor-derived p53 mutant with novel transactivating selectivity. Oncogene 2000;19:3095-100. Puri PL, Wu Z, Zhang P, et al. Induction of terminal differentiation by constitutive activation of p38 MAP kinase in human rhabdomyosarcoma cells. Genes Dev 2000;14:574-84. Wu Z, Woodring PJ, Bhakta KS, et al. p38 and extracellular signal-regulated kinases regulate the myogenic program at multiple steps. Mol Cell Biol 2000;20:3951-64. Raingeaud J, Whitmarsh AJ, Barrett T, Derijard B, Davis RJ. MKK3- and MKK6regulated gene expression is mediated by the p38 mitogen-activated protein kinase signal transduction pathway. Mol Cell Biol 1996;16:1247-55. Aguirre Ghiso JA, Kovalski K, Ossowski L. Tumor dormancy induced by downregulation of urokinase receptor in human carcinoma involves integrin and MAPK signaling. J Cell Biol 1999;147:89-104. Nicolaides NC, Correa I, Casadevall C, et al. The Jun family members, c-Jun and JunD, transactivate the human c-myb promoter via an Ap1-like element. J Biol Chem 1992;267:19665-72. Stepniak E, Ricci R, Eferl R, et al. c-Jun/AP-1 controls liver regeneration by repressing p53/p21 and p38 MAPK activity. Genes Dev 2006;20:2306-14. Carreras MC, Converso DP, Lorenti AS, et al. Mitochondrial nitric oxide synthase drives redox signals for proliferation and quiescence in rat liver development. Hepatology 2004;40:157-66. Eferl R, Ricci R, Kenner L, et al. Liver tumor development. c-Jun antagonizes the proapoptotic activity of p53. Cell 2003;112:181-92. Schreiber M, Kolbus A, Piu F, et al. Control of cell cycle progression by c-Jun is p53 dependent. Genes Dev 1999;13:607-19. Schewe DM, Aguirre-Ghiso JA. ATF6alpha-Rheb-mTOR signaling promotes survival of dormant tumor cells in vivo. Proc Natl Acad Sci U S A 2008;105:10519200

35. 36. 37. 38. 39. 40.

41.

42. 43.

44.

45.

24. Laoukili J, Stahl M, Medema RH. FoxM1: at the crossroads of ageing and cancer. Biochim Biophys Acta 2007;1775:92-102. Rhodes DR, Yu J, Shanker K, et al. ONCOMINE: a cancer microarray database and integrated data-mining platform. Neoplasia 2004;6:1-6. van 't Veer LJ, Dai H, van de Vijver MJ, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002;415:530-6. van de Vijver MJ, He YD, van't Veer LJ, et al. A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 2002;347:1999-2009. Boyle JO, Hakim J, Koch W, et al. The incidence of p53 mutations increases with progression of head and neck cancer. Cancer Res 1993;53:4477-80. Lopez-Martinez M, Anzola M, Cuevas N, Aguirre JM, De-Pancorbo M. Clinical applications of the diagnosis of p53 alterations in squamous cell carcinoma of the head and neck. Med Oral 2002;7:108-20. Bulavin DV, Saito S, Hollander MC, et al. Phosphorylation of human p53 by p38 kinase coordinates N-terminal phosphorylation and apoptosis in response to UV radiation. Embo J 1999;18:6845-54. Samuels-Lev Y, O'Connor DJ, Bergamaschi D, et al. ASPP proteins specifically stimulate the apoptotic function of p53. Mol Cell 2001;8:781-94. Thukral SK, Blain GC, Chang KK, Fields S. Distinct residues of human p53 implicated in binding to DNA, simian virus 40 large T antigen, 53BP1, and 53BP2. Mol Cell Biol 1994;14:8315-21. Ranganathan AC, Zhang L, Adam AP, Aguirre-Ghiso JA. Functional Coupling of p38-Induced Up-regulation of BiP and Activation of RNA-Dependent Protein Kinase-Like Endoplasmic Reticulum Kinase to Drug Resistance of Dormant Carcinoma Cells. Cancer Res 2006;66:1702-11. Yang SH, Sharrocks AD. PIASx acts as an Elk-1 coactivator by facilitating derepression. Embo J 2005;24:2161-71.

201

Figures Figure 1. A, Venn diagrams of genes changing significantly (91 up, 133 down) to each of the three strategies using both the gcRMA and MAS5 background correction and normalization methods. Significance was established with a p value of the F-statistic at 0.05. The total number of genes induced or repressed by p38 are indicated outside the Venn diagram. B, TF network regulated by p38 signaling showing only genes that move significantly up or down in all three of the p38 inhibition experiments and the TFs that potentially control them. BHLHB3, Jun, FoxM1 and Nr2F1 appear as TF nodes potentially regulating the expression changes. TFs are depicted as boxes and target genes other than TFs with circles. The colors indicate whether they were upregulated (red) or downregulated (green) by p38 signaling. Where a gene's promoter contains known or predicted binding sites for a given TF, the binding sites scores are summed and multiplied by their expression correlation (Spearman across all samples). An arrow is drawn between the TF and gene where the score exceeds a threshold (here it is 0.75). The color of the arrows indicates whether the genes were negatively (orange) or positively (green) co-regulated.The white boxes depict changes in expression that were not statistically significant but still were significantly correlated with the putative target genes. For those genes that are depicted by white boxes, their expression level trend can be inferred by the color of the connecting arrow

202

203

Figure 2. BHLHB3 is required for D-HEp3 cell quiescence A, basal expression of BHLHB3 detected by RT-PCR (left panel) or (right panel) qPCR under the indicated treatments for 24hrs. PD = 20μM PD98059 Mek1/2 inhibitor, SB=10μM SB203580 p38 inhibitor; DMSO= dimethyl sulfoxide used as vehicle. B, siRNA-mediated knockdown of BHLHB3 caused a strong decrease in BHLHB3 mRNA as measured by RT-PCR (upper panel) or qPCR (middle panel). and interrupt the quiescence of D-HEp3 cells, stimulating tumor cell proliferation (lower panel). C, shRNA stable knockdown of BHLHB3 measured by RT-PCR (left panel), results in restored proliferation of D-HEp3 cells in vivo (right panel). When inoculated in nude mice (2x105 cells/mouse), D-HEp3 cells expressing BHLHB3 shRNA have a statistically significant shortening in their dormancy period. Results are representative of at least two independent experiments. D, Overexpression of V5-BHLHB3 inhibits proliferation of T-HEp3 cells on CAMs in vivo (left panel). Right panel, detection of the V5-tag after transfection of T-HEp3 cells with an empty vector or V5-BHLHB3. MKK6 has not tag and its expression has been verified previously (21).

204

205

Figure 3. A, Basal c-Jun expression level is lower in D-HEp3 cells than in T-HEp3 cells, increased by 10μM SB203580 (SB) treatment in D-HEp3 cells and reduced by 20μM PD98059 (PD) in T-HEp3 cells as measured by RT-PCR (upper panel) and qPCR (middle panel). B, SiRNA mediated down-regulation of c-Jun detected by Western blot (upper panel) reduced cell proliferation of T-HEp3 cells after 4 days in vivo (lower panel). Scrambled siRNAs were used as controls. C, Basal (Control DMSO) FoxM1 expression level is lower in D-HEp3 cells than in T-HEp3 cells and can be induced by 10μM SB203580 (SB) treatment in D-HEp3 cells (upper panel). siRNA-mediated knockdown of FoxM1 for 48 hrs, as detected by qPCR (lower left panel), reduced cell proliferation of T-HEp3 cells after 3 days in vivo (lower right panel). .D, Quantification of apoptosis as measured by cleaved caspase-3 staining in 2 day old tumors of T-HEp3 cells transfected with control, c-Jun or FoxM1 siRNAs (left panel). Right panel: quantification of phospho-histone H3 levels in the same samples as in the left panel. Results are representative of at least three independent experiments

206

207

Figure 4. A, protein sequence alignment along aa 201-250 between wt (NM_000546) and mutant p53 in HEp3 cells shows an R213Q substitution (upper panel). Basal p53 mRNA (middle panel) and protein (lower panel) expression is higher in D-HEp3 cells than in THEp3 cells; the indicated lanes show treatments with vehicle alone (DMSO) or Mek1/2 inhibitor PD98059 (20μM) or p38 inhibitor SB203580 (10μM), as measured by RT-PCR and Western blot, respectively. B, p53 luciferase reporter activity measured in vivo, using a luciferase reporter gene downstream of a p53-responsive element and a minimal TA promoter (pTA-p53-Luc) or a minimal TA promoter (pTA-Luc) alone, was higher in DHEp3 cells than T-HEp3 cells after 72h in vivo growth on CAMs. C, shRNA stable knockdown of p53 measured by Western blot (upper panel), results in restored proliferation of D-HEp3 cells in vivo (lower panel). D, Knockdown of p53 by siRNAs in DHEp3 cells was measured at the protein level by Western blot and qPCR (upper panel ), promoted cell proliferation in vivo in D-HEp3 cells (lower panel). In all cases, the results are representative of at least three independent experiments.

208

209

Supplementary Figures S-Figure 1. A, An outline of the overall approach for gene array analysis. Two independent normalization and background correction methods were applied to the data (gcRMA and MAS 5.0). This was followed by fitting both the transformed datasets to a linear model of differential expression between measured conditions (limma; see methods). Overall significance of genes here was measured using the Fisher (F) statistic p-value and a cut-off of p=0.05 was found to visually separate its bipartite distribution of signal (significant) and background (non-significant) in the same graph. The intersection of significant genes from MAS5 and gcRMA set is then taken to be the low-confidence set of p38 responders and genes who are further significant in all three of the treatment are taken to be the high-confidence set. B, Scores for known and predicted binding sites for each TF were summed for each gene. This summed regulation score matrix was multiplied by the matrix of pairwise Spearman correlations for the corresponding genes and TFs. The resulting matrix is recast as a network and examined at various thresholds of the combined score. (30). For more detailed protocol, see Methods. C, Using raw gcRMA expression values, the hierarchical clustering of the genes significant in both p38 pharmacological inhibitor experiments (by both gcRMA and MAS 5.0) showed the tightest clustering between the two pharmacological inhibitor experiments and their controls. The replicates from the genetic experiment with Neo and DNp38 clustered separately but together . Neither the high stringency filter on all three strategies (Benjamini-Hochberg corrected p value for differential expression

A Locus-Based Paradigm for Generating Systems

A Locus-Based Paradigm for Generating Systems

Suggest Documents

A research paradigm for systems agriculture.

Coordination as a Paradigm for Systems Integration

The Cognitive Systems Paradigm

Generating Polynomial Invariants for Hybrid Systems - CiteSeerX

Generating Collaborative Systems for Digital ... - Semantic Scholar

Generating Collaborative Systems for Digital ... - Semantic Scholar

Geographic Automata Systems: A New Paradigm ... - GeoComputation

Expansion Planning for Electrical Generating Systems

Process Systems Engineering as a Modeling Paradigm for Analyzing ...

A Flocking-Based Dynamical Systems Paradigm for ... - Communications

[PDF] A New Paradigm for Global School Systems - Google Sites

Whole Systems Thinking as a Basis for Paradigm Change in ...

Expansion Planning for Electrical Generating Systems A ... - LEAP

AspectJ Paradigm Model: A Basis for Multi-Paradigm Design for ...

Agile Systems - Paradigm Shift International

Method for Generating Long-Range Correlations for Large Systems

Generating Realistic Application Workloads for Mix-Based Systems for

Generating Compact Classifier Systems Using a Simple Artificial ...https://www.researchgate.net/.../Generating-Compact-Classifier-Systems-Using-a-Simp...

A framework for generating realistic traffic for

GENERATING MULTIPLE SOLUTIONS FOR A PROBLEM: A

A Heuristic for Generating Scenario Trees for

Using the Model Paradigm for Real-Time Systems Development ...

optimizing expert systems: heuristics for efficiently generating low cost ...

Commonality analysis for mesh generating systems - Department of