INTEGRATING GENE EXPRESSION SIGNALS WITH ... - CiteSeerX

11 downloads 198 Views 3MB Size Report
Steve Fischer, Jonathan Crabtree, Charley Bailey, Mark Gibson, Fidel Salas, Jurgen ...... Denis [54] provides an algorithm for learning when the positive-only.
INTEGRATING GENE EXPRESSION SIGNALS WITH BOUNDED COLLECTION GRAMMARS Jonathan Schug A DISSERTATION

in

Computer and Information Science

Presented to the Faculties of the University of Pennsylvania in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

2005

M. Mintz and C. J. Stoeckert Jr. Supervisor of Dissertation

Benjamin C. Pierce Graduate Group Chairperson

COPYRIGHT Jonathan Schug 2005

In Memoriam G. Christian Overton Though his name does not appear on the title page as my advisor, without Chris Overton, who sadly passed away June 1, 2000, this dissertation would not be what it is. My interest in transcription factors and gene regulation was sparked by Chris who assigned me the task of creating the TESS web site. He was energetic, thoughtful, and had a great sense of humor. Our conversations would often lead me to a new insight or a new approach to explore. I hope he would be pleased to see the results of his inspiration.

iii

Dedication This dissertation is dedicated to my wife Seran, our children Jorin and Ani, our parents John, Jeanne, Bedros, and Sirvart, and our extended family whose love, support, and patience have been unending.

iv

Acknowledgements As I finish writing this dissertation I cannot help but remember that twenty five years ago, almost to the day, I sat on the front porch of my West Philadelphia house enjoying the spring weather and finishing the last assignment of my undergraduate career. It was an essay about the intuitionistic logic movement for a logic course taught by Scott Weinstein. I enjoyed it but was glad that classes were over. I had not touched a computer since learning BASIC on a teletype in high school and had no plans to return to academia. Since that time I have had the pleasure of working with a number of scientists whose love for their work has led me back into academics and in particular computational biology. At the LRSM here at Penn Bob White and Alex Vaskelis were kind enough to allow me to relearn programming on the Data General Nova 2 and DEC PDP 11/2 computers that ran the electron microscopes. At Britton Chance’s Biostructures Institute, Gerd Rosenbaum, Grant Bunker, Robert Fischetti, Ke Zhang, and the late Richard Korzun were a pleasure to work with. Through many nights working on the beamline and more meals than I can remember at the Windmill Diner, they taught me the joys of doing science. They encouraged and supported my journey back to graduate school. At CCCC, Noah Prywes and Insup Lee let me begin to put my formal training to use. In 1995, Peter Buneman finally recruited me to the then fledgling CBIL run by David Searls and the late Chris Overton. My coworkers and mentors here have been uniformly enjoyable and inspiring people; Steve Fischer, Jonathan Crabtree, Charley Bailey, Mark Gibson, Fidel Salas, Jurgen Haas, Barb Eckman, Shan Dong, Brian Brunk, Debbie Pinney, Sharon Diskin, Thomas Gan, Gary Chen, Greg Grant, Yury Kondrakhin, Georgi Kostov, Andrew Selden, Mike Saffitz, Regina White, Jessie Kissinger, Martin Fraunholz, Jules Milgram, Val Tannen, Susan Davidson, Warren Ewens, Sridhar Hannenhalli and David Roos. They have given me much advice, prodding, and help in statistics, R, and databases. In particular, I had many helpful conversations with Joan Mazzarelli about promoters and transcription factors. Elisabetta Manduchi has been the patient first user of the grammar learning system. Shilesh Date provided timely advice on cross-validation. Phil Le and Josh Friedman shared their ChIP-chip data and biological insights. v

I have worked directly with the members of my committee in other settings and am grateful for the interesting problems they have posed and the advice and support they have given. Finally, I thank my advisors Max Mintz and Chris Stoeckert. Max has supplied years of constant encouragement, critical and thought-provoking questions, war stories, coffee, and advice: always at the right time and in the right amounts. Chris kept CBIL growing after Chris Overton’s death, bent over backwards to allow me to spend time doing my research, sparked stimulating discussions, and guided me through the maze of distractions that is scientific research.

Thank you all.

vi

ABSTRACT INTEGRATING GENE EXPRESSION SIGNALS WITH BOUNDED COLLECTION GRAMMARS Jonathan Schug M. Mintz and C. J. Stoeckert Jr. Tissue-specific expression is one of the most obvious and important patterns of gene expression in complex eukaryotes. Every cell in an organism has the same set of genes, yet only a subset of the genes are expressed in a given cell type. This regulation is accomplished in large part by transcription factors (TF’s) that bind to short degenerate genomic sequences called binding sites near the genes they regulate. TF’s work in combination to provide precise regulation of gene expression. Understanding the combinatorics of TF regulation is still an open problem in postgenomic biology. In this dissertation we develop and apply a bounded collection grammar (BCG) formalism, similar to permutation grammars, and a machine-learning algorithm to model, search for, and learn the combinations and arrangements of TF’s that regulate tissue-specific expression. Our machine-learning algorithm allows for the optimization of free parameters in a grammar such as spacing and scores to identify the best possible performance of a rule. This system provides a unique combination of modeling power and learning ability. To identify tissue-specific genes from tissue surveys of gene expression, we apply Shannon entropy Hg to quantify overall specificity, then develop and apply a new metric entropy-based Qg|t to quantify specificity to a particular tissue, t. We take a stepwise approach to promoter analysis by first studying specific and ubiquitous promoters in general to determine global characteristics. We then study the genes specific to a particular tissue in this global context. Our analysis of mouse and human promoters ranked by Hg identifies the TATA box and CpG island as the major determinants of tissue-specificity. We find there are functional correlates of the TATA/CpG class of a gene’s promoter. We identified TF’s enriched in liver promoters and studied their arrangements to refine and extend earlier results by identifying one known rule and many new rules. Finally, we performed sequence analysis of ChIP-chip experiments to identify the companion factors of the ChIP-chip target factor that help define the active sites in the direct target genes demonstrating that our machine learning system can also contribute to the understanding of other regulatory events.

vii

Contents In Memoriam - G. Christian Overton

iii

Dedication

iv

Acknowledgements

v

1 Introduction

1

1.1

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.3

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

2 Related Work

6

2.1

Identification of Regulatory Elements . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

Biological Applications of Formal Grammars . . . . . . . . . . . . . . . . . . . . . . 11

2.3

Permutation and ID/LP Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4

Grammar Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Bounded Collection Grammars 3.1

3.2

3.3

6

18

Overview of BCG Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.1

Bounded Collection Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.2

Genomic Data Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.1

Alphabets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.2

Annotation Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.3

Comments on Annotation and Tokenization . . . . . . . . . . . . . . . . . . . 24

Formal Grammar Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 viii

3.4

3.5

3.3.1

Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.2

Stream Definition Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.3

Nonterminals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.4

Recognition Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.5

Bounds and Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.6

Sequences and Feasible Intervals . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3.7

Linking to Database Rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Theoretical Assessment of BCG Extensions . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.1

Abstract Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4.2

Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4.3

Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Learning Simple Collections

38

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2

Comparison to Standard Grammar Induction Problems and Solutions . . . . . . . . 39

4.3

4.2.1

Learning Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2.2

Comparison with Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.1

Probabilities, Information Content, and Characteristic Length . . . . . . . . . 42

4.4

Ordinary Productions - Features with Fixed Spacing . . . . . . . . . . . . . . . . . . 43

4.5

Collections of Two Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.5.1

2-Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5.2

2-Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.6

Larger Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.7

System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.8

4.9

4.7.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.7.2

Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Empirical Evaluation of Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.8.1

ROC Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.8.2

Estimating the True True Positive Rate . . . . . . . . . . . . . . . . . . . . . 53

The Evaluation Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.10 Exploring the Grammar Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.10.1 Number of Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.10.2 Static Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 ix

4.10.3 Dynamic Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.10.4 Other Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.11 How Much Data Do We Need? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5 Entropy and Tissue Specificity

67

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.2

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.3

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3.1

Defining Tissue Specificity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.3.2

Measuring Tissue Specificity with Entropy . . . . . . . . . . . . . . . . . . . . 71

5.4

Evaluating a Set of Housekeeping Genes . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.5

Most Genes are Regulated in a Tissue-Dependent Manner . . . . . . . . . . . . . . . 77

5.6

Clustering Tissues Using Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.7

CpG Islands are Associated with the Least Tissue-Specific Genes . . . . . . . . . . . 79

5.8

Base Composition of Promoters Depends on Specificity . . . . . . . . . . . . . . . . . 83

5.9

Selected Transcription Factor Motifs in the Core Promoter . . . . . . . . . . . . . . . 84

5.10 Promoter Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.12 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.13 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6 Transcription Factor Binding Sites in the Core Promoter

105

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.3

Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.4

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7 Transcription Factor Binding Sites in Liver-Specific Promoters

116

7.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.2

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.3

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.3.1

Identifying Liver-Specific Genes . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.3.2

Identifying TF’s Over-Represented in Liver-Specific Promoters . . . . . . . . 118

7.3.3

Combinations and Arrangements of TF’s . . . . . . . . . . . . . . . . . . . . 121 x

7.3.4

Two-Feature Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.3.5

Three-Feature Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.3.6

HNF1 Companions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

7.3.7

Combinations with the TATA Box . . . . . . . . . . . . . . . . . . . . . . . . 129

7.3.8

Performance

7.3.9

Liver Selectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.3.10 Selectivity for Liver-Specific CpG+ Genes . . . . . . . . . . . . . . . . . . . . 132 7.4

7.5

Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7.4.1

Promoter Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7.4.2

Building and Sampling from Markov Models . . . . . . . . . . . . . . . . . . 137

7.4.3

Scoring Positional Weight Matrices . . . . . . . . . . . . . . . . . . . . . . . . 137

7.4.4

Definition of PWM Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

7.4.5

Definition and Evaluation of ROC Graphs . . . . . . . . . . . . . . . . . . . . 137

7.4.6

Exploring Combination Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.5.1

Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

7.5.2

Clustering TFBS Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.5.3

Augmenting the Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . 140

8 Identifying Companion Factors of ChIP-Chip Target Transcription Factors 8.1

8.2

8.3

8.4

142

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.1.1

ChIP-chip Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

8.1.2

The Goals Sequence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

8.1.3

Microarray Chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

C/EBP-beta Targets in Regenerating Liver . . . . . . . . . . . . . . . . . . . . . . . 144 8.2.1

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

8.2.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

8.2.3

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

8.2.4

Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Glucocorticoid Receptor Targets in Fasted Dexamethazone-Injected Mice . . . . . . 150 8.3.1

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

8.3.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

8.3.3

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

8.3.4

Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 xi

9 Discussion

158

9.1

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

9.2

Grammar Formalism and Learning System . . . . . . . . . . . . . . . . . . . . . . . . 159

9.3

Tissue Specificity, the TATA Box, and CpG Islands . . . . . . . . . . . . . . . . . . . 160

9.4

Liver-Specific Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

9.5

ChIP-chip Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

9.6

The Structure of Regulatory Modules . . . . . . . . . . . . . . . . . . . . . . . . . . 165

10 Future Work

168

Bibliography

169

Appendices

193

A GLE yapp Specification

193

B Liver Specific Genes

206

xii

List of Tables 3.1

Description of attributes of annotation features. . . . . . . . . . . . . . . . . . . . . . 23

3.2

Implmentation of Markov model of TFBS chain . . . . . . . . . . . . . . . . . . . . . 35

4.1

Transition probabilities for state machine. . . . . . . . . . . . . . . . . . . . . . . . . 47

5.1

The top 5 most tissue specific known genes for representative tissues. . . . . . . . . . 73

5.2

The list of tissues used in this study . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3

The top 5 most group-specific known mouse genes for selected tissue groups.

5.4

CpG islands are correlated with embryonic expression. . . . . . . . . . . . . . . . . . 83

5.5

The most significant indicators of the degree of tissue-specificity: start CpG island

. . . . 81

and TATA box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.6

YY1 sites are confined to the least specific genes. . . . . . . . . . . . . . . . . . . . . 89

5.7

Over-represented Gene Ontology (GO) terms for cellular component and biological process of genes by promoter class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.1

Number of promoters by class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.2

Frequency of transcription factors enriched in the core promoter . . . . . . . . . . . 110

6.3

Mean conditional entropy for genes containing core-promoter TFBS. . . . . . . . . . 113

7.1

Top 30 TF familes in CpG– liver-specific promoters . . . . . . . . . . . . . . . . . . . 119

7.2

Count of rule types found in promoter of liver-specific genes . . . . . . . . . . . . . . 125

7.3

Three-feature instance rules found in liver-specific promoters . . . . . . . . . . . . . 128

7.4

Companion TF’s for HNF1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.5

Top 15 most liver-selective rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

8.1

Direct targets of C/EBPβ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

8.2

Companions of C/EBPβ in direct targets of C/EBPβ. . . . . . . . . . . . . . . . . . 147

8.3

Performance of three TRANSFAC PWMs for GR . . . . . . . . . . . . . . . . . . . . 153 xiii

8.4

Performance of top 20 PWM groups for DEB set . . . . . . . . . . . . . . . . . . . . 153

8.5

Companion factors for GR monomer . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

8.6

Companion factors for GR dimer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

B.1 Top 100 most liver-specific CpG- mouse genes . . . . . . . . . . . . . . . . . . . . . . 207

xiv

List of Figures 3.1

Specification and example of stream definition statement. . . . . . . . . . . . . . . . 25

3.2

Example of a web interface to the parser . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1

Plot of p(P, L) as a function of log P and log L. . . . . . . . . . . . . . . . . . . . . . 43

4.2

Schematic drawing of a compound feature occurring within its length bound n within a sequence interval of length L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3

Point probabilities of the same two features in different orders . . . . . . . . . . . . . 46

4.4

State machine for computing p(S −→ F1 , F2 ; L) . . . . . . . . . . . . . . . . . . . . . 48

4.5

Agreement between calculated and observed frequency of 2-list rule. . . . . . . . . . 49

4.6

Architecture of the machine learning system . . . . . . . . . . . . . . . . . . . . . . . 52

4.7

Example ROC curve with suboptimal points . . . . . . . . . . . . . . . . . . . . . . . 54

4.8

Complete grammar space predecessor graph . . . . . . . . . . . . . . . . . . . . . . . 58

4.9

Partial grammar space predecessor graphs . . . . . . . . . . . . . . . . . . . . . . . . 59

[n]

4.10 Number of rules for 200 simple features . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.11 Comparison of ROC Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.12 Examples of Bonferroni correction tolerance.

. . . . . . . . . . . . . . . . . . . . . . 65

5.1

Examples of GNF-GEA expression patterns for mouse genes at selected Hg and Q. . 72

5.2

Distributions of H and Q for different data sources and tissues. . . . . . . . . . . . . 75

5.3

Consensus tissue tree of tissues from human and mouse data. . . . . . . . . . . . . . 80

5.4

The fraction of start CpG islands in genes ranked by entropy Hg increases with entropy. 82

5.5

Base composition profiles for ubiquitous and tissue-specific genes with and without start CpG islands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.6

A new YY1 motif found downstream of the TSS . . . . . . . . . . . . . . . . . . . . 90

5.7

Distribution of YY1 motifs in the core promoter . . . . . . . . . . . . . . . . . . . . 91

5.8

The distribution of TATA box and initiator element in pancreas specific genes. . . . 92 xv

5.9

The cumulative distribution of promoter classes as a function of entropy is similar in human and mouse.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.1

Identification of optimal Sp1 position and score . . . . . . . . . . . . . . . . . . . . . 108

7.1

Overlap of tissue-specific TF’s between three tissues. . . . . . . . . . . . . . . . . . . 120

7.2

Distribution of Qg|liver for HNF factor targets . . . . . . . . . . . . . . . . . . . . . . 123

7.3

Comparison of seven-fold cross-validation AUC’s . . . . . . . . . . . . . . . . . . . . 124

7.4

Cumulative distribution of optimal size bounds in liver-specific genes . . . . . . . . . 127

7.5

Example of learning a three-set rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

7.6

Distribution of Q of genes with and without rule matches . . . . . . . . . . . . . . . 133

7.7

Tissue-selectivity of consensus rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.8

Comparison of selectivity in liver-specific CpG– and CpG+ genes . . . . . . . . . . . 136

8.1

ROC graph for C/EBPβ and combinations enriched in direct targets of C/EBPβ during liver regeneration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

8.2

Distribution of Qg|liver for GR direct targets . . . . . . . . . . . . . . . . . . . . . . . 152

8.3

ROC curves for GR PWMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

xvi

Chapter 1

Introduction 1.1

Background

The human genome is estimated to contain between 20000 and 25000 protein coding genes [49]. Every cell in our bodies contains all of these genes yet only a specific subset of them are ‘turned on’ or expressed in each cell. The set of genes that are expressed depends on the type of cell, and expressing the proper set of proteins is crucial to the proper functioning of the cell. The amount of protein in each cell is controlled by a number of different processes. One important process is the control of transcription by proteins called transcription factors (TF’s). Transcription is the first step in the process of creating a messenger RNA (mRNA) from a gene. During transcription a large cluster of basal transcription factors are assembled at the start of the gene. The assembly process is controlled in part by signals called transcription factor binding sites (TFBS’s) in the genomic DNA that control the relative location of the basal transcription factors. In addition, transcription factors specific to a cell type or biological process active in the cell may bind upstream of or in a gene to enhance the binding of the basal transcription factors. Understanding the signals in genomic DNA that are responsible for tissue-specific expression of genes is one of the central problems in post-genomic biology. Transcription factors exhibit sequence-specific binding to DNA; they binding much more tightly to certain short sequences of DNA than to others. This feature allows the genome to control the location of a transcription factor by placing a good, i.e., strong, binding site at a certain position relative to the start of transcription and by having weakly- or non-binding sequences at other nearby locations. However the region of the genome around each gene that may contain functional binding sites is rather large, a few hundred base pairs for the proximal promoter and tens of kilobases for 1

extended upstream regions that frequently contain active sites. The fact that the binding sites are short yet may be scattered across a long distance makes it difficult to predict individual sites that are relevant to the control of a gene. An irrelevant site with good binding strength is very likely to occur by chance. It has been shown that TF’s work cooperatively in pairs or in larger groups to control transcription. Often the TFBS’s for these TF’s are located within a few hundred to a thousand base pairs of each other. Such arrangements of TFBS’s are termed cis-regulatory modules (CRM’s) when it can demonstrated that they are sufficient to perform some part of the regulatory program of the gene. In several cases, the details of such modules have been well characterized, perhaps most spectacularly by Davidson’s group in the developing sea urchin [243]. Using this phenomenon, the accuracy of identifying single sites may be increased by identifying them in the context of sites of other cooperating factors. This approach has been demonstrated to be effective in several instances [73, 20, 2]. However, these results were achieved by considering a very unstructured model of the arrangement of TFBS’s. There may be no constraints placed on the order, exact arrangement of identity of TFBS’s, just the density of sites. There is some earlier work that focused on identifying closely spaced pairs of TFBS’s [229] where the relative position of sites can be very strictly controlled. An attempt at more structured descriptions of TFBS arrangements was developed in work by Collado-Vides’ lab [181] that uses a grammar to describe the location of sites. However that work provides no ability to search for other occurrences of the sites in similar arrangements or to perform machine-learning to learn new arrangements. Furthermore, it was performed in prokaryotes which have much simpler control system than vertebrates. A more detailed review of related work will be presented in Chapter 2. Some very recent work [3, 222] is addressing similar concerns, but a comprehensive understanding and knowledge base does not yet exist.

1.2

Goals

The question that this dissertation seeks to answer is: what are the rules that govern the arrangements of transcription factor binding sites that control tissue specificity? To answer this question we will need a way to describe such arrangements, a means to identify such arrangements in genomic sequence, and sets of genes to study. In addition, we want to take advantage of ChIP-chip experiments to explore the regulation of sets of genes defined by processes other than tissue specificity. Describing Arrangements of TFBS’s How shall we describe the arrangements of TFBS’s? As the title of the dissertation suggests, we have chosen to explore a grammar-based system. We can justify this choice by listing some of the characteristics of the arrangements we wish to specify. 2

It will be necessary to constrain the spacing, order, and orientation of TFBS. However, the spacing and order will not always be completely constrained. Furthermore, constraints may operate at different levels, e.g., the spacing between members of a dimer versus spacing between the dimer and other cooperating factors. There may be common interactions between TF’s that are reused in different circumstances; it is desirable to build a library of such interactions so they can be reused. We want the models we learn to be interpretable by biologists; we do not want to build ‘black box’ models that work, but are not comprehensible. Finally, there is the desire to use the formalism as an query language to answer queries that a biologist might want to pose in an ad hoc manner. A grammar formalism satisfies many of these criteria. However, initial experimentation with this approach using GenLang [61] revealed that regular and context free grammars are not good at describing the loosely constrained arrangements of previous work. Thus we found it necessary to augment grammars with a means of describing such loosely structured arrangements. We borrowed the notion of collections from the Kliesli programming languages to create grammar formalism with collection productions that relax the constraints of spacing, order, and count in a syntactically concise manner. To constrain the possibly too-general rules that can be created with collection productions, we added a means of bounding the size and/or defining the location of matches to rules. Finally, we added ways to access and filter annotations from databases, files, or external programs, and then define grammatical rules in terms of these annotations.

Learning Grammars In this dissertation we focus on learning extremely simple grammars that consist of one rule. Even under this restriction, the number of rules that can be formed from a set of TFBS’s grows polynomially with the size of the set and exponentially with the complexity of the rules allowed. If we start with no a priori knowledge and consider large rules built from any TF’s listed in the available databases, the number of possible grammars is astronomical. This raises two concerns; the time it takes to evaluate these possibilities and the potential for over-fitting. Thus we will need to constrain our search in ways that do not rule out biologically plausible rules yet still yields a search space that is not too large. In addition, each candidate grammar may contain a number of parameters with unknown values that must be evaluated to determine the overall and best performance of a grammar. We have developed a machine-learning system that allows the user to select a learning strategy that suits the biological problem at hand. In the simplest case, the structure of grammars to be considered can be constrained to match a simple template so that the learning task consists of simply estimating the parameters of TFBS matching and orientation. Another learning strategy can operate in a larger set of structures, but performs a search that prunes out large parts of the set to concentrate only on the grammars where all TF components 3

are contributing to the discriminatory power of the rule.

Identifying Tissue-Specific Genes Finally, we need sets of genes to analyze. Previous efforts of this type used small sets of genes hand-picked from the literature. This will not be sufficient for our purposes; we will need a way to generate large sets of genes. There are now genome-wide surveys of RNA expression levels in a wide variety of tissues. As presented in Schug et al 2005 [189], we use Shannon entropy Hg to identify genes that show evidence of overall tissue-specific expression. We then refine the notion of tissue specificity, by measuring both overall tissue specificity and specificity to a particular tissue of interest (conditional specificity Qg|t ). This yields a stratified approach that lets us identify a gene as specifically expressed in some tissue, then subsequently to identify the tissue(s) in which the gene is specifically expressed. Using entropy to rank genes by their specificity, we identify the TATA box and CpG islands as the main correlates of tissue-specific and ubiquitous expression, respectively. In addition, we find trends in the function and cellular location of gene products of genes as categorized by the presence or absence of the TATA box and CpG island. We then apply the Qg|t statistic to identify liver-specific genes in mouse. We analyze genes without CpG islands as these are the typical tissue-specific genes. We consider combinations of the top 20 PWM’s families and successfully recapitulate and, more importantly, extend earlier work by identifying combinations of both liver-specific and liver-active TF’s that select for liver-specific expression and select against expression in unrelated tissues.

Identifying Companion Factors in ChIP-chip Experiments We also found our system to be very useful in another, related, sequence analysis task. A ChIP-chip experiment is a highthroughput method of identifying the areas of the genome that are bound by a TF of interest and thus contain functional TFBS’s. Beyond the utility of the learning the set of target genes of TF, ChIP-chip experiments also provide a means to identify the companion TF’s that help define which sites are functional. We present material from two papers where we used our machine-learning system to identify the companion factors of C/EBPβ [72] and the glucocorticoid receptor [130] in mouse liver.

1.3

Conclusion

To summarize, the accomplishments described in this dissertation are as follows: • Computational Results 4

– Chapter 3: the creation of a grammar-based formalism and parser to describe and search for arrangements of TFBS’s, – Chapter 4: the development and implementation of a machine-learning algorithm to identifying arrangements of TFBS that are correlated with tissue-specific expression or other expression patterns. • Identification of Tissue-Specific Genes and Global Promoter Features – Chapter 5: identification of genes showing evidence of tissue specific expression and regulation using Shannon entropy as developed in [189], – Chapter 6: examination of the core promoter to identify any TFBS’s that may indicate tissue-specific expression • Identification of TF Combinations – Chapter 7: application of these techniques to the analysis of promoters of liver-specific genes. – Chapter 8: application of the machine-learning system to identify companion TF’s of target TF in ChIP-chip experiments as presented in [72, 130]. These are open problems in computational biology and advances in this area are of interest to a wide variety researchers.

5

Chapter 2

Related Work In this chapter we survey efforts related to our approach to the problem of identifying cis-regulatory modules (CRM’s). These include methods and systems for identifying CRM’s, applications of grammars to other biological problems, other grammar formalisms that are similar to bounded collection grammars, and finally the general problem of grammar induction.

2.1

Identification of Regulatory Elements

The search for regulatory elements has been an on-going and difficult problem in computational biology. Early work focussed on defining and improving models for single binding sites. The earliest models were consensus sequences [52] that represented the most common sequence with possibly ambiguous bases. The major and lasting innovation was the development and analysis of positional weight matrices (PWM’s) [18, 184, 100, 213] which can approximate the binding energy between a TF and a DNA sequence. Neural nets were also tried [239]. Efforts at building better single site models are still on-going, e.g., [86, 247], and still have a role to play. Databases of TF’s, their binding sites, and binding site models have been constructed: EPD [31], IMD [44], TRANSFAC [145], ooTFD [82], JASPAR [182], TRED [246], and TransCOMPEL (composite sites) [120], that provide valuable resources for the community. On-line services such as the databases just cited and our own TESS [188] provide DNA sequence search services. However, all of these prediction methods suffer from an extremely high and unavoidable false-positive rate. In addition, the databases are by no means complete catalogs of all TF’s in a species, nor do they provide complete information for the TF’s they do contain. This then, poses a number of research problems including; how to improve the predictive performance of binding site identification, and how to identify sites for as yet unknown TF’s. In this work we focus on the first problem, though the second is no less important 6

or interesting. In fact, the problems are intertwined and many of the techniques we review below also apply the the problem of identifying novel motifs. It became clear that the key to successfully identifying functional sites by reducing the number of false positives was to bring in other information and constraints. The mainstays of current practice are 1. over-representation of sites, 2. combinations or arrangements of sites, 3. conservation between species, and 4. integration of expression, binding, or other data, which are applied in some combination in most work. The challenge has always been to apply techniques that work well in yeast or prokaryotes to more complex eukaryotes and to larger data sets.

Over-representation

Initial attempts to identify binding sites were typically done one sequence

at a time by investigators who were studying one gene at a time. As more sequence became available one could ask the question ‘what motifs are over-represented in my set of sequences?’ Over-representation is the fundamental process that operates when no other constraints or data can be applied. This is perhaps obvious, as it is the basis of statistical significance, but it is worth noting the different ways it can be applied. A common scenario, e.g., work by Wasserman [234], is the study of the known or potential regulatory regions of a set of genes that are assumed to be co-regulated. The underlying assumption is that these regions will be enriched for the binding sites of the TF’s that control them. The observed frequency of the sites must be compared to their expected frequency, either via a mathematical model of random occurrence or the empirical frequency in an appropriate set of control sequences, e.g., intergenic or exonic regions, or regulatory regions for other genes. A second scenario, e.g., work by Wagner [231], that is only effective with rare patterns such as combinations of sites, is to scan the whole genome with a fixed window looking regions that contain very unexpected groups of sites. Again, there must be a suitable model for the expected occurrence. These two approaches may be combined in some sense by looking at all known genes in a genome to find elements that are very widely used and so appear more commonly than expected. A common choice for the background model is a Markov model of some order that describes the base composition as well as di-, tri- [20, 3] or higher, nucleotide frequencies of the sequence. Hannenhalli and Levy [94] demonstrate that the underlying assumption of enrichment 7

is true by considering the binding site composition of promoters of genes clustered by functional information. Most novel motif finders, e.g., complete enumeration [228], WORDUP [163], Gibbs samplers [129], MEME [11], AlignACE [105], word segmentation [36], and YMF [206], work solely under the assumption of over-representation.

Combinations of Sites Given the combinatorial nature of eukaryotic gene regulation it is natural to search for combinations of sites. There are many possible interpretations of what that means, having largely to do with the class of combinations and arrangements that is the target of the search. It is worth defining some terms at this point. We will use the term combination to refer to the set of TF’s or TFBS models that are included in the model of a CRM. We will use the term arrangement to mean the constraints the model places on the number and order of instances of TF’s. There are further constraints that can be applied such as the total length of the arrangement or the orientation and score of components of the model, which we call the parameters. The simplest possible additional information one can provide about a binding site is its location and orientation preferences. Computational studies have been done for NF-Y (CCAAT box) [141] and CREB [48]1 binding sites. Ettwiller used promoter versus intergenic enrichment in motif discovery [67] in yeast using localization near the transcription start site (TSS) to enrich for functional motifs. In nearly all species, strong positional preferences can only be expected to apply to components that are part of or interact directly with the basal transcription apparatus. Positional preferences are often noted in passing in work cited below. Perhaps the simplest combinatorial model to find is a pair of factors with close, fixed, spacing. Considering two closely-spaced sites provides a large decrease in the false-positive rate (as we shall see in Chapter 4) along with a relatively small model space. Van Helden and colleagues identified novel pairs in yeast [229]. Although there is some success with this approach in eukaryotes [69], the number of such regulatory elements appears to be small. One phenomenon that is observed in the binding of pairs of TF’s is cooperative binding, where occupation of each site is higher when both factors are present than they are when the TF’s are present individually. Kel’s work [119] provide computational evidence for cooperative binding between NFAT and AP-1. GuhaThakurta [91] finds novel motifs with a notion of cooperative binding. We do not directly address cooperative binding, though it could be addressed in our formalism by providing a threshold on the total score of a rule as well as the component features. If this threshold is more stringent than the implied threshold based on the thresholds of the components alone, then there is some evidence of cooperative binding. In this work we will often refer to the component features of a combination of sites as companion 1 We

note with chagrin that this work scooped similar results that we also achieved with Drs. Pack and Mackiewicz.

8

factors rather than cooperating factors to avoid the false claim of having demonstrated cooperative binding. A model that been most heavily studied is a cluster of binding sites. Clusters may be defined in a few different ways depending on whether or not all members of the combination are required to occur and on how the scores of the component features are to be combined. Early work was done in yeast [28, 29]. Berman and coworkers [20] treated the combination as a pool of TF’s and only constrained the minimum number of total matches to component TF’s. Thus if the component features are A, B, and C, a group of three As is just as interesting as a group containing A, B, and C. Wasserman [234, 127] uses a logistic regression model that does not force each TF to appear, but uses the best score for the feature (which may be poor) in a moving window. The TOUCAN system [3, 4, 2] is a model builder and search tool that uses either an A* search or a genetic algorithm to learn models. Clusters are sets or multi-sets of features in a fixed window size, though missing elements can be penalized. Features may be allowed to overlap or not. The system appears to process at most five features in a cluster. CISTER [73] and an earlier, related, system COMET [74] use a hidden Markov model (HMM) with intra-cluster background and inter-cluster background that places no structure on the cluster beyond the pool of component features. Hannenhalli and Levy [94] consider stringent conserved TFBS’s in the first 5KB upstream of genes and find that functionally related genes have similar combinations of TF’s. Their model is effectively a set with a fixed size of 5KB. MSCAN [5] identifies unstructured clusters of user-selected sites in user-defined size. Work by Kreiman [126] requires each feature (of up to four) to occur but does not constrain the order. Constraints are placed on the distance between features. That system uses a search strategy of exhaustive enumeration of combinations using predefined scoring thresholds. Zhang and corworkers [138, 219] find over-represented novel motifs and pairs by enumeration of consensus strings. Work by Bluthgen [25] looks for overrepresentation of GO terms in genes that contain combinations of preselected binding sites. A somewhat more structured arrangement is a list in which all components of a combination must appear in a specified order. Older work by Frech [71, 123, 76] used this approach but a learning algorithm has not been published. Recent work by this group [60] uses ordered and spaced arrangements seeded by annotated pairs. More complicated models are built by hand. Very recent work by Terai and Tegaki [222] considered both sets and lists in yeast. Evaluation of score thresholds was done in a loop outside of pattern evaluation. The distance was considered when selecting models Recently, work has been done to identify combinations of novel motifs using Gibbs sampling [224]. They use a model which consists of a matrix of transition probabilities that measure which 9

motif is likely to follow another. The distance between the motif matches is model by a uniform distribution on a preset interval. The components of the combination are novel motifs identified by Gibbs sampling. Thus no individual motif is guaranteed to occur, but the order of the motif matches is guided by the transition probability matrix. Conservation

Another technique for identifying conserved binding sites is to use inter-species

conservation. Like exonic sequence, the regulatory parts of the genome show conservation between species. A number of methods have been developed to identify conserved regions and binding sites, e.g., Bayesian methods [248], DNA block aligner [110], Vista [110] and rVista [139, 13], Pipmaker [191] and BLASTZ [190], the regulatory potential measure [125], Mulan [154], CORG [56, 57, 58]. Some systems that identify CRM’s using conservation are CREME [201, 200] looks for two or more conserved factors with at least one site in user-defined length. They pre-filter features based on over-representation. TraFac [111] uses BLASTZ to identify conserved regions, then notes similar composition of sites, rather than exact positional matches, in a 200bp moving window. Halfon and coworkers [92] considered five known TF’s of interest in 300bp window in D. melanogaster using conservation to identify regions of interest. An important question for studying combinatorial regulation using conservation is what fraction of regulatory sites are in fact conserved and would be identified by even an ideal algorithm. Estimates range from 50% [135] to 98% (in top 19% of conserved sequence) [235]. If the 50% estimate is correct, then one might expect only 0.53 = 0.125 of a three-factor combination to be conserved. We do not consider conservation in our work (for now) because of this issue. Integration of Other Data Often mRNA expression data is processed separately from the subsequent sequence analysis, e.g., the data is clustered, a list of genes that show significant differential expression is extracted, and the genes in the cluster are analyzed. That is how the work in this dissertation was performed. However, it is also possible that better results could be had by integrating the analysis of the expression data with the sequence analysis phase. Such work has been done by Koller, Segal, and coworkers [196, 215, 148] and involves the clustering of genes based on both expression data and the occurrence of putative motifs using an expectation maximization (EM) procedure. This allows one to possibly resolve marginal cluster membership based on motif information. Another approach by Bussemaker [37] uses a linear function to predict an expression pattern by a linear weighting of the motifs present in the promoters of genes. The assumptions underlying the enrichment of sites can be undermined by two confounding influences. First, data sets may include genes that are not direct regulatory targets of factors of interest. Secondly, regulatory elements may be located far from the promoter. The problem of 10

data sets with heterogeneous regulatory mechanisms can be mitigated by considering expression under a wider variety of experimental conditions which can better define the group of co-regulated genes. The problem of clustering then arises to properly group co-regulated genes. The second confounding influence, regulatory elements that are located far from the promoter, can be dealt with by considering regions that are conversed between species (described above) or by direct, possibly high throughput, identification of binding regions using chromatin immuno-precipitation. These methods and studies originally applied in yeast [132] have since been applied to other eukaryotes [151, 72, 72, 130]. They provide a list of candidate regulatory targets for a TF of interest. This approach suffers from the usual problems of false positives and false negatives in terms of identifying binding events. It also suffers from the problem of determining the significance of binding; it is not clear that TF’s themselves do not suffer from the same problem of promiscuous binding that our computational models exhibit [143]. In addition, such location analyzes are typically performed at the resolution of a promoter, i.e, 1 to 5 KB, rather than the level of a single base pair. Thus even in this case, the details of the arrangements of sites still needs to be worked out. Gao and coworkers [77] have used both binding and expression data in yeast to identify functional regulatory sites. In the case of divergently transcribed genes they are able to identify which gene is regulated by the site. Garten and coworkers [78] use yeast protein-DNA binding data to identify cooperative factors. Kato and coworkers [117] considered all possible unordered combinations of two and three novel motifs identified in yeast promoters grouped by both binding and expression data. Combinations were retained when they gave better predictions of the time-course of cell cycle data. One can imagine combining binding, expression, conservation, and motif prediction into one process. We have not done this in this dissertation, opting instead to concentrate on the combinatorial and parameter optimization problems. Such integration could be considered in future work.

2.2

Biological Applications of Formal Grammars

Formal grammars have a number of different applications in DNA or AA sequence analysis. We follow a theme developed by Pereira [159] to help structure this brief summary. We can classify grammatical models along an axis running from numeric to symbolic. At the numeric end of the axis, the content of the grammar is largely in the numeric parameters learned and the structure is either uninformative or uninteresting, i.e., there is no symbolic content. Most common in the numeric end of the spectrum is the wide-spread practice of using of n-gram models of DNA or AA sequences. This technique has been used to identify coding potential [33] and regulatory potential 11

[66] just to cite a very small number of applications. We begin to see a small amount of symbolic content in hidden Markov models (HMM’s) which have been used to model protein domains [90, 64, 209], genes [33, 170, 99, 240], and promoters [157]. The cited work is just a small selection of the applications of HMM’s to these problems. HMM’s contain more structure than n-gram models, but the symbols in the grammar are not high level concepts, but simply a position in an idealized sequence. Positional weight matrices are an example of a very simple HMM. Some of the work cited above, e.g., [74], uses an HMM topology where sections of the HMM correspond to a PWM. Similarly, in context-free grammar (CFG) models of RNA secondary structure [177, 144, 102, 158, 150, 38, 232, 122, 179, 133, 178, 170, 30, 87, 88] the symbols of the grammar correspond to particular base pairings in the context of particular secondary structure. Stochastic tree-adjoining grammars have been used for RNA secondary structure with pseudo-knots [124, 140] and protein secondary structure [1]. Like the CFG models, the symbols are individual base-pairings. In all of these cases it is possible to identify parts of the model with a particular part of the parsed string, but the regions are not identified symbolically. It is unlikely that a successful biological (sequence analysis) application of grammars will ever be completely symbolic, i.e., devoid of any numeric or stochastic component; biological data is full of noise and variation that is best captured by a numeric approach. Perhaps, the best we can aspire to is the stochastic, hierarchical definition of symbols through patterns based on primitive symbols identified by numeric processes. There have been several applications of grammars that match this rather general description. The program PatScan [155] offered a primitive form of grammar-like searching. Here, as in our work, the symbols represent transcription factor binding sites. These may be defined either numerically or as consensus strings. Searls, long interested in linguistic approaches to sequence analysis [192, 61, 195, 193, 194], used definite clause grammars (DCG’s) and string variable grammars (SVG’s) as a model of intron and exon structure in gene models. In this work, there were explicit symbols for introns and exons. The quality of an exon was measured by its coding potential as measured by an n-gram model and an overall score was assigned to a gene based on the scores of its components. More recently, prokaryotic promoters were modeled using DCG’s [134]. The symbols of this work were again binding sites. This grammar formalism included a positional term or operator to adjust locations of binding site in the promoter as did Searls’ work. Collado-Vides and coworkers [47, 223, 181] have used grammars to structure a description of the location of binding sites in prokaryotic promoters but there was no learning component to this work. Their representation collected all binding sites for a given factor together, then combined them under a rule for the promoter. Our work is in this second category. The formal specification of bounded collection grammars 12

(BCG’s) includes a syntax and semantics for assigning scores to rules; it is a stochastic grammar formalism. BCG’s allow for the specification of rules in terms of other rules. Annotation streams allow for the definition of symbols from sequence annotation which may be given numeric weights. A major part of the machine-learning component of our system is to identify values for numeric parameters as well as learning the symbolic structure of the rules. Although we do not pursue learning hierarchical rules in this dissertation is it possible to define and search with such rules and our learning algorithm can work with non-recursive hierarchical definitions.

2.3

Permutation and ID/LP Grammars

Two of the collection productions we develop and apply here are very similar to permutation phrases proposed by Cameron [39]. Permutation phrases are indicated by adding an operator to the (Extended) Backus-Naur Form syntax that indicates that a pair or, by extension, a group of syntactic elements may appear in any order. The impetus for permutation phrase grammars comes from artificial languages where, for example, named arguments or components may appear in any order, but must appear at most once. The challenge in parsing is to efficiently manage all of the possible word orders. Direct translation to all possible permutations in not workable in general due to the exponential explosion of possible orderings. Cameron’s work provides a solution that reduces the complexity from O(n!) to O(n2 ) but work is on-going, e.g., [9], to identify other parsing algorithms. As observed by Cameron, permutation phrases are related to immediate dominance/linear precedence (ID/LP) grammars [80] developed for natural language modeling, which separate the task of specifying the part-whole hierarchy from the task of specifying the linear ordering of components. ID/LP grammars have a set of productions that indicate the part-whole hierarchy, but are interpreted as placing no constraint on the order of the terms in the right-hand side. Partial ordering information is supplied in the form of an order relation on terms which applies any time the two ordered terms appear in the same right-hand side. Parsing ID/LP grammar is NP-complete [14] in general but can be reasonably efficient in many cases [202]. Pericliev [161] presents an algorithm for learning the precedence relations for an ID/LP grammar that has only the hierarchical component specified, but it operates in the minimally adequate teacher (MAT) learning model defined below. ID/LP and permutation grammars allow both sets and multi-sets (bags) rules as does our work. However they lack any notion of absolute linear proximity or position, reverse complement orientation, and model scores which biological sequence analysis requires. In addition, such grammars would be forced to include every token in the input stream in a parse. In our work the parser is 13

free include only the tokens it wants to in a parse. The ability BCG’s provide to constrain linear ordering is limited to list and ordinary productions which imply a total ordering of the components of the right-hand side. These are simpler than what ID/LP grammars provide. This is not an issue for now as we know so little about the constraints on actual CRM’s. It also slightly eases the grammar induction problem as the number of possible total ordering constraints for BCG’s smaller than the number of partial orderings possible in ID/LP grammars. This will be an interesting direction for future work.

2.4

Grammar Induction

Grammar induction is the task of learning a grammar given some kind of access to example data and/or feedback from a teacher. There are a number of variants that involve the type of data the learner has access to and how it is presented to the learner. The learnability of different classes of grammars depends on the learning setting. A related topic is the teachability of a language, that is can a teacher generate examples that will allow a reasonable learner to correctly identify the grammar. We focus on learning. Exact Learning In the simplest learning scenario the learner has access to a set of examples that contains positive exemplars, and may or may not include negative exemplars. The task is to learn exactly the grammar that produced the data. Gold [84] showed that regular languages could not be learned from positive examples alone. Gold also showed that the problem is NP-hard for regular languages even if arbitrary positive and negative examples are presented. This scenario can be rescued by placing constraints on which exemplars are presented. For example, if all sequences less than a particular length are presented Trakhtenbrot and Barzdin report an algorithm [225] that can learn the smallest DFA that accepts only the positive set. A stronger constraint is insisting that the training data constitute a characteristic set that is, in the case of a regular language, guaranteed to minimally cover the DFA transitions. In this model DFA’s are now polynomially learnable [84, 153], as are a number of subclasses of CFG (reviewed in [53]). An interesting subtlety that can be found in work by de la Higuera [53] is that for regular grammars, the complexity can be measured in terms of the size of either the grammar or the data. If the size of the grammar is used, then the problem is not tractable because the training data may include unnecessary exponentially long examples that must be processed. Hence, for unrestricted (characteristic) data sets the problem is measured in terms of the size of the data. Also a CFG can generate a string that is exponentially long, so the data size, not the grammar has to be the underlying problem dimension. We will return to a related point in the discussion. 14

Minimally Adequate Teacher Angluin [6] modified the learning scenario so that the learner has access to the data thorough a minimally adequate teacher (MAT). The MAT answers two kinds of requests from the learner, 1) for an arbitrary labeled example (with no restrictions) and 2) verification of a grammar. The verification either indicates that the learner’s guess is correct or provides a counter-example. In this model regular grammars are learnable as are other restricted forms of CFG’s. However, there are a number of subtle technical points that make a satisfactory definition of this kind of learning hard to find. For example, it is possible for a teacher and learner to collude to make it trivial to learn any grammar. The canonical example is that the teacher can include a string that contains a coded description of the DFA among the positive exemplars which the learner simply detects and decodes to learn the grammar.

PAC Learning Valiant [227] introduced probably approximately correct (PAC) learning which, in terms of grammar induction, relaxes the learning task from exact identification of the correct grammar to the likely approximate identification of a well-performing grammar. The PAC learning model assumes that the examples are presented to the learner according to a fixed but unknown underlying probability distribution, D. In this case the task is to identify a rule that has a small probability of making errors as measured by D. The rule identified may not be the true rule, but it is approximately correct in terms of its performance on the data as given by the distribution D. There must be a high probability that the rule identified by a PAC algorithm meets the approximately correct criterion. Valiant has proved that a number of learning tasks are possible with this definition, unfortunately not regular languages [118]. However, a more recent variation of the PAC-learning approach is to assume that the exemplar strings are distributed, not by an arbitrary distribution, but one such that the shorter strings are more common. This makes it more likely that the short positives examples are well covered and that only a small amount of generalization is needed. Denis [54] provides an algorithm for learning when the positive-only samples are simple as measured by the grammar. In this case a regular grammar is learned exactly with some small probability of failure.

Stochastic Grammars Stochastic grammars can be learned from positive data only. The Alergia algorithm [41] is a stochastic extension of an earlier deterministic algorithm by the same researchers. Stolcke [212] presents a Bayesian model merging technique that is applied to several classes of grammars including HMM’s and SCFG’s. The pressure to generalize is expressed by priors on grammars that prefer fewer number of states. This is balanced by the need to prevent the total likelihood of the training set from dropping too much as a result of generalization. Model 15

merging is accomplished by the application of a small number of basic operations to the working grammar, rather than a complete exploration of all grammars. Chen [45] describes a similar approach with different rules for modifying the grammar. The resulting grammar is made more expressive by adding productions of the form Xi −→ Xj Xk . The augmented grammar is passed through the inside-outside algorithm which adjusts the weights of the augmenting rules appropriately.

2.5

Discussion

Our grammar formalism which we describe in Chapter 3 can either approximate or capture exactly most of the approaches in described in previous work, but in addition brings new added features, most importantly hierarchical rules and size and location bounding. Thus as a modeling system it is an advance over existing systems. In this dissertation we focus on learning only simple single collection rules. Some of the contemporary related work contains similar modeling capabilities, i.e., sets and lists, but either uses a less sophisticated search mechanism or is missing one of the collection constructs. None of the related work integrates parameter optimization into the basic analysis of a grammar. Thus in this area we cover new ground. Although we formulate our extensions in Chapter 3 in terms of arbitrary grammars in the Chomsky Hierarchy, we do not expect to encounter any language more complicated than a finite regular grammar in practice. Hence much of the difficulties learning grammars as described above are not relevant. Furthermore, the algorithms presented above are developed for grammars with strict word order and thus not directly applicable. However, there are important points that we can take advantage of when designing our machine-learning system. First is the utility of negative training sets. For a rule to rank highly it must both occur frequently in the positive training data and infrequently in the negative data. This helps prevent over-generalization. To prevent overlearning we will insist that more complex rules perform significantly better than their predecessors and use cross-validation. Finally, borrowing from stochastic grammar learning algorithms, we will structure the space of grammars in terms of primitive incremental operations that will help guide exploration of the set of possible grammars. To return to the point raised in discussion of de la Higuera’s work [53] about exponentially long strings, we observe the following similar situation for set or multiset productions. A set production with n terms in the right-hand side will match n! different orders of the terms. These strings are not exponentially long, but are exponential in number. Thus for correct identification of such rules, we would need to see at least n! examples. This turns out to be not unreasonable in our scenario 16

as n is small, i.e., four or five at the most. However, we note that our learning method will not check that all n! orders were in fact observed. We let the random occur rates decide if the number of observed order variations in the positive data is enough to justify the set rule.

17

Chapter 3

Bounded Collection Grammars This chapter describes the syntax and semantics of a collection of extensions to regular, contextfree, or other grammars to make what we call bounded collection grammars (BCG). In addition to the bounded collection extensions, we also describe another set of extensions that make it more practical to create grammars that describe arrangements of genome sequence annotation and to deal with a variety of information sources for this annotation. For simplicity, we describe the extensions in the context of a context-free grammar, i.e., a tuple G = (V, T, P, S) of nonterminals or variables (V ), terminals or an alphabet (T ), productions (P ), and the special start symbol (S).1

3.1

Overview of BCG Extensions

In this section we give a brief description of each the various extensions we introduce as well as a biological motivation for its inclusion. We envision this system having a usefulness beyond the applications described in this work. There are, therefore, features that we do not take advantage of or fully develop at this time.

3.1.1

Bounded Collection Extensions

Collections Productions Recent work [20] on the genes involved in specifying the body plan of D. melanogaster have focussed on identifying clusters of transcription factors binding sites (TFBS’s) that regulate these genes. Current evidence suggests that the density of sites matters more than their particular arrangement. In other circumstances, a biologist may know or hypothesize that certain transcription factors 1 See

background material on grammars elsewhere for more details about a context-free grammar.

18

(TF’s) are relevant to the control of a set of genes, but have no detailed knowledge as to the details of their interaction. She may want to identify a set of genes that contain binding sites for these factors in close proximity to each other. Expressing such configurations is extremely tedious in an ordinary grammar. For example, three TFBS’s can appear in six different orders and would require a grammar with six productions. We want to create a tool that will make it easy to perform such a task. Finally, it may be the case that it is possible to do machine learning to identify a collection of sites, but none of the individual possible orders of sites is statistically significant due to multiple testing concerns. Therefore, we introduce several kinds of collection productions that relax the constraints that ordinary grammars place on the spacing, order, and count of instances of matches to the right-hand-side (RHS) terms. In collection productions, we will also release the parser from the requirement of matching every character in the string. Current biological expectations are that much of the exact sequence is not relevant. Furthermore, essentially every predicted binding sites is incorrect and so should not be included in a correct parse of a complex signal.

Production Bounds Faced with the freedom we have now given our grammars by relaxing the spacing, order, and count of the RHS terms, we need some compensating ‘force’ to allow the grammar to identify statistically and biologically significant matches. To do this we introduce production bounds. If a production is bounded, then any match of the production must satisfy the conditions of the bounds. Bounds may either be numeric, i.e., constraints on the length of the match, or contextual location constraints that force a match to occur in a larger context of features. Numeric bounds are an easy way to ensure that, for example, TFBS’s occur in close proximity. This can ensure statistical, and possibly, biological significance. Contextual location constraints can be used to declaratively force matches to occur in locations such as the first intron or conserved regions.

3.1.2

Genomic Data Extensions

Input Data It is anticipated that some of the features that would be useful to include in a production are signals that may not have a (known) grammatical model and/or may be more easily or efficiently predicted by running an external algorithm or simply retrieved from a database. Example of such features are alignments of mRNA sequences to genomic sequence, CpG island prediction, regions of inter-species sequence conservation, TFBS’s, methylation sites, and so on. To support the use of such information, we introduce two constructs: streams and stream annotation features. To help 19

manage genomic-scale data we introduce sequences and feasible intervals that allow the parser to consider multiple regions of interest on multiple sequences.

Streams and Annotation Features BCG’s can parse input contained in multiple streams. A stream may contain two kinds of data, a sequence of characters or other scalar data types and/or a collection of interval-based annotation. There is a main stream that corresponds to the primary character sequence to be parsed. It will be the responsibility of the parser to populate the main stream in an appropriate way, i.e., from a flat file, database, DAS server, or other data sources. A particular grammar can define multiple additional streams which are populated via plug-ins. A plug-in may take parameters that tailor its behavior. These can be specified in the grammar. A more common use of secondary stream is to store interval-based annotation of the main stream. Such annotation is stored in a data structure that includes a set of standard attributes called a node. Nodes can be arranged into a hierarchy. For example, to represent a gene model, a plug-in can create a node for the whole gene. The children of the gene node might represent the coding sequence, untranslated regions, exons, introns, and so on. Table 3.1 describes the supported node attributes. Unlike ordinary grammars, the semantics of BCG’s will not require a grammar to match all data contained in secondary streams.

Sequences and Feasible Intervals We anticipate that typical usage scenarios will include the parsing of genome-scale data. Such data consists of a collection of potentially very long strings, i.e., chromosomes, with much attendant information. It is impractical to load all such data into RAM at once. Also each sequence will have its own coordinate system. To facilitate handling such large sets of data we structure the input into a list of sequences. Each sequence must have a unique identifier of some kind that can be handed to the plugins that populate secondary streams so that they can load the correct data. To further subdivide the input data, we also use feasible intervals which are regions of a sequence that will actually be considered during the parse. For example, if we are looking for binding sites in CpG islands, there is no point in loading sequence for regions outside CpG islands. The definition of the feasible intervals are extracted from the grammar. The GLE parser is also able to determine if the grammar contains any patterns that actually require the sequence. If not, i.e., if all patterns match against annotation, then the sequence itself need not be loaded. To allow the user to refer to the content of other streams and to match annotation we introduce a new kind of terminal described below. 20

Gaps Since we will be concerned with the spacing between features and may not care about the particular characters in the intervening sequence, we introduce a special gap character that matches any character regardless of what alphabet the grammar is working with. A special gap character also affords the opportunity to improve parsing algorithms by skipping trivial matching operations.

Position Occasionally, it may be important to use a specific position as an anchor point in a parse. BCG’s contain a special kind of terminal that corresponds to position in the input streams.

Parsing All grammars will be interpreted as having a leading and trailing gap around the start symbol. Thus authors of grammars do not need to add these. The parser will then report all matches of the grammar to the input sequences. We have implemented a parsing algorithm in a collection of Perl modules called GLE that can parse non-recursive ‘context-free’ grammars. The GLE parser reports all possible parses with their scores. These can be post-processed to yield the best score or the sum of all scores, i.e., to implement a Viterbi or inside-outside algorithm.

3.2

Input Data

BCG’s can parse two kinds of input data. The first is a list of strings made from an alphabet of terminals. The second is interval-based annotation (features) of those sequences. Here we describe these inputs in detail.

3.2.1

Alphabets

In our applications the strings to be parsed by a BCG will always be DNA sequences. Other biological applications might use an amino acid alphabet. We envision that BCG’s might be useful for non-biological applications as well, e.g., text mining. In those applications, the alphabet would be the standard ASCII or Unicode alphabets. DNA sequences are unusual in that they can be read in two directions, i.e., the forward and reverse complement strand. Since this is of such fundamental importance we build this notion into BCG’s. This support appears in two places: in specialized alphabets and in RHS terms (described 21

below). For now, we note that the GLE parser implements alphabets as objects. These alphabet objects can provide a method that converts a letter to its complement.

3.2.2

Annotation Nodes

We define a common structure for all stream annotation which simplifies the specification and selection of annotation in grammars. The capabilities of annotation were designed so that they could also represent BCG parses. In fact, the GLE parser stores successful partial parses as annotation on the main stream. Annotation consists a graph of nodes that are each intervals in the sequence coordinate system. The nodes are records that have a number of attributes. Edges between nodes are typed. Relations between the feature nodes can either be part-whole or simple containment.

Attributes The node attributes were modeled on the DAS standard [62] and are described in Table 3.1. Many of the attributes have rather loose semantics. It is left to individual plugins to set useful values that comply with the intended semantics.

Relations Relations between nodes consist of typed directed edges. Here we describe the two types of edges currently supported and give use cases that should guide authors of plugins and grammars.

Part-Whole

Part-whole relations should exist between annotation nodes when the child feature

is required for the definition of the parent feature. For example, gene and an exon would participate in a part-whole relation. When BCG parses are rendered as annotations, the part-whole hierarchy is used to relate the RHS term matches to the LHS nonterminal match. For example, given a match to the rule S -> A, B, the features for the A and B matches would be linked to a feature for the S match.

Containment

Containment relations hold when the child feature is simply covered by the span

of the parent feature. For example, a CpG island and a Sp-1 binding site located in the CpG island might participate in a containment relation. The Sp-1 binding site is not part of the definition of the CpG island so a part-whole relation is not appropriate. When rendering a BCG parse in annotation, the containment relation would be used to relate a match that occurred inside a location bound. 22

Name type category name id method modelid model score reviewed

Type string string string string string string string float boolean

start

integer

end sense

integer +,-, or 0

length index

integer integer

xedni

integer

parts contents leaf

boolean boolean boolean

Description

how the feature derived was derived

a LOD score for example has the feature been reviewed. Assumes that plugins will not retrieve bad features that have been reviewed. start of feature in sequence coordinates regardless of strand. start ≤ end. end of feature in sequences. See start. indicates strand of feature. Features may not require a strand, e.g., CpG islands, in that case the sense should be 0. length of the sequence. a ordinal number indicating the place of the feature relative to others. The other can be all features in the stream, all features of the same type, or other semantics determined by the plugin that loads the data. similar to index, but counts from the end of the feasible interval. set to true if the features has any subparts. set to true if the feature spans any smaller features. set to true if the feature does not have any parts or contents.

Table 3.1: Description of attributes of annotation features.

23

3.2.3

Comments on Annotation and Tokenization

We end this subsection with a few comments on the relation between annotation and tokenizaton. Theoretical analyzes of grammars describe the input strings by assuming they are members of Σ∗ which is the set of strings made from letters in an alphabet Σ. On the other hand, most practical parsers, e.g., the parsing component in a C++ compiler, preprocess input strings with a tokenizer that identifies common tokens in the string and converts the string to a series of tokens. Typical tokens might be numbers, variables or function names, arithmetic operators, or the reserved words in a programming language. A grammar for the language is then written in terms of these tokens rather than in the base alphabet. This is done for conceptual clarity, efficiency, and to support extra semantic constraints that can not be encoded in the grammar (e.g., tracking the data type of variables). The grammar for BCG’s that the GLE parser uses is designed in this way. On the other hand, biological applications seem to demand a more flexible tokenization approach. Grammatical models like HMMs do not typically tokenize their input. In gene regulation applications, a natural token might be a TFBS. Other tokens of interest are entities like exons, genes, repeats, CpG island, etc. The GLE parser does no tokenizing itself but annotation streams provide a very flexible system for providing a rich source of tokens. This situation differs from conventional parsing, either in programming language parsing or natural language parsing in a few ways. First, we can not perform a reliable tokenization. A particularly egregious example is TFBS’s where essentially all predictions of individual sites are wrong. In general, many extra tokens will be produced and a few correct tokens will be missed. Second, many of the tokens will overlap rather than abut or have white space (gaps) between them. This means that not all tokens can be included in a parse without changing how grammars work. The GLE parser allows the author to control whether or not overlapping is allowed. Furthermore, the GLE parser does not require that all characters or annotation actually be included in a parse. The annotation is a source of tokens that can be selectively included in a parse if they match the grammar.

3.3

Formal Grammar Specification

Now that we have introduced and motivated the basic extensions, we turn to a more formal description of the semantics and concrete syntax of BCG’s. We include a complete yapp specification in Appendix A. The concrete specification of a BCG consists of two kinds of statements; stream definition statements and production statements. Stream statements define alternate parallel character or 24

/* Load DoTS gene models from GUS */ @GeneModels ---> ’GUS::GeneModels’ --Species ’Mus musculus’ --UpstreamPad 1000 --DownstreamPad 1000 ;

Figure 3.1: Specification and example of stream definition statement.

feature streams. Production statements are used to define the legal expansions of nonterminals. The nonterminal in the left-hand side (LHS) of the first production is taken to be the start symbol.

3.3.1

Comments

The BCG concrete syntax has two kinds of comments. The first type is a C++-style comment, i.e., // ...

. These may be placed freely in the text and last until the end of the line. They are

skipped over by the grammar parser’s tokenizer. They are optional. The second kind is a multi-line C-style comment, i.e., /* ...

*/, that can only be used in

four different locations in a grammar. First, there must be one at the beginning of the grammar. This comment is associated with the whole grammar. Second, there may be a comment before each stream definition statement. Third, there may be a comment before the LHS of each production. Finally, there may be a comment before each term in the RHS of a production. The GLE parser adds these comments to the Perl objects that represent grammars, productions, and RHS terms. We will return these comments later in this section.

3.3.2

Stream Definition Statements

A stream definition statement defines a stream by giving it a name and the name of a plug-in that will be used to initialize the stream as illustrated in Figure 3.1. The plug-in’s behavior can be controlled by means of optional named parameters. The GLE package comes with a number of plug-ins, but users of the system are free to create more as needed. The plug-in’s name is converted to a Perl package name and automatically imported as needed. The specification is shown in Appendix A page 197. 25

3.3.3

Nonterminals

A nonterminal may be appear in either the standard unquoted form or a quoted form. The unquoted form is a word beginning with an uppercase letter or an underscore followed by a digit. The rest of the characters in the word must be letters, digits, underscore, or minus. The quoted form allows any characters. Thus Gata, 0999, and ’bicoid II’ are all legal nonterminals, while gata, 0999, and bicoid II are not. As we shall see nonterminals must appear on the LHS of a production and may appear in the RHS.

3.3.4

Recognition Elements

In an ordinary grammar, recognition elements are either terminals or nonterminals that appear on the RHS of productions. These are the things that must match against the input string. BCG’s have a larger variety of recognition elements.

Nonterminals Nonterminals are indicated as described above. When a nonterminal appears in the RHS of a production it indicates that a match to any production for which that nonterminal is on the LHS must occur for the production to match. In addition BCG’s allow a selector to follow a nonterminal to select nonterminal matches that meet certain criteria. Selectors are defined below.

Character Literals A character literal is a string of one or more characters in the base alphabet. Character literals are usually indicated as lowercase letters. When literals need to be uppercase they must be quoted, e.g., "G" or g are legal literals. A literal may be followed by [n] where n is a positive integer. This indicates that at most n mismatches are allowed.

Examples:

Here are some examples of literal matches:

1. gata indicates a perfect match to a GATA site. 2. "WATAA"[1] indicates a match to a consensus sequence for a TATA box that allows one mismatch. Note the ambiguous character at the start of the literal. 26

Path Expressions Path expressions are the terminals for stream annotation. Like character literals they are meant to match instances of annotation in the primary or, more commonly, the secondary streams. Since annotation has a record structure with attributes and also consists of spans in the coordinate space of the sequence, we need to allow a richer set of literals to more fully support matching against annotation.

Attribute Filters As shown in Table 3.1, annotation nodes have a set of attributes that provide a set of values that can be used to select annotations of interest. These correspond to a slightly extended version of DAS attributes. This selection is accomplished by the attribute filter clause in a non-terminal right-hand side term. At the moment only a single relation can be applied to an attribute, i.e, score ≤ 5, but not, score ≤ 5, score ≥ 2.

Interval and Structural Operators The first term in the path must indicate a stream. Subsequent terms need not indicate a stream. If they do not, then the annotation will be in the same stream as the preceding term. Terms in the path may be related to each other in a three ways. A term may be a child of the previous term, it be contained in the span of the previous term, or simply overlap with the previous term. Once the final term in the path is reached, either the whole feature may be used or just the position of a specific anchor point. The anchor points are the beginning, center, or end of the interval. The specification of the concrete syntax begins at Appendix A page 200.

Examples:

Here are some examples of path expressions.

1. Annotation::CpgIsland a CpG island loaded into the Annotation stream. 2. Annotation::CpgIsland[length>=500] a CpG island longer than 500bp 3. Annotation::CpgIsland/BindingSites::Sp1[score>=9] an Sp1 site that is located inside a CpG island. 4. Annotation::Gene.Intron[index=1] the first intron in a gene. 5. Annotation::Gene.TranscriptStartSite%Annotation::CpGIsland=>center the center of a CpG island that overlaps the start of transcription of a gene 27

Weight Matrix Since weight matrices are such a common means of identifying TFBS’s, we included them as a primitive recognition method in the grammar. The specification begins at Appendix A. The definition amounts to specifying the alphabet of the matrix followed by observations at each position. The matrix is scored using a log-odd score. Alignments are selected by comparing them to the score listed in a selector following the matrix definition. For large scale searching it will be more efficient to perform the searching ahead of time and read in matches via a plug-in. However the built in weight matrices are quite effective for ad hoc testing.

Examples:

Here is an example of a weight matrix recognition term.

[score>=10.2] This matrix has the consensus sequence ACTG. Spaces have been inserted for clarity, but they are not required.

Gap A gap is indicated by a . and matches any character in any alphabet. Gaps are most useful when they are used with a count (described below).

Position Positions are indicated either as @@n (absolute position) or @n (relative position) where n is an integer. Absolute positions are measured in the coordinates of the sequence. Relative positions are measured relative to the feasible interval. Positive values are measured from the start of the interval; negative values from the end of the interval.

Examples Here are some examples of position recognition terms: 1. @1983 position 1983 in a sequence. 2. @@5 five positions from the start of a feasible interval. 3. @@-20 twenty positions from the end of a feasible interval. 4. @@-1 last position in a feasible interval. 28

Selectors Selector clauses can be added to several recognition elements: path expressions, weight matrices, and literal matches. They consist of comparisons of the attributes listed in Table 3.1 to constants. There can be at most one comparison for each attribute name. Comparisons can check for equality or less than or equal relations.

3.3.5

Bounds and Collections

The bounds and collection type of a production are both specified in the same bit of concrete syntax. The specification starts at Appendix A. The bound and collection type are indicated by modifying the standard arrow. The collection type is indicated by the type of braces that surround the bound. If no braces (and therefore no bound) are present the production is an ordinary production. If the braces are present, but the bounds are empty, the production has the indicated collection type but is unbounded. Bounds A size bound and a location bound can be applied independently. Furthermore, the production can be marked as requiring a minimal match or a loose match. A minimal match is one that does not span a smaller match. The size bound allows an optional units tag that must be one of bp, KB or MB. The location bound consists of a path expression which is defined above. Limitations:

In this work we do not consider or minimal matches. The parser we have imple-

mented can correctly parse simple location bounds, but does not implement minimal matches. We will focus solely on the size bound in the machine learning section. Examples:

Here are some examples of bounded collection productions.

1. S ----> A, B; This is an ordinary production requiring A and B to appear adjacent to each other. 2. S -{1KB;;}-> A, B; This is a set production that can be no longer than 1000bp. Since it is a set, A and B can appear in any order. 3. S -{400;;}-> A#2, B#3;

This production requires two As and three Bs within 500bp of

each other. 29

4. S -{;Annotation::CpgIsland;}-> A; This production requires A to occur inside an annotated CpG island. 5. S -[400;Genes::Intron[1];]-> A, B, C;

Here we require three features to appear in

order within 400bp of each other and in the first intron of a gene. Semantics of Streams, Plugins, and Annotations The semantics of streams and annotation is deliberately left rather flexible. It is left to the plug-in and grammar author(s) to ensure that they are used correctly. Only one plug-in can be attached to a stream so it is simplest to map different sources of information to different streams. However, an information source may provide only a single type of annotation and so there may be some logical redundancy between the stream name and the annotation type. Furthermore, multiple types of annotation may be available from on source. Again, in this case there may be some redundancy between the stream name and the annotation type.

3.3.6

Sequences and Feasible Intervals

A BCG parser, like GLE, should allow for multiple sequences to be parsed in a single run. Many biological problems of interest need to consider multiple independent sequences, i.e., chromosomes or genes. At the present time, it does not make biological sense to consider the sequences as one long sequence and allow matches to rules that span different individual sequences. An exception to this rule might be an unfinished or draft genome sequence, where the order, orientation, and perhaps spacing is known for a collection of contigs from a single chromosome. We suggest that this situation be dealt with either by considering the contigs individually or by creating a single sequence for each chromosome. The sequences may be contained in a FASTA file, extracted from a database, from a DAS server or any other source. The sequences are presented to the parser by a sequence plug-in that can provide the sequence as well as a unique ID for each sequence and its size. The unique IDs may be used by the annotation plugins to help retrieve the appropriate annotation. Managing the Memory Footprint

The chromosomes of ‘higher’ eukaryotes may be a few

hundred megabases long. Loading all of that sequence into memory at once can be burdensome so we have taken some steps to avoid this when possible. This is done in two ways. First, some grammars do not contain any literal matches or use annotation plugins that require the actual sequence. The GLE parser examines the grammar and queries the annotation plugins to check for this condition. When it exists, the sequence plug-in is informed and need not load any actual 30

sequence, though it will still need to provide the IDs and sizes of the sequences. Second, the production(s) for the start symbol may contain a location bound that depends on annotation to specify areas of interest that we call feasible intervals. If this annotation comes from a plug-in that does not need sequence, then the parser queries the plug-in for all of the regions of interest. It then processes each of these feasible intervals serially, looking for matches to the main rule. The only sequence requested is for the feasible intervals and this sequence can be discarded once the feasible intervals has been parsed.

3.3.7

Linking to Database Rows

One of the goals of this project is to store models of regulatory elements in a database, specifically GUS, so they can be queried and serve as a link between TF’s and instances of binding sites in the promoters of genes regulated by those factors. While the GUS schema stores grammars in a form that is equivalent to the form that they are used in a BCG parser, the data model is slightly different and completely different software objects are used. We have written ‘factories’ that convert from GUS objects to (and from) GLE objects. It is important that a trace be maintained from a GLE object back to its GUS counterpart so that parsing results can be loaded into GUS and be connected to the GUS rows that correspond to the major parts of the grammar. While this link is important, we want to maintain it in a GUS-independent fashion so that the GLE code may be used with other databases or even without a database. We maintain links to permanent storage using specially formatted text called a p-store link at the very beginning of the comment. The format of a p-store link is . The location can be the schema and table name or other kinds of locations. The id is required to be a series of digits and is expected to be a numeric primary key. These links are optional. They do not affect the semantics of the grammar in any way. The are there solely to allow accurate mapping from a grammar back to a source database. This is essential in our application since grammars and parses will be stored in a database and foreign key references are maintained between elements of the parse and elements of the grammar. This grammar S ----> ’tata’; might appear this way with a full set of p-store links. /* TESS::SbcgGrammar#1234 */ /* TESS::SbcgProduction#24423 */ S ----> /* TESS::SbcgRecognition#3453 tata box */ ’tata’; The specification of the concrete syntax of p-store comments starts at Appendix A but the details of the link syntax is handled by the GLE tokenizer. 31

3.4

Theoretical Assessment of BCG Extensions

In this section we comment on some aspects of the BCG extensions as they relate to the Chomsky hierarchy. The Chomsky hierarchy is the core of computational linguistics in the sense that it provides an initial categorization of the complexity and power of classes of grammars. Comparing a new grammar formalism to the Chomsky hierarchy is a way of measuring the power and limitations of the formalism. Since the BCG extensions do not, in themselves, constitute a grammar formalism, but are rather a modification of a base ordinary grammar formalism, we examine the effect of each extension as applied to grammars in selected levels of the Chomsky hierarchy.

3.4.1

Abstract Syntax

For the sake of clarity, brevity, and generality, we will use a more abstract mathematical syntax for {}

grammars such as S −→ F1 , F2 ;. This syntax removes the details of features and lets us concentrate on the collection type and size bound.

3.4.2

Collections

Collection productions do not affect the power of grammars at any level of the Chomsky hierarchy. In each case, a collection production or an entire grammar using collection productions can be rewritten to eliminate the collection productions. This is done by simply expanding each collection production into a set of productions that correspond to all possible orderings of the terms in the LHS. For example, Production 3.1 would be expanded as the set of alternatives listed in Production 3.2. {}

P −→ A B C ;

(3.1)

P −→ A B C | A C B | B A C | B C A | C A B | C B A ;

(3.2)

The implicit gaps in each production can then be filled with a gap nonterminal and production as shown in Productions 3.3 and 3.4. Gap −→ AnyChar

3.4.3

−→

AnyChar Gap |  ;

(3.3)

”A” | ”C” | ”G” | ”T” ;

(3.4)

Bounds

There are two components to production bounds; size and location. We consider these separately. 32

Size Bounds Applying a size bound N to all productions for given symbol of a grammar at any level of the Chomsky hierarchy reduces that part of the grammar down to a regular grammar. The number of strings with length less than or equal to N is finite and so a finite state machine can be built to recognize any set of strings described by a completely bounded portion of a grammar. If a size bound is applied to all productions for the start symbol, then the whole grammar is regular. If the bound is applied to productions other than the start symbol, then the Chomsky class of the partially bounded grammar is controlled by the complexity of the unbounded portion of the grammar. For example, consider a grammar S −→

”(” A ”)”;

(3.5)

A −→

”[” A ”]”

(3.6)

A −→

”a”;

(3.7)

As it stands this grammar is context-free because the definition of A requires an unbounded number of matched pairs of square brackets. If we bound just the recursive production for A then the grammar becomes regular. The production for S requires only a finite number of matched parenthesis and so is regular in itself. Once the bounding makes the subgrammar rooted at A regular then the whole grammar becomes regular. One the other hand, bounding productions for A in this grammar: S −→ ”(” S ”)”;

(3.8)

S −→ A;

(3.9)

A −→ ”[” A ”]”

(3.10)

A −→ ”a”;

(3.11)

does not make the grammar regular as the definition of S would still require an unbounded number of matched parenthesis. Thus, in general, applying a size bound to a production converts the subgrammar rooted at that production to a regular grammar. Whether or not this affects the class of the whole grammar depends on the complexity of the rest of the grammar. Location Bounds Location bounds have more complicated effects. BCG’s allows location bounds to be either matches to productions defined in the grammar or annotation from an external source. We consider these 33

two cases in turn. Production Matches Bounding a production by another production effectively introduces an intersection operation between grammars. To see this consider two grammars G1 and G2 defined over the same set of terminals but using distinct nonterminals. Construct a new BCG grammar G∩ consisting of the union of the productions in G1 and G2 and this production, where S∩ is the new start symbol of G∩ . S∩

S

2 −→ S1 ;

(3.12)

This grammar nearly implements the intersection L1 ∩L2 . It does not exactly implement intersection because BCG’s do not require the bound production to match the entire bounding region. Thus matches to G1 are merely required to be substrings of G2 . However, we easily remedy that by adding two special novel terminals, say B and E that mark the beginning and ending of strings. ˆ ∩ as above. The presence of the new ˆi −→ B Si E; and construct G We then define new grammars S terminals forces each grammar to match the whole string and thereby implements intersection. The complexity of the intersection depends on the class of the two grammars being intersected. If either is regular, then the intersection is regular. If both are context-free then the intersection achieves the power of a Turing machine. Annotation Matches Since we can not characterize the computational power needed to define the annotation sources, we can not say anything concrete about the result of intersecting them with the base grammar. We anticipate that annotation matches will typically be to regions in genes, e.g., promoters, introns, or perhaps to conserved regions or CpG islands. We note that the standard definition of a CpG island means that the set of strings that constitute a CpG island is not a regular language, though they could be generated with high probability from a stochastic regular grammar. If the annotation intervals have a known, intrinsic, maximum length, then they constitute a regular language. Furthermore, the maximum length puts a length bound on the base grammar and hence makes it regular.

3.5

Discussion

Emulation Our system can emulate many of the models used in previous work. For example, a cluster with a fixed number, k, of sites drawn from a set of m TF’s can be modeled as follows: {n}

S −→ F : k;

F −→ F1 |F2 | · · · |Fm ; 34

(3.13)

where Fi is the i-th feature of interest. We have not included open ended counts, i.e., at least k, but that could be added if need be. The following grammar implements at-least-k hits, but somewhat clumsily: {n}

S −→ F : k, G;

[]

G −→ F, G|λ;

F −→ F1 |F2 | · · · |Fm ;

(3.14)

Similarly the model in Thompson’s motif discovery work [224] would be rendered as follows in our system as shown in Table 3.2.

S −→ [p1 ] f1 , F1 ; S −→ [p2 ] f2 , F2 ; ··· S −→ [pk ] fk , Fk ;

set of rules for first feature f1 .

F1 −→ [p1,1 ] .#{1, m}, f1 , F1 ; F1 −→ [p1,2 ] .#{1, m}, f2 , F2 ; ··· F1 −→ [p1,k ] .#{1, m}, fk , Fk ; F1 −→ [pλ ] λ; .. .

set of rules for feature following f1 . empty rule to end chain

Fk −→ [pk,1 ] .#{1, m}, f1 , F1 ; Fk −→ [pk,2 ] .#{1, m}, f2 , F2 ; ··· Fk −→ [pk,k ] .#{1, m}, fk , Fk ; Fk −→ [pλ ] λ;

set of rules for feature following fk .

Table 3.2: Implementation of Markov model of TFBS chain where fi is the i-th feature, m is the maximum gap between features, pi,j is the probability of fj following fi , and λ is an empty production that ends the chain.

Implementation We have implemented a parser that can parse hierarchical, but non-recursive rules. It supports bounding either by stream annotation. Implementation of bounding by a symbol can be easily done. It is written in Perl and includes plugins to access the GUS database, DAS annotation servers, and XML files. We have provided a web page interface as shown in Figure 3.2 to the parser with data sources tailored to allow diabetes researchers to query for collections of pancreas-specific or user-entered TFBS’s in promoters or introns of AllGenes gene models in mouse or human.

Comparison with ID/LP Grammars As mentioned in Chapter 2, set and multi-set productions are equivalent to the free word order grammars such as permutation grammars and ID/LP 35

Figure 3.2: Example of a web interface to the parser http://www.cbil.upenn.edu/cgi-bin/EPConDB/TESS/tess.pl?mode=SearchForm.

36

at

grammars without additional linear precedence constraints. ID/LP grammar allow very flexible constraints on word order. The only explicit linear precedence (LP) relation that bounded collection grammars support is the total ordering indicated by list productions. Linear precedence could approximated in a weak sense by considering a set production that consists of all of the LP relations translated to list productions. As long as the instances of matches to the list rules are allowed to overlap, the set rule could collect them together and present a total match that satisfied the constraints. For example, the ID/LP production shown in the grammar 3.5 can be approximated by the BCG grammar 3.5. S −→ A, B, C, D, E [A < E, A < B, C < D]; {n}

S −→ R1 , R2 , R3 ;

[n1 ]

[n2 ]

R1 −→ A, E;

R2 −→ A, B;

G3.1 [n3 ]

R3 −→ C, D;

G3.2

This approach however is only approximate in that it does not force the instances of the base features that satisfy the LP constraints to coincide. For example, the BCG productions R1 and R2 could match two different instances of the feature A. This could be fixed by naming feature matches, such as is done in definite clause grammars (DCGs), and using this to force coincidence of features. This would allow BCG’s to maintain the important notion of bounding the separation of features, yet support the richer set of LP relations that ID/LP allows.

37

Chapter 4

Learning Simple Collections In this chapter we present issues related to machine learning of simple grammars consisting of a single production. The issues are the statistical preconditions for learning, the random occurrence of matches to grammars, techniques for evaluating grammars, and search strategies for exploring the space of possible grammars.

4.1

Introduction

We assume that we are starting with a set of promoters or other sequences of interest that constitute our positive exemplars, E + . The machine learning task is to identify arrangements of combinations of TFBS’s that are enriched in the E + relative to their expected frequency or observed frequency in a control set, E − . The structure, component features, and feature parameters of the arrangements of interest may be partially specified by the user. A variety of structures may be considered when the structure is not known. When the exact component features are not known, they may be drawn from one or more lists of candidate feature types. Finally, when the feature recognition parameters are not known, different parameter values must be tried to determine both the overall and best possible performance performance of the arrangement. The combination of structure and component features yields an exponentially large space of possibilities to consider. This spaces must be explored in as efficient a manner as possible. The size of the set of possible feature parameters values for a particular arrangement of component features is also exponential in the number of components. Because there is a large number of possible arrangements for even a small number of TF’s we will have to apply multiple testing corrections. We will therefore want to know how large E + must be to yield statistically significant conclusions with large corrections for multiple testing. 38

We will need to be able to calculate or estimate the rate of random occurrence for complex features. This is made more difficult by the fact that the arrangements we are evaluating may contain some free parameters such as the scoring thresholds and orientations of TFBS’s and the maximum allowable length of an arrangement of sites. Thus the rate of random occurrence must be calculated many times for each arrangement. We can not always guarantee that the members of E + share a single regulatory mechanism therefore we do not assume that there is only one over-represented arrangement. Furthermore, since the regulatory regions in complex eukaryotes are widely scattered, we do not assume that every positive exemplar will contain an identifiable arrangement of sites; we may not have extracted the relevant region of the genome. Thus we need a learning algorithm that can report alternate models and can deal with a potentially low apparent true positive rate. The rest of this chapter explores these issues and presents the approaches used in the subsequent biological applications presented in later chapters.

4.2

Comparison to Standard Grammar Induction Problems and Solutions

In this section we compare and contrast our machine learning problem with previous work in grammar induction. We first describe the available data, then describe the differences between our scenario and those assumed in previous theoretical work.

4.2.1

Learning Scenario

The first question we must answer is what established learning scenario(s) can we use to frame our problem? In most biological applications, including this one, we do not have access to a teacher that can answer example or equivalence queries so the minimum adequate teacher (MAT) learning scenario is not possible.1 In most biological learning scenarios we are given a finite set of positive exemplars. Negative exemplars can be either a finite set of biological controls or randomized sequences based on statistical characteristics of the positive set. We typically do not have control or knowledge of the distribution from which the positive exemplars are drawn nor do we have any chance of ensuring that we have a characteristic set to learn from. This places us in the probably approximately correct (PAC) learning model developed by Valiant [227] or in a stochastic grammar learning environment. 1 It is interesting to speculate to what extent we could emulate a MAT environment with an integrated machine learning system and wet lab system able to perform the appropriate biological experiments.

39

4.2.2

Comparison with Related Work

In this section we identify the differences between grammar induction in our setting and other settings described in previous work. This serves to motivate the approach we adopted. First, although the strings we are studying are sequences of DNA bases, we are actually defining rules using literals or terminals that are matches to positional weight matrix (PWM) models of TFBS’s. Thus matches to PWM’s are the alphabet of the languages we want to learn. We do not know a priori which TFBS’s are relevant so part of the learning task is to identify the terminals to pursue further. Furthermore, we are not sure which instances of any given terminal are actually correct and terminals can overlap each other. Part of the learning process is to identify the cutoff score between valid and non-valid instances of matches to PWM’s. While is it theoretically possible to apply existing algorithms with larger alphabets than necessary and let the algorithm determine which letters are actually used, this may impose an unnecessary performance penalty if the extra terminals are not identified immediately and removed from further consideration. Defining rules in terms of higher level tokens is common place in grammar induction, e.g., machine learning of natural language is typically done using the words of the target language not the underlying alphabet. However in most applications the tokenization is less ambiguous and has a much lower error rate. An application where tokenization is difficult is parsing spoken language where identifying words in audio data is a problem similar to identifying TFBS’s in a DNA sequence. Issues described in the next paragraph rule out direct application of those techniques. Second, our terminals or tokens are not contiguous as they are in ordinary grammars. Furthermore, although we do not parse the DNA sequence between terminal features, we do consider the absolute or relative positions of terminals. Part of the learning task is to identify positional or spacing preferences of terminals. In artificial language, i.e., programming languages, there is white space between the tokens of the language. However the exact size of this white space is rarely significant2 . The size of the ‘white space’ between TFBS’s is expected to be very significant in a number of cases and at least somewhat significant in most cases. The relatively fixed position of the TATA box with respect to the transcription start site is an example where the size of the white space matters. Furthermore, although we do not tackle this problem here, it is likely that DNA white space is actually pink space, that is there may be some sensitivity to the structural properties of the DNA between TFBS’s. Third, our positive examples have some, unknown, degree of unreliability that depends on the application. For example, in Chapter 7 when we study liver-specific promoters, we will be 2A blatanlty silly counter-example is the programming language whitespace. See http://compsoc.dur.ac.uk/whitespace/ for details. Interestingly, this site contains a program that is legal in both C and whitespace.

40

considering just 1200bp of promoter sequence. It is likely that some of the positive exemplars will not contain the regulatory sequences responsible for expression in liver. Even if the 1200bp does in fact contain the regulatory sequence of interest, it is not known where that sequence stops and starts. This could be dealt with by considering rules that allowed matches to any terminals before and after the actual strings, i.e., R0 → Σ∗ , R, Σ∗ ;, but it would be important that these extra matches do not contribute to the scores. Negative examples are also problematic. Any small arrangement of TFBS’s our learning algorithm might propose will have a non-zero probability of occurring in the negative exemplars. Thus essentially or even exactly the same exemplar, as described by TFBS’s, could appear in both the positive and negative training data. These facts dictate that rather than a purely symbolic approach that will necessarily fail, we use add a numeric component and strive to identify parameterized rules that are more common in the positive exemplars than in the negative exemplars along with their optimal parameter values. The negative exemplars will play a crucial role in controlling the tendency to over generalize that is observed in learning from positive data only. Fourth, because our grammar formalism uses collection productions that are not directly equivalent to deterministic finite automata (DFA) we can not use the existing algorithms without modification. While it is possible to translate collection productions into equivalent DFA models, these DFA’s are exponentially larger than the corresponding collection productions. Thus we need to develop a methods that supports collection productions in a relatively efficient manner.

4.3

Preliminaries

We begin by describing the ‘landscape’ that underlies the process of learning sequence features from a set of sequences. The goal of this section is to define some basic terms and concepts that we will use to understand the random occurrence of features which will be examined in subsequent sections of this chapter. We will assume that we are working with point features, i.e., features that span a single base and can not overlap with each other. Overlapping features can be handled easily for strings matches [171]. For example, given that the sequence GATA has occurred, the probability of the sequence TACC occurring with an overlap of 2bp , i.e., GATACC, is about p(C)2 ≈

1 16

which is larger than

the probability of TACC occurring independently. On the other hand, it is not possible for GATA to overlap CCTG. Extending that work to overlapping PWM’s is possible but is likely to involve complicated calculations that depend on the scoring thresholds used. However, we are seeking only the general principles of combining features and we can achieve that using the idealized assumption 41

of point features.

4.3.1

Probabilities, Information Content, and Characteristic Length

We assume that we are given sets of positive, E + , and negative, E − , exemplar sequences each of length L. Furthermore, we assume that the target feature A will occur in a fraction fA+ of the positive sequences and seek to determine the fraction fA− negative sequences that we can expect to contain A. The larger the difference fA+ − fA− or ratio fA+ /fA− is, the fewer exemplars we need to confidently learn the association of feature A with our positive set. Thus we need to determine the probability of A occurring at least once at random in a sequence of length L; we call this the interval probability of A. We call the probability of feature A occurring at a single position the point probability and denote it as p(A). The formula for the interval probability of a point feature is shown in Equation 4.1. This equation assumes that p(A) does not vary with position in each sequence and that instances of A are independent. p(A; L) = 1 − (1 − p(A))L

(4.1)

Figure 4.1 shows p(A; L) as a function of p(A) and L. The smaller the function value is for a given feature and sequence length, the easier it will be to identify that feature as being over-represented in a set of positive exemplars. The value of p(A) may depend on a parameter such as a scoring threshold or orientation. If A is significantly enriched for some parameter setting then we may identify it as a regulatory feature. We define the information content of a feature as shown in Equation 4.2. If = − lg p(f ).

(4.2)

The less likely it is for a feature to occur at random, the higher its information content. We also define a feature’s characteristic length as shown in Equation 4.3. Lf =

1 p(f )

(4.3)

The information content and characteristic length are related; If = lg Lf . We note that lim p(A; LA ) =

p(A)→0

e−1 ≈ 0.63. e

(4.4)

Thus, a rare feature will occur in about 63% of sequences that are as long as the feature’s characteristic length. 42

Figure 4.1: Plot of p(P, L) as a function of log P and log L. Regions in the ‘lowlands’ and low on the slope are suitable for learning as the background rate of random occurrence is low.

4.4

Ordinary Productions - Features with Fixed Spacing

The simplest compound feature is represented by this grammar. S −→ A, B;

G4.1

Given point probabilities p(A) and p(B) we have the following: p(S −→ A, B; )

= p(A)p(B)

(4.5)

IS−→A,B;

= IA + IB

(4.6)

LS−→A,B;

= LA LB

(4.7)

Equation 4.5 is easily derived; features A and B are independent and must both occur if S is to occur, hence we multiply their individual probabilities. Equations 4.6 and 4.7 result from the application of the definition of information content and characteristic length to Equation 4.5. This result can be extended to productions containing multiple RHS features, i.e., p(S −→ F1 , · · · , Fk ; )

=

k Y

p(Fi )

(4.8)

IFi

(4.9)

i=1

IS−→F1 ,···,Fk ;

=

k X i=1

43

LS−→F1 ,···,Fk ;

=

k Y

LFi

(4.10)

i=1

We can also consider productions with arbitrary but fixed gaps between each member of the RHS, i.e., S −→ F1 , .#g1 , F2 , .#g2 , · · · , .#gk−1 , Fk ;

G4.2

Since the gaps, gi , are fixed, and the features are independent, the formulae for the probability, information content, and characteristic length of compound features of this form are also given by Equations 4.8, 4.9, and 4.10. These feature combinations preserve all of the information that is in the component features. It turns out that this will be the best we can do with combinations of features allowed by grammar rules. All other rules we will consider will yield larger a p(f ) and hence smaller If and shorter Lf values and thus will be harder to learn from a fixed amount of data.

4.5

Collections of Two Features

We first consider rules made from collections (sets and lists) of two features. Since the computation for a list feature is simpler, we will do that first.

4.5.1

2-Lists

We first consider a grammar rule of the form []

S −→ F1 , F2 ;

G4.3

which we refer to as a 2-list rule. Let us compute the p(S) for this rule. The probability of F1 occurring at a given position is p(F1 ). Since there is no size bound on the production, F2 is allowed to occur anywhere after F1 . The probability of this is limL→∞ p(F2 ; L) = 1. Thus we have []

p(S −→ F1 , F2 ; ) = p(F1 ), i.e., the information content of F2 is totally lost. If we consider a size-bounded rule as shown in grammar 4.11. [n]

S −→ F1 , F2 ;

(4.11)

we get the results shown in Equations 4.12. [n]

p(S −→ F1 , F2 ; )

= p(F1 )p(F2 ; n − 1)

(4.12)

Feature F1 must occur at the given position and F2 must occur within n − 1 bases afterward. The exact formulas for IS and LS are not particularly illuminating. To gain some intuition, we can consider a simple linear approximation, p(A; n) ≈ np(A), that applies when n  1/p(A). I

[n]

S−→F1 ,F2 ;

≈ IF1 + IF2 − lg n 44

(4.13)

Figure 4.2: Schematic drawing of a compound feature occurring within its length bound n within a sequence interval of length L L

[n]

S−→F1 ,F2 ;

≈ LF1 LF2 /n

(4.14)

The important rule of thumb we can extract from Equation 4.13 is that I

[n]

S−→F1 ,F2 ;

decreases by a

bit when the size bound is doubled. Does Order Matter? We must carefully interpret Equation 4.12. Assume we have two features such that p(F1 ) = 1/1000 [n]

[n]

and p(F2 ) = 1/2000. Figure 4.3 plots p(S −→ F1 , F2 ; ) and p(S −→ F2 , F1 ; ) as a function of the size bound, n. As the size bound increases the probabilities tend to p(F1 ) and p(F2 ) respectively [n]

[n]

as expected. In fact, p(S −→ F1 , F2 ; ) > p(S −→ F2 , F1 ; ) for all n > 2. This suggests (incorrectly) that the order F1 , F2 is more common than F2 , F1 . What the plot really indicates is that as we look to the right from any given position we are more likely to encounter the features in the order F1 , F2 than F2 , F1 . If we were to look to the left, then the same formula will apply and we would be more likely to encounter the features in the (global) order F2 , F1 . Furthermore, once we encounter the rarer F2 we are likely to encounter a trailing F1 and thus complete a match in the reverse order. Thus both orders are equally common. Interval Probabilities [n]

Computing p(S −→ F1 , F2 ; L) is more difficult. It is tempting to apply a version of Equation 4.1 but this would not be correct since S is not a point feature. The derivation of Equation 4.1 assumes that occurrences of the feature at different positions are independent. This is not the case for list rules or, in fact, any compound rules. Knowing that the rule did not match at position i may yield information about the probability of it matching at positions i + 1 to i + n − 1. To correctly [n]

computing p(S −→ F1 , F2 ; L) it is essential to count each matching sequence once and only once. [n]

A nearly correct method for computing p(S −→ F1 , F2 ; L) is given next. The method uses Markov chains and a probabilistic state machine that is sketched in Figure 4.4. 45

1e−03 4e−04 0e+00

2e−04

P[S]

6e−04

8e−04

[F1,F2] [F2,F1]

0

1000

3000

5000

7000

Bound [bp]

Figure 4.3: Point probabilities of the same two features in different orders.

The actual machine has 3n states. The state machine is the key to organizing the counting of sequences so that classes of matches are disjoint and thus the total probability is correct. The intuition behind the machine is as follows. Scanning from left to right for a match to the pattern [A,B] in an interval, we first need to match A then find a B within n − 1 base pairs. If we match an A on the first try at position 1, then we must scan the next n − 1 positions for a B. If we find a B in that interval then we are done. If we do not find a B, then we know that there is no B in the interval (2, n) and we proceed to check for an A at position 2. If we find an A at position 2, then we do not need to rescan that interval for a B. We need only look at position n + 1. In general, exactly how many positions we need to scan depends on the position of the last matched A. If we do not find an A at a position we do not scan ahead for a B. The states in Figure 4.4 correspond to the actions of checking for an A at the current position (Ai ), scanning ahead for a B (Bi ), skipping the search for a B (Zi ), or finding a match (Success). The index i tracks the number of positions we will need to scan when checking for a B. The Zi 46

states are not necessary for the search algorithm, but are necessary to correctly apply Markov chain theory. The Start state is like An−1 except we have no knowledge about the current position. Each time we visit an Ai we are testing for an A at the next starting position. The transitions between states have probabilities that correspond to easily-calculated match probabilities. The transition probabilities are given in Table 4.1. The probabilities of transitions between each state are stored in a transition matrix M = mij [89]. We are interested in the probability of moving from the start state to the Success state while visiting the set of Ai states no more than L times. Markov chain (t)

theory states that given a vector µi

representing the probability of being in state i at time t, the

product µ(t+1) = µ(t) M gives the state probabilities at time t + 1, i.e., after making one transition. In general, µ(t) M k = µ(t+k) . Parsing starts in the Start state so µ(0) is a unit vector where all components are equal to zero except for the component for State which is one. The Zi states are included so that the assessment of each starting position takes the same number of steps. Thus, M 2 represents the entire process of checking for a match at a single position. All probability that is not in the Success is either in Ai states or Bi and Zi states. To check for matches over all L positions of our interval we compute µ(0) M 2L and report the probability of the Success state in µ(2L) . The algorithm was evaluated by comparing calculated values to observed values for a range of size bounds, interval lengths, and feature probabilities. The results are shown in Figure 4.5. As stated, this algorithm is nearly correct. What it fails to do is to properly account for the approach of the end of the interval. It is approximately correct as long as n L. This could be corrected by building another (more complicated) state machine to handle the positions L − n to L. However the machine as constructed is already quite accurate.

Transitions

Probability

Start → Bn−1 Start → Z Ai → Bi

p(A) 1 − p(A)

An−1 → Start

p(A) 1−p(B) p(A) 1 − 1−p(B) p(A) 1 − 1−p(B)

Bi → Success Bi → A1 Zi → Ai+1 Success → Success

p(B; i) 1 − p(B; i) 1 1

Ai → Zi+1

Table 4.1: Transition probabilities for [A, B] state machine.

47

[n]

Figure 4.4: State machine for computing p(S −→ F1 , F2 ; L). In states labeled Ai and Start, the machine is trying to match feature A. In the square states labeled with Bi the machine can scan the last i states out of n to find feature B. The Zi states are ‘resting’ states to synchronize with Bi states. Lines labeled with upper case letters indicate a successful match; lowercase indicates failure.

4.5.2

2-Sets

Now let us consider a grammar of the form {n}

S −→ F1 , F2 ;

G4.4

which we refer to as a 2-set rule. Let us compute the p(S) for this rule. Since either feature may occur first and matches to the two orders are mutually exclusive events we can simply add the two 2-list probabilities. This yields the following equations {n}

p(S −→ F1 , F2 ; ) = p(F1 )p(F2 ; n − 1) + p(F2 )p(F1 ; n − 1)

(4.15)

In the case where p(F1 ) = p(F2 ) we have {n}

p(S −→ F1 , F2 ; ) = 2 p(F1 ) p(F1 ; n − 1)

(4.16)

Interval Probabilities {n}

We have not developed any exact formulae for p(S −→ F1 , F2 ; ). It can be approximated using a multi-event Poisson formula. This formula was developed for a slightly different search situation. 48

1e−01 1e−02 1e−03 1e−05

1e−04

Calculated

1e−04

5e−04

5e−03

5e−02

5e−01

Observed

Figure 4.5: Agreement between calculated and observed frequency of [A, B].

In that scenario the goal is to find a cluster of sites in a fixed-size window. There is no bound on the size of cluster within the window. Thus it is a mixture of our point and interval probabilities.

4.6

Larger Collections

The state machine method presented above can be extended to longer lists, but developing the state machines is difficult. Furthermore the machines will have have at least O(nd ) states (where d is the number of features). Thus managing the matrices will become intractable. In addition we anticipate applying this algorithm to hierarchical rules in the near future which will make the calculation even more difficult. A simple alternative is to use a control set approach to estimate the random rate of occurrence of features as long as the features are not so rare that they require too many sequences to reliably estimate their frequency. Thus, depending on the biological question being investigated, a different set of control sequences will be selected or generated. 49

Other work [231] in this area that uses a less structured model of simply a dense collection of sites can apply the Poisson formula. The typical calculation is to compute the p-value of a collection of sites occurring in a fixed window of w bases. If there are m different varieties of sites with ni instances of each site, then the p-value is the product of the p-values for seeing ni features of feature i in w bases. This calculation does not match our situation because we need to consider the probability of finding this bound collection in a larger interval. It is clear that larger sets and bags will have a roughly n!-larger false positive rate than list productions of the same size since sets and bags allow so many orderings of their components.

4.7

System Architecture

This section briefly describes the architecture of the learning system. It summarizes the processes described in the next three sections and places the components in context. Figure 4.6 presents a block diagram of the data flow. The reader may find it helpful to return to Figure 4.6 to help understand the process.

4.7.1

Overview

The process starts (1) when the user creates a set of training sequences that have been annotated with the binding sites of interest. In addition the user selects the search strategy to apply and supplies parameters to the search as well as some safety bounds for the scheduler. As the run begins, the strategy plugin (2) processes the user-supplied information and schedules an initial set of grammars for evaluation which are parameterized according the user’s specification. As discussed the parameters might be the score thresholds or size bounds and the search strategy plugin needs to know how these rules perform and perhaps what the optimal parameters are. The next few steps are carried out in parallel using the Liniac cluster. The scheduler maintains a queue of unprocessed grammars and (3) passes the grammars to an evaluator when one is free. The evaluator sets the grammar parameters to their least stringent bounds (4) and the parser parses both positive and negative exemplars in the training set. The parser returns to the evaluator a result set (5) consisting of all possible matches of the grammar to the training sequences. The evaluator then uses either brute force enumeration or stochastic descend to produce a ROC curve (6) which is a summary of the performance of the grammar that is returned to the scheduler. The ROC graph can be processed to yield the optimal parameter values. The scheduler passes the ROC curve to the strategy plugin which can decide to schedule more grammars for evaluation. The process ends once the strategy has scheduled and received reports for all grammars of interest. 50

4.7.2

Notes

This section contains a few brief notes about the implementation. Steps 4, 5, and 6 produce a number of reports that can be processed after the run to more fully evaluate a grammar of interest. For example, the reports contain the parameters for the optimal point as well as the parameters for the entire ROC curve. The strategy plugin and the scheduler produce their own reports as well. The scheduler has a few safety parameters that limit the size of the grammars that it will evaluate. Grammars can be limited by (1) the total number of features in the combination, (2) the total number of feature instances in the arrangement, or (3) the total depth of the grammar hierarchy. These bounds can be used to prevent the strategy plugin from considering grammars that will take too long to evaluate or from considering too many features. Though grammars can in general access annotation data from a variety of sources via stream plugins, for simplicity the annotation must be presented in XML files. It would take only a relatively minor change to allow learning to work directly out of a database or DAS server, though there may be some performance penalties in taking this approach. Also, in our current implementation, access to the databases or other sources must be accomplished from the nodes of the Liniac compute cluster that do not have direct contact with the rest of the network.

4.8

Empirical Evaluation of Grammars

In this section we describe our approach for assessing the performance of a rule given sets of positive and control sequences. A similar approach could be used even if exact or approximate formulae were developed that could replace the control set. The rules we are attempting to learn will typically have one or more free parameters that will control how the rule performs, i.e., its sensitivity and specificity. We will need to evaluate a rule over the space of possible parameter values to see how it performs. Based on this evaluation we will want to decide if the performance is better than random and if so how does it rate compared to other possible rules.

4.8.1

ROC Graphs

One method for visualizing the performance characteristics of a rule is the receiver operating characteristic (ROC) graph. This is a plot of sensitivity versus 1 - specificity. A typical example is shown in Figure 4.7. A number of measures of performance can be extracted from a ROC curve. First is the the area under the curve (AUC). This varies from 0 for a predictor that is never right 51

1 Search Parameters

Features

2

Strategy

3

Training Sequences

Scheduler

parameterized grammar

Reports

ROC

6

Evaluator

4

LSB grammar result set

Binding Binding Sites Sites Annotation

5

Parser

Parallel Figure 4.6: An overview of the architecture of the machine learning system. 1: The user presents the annotated training data, features of interest, and search parameters to the system. 2: The search strategy generates an initial wave of grammars. 3: The scheduler schedules grammars for parallel execution by the evaluator. 4: The evaluator sets the parameters to their least stringent bounds and parses the training data. 5: The parsers returns all possible parses of the training data. 6: The evaluator processes this data and returns the ROC curve to the scheduler, which presents it to the strategy module which may schedule more grammars if need be.

52

(a rule that only occurs in E − ), to 0.5 for random guessing, to 1.0 for a prefect predictor. AUC assesses the performance of the rule over its entire curve. While this has the advantage of being independent of a particular parameter setting, this is also a disadvantage. We can start to identify positions on the ROC curve that are of interest. First is the point where rTP −rFP achieves its maximum (or minimum). The distance at this point corresponds to the Kolmogorov-Smirnof statistic for the hypothesis that the distribution of hits in E + follows the same distribution as those in E − . A p-value can be calculated for this distance as long as the number of points in each curve can be established. Under certain ideal circumstances this is also the point where all of the true positives have been matched by the rule. In that case the upper part of the curve will be a straight line that ends at the point (1, 1). We can also identify the most significant excursion from the diagonal. Due to the shape of the iso-significance lines, this may not be the same as the KSD point. A user may want to select other points of interest that yield, say, a 10% false positive rate. Once a point has been selected the parameter values that yield that point can be extracted, reported and used for future searches. Of particular interest may be the size of a compound feature.

4.8.2

Estimating the True True Positive Rate

We can also use point performances on the ROC graph to make a conservative estimate the true coverage of a rule. Since we expect our positive sets to represent multiple sequence classes we do not know that every match we identify in the positive set is a real positive or is a false positive occurring in a positive sequence that does not actually contain a match. We are therefore interested in subtracting off the background rate from the positive set. We assume that a rTP corresponds to some real coverage that in increased by the rFP at that rTP to create the observed rTP as follows rTP = C + (1 − C)rFP .

(4.17)

Solving this equation for C yields C=

rTP − rFP 1 − rTP =1− . 1 − rFP 1 − rFP

(4.18)

This is the intersection point of a line from (1, 1) through (rFP , rTP ) and the Y-axis. We note that rTP − rFP < C < rTP . If the ROC curve behaves in the ideal fashion, i.e., a straight line from the point performance of the last true match to (1, 1), then Equation 4.18 will yield the same value of C all along this section of the curve. Otherwise, this equation must be applied cautiously as (rFP , rTP ) approaches (1, 1). It always yields nonsensical results if rTP < rFP . 53

50 40 30 0

10

20

TP

0

50

100

150

200

250

FP

Figure 4.7: Example ROC curve with suboptimal points. The vertical line indicates the point that achieves the largest rTP − rFP . The dashed diagonal line is the performance of a random guessing algorithm.

54

4.9

The Evaluation Engine

The evaluation of a grammar is performed by several pieces of software. We first describe the algorithm mathematically, then indicate where each part is supported in the software. We assume that each rule, R, is parameterized by a vector ΘR where the length of the vector may depend on R. Each component of the vector, θiR has a type, τiR drawn from a collection of R supported types and a corresponding set of possible values ΩR i . The set of all possible values for Θ

is ΩR . In very high-level terms, the goal of the evaluation is to find the point(s) in ΩR that yield the ‘best’ enrichment or ‘optimal’ performance in identifying members of E + versus E − . However, the meaning of ‘best’ and ‘optimal’ depend on the use to which the predictions are put. We remain largely agnostic about the end uses, so instead we find the best possible NTP for each NFP , which we will call the best performance curve. This data can be used to create ROC graphs described above. Furthermore, the optimal point of most utility functions will be in this set. This is true of the subsequent assessment that we use. The parameters currently supported are the start and end positions, scores, and orientation of features and the size and score of compound features. It is planned that this list will expand to include the minimum and maximum repeat count on gaps so that more precise feature spacing can be learned. Thus although the log-likelihood scores are nearly continuous, the position, size, and orientation parameters are not. This leaves us with an optimization problem in a space that is neither continuous nor discrete. Solving such problems is difficult. It is made more difficult by the fact the data we have is stochastic, so normal gradient descent methods will not work without modification on even the continuous part of the problem. We took the approach of making the problem entirely discrete. This is reasonable since the score parameters are not continuous for several reasons; (1) they are estimated from a finite amount of data and so have some inherent resolution, (2) only a relatively small number of values are actually encountered in the training set. Thus we can replace the continuous set, ωiR with the set of observed values ωiRo in the evaluation process. In fact this can be done for all parameters to yield the universe of observed values ΩRo . Furthermore, the user is often in a position to supply (or accept) a bound on the values of parameters which yields a smaller space, ΩRob . Finally, the user may optionally supply a rounding value for each parameter (by type), ρR i . Using this rounding the space can be reduced further to the observed, bounded, and rounded universe ΩRobr . We take advantage of the fact that parameters with numeric types have the property that the number of matches will increase monotonically as the value of a parameter with that type is adjusted over its range of possible values from a least stringent bound to a most stringent bound. For example, consider the log-likelihood score of a TFBS. It is always the case that if u < v then 55

there are at least as many sites with a score better than u than sites with a score better than v. Similar properties hold for the positional parameter start. On the other hand, size and end allow more hits as they get larger. The orientation of a features behaves differently. Applying no constraints to the orientation yields at least as many hits as either orientation, neither orientation individually is guaranteed to yield more hits than the other. Thus orientation has a least stringent bound, but no most stringent bound and is only partially ordered. We can map each observed, | − 1) where = (0, . . . , |ΩRob to a corresponding set of integers ΩRobri bounded and rounded ΩRobr i i i 0 is the least stringent bound. Similarly, orientations are mapped to the set (0, 1, 2). The software then parses E + and E − with the parameters set to their least stringent bounds and rounds and maps all observed values to their integer equivalents. There are typically many matches for each . exemplar. The data presented to the evaluation program is in the space ΩRobri i We have implemented two algorithms for computing the best performance curve; brute-force enumeration (ENM) and stochastic hill climbing method (SHC). The ENM method is preferred when possible since it is completely covers ΩRobri . The software allocates a fixed number, neval , of parameter points to evaluate. If |ΩRobri | ≤ neval then the ENM method is used, otherwise the the SHC method is used with a budget of neval evaluations. The ENM method simply considers every point in ΩRobri in turn and tracks the best NTP encountered at each NFP . The default value of neval is 1,000,000. The SHC uses an objective function fobj (ω Robri ) = rTP (ω Robri ) − rFP (ω Robri )

(4.19)

and tries to maximize this function. Given any point below the best performance line, there is at least one point on the best performance line that has a better value of fobj . The SHC randomly Robri chooses a seed point, ω(0) , in ΩRobri and begins an ascent from there. It randomly chooses a Robri Robri neighbor, ω(t+1) from the points within a fixed distance of the current point ω(t) and moves Robri Robri to that point if fobj (ω(t+1) ) ≥ fobj (ω(t) ). The neighborhood (for numeric parameters) is set to

about 5 or 10 (in ΩRobri ) so that ascents are not trapped in local minima that are due to noise. As points are evaluated, the best performance curve is tracked. The length and number of ascents tried are controlled by the software and currently default to 100 and 10000 respectively. Experiments have shown that the best performance curve is estimated with acceptable reliability using this method. The software has means for the user to set ρR i and lower bounds globally by type and individually by annotation stream. Free parameters can also be fixed at their lower bounds and not varied during the optimization. 56

4.10

Exploring the Grammar Space

Related work learning regular or context free grammars typically involve a discrete space of possible grammars and a small number of primitive operations that are used to either evolve individual or sets of candidate grammars or adjust sets of grammars that constitute the bounds on the space of possible grammars. We then define the following primitive operations: 1. add-feature : add a feature instance and convert collection as necessary 2. to-list : convert set or bag production to a list production 3. to-ordinary : convert list production to ordinary production We can visualize part of the grammar space as shown in Figure 4.8 and Figure 4.9. Figure 4.8 shows all possible collections using two different features with at most two feature instances in a rule. The root of the space is the empty rule, R∅ , which corresponds to random guessing. The immediate successors of this rule are single feature rules (indicated by circles) for which the collection type does not matter. Successors to single feature rules are either the unique set rule {A, B} (in the ellipse) or the two bag rules (in the rounded rectangles). In general, given n simple features, there  are n2 set rules and n bag rules with two instances. Figure 4.9 shows the partial space for two features and up to three feature instances but does not include the ordinary productions. We will indicate a predecessor relation between two rules as Ry ≺ Rx when there is a primitive operation that will transform Ry into Rx . The learning algorithm will explore the set of grammars that are built from features (matches to annotation in the feature streams), non-terminals, and gaps. The simplest members of this set are grammars such as grammar 4.10 which simply look for instance of feature Fi . S −→ Fi ;

G4.5

An example of a more complex grammar is shown in grammar 4.10 which looks for a pair of TFBS’s within 50bp. {50}

S −→ F1 , F2 ;

G4.6

Grammars may be hierarchical as shown in grammar 4.10 which looks for two different closely spaced dimers occurring within 300bp of each other. [300]

S −→ A, B;

{30}

A −→ F1 , F2 ;

{20}

B −→ F3 , F4 ;

G4.7

So far we have not shown any grammars that contain alternate expansions for a nonterminal. We will not consider such grammars in this work except for implicit alternative definitions of the start symbol. Since it is possible that E + is not regulated by a single set of TFs or by a single 57



A

ø

[A,A]

(A,.,A)

[A,B]

(A,.,B)

[B,A]

(B,.,A)

[B,B]

(B,.,B)

{A,B}

B

Figure 4.8: This is the complete graph for two features and a maximum of 2 instances per rule. Arrows indicate application of primitive operation. Adding a feature instance is indicated by a solid arrow. Converting set or bag to a list is indicated by a dashed line. Converting list to an ordinary production is indicated by a dotted line. Single feature rules are indicated by a circle. Set productions are indicated by an ellipse. Rounded rectangles indicate a bag production. List productions are indicated by a rectangle. Ordinary productions are indicated by a double-ended rectangle. The ‘.’ in the ordinary productions are a actually a variable-sized gap character, e.g., a shorthand for .#{n1 , n2 }.

arrangement of TFBS’s, we will allow the start symbol to have alternate productions simply by considering any sufficiently selective rule. We will typically predict TFBS’s using a weight matrix extracted from TRANSFAC, JASPER, or the literature. Thus each prediction will have a log-odds score and the selection of a threshold score will be part of the learning procedure.

4.10.1

Number of Rules

In this section we provide formulae for the number of rules with a given number of feature instances. We denote the size of the pool of simple features as n and seek to compute the number of rules of different collection types and sizes. Figure 4.10 plots the number of rules that can be formed from 200 simple features using 1 to 10 instances of these features using the formulae described below. 58



[A,A,A]

[A,A]

[A,A,B]



[A,B,A]

[A,B]

[B,A,A]

[B,A]

[A,B,B]



[B,A,B]

[B,B]

[B,B,A]



[B,B,B]



A

ø

{A,B}

B



Figure 4.9: A partial predecessor graph for two features with a maximum of three instances. Ordinary productions are not shown. Operations are indicated as in Figure 4.8. Two types of operations may be applied to a rule and a rule may also be produced by two types of operations. m-sets The number of set collections with m features that can be formed from n simple features is given by Nsets (n, m) =

  n n! = m (n − m)!m!

(4.20)

which is easily derived from the fact that we choose m of the n features to be included in the set and the order does not matter. m-bags The number of set collections with m feature instances that can be formed from n simple features is given by  Nbags (n, m) =

 n+m−1 m

(4.21)

as the number of configuration is given by the Bose-Einstein model of indistinguishable particles with multiple occupancy. We view the n simple features as states that the m feature instances in 59

1e+22 1e+12 1e+02

1e+07

Number of Rules

1e+17

list bag set

2

4

6

8

10

Number of Instances

Figure 4.10: The number of rules for 200 simple features is plotted for each of the three collection types. Note the log scale for the y-axis. The number of set and bag rules are nearly equal.

the bag can occupy. We can not distinguish the instances beyond the number associated with each feature. See [116] for a short discussion.

m-lists The number of list collections with m feature instance that can be formed from n simple features is given by Nlists (n, m) = nm

since at each of the m positions in the list we can choose any of the n simple features. 60

(4.22)

4.10.2

Static Strategies

The strategies in this section take one or more lists of features and evaluate a fixed set of rules. They are used in several of the later chapters when only very simple grammars are of interest.

Basic Evaluation Simply generates ROC data for all NS solo features. No input dictionary is required. The basic strategy processes all streams, Si , defined in the job description. Each stream may have zero or more parameters and zero or more filters. The grammar space contains NS grammars.

Fixed-Size Sets Make all sets of a fixed-size (ns ) from a set of solo features, Fi . The grammar space is controlled by selecting a fixed (nf ) number of solo features based on their ROC area (features with ri ≥ τr ) which are recorded in the dictionary.  The grammar space contains nnfs grammars. Higher-Order Sets Higher-order sets combine non-simple features from one or more sets. In the case involving one set, the dictionary for the result set will contain the solo features, Fi , and the complex features Gj . The strategy package must load the solo and upper level features into its dictionary. The filtering step must then extract a list of the (complex) features to be used in the combinations.

4.10.3

Dynamic Strategy

The strategy in this section takes a list of features but evaluate only a subset that is defined dynamicly in response to the performance of simpler rules.

Whole Shebang Strategy This strategy identifies arrangements of simple features by applying the first two primitive operations, add-feature and to-list, to explore the grammar space. The grammar space is pruned using a heuristic that ensures that only combinations with a strong potential for improved selectivity are considered. The process stops once the rules get larger than a specified size or there are no more candidates to consider. 61

We now describe how the strategy works. Rules are evaluated in a series of stages consisting of rules with incrementally larger sizes. Each stage has two phases; first set and bag rules are evaluated followed by the corresponding list rules for the successful set and bag rules. The first stage consists of an evaluation of all of the simple features using rules of the form R −→ Fi [f (ΘRi )]; where Fi is the i-th simple feature and

f (ΘRi )

G4.8

are the user-selected thresholds on position, score,

and orientation of the feature. These may be defined on a feature-by-feature basis or globally for all simple features. The ROCs for the simple features are compared to the ROC curve of the R∅ rule which is the diagonal line from (0, 0) to (1, 1). The first phase of the second stage consists of evaluating all set and bag rules containing two feature instances, e.g., R{ij}

R{ij} where

R· θsize

{θsize

}

−→ Fi [f (ΘFi )], Fj [f (ΘFj )];

G4.9

is the size bound parameter. The user specifies an upper (least stringent) bound for

R· θsize at the start of the run. The second phase of the second stage consists of evaluating all list

rules with two feature instances of the form R [θ

[i,j]

]

size R[i,j] −→ Fi [f (ΘFi )], Fj [f (ΘFj )];

G4.10

for all sets and bags that were improvements over their component features. This process continues in the same fashion, i.e., adding features to sets and bags to create new sets and bags as long as all predecessors of the new sets and bags were evaluated and found to be improvements. As in stage two, each set or bag that is an improvement is converted to all possible list collections and evaluated. The process terminates either when the features reach a fixed number of feature instances or when there are no new features to evaluate. Assessing Improvement The strategy can assess improvement in two ways. The first way is simply to consider the AUCs of a rule and its predecessors. The user defines an improvement threshold, δAUC , and if AUCRx − AUCRy = ∆AUC ≥ δAUC for all Ry ≺ Rx then Rx is considered an improvement. This assessment works well in cases illustrated in Figure 4.11(a) where E + is homogeneous and there is a single rule that matches all positive exemplars. In this case the ROC curve for rules that approach the true rule will totally dominate their predecessor rules. However, since we are allowing for a heterogeneous E + , a more general notion of improvement is needed. In this case, as shown in Figure 4.11(b), a successor curve may have the same or smaller AUC than its predecessor, but dominates the predecessor at lower rFP . To identify this case we use a measure called maximum area between curves defined as Z rFP

rFP

(Cx (r) − Cy (r))dr

MABC(Cx , Cy ) = max 0

62

(4.23)

where the C∗ are the ROC curves for the successor and predecessor features respectively. If the successor ROC graph totally dominates the predecessor, then MABC = ∆AUC , otherwise it measure the area between the curves at the left end of the curves. Like ∆AUC , Rx is considered an improvement if MABC(Cx , Cy ) ≥ δAUC for all Ry ≺ Rx .

4.10.4

Other Strategies

The machine learning package is designed so that different exploration strategies can be easily implemented in Perl packages and plugged in at run time. Using this approach our evaluation engine could be used to emulate other learning approaches as long as they can be fit into a grammatical model. In addition the evaluation engine is compatible with hierarchical rules and so is ready to support strategies that explore the space of such rules.

4.11

How Much Data Do We Need?

As described above, we may consider a large number of grammars. Any p-values reported for a grammar must be corrected for multiple testing using, for example, a Bonferroni correction, αcorrected = Ngrammars α where Ngrammars is the number of grammars considered and α = 0.05 for example. It will turn out that we need to consider not just the size of E + , but also the size of E − . We can predict the amount of data we need by fixing an α = 0.05 and then computing the Bonferroni correction, NBC , that an observed point performance (NFP , NTP ) could withstand and still yield αmt ≤ 0.05. We start first with a two-sample binomial test for comparing rFP to rTP . We select training set sizes, |E + | and |E − |, and compute NBC (NFP , NTP ) and display the isolevels of NBC . An example is shown in Figure 4.12. What these graphs do not show is variation in point performance due to sampling during cross validation. We can view a point performance as a pair of binomial variables which are governed by this equation:

σN,p =

p N p(1 − p)

(4.24)

where p = (1 − q). For a given N this equation is maximized at p = 0.5. Taking for example, N = 40, we find that σ40,0.5 = 3.16. The point performances in a ROC curve are not independent, but this gives us some idea of the of amount variation we might expect to see in the AUC and therefore how fine a distinction we will want to make between ROC curves. 63

64

0.0

0.2

0.4 FP

0.6

0.8

1.0

TP (a)

0.0

0.2

(b)

0.4 FP

0.6

0.8

1.0

Figure 4.11: Comparison of ROC Curves. (a) shows a series of ROC graphs (solid, long dashed, and dashed) which exhibit strict dominance, i.e., each curve has a higher rTP at a given rFP than its predecessor. Identifying the best curve is easy. (b) shows a pair of curves that have the same AUC, but one (dashed) shows better performance at low rFP and may reflect a subpopulation of E + identified by, for example, a pair of TFBS’s.

TP

1.0

0.8

0.6

0.2

0.0

0.4

1.0 0.8 0.6 0.4 0.2 0.0

20 x 100 − Binom

10

TP 2

5

4

TP

6

15

8

20

10

10 x 50 − Binom

0

10

20

30

40

50

0

20

40

FP

80

100

FP

(a) NAP = 10, NAN = 50

(b) NAP = 20, NAN = 100 80 x 400 − Binom

0

40 0

10

20

20

TP

30

60

40

80

40 x 200 − Binom

TP

60

0

50

100

150

200

0

FP

100

200

300

400

FP

(c) NAP = 40, NAN = 200

(d) NAP = 80, NAN = 400

Figure 4.12: Examples of Bonferroni correction tolerance with a range of actual positives. The number of actual negatives is fixed at five times the number of actual positives. Points outside the scalloped lines will yield a corrected p-value better than 0.05 with the indicated Bonferroni correction. For example, points outside the lines labeled 574 (the number of vertebrate PWM’s in TRANSFAC v8.4) indicates the (NFP , NTP ) points that will have a significance better than p × 574 = 0.05. The five-fold ratio between actual negatives and actual positives causes the asymmetry in the iso-significance lines.

65

4.12

Conclusion

In this chapter we have compared our grammar induction scenario with those that have been analyzed in the literature. This analysis leads us to work in a combination of the PAC and stochastic learning scenario. Like the PAC learning scenario we work with positive and negative exemplars of unknown distribution and do not expect exact learning of the target grammar. We also incorporate the grammar space exploration techniques of stochastic grammar induction. As part of this analysis we identify the importance of the rate of random occurrence of features in the control set, E − as the main means of preventing over-generalization. We therefore define some basic terms regarding the rate of random occurrence. We then examine simple combinations of two point features and find that two is the limit of simple exact calculation of background probabilities. This forces us to use actual sequences for E − rather than calculated frequencies. We presented our method for evaluating rules that include scoring, position, and orientation parameters to characterize both overall performance and identify optimal performance parameter settings. We then examined the size and structure of the set of possible rules. There are a very large number of possible rules so we need to calculate how many sequences we will need to find statistically significant results in the face of multiple testing correction. Finally, we present strategies for exploring the space of possible rules.

66

Chapter 5

Entropy and Tissue Specificity This is work done in collaboration with Winfried-Paul Schuller, Claudia Kappen, J. Michael Salbaum, Maja Bucan, and Christian J. Stoeckert Jr. It essentially the contents of [189] with some embellishments. The Kappen group contributed the TATA and initiator element analysis of pancreasspecific genes which we include here.

5.1

Introduction

Background: The regulatory mechanisms underlying tissue specificity are a crucial part of the development and maintenance of multicellular organisms. A genome-wide analysis of promoters in the context of gene expression patterns in tissue surveys provides a means to identify the general principles for these mechanisms. Results: We introduce a definition of tissue specificity based on Shannon entropy to rank human genes according to their overall tissue specificity and by their specificity to particular tissues. We apply our definition to microarray and EST-based expression data for human genes and use similar data for mouse genes to validate our results. We show that most genes show statistically significant tissue-dependent variations of expression level. We find that the most tissue-specific genes typically have a TATA box, no CpG island, and often code for extracellular proteins. As expected, CpG islands are found in most of the least tissue-specific genes which often code for proteins located in the nucleus or mitochondrion. The class of genes with no CpG island or TATA box are the most common mid-specificity genes and commonly code for proteins located in a membrane. Sp1 was found to be a weak indicator of less specific expression. YY1 binding sites, either as an initiator or as a downstream site, were strongly associated with the least specific genes. 67

Conclusions: We have begun to understand the components of promoters that distinguish tissuespecific from ubiquitous genes, to identify associations that can predict the broad class of gene expression from sequence data alone.

5.2

Background

The development of an adult from the single cell of a fertilized egg requires a complex orchestration of genes to be expressed at the right time, place, and level. Basic cellular functions require the expression of certain genes in all cells and tissues (i.e., in a ubiquitous manner) while specialized functions require restricted expression of other genes in a single or small number of cells and tissues (i.e., tissue specific). Both types of genes may be needed for embryonic development as well as for the function of adult cells and tissues. While the details of regulatory mechanisms will vary for individual genes, general features of promoters (and here we will restrict our focus to RNA Pol II promoters) are likely to facilitate whether a gene will be expressed widely or in a restricted manner. For example, promoters with CpG islands have been associated with housekeeping genes based on the limited number of genes available at the time [22, 23]. It is desirable to reexamine this finding in the context of complete genomes for human and mouse and to place it in context with subsequent findings such as the association of CpG islands with embryonic expression [165]. Furthermore, it would also be informative to examine the relationship of CpG islands to the base composition of promoters, and the distribution of motifs thought to be bound by factors closely involved with (or part of) the basal transcription complex. The distribution of major components of the core promoter, the TATA box (TBP/TFIID binding site) and initiator element (Pol II binding site, Inr) [208], and proximal elements such as Yin-Yang 1 (YY1) site [204, 198, 175, 174] , among genes is not yet well understood. In addition, the functional correlations with tissuespecificity and promoter structure are largely unknown beyond the CpG island association. Our goal is to place these components together in general models for tissue specificity using genome-wide surveys of expression in many tissues. Investigators have searched for combinations of transcription factor binding sites (TFBS’s) that confer tissue-specific expression to particular cell types, such as, muscle [234] or liver [127] in mammals, or body plan in the fruit fly [176, 21] (see [236] for a review). In support of these efforts, analyses of genome-wide expression data have largely focused on identifying common patterns for particular tissues, disease states, or signaling inputs. For microarray data, investigators have begun defining these patterns largely through the application of clustering algorithms [104, 7]. Our approach is to rank genes in the spectrum of tissue-specificity that runs from expression restricted to one tissue to uniform ubiquitous expression. We can study in 68

detail the distribution of human and mouse genes across the spectrum of tissue-specificity and use this to identify commonalities and differences in their promoters with the availability of complete genome sequence [237], libraries enriched for full length cDNAs [220, 40, 214], and genome-wide surveys of gene expression using microarrays [104, 83, 172, 216, 180, 97], SAGE [238], mRNAs [40] and ESTs [26]. We validate patterns discovered in human sequence and expression data by comparison to similar mouse data.

Measures have been developed for overall tissue specificity [165, 106, 230] that amount to counting the number of tissues that express a gene. These are really measuring tissue restriction as they do not consider any bias in the expression levels across tissues that express the gene. Most specificity measures for a particular tissue are equivalent to the relative expression in a tissue compared to the total expression in all tissues considered, e.g., [211]. We assert that overall tissue-specificity measures should take into account the levels of expression in different tissues, not just presence and absence, and that specificity measures for particular tissues should consider the distribution of expression among all tissues in addition to the tissue of interest. Such measures would enable the correct identification of genes as specific for a tissue when that tissue is not the primary site of expression but there are only a few other tissues where the gene is expressed. A metric for characterizing the breadth and uniformity of the expression pattern of a gene that meets our criteria is the Shannon information theoretic measure entropy. Although entropy has been used previously to identify potential drug targets [75, 51] by considering the entropy of the variation of expression levels and to cluster microarray data [164], our direct application of entropy toward measuring tissue specificity is unique. Entropy (H) measures the degree of overall tissue specificity of a gene, but does not indicate whether it is specific to a particular tissue. To quantify categorical tissuespecificity, we introduce a new statistic (Q) that incorporates overall tissue specificity and relative expression level. We demonstrate that H and Q are effective metrics for ranking and selecting genes according to tissue specificity and then proceed to use them to investigate promoter features (CpG islands, base composition, transcription factor motifs) that may be used distinguish tissue-specific genes from non-specific genes. The association of promoter features with a quantitative assessment of tissue-specificity using H and Q is an important step towards developing models for promoter function. 69

5.3

5.3.1

Results

Defining Tissue Specificity

We begin by defining the measurement of two kinds of tissue specificity, overall tissue specificity and categorical tissue specificity. Overall tissue specificity ranks a gene according to the degree to which its expression pattern differs from ubiquitous uniform expression. We use the term ubiquitous expression to mean expression at any level above background in all tissues. Categorical tissue specificity places special emphasis on a particular tissue of interest and ranks a gene according to the degree to which its expression pattern is skewed toward expression in only that particular tissue. In both cases, a genes specificity to a tissue, cell type, or other condition is decreased as the gene is more uniformly expressed in a wider variety of conditions. In addition, the categorical tissue specificity should decrease as the tissue of interest becomes a smaller component of the overall expression pattern of the gene. Given a static multi-tissue expression profile for a gene, there are at least two dimensions along which we can assess the profile to measure tissue specificity. The first dimension is the number of tissues that express the gene above some background level. It can be argued that this dimension measures tissue restriction, i.e., a gene shows restricted expression if it is expressed in only a subset of tissues. The second dimension is the uniformity of expression over all tissues that express the gene. A gene that shows significant non-uniform expression is exhibiting tissue-dependent regulation, in addition to any tissue restriction that may be occurring. We assume that a gene that exhibits no tissue-specific regulation will be expressed at the same level in every tissue. We do not assert that such genes are not regulated, only that they are regulated in a way that is not sensitive to tissue. The term most tissue-specific will refer to the range of genes that are closer to the extreme of expression in a single tissue than to the extreme of ubiquitous uniform expression. We will refer to genes close to the uniform and ubiquitous end as either least tissue-specific or non-specific though the latter term may not be strictly true. The range in the middle will be termed semi-tissue specific. The term housekeeping has been applied to genes that are widely expressed and may show little tissue-specific expression level changes. We can use such genes as an example of genes that will tend to be ubiquitously and uniformly expressed and thus ought to be non-specific on average. We will use the phrase gene sharing to refer to the situation that occurs when a gene is tissue-specific, and is expressed in a small number of tissues that can be said to share the gene. 70

5.3.2

Measuring Tissue Specificity with Entropy

We used two gene expression data sets to evaluate our methods; Affymetrix-based data from the GNF Gene Expression Atlas (GNF-GEA) [216] and the distribution of source tissues for EST libraries in the clusters and assemblies of ESTs in the DoTS mouse and human gene index [42]. As described in Methods, the GNF-GEA data were used as provided; EST counts in the DoTS gene index were adjusted with pseudocounts and normalized to account for the different number of ESTs sampled from each tissue across all libraries. Given expression levels of a gene in N tissues, we defined the relative expression of a gene g in a tissue t as pt|g = P

wg,t

1≤t≤N

wg,t

(5.1)

where wg,t is the expression level of the gene in the tissue. The entropy [199] of a genes expression distribution is Hg = −

X

pt|g lg(pt|g ).

(5.2)

1≤t≤N

Hg has units of bits and ranges from zero for genes expressed in a single tissue to lg(N ) for genes expressed uniformly in all tissues considered. The maximum value of Hg depends on the number of tissues considered so we will report this number when appropriate. Because we use relative expression the entropy of a gene is not sensitive to the absolute expression levels. To measure categorical tissue specificity we define Qg|t = Hg − lg(pt|g ).

(5.3)

The quantity − lg(pt|g ) also has units of bits and has a minimum of zero that occurs when a gene is expressed in a single tissue and grows unboundedly as the relative expression level drops to zero. Thus Qg|t is near its minimum of zero bits when a gene is relatively highly expressed in a small number of tissues including the tissue of interest, and becomes higher as either the number of tissues expressing the gene becomes higher, or as the relative contribution of the tissue to the genes overall pattern becomes smaller. By itself, the term lg(pt|g ) is equivalent to pt|g . Adding the entropy term serves to favor genes that are not expressed highly in the tissue of interest, but are expressed only in a small number of other tissues. As described earlier, we want to consider such genes as categorically tissue-specific since their expression pattern is very restricted. Figure 5.1 shows examples of patterns of GNF-GEA expression data for different values of Hg and Qg|t . The top 5 genes specific to mouse amygdala, lymph node, and liver as assessed by this data are listed in Table 5.1. Tables of Hg and Qg|t values for all genes in all tissues in the GNF-GEA data sets are available in the Supplementary Material for [189]. 71

Adipose Adrenal_gland Amygdala Bladder Bone Bone_marrow Brown_fat Cerebellum Dorsal_root_ganglion Epidermis Eye Frontal_cortex Gall_bladder Heart Hippocampus Hypothalamus Kidney Large_intestine Liver Lung Lymph_node Mammary_gland Olfactory_bulb Ovary Placenta Prostate Salivary_gland Skeletal_muscle Small_intestine Spinal_cord Spleen Stomach Striatum Testis Thymus Thyroid Tongue Trachea Trigeminal Umbilical_cord Uterus

Adipose Adrenal_gland Amygdala Bladder Bone Bone_marrow Brown_fat Cerebellum Dorsal_root_ganglion Epidermis Eye Frontal_cortex Gall_bladder Heart Hippocampus Hypothalamus Kidney Large_intestine Liver Lung Lymph_node Mammary_gland Olfactory_bulb Ovary Placenta Prostate Salivary_gland Skeletal_muscle Small_intestine Spinal_cord Spleen Stomach Striatum Testis Thymus Thyroid Tongue Trachea Trigeminal Umbilical_cord Uterus

Expression 800

600

Expression

(a) Adipose Adrenal_gland Amygdala Bladder Bone Bone_marrow Brown_fat Cerebellum Dorsal_root_ganglion Epidermis Eye Frontal_cortex Gall_bladder Heart Hippocampus Hypothalamus Kidney Large_intestine Liver Lung Lymph_node Mammary_gland Olfactory_bulb Ovary Placenta Prostate Salivary_gland Skeletal_muscle Small_intestine Spinal_cord Spleen Stomach Striatum Testis Thymus Thyroid Tongue Trachea Trigeminal Umbilical_cord Uterus

Adipose Adrenal_gland Amygdala Bladder Bone Bone_marrow Brown_fat Cerebellum Dorsal_root_ganglion Epidermis Eye Frontal_cortex Gall_bladder Heart Hippocampus Hypothalamus Kidney Large_intestine Liver Lung Lymph_node Mammary_gland Olfactory_bulb Ovary Placenta Prostate Salivary_gland Skeletal_muscle Small_intestine Spinal_cord Spleen Stomach Striatum Testis Thymus Thyroid Tongue Trachea Trigeminal Umbilical_cord Uterus

Expression 20000

15000

Expression

30000 1000

25000 800

10000

5000

400

200

(c)

72

600

400

200

(b)

1200

1000 12000

10000

8000

6000

4000

2000

(d)

Figure 5.1: Examples of GNF-GEA expression patterns for mouse genes at selected Hg and Q. Liver is the tissue of interest for Q values and is indicated in black. The figures are based on images taken from http://expression.gnf.org/. (a) Very specific liver expression: H = 1.3 bits and Qg|liver = 2.1 bits, 94777 at Alb1 serum albumin. (b) Liver is a strong but not dominant part of the expression pattern: H = 3.7 bits and Qg|liver = 6.8 bits, 9452 g at Lisch7 liver-specific bHLH-Zip transcription factor. (c) Near uniform expression : H = 4.3 bits and Qg|liver = 10.2 bits, 104391 s at Clcn7 chloride channel 7 (d) Liver exhibits very low expression in an otherwise widely expressed gene.: H = 4.4 bits and Qg|liver = 15.1 bits, 93750 at Gsn Gelsolin

73

Probe Set ID 96055 at 93178 at 93273 at 92943 at 95436 at 98406 at 98063 at 99446 at 92741 g at 102940 at 94777 at 101287 s at 99269 g at 100329 at 94318 at

H 3.2 2.7 3.7 3.5 3.3 2.7 1.6 2.5 3.3 2.8 1.3 1.6 1.5 1.4 1.6

Q 5.8 5.8 5.8 6.0 6.1 4.0 4.1 4.1 4.5 4.6 2.1 2.2 2.2 2.3 2.3

RefSeq NM 031161 NM 019867 NM 009221 NM 008165 NM 009215 NM 013653 NM 007641 NM 008518 NM 010005 NM 019911 NM 009246 NM 013475

Description cholecystokinin neuronal guanine nucleotide exchange factor synuclein, alpha glutamate receptor, ionotropic, AMPA1 (alpha 1) somatostatin chemokine (C-C motif) ligand 5 glycosylation dependent cell adhesion molecule 1 membrane-spanning 4-domains, subfamily A, member 1 immunoglobulin heavy chain 4 (serum IgG1) lymphotoxin B albumin 1 cytochrome P450, 2d10 tryptophan 2,3-dioxygenase serine protease inhibitor 1-4 apolipoprotein H

Table 5.1: The top 5 most tissue specific known genes for representative tissues. Genes must express at 200 au in one or more tissues. A full list of all genes is available in the Supplemental material.

Liver

Lymph Node

Tissue Amygdala

To compare results from microarray and EST-based expression data we mapped the tissues from the GNF-GEA study to the hierarchical controlled vocabulary of anatomical terms used by DoTS and chose a set of 45 tissue terms grouped into 32 groups shown in Table 5.2. In both cases, the vast majority of genes are widely expressed as measured by Hg as shown in Figure 5.2 (a). Of the 7714 probe sets in the GNF-GEA data with an average normalized intensity value above 50 arbitrary units (au), 6167 (80%) of genes had Hg ≥ 4 bits, which implies expression in at least 16 tissues and typically corresponds to wider, but uneven, expression. Only 87 (2%) of genes had Hg ≤ 1.5 bits, which corresponds to expression in as few as three tissues. Both microarray- and EST-based data yielded similar overall curves. The EST curve peaked at a lower Hg than the microarray curve. This was due to the small numbers of EST sequences in some of the tissues we considered; EST counts for tissues ranged from 1933 in the adrenal gland to 331,582 in the central nervous system (CNS). Genes that are ubiquitously expressed may not have ESTs from several of the lightly-sequenced tissues making them appear to have more restricted expression than they really do and hence have lower entropy. Figure 5.2 (b) shows the correlation between estimates of Hg derived from microarray and EST data. Visual inspection of the plot reveals that while there are no strong contradictions between the two methods, quantitative agreement is limited. Detailed analysis shows that the standard deviation of the difference of paired Hg values is 0.61 bits. Under the null hypothesis that the estimates from the two data sources are totally uncorrelated the average standard deviation was found to be 0.91 bits. We can reject the null hypothesis (P < 10−5 as estimated by Monte Carlo methods). The distribution of Qg|t for selected tissues is shown in Figure 5.2 (c). These curves can be used to characterize tissues in terms of the number of tissue-specific genes and the amount of gene sharing, e.g., liver has a relatively large number of genes shared with a small number of other tissues. By contrast, there were no genes in this set that are uniquely expressed in the amygdala. The most specific genes in amygdala have Hg ≈ 5. This is because the tissue set contains a large number of nervous system tissues. Figure 5.2 (d) shows the distribution of Hg for a set of tissues that contains only three nervous system tissues: amygdala, cerebellum, and spinal cord. In this analysis amygdala can be seen to contain roughly as many tissue specific genes as skeletal muscle. It is important to determine how well the Hg and Qg|t statistics can be estimated from a data set to determine the smallest meaningful difference in scores and to guide interpretation of gene rankings. To assess the standard deviations of Hg and Qg|t , we sampled from the replicates in the GNF-GEA microarray data to compute a large number of Hg values for each probe set. We found that the standard deviation for Hg was less than 0.2 bits for 97% of genes. Qg|t was not estimated as well; the standard deviation was 1 bit or less for 95% of gene and tissue pairs. This was probably 74

5000

DoTS GNF−GEA

N [0.1x0.1 H bins] 1000 100

50

Prob

500

>= 30 ESTs >= 100 ESTs

10

5

10

15

4

3

1

H (DoTS) 2 0

1

2

3

1

1

2

3

5

4

H (Novartis)

4

H [bits]

10

10

100

Cumulative Genes

1000

Liver Skeletal Muscle Mammary Gland Amygdala

100

10000 1000

Liver Skeletal Muscle Mammary Gland Amygdala

1

1

Cumulative Genes

(b) 10000

(a)

0

2

4

6

8

10

0

H [bits]

2

4

6

8

10

Q [bits]

(c)

(d)

Figure 5.2: Distributions of H and Q for different data sources and tissues. (a) Distribution of H as estimated from GNF-GEA and DoTS. DoTS curve was generated from genes with at least 6 ESTs. (b) Correlation of H estimates from GNF-GEA and DoTS. Genes with at least 30 (100) ESTs are shown in solid (dashed). (c) Cumulative distribution of Q values for selected mouse tissues and the average for all 39 tissues. Mammary gland, liver, muscle, and the amygdala have decreasing numbers of highly tissue-specific genes. Liver has a very large number of relatively specific genes. All distributions peak at 2 lg(39) = 10.6 bits and have a tail at high Q (not shown) that corresponds to genes that are ubiquitously expressed except in the tissue of interest. (d) Similar data as in part (c) but with only three nervous system tissues: cerebellum, amygdala, and spinal cord. This tissue set highlights the brain-specific genes in amygdala which is now more similar to skeletal muscle in terms of the number of specific genes.

75

GNF+GEA Tissues DRG trigeminal, hippocampus, amygdala, frontal cortex, cortex, striatum, olfactory bulb, hypothalamus, spinal cord lower, spinal cord upper, cerebellum eye spleen lymph node trachea thymus bone marrow, bone lung uterus umbilical cord placenta ovary epidermis, snout epidermis heart skeletal muscle adipose tissue, brown fat adrenal gland stomach bladder small intestine large intestine gall bladder liver kidney salivary gland thyroid mammary gland prostate testis

Comparison to EST PNS CNS

eye spleen lymph node trachea thymus bone lung uterus umbilical cord placenta ovary epidermis heart skeletal muscle fat adrenal gland stomach bladder small intestine large intestine gall bladder liver kidney salivary gland thyroid mammary gland prostate testis

Table 5.2: The list of tissues available in the mouse GNF+GEA survey, groupings of tissues used to compare microarray and EST-based entropy estimates.

76

due to the high standard deviation of the − lg(pt|g ) term for low expressing gene-tissue pairs. We found much more variation when we measure reproducibility by considering genes that have two or more probe sets (and therefore two or more different transcripts) in the microarray data. In this case, the standard deviation of Hg estimates was as high as 1 bit for 97% of the genes but less than 0.3 bits for about 70 to 80% of the genes. We chose a minimum of 1 bit for Hg bins and 2 bits for Q bins in the rest of the analyses that require binning. This bin size ensured that most of the genes are in the proper bin and thus the bin could be reliably used to determine associations with the tissue-specificity of a class of genes.

5.4

Evaluating a Set of Housekeeping Genes

A test of the Hg and Qg|t statistics is to determine values for a set of non-specific genes such as housekeeping genes. A list of 797 human housekeeping genes [65] were evaluated using these statistics based on the GNF-GEA data set using RefSeq accession numbers to identify appropriate probe sets. The housekeeping genes had a mean Hg = 4.6 ± 0.27 bits in a set of 27 tissues with a maximum H = lg(27) = 4.75 bits; thus they are non-specific as expected. Interestingly, a small number of these genes did show some degree of tissue specificity yet were ubiquitously expressed. For example, the median expression of NM 021983 the major histocompatibility complex, class II DR beta 4 gene (32035 at) is approximately 200 au, but it shows much higher expression in a small set of tissues (spleen, thymus, lung, heart, and whole blood) which lowered its entropy. A more extreme case is NM 001502 glycoprotein 2 (zymogen granule membrane protein 2) which is expressed between 250 and 1000 au in all tissues, except pancreas where it is expressed at 34183 au. This is a ubiquitously expressed gene that entropy categorizes as specific since it showed such extreme tissue-specific induction. The housekeeping genes had a mean Qg|t = 9.5 ± 0.14 bits in the same set of tissues. The expected Q value for a uniformly and ubiquitously expressed gene is 2 lg(27) = 9.5 bits. Thus, the Hg and Qg|t statistics successfully captured the expected expression properties of housekeeping genes.

5.5

Most Genes are Regulated in a Tissue-Dependent Manner

Although the housekeeping genes assessed above have relatively high entropies, they do show some small degree of overall tissue specificity. We therefore sought to determine how many genes show 77

evidence of tissue-dependent regulation. Since random biological and experimental variation introduce fluctuations in the expression levels of genes, we made a probability model of the effect of these fluctuations on the observed entropy. The experimental variability was estimated from the GNF-GEA data using all normal tissues. The random tissue-to-tissue biological variability was modeled by assuming that each gene has an average expression level across all tissues and that the log base 2 of the tissue-dependent fold changes from the average level follow a normal distribution with mean equal to zero and some unknown, but small, standard deviation (s). We obtain a conservative estimate of the number of genes showing evidence of tissue-dependent regulation by using s = 0.5 which allows for a relatively large amount of variation; up to 1.4-fold tissue-to-tissue variation around the mean expression level in about 63% of tissues and larger changes in the remaining tissues. As a threshold for selecting genes with tissue-dependent expression, we choose Hg = 4.52 bits which has a p-value of 0.005 under the null hypothesis that all genes are uniform. We then find that 5837/8703 (67%) of human genes have entropies less than this and so are probably regulated in a tissuedependent manner. If we use a more stringent definition of uniform expression that allows half as much variation in tissue-to-tissue expression levels (s = 0.25), then the threshold is Hg = 4.62 bits and we find that 7584/8703 (87%) of human genes show evidence of tissue-dependent regulation. Similar results are found in mouse using all 42 distinct tissues, where the corresponding thresholds are Hg = 5.24 bits (s = 0.5) and Hg = 5.35 bits (s = 0.25) and the fractions of genes showing tissue-dependent expression are 5467/7913 (69%) and 7482/7913 (94%) respectively. Thus we conclude that most genes show evidence of tissue-dependent expression levels.

5.6

Clustering Tissues Using Q

A test of Qg|t with respect to specific genes is to evaluate the tissues in which they rank highly (i.e., have low Q) for consistency. This was accomplished by clustering tissues with similar tissue-specific genes and inspecting the clusters formed. We used 27 normal human tissues and, separately, 39 tissues from the GNF-GEA data for mouse and selected the genes (N = 3768 human and N = 1786 mouse) that express at least 200 au in at least one tissue and have Qg|t ≤ 7 in at least one tissue. With these genes, we made a consensus hierarchical clustering of the tissues as shown in Figure 5.3. We found that the tissues in the nervous system, reproductive structures (excluding testis), immune system, and digestive system reliably cluster together in both species. In addition, skeletal muscle and heart clustered in mouse; the human survey did not have skeletal muscle. These results suggest that Qg|t is correctly identifying tissue-specific genes. Interestingly, testis is an outlier in both trees indicating that the collection of genes expressed in testis are distinct from any other tissue or organ. 78

Furthermore, Hg and Qg|t can also be used in conjunction with a tissue hierarchy to answer more complex questions about the tissue distribution of genes such as ‘what genes are specific to the brain but are widely expressed throughout the brain? In Table 5.3 we list the top 5 mouse genes expressed specifically but uniformly across three of the highlighted groups in Figure 5.3(b).

5.7

CpG Islands are Associated with the Least Tissue-Specific Genes

It has been proposed that CpG islands are predominantly associated with promoters of housekeeping genes [23]. We performed a quantitative test of this hypothesis using the GNF-GEA data and determining the frequency of CpG islands in promoters as a function of Hg . We considered only predicted CpG islands that span the start of transcription (see [165] for a justification of this definition), and genes that expressed at least at the median level of 200 au (i.e., moderately expressed) in at least one tissue, and represented by a single probe set on the Affymetrix chip used in the GNFGEA experiments. Promoter sequences were obtained from DBTSS and were based on the 5 ends of full length transcripts [220]. We found that there is a strong, roughly linear, correlation between a gene’s entropy Hg and the probability that the gene will have a predicted start CpG island as shown in Figure 5.4. Start CpG islands were associated with only 9 of the 100 most tissue-specific human genes as compared to 80% of the least tissue-specific genes. Similar numbers were found for mouse (7% start CpG island frequency for the 100 most tissue specific genes; about 64% for the least tissue-specific genes). A comparison of CpG islands from the most and least tissue-specific genes did not reveal any significant difference in the overall base composition, or ratio of observed to expected CpG dinucleotides. The distribution of the position of the 5 end point of CpG islands was also very similar for the most and least tissue-specific genes though CpG islands tend to start further upstream in the least tissue-specific genes (data not shown). Another group of genes observed to be associated with CpG islands are those expressed in the early embryo [165] from the fertilized egg to the blastocyst. The question arises as to whether there is an association of genes having start CpG islands and the developmental stage of expression (i.e., embryonic versus adult) in addition to the one for tissue specificity. We investigated this possibility in the mouse using DoTS [42] EST and mRNA assemblies by tabulating the number of DoTS genes that contain at least two ESTs from a mouse early embryo library as shown in Table 5.4. We considered 933 genes with start CpG islands (CGI+) and 1007 genes without start CpG islands (CGI–) that were expressed in the adult. If there were no developmental bias, this distribution of CpG+ and CpG– genes should be maintained in genes expressed in the embryo. 79

cortex

(a)

testis

(b)

thyroid trachea salivary gland heart adrenal gland DRG pituitary gland placenta uterus ovary prostate

amygdala whole brain thalamus caudate nucleus cerebellum spinal cord corpus callosum blood thymus spleen lung pancreas kidney liver

amygdala hippocampus frontal cortex striatum olfactory bulb hypothalamus spinal cord ovary cerebellum placenta trigeminal umbilical cord DRG uterus eye fat adrenal gland epidermis heart skeletal muscle spleen lymph node trachea thymus bone bone marrow lung small intestine large intestine stomach bladder liver gall bladder kidney salivary gland testis thyroid mammary gland prostate

Figure 5.3: Consensus tissue tree of tissues from human and mouse data. Trees are the consensus of trees created from 5000 random samples of sets of 1000 genes from (a) 3768 (human) or (b) 1786 (mouse) genes with Qg|t ≤ 7 bits in at least one tissue. The length of the line leading into a node indicates how many trees did not include the set of tissues to the right of the node. The shortest lines correspond to unanimous subgroups. We highlighted all maximal subgroups that occurred in at least half of the sampled trees. The tissues not included in a highlighted subgroup typically have statistically significant overlap with many of the highlighted tissues as estimated using the hypergeometric distribution.

80

81 3.495 3.609 1.280 1.394 1.471 1.503 1.515

3.923 4.039 1.326 1.464 1.561 1.595 1.607

3.876

3.541

101048 at 94278 at 100156 at 94777 at 100329 at 99269 g at 99862 at 96846 at

Q 3.4 3.6 3.8 3.8 3.8 2.882 3.622

H 3.3 3.5 3.7 3.7 3.7 2.807 3.373

Probe Set ID 100047 at 103030 at 97983 s at 98339 at 94545 at 96648 at 93584 at 009295 018804 153457 009898

NM NM NM NM

009246 019911 013465 080844

NM 008879 NM 008566

NM 011210

NM NM NM NM

RefSeq NM 011428

Description synaptosomal-associated protein, 25 kDa dynamin syntaxin binding protein 1 synaptotagmin 11 reticulon 1 coronin, actin binding protein 1A immunoglobulin heavy chain 6 (heavy chain of IgM) protein tyrosine phosphatase, receptor type, C lymphocyte cytosolic protein 1 mini chromosome maintenance deficient 5 albumin 1 serine protease inhibitor 1-4 tryptophan 2,3-dioxygenase alpha-2-HS-glycoprotein serine (or cysteine) proteinase inhibitor, clade C (antithrombin), member 1

Table 5.3: The top 5 most group-specific known mouse genes for selected tissue groups. The tissue groups were identified in a consensus clustering of tissues based on common tissue-specific genes. The Q value is for the gene and tissue group. To ensure uniform expression across the tissue group, genes were required to have an entropy on the tissue group that was 90% of the maximum possible for the group.

Liver and Gall Bladder

Immune System

Tissue Cluster Nervous System

Fraction of Promoters w/ CpG Island

0.9

Human Mouse

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

1

2

3

4

5

6

Entropy

Figure 5.4: The fraction of start CpG islands in genes ranked by entropy Hg increases with entropy. Each point represents the fraction of genes in consecutive groups of 100 genes ranked by entropy Hg computed from GNF-GEA data. Genes in this set express above 200 au in at least one tissue. The human data set has 26 tissues (maximum H = 4.7 bits), the mouse data set has 42 tissues (maximum H = 5.3 bits).

82

However, only 139 (14%) of the CGI– genes were expressed in the early embryo in contrast to 365 (39%) CGI+ genes (P = 3 × 10−70 exact binomial). Therefore, a gene expressed in the adult was 2.8 (= 0.39/0.14) times more likely to be expressed in the early embryo if it contained a start CpG island. Furthermore, the most tissue-specific genes expressed in the adult were 4 times more likely to have been expressed in the early embryo if their promoter contained a start CpG island. These results strongly suggest that CpG islands are promoter features for both embryonic and the least tissue-specific genes.

Gene Type Embryo Adult Specific

CpG Island State CGI+ CGI– CGI+ CGI–

Total Genes 933 1007 29 180

Expressed Genes 365 139 8 12

Fraction 39% 14% 29% 7%

Fraction Ratio 2.8 4.0

Table 5.4: CpG islands are correlated with embryonic expression even for tissue-specific genes. We determined the fraction of genes with (39%) and without (14%) start CpG islands that are expressed in the early embryo. A gene is 2.8(= 0.39/0.14) times more likely to be expressed in the early embryo if it has a start CpG island. If we then consider genes that go on to be specific in the adult, we find the ratio of CGI+/CGI– genes is now 4 = 0.28/0.07. The differences in rates between CpG island status within each stage are significant (P < 0.0005; binomial). Of the between stage comparisons, only the CGI– adult-specific/embryo change is significant (P = 0.0009; hypergeometric).

5.8

Base Composition of Promoters Depends on Specificity

Analysis of base composition profiles of promoters provides clues for common features including motifs associated with promoter categories. We examined the base composition profiles of human promoters of high (0 ≤ Hg ≤ 3.5 bits) and low (4.4 ≤ Hg ≤ 4.71 bits) tissue-specificity genes. We considered CGI+ and CGI– genes separately since it is clear the presence of a CpG island will strongly influence the base composition and the fraction of start CpG islands varies with entropy. In addition, the presence of a start CpG island may indicate a different regulation mechanism related to either tissue-specificity or embryonic expression (or both). The number of promoters from DBTSS in these four classes that were used in the analysis were: 310 CGI– and 129 CGI+ high specificity; 342 CGI– and 1501 CGI+ low specificity. Genes that have only non-start CpG islands represented a minor component and were not included in this analysis. We used the full set of normal tissues in the first GNF-GEA microarray study for human and mouse. Base composition profiles with 10bp windows are shown in Figure 5.5 for human genes. Each of the features we 83

report were observed in human and mouse (unless noted otherwise) and compare G to C or A to T over spans of at least 10 positional bins; the probability of observing a feature at least this long by chance is less than 0.510 ≈ 0.001. Promoters of CGI+ genes (Figure 5.5 a and b) shared common features but could also be distinguished based on tissue-specificity. A common feature of CGI+ promoters was the increase in C+G content that starts at 1000 bp upstream of the transcription start site and continues at 200bp downstream. The C+G bias reached p(C+G) = 0.7 at the start of transcription and continued into the 5’ UTR. Non-specific (Figure 5.5 c) and tissuespecific (Figure 5.5 d) CGI– genes still showed a C+G bias around the start of transcription, but it was much smaller in magnitude at p(C+G) = 0.54. The low specificity CGI+ genes (Figure 5.5 a) showed upstream base composition biases that were not found in any of the other three gene classes. There was a preference for C over G (p(C) > p(G)) in the (-350, -150) region and also a preference for p(A) > p(T ) in the -600, -200 region in human (this region is located (-400, -150) in mouse). In tissue specific CGI+ (Figure 5.5 b) genes the strong C+G bias held but p(C) = p(G), except for the (+50, +100) region where p(C) > p(G). These base composition differences observed between non-specific and tissue-specific promoters over regions of hundreds of base pairs even in the context of a CpG island suggest different structural features and regulatory mechanisms for these CGI+ classes. Most striking were differences between non-specific and tissue-specific promoters that are independent of the presence of a CpG island. A sharp spike in the proportion of A and T was seen in the (-50,-1) region for all classes but was most pronounced in the tissue-specific promoters (Figure 5.5 b and d). These spikes correspond to the presence of a TATA box and suggest a correlation of this motif with tissue-specific genes (explored more fully later). Conversely, all low specificity genes (Figure 5.5 a and c) shared a common feature in the (+1, +200) region where p(G) > p(C) and p(T ) > p(A) that was not seen in tissue-specific (Figure 5.5 b and d) genes. As shown later, this low specificity feature could be partially explained by the presence of a YY1 motif. These base composition differences observed between non-specific and tissue-specific promoters are likely to indicate motifs that distinguish the two classes.

5.9

Selected Transcription Factor Motifs in the Core Promoter

We next examined the distribution of basic core promoter features; the TATA box, the initiator element, and two binding sites for selected ubiquitous transcription factors, Sp1 and YY1, to see if their presence in the proximal promoter was correlated with the tissue specificity of a gene. 84

4.84 ≤ H ≤ 5.18 0.45

0.45

0.00 ≤ H ≤ 3.85

0.40 0.35 0.30 0.20 0.15 0.10

−1000

−800

−600

−400

−200

0

200

−1000

−800

−600

Position [bp]

−200

0

200

−200

0

200

0.25

0.30

0.35

0.40

A C G T

0.15 0.10

0.10

0.15

0.20

0.25

0.30

P [20bp bins]

0.35

0.40

A C G T

0.20

P [20bp bins]

−400 Position [bp]

0.45

0.45

CGI–

−1000

CGI+

A C G T

0.25

P [20bp bins]

0.30 0.25 0.10

0.15

0.20

P [20bp bins]

0.35

0.40

A C G T

−800

−600

−400

−200

0

200

Position [bp]

−1000

−800

−600

−400 Position [bp]

Figure 5.5: Base composition profiles for ubiquitous and tissue-specific genes with and without start CpG islands. Data is for human genes; similar patterns were observed in mouse. (a) Ubiquitous genes with a CpG island, (b) tissue specific genes with a CpG island, (c) ubiquitous genes with no CpG island, and (d) tissue specific genes with no CpG island. Note differences in upstream C+G content, peak sizes at TATA box (-35bp) and initiator positions, and downstream C versus G differences.

85

Two approaches were taken using different data sets and motif searching methods that gave similar results providing independent confirmation of results. First, we searched for core motifs using weight matrix hits in promoters of genes selected using Hg calculated from the GNF-GEA data. Second, we searched for core motif consensus sites in promoters of genes selected using Qg|t calculated from EST data. TATA Boxes are Associated with Tissue-Specific Genes

We grouped the human genes

that expressed at least 200 au (average value) in the GNF-GEA data by entropy and start CpG island status. The number of genes in each category is shown in Table 5.5 along with a summary of results. We used alignments of position-specific scoring matrices and scoring thresholds included in the Eukaryotic Promoter Database [31] to identify the TATA box and initiator element. Matches to these motifs were preferentially located at the expected positions relative to the transcription start site based on the ratio of the number of observed set to the expected number using a set of random sequences with the same position-dependent base composition as each of the promoters. We searched for the TATA box in the (-45, -10) region where the average observed/expected ratio for the TATA box was 3.1. As shown in Table 5.5, the most specific CGI– genes were 6 times more likely to have a TATA box than the least specific CGI+ genes (117/215 (54%) versus 183/2072 (9%), P ≈ 0 exact binomial). Similar numbers are found in mouse (52%/11% = 4.7) This trend also holds within CGI– genes and CGI+ genes. The most specific CGI– genes were 3 times more likely to have a TATA box than the least specific CGI– genes (117/215 versus 110/607, P ≈ 0 exact binomial). While less common in CGI+ genes, TATA boxes were still almost 4 times as likely to be found in the most specific CGI+ genes than the least specific CGI+ genes (19/56 versus 183/2072, P = 2 × 10−7 exact binomial). Thus TATA boxes are clearly associated with tissue-specific genes and provide a second axis (with CpG islands) for distinguishing between the most and least specific genes. By contrast, the frequency occurrence of the initiator element (Pol II binding site) was roughly constant across all tissue-specificity classes for both CGI+ and CGI– genes. We searched for the initiator element in the (-10, +10) region. It occurred in 762 of 1118 (68%) of CGI– genes and 1273 of 2434 (52%) of CGI+ genes. Similarly, it occurred in 149 of 215 (69%) of the most specific genes and 388 of 607 (64%) of CGI+ genes. The observed frequency of TATA+/Inr+ promoters was not significantly from the expected rate assuming independence of the two individual features (data not shown). Sp1 Binding Sites Are Weakly Associated with the Least Tissue-Specific Genes Sp1 [50, 136] is a ubiquitous transcription factor with a G-rich binding site with consensus sequence 86

Features CGI TATA

CGI+

CGI–

Total Fraction 3552 1.00 2434 0.69 1118 0.31

TATA+

604 0.17

TATA-

2949 0.83

CGI+

TATA+

284 0.08

CGI–

TATA+

320 0.09

CGI+

TATA-

2150 0.61

CGI–

TATA-

798 0.22

H 0-3 Most Specific 271 0.08 56 0.02 0.30 215 0.19 2.52 136 0.23 2.95 135 0.05 0.60 19 0.07 0.88 117 0.37 4.79 37 0.02 0.23 98 0.12 1.61

H 3-4 Semi-Specific 602 0.17 306 0.13 0.74 296 0.26 1.56 175 0.29 1.71 427 0.14 0.85 82 0.29 1.70 93 0.29 1.71 224 0.10 0.61 203 0.25 1.50

H 4-5 Least Specific 2679 0.75 2072 0.85 1.13 607 0.54 0.72 293 0.49 0.64 2387 0.81 1.07 183 0.64 0.85 110 0.34 0.46 1889 0.88 1.16 497 0.62 0.83

Table 5.5: The most significant indicators of the degree of tissue-specificity: start CpG island and TATA box. The two leftmost columns indicate the combination of features considered; empty cells indicate that the feature is not considered. The middle column indicates the number of promoters with each feature combination and the corresponding fraction of all genes considered. The three rightmost columns indicate the number (top), fraction (middle), and enrichment ratio (bottom) of matching genes in three bands of tissue specificity. The enrichment ratio is the fraction of promoters of genes in the entropy band that contain a feature divided by the band’s fraction among all genes considered. For example, specific genes are best recognized by a combination of TATA box and lack of a CpG island which enriches the fraction of such genes from 8% to 37% a factor of 4.79.

87

GGGCGGG that might explain the observed G-richness of the 5’ UTR in non-specific genes. We used the GC-box weight matrix and scoring threshold from EPD [31] to identify Sp1 sites. We found that Sp1 sites are preferentially located in the (-150, +1) region in all sets of genes where they occurred on average at twice the expected rate in agreement with previous findings [31]. In both human and mouse, Sp1 sites were rarely found in the 5’ UTR despite the G-richness of this region; they occurred at the expected rate of between 2 and 5%. Thus Sp1 sites were not the cause of the G-richness in the 5’ UTR. Sp1 sites are associated with CpG islands but are an important component of GGI- promoters as well. Considering just the (-150, +1) region, Sp1 sites occurred in 1105/2434 (45%) of human CGI+ gene promoters, and 316/1118 (28%) of CGI– genes at about 2.5 to 3.0 times the expected frequency in both cases. Frequencies in mouse are 927/2075 (45%) of CGI+ promoters and 464/1652 (28%) CGI– promoters. Sp1 sites were also weakly associated with the least specific genes occurring in 1105/2679 (41%) of these genes as compared to 94/271 (32%) in the most tissue-specific genes (P = 0.016). Similar numbers are found in the mouse; 38% of the least specific and 26% of the most specific promoters have Sp1 sites. Thus, although Sp1 shows a preference for the least tissue-specific promoters, it is not a strong predictor of the tissue-specificity of a gene.

YY1 Binding Sites are Associated with Low-Specificity Genes The transcription factor YY1 [204, 198, 175, 174] is also ubiquitously expressed and is thought to bind close to [131] and downstream of the TSS. There is evidence [149] that the function of YY1 depends on its orientation. The location and G-richness of the reverse complement consensus sequence (AANATGGCG) make YY1 a candidate for explaining the prominent p(G) > p(C) feature in the (+1, +200) region of low specificity genes. We consider YY1 because a YY1-like motif was frequently included among the most statistically significant motifs identified by the motif discovery programs AlignACE [146] and MEME [10] in the (+1, +60) region of non-specific CGI+ promoters; see Figure 5.6 for a sequence logo of the motif. Our form is most similar to the activating form [205] which may be associated with low-specificity genes. Because of the demonstrated functional sensitivity to the orientation of binding sites we considered each orientation separately. Indeed, as shown in Figure 5.7 (b) we found each orientation exhibits different position preferences. Sites in the reverse orientation (YY1r) were preferentially located in the (+1, +25) region but with some elevated levels to +80bp. Start positions of sites in the forward orientation (YY1f) showed a very sharp preference for -3 bp which probably represents a YY1-like initiator sequence reviewed elsewhere [207]. Both orientations were found predominantly in the least specific genes (Table 5.5). YY1f initiator sites are rare; only 55/2679 (2%) were found above background in human low-specificity genes. The rate in mouse, 88

22/2832 (0.8%) of low-specificity promoters, is even lower. The YY1r sites are more common and were found above background in 217 (8%) of the 2679 least specific genes. YY1r sites were more common in CGI+ genes than in CGI– genes (202/2072 (10%) versus 15/607 (2%) P = 3.7 × 10−9 two-population binomial). The corresponding rates in mouse confirm these observations; 178/2832 (6%) for all low-specificity genes and 152/1779 (9%) in CGI+ and 26/1053 (2%) of CGI– low specificity promoters. These YY1-like sites therefore constitute a feature strongly associated with the least specific genes and may partially explain the observed p(G)/p(C) ratio in the (+1, +200) region.

Features CGI YY1

Total Fraction

H 0-3 Most Specific

H 3-4 Semi-Specific

H 4-5 Least Specific

3552 1.00

271 0.08

602 0.17

2679 0.75

CGI+

2434 0.69

CGI–

1118 0.31

56 0.02 0.30 215 0.19 2.52

306 0.13 0.74 296 0.26 1.56

2072 0.85 1.13 607 0.54 0.72

1 0.00 0.04 1 0.00 0.05 55 0.03 0.33 215 0.20 2.59 0 0.00 0.00

16 0.05 0.32 10 0.04 0.23 296 0.14 0.80 290 0.27 1.58 6 0.19 1.11

276 0.94 1.25 250 0.96 1.27 1822 0.84 1.11 581 0.53 0.71 26 0.81 1.08

YY1+

293 0.08

CGI+

YY1+

261 0.07

CGI+

YY1-

2173 0.61

CGI–

YY1-

1086 0.31

CGI–

YY1+

32 0.01

Table 5.6: Non-specific genes are most specifically recognized by CpG islands and YY1 sites which returns a set that is 96% non-specific genes, but only matches 7% / 75% = 10% of the non-specific genes.

89

Figure 5.6: A new YY1 motif found downstream of the TSS. A logo [183] representation of the YY1 motif identified in the (+10, +20) region of human CGI+ promoters identified using AlignACE. It is based on 102 sequences. Additional logos for weight matrices in TRANSFAC v7.3 for YY1.

90

80 70

Number of Genes

60

Reverse Forward

50 40 30 20 10 0 -50

-30

-10

10 30 Position [bp]

50

70

90

Figure 5.7: YY1 motifs are found at and downstream of the TSS depending on their orientation. YY1 sites were predicted using a weight matrix generated using AlignACE. YY1 sites are more than almost three times (P ≤ 2 × 10−7 ) as common in genes with non-specific CGI+ genes (11%; N = 2072) than in CGI– genes (4%; N = 607) and occur at more than 10 times the expected rate. Similar trends are observed in genes with 3 ≤ Hg ≤ 4 though with lower absolute and relative rates. The difference between CGI+ and CGI– genes is not statistically significant for genes in the 3 ≤ Hg ≤ 4 bin. Essentially no YY1 sites where observed in specific genes with Hg ≤ 3 bits whether or not they had a CpG island.

91

Q-Based Analysis of Core Promoter Motifs A second analysis of TATA box and Inr motifs was done to determine if the association of the TATA box with tissue-specific genes is also found in genes ranked by Q and is robust to using EST data as well as promoters that did not specifically rely on full-length cDNA clones. The definition of Qg|t implies that a gene with a particular Qg|t -value can have a Hg drawn from a range of possible values and thus it may be more difficult to identify features related to tissue-specificity. We tabulated all DoTS genes that contained at least two ESTs from an islet cell library then ranked the genes by Qg|pancreas computed using EST counts. We used Qg|pancreas ≤ 7 bits as the criterion for selecting pancreas-specific genes which we grouped into 2-bit Q intervals. For comparison we selected 50 genes with Qg|pancreas = 8.5 bits, and 50 genes with 10 ≤ Qg|pancreas ≤ 10.6 bits. Genes with high specificity for the pancreas (0 ≤ Qg|pancreas ≤ 2 bits, N = 9) preferentially had TATA boxes (8 of 9) with half of these also having an initiation element (4 of 9; see Figure 5.8 (a)). With decreasing specificity, the fraction of genes containing TATA boxes drops with only 18 of 81 (2/9) genes with Q > 6 bits having TATA boxes. Thus, the strong correlation of TATA boxes with specific genes found with Hg and microarray data was also seen with Q and EST data for pancreas-expressed genes. Also consistent is the observation that initiator elements were found at similar frequencies ( 60%) across all specificity classes (Figure 5.8

80

1.0

(b)). Similar patterns were observed in other tissues (data not shown).

0.8

human mouse

Genes

0

0.0

0.2

20

0.4

40

Genes

0.6

60

TATA−/Inr− TATA−/Inr+ TATA+/Inr+ TATA+/Inr−

0−2

2−4

4−6

>6

0−2

Q(pancreas) [bits]

2−4

4−6

6−8

8−10

>10

Q(pancreas) [bits]

(a)

(b)

Figure 5.8: The distribution of TATA box and initiator element in pancreas specific genes. 160 pancreas genes were divided into bins according their Q-value. Shown are (a) the genes that have a TATA box, an initiator with the motif YYANWYY, both, or none of these two motifs, expressed in absolute number and (b) number of TATA boxes found in orthologous human and mouse gene pairs. The consistency of findings for the TATA box with human islet genes based on Q and ESTs was 92

next tested with orthologous genes in mouse. This test provides a measure for whether the global pattern observed (TATA box with tissue-specific genes) is also found for the same set of genes in another mammal. We also added bins of genes with higher Q-values that represent more widely expressed genes. For each human gene, the orthologous mouse gene was determined (see Methods for details) and analyzed as described above. Overall, 18.8% of the human genes and 22.9% of the mouse genes that were analyzed carry the TATA box motif. Except for the last group (Qg|t > 10 bits) the percentage of the genes with TATA box motifs decreases with the increase in the Q-value. This is to be expected since genes with high Q may be specific to other tissues and hence are more likely to have a TATA box. Discrepancies between human and mouse promoters were noted for only about 10% of all human-mouse pairs analyzed and may reflect sequence differences and possible annotation discrepancies for the transcription start site. Nevertheless, there is overall excellent agreement for the presence of TATA motifs in human and mouse genes. Thus, our assessment of preferential presence of transcription regulatory motifs in the human pancreas-expressed genes also applies to their mouse orthologs. We conclude that genes expressed with restricted tissuedistribution may be preferentially regulated via TATA-mediated transcription, and that genes with broader expression profiles are more likely to be regulated by non-TATA mediated mechanisms (such as YY1).

5.10

Promoter Classes

Since the presence or absence of a start CpG island and a TATA box appear to be the primary sequence feature that correlate with tissue specificity, we consider them in more detail. We observe that CpG islands and TATA boxes are not mutually exclusive features of promoters and so we consider all possible combinations of these features. Frequency of Promoter Classes Figure 8 shows the cumulative fraction of each class of promoter as a function of increasing Hg in human (a) and mouse (b). The data from human and mouse follow similar trends even though the mouse has a lower proportion of CGI+ genes. Overall CGI+/TATA- genes are the most common at 50 to 60% depending on the species. Interestingly, the CGI–/TATA- class is the second most common overall, comprising 20 to 30% of genes depending on the species. Genes in this promoter class are roughly equally common across the entire entropy range and are the most common promoters in the mid-specificity range in both species. The classes CGI–/TATA+ and CGI+/TATA+ are the least common (8 to 12% overall). CGI–/TATA+ genes are concentrated in the most specific genes. CGI+/TATA+ are found relatively uniformly across all but the most specific genes. Although the TATA box and CpG islands are strongly predictive of 93

a genes entropy, Figure 8 also illustrates the limitations of the promoter classes as an explanation for expression patterns. First, although the CGI–/TATA+ and CGI+/TATA- classes are strongly associated with the most and least tissue-specific genes (respectively), instances of genes in each class cover virtually the entire range of tissue-specificity. Secondly, the CGI–/TATA- class is the second most common, illustrating that any degree of tissue specificity can be obtained without these sequence features.

Functional Assessment of Promoter Classes Using Gene Ontology Terms To try to understand the functional correlates of the four promoter classes, we looked for trends in the cellular localization and biological process of the products of genes from each promoter class. We used the DAVID system [103, 55] which identifies overrepresented Gene Ontology (GO) [96] terms in a set of genes. A summary of the results for human and mouse genes are shown in Table 5.7. In each case the set of genes in each promoter class were compared to all genes on the corresponding Affymetrix chip. Products of genes in the CGI–/TATA+ class were often (70/198) located extracellularly. Examples of such genes are the insulin-like growth factor family, serum albumin, and chymotrypsin. Some extracellular CGI–/TATA+ genes, such as luteinizing hormone beta (Lhb) and bone morphogenetic protein 10 (Bmp10) in the mouse, have a high Hg because they are not induced in the tissues or at the developmental stages surveyed, but otherwise fit the pattern of secreted proteins. Gene products that are secreted from the cell must be produced at high level to be effective. Indeed we found the maximum expression level of TATA+ genes is higher than TATA- genes; 454/745 (61%) of TATA+ genes express at least 1000AU in one or more tissues, whereas only 1321/3773 (35%) of TATA- genes express that highly (p-value = 0; two-sample binomial population). A second group of CGI–/TATA+ that is common, but with a p-value just over the p-value cutoff are the muscle contraction-related genes, actin, troponin, and members of the myosin family. Products of these genes are also required in large amounts to create the contractile apparatus but are only produced in a few cell types. The biological processes that are enriched in the CGI–/TATA+ class differ between human and mouse, but nearly all of them are descendents of the GO term ’response to stimulus’ (GO:0050896). The CGI+/TATA- promoters produce proteins that are typically located in the cell, especially in the cytoplasm and mitochondrion. These locations are consistent with many housekeeping functions. The human results for biological process suggests a large number of housekeeping processes, but these were not confirmed in the mouse using all CGI+/TATA- genes. When we consider just the least specific CGI+/TATA- mouse genes (4.45 ≤ Hg ≤ 5.57 bits), we find cellular locations (including the nucleus) and biological processes that match the human results. 94

95

0

1

2

3

CGI−/TATA− CGI+/TATA− CGI−/TATA+ CGI+/TATA+

4

P [CDF]

(b) Mouse

(a) Human

3 H [bits]

2

H [bits]

1

CGI−/TATA− CGI+/TATA− CGI−/TATA+ CGI+/TATA+

4

5

Figure 5.9: The cumulative distribution of promoter classes as a function of entropy is similar in human and mouse. The cumulative fractions of genes with all possible combinations of CGI and TATA box features for human (a) and mouse (b) as a function of entropy Hg as computed from GNF-GEA data is shown. For example, in human about 50% of the genes with Hg ≤ 2.5 have a CGI–/TATA+ promoter. The gray bars indicate the entropy range that is not significantly different from uniform ubiquitous expression. Curve are compiled from genes that express above 200 au in at least one tissue. As expected CGI+/TATA- genes are most common in less specific genes and CGI–/TATA+ genes are most common in tissue-specific genes. CGI–/TATA- genes are very common and are found at nearly uniformly at every level of specificity. Furthermore, CGI+/TATA- and CGI–/TATA+ genes are both common in mid-specificity (3 ≤ Hg ≤ 4) genes showing that specificity is not determined by these features alone. The trends in human and mouse data are nearly identical despite the lower rate of CpG islands in mouse. The large variations in the graph at low entropy are due to the noise inherent in the small number of genes in this range.

P [CDF]

1.0

0.8

0.6

0.2

0.0

0.4

1.0 0.8 0.6 0.4 0.2 0.0

No significant concentrations of cellular locations or biological processes were found among the CGI+/TATA+ genes. A manual examination of genes in both human and mouse identifies a number of heat shock proteins, histones, and ribosomal proteins though these are not statistically significant due to the multiple testing correction. Many of these genes fit the expected expression pattern in that they are widely expressed and at high levels. Interestingly, the products of CGI–/TATA- genes are often (244/499 of human genes with a cellular location) located in the plasma membrane and support signaling and response to the environment. Such products, e.g., bradykinin receptor B2, prolactin receptor, or protocadherin 9, may be expressed in a tissue-specific pattern, but not at the high levels required for secreted proteins. The exact biological process GO terms that are statistically significant vary between mouse and human but a common core includes defense response (GO:0006952), immune response (GO:0006955), and response to stimulus (GO:0050896). Thus these genes are similar to CGI– /TATA+ genes in that they are involved in response, but are not (typically) required to be expressed at such high levels.

5.11

Discussion

We have applied Shannon entropy as a novel measure of overall tissue specificity of gene expression and have created a new statistic Q to assess the categorical specificity of a gene for a particular tissue. We have evaluated the performance of entropy on microarray-and EST-based estimates of tissue-specific expression and found that it correctly identifies both tissue-specific and housekeeping genes. Ranking and binning genes by entropy allowed us to begin to deconstruct core promoters into features directing when and where the gene will be expressed. We verified and extended previous observations [23] about the correlation of CpG islands with housekeeping genes and embryonic genes. We then identified differences in the base composition profile of promoters of tissue-specific and non-specific genes. Next, we identified correlations between, on the one hand, the TATA box and tissue-specific genes, and on the other hand, the YY1 site and non-specific genes. Finally, we then identified trends in promoter classes based on CPG island and TATA box status and associated them with common cellular locations and biological processes. Similar observations were also observed for TATA box and Q-selected genes in pancreas. Thus entropy Hg and Q have allowed us to discover fundamental properties of mammalian Pol II promoters and should allow serve to aid understanding of expression in particular tissues of interest. The validity of our approach is supported by findings in other work. Our finding that most genes are regulated in a tissue-dependent manner is consistent with another analysis of gene expression 96

Promoter Class

Cellular Component

CGI–/TATA+

CGI+/TATA-

CGI–/TATA-

Human Only

Mouse Only

extracellular, extracellular space

microsome, vesicular fraction

intermediate toskeleton)

response to stimulus

organismal physiological process

inflammatory response, innate immune response, cell motility, defense response, response to pest/pathogen/parasite, response to wounding, response to biotic stimulus, cell-cell signaling, morphogenesis, digestion, muscle contraction, chemotaxis, taxis, response to chemical substance, response to abiotic stimulus, muscle development

cell, cytoplasm, intracellular, mitochondrion

nucleus, ribonucleoprotein complex

nucleobase, nucleoside, cleotide and nucleic metabolism

intracellular transport

metabolism, protein transport, intracellular protein transport, RNA processing, RNA metabolism, cell cycle, mitotic cell cycle

Biological Process

(integral to) (plasma) membrane organismal physiological process, defense response, immune response, response to biotic stimulus, response to stimulus, response to external stimulus

extracellular, space response to pest/pathogen/parasite, cell communication, response to wounding, cellular defense response, signal transduction

filament

(cy-

nuacid

extracellular

complement activation,complement activation (classical pathway), humoral defense mechanism (sensu Vertebrata), humoral immune response

Table 5.7: Over-represented Gene Ontology (GO) terms for cellular component and biological process of genes by promoter class. All terms were selected using a p-value ≤ 0.05 (corrected for multiple testing). Terms common to human and mouse are listed in the left column. The two columns on the right indicate any additional terms found in only one species. The CGI–/TATA+ terms are consistent with a model of strong condition-specific induction, CGI+/TATA- terms are consistent with housekeeping functions. CGI–/TATA- terms indicate support for cell sensing and communication functions. No significant results were found for CGI+/TATA+ genes.

97

[104] which found that housekeeping genes cluster in a tissue-specific manner. Thus, it appears, even the most basic biological functions are subject to regulation. The tissue trees we produced contain relationships similar to those in an analysis [242] of mid-specificity genes including the close relation between lung the immune system-related organs spleen and thymus. That analysis is based on a different method and a different set of expression data gives us confidence that Qg|t is properly identifying genes that are specific to a tissue. Our analysis focused on only a few sequence features and although we found good correlations, two aspects of our results indicate that there are other regulatory mechanisms not yet identified. First, there is a gradual transition in the frequency of the TATA box and CpG islands between the most and least tissue-specific genes. Second, while these features are strong indicators of high and low specificity, they are far from perfect predictors. Indeed, the middle range of entropies contains a mix of all promoter classes in large numbers indicating that it is possible to achieve tissue-specific expression with any promoter class. YY1 may be an example of such an supplementary mechanism. While occurring in only 16% of genes, it is very strictly confined to low-specificity genes and is a better indicator of low specificity than CpG islands. We expect that other such signals will be found. Anatomical resolution is an issue with the data sets used in this study. For example, the pancreas consists of exocrine cells, ductal cells, and islet cells of several types. The bulk pancreas was used to generate the GNF-GEA data, so the reported expression level is the average mRNA concentrations weighted by the cell-type count. This approximation reduces the maximum possible entropy and, more significantly, can make the apparent entropy different from the true entropy. Genes highly and specifically expressed in a cell type with a small population may currently appear to be ubiquitous with very low overall expression. Genes expressed in a few tissues may be revealed to be less tissue specific as more cell types can be measured in detail. Genes that appear to be ubiquitously expressed may turn out to not to be expressed in a few cell types. It will be interesting to see whether data with higher anatomical resolution will help to increase the accuracy of the rules we have identified here for identifying tissue-specific and non-specific promoters. It should be emphasized that the limitation is not the measure or approach used but rather the data sets available. Our method can be also applied to other sources of expression data including in SAGE, RT-PCR, and in situ hybridization data. SAGE has the advantage of sensitivity as these studies generally sequence to much greater depths than EST libraries [27]. In situ hybridization data may increase the anatomical resolution of the data. Qualitative intensities, e.g., o, +, or +++, can be converted to representative numeric values as appropriate. Our method can also be applied to other collections of conditions beside normal tissues, e.g., different types of cancers, or samples 98

of the same cancer from multiple patients. Modification of our method to account for temporal changes in tissue specificity represents another direction for future work. The analysis presented here focuses on genes rather than on transcripts generated from different promoters from the same gene. The rate of the occurrence of alternative transcription start sites is at least 9% [245] and may be as high as 25% [244]. The promoters we used were specified by the DBTSS data set but there may be alternative promoters with different characteristics and tissue-specific usage patterns. This is not a limitation of entropy or Q; it reflects our decision to first investigate tissue-specificity at the gene level but analysis based on different RNA species can easily be incorporated into our approach. Our results for CpG island frequency in very tissue-specific genes are lower than recent reports [165] that were based upon present/absent calls, i.e., tissue counting, using ESTs to measure tissuespecificity. This may be due to two reasons. First, as we described in the Results, a significant fraction of genes will show no evidence of expression in poorly sampled tissues. A poorly-sampled non-specific gene will appear therefore more tissue specific than it actually is and this increases the number of apparently tissue-specific genes with CpG islands. Second, when we use microarray data and determine tissue-specificity by counting tissues expressing above the median value of 200 au, we see (data not shown) rates of CpG island occurrence in ’specific’ genes similar to those reported in [165]. Thus we conclude that including the variation of expression levels rather than mere presence/absence is important for identifying very tissue specific genes as assessed by start CpG islands. These results present an initial look at the correlation between tissue specificity, CpG islands, and binding sites for selected transcription factors that interact with the basal transcription apparatus. Using a novel approach with entropy-based metrics, we have begun to lay out the framework for promoter function by identifying strong correlations between tissue-specific or ubiquitous expression and a number of these sequence features. We plan to extend this work in several ways. First, we plan to identify correlations with other known TFBS’s and novel motifs identified as over-represented in promoter regions [142]. Second, these results will help to understand regulation by upstream transcription factors in genes specific to particular tissues or clusters of tissues.

5.12

Conclusions

We have used Shannon entropy to quantify and rank the tissue-specificity of genes using tissue survey data. This has allowed us to assess first the prevalence of tissue-specific regulation; we find that most genes show evidence of some degree of tissue-dependent variation in expression 99

levels. It has also allowed us to find and evaluate associations between promoter features and tissue specificity. We have verified and extended understanding of known associations between, on the one hand, CpG islands and the least tissue-specific genes and, on the other hand, the TATA box and the most tissue-specific genes. However, they are not the sole determinants of tissue-specific expression as indicated by mid-specificity genes which exhibit a mix of all promoter classes. The class of CGI–/TATA- promoters has emerged as the second most common class of promoter overall and the most common promoter class in mid-specificity genes. Therefore additional determinants of tissue-specificity remain to be found. We have identified one potential determinant, a downstream YY1 site, which is very strongly associated with the least tissue-specific genes but is a relatively rare feature of these promoters. Finally, we have also been able to associate trends in the localization and function of protein products of genes according to their promoter class. Many of the CGI–/TATA+ genes code for highly expressed, very tissue-specific, extracellular proteins involved in a cell’s response to the environment. CGI–/TATA- genes are also involved in response to the environment, but are found more uniformly across the spectrum of tissue-specificity, are not as highly expressed as CGI–/TATA+ genes, and very often code for membrane bound proteins. CGI+/TATA- genes are more likely to be located in the cytoplasm or nucleus and, as expected, carry out housekeeping functions. All of the results we report are found in both human and mouse and so may reflect general principles of all mammalian species.

5.13

Materials and Methods

Processing GNF-GEA and DoTS Expression Data as described [216]. Given a set of N tissues we define pt|g

The GNF-GEA data are processed P = wg,t / 1≤i≤N wg,t where wt is the

expression level of the gene g in tissue t. DoTS, available through the AllGenes[42] site, contains ESTs and mRNAs assembled into transcripts that are then clustered into genes. We did not consider any transcript that contains only one EST since this may represent a spurious sequence and did not consider any gene with fewer than five ESTs because they provide a poor estimate of Hg . To accommodate the great disparity in sampling depth across tissues we normalized EST counts by tissue. To avoid artificially low entropies for genes that contain relatively few ESTs we used pseudocounts to smooth the data. The expression level of a gene in a tissue is computed as wg,t = (ng,t + 1)/(Nt + Ng ) where ng,t is the number of ESTs from libraries for a tissue included in a gene, Nt is the total number of ESTs from a tissue assembled into genes, and Ng is the number of genes. We used different sets of tissues depending on the task. Hg and Q measures in Figure 1 used the full GNF-GEA mouse set with a few modifications; adipose tissue and brown fat were merged, 100

epidermis and snout epidermis were merged, digits and tongue were not considered since they are both a combination of skeletal muscle and epidermis. The expression level for a set of merged tissues is the median of the individual tissue replicate medians. For comparison of microarray and EST data we used a set of 27 tissues that were common to both data sets and merged the central nervous system and peripheral nervous system tissues.

Estimating Variance:

To estimate the variance in H and Q, we took advantage of tissue repli-

cates in the GNF-GEA data. Using the mouse data set, we repeatedly sampled one of the measurements from each pair of replicates and computed H for each gene. We then computed the variance of the distribution of the estimates of H for each gene and show the survivor distribution function in Figure 2. The variance of Q was computed in a similar manner.

Clustering Tissues:

Clustering was based on the Q scores for the set of mouse genes with

Qg|t ≤ 7 for at least one tissue and expressing at least 200 au in at least one tissue in the GNFGEA data. There were 1786 Affymetrix probe sets selected. The tree in Figure 5.3 was built by sampling 5000 sets of 1000 probe sets and clustering tissues using Pearson correlation and a centered measure using the XCLUSTER[203] program. The consensus tree was built using the program CONSENSE in the PHYLIP [68] package with the default parameters.

Identifying Genes Specific to a Set of Tissues:

The total entropy of all tissues under a node

can be computed at each node in the hierarchy using a generalization of the grouping theorem [8]. If the entropy of a gene at a node is close to the maximum possible entropy for the number of tissues under the node, then we select it and compute a Qg,n for the gene at the node. Using Qg,n we can rank genes by specificity to a cluster of tissues just as we can for an individual tissue.

Predicting CpG Islands:

We predicted CpG islands using the program NEWCGREPORT in

the EMBOSS [173] package with the default parameters which require a minimum length of 200bp, C+G fraction of 0.6 and ratio of observed over expected CpG of 0.5.

Statistical Significant in Embryonic Expressed Genes:

We computed statistical significance

of differences between all embryonic-expressed genes and adult-specific rates using a hypergeometric distribution. We start with a collection of N CGI+ genes, ne of which are expressed in the embryo, i.e., marked as special. The NA tissue-specific genes in the adult are considered a random sample from the original N and we compute the probability of finding that at least (or at most) nae of these were expressed in the embryo. 101

Modeling Distribution of Entropy from Uniform Genes:

To model the effect of exper-

imental variability, we computed the distribution of the difference between expression levels of individual replicates for each gene and tissue and the mean expression level across replicates as a function of the mean expression level. This distribution was well fit by an exponential distribution with a parameter that depends on the mean expression level. Thus, given an ideal expression level, we can estimate what the experimental variability will be. To model a uniformly expressed gene, we assume that a gene has some average expression level across all tissues and then allow the expression levels in individual tissues to follow a narrow distribution of random fold changes from that level. Specifically, we assumed that the log base 2 of the fold changes is distributed according to a normal distribution with mean equal to 0 and a standard deviation (s). The standard deviation can be adjusted to control the amount of biological variation a uniformly expressed gene is allowed to show. For example, setting s = 0.5 means that about 68% of the fold changes between a particular tissue and the nominal level are within 1.4 up or down from the nominal level, i.e., a two fold change from the lowest to the highest levels. Larger fold changes are expected to occur in 32% of tissues. This model allows significant variation and so is arguably close to the upper limit of variation allowable for a gene that shows no tissue specificity. We also used s = 0.25 as a more stringent definition of uniform expression. We sampled mean expression levels from the distribution of observed mean expression levels and sampled entropy values from the probability model. An entropy threshold were estimated by sampling approximately 5000 random expression profiles and determining the value for a p-value of 0.002. This process was repeated ten times and the corresponding thresholds and fraction of genes were computed. The thresholds spanned a range of less than 0.01 bit The tissue-dependent gene fractions never varied by more than one percentage point in either direction. Statistical Significance of Co-occurrence:

We estimated the statistical significance of the

co-occurrence of motifs using the hypergeometric distribution. Given two motifs with occurrence counts n1 and n2 , measured in the same set of N promoters, and a co-occurrence count of n12 , we compute the significance as the probability of finding no more than (or at least) n12 hits in a random selection of n2 promoters from a pool of N promoters where n1 of them are special. Comparison of Frequency on Independent Sets:

Given two sets of size N1 and N2 and

positive observations n1 and n2 in each, we computed the probability that the underlying rates are different using an exact calculation of the binomial distribution to compute the probability of seeing at least (or no more) than ni matches in Ni trials where the rate is assumed to be r = nj /Nj . We estimated r using the larger of the two sets. 102

Two Binomial Populations:

We used the normal approximation to the difference of the pro-

portions normalized by their variance to compute a z-score. Promoter Sequences: We obtained promoter sequence in two ways. The H-based set of analyses used links from Affymetrix probe sets to RefSeq identifiers to select alignments from the DBTSS promoter sequences covering the (-1000, 200) region downloaded from http://dbtss.hgc.jp/index.html. The Q-based analyses of TATA box and initiator elements used genomic locations of DoTS genes on UCSC Golden Path release mm3 [115, 121] to identify gene names. Promoter sequences consisting of the 350 bp of the upstream region were then extracted from ENSEMBL [24]. The mouse homologs were also used as annotated in ENSEMBL.

Core Motifs:

The H-based analysis used core promoter element models from EPD[31, 162]. The

fraction of promoters containing each matrix was determined as follows for each set of genes (with and without CpG islands in each entropy bin) individually. Having verified that the positional distribution of each motif was sharply peaked at the appropriate place in the promoter sequences ((-40, -20) region for TATA and (-20, +20) region for the initiator element) we considered only the predictions in these windows from all genes. We used the log-likelihood function to score each subsequence against each matrix using the published score cut-offs. The YY1 motif was found in essentially every run of AlignACE and MEME performed on the downstream regions of ubiquitous CGI+ promoters. We explored different motif widths and other settings and selected version that achieved a combination of good coverage and conservation. In all cases we estimated the background rate of random occurrence of motifs by repeatedly scrambling the individual sequences over a 10bp window to create approximately 1000 test sequences in for each combination of CpG island status and specificity range. These sequences were scored in the same manner as the unscrambled sequences. We estimated the statistical significance of differences of observed frequencies using exact computation of the binomial distribution. The Q-based analyses of core motifs used the TATA box motif (TATAA) and initiator element (YYANWYY). Motif searches were carried out using the tool patternmatch from the biological workbench 3.2 [218] (http://seqtool.sdsc.edu/CGI/BW.cgi). Only the TATAA instance located closest to the start of the mRNAs alignment to the genome was used. Matches to the initiator element were required to be downstream of the TATAA box when present. YY1 Motif: We used an AlignACE-derived weight matrix (shown in Figure 6A) to assess the occurrence of YY1-like sites as it contained the YY1 consensus and was built using approximately 100 sites which is many more than previously published weight matrices [205, 145] also shown in Figure 6A. 103

GO Association Analysis: We submitted Affymetrix probe set ids of interest to the DAVID web site (http://david.niaid.nih.gov/david/ease.htm) [103, 55] and compared them either to all probe sets on the appropriate Affymetrix chips or to all genes in the selected entropy range. We compensated for multiple testing by requiring the reported p-values be better than either 0.05/1472 (cellular component) or 0.05/8972 (biological process) using the number of GO terms for the corresponding GO divisions in a Bonferroni correction.

104

Chapter 6

Transcription Factor Binding Sites in the Core Promoter Chapter 5 introduced Hg and Qg|t as measures of tissue-specificity and considered just a few of the possible promoter sequence features that may be correlated with and/or control the overall tissuespecificity of a gene. The primary signals we identified were the TATA box and the CpG island. We also considered, but did not report, other core promoter signals that have been identified as possible fundamental components such as the downstream promoter element (DPE) [128, 34, 35] and the MED-1 (Multiple start site Element Downstream) [108] which turned out to be only minor components of mouse and human promoters. We have not yet considered all known transcriptions factors (TF’s) to see if any of them can indicate a bias in overall tissue specificity. In this chapter we consider known TF’s with positional weight matrices (PWM’s) and identify approximately 30 over-represented PWM’s in the core promoter that occur in at least 10% of promoters. We find that the rule evaluation engine is able to identify the known positional preferences of TF’s, e.g., TATA box, Inr, Sp-1, NFY, and YY1. However, only the TATA box has any significant effect on the distribution Hg given a gene’s CpG island class.

6.1

Introduction

Because CpG islands are more reliably identified than TATA boxes and because it is necessary to control for the base composition of a sequence when searching for transcription factor binding sites, from now on we will make the presence or absence of a CpG island the primary characteristic of a promoter. We know that a CpG island is a strong indication that a gene will have widespread 105

expression, and now want to ask if any known TF can alter the distribution of Hg conditioned on the presence or absence of a CpG island. Identification of such a TF could help improve the accuracy of rules involving TFBS’s in the extended promoter by indicating the existence of different classes of promoters. In this chapter we consider all TF’s with PWM’s in TRANSFAC v7.3 and determine which of these are over-represented in core promoters of genes with and without CpG islands. We then determine whether the distribution of Hg for genes in the class that contain the over-represented TF’s is different from the distribution for all genes with the same CpG island status. This region has been studied before, both in general [12, 70] and for particular binding sites, e.g., NF-Y [141] and CREB [48], so our (potential) biological contribution is to the identification TF’s that indicate a change in the distribution of entropy. This chapter also marks the first application of the grammar evaluation system. In Chapter 7 we used the grammar parser to search for genes that had a TATA box, Sp1, YY1, or other sites using an unparameterized grammar with score thresholds and position constraints extracted from the literature. In this chapter, we will use the evaluation of rules with one feature and a few parameters to identify over-representation and positional preferences.

6.2

Results

We define the core promoter as the region roughly 100bp up- and down-stream from the transcription start site (TSS). We analyzed sequence covering the region -200 to 150bp so that we could detect whether any detected positional preference peaked in the core region or was stronger outside the core but still showed enrichment in the core. As shown in the previous chapter, there is a base composition change near the TSS which could produce a dip in the concentration of binding sites for a factor near the TSS and, thereby create a spurious apparent peak the upstream end of core promoter. We used four positive sets consisting of human or mouse genes with and without start CpG islands using sequence from DBTSS as tabulated in Table 6.1. These sequences were masked for repeats then trimmed to the -200 to 150bp region. The control sets were random sequences sampled from corresponding 3rd order Markov models trained on the trimmed positive sequence sets. We considered all vertebrate PWM’s in TRANSFAC v7.3 using rules of the form: Ri −→ Fi [score ≥ xi , start ≥ yi , end ≤ zi ];

G6.1

which allowed the evaluation engine to optimize start and ending positions and log-odds scores to maximize the enrichment of matches to the PWM. When we had prior knowledge about orientation sensitivity, e.g., the TATA box, we consider rules of the form: 106

Species M. musculus H. sapiens

CGI– 1587 1005

CGI+ 1768 2365

Table 6.1: Number of promoters by class and species.

Ri −→ Fi [sense = wi , score ≥ xi , start ≥ yi , end ≤ zi ];

G6.2

with a fixed wi to select the desired orientation. We rounded the position parameters to 5bp to improve the evaluation speed. Figure 6.1 illustrates the kind of optimization this rule allows. We ranked the PWM’s by AUC and performed a visual inspection to make sure the detected peak was above the background noise level and was in the core region. We also evaluated the same rules on 5 sets of 500 positives and 1000 negatives to assess the reproducibility of the optimal positions. A number of A/T-rich motifs, e.g., CDXA and PAX2, localized to the region of the TATA box, but we do not report them. Similarly, there are C/G-rich motifs, e.g., for CAC-binding protein or MAZ, that are nearly indistinguishable from the Sp1 site that follow similar distributions; we do not report these either.

Frequency of motifs Using the ROC curve, we estimated the coverage of each sequence set by the TF’s that had a positional bias in the core promoter. We computed a p-value for the enrichment of TFBS in CGI– versus CGI+ classes as shown in Table 6.2. Eight (8) of the factors favor CGI– genes, 12 favor CGI+ genes, and 12 show no statistical significant preference. Most of the factors that favor a particular CGI class still show some enrichment in the other class. For example, Sp1 is found in about 0.52 of CGI+ genes but also in about 0.35 of CGI– genes. Considering TF’s that are enriched in CGI– genes, STAF, OCT1, TEF, AP1, COUP/DR1, and FOXJ2 were below 0.10 in CGI+ genes. Considering TF’s enriched in CGI+ genes, only Myc/Max, NRF1, and E4F were below 0.10 in CGI– genes.

Changes in Hg

Using the average optimal score and positional parameters from the replicate

runs, we computed the mean Hg value for genes with a match to a given enriched PWM versus the bulk distribution as shown in Table 6.3. We also performed a visual inspect of the cumulative distributions of Hg . We find that only the TATA box has a large and consistent effect on this distribution. A small effect was found for FoxJ2 in CGI– genes. Thus once we know if a gene has a CpG island, only the TATA box provides any significant further information about its overall specificity. 107

- 200

- 100

0

Position [bp]

- 50

50

(a) Sp1 (mouse CpG+)

- 150

100

150

Hits [10bp]

- 200

- 100

0 Position [bp]

- 50

50

(b) NFY (mouse CpG-)

- 150

100

150

Figure 6.1: The lines show the number of predicted Sp1 sites in 10bp bins in the mouse CGI+ promoters. Each line represents a different score threshold; the top line is 5.0, each line below represents an additional increment of 1.0. The two bold lines are 10 and 11 which bracket the optimal score of 10.4. The heavy line across the bottom of the graph represents the optimal interval.

Hits [10bp]

400

300

200

100

0

60 40 " 20 0

108

Ubiquitous Factors Nearly all of the factors identified as being enriched in the core promoter are also ubiquitous. The few exceptions are VDR, COUP-DR1, PPARA, TEF, and FOXJ2. A few tissue-specific factors, e.g., HNF1, HNF3, showed some enrichment but these we eliminated these because their peak concentration was at the 5’ end of the core interval as mentioned above. Orientation Few factors showed a strong preference for a particular orientation. The strongest example of orientation dependence is YY1. In the forward orientation, it is found equally among CGI– and CGI+ genes. In the reverse orientation, it favors CGI+ genes. This result differs (slightly) from our previous results because we are using the TRANSFAC or JASPAR matrices for YY1 which are much less conserved than the PWM we developed in Chapter 5.

109

110 V NFY 01 JV YIN YANG (-) V NRF1 Q6 V MYCMAX B V TATA 01(+) V STAT1 02 V SF1 Q6 V OCT1 04

NFY

YY1

NRF1

Myc/Max

TATA

Stat1

SF1

Oct-1

Continued on next page . . .

V ETF Q6

ETF

V EGR Q6

EGR V E2F Q2

V GC 01

Sp1

E2F

PWM

Factor

0.189

0.249

0.080

0.214

0.047

0.068

0.093

0.116

0.143

0.077

0.146

0.330

CGI–

0.077

0.097

0.015

0.106

0.122

0.182

0.185

0.199

0.220

0.257

0.271

0.506

CGI+

H. sapiens

0.156

0.221

0.013

0.218

0.049

0.054

0.114

0.103

0.126

0.108

0.210

0.373

CGI–

0.093

0.085

0.179

0.120

0.143

0.174

0.216

0.217

0.243

0.317

0.338

0.534

CGI+

M. musculus

Coverage by species

0.173

0.235

0.046

0.216

0.048

0.061

0.104

0.110

0.135

0.093

0.178

0.352

CGI–

0.085

0.091

0.097

0.113

0.133

0.178

0.201

0.208

0.232

0.287

0.305

0.520

CGI+

Coverage

Average

the coverage estimates. The table ends at PWM’s that are found in approximately 10% of genes.

6 × 10−5

6 × 10−5

6 × 10−5

6 × 10−5

6 × 10−5

6 × 10−5

6 × 10−5

6 × 10−5

6 × 10−5

6 × 10−5

6 × 10−5

6 × 10−5

p-value

Enrichment

Differential

as described in the Method section. Similar measurements on subsets of the data suggest an average standard deviation of 0.026 for

correction for multiple testing (N = 30). The coverage shown is estimated from the ROC curves using the point of largest enrichment

p-value between CGI– and CGI+ genes which roughly corresponds to decreasing average enrichment. P-values reflect a Bonferroni

Table 6.2: Frequency of transcription factors enriched in the core promoter. TF’s are in order of increasing differential enrichment

111 V GEN INI3 B V TFIII Q6 V ATF B V ZIC1 01 V RREB1 01 V GKLF 01

Inr

TFIII

CREB

Zic

RREB

GKLF

Continued on next page . . .

JV THING1 E47

Thing1/E47

V E4F1 Q6

E4F

V ZID 01

V GABP B

GABP

ZID

JV YIN YANG (+)

YY1

V AP2GAMMA 01

0.104

V STAF 01

STAF

AP2G

0.108

V COUP DR1 Q6

COUP

V AP1 C

0.047

V FOXJ2 01

FOXJ2

AP1

0.172

V TEF Q6

TEF

0.220

0.203

0.128

0.079

0.184

0.110

0.161

0.121

0.155

0.159

0.122

0.194

0.163

PWM

Factor

CGI–

0.275

0.280

0.117

0.101

0.223

0.131

0.099

0.101

0.064

0.030

0.133

0.214

0.117

0.066

0.054

0.045

0.070

CGI+

H. sapiens

0.215

0.199

0.121

0.069

0.145

0.099

0.102

0.073

0.131

0.133

0.069

0.083

0.117

0.057

0.092

0.142

0.122

CGI–

0.344

0.198

0.125

0.120

0.200

0.074

0.086

0.151

0.116

0.061

0.143

0.184

0.142

0.098

0.038

0.072

0.073

CGI+

M. musculus

Coverage

0.218

0.201

0.125

0.074

0.165

0.105

0.132

0.097

0.143

0.146

0.096

0.094

0.113

0.052

0.132

0.168

0.143

CGI–

0.310

0.239

0.121

0.111

0.212

0.103

0.093

0.126

0.090

0.046

0.138

0.199

0.130

0.082

0.046

0.059

0.071

CGI+

Coverage

Average

0.03

0.03

0.02

9 × 10−4

6 × 10−4

2 × 10−4

2 × 10−4

6 × 10−5

6 × 10−5

p-value

Enrichment

Differential

112

PWM V AP4 Q6 V PPARA 02 V VDR Q3 V VBP 01

Factor

AP4

PPARA

VDR

VBP

0.100

0.166

0.123

0.180

CGI–

0.117

0.153

0.160

0.119

CGI+

H. sapiens

0.105

0.156

0.132

0.177

CGI–

0.097

0.143

0.140

0.120

CGI+

M. musculus

Coverage

0.103

0.161

0.128

0.179

CGI–

0.107

0.148

0.150

0.120

CGI+

Coverage

Average

p-value

Enrichment

Differential

Factor NULL V GC 01 V TATA 01 V FOXJ2 01 V NFY 01 V OCT1 04 V NRF1 Q6 JV YIN YANG JV YIN YANG JV YIN YANG JV YIN YANG V GKLF 01 V GABP B V EGR Q6 V E2F Q6 02 V ETF Q6 V GEN INI3 B V STAT1 02 V ATF B V RREB1 01 V STAF 01 V E4F1 Q6 V VBP 01 V TFIII Q6 V AP4 Q6 V SF1 Q6

Range (-180 (-35 (-200 (-170 (-200 (-150 (10 (10 (10 (-5 (-150 (-100 (-120 (-70 (-75 (-5 (-130 (-125 (-180 (-140 (-100 (-125 (-125 (10 (40

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

-5) -15) -5) -20) -50) 0) 105) 105) 105) 5) -5) 60) -15) -35) -20) 5) 15) 0) 0) 0) 0) 0) -25) 50) 135)

Score 10.7 5.3 6.6 10.0 6.0 11.8 7.7 7.7 7.7 7.7 7.0 9.3 9.3 8.7 8.3 5.2 10.0 7.5 6.3 5.5 8.3 5.1 6.0 5.8 6.2

Dir

+

+ +

+ +

Hs CGI– 4.18 4.13 3.72 4.11 4.26 4.13 4.45 4.20 4.21 4.21 4.09 4.13 4.26 4.15 4.05 4.10 4.15 4.29 4.14 4.10 4.32 4.26 4.16 4.18 4.03 4.15

Mean Hg Hs Mm CGI+ CGI– 4.49 4.92 4.48 4.93 4.31 4.58 4.49 4.84 4.49 4.94 4.51 4.94 4.52 5.19 4.51 4.87 4.52 4.89 4.52 4.89 4.55 4.97 4.49 4.89 4.54 5.04 4.48 4.98 4.48 4.97 4.48 4.92 4.49 4.82 4.55 5.04 4.51 4.97 4.47 4.91 4.54 5.07 4.51 4.89 4.51 4.85 4.49 4.89 4.48 4.87 4.50 4.91

Mm CGI+ 5.16 5.14 5.07 5.17 5.15 5.17 5.19 5.14 5.15 5.15 5.16 5.15 5.19 5.14 5.14 5.15 5.13 5.24 5.17 5.16 5.23 5.18 5.15 5.17 5.14 5.17

Table 6.3: Summary of mean conditional entropy values for single TFBS regions, scores, and orientations in human and mouse CGI– and CGI+ core promoters. The ‘NULL’ row indicates the mean value for all genes. Only genes containing a TATA box (highlighted) show any consistent practical difference from the overall distribution.

113

6.3

Methods

Sequence Preparation Sequences were obtained, assessed for CpG islands, and masked for repeats as described in Chapter 5. Scoring Binding Site Models We used all vertebrate PWM’s in TRANSFAC v7.3 (N = 494) and JASPAR (downloaded 2004/07/01) (N = 81) for a total of 575 PWM’s. Given a matrix with observed frequencies wb,i , we scored sites using a log-likelihood ratio score LA (S) =

X

lg(psi ,i ) − lg(0.25)

(6.1)

1≤i≤W

where W is the length of the PWM and si is the i-th base in the W -mer S that is being scored. Because, the binding sites used to define each PWM are available for some but not all of the PWM’s, we adopted a global minimum threshold score of LA ≥ τA = 5 for all PWM’s. Examination of the PWM’s for which training data was available suggests that this threshold yields at least 90% sensitivity for nearly all sites. If the average number of sites per sequence was more that six, we adjusted the τA upward to get an average of six hits per sequence. Evaluating Rules Rules were evaluated as described in Chapter 4 with the parameters as described above. Scores were rounded to 0.1 and start and end position were rounded to 5bp. Estimating the coverage

We estimate the coverage, fraction of positive sequences containing

non-random instance, for PWM’s by applying the formula C =1−

1 − rTP 1 − rFP

(6.2)

derived in Chapter 4 to the individual rTP and rFP that maximize rTP − rFP for each PWM. Estimating the standard deviation of coverage estimates We evaluated rules on 5 sets of 500 core promoters and 1000 control sequences for each promoter class. We applied the same coverage estimation procedure to the ROC curves from these results and computed the standard deviation of these numbers. These values were consistent across all promoter classes.

6.4

Discussion

Biology Surprisingly, only the TATA box shows any evidence of an entropy distribution that differs from the corresponding CGI-dependent distribution in all data sets. We had expected some 114

of the C/G-rich motifs or perhaps the CAAT box to show a strong preference for low-specificity genes. In particular, none of the CGI+-favoring factors affect the entropy distribution. They all have a proportional number of examples of low-entropy, high-specificity, genes. This might be due to a relatively high rate of false positive background sites that mask the selective effects of the binding site. However, at least in the case of Sp1, this appears not to be the case. We tried a very stringent site score threshold for the PWM V$GC 01 and did not find any change in the entropy distribution. Few if any tissue-specific factors bind directly to the core promoter. Many of the core promoter-binding factors are present in both CGI– and CGI+ genes even if they are more enriched in one class or the other. As the enriched factors are often involved in growth response, cell-cycle, or embryonic expression, we propose that these processes are in some sense orthogonal to tissue specificity, i.e., tissue specific genes are regulated by these processes in proportion to their overall frequency. Machine Learning

We found that the evaluation system was able to identify known positional

preferences, such as the TATA box, Sp1, CAAT box, YY1 as described in Chapter 5. We did not evaluate the variation in positional preferences. The variation will depend on the the slope of the site density as well as the size of the data set. Anecdotal experience suggests that the 10% frequency is a reasonable cutoff. Above this threshold the detected peak appeared to be stable and significantly higher that the noise in the surrounding bins. Limitations of the Data

The work described in this chapter was done with promoters for

about 20% of all of the genes expected to be in the mouse or human genome, thus better results can probably be had once accurate TSS are available for all genes. Such a small number, 20%, leaves open the possibility that one or more of the motifs we considered will be shown to have an effect in the larger data set. This would not be so likely if the genes we considered were a random sample of the whole genome, but, as the history of our understanding of the TATA box illustrates, the well-studied genes are not a representative sample of genes. We did not consider alternate promoters. It may be that there are signals that would become clear if we could consider expression from alternate promoters.

115

Chapter 7

Transcription Factor Binding Sites in Liver-Specific Promoters

7.1

Introduction

Identification of the transcription factors (TF’s) and arrangements of their binding sites (TFBS’s) that control tissue-specific gene expression is still an open problem in post-genomic biology. We apply a grammar formalism to promoters of mouse liver-specific genes that lack a CpG island to identify the TF’s that are predictive of liver expression, then further analyze the promoters to identify the arrangements of these sites that show increased selectivity for liver-specific genes. We limit our grammars to single rules that describe collections of TFBS where the order and number of instances of sites may or may not matter. We explore the set of possible rules by considering more complex rules only when all components of the rule contribute to the selectivity of the rule and when the rule is an improvement over all of its predecessors. To prevent over-learning we perform a seven-fold cross-validation and identify the consensus rules. The rules are validated by showing that they select for genes expressed in the human liver but do not select for genes expressed unrelated tissues such as the cerebellum. Our method identified the core of a previously-identified rule but also indentifies a number of new, biologically plausible, rules. We find that the order of sites is rarely important, but a few successful rules do consider the number of instances of a particular TFBS. 116

7.2

Background

Liver is a large glandular organ with many important functions including detoxification and filtering of the blood, storing and releasing glucose, processing fat, producing digestive enzymes (exocrine function), and producing serum proteins (endocrine function). It contains five cell types; hepatocytes which form the bulk of the liver mass, duct cells, Ito cells, endothelial cells and Kupffer macrophages. The expression data we use was based on whole liver and so largely reflects the gene expression pattern of hepatocytes. Gene expression in the liver is regulated in large part by a set of liver-specific transcription factors recently reviewed by Schrem, Klempnauer, and Borlak [186, 187], the hepatocyte nuclear factors (HNF1, HNF3, HNF4, and HNF6), CCAAT/enhancer binding protein (C/EBP), and D site-binding protein (DBP), among others. A logistic regression analysis (LRA) model of the control of gene expression in liver was developed by Krivan and Wasserman [127]. This work used 16 known control regions from 15 genes from different species. Binding sites were taken from the literature. The model was built using weight matrix binding sites for HNF1, HNF3, HNF4, and C/EBP, however only the HNF1 and HNF4 sites were found to be statistically and practically significant. Thus the LRA model is roughly equivalent to Grammar 7.2. {250}

S −→ HNF1, HNF4;

G7.1

Since so little data was used in training the LRA model, it will be interesting to see how rules learned on larger data sets will compare to the LRA model. In addition, we want to know if any factors combine with HNF3 or other TF’s not considered in the LRA model. Recently, large scale ChIP-chip experiments by Odom and coworkers [151] have presented a picture of the binding of HNF1, HNF4, and HNF6 in human liver- and pancreas-expressed genes. In particular HNF4 was found to bind to about 45% of the genes transcribed in liver whereas HNF1 and HNF6 bound to around 10% of the transcribed genes. In Figure 7.2 we plot the distribution of Qg|liver for the human genes identified by Odom. We note that HNF1 is by far the most predictive of liver-specific expression of the three factors considered. This data offers a point of comparison with our findings.

7.3 7.3.1

Results Identifying Liver-Specific Genes

Liver-specific genes were selected using the expression profiles described in [216]. We applied the conditional specificity statistic Qg|t defined in Chapter 5 to this data and selected 100 CpG– mouse 117

genes with Qg|liver ≤ 7 and promoters in DBTSS [220]. These represent the most liver-specific genes with the most common promoter class in this specificity range. The genes are listed in Appendix B Table B.1. We analyzed the -1000 to 200 region relative to the transcription start site (TSS) after masking repeats.

7.3.2

Identifying TF’s Over-Represented in Liver-Specific Promoters

To identify TF’s over-represented in liver-specific promoters, we considered all positional weight matrix models in TRANSFAC v8.4 [145] and JASPAR [182]. The control set were sequences sampled from a third-order Markov model (MM) of CpG– promoters. Since the control set is randomized sequence, this test has the potential to identify both liver-specific factors as well as factors that may be less liver-specific in expression but still occur in many liver-specific promoters. A receiver operator characteristic (ROC) graph was generated for each PWM by evaluating a rule of the form: R −→ Fi [score ≥ θi ];

(7.1)

where Fi represents a match to the i-th PWM and θi is a PWM-specific log-likelihood ratio score threshold. Using the ROC graph we calculated the area under the curve (AUC) for each PWM. Absolute Rankings The top 30 unique PWM families resulting from this analysis using all 100 liver-specific promoters and 500 sequences drawn from the CpG– promoter model are shown in Table 7.1. This list agrees well with expectations as many of the known liver-specific factors are highly ranked, e.g., HNF1, HNF3, HNF4, and HNF6. The forkhead TF’s FOXO3, FREAC7, and XFD2 have PWM’s similar to the PWM for HNF3. Additionally, TF’s active in, but not specific to liver, such as glucocorticoid receptor (GR) and related motifs were also highly ranked. PBX1 is involved in hematopoiesis in the fetal liver [59]. ARP1 (apolipoprotein AI regulatory protein 1) has a HNF4/COUP-like site and is involved in liver-related processes. NKX6.1 is known as a pancreas marker, so its appearance in liver is unexpected. Evi1 (ectopic viral integration site 1) is a proto-oncogene that is involved in the control of proliferation and differentiation, particularly in hematopoiesis [32]. Evi1 is expressed in other gut organs (among others) such as the large intestine, stomach, bladder, kidney, and uterus, but not in the liver. Its binding site contains a repeated GATA-like motif. A possible role in the genes of the adult liver is not clear, but it has possible connections.

118

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Factor HNF1 RREB-1 HNF3 SP1 CACBINDINGPROTEIN TANTIGEN-B FOXO3 ARP1 FOX NKX61 COUP-TF STAF GR HNF3ALPHA NF1 FREAC7 OCT1 XFD2 IRF-2 EVI1 PBX1 NRSF FREAC3 ERR1 RSRFC4 HNF6 RFX1 VBP TCF11MAFG PR

AUC 0.737 0.670 0.655 0.652 0.646 0.631 0.626 0.621 0.619 0.619 0.617 0.617 0.616 0.615 0.613 0.612 0.608 0.602 0.601 0.598 0.591 0.590 0.588 0.585 0.583 0.583 0.580 0.580 0.578 0.575

Family HNF1 RREB/LUN1/LYF1 HNF3 SP1 CACC TANTIGEN FREAC/GTAAACA ARP1/LFA1 HFH/FOX SOX HNF4/TCF4/COUP/PPAR STAF GR/PR/AR-short CAAA/HNF3 CAAT FREAC7/ATAAA OCT1 XFD2 IRF EVI1 PBX NRS FREAC/GTAAATA ER-LEFT MEF2/RSRFC4 GF1/HNF6 RFX1/MIF1 TTAC/EFC/NCX/VBP TCF11 AR/PR/GR-long

Table 7.1: Top 30 TF families in CpG– liver-specific promoters

119

We also looked at the overlap between the top 25 PWM families enriched in liver, large intestine, or cerebellum as shown in Figure 7.1. We found four factors, RREB1 (ras-responsive element binding protein 1), Sp1 (stimulating protein 1), Oct-1 (octamer factor 1), and IRF-2 (interferon regulatory factor 2), that were enriched in all three sets. The TF’s enriched in both liver-specific and large intestine-specific genes included HNF1, HNF3, FOX, Nkx3.1, HNF4, FREAC7, and Evi1. TF’s common to liver-specific and cerebellum were CAC binding protein, T-antigen, and ARP-1. We note that the related site COUP-TF is involved in neurogenesis or central nervous system patterning [79] and sensory ganglia [107], so this is a reasonable overlap. A PWM for RFX-1 was common to both cerebellum- and large intestine-specific promoters. The fact that the largest overlap of tissue-specific TF’s was between developmentally related tissues, whereas the TF’s common to all tissues were ubiquitous, suggests that we are correctly identifying over-represented TF’s.

HNF1, HNF3, FOX, SOX, HNF4, FREAC7, Evi1

Liver

CAC BindProt, Tantigen, Arp1

RREB, Sp1, Oct-1, IRF

Large Intestine

RFX1

Cerebellum

Figure 7.1: Overlap of tissue-specific TF’s between three tissues. Related tissues liver and large intestine show a larger overlap than unrelated pairs involving cerebellum.

Relative Rankings The AUC value for a PWM is influenced by two factors; its prevalence in the positive set as well as its information content which controls its rate of random occurrence as discussed in Chapte 4. A factor with low information content will not be highly rated by the AUC statistic even in the tissue 120

in which it is most active. To further confirm that our method is correctly identifying enriched TF’s we considered the relative enrichment of TF’s. We considered a diverse set of 18 tissues (adrenal gland, amygdala, cerebellum, heart, kidney, large intestine, lung, ovary, prostate, skeletal muscle, small intestine, spinal cord, spleen, testis, thymus, thyroid, and uterus) in turn and selected the TF family members that had their largest or second largest AUC in the tissue of interest. Considering, for example, muscle tissues, we found that muscle-specific factors MYOD, myogenin, and MEF2 were ranked highest in genes specific to skeletal muscle or heart where they appeared in the top 20 TF families in skeletal muscle. Similar results were found for genes specific to cerebellum, where the TF’s DEAF-1 (deformed epidermal autoregulatory factor 1), NRSE (neuron-restrictive silencer factor), and NCX (enteric neuron homeobox) were ranked first or second in the cerebellum as compared to other tissues.

7.3.3

Combinations and Arrangements of TF’s

We next considered rules describing combinations and arrangements of PWM’s to see which rules yielded more selective identification of liver-specific genes. We used the best representative PWM from the top 20 PWM families and considered rule performance in the 100 mouse liver-specific genes versus 500 sequences sampled from the CpG– promoter Markov model. We also performed a seven-fold cross-validation to control overlearning. We explored the set of possible combinations by following the predecessor relations as described in Chapter 4. We set a limit of six TFBS instances per rule and a maximum size bound of 300bp which is optimized in the evaluation process. We measured the improvement in selectivity between a rule and its predecessors by requiring that the maximum area between the curves (MABC) statistic be at least 0.01. Figure 7.3 plots average AUC’s for rules found in all seven cross-validation runs for both the cross-validation training (a) and held-out or test (b) data versus the full data set. The training data shows average AUC’s that are very slightly larger than the full data set. The cross-validation training sets were used to identify both rules and parameter value tuples for point performances that lie on the ROC curve. When we evaluate these rules on the held-out (test) data using the learned parameter value tuples, we find their performance is slightly lower than, but well correlated with, the AUC’s from the full data. In many cases the training procedure produces more than one set (or tuple) of parameter values for each point performance, i.e., there may be a few ways to achieve the same performance. In evaluating the held-out AUC’s we have arbitrarily chosen just one tuple per point. Including more tuples per point yields better AUC’s on the held-out data, but this amounts to retraining and a user of the training results has no way of knowing which parameter set is better. The more complex rules have more free parameters and so are more prone to over-fitting. However, comparing 121

the AUC’s for these rules to the solo features, which are circled in part (b), we see that the more complex rules do not show significantly worse relative performance on average. Thus the seven-fold cross-validation has identified a set of rules that can be consistently identified from the data and we find their performances are well correlated with their performance on the full set. As shown in Table 7.2 we found 382 rules in the full run and 209 rules in every one of the cross-validation runs. We are interested in the leaf rules that could not be improved upon by other rules since they represent the most specific rules possible. There are a total of 229 leaf rules in the full set and 107 leaf rules in the consensus set. The table also shows the number of rules that the learning algorithm considered and the number of possible rules that it might have considered. By comparing the Considered and Possible columns it is clear that the learning algorithm is able to avoid evaluating a large number of the possible rules. We now examine the rules to see if the algorithm is learning reasonable rules. We consider just the rules learned in the cross-validation analysis and identify general biological lessons from these rules. We take the known liver-specific factors HNF1, HNF3, and HNF4 as the anchors of the analysis and examine how these factors relate to each other and to the other TF’s in the chosen set of rules.

122

1.0 0.0

0.2

0.4

CDF

0.6

0.8

All HNF1 HNF6 HNF4

0

2

4

6

8

10

12

14

Q − liver

Figure 7.2: The distribution of Qg|liver for targets of HNF factors identified in [151]. HNF1 is the most liver-specific followed by HNF6 and HNF4.

123

0.50

0.55

0.60

0.65

0.70

0.75

0.80

CV Test AUC

(a) Training

Full Data AUC

0.50

0.55

O

Full Data AUC

0.65

O

(b) Testing

0.60

O OO

OO O O O O O O O

O OO

0.70

0.75

O

0.80

Figure 7.3: A comparison of the average AUC’s of consensus rules from the seven-fold cross-validation. (a) compares the average AUC’s in the training data versus the full data set. (b) compares the average AUC’s from the held-out or testing data versus the full set.

CV Train AUC

0.80

0.75

0.70

0.65

0.55

0.50

0.60

0.80 0.75 0.70 0.65 0.60 0.55 0.50

124

Number of Rules Number of Features 1 2

3

4

Collection Type set bag list set bag list set bag list

CrossValidation 20 144 12 9 35 9

Total Leaf

All Data 20 159 14 29 131 39 5 2 3

209 107

382 229

Considered 20 190 20 330 721 243 846 32 71 84 2537

Possible 20 190 20 400 1140 400 8060 4845 4200 122380 141635

Table 7.2: Count of rule types found in promoter of liver-specific genes by size in both seven-fold cross-validated consensus rules and rules learned from all data. The considered column indicates the number of rules considered in the full run. The possible column indicates the number of rules that can be formed from 20 features.

7.3.4

Two-Feature Rules

We first consider the rules with two feature instances. Two-Bags: Self pairs These are rules of the form

i S −→ Fi : 2;

G7.2

which describe homodimers that are possibly widely spaced. The HNF1 self pair exhibited only a marginal performance increase over a solo HNF1 with a mean MABC of 0.007 in the cross-validation which was below our threshold and thus not included in the selected rules. The self pair for the ARP1 form of HNF4 had a similar mean MABC and was also not selected. However the COUP version of HNF4 did have improved performance as did both forms of HNF3. Consistent with the HNF3 self pairs, the PWM’s for other forkhead TF’s FOX, FOXO3, FREAC7, and XFD2 also formed improved self pairs. PWM’s for the ubiquitous factors CAC-binding protein, Sp1, and NF1 did not form self pairs, nor did PWM’s for STAF, RREB, and T-antigen. Two-Sets These are rules of the form {nij }

S −→ Fi , Fj ; 125

G7.3

which describe heterodimers that are possibly widely spaced. As shown in Table 7.2 most of the possible two-sets were found to show increased performance over their component features. Concentrating on the HNF factors, we find the forkhead PWM’s consistently pair with HNF1. The COUP HNF4 site pairs with HNF1, but not the ARP1 version. Both forms of HNF4 formed improved pairs with most of the forkhead factor PWM’s. Because so many two-sets showed improvement, we instead consider the sets that did not show improvement. Most common among the PWM’s not included in two-sets were T-antigen (15 bad pairs), RREB (14), HNF1 (9), and Sp1 (7). The near absolute lack of pairs including T-antigen and RREB suggests that these PWM’s may be incorrectly identified as enriched by comparison with the Markov model control set. The single site optimal score threshold for T-antigen is about 5.3, i.e., near the minimum we allowed, and the T-antigen PWM has very high information content. These two facts also suggest that the identification of T-antigen was an artifact. The case for RREB is less clear since the single feature optimal threshold score is higher at 8.4. The RREB protein is expressed in all tissues except brain, so it is possible that it is widely acting.

Two-Lists These are rules of the form [nij ]

S −→ Fi , Fj ;

G7.4

which describe hetero- or homo-dimers that are possibly widely spaced where the order of the factors matters. The learning procedure on the full data set considered 330 2-list rules and found that 29 showed improvement. Of these, nine were found among the consensus rules from the cross-validation. Among the nine two-list rules, four involved NF1 (CCAAT box) in the downstream position, i.e., closer to the TSS, which reflects the known positional bias of NF1 sites [141]. The size bound for these rules was about 270bp indicating that the companion factors of NF1 were occurring much further upstream. None of the list rules were pairs of the HNF family members though individual members did pair with other factors. Both HNF4 sites, COUP and ARP1, were found downstream of EVI1, though with different size bounds of 190bp and 300bp respectively.

Size Bounds We plot the cumulative distribution of the optimal size bounds averaged over the cross-validation runs in Figure 7.4. There is essentially no preferred length but a very small increase can be observed between 200 and 300bp which could either due to the redundancy of forkhead factors or the imposition of the 300bp limit. In the later case, rules with an actual optimal size longer than 300bp might tend to pile up near the 300bp limit as the rule matches the shorter members of the family. The curves suggest that this is not a major problem. 126

The glucocorticoid receptor (GR) forms the smallest two-bag rule; the size bound in the full data set is 40bp and in the seven-fold cross-validation runs it is less than 40bp in three runs, about 120bp in two runs, and 285bp in the remaining two runs. The forkhead factor XFD2 two-bag is also closely spaced with optimal size bounds between 35 and 65bp in all but one cross-validation run. Thus there is no evidence for very closely spaced dimers. The {HNF1,COUP} rule has an average optimal size bound of 221 ± 6bp across the crossvalidation runs which closely mirrors the results of Krivan and Wasserman [127] who selected an

1.0

optimal size of 250bp. Thus our method was able to automatically recapitulate their finding.

0.6 0.4 0.0

0.2

Cumulative Probability

0.8

2−Feature Rules 3−Feature Rules

0

50

100

150

200

250

300

Average Optimal Size [bp]

Figure 7.4: The cumulative distribution of average optimal size bound for two- and three- feature rules. The curves indicate that the distribution is nearly uniform. There is a very small increase between 200 and 300bp which could either due to the redundancy of forkhead factors or the artificial limit of 300bp.

7.3.5

Three-Feature Rules

There are 35 three-set rules in the cross-validation consensus set. However a large number of these are redundant combinations of various forms of forkhead PWM’s. The simplified list is shown in Table 7.3. We note that none of these rules involve HNF1 whereas most of them contain a forkhead 127

site and/or and HNF4. The tendency of HNF3/forkhead sites to cluster is highlighted in these rules as six of them involve two or more HNF3 sites. The occurrence of GR with HNF3 and HNF4 sites is consistent with GR’s requirements for accessory factors in the promoter of the PEPCK gene [210]. We find Oct-1 to be a companion factor of GR and these two factors have been shown to associate in vivo [167, 43, 85]. Looking ahead to Chapter 8, we note that this combination will be identified on the basis of ChIP-chip data as well.

Rule {HNF3, EVI1, GR} {HNF3, EVI1, NKX61} {HNF3, EVI1, OCT1} {HNF3, GR, OCT1} {HNF3, NKX61, OCT1} {HNF3, HNF4, CAC} {HNF3, HNF4, GR} {HNF3, HNF4, OCT1} {GR, HNF4, IRF} {GR, HNF4, OCT1} {GR, IRF, NKX61} {GR, NF1, SP1} {GR, EVI1, OCT1} Table 7.3: The three-feature instance rules found in liver-specific promoters. HNF3 indicates one of more of a variety of forkhead TF’s. The ranking is by component features not performance.

7.3.6

HNF1 Companions

Since HNF1 appeared in relatively few of the two-feature rule and none of the three-feature rules, we decide to search more widely for companion TF’s for HNF1. We considered HNF1 sites paired with the best PWM’s from each family as assessed in the single feature analysis described above. The few companion factors that show any improvement over HNF1 by itself are listed in Table 7.4. The improvement in the AUC is quite small, less than 0.013, and the average size bound was large 128

AUC 0.7373 0.7502 0.7668 0.7574 0.7663 0.7508 0.7556 0.7560 0.7612 0.7574

Size Bound 170 210 215 220 225 235 260 265 285

Rule HNF1 {GATA1,HNF1} {GKLF,HNF1} {FOX,HNF1} {HNF1,PIT1} {COUP,HNF1} {FAC1,HNF1} {HNF1,VDR} {GR,HNF1} {AP1,HNF1}

Table 7.4: Companion factors for HNF1 that show improvement in AUC. There are very few and the size bounds are large. The improvement in AUC over the AUC for HFN1 alone is very small. at 211bp. This suggests that a few hypotheses: HNF1 may act without closely spaced partners, have an as yet unknown partner, or may not require any partners as suggested by the data in [127] which contains a few regulatory regions with just an HNF1 site.

7.3.7

Combinations with the TATA Box

We performed a check to see if any of the consensus rules tended to co-occur with a TATA box. Of the 100 genes in our data set, 56 had TATA boxes as measured by the PWM and threshold defined by Bucher [31]. The mean TATA-box fraction is 0.61 for the consensus rules. The lowest rate for a meaningful rule is 24/52 = 0.46 for {COUP TF,FREAC7}. The highest is {HNF1,OCT1} with 39/53 = 0.74. Visual inspection suggests that many of the TATA-enriched rules contained GR and Oct-1 features. In particular, the fourth most TATA-enriched rule is {GR,OCT1} which has a 0.72 (41/57) rate of TATA box occurrence. This is significantly (P = 229 × (4.2 × 10−5 ) = 0.01; hypergeometric) enriched after a Bonferroni correction for multiple testing. This suggests that we may be missing rules relevant to the regulation of TATA-less promoters.

7.3.8

Performance

Figure 7.5 shows an example of the learning process for the rule, {EVI1,FOX,HNF3}, which was the most highly-rated rule in the cross-validation runs and is the top ROC curve in the plot. The three dashed ROC curves are for the two-set predecessors, {EVI1,FOX}, {EVI1,HNF3}, and {FOX,HNF3}. The three lowest solid lines are for the individual component features. There is a dramatic improvement between the solos and the final rule. Adding more features to three-sets 129

rules typically yields very little if any additional area. The optimal parameters on the full data set yield rTP = 0.62 and rFP = 0.16 which is about 3.9-fold ratio. However, the plot also indicates there are more stringent parameter settings that will yield, for example, rTP = 0.40 at rFP = 0.042, a 9.5-fold ratio which maybe more useful for identifying candidates for experimental verification. The rFP from the ROC graph is calculated on a 1200bp search per gene. This correspond to roughly a 1/28000 bp false positive rate. The full evaluation of all parameters in combination with the ROC curve offers the user of the rules the ability to fine tune parameters to get the performance that best suits the application at hand. In this case there is a list rule, [EVI1,FOX,HNF3], that has better performance in very stringent searches. However, we take the view that if a set rule is an approximation of what is actually a list arrangement, then the list rule should present a dramatically better rFP with essentially the same rTP . Converting a three-set rule to a list rule with the same size bound should reduce the rFP by about a factor of six at a given rTP . This rule did not exhibit such an improvement. Its MABC was between 0.001 and 0.005 in the cross-validation runs and so was not reported as an improvement. We can also examine which of the Krivan and Wasserman training set are hit by our rules. We had DBTSS promoters for eight members of the training set. Three of these, PAH, PROC, and SLC2A2, were included in the training set because they had Qg|liver ≤ 7. We find that the {HNF1,HNF4} rule matches two of these and four of the remaining five genes. On the other hand, the rules in the family show a range of behaviors, ranging from missing all of the Krivan and Wasserman set to covering six of the eight. This indicates that the forkhead PWM’s are not exactly equivalent despite their similarity. The rules {ARP1,FOXO3,GR} and {COUP,FOXO3,GR} also hit none of the Krivan and Wasserman set even though they have an AUC of 0.71 and 0.73 in the test set and 0.73 and 0.74 in the full set which suggests they occur widely in the training set.

7.3.9

Liver Selectivity

We have identified rules that are more likely to match liver-specific promoters as compared to random sequence. By using a synthetic sequence control, we have attempted to identify all TF arrangements that are active in liver-specific genes rather than identify those that differentiate liver from other tissues. Consequently we expect that some of the rules identified in the liver promoters may be used by other tissues as they may reflect control by metabolic or signalling processes that are not specific to liver or because they include TBFS’s for factors with paralogs active in other tissues. We now determine whether these rules are more likely to match liver-specific promoters than promoters of genes that are specific to other tissues. 130

100 0

20

40

TP

60

80

{EVI1,FOX,HNF3} 2−sets solos fit

0

100

200

300

400

500

FP

Figure 7.5: The ROC curves for ’{EVI1,FOX,HNF3}’ and its predecessors. Also shown is the linear estimation of the true TP count for the optimal parameters.

131

To measure tissue-selectivity, we parsed all of the mouse CpG– promoters with the consensus leaf rules using the optimal parameters as determined by the full data set. We then ranked the genes by Qg|t for tissues in the set of 34 representative tissues used in Chapter 5 that contained only three nervous system tissues. We determined the probability that a promoter with a hit to rule r is specific to tissue t, p(Qg|t < q|r), and the probability that a gene without a hit to rule r is specific to tissue t, p(Qg|t < q|¬r). To measure the tissue-selectivity we use Sr,t,q =

p(Qg|t < q|r) . p(Qg|t < q|¬r)

(7.2)

For example, if a rule hits 500 genes of which 50 are specific to tissue t and misses 400 genes of which 10 are tissue specific to tissue t, then the tissue-selectivity ratio is

50/500 10/400

= 4.0. Figure 7.6

shows an example of the result for two rules and tissues. Figure 7.7 contains scatter plots of Sr,t,7.0 for eight representative tissues versus Sr,liver,7.0 for all consensus rules. Nearly all of the rules are at least slightly selective for liver-specific genes as the rule selectivities are rarely below 1. In most non-liver tissues the majority of the rules have selectivities below 1 indicating they select against genes from the tissue. The most selective rule is {COUP,HNF1} at S = 3.45. Higher S ratios can be had by considering more stringent, i.e., smaller, q thresholds. For example, S = 4.0 for {COUP,HNF1} in liver. The best rules that do not involve HNF1 involve combinations of forkhead, COUP, and GR sites. A few of the rules show high selectivity for tissues other than liver. Most dramatically kidney and small intestine which show high correlation with the liver selectivity. For example, {HNF1,HNF3} is selective of liver, kidney, and to a lesser extent small intestine. This is not surprising as the hepatic nuclear factors are also active in those tissues. The most selective rules for tissues other than liver usually involve GR, a forkhead, or a COUP/ARP1 site. GR is known to be active in many tissues. The forkhead family encompasses a large number of factors which are active in lung [233], muscle [114] and the immune system [46] among other tissues. Similarly, the COUP factor which shares a binding site with HNF4 is active in a variety of tissues [160] including brain [226].

7.3.10

Selectivity for Liver-Specific CpG+ Genes

We can also evaluate the selectivity of these rules in liver-specific CpG+ genes. The results are shown in Figure 7.8 for 33 genes with Qg|liver ≤ 7. The four most selective rules for liver CpG– genes, which all involve HNF1, do not select for liver-specific CpG+ genes. Indeed many of the CpG– rules do not select well for CpG+ genes. The top five most selective rules for CpG+ genes are {ARP1,FOXO3,GR}, , [XFD2,NF1], {IRF-2,ARP1,GR}, and {FOX,GR,HNF3} which are the five right-most points in Figure 7.8. The distribution of points in the plot is different from 132

0

4 Q [bits]

6

8

10

Cumulative Prob

Cumulative Prob

0.30

0.25

0.20

0.15

0

2

4 Q [bits]

6

8

10

0

2

4 Q [bits]

6

8

10

0

2

4 Q [bits]

6

8

(b) {HNF1,HNF3}

10

0.05

(e) {HNF1,HNF3}

0

0

4 Q [bits]

6

8

4 Q [bits]

6

8

(f) {HNF3,HNF4}

2

(c) {HNF3,HNF4}

2

10

10

Figure 7.6: The distribution of Q for genes with (solid) and without (dashed) matches to the indicated rules. (a), (b), and (c) use Qg|liver which is enriched in the hits. (d), (e), and (f) use Qg|cerebellum which is slightly depleted in the hits.

(d) {HNF1,HNF4}

0.05

0.00

0.10

0.15

0.20

0.25

0.30

0.00

0.10

0.30 0.25 0.20 0.15 0.10 0.05 0.05

0.10

0.15

0.20

0.25

0.30

0.00

(a) {HNF1,HNF4}

Cumulative Prob

0.30 0.25 0.20 0.15

0.00

Cumulative Prob

2

Cumulative Prob Cumulative Prob

0.10 0.05 0.00 0.30 0.25 0.20 0.15 0.10 0.05 0.00

133

0

2

3

4

0

1

2

3

4

Liver

Liver

4

3

2

0

1

2

3

4

0

1

2

(b) kidney

Kidney

0

1

2

3

4

0

1

2

(c) lung

Lung

3

4

3

4

1

0

4

3

2

(f) testis

Testis

(g) thymus

Thymus

0

0

2 Muscle

2 Thyroid

(h) thyroid

1

(d) muscle

1

3

3

4

4

Figure 7.7: The plots show the tissue-specific selectivity ratio for all of the consensus rules. In all frames the Y-coordinate is the liver selectivity ratio. The X-coordinate is the selectivity for the indicated tissues. Kidney (b) and small intestine (e) show a diagonal trend indicating that the rules are identifying genes specific to these tissues as well. In cerebellum (a) and muscle (d) and to a lesser extent testis (f), thymus (g), and thyroid (h) the bulk of the points are have x ≤ 1 indicating neutral or selection against these tissues.

(e) small intestine

SmallIntestine

1

0

Liver

4 3 2 1 1 0

2

3

4

0

(a) cerebellum

Liver

4 3 2 1 0 4 3 1

2

Cerebellum

Liver

4 3 2 1 0 4 3 2

0

Liver

1

Liver Liver

1 0

134

135

Kidney 1.862 2.582 3.039 2.750 1.175 1.216 2.095 2.448 2.016 1.339 1.801 1.879 1.512 2.153 1.358

Liver 3.450 3.400 3.029 3.006 2.653 2.635 2.623 2.605 2.596 2.536 2.413 2.348 2.344 2.332 2.327

Lung 0.915 1.881 1.694 1.446 1.158 0.591 1.225 1.379 1.094 1.040 1.077 1.308 1.441 0.933 0.722

Muscle 0.695 0.300 0.074 0.360 1.116 1.234 0.838 0.262 0.180 2.048 0.319 0.242 0.137 2.692 0.876

SmInt 2.551 1.907 2.335 2.036 1.711 2.172 2.322 2.210 1.785 1.189 1.282 1.805 1.572 2.652 1.085

Testis 1.180 0.981 1.073 1.157 1.375 0.732 1.000 0.972 1.136 1.362 1.161 1.459 0.322 1.723 0.626

Thym. 0.937 0.734 1.270 0.731 1.589 0.811 0.884 0.670 0.684 1.479 0.877 0.749 1.647 0.517 0.626

Thyr. 0.869 0.516 0.497 0.701 0.717 0.921 0.780 0.469 0.726 1.427 0.737 0.509 0.706 1.464 0.695

Synopsis {COUP TF,HNF1} {GR,HNF1} {HNF1,NF1} {HNF1,HNF3} {EVI1,GR,HNF3} {COUP TF,HNF3} [XFD2,NF1] {FOX,HNF1} {FREAC7,HNF1} {ARP1,FOXO3,GR} {IRF 2,GR,NKX61} {HNF1,OCT1} {EVI1,NF1} {COUP TF,CACBINDPROT} {GR,STAF}

Table 7.5: The top 15 most liver-selective rules with selectivities for other tissues. There are frequently good selectivities in related tissues kidney and small intestine, but poor performance in brain, testis, thymus, and thyroid, e.g., {HNF1,HNF3}.

Cereb. 0.483 0.812 0.497 0.701 0.513 0.375 0.531 0.761 0.449 0.581 0.877 0.392 0.706 0.506 0.695

4 3 2 1 0

Liver CpG− Selectivity Ratio

0

1

2

3

4

Liver CpG+ Selectivity Ratio

Figure 7.8: Comparison of selectivity in liver-specific CpG– and CpG+ genes. The HNF1-based rules do not select well for liver-specific CpG+ genes. The best rules for liver-specific CpG+ genes are based on forkhead (HNF3) or HNF4 (ARP1) sites. both plots like liver CpG– versus cerebellum CpG– genes and liver CpG– versus kidney or small intestine CpG– genes; see Figure 7.7. Examination of AUC’s for solo features in liver-specific CpG+ genes supports these results. A number of forkhead PWM’s are more highly ranked than HNF1. C/EBP PWM’s are also highly placed in this list. This suggests that the CpG+ genes are regulated in a somewhat different manner than CpG– genes.

7.4 7.4.1

Methods Promoter Sequences

Human and mouse promoter sequences spanning the (-1000,200) region around the transcription start site (TSS) were obtained from DBTSS which contains TSS’s based on the alignment of fulllength enriched clones to the human and mouse genomes. These sequences were repeat masked 136

using repeatmasker ; repeats were replaced with ‘N’s. We used the program newcpgreport in the EMBOSS package [173] with the default parameters to identify CpG islands.

7.4.2

Building and Sampling from Markov Models

We selected the 1587 mouse promoter sequences that did not contain a CpG island and tabulated the frequency of all oligos of length 6 and shorter, including those that end in ‘N’ but not including those with ‘N’s in any other position. The lengths of tracts of ‘N’s from repeat masking were also tabulated. Sequences were sampled from these models as follows. The sequence is started by sampling from the distribution P (b). The second position was drawn from the distribution P (b|x) where x is the previously sampled base. In general, positions are drawn from P (b|x1 x2 x3 ).

7.4.3

Scoring Positional Weight Matrices

We used all vertebrate PWM’s in TRANSFAC v8.4 (N = 546) and JASPAR (2005/01/01) (N = 81) for a total of 627 PWM’s. Given a matrix with observed frequencies wb,i , we converted frequencies to probabilities using wb,i + 0.25 pb,i = P b0 ∈B wb,i + 1

(7.3)

where B = {A, C, G, T}. We scored sites using a log-likelihood ratio score LA (S) =

X

lg(psi ,i ) − lg(0.25)

(7.4)

1≤i≤W

where W is the length of the PWM and si is the i-th base in the W -mer S that is being scored. Because, the binding sites used to define each PWM are available for some but not all of the PWM’s, we adopted a global minimum threshold score of LA ≥ τA = 5 for all PWM’s. Examination of the PWM’s for which training data was available suggests that this threshold yields at least 90% sensitivity for most sites. When a factor had too many hits per sequence on average, we increased this threshold to yield an average of 6 hits per sequence.

7.4.4

Definition of PWM Families

Families of PWM’s were initially creating using a greedy algorithm that considered the consensus sequence of the PWM. Subsequently, manual curation was used to refine and maintain these clusters.

7.4.5

Definition and Evaluation of ROC Graphs

ROC graphs were generated by varying all free parameters, θk in the candidate rule across all rounded observed values. Score parameters were rounded to 0.1 and sizes were rounded to 5bp to 137

ˆ A budget of 1,000,000 parameter points was allocated. yield an approximate parameter vector θ. If the number of different values of ω ˆ was less than the budgeted amount then all values were enumerated and the performance was evaluated at each point. Otherwise a stochastic descent method was applied using 10,000 runs of 100 steps with a neighborhood of ±10. The objective function to be minimized was f (θ) = rFP (θ) − rTP (θ). For each NFP , the parameter values that yield the maximum NTP were recorded. This curve was then made monotonic by replacing each NTP with the maximum of the NTP and all NTP for smaller NFP . The area under the curve (AUC) was computed by simple rectangular rule integration after converting NFP and NTP to rFP and rTP .

7.4.6

Exploring Combination Rules

Our goal is to be conservative in selecting the features that appear in combined rules. We explore rules using the dynamic strategy defined in Chapter 4. We required that the MABC statistic be at least 0.01.

7.5

Discussion

In this chapter we have successfully automatically learned combinations and arrangements of TFBS’s that are over-represented in, and select for, liver-specific promoters. We have validated other earlier work by identifying the {HNF1,HNF4} rule from a larger data set, but have extended this work by identifying other rules involving the forkhead factors and glucocorticoid receptor that also select for liver-specific genes. Our automatically discovered rules are consistent with biological knowledge suggesting that our method has correctly identified new rules and can be applied to other systems.

7.5.1

Biology

We note with satisfaction that the less biologically plausible factors, T-antigen, STAF, and RREB, appeared in few two-feature rules and in none of the consensus three-feature rules. This suggests that our method has some robustness with respect to the inclusion of spurious features. If true, this is probably due at least in part to the requirement for proximity of sites. HNF1 shows no evidence of working in close proximity with multiple factors. It is interesting to compare this with the curve plotted in Figure 7.2 which shows the strength of the enrichment of HNF1 sites in liver-specific genes, especially as compared to HNF4. One possible hypothesis is that HNF1 is a global enabler that does not require other factors to play its primary role. It 138

might then play a role in chromatin remodelling. In fact, two papers report such a role for HNF1α. Pontoglio and coworkers [166] identify DNase hypersensitive sites that require a functional HNF1α. Parrizas and coworkers [156] find it regulates histone hyperacetylation (differentially) in liver and pancreas. We did not include HNF6 in the combination rules as it did not reach the top 20 PWM’s. We anticipate including HNF6 and other liver-specific factors in future work to determine their contributions. In particular we are interested in comparing HNF6 to HNF4 to see how their liver-specificity relate to the data in Figure 7.2. The largest collections that we found in the seven-fold cross-validation contained three features. The largest collections considered contained four features, but they did not show improvement over their predecessors. In the full run, where there may be some overlearning, a few collections of four features were selected. This suggests that in most cases, CRM’s of this size contain relatively few TF’s. Our set of TF’s was in some sense limited since by selecting the top 20 TF families, we did not include some known liver TF’s such as HNF6, VBP, PBX1, C/EBP. Future work will consider these factors in addition to the TF’s that appeared in the selected three-feature rules. List rules with NF1 suggest a requirement for the CCAAT box. A similar case holds for the TATA box which is clearly enriched in tissue specific genes, but is not identified as such in an evaluation that only varies the score threshold. The 300bp limit in combination with nonhierarchical rules does not allow us to identify long-range constraints between upstream CRM’s and the core promoter. Such grammars is easy to formulate and could be learned with a small extension to our system. The application rules derived from liver-specific CpG- genes to liver-specific CpG+ genes suggests that these genes are regulated in a somewhat different manner than CpG- genes. These results are preliminary as they are based on just 33 CpG+ genes. It is also possible that the differences derive from the fact that the CpG+ genes are less specific on average than the CpG– genes rather than the CpG status. If this hypothesis were true, then we would expect CpG– and CpG+ genes in the same specificity range, e.g., 7 ≤ Qg|liver ≤ 8 to be regulated in the same or similar ways. We plan on performing experiments to answer this question in the future.

7.5.2

Clustering TFBS Models

We have left the forkhead TF’s in three clusters and the HNF4-like sites, COUP and ARP1, in two separate groups. While this has introduced some redundancy it has also allowed us to see how much the exact PWM chosen to represent a TF can affect the results. The forkhead results were largely consistent at the level of two-sets, i.e., all of the forkheads paired with HNF1. However at the level of rules equivalent to the set of genes hit depended on the mix of forkhead 139

PWM’s. The two HNF4-like also showed different behaviors; ARP1 did not pair with HNF1 while COUP did. These two PWM’s are significantly different in that the ARP1 PWM has two extra unconserved bases inserted in the COUP matrix. Thus, one might expect differences in behavior in this case.

7.5.3

Augmenting the Learning Algorithm

Our algorithm for exploring the set of possible collections focusses on sets as it considers them first before proceeding to bag or list successors. This, and the requirement that all predecessors show improved selectivity, does put list productions at a disadvantage. Consider a true long list arrangement. The two-set consisting of the 5’- and 3’-most TF’s will probably not show much improvement due to the long gap between them. Thus this pair might be eliminated thus preventing the set predecessor to the list from being found. We can suggest a list-centric learning algorithm as follows. First, consider all pairs of factors. Once these are evaluated, begin to join promising lists that have the same feature at their opposite ends, e.g., join [A, C] and [C, B] to form [A, C, B]. The promising lists could be extended again by drawing from the already evaluated two-lists. We will consider this in future work. However, there is some evidence that this approach might not be productive. In the cross-validation runs only a few two-lists were selected and at least half of these appeared to be based on absolute positional biases not relative positions. This suggests that we may not have missed a significant number of list rules after all. We tried to validate these rules by applying the same algorithm to human liver-specific genes. While we did identify HNF1 and HNF4 among the enriched TF’s we could not replicate the mouse results. This may be due to several causes. First, we note that the overlap in homologs between the top 100 liver-specific genes in mouse and human is only about 29, i.e., quite low. Secondly, the amount of repeat masked sequence is about twice as large in humans (25%) as in mouse (10%). While in mouse a Markov model accurately replicates the repeat-masked sequence statistics, it did not do so in human. The distribution of length of masked block is not an exponential curve which is what is produced by a Markov model and the Markov model sequence tended to have less masked sequence than the real sequences. This would tend to produce the kinds of defects we saw in the human ROC curves. To better model masked sequence, we tried a generalized Markov model that sampled the length of masked blocks from the observed length distribution. This made a small difference in the ROC curves, but did not alter the results fundamentally. We will investigate this matter further as part of our future work. The MABC statistic is a fairly conservative statistic for determining improvement. For example, consider a ROC curve that has a steep initial slope, i.e., is very discriminating. It will be very 140

difficult for another curve to gain significant MABC as there is very little room between the initial ROC curve and the Y-axis. This has been noted before and some researchers consider the log of the rFP as this yields a larger emphasis on the stringent end of the curve. Other statistics could be used as well. One might fix a target rTP rate, say 0.5, and rank rules by the rFP at the parameter values that yield that rTP . Such a method could be easily incorporated into our machine learning algorithm. However, it is best applied if one has some assurance that the target rTP is less than the actual true positive coverage. We have said that the MABC statistic is conservative, however, we note the large drop in selected three-feature rules between the full data set and the seven-fold cross-validation.

141

Chapter 8

Identifying Companion Factors of ChIP-Chip Target Transcription Factors In this Chapter we describe the application of bounded collection grammars (BCG’s) to the results of chromatin-immunoprecipitation (ChIP-chip) experiments. The ChIP-chip experiments were performed by members of Dr. Kaestner’s laboratory who kindly provided me with sets of genes to analyze and biological guidance for the search strategy. The material in this chapter is described in [72] and [130].

8.1 8.1.1

Introduction ChIP-chip Experiments

Conventional mRNA expression experiments can yield a list of genes that are differentially expressed between a condition of interest and a control condition. When the condition of interest is a mutant with a transcription factor knocked-out or over-expressed, then the differentially expressed genes are candidates for regulatory targets of the TF of interest. However, the differentially expressed genes could be indirect targets, i.e., genes that are not bound by the TF but instead regulated by the direct target genes to which the TF actually binds. It is important to distinguish these two classes of genes to fully understand the details of the mechanism of action of the TF of interest. In addition, since the regulatory regions for a gene may be widely scattered it is important to identify 142

the region where the TF binds, e.g., proximal promoter, intron, or distant enhancer, to identify which other factors may act with the target TF to regulate the target genes. The purpose of a ChIP-chip experiment is to identify the regions of genomic DNA that are bound by a particular TF of interest. The ‘ChIP’ in ChIP-chip is a chromatin immunoprecipitation. Cells are extracted from a biological sample of interest. Formaldehyde is used as a cross-linking agent to fix all TF’s to the DNA where they are already bound. The genomic DNA is then extracted and sheared to produce short fragments, approximately 1KB in length, with the TF’s still attached. An antibody for the TF of interest is used to select the genomic fragments that are bound to the TF to produce a pool of fragments of DNA that are enriched for functional TFBS. All TF’s are then released from the DNA. The DNA may be amplified if necessary to produce the required amount. The ‘chip’ in ChIP-chip is similar to a conventional microarray experiment using the genomic DNA instead of cDNA made from mRNA. The microarray chip that contains amplicons of genomic DNA from the promoter or other regions of genes or perhaps a tiling of the whole genome. Subsequent processing is similar to what is done in mRNA expression level microarray experiments. The end result is a set of genomic regions that may show enhanced or reduced binding of the TF relative to the control condition. Changes in binding can be confirmed by RT-PCR on individual genes. See [137, 132, 151] for applications of this technology. In the Kaestner lab ChIP-chip experiments are combined with conventional microarray experiments on mRNA extracted from the same samples to identity genes that are differentially expressed in the condition of interest. The intersection of the genes showing increased TF binding with genes showing differential expression can be considered to be the direct targets of the TF in the experimental conditions.

8.1.2

The Goals Sequence Analysis

We have several goals for a sequence analysis of the results of a ChIP-chip experiment. The first goal is the identification of the likely binding locations for the target TF. The genomic locations identified by the ChIP-chip experiment usually have a resolution of about 1KB; it is desirable to locate the exact site(s) to which the TF’s binds. This is in some sense a validation of the experimental results, both computationally and biologically, by identifying sequence for designing primers for RT-PCR verification. It is also a chance to see if the functional sites are similar to or different from known models of the TFBS. The second goal is the identification of other factors that cooperate with the target TF. As discussed in Chapter 4, it is certain that there are many binding sites in other genes for the target TF. What is different about the neighborhood of the sites that are functional in the experimental condition? Our analysis will focus on identifying (the arrangements 143

of) any other known TF’s that may be present with the target TF in the direct target genes.

8.1.3

Microarray Chips

The Kaestner lab has produced a promoter chip containing promoter regions for approximately 3500 mouse genes. This chip has been produced in several versions and is currently undergoing an expansion to more genes. Two of the early versions were used in the work described here. Both chips contain at set of 1KB genomic amplicons or tiles that are located just upstream of RefSeq mRNA transcripts aligned to the mouse genome. The second version of the chip contains an additional set of 2KB tiles that are located just upstream of the 1KB tiles. They, in conjunction with CBIL, have also produced a mRNA microarray chip [113] spotted with a set of over 13,000 mouse cDNA clones that is enriched for genes expressed in pancreatic development and/or related to glucose homeostasis. These two chips were used to produce the data analyzed in this chapter.

8.2

C/EBP-beta Targets in Regenerating Liver

This work is described in [72] from which the first two paragraphs of the Background material is drawn with little modification.

8.2.1

Background

CCAAT enhancer-binding proteins (C/EBPs) constitute a family of basic-leucine zipper (bZIP) transcription factors that are critical for the regulation of numerous biological processes, including differentiation, metabolic homeostasis, proliferation, tumorigenesis, inflammation, and apoptosis. C/EBP proteins are regulated at multiple levels, including gene transcription, translation, and phosphorylation, in response to a variety of stimuli including hormonal, cytokine and growth factorsignaling pathway. C/EBP proteins are able to form hetero- and homo-dimer complexes with other C/EBP family members, thereby creating additional diversity in target sequence recognition. A variety of approaches have been used to identify C/EBP-binding sites, including cell culture systems, C/EBPβ −/− mice, and analyses of promoter sequences. However, several obstacles have limited the identification of direct C/EBP-dependent transcriptional targets in vivo. All C/EBP family members with the exception of C/EBPζ possess identical in vitro DNA-binding affinity for C/EBP consensus sequences, suggesting that other C/EBP family members may be able to compensate for the loss of C/EBPβ. Second, the application of computational sequence analysis to identify C/EBP promoter sequences has been impeded by the fact that significant variations from the optimal C/EBP-binding sequence are tolerated, limiting the discriminative power of the 144

C/EBP consensus sequence. Furthermore, the ability of C/EBP to heterodimerize with other basic-leucine zipper (bZIP) and non-bZIP transcription factors is associated with alterations in transactivation and DNA-binding specificity that may not be predicted based on consensus C/EBPbinding sequences. The liver is an organ that can regenerate itself when damaged. This regenerative response is diminished in C/EBPβ

−/−

mice hence characterizing the direct targets of C/EBPβ in the

regenerating liver should help to provide a more complete understanding of this mechanism. From the point of view of this dissertation, we are interested in answering both questions posed in the introduction.

8.2.2

Results

Characterization of the Direct Target Genes The ChIP-chip experiment yielded a list of 15 direct targets for C/EBPβ which are listed in Table 8.1. For comparison with the work in Chapter 7 we include the value Qg|liver for the direct targets for which we can identify corresponding probe sets in the GNF1 data [216]. The direct targets are significantly (p-value ≈ 2 × 10−7 ) enriched for mildly liver-specific genes with Qg|liver ≤ 8.25. In addition, two other genes, Car3 and Saa1, showed liver-specific expression in the second [217] GNF data set for mouse though they were absent or non-specific in the GNF1 data.

Computational Verification of C/EBP-beta Binding All eight of the PWMs for C/EBPβ family members contained in TRANSFAC v7.3 were considered to determine if binding sites for C/EBPβ were enriched in the direct targets as compared to 138 non-binders. Figure 8.1 shows a summary of the results from this and subsequent steps. The best C/EBPβ PWM was V$CEBP 02 (AUC=0.703) which achieved statistically significant enrichment at several scoring thresholds with a p-value better than 0.05 after a Bonferroni correction of eight. Thus we conclude that there is computational evidence for enriched C/EBPβ binding in the direct targets and that the sites most closely match the PWM V$CEBP 02.

Identifying the Companion Factors of C/EBP-beta We considered all 674 PWMs in TRANSFAC and identified the top three as ranked by AUC that were also expressed in liver according to the mouse gene expression database GXD [101]. These were D-site-binding protein (DBP, V$DBP Q6) AUC=0.787, interferon-stimulated response element (ISRE, V$ISRE 01) AUC=0.712, and peroxisome proliferator-activated receptor (PPAR, V$PPAR DR1 Q2) AUC=0.665. Although not appreciated at the time of publication of [72], the PPAR site is very similar to the COUP and 145

Qg|liver mouse 7.62

10.79 2.66 8.24

RefSeq Id NM 013467 NM 007531 NM 007606 NM 007871 NM 007954 NM 007981

7.70 10.92 6.66 10.27 12.88 11.03 11.72 10.03

NM NM NM NM NM NM NM NM

008260 008163 008185 010664 007483 013923 009112 009117

Gene Symbol Aldh1a1 Bcap37 Car3 Dyn2 Es1 Facl2 Fkbp11 Foxa3 Grb2 Gstt1 Krt1-18 RhoB Rnf19 S100a10 Saa1

Description aldehyde dehydrogenase family 3, subfamily A1 B-cell receptor-associated protein 37 carbonic anhydrase 3 dynamin 2 esterase 1 fatty acid Coenzyme A ligase, long chain 2 FK506 binding protein 11 forkhead box A3 growth factor receptor bound protein 2 glutathione S-transferase, theta 1 keratin complex 1, acidic, gene 18 ras homolog gene family, member B (Arhb) ring finger protein (C3HC4 type) 19 S100 calcium binding protein A10 serum amyloid A 1

Table 8.1: Direct targets of C/EBPβ including Qg|liver in 42 tissues. The uniform landmark value is 2 lg(42) = 10.78. Several other genes, Car3 and Saa1, showed liver-specific in the second GNF data set[217].

HNF4 sites. We then considered the 2-set, 3-set, and 4-set rules that include the best C/EBPβ PWM and one or more of the top three PWMs. The sites were found to co-occur as listed in Table 8.2 and plotted in Figure 8.1. All combinations were found to be significantly enriched in the direct targets as assessed by two different methods. The first method compared the frequency of occurence using the optimal parameters in the direct targets as compared to the whole promoter chip as shown in Table 8.2. The second method used their rate of occurrence in 138 non-C/EBPβbinding promoter tiles. This second method used a very stringent Bonferroni correction which ignored the requirement for liver expression of cooperating TF’s.

8.2.3

Discussion

We have identified 3 additional companion TF’s that are part of the context that may help define the direct targets of C/EBPβ in regenerating liver. We have not demonstrated co-occupancy or cooperativity between these factors, so the nature of their interaction remains unknown. We found a rather large size bound (∼700 bp) on the rules we considered, so there is little evidence for direct interaction at this time. Therefore, additional biological experiments will have to be done to determine whether there is interaction between C/EBPβ and the co-occurring TF’s. Interestingly, 146

Binding Sites Combinations {C/EBPβ, DBP} {C/EBPβ, ISRE} {C/EBPβ, PPAR-DR1} {C/EBPβ, DBP, ISRE} {C/EBPβ, DBP, PPAR-DR1} {C/EBPβ, ISRE, PPAR-DR1} {C/EBPβ, DBP, ISRE, PPAR-DR1}

Size (bp) 342 608 736 790 963 663 768

Target Freq. 80.0 80.0 73.0 80.0 80.0 80.0 80.0

Chip Freq. 16.0 14.0 19.0 6.4 12.0 9.3 4.2

Corrected P-value 4.0 × 10−4 7.7 × 10−5 0.04 1.9 × 10−6 3.5 × 10−3 1.9 × 10−4 2.7 × 10−6

Table 8.2: Companions of C/EBPβ in direct targets of C/EBPβ. The table shows combinations of companion factors of C/EBPβ and the learned size bound as well as the frequency of occurrence of the combination in the direct targets and in all promoters on both chips using the optimal scoring thresholds (CEBP=8.9 (2-sets), 8.1(3-sets and 4-set); DBP=9.2; ISRE=9.0; PPAR-DR1=9.0). The p-value reported is corrected for the number of possible rules with the same number of features.

147

C/EBP and Combinations 8 c(T,2) T

8

TP

10

15

8 c(T,3)

{CEBPB,DBP,ISRE,PPAR_DR1} 5

{CEBPB,DBP,ISRE} {CEBPB,DBP,PPAR_DR1} {CEBPB,ISRE,PPAR_DR1}

0

CEBPB

0

20

40

60

80

100

120

140

FP

Figure 8.1: ROC graph for C/EBPβ and combinations enriched in direct targets of C/EBPβ during liver regeneration. The scalloped curves are the iso-significance lines (ISL) with Bonferroni corrections of 8, 8T , 8c(T, 2) and 8c(T, 3) where T is the number (574) of vertebrate PWMs in TRANSFAC V7.3 and c(T, n) is the number of n-sets that can be made from T features. A point on a ROC graphh is significant if it is above the appropriate ISL. The plot shows the ROC graphs for the best C/EBPβ PWM in the solid dotted line which should be compared to ISL 8. The dashed lines correspond to the 3-set rules and should be compared to ISL 8c(T, 2). Finally, the plain solid line is the 4-set rule which should be compared to ISL 8c(T, 3).

148

HNF-1 and HNF-3, did not appear to be companions of C/EBPβ. Indeed, the best HNF1 PWM had AUC=0.473 which is slightly worse than random guessing. The best member of the HNF3 cluster is had AUC=0.56. Also the next best, but different, family member, behind PPARG, was COUP also at about AUC=0.56. This is perhaps due to the control set which is rich in gut-specific genes or the distribution of liver-specificity of the direct targets which differs from the data used in Chapter 7. Additional computational experiments could be performed with different control sets to try to see if other liver-specific factors can be integrated into the C/EBPβ companion set

8.2.4

Methods

Enrichment of liver-specific genes:

We could assign a Qg|liver score to 12 of 15 C/EBPβ

target genes. Of these, 42% (5) had Qg|liver ≤ 8.25. Of the 7956 genes expressing above 200AU in the GNF1 mouse data only 5% (393) had Qg|liver ≤ 8.25. A two-sided test using the two-population proportion test in the R package[221] to compare these proportions yielded a p-value of 2 × 10−7 . This value is approximate since the R function prop.test uses a χ2 approximation which may be inaccurate given the small number of C/EBPβ targets with Q values.

Enrichment of C/EBP-beta binding sites:

C/EBPβ enrichment was assessed with rules of

the form Ri −→ Ci [score ≥ xi ];

G8.1

where Ci is the i-th PWM for a C/EBPβ family member. The control set were 138 genes that did not bind C/EBPβ and did not show differential expression in the experiment.

Enrichment of C/EBP-beta and companion sites:

Pairs, triples, and quadruples, were

assessed with rules of the form shown below for three features where n has a maximum of 2000bp, but is optimized during the learning. {n}

Rijk −→ Ci [score ≥ xi ], Fj [score ≥ xj ], Fk [score ≥ xk ];

G8.2

We computed statistical significance in two ways. First, as reported in Table 8.2 we selected the optimal scoring parameters for each rule and counted the number of matching tiles among the direct targets (M = 15) and among all genes on both the promoter chip and the mRNA chip (N = 2122). Since the direct targets are a subset of the second set, we used a hypergeometric distribution to compute the p-value of finding m matches among M examples drawn from a set of N with n successes. These p-values were corrected using a Bonferroni correction for the number of possible rules of a given size. Second, as shown in Figure 8.1, we computed the p-value for enrichment in the target set versus the 138 non-binding non-differentially expressed genes. Since 149

these sets are distinct we used a two population proportion test. The figure shows iso-significance lines for α = 0.05 at various levels of multiple-testing correction that over-estimate the number of independent set rules of each size. The Bonferroni correction factors are very conservative as they use the number of vertebrate PWMs (574) which is many more than the number of non-redundant PWMs for factors that are expressed in liver.

8.3

Glucocorticoid Receptor Targets in Fasted Dexamethazone-Injected Mice

This work is described in Le etal 2005 [130]. As in the previous section, we acknowledge that work as the source of the background material and refer the interested reader there for more detailed citations.

8.3.1

Background

Glucocorticoids are essential steroid hormones which are secreted by the adrenal cortex and affect multiple organ systems. Among these effects are the ability to depress the immune system, repress inflammation, and help mobilize glucose in the fasting state. Glucocorticoids and their synthetic analogs are widely prescribed for adrenocortical insufficiency and as an immune suppressant/antiinflammatory agent, but their systemic effects can often be debilitating. An understanding of the genes regulated by the glucocorticoid signaling pathway may lead to more targeted therapies, thereby preventing unwanted side-effects. Glucocorticoids act via a signaling pathway that involves the glucocorticoid receptor (GR), a member of the nuclear receptor superfamily of ligand-activated transcription factors [16, 241]. In the absence of glucocorticoids, GR is sequestered in the cytoplasm by a protein complex which includes HSP70 and HSP90. When glucocorticoids are present, they traverse the plasma membrane and bind to GR, allowing GR to dissociate from its chaperone proteins and translocate to the nucleus. Within the nucleus, the ligand-bound GR can bind to DNA as a monomer or as a dimer to palindromic glucocorticoid response elements (GRE’s) and modulate transcription. The mechanisms of action of the ligand-bound GR are fairly complex, including the ability to both activate and repress transcription, and to interact with other transcriptional regulators such as AP-1 and Nf-κB (reviewed in McKay and Cidlowski [147]). The net effect of glucocorticoid administration on a particular target gene is likely dependent upon the other transcription factors present on the target gene’s promoter or enhancer(s). Specifically, the integration of multiple 150

signaling pathways can occur at glucocorticoid response units (GRU’s), which consist of a combination of a GRE and other transcription factor binding sites. These include GRU’s in the promoters of the phosphoenolpyruvate carboxykinase (Pck) and carbamoylphosphate synthetase (Cps) genes [95, 210, 185]. Thus, understanding the complete nature of glucocorticoid action requires knowing not only the set of genes bound and regulated by the GR, but also the transcription factors that may interact with the GR, as well as the loci where these interactions occur. The binding site for GR has similar qualities to those of C/EBPβ. They both much variation from the consensus sequence and hence active sites may score poorly against a particular weight matrix. Also, as before the companion TF’s of GR play a large role in determining where GR will bind and be active.

8.3.2

Results

The ChIP-chip experiment yielded a list of 318 tiles representing 302 distinct genes that showed increased GR binding. A few of these genes were chosen at random for QPCR verification; 11 of 14 selected showed increased binding in this assay. The list also contained a number of known GR targets. The parallel mRNA expression experiment yielded a list of 498 genes that showed differential expression. Considering just the 2500 genes common to both microarray chips, there were 235 GR-binding genes and 498 differentially expressed genes. Intersecting these last two sets yielded 56 tiles with differential expression and GR binding which we will refer to as the DEB set. Of these tiles 40 are 1KB tiles and 16 are 2KB tiles. A network analysis using the Ingenuity package was performed (by Phil Le of the Kaestner lab) on the set of direct targets and GR. This generated a graph of genes that are known to interact with GR that can serve to provide evidence of potential for interaction between GR and other TF’s.

Characterization of the Direct Target Genes We associated Qg|liver values with 44 of the GR direct targets and compare this distribution to the distribution for all genes on the promoter chip that express at least 200 AU in at least one tissue. The results are shown in Figure 8.2. The distribution shows of the direct targets shows a clear enrichment of liver-specific genes. Taking the same threshold that we used with C/EBPβ genes, Qg|liver ≤ 8.25, we find this enrichment is significant (P = 1.1 × 10−5 ).

Transcription Factors Enriched in the DEB Set:

We evaluated the ability of PWMs for

GR to distinguish the DEB set from randomly selected tiles from the promoter chip. There are three PWM in TRANSFAC 7.3 that represent a GR binding site. They are slightly enriched in 151

1.0

CDF of Q for GR Direct Targets

0.6 0.4 0.0

0.2

CDF

0.8

All GR Binders

0

2

4

6

8

10

12

14

Q (liver) [bits] Figure 8.2: The distribution of Qg|liver for GR direct targets shows a statistically significant enrichment for liver-specific genes as compared to the distribution of genes on the promoter chip.

the DEB set as shown in Table 8.3 and the best p-value for a performance point is 0.00021. Thus GR is enriched in the set, but is an extremely weak predictor of DEB set membership. This is not surprising given that known GR sites can vary significantly from the GR PWM consensus. We then examined all other vertebrate TRANSFAC v7.3 and JASPAR matrices (total 574) for enrichment in the DEB set. The top 20 TF groups ranked by AUC are shown in Table 8.4. We now apply a stronger multiple testing correction (0.05/574) and find that only JV$THING1 E47 and V$HNF4ALPHA Q6 are significantly enriched in all three samples. The GATA-like PWM V$LMO2COM 02 is significant in two samples. There is some evidence for an interaction between GATA family members and GR [152]. DNA binding by E47 and COUP were recently shown [112] to be repressed by GRα. 152

PWM V$GRE C V$GR Q6 V$GR Q6 01

Site Size dimer dimer monomer

Mean AUC 0.584 0.561 0.567

AUC Range 0.013 0.013 0.019

Table 8.3: Performance of three TRANSFAC PWMs for GR. PWM V$ERR1 Q2 V$YY1 Q6 V$SRF Q4 JV$THING1 E47 V$LMO2COM 02 V$DR1 Q3 V$CREB 02 V$SREBP1 02 V$ER Q6 01 V$SF1 Q6 V$GRE C

AUC 0.638 0.631 0.627 0.625 0.622 0.616 0.606 0.594 0.592 0.591 0.584

Spread 0.007 0.022 0.021 0.010 0.019 0.005 0.002 0.015 0.003 0.014 0.013

PWM Group ER-LEFT YY1 SRF THING1 E47 GATA HNF4/TCF4/COUP/PPAR CREB/ATF SREBP ER GNCF/SF1 GR

Table 8.4: Performance of top 20 PWM groups for DEB set. The AUC is the average over three different background samples. The spread is the difference between the minimum and maximum AUC values. GR-TF Combinations Enriched in the DEB Set In order to begin to identify components of the sequence context that determines which GR sites are functional in the experiment, we consider 2set combinations GR and any other PWM, including those for GR. This yielded 1147 combinations which were evaluated using the standard method with a size bound of 300bp. Table 8.5 and Table 8.6 list the companions that were significantly enriched at the maximum enrichment point (max(rTP − rFP )) in all three samples. The tables present the average uncorrected p-value for the enrichment, the average size bound, and the standard deviation of the size bound. Many of the rules have a small size bound, e.g., AP1, p53, IRF, C/EBP, cRel with the GR monomer site, and myc and NFAT with the GR dimer site. All of these factors are known to interact with GR, e.g., YY1 [19], C/EBPβ [185], Oct-1 [168], and p53 [197]. Most are included in the Ingenuity network analysis. Interestingly, AP-1 was found close the GR monomer much more so than the dimer which is consistent with published reports [98]. We can add some measure of experimental validation to the pairing with C/EBPβ by considering the data from the C/EBPβ experiment described earlier. Of the genes in this study with a GR, C/EBPβ dimer, 11 had been included in the previous experiment. Of those, 45% (5) showed 153

GR PWMs 3

30 20

TP

40

50

T

0

10

V$GRE_C V$GR_Q6 V$GR_Q6_01

0

200

400

600

800

1000

FP Figure 8.3: ROC curves for GR PWMs. Only the PWM V$GRE C achieves statistical significance at some point performance. The thin scalloped lines are the iso-significance lines for α = 0.05 and Bonferroni corrections of 3 and T = 574. enrichment in C/EBPβ binding under those, different but related, conditions.

8.3.3

Discussion

The analysis of this experiment posed a challenge as the target factor has a binding site that is not well described by the existing PWMs. In addition, the direct targets included 16 2KB tiles which have a larger chance of containing a GR site by chance. These two facts together yielded only a very weak confirmation of the enrichment of GR sites by direct computational analysis. Despite these difficulties we have been able to identify a number of statistically significant GR-TF pairs that include a number of known partners of GR. One of these, C/EBPβ, has been verified by comparison with the previous C/EBPβ experiment. Others are known to interact with GR, so we 154

Set Rule LMO2COM GATA PPARGAMMA AP1 R YY1 RORALFA 1 ERR1 OCT1 P53 DECAMER IRF 2 ICSBP CEBP CEBPB CEBPB THING1 E47 SOX17 ELK1 C REL E2F1DP1

log10 P-value -6.852 -5.652 -6.306 -5.880 -5.767 -5.631 -5.609 -5.320 -5.587 -5.512 -5.407 -4.572 -5.246 -5.158 -4.805 -5.228 -5.101 -5.094 -4.849 -4.555

Avg Size 147 50 270 30 150 143 140 140 110 30 70 150 30 60 187 190 187 70 30 40

Std Dev 78 0 0 0 0 106 0 0 0 0 0 0 0 0 101 26 12 0 0 0

Table 8.5: Companion factors for GR monomer with − log10 (p-value), estimated optimal spacing, and standard deviation of spacing estimate across three runs. These rules have corrected p-values better than 0.05.

consider them to be strong candidates for further evaluation and a indication that the analysis is correct.

We noted that the direct targets were enriched in liver-specific genes. As the binding sites for the liver-specific factor HNF4 are found to be enriched, HNF4 could be responsible for the liverspecific expression of these genes. Interestingly, the statistical significance of the concentration of HNF4 sites appears to be due to their occurrence in about 16 of the tiles. Mapping these tiles to the GNF1 expression data, 5 of 11 mapped tiles are associated with genes that has Qg|liver ≤ 5.8 for at least one probe set. Comparing these Q values to those displayed in Figure 8.2, it seems that these probably constitute the most liver-specific genes in the set. 155

Set Rule LMO2COM THING1 E47 E47 MYC PPARGAMMA R YY1 EVI1 IK1 AR NFAT PAX5

log10 P-value -8.793 -6.650 -5.264 -5.783 -5.580 -5.293 -4.993 -4.985 -4.968 -4.841 -4.778 -4.564

Avg Size 180 252 110 60 276 175 210 160 193 150 40 170

Std Dev 0 24 0 0 9 10 0 0 40 0 0 0

Table 8.6: Companion factors for GR dimer with − log10 (p-value), estimated optimal spacing, and standard deviation of spacing estimate across three runs. These rules have corrected p-values better than 0.05.

8.3.4

Methods

Enrichment of Liver-Specific Genes We could assign a Qg|liver score to 44 of 54 GR target genes. Of these, 25% (11) had Qg|liver ≤ 8.25. Of the 2207 genes on the promoter chip expressing above 200AU in the GNF1 mouse data only 7% (148) had Qg|liver ≤ 8.25. A two-sided test using the two-population proportion test in the R package [221] to compare these proportions yielded a p-value of 1.1 × 10−5 .

Control set for solo TF and GR-TF analyses The control set consisted of randomly selected tiles from the remainder of the tiles on the promoter chip. The 56 DEB tiles contained 40 1KB tiles and 16 2KB tiles. Since the probability of a feature occurring in an interval depends on the interval length, we chose a control set that had a mix of 1KB and 2KB tiles in proportion to their frequency in the DEB set. We chose three random samples of 800 1KB tiles and 320 2K tiles. The numbers presented are averaged over these three runs.

Solo Factor Evaluation The evaluation of single factors from TRANSFAC and JASPAR, including PWMs for GR, was done using grammars of the form Ri −→ Fi [score ≥ xi ].; The minimum score threshold was 5.0 and scores were rounded to 0.1. 156

G8.3

GR-TF Combination Evaluation The evaluation of GR-combinations was done with set production rules of the form {nij }

Rij −→ GRi [score ≥ xij ], Fj [score ≥ yij ];

G8.4

where the score thresholds, xij and yij , were rounded to 0.1 with a minimum of 5.0 and the size bound, nij , was rounded to 10 bp to allow for a complete enumeration of all possible values and was limited to 300bp. In the case where a GR PWM was paired with itself a bag production of the form

Rii −→ GRi [score ≥ xii ] : 2;

G8.5

was used. Statistical Significance The statistical significance for a performance point was computed using a two-sample proportion test. The null hypothesis is that the true positive and false positive rates are the same. The p-value is computed using the prop.test function in the R software. A multiple testing correction is applied by considering a p-value threshold of 0.05/3 (GR PWMs), 0.05/574 (all single PWMs) or 0.05/1147 (GR-2-sets) as appropriate.

8.4

Discussion

The application of our technique to these particular sets of ChIP-chip data has provided a valuable counterpart to the tissue-specific gene sets analyzed earlier. C/EBPβ and GR are examples of TF’s that are expressed in a wider variety of tissues and target genes that are less tissue specific. This has allowed us to demonstrate the effectiveness of the technique in a different setting. In both cases, we were able to identify biologically relevant combinations of sites even though the binding sites for these factors are highly variable. Looking to the future, we view our technique as being useful for processing data from ChIP-chip experiments involving multiple TF’s active in the same experimental conditions when their direct targets overlap.

157

Chapter 9

Discussion In this chapter we summarize our results then discuss our findings in a larger context.

9.1

Summary

We have described the following accomplishments in this dissertation. We have defined a grammar formalism tailored to the problem of describing and identifying cis-regulatory modules (CRM’s) in genomic sequence and implemented a parser to find matches to such grammars in genomic sequence. We have developed a machine-learning algorithm to identify arrangements of transcription factor binding sites (TFBS’s) represented by simple expressions in the grammar formalism that are over-represented in a positive set as compared to a negative set of sequences. We have applied entropy and developed a related metric Qg|t to identify tissue-specific genes. We have taken a layered approach to identify the signals involved in tissue-specific gene expression. We started by making gross classifications of promoter features associated with tissue-specificity in general. We then studied the core promoter to identify any other specificity-related TF’s. We followed this by identifying the arrangements of liver-specific and liver-active TF’s that are present in liver-specific promoters. Finally, we analyzed the sequence of promoters identified as direct TF targets by ChIPchip experiments to identify companion factors that may cooperate with the target TF to define the set of target genes. These steps all constitute advances in either computer science or computational biology. 158

9.2

Grammar Formalism and Learning System

Our grammar formalism is similar to immediate dominance/linear precedence (ID/LP) [80] and permutation [39] grammars but incorporates the notion of size and context bounding as well as reverse complement orientation and annotation matching. Reverse complement orientation is not novel; it has already appeared in GenLang [192, 61, 195, 193, 194], but it is an essential part of pattern recognition in genetic applications. Size bounding is also widely used, but context bounding is a novel contribution. The use of a grammar formalism allows us to integrate a number of different kinds of models of CRM’s into one framework which allows for a unified method of learning, evaluating, and comparing these models. In addition, it gives us a powerful tool for encoding expert knowledge of regulation in a database. We have developed a flexible machinelearning system including a learning strategy that reduces the size of the search space with a heuristic that focusses the search on the rules involving features that demonstrate the most evidence of cooperation. The system supports multiple learning strategies that can be selected by the user to suit the biological problem at hand. In addition, new strategies can be developed by programmers as well. By including free parameters for scoring, orientation, position, and size of features and by providing a evaluation system for optimizing these parameters using receiver-operator characteristic (ROC) curves, we are able to provide an in-depth analysis of the performance of a rule. Although we have only described the learning of simple rules involving a single collection of features, our system is capable of learning more complex non-recursive hierarchical rules as well. This provides an important avenue of expansion for the future and is a novel feature of our work as compared to others (cited below). When we began to develop this work, we used as starting points systems like GenLang [192], work in yeast on pairs of sites [229], logistic regression (LR) analysis models by Wasserman [234, 127], list-order models in correlated binding sites derived from as little as a single promoter or enhancer by the ModelInspector [123] system, and the preliminary descriptions of work in fly such as [20] that considered unstructured clusters of TFBS’s. We recognized then that constraints on the arrangement and composition TFBS’s in CRM’s would be complex and would apply at multiple levels. Recently proposed systems that have appeared since our proposal, such as TOUCAN [3, 4, 2], COMET [74], MSCAN [5], work by Kreiman [126], and work by Terai and Tegaki [222], have also used concepts that are identical or nearly identical to one or more of our collection productions, but without the explicit mention of a grammar formalism. Thus our system has anticipated these developments and is able to place them into a larger context using the grammatical framework. This framework allows for the composition of rules and also the enforcement of constraints at multiple levels which is not possible in these other approaches. Furthermore, using the precedence 159

relations between rules, i.e., the gradual development of more complex rules by the application of basic operations, in the learning framework allows us to structure the space of hypotheses and allow automated learning across a large set of possible rules. We note that most of the systems, e.g., TOUCAN [3], are designed to search for a single optimal solution. Furthermore, this system as well as Kreiman’s work [126] assume that acceptable scoring thresholds for PWM’s are known prior to the search. We found this assumption not to be true in general, so we solved a harder problem by including parameter optimization in the learning process. It is also important to realize that acceptable individual thresholds for PWM’s may not accurately reflect the optimal thresholds when the PWM’s are used in combination. Our evaluation algorithms allow for the optimal parameters to be learned. The trade-off is that we are not able to explore as deeply into the space of possible grammars for the same amount of computing effort. However, we note that our system does not require that scoring or other parameters be optimized during learning and it is entirely possible to use the existing learning engine with slightly altered learning strategies that work with the performance assessments that would result from fixing score and/or orientation parameters prior to learning. Such an approach might lead to more rapid identification of potential rules that could be explored more fully after the initial learning run. There are a few models that we can not capture exactly in our grammar system, notably LR models and the bounding on spacing between factors in a set context as used by Kreiman [126]. It is not clear how significant a difference there is between Kreiman’s pair-wise bounds and our production-wide bounds, nor how much allowing a weighting of scores (as is done in the LR model) would help in discrimination. These features could be easily added to our grammar formalism, though this might be considered taking too much of a ‘kitchen sink’ approach.

9.3

Tissue Specificity, the TATA Box, and CpG Islands

We began our biological analysis [189] by applying entropy to rank genes according to their overall tissue specificity. This allowed us to identify the TATA box and CpG islands as the major global features related to tissue specificity. These two features have countervailing effects. The TATA box is correlated with higher maximum expression and with tissue-specific genes. CpG islands on the other hand are correlated with higher minimum expression and the least-specific genes as well as embryonic expression. Subsequent preliminary data from analyzes not described here, indicate that the fraction of tissue-specific CpG– genes depends on the tissue. These two signals are neither mutually exclusive nor exhaustive. About 8-9% of mouse genes have a TATA box and a CpG island and an even larger number, 30%, have neither. We found that the TATA+/CpG– genes coded for 160

a disproportionately high number of proteins that are exported from the cell. The TATA–/CpG– gene products tended to be located in the membrane. Both sets of genes were typically involved in cellular response processes. TATA–/CpG+ gene products were located in the cytosol or nucleus and performed housekeeping functions. These results are consistent with a view that the TATA box facilitates the rapid production of large amounts of mRNA for proteins which will be either exported from the cell, e.g., insulin, or used in cell types that require large amounts of the protein, e.g., the hemoglobins or muscle contractile apparatus. Some of the proteins that end up in the membrane, e.g., receptors or pumps, are tissue-specific but are not needed in such large numbers since the ‘volume’ of the membrane is less than the whole cell volume or the volume of the external space. We also observed positional base compositional biases relative to the transcription start site (TSS), particularly a G-richness of the region downstream of the TSS, that were common to the least specific genes regardless of their CpG island status. Motif finding in this region identified a YY1-like motif that was correlated with ubiquitous expression. This application of entropy Hg and the derived conditional specificity metric Qg|t are novel. Measures used by other workers amount to two approaches, either counting the number of tissues that express a gene to measure overall specificity [165, 106, 230], or considering the relative expression [211] of a gene to measure conditional (or overall) specificity. Counting tissues is a relatively crude approach that we found did not yield interesting insights into the distribution of CpG islands. Similar (bad) results were also found in other work, e.g., [165]. Tissue counting also raises the difficulty of choosing a cutoff for determining expression. Our approach avoids that problem by not making a decision about expression versus non-expression. Relative expression is a reasonable way of measuring conditional-specificity but, as we noted in Chapter 5, it can miss genes for which the tissue of interest is a secondary but important site of expression. Our Qg|t metric corrects that problem. Our findings with regard to the association of CpG islands with housekeeping genes is not new, e.g., [23, 165] among many others; in fact we used it as validation that we were indeed capturing the notion of a housekeeping gene with the entropy metric. Similarly, the association of embryonic expression with CpG islands was also known [165]. The association of TATA boxes with tissuespecific genes is more novel. Early models of promoters emphasized the TATA box probably because the earliest genes to be sequenced were tissue-specific genes, e.g., the globins and insulin. Indeed the TATA box was an essential part of early promoter finders such as Promoter Scan [169]. (It is interesting to note that recently the CpG island has come to fill that same role, e.g., [109, 93, 63]. What we were able to do is to assess the overall importance of the TATA box, i.e., just 20-30% of genes, and associate the TATA box with tissue-specific genes especially those that are highly 161

induced. The contemporary work by Gershenzon and Ioshikhes [81] yields an even lower fraction of TATA boxes on the same data set. This is probably due to differences in predicting TATA boxes. Gershenzon and Ioshikhes [81] also chart the gradual reduction in estimates of the prevalence of the TATA box over recent years. They also note the lower rate of TATA boxes in CpG+ genes, but did not include tissue-specificity in their analysis. Furthermore, we could begin to make functional associations with the promoter class which is a step in the direction of decoding the promoter. Recently it was observed [15] in yeast that the TATA box is associated with stress-related genes which constitute a reasonable set of analog genes for the class of genes that we identified as having TATA boxes in mouse or human. Thus it appears that our findings may be quite general across a wide range of species. We will be examining this in future work. We then studied the core promoter in detail to see if there were any more global features that might help identify tissue-specific genes. We considered all TF’s with PWM’s to see if any were enriched in the core promoter and furthermore whether they had effect on the distribution of overall tissue specificity, Hg , after controlling for the presence or absence of a CpG island. We used our machine-learning algorithm to identify about 25 TF’s that are enriched in the core promoter. These typically showed preferences for either CpG– or CpG+ promoters but showed little cooperation with each other. Encouragingly, we identified a number of known TF positional preferences indicating that the algorithms are effective at identifying relevant patterns. However, none of the TF’s, except the TATA box, had any effect on the entropy distribution. Many of the TF’s we identified as enriched were involved in growth response, cell cycle, or embryonic expression. Thus it appears these functions are largely independent of, perhaps orthogonal to, tissue-specific expression. Though not reported in detail here, we did not find much evidence of synergy between elements of the core promoter, e.g., TATA box and initiator element (Inr). Gershenzon and Ioshikhes [81] found statistically significant computational evidence of such interactions though the effects are small. This suggests that the associations are not of prime importance in determining the function of the core promoter, or rather may function relatively independently.

9.4

Liver-Specific Genes

We then proceeded to consider the proximal promoter of mouse CpG– liver-specific genes as measured by our metric Qg|t . We compared these sequences to random sequence sampled from a Markov model of CpG– promoters in general. We were able to successfully automatically identify a number of over-represented liver-specific or liver-related TF’s in these promoters. We showed that our method also correctly identified TF’s specific to other tissues such as muscle, small intestine, 162

and brain, in promoters for genes specific to those tissues. We then considered combinations of the top 20 of these liver-specific factors. We recaptured the {HNF1,HNF4} rule previously identified by others [127], but also identified other rules involving forkhead factors, glucocorticoid receptor (GR), and Oct-1. We performed 7-fold cross validation to identify the rules that can be reliably learned from the data. As part of the learning process we identified the optimal scoring and spacing parameters for these rules. Using the optimal parameters, we determined that in large part these rules did in fact select for liver-specific promoters and selected against promoters of genes specific to other unrelated tissues such as cerebellum. There were two kinds of exceptions to this finding of liver-specific promoter selection. First, promoters for genes specific to related organs such as kidney and small intestine were often identified by the liver rules. This was not unexpected as some tissue-specific genes are expressed in all three tissues and some of the liver-specific factors are active in those tissues as well. The other exception is combinations of TF’s that are either widely but not ubiquitously expressed, e.g., GR, or by TF’s that are part of large families that are active in other tissues, e.g., the forkhead family, or HNF4 or COUP. Combinations of PWM’s for such factors are likely to be active in other tissues besides liver. We consider this result a success as we would like to identify all CRM’s active in a given tissue. not just the ones required for plain tissue-specific expression. We found that our rules favored TATA+ promoters suggesting that there may be more combinations to identify that favor the TATA–less promoters. We note that we did not include all potential liver-specific factors because we limited ourselves to the top 20 TF’s. Excluded factors include HNF6, PBX1, and DBP. These will be included in future efforts. In addition, we will want to try to learn to distinguish between related tissues, e.g., kidney and liver. We found that although HNF1 was part of a number of 2-feature rules it was difficult to improve on the predictive performance of HNF1 by itself. HNF1 was not part of any 3-feature rules. However, the forkhead factors tend to appear in clusters and will cooperate with HNF1 and HNF4 as well as other liver-related factors. We do not know if these represent the same forkhead factor appearing multiple times or two or three different forkheads cooperating. We found the {HNF1,HNF4} rule was the strongest selector of liver-specific promoters, but that the new rules we discovered were also effective and can select different sets of genes. Our findings show some agreement with the location analysis of Odom [151] as we determined that their human HNF1 targets are largely liver-specific as measured by Qg|t and we find that HNF1 is the strongest individual predictor. We differ in that we find HNF4 to be a reasonable predictor of liver-specific expression, but the Odom data suggests that HNF4 binds to many non-liver-specific genes. This may be reflect a problem with their experimental process, e.g., a less specific HNF4 antibody. This data also raises a question about the different roles a tissue-specific TF may play. In tissue-specific 163

genes, the TF may act to directly enable transcription. In non-specific genes that need to have their expression level slightly altered either statically or dynamically in response to external or metabolic signals, the TF may act as an anchor for the signal response TFs. It is also possible that the Odom data suggests that HFN4 plays this second role in a number of genes. The false positive rate we achieved with our {HNF1,HNF4} was not as good as achieved by the LR model [127], however, we note that that work was based on just 16 positive examples and may be a case of overlearning. Our thresholds and true and false positive rates should be more accurate as they are based on 100 genes. It is clear however, that we have not yet formed a complete picture of gene regulation in liver. We plan to consider CpG+ genes to see if they are regulated in the same of different manner as CpG– genes. Our initial analysis suggests that HNF-1 is somewhat less prevalent in these genes. We also want to analyze genes from other species, i.e., human, and in other experiments to try to identify the robust common rules that function across data sets. In addition, we will want to integrate these ‘liver overview’ studies with the type of focussed studies described in the ChIP-chip section.

9.5

ChIP-chip Data

Finally, we have analyzed the results of ChIP-chip data to identify the companion TF’s of the ChIP-chip target TF: either C/EBP [72] or glucocorticoid receptor [130]. The ChIP-chip target genes are not all tissue-specific, but rather encompass a wide range of specificity. As described in [72] and [130], we used our machine-learning algorithm to identify other TF’s that may serve to define the sequence environment that distinguishes the direct targets from other genes. In both cases, but especially in the GR experiment, most of the top candidates were known to interact with GR under some circumstances. Our method identified the most promising partners with the optimal spacing and scoring parameters. This allowed for identification of TF’s to consider in follow up experiments for validation and further understanding of the mechanism of regulation. We also found agreement between the liver-specific promoter analysis and the GR ChIP-chip analysis as both identified Oct-1 as a partner of GR. This is a known interaction [167]. Furthermore, the GR analysis identified C/EBP as a partner and many of the genes containing this pair were bound by C/EBP in the C/EBP experiment. As GR is conditionally, not constitutively, bound to promoters, the identification of the {GR,Oct-1} rule in the liver overview suggest two hypotheses: the mice in the tissue-survey experiment were in some condition that stimulated GR binding or that the motif is common to many liver genes and so was identified. Future analyzes of more data should help resolve this question. 164

9.6

The Structure of Regulatory Modules

We placed a 300bp limit on the size of our CRM’s and found that they contained at most three TF’s. TF’s were typically involved in a few different combinations. This raises the possibility that the rules we found were a function of the 300bp limit and that if the size limit were larger, we might find a larger single collection that subsumes all of the rules we found. In particular a {forkhead,HNF4,GR,OCT1} rule appears to be a likely possibility as all four predecessor 3-set rules were found. However this rule was not evaluated because the successful predecessor rules involved different forms of either the forkhead PWM or the HNF4-related PWM. We are however interested in seeing how CRM’s matching these and other rules may work together. It will be interesting to see whether instances of these related rules overlap or are distinct in promoters. We did not force the CRM’s to be physically distinct so they may be overlapping on the promoters. On average the rules we found hit individually about 40 to 60% of the genes we trained on so there must be gene-level if not base-pair-level overlap of many of the rules. Together these facts suggest a model where there are many, potentially overlapping, CRM’s in a promoter. This may be a way of reconciling the dense-cluster-of-sites models, e.g., COMET [74], and our more structured models. The 3-TF limit suggests that if more factors are interacting it may either be over longer distances or with factors that we have not yet considered. This also raises an important point of comparison with LR or the cluster-of-sites approaches. Those two approaches both allow a site for a TF to be effectively absent from an instance of a CRM as long as the other TF’s can provide a strong enough score. This fact highlights the subtle differences between related questions in this area. We can arrange some of these questions as follows from least to most detailed or stringent. 1. Is this gene specific to tissue X? 2. Which regions regulate the gene’s expression? 3. How do these regions regulate the gene’s expression? The exact level that an approach addresses is debatable, but the LRA and cluster-of-sites approaches, by allowing missing features are not bringing as much rigor to question 3 as our approach does. Consider the potential rule {forkhead,HNF4,GR,OCT1} which is suggested, but not found in our liver results. It may be that these TF’s serve distinct roles in regulation. Thus the specific combination of factors that are active at a gene defines the regulatory program for the gene. A gene that is missing an Oct-1 site will not behave the same way as one with the Oct-1 site. By forcing sites to appear with a strong score, we are more rigorously identifying the exact regulatory 165

program of a gene. This argument can be refined by several further considerations. First, two or more TF’s may be redundant, i.e., only one or the other needs to be present to implement the regulatory program. That is the rules {A, B, C} or {A, B, D} describe the CRM. The cluster-of-sites approach might recognize clusters of TF’s A, B, C, and D, but does not have the ability to describe the requirement of A and B and (C or D). Such rules were considered in yeast [17] using clusters of genes derived from a large number of expression experiments. LRA is in a similar position, though it might be able to place a stronger weight on TF’s A and B. Currently our approach should identify both {A, B, C} and {A, B, D} rules but would not relate the two except via their common predecessor {A, B}. Expansion of our learning method to include disjunction or alternate expansions would be able to collapse these two rules to {A, B, Z} and Z −→ C|D;. The yeast work [17] also considered rules involving NOTs, i.e., the absence of a TFBS as a requirement. Grammars typically do not handle NOT constructions. We have considered adding NOTs to BCG’s by marking RHS terms as illegal, i.e., as ‘poison pills’ that kill a parse if the RHS term is found. The rules identified by considering whole tissues or even a single cell type must be refined by comparison more tightly defined experiments. The whole-tissue rules identify patterns active in a tissue, but do not relate them to particular conditions. We have no way of knowing which of the rules are related to getting the genes expressed as opposed to modulating their expression in response to stimuli. We have performed the simplest refinement by doing the selectivity measurements on other tissues. The ChIP-chip experiments also provide this kind of information by considering the tissue in particular states. In particular this identified the GR-Oct-1 interaction as common to both situations. Incorporation of our modeling method in to larger sets of data with TF expression data and TF ChIP-chip data should help to gain a more detailed picture of regulation. In light of the complications of analyzing liver-specific promoters, e.g., families of TF’s or TF’s similar binding sites, the successful application of our technique to ChIP-chip data suggests that an approach integrating expression and binding may be able to resolve some of those ambiguities. We view this application as particularly satisfying as it allows one to significantly extend the usefulness of the ChIP-chip data. The analysis of the C/EBP and GR ChIP-chip experiments focussed on identifying the distinctions between direct targets and non-targets but did not attempt to integrate these results into a common framework with the whole liver experiments. This is best illustrated by considering Figures 7.2, 7.6, and 8.2. These illustrate that the distribution of assayed or predicted sites or complexes covers not only tissue-specific genes, but also a number of non-specific genes. Understanding the role of the tissue-specific factors in the regulation of the non-specific genes is an important next step. It also highlights the further refinement of the notion of tissue-specificity. We noted in 166

Chapter 5 that some of the non-specific TATA+/CpG– genes were probably expressed specifically in certain circumstances that were not assayed in the general tissue survey. Experiments like the mRNA expression component of the Kaestner ChIP-chip experiments should allow us to augment the tissue survey results to identify conditionally tissue-specific genes. Further analysis of the promoters of these genes especially in comparison with the unconditionally tissue-specific should help refine our models of tissue-specificity.

167

Chapter 10

Future Work In this chapter we briefly discuss future extensions to this work. Biology We have considered the most liver-specific genes without CpG islands. Even within liver-specific genes, there are other sets of interest, e.g., the liver-specific genes with CpG islands. Preliminary analysis of these suggests that HNF1 might play a smaller role, but that needs to be confirmed. In addition, we have concentrated on the most specific genes. There are a large number of genes with 8 ≤ Qg|liver ≤ 9 that we can analyze. These may or may not share the same rules as the more specific genes or may begin to exhibit different patterns that correspond more to the liver-specific regulation of less specific genes. Of course, we can consider other tissues as well. In particular it will be illuminating to try to determine how promoters are different in liver-specific versus other gut organs such as the large intestine, gall bladder, or pancreas, that are also regulated by the HNF factors. We also will need to examine other, larger, data sets to see how repeatable our results are. We know that gene expression is affected by circadian rhythms, feeding status, and other effects. We have preliminary evidence that some of these effects may be acting in the GNF data sets. We expect broad agreement on the major tissue-specific TF’s, but want to see if the secondary signalling TF’s are affected. Our sequence analysis has been limited to the proximal promoter. This was done because we chose not to consider sequence conservation when selecting candidate TFBS’s and the proximal promoter is the most likely place to find regulatory modules. To expand our search area, both upstream and into the gene, it will be necessary to include conservation. It can be easily done by simply filtering the predicted TFBS’s to include only those that appear to be conserved. We still face the possibility of losing a large fraction of combinations due to a high rate of active but unconserved sites. We should be able to preprocess TFBS’s to include those that are not conserved, 168

but are near conserved regions, or are in conserved regions but are not present at the same position. One way to do this would be to run an analysis like the TraFac system which essentially considers all sites in conserved regions, rather than insisting on direct positional conservation. Indeed, we could integrate our method with many other approaches such as the cluster-based methods such as CISTER. We could use CISTER to identify regions with dense clusters of TFBS’s then process them with our system to refine the CRM models. We can also take advantage of ChIP-chip data as more of the genome is covered by affordable microarrays. The recognition of the importance of microRNA-based regulation is growing. Our system can easily incorporate those patterns as well. The location bounding component of BCG’s can easily be used to look both at promoters and at exons, introns, or UTR’s. Thus one can imagine, and in fact formulate a query now, that includes the effects of TF’s and miRNA’s. Models We have considered simple rules consisting of just one collection. An area for expansion is Although the total size of the match was constrained, we did not try to enforce With the current evaluation system we can consider constraining the spacing between features. Perhaps the most exciting direction is to learn hierarchical rules. It is not clear to us whether there is a synergy among all members of large groups of TF’s or rather many relations between the pairs of TF’s in the group. Learning hierarchical rules based on pairs of factors is one way to try to understand that issue. Along these lines, it might be necessary to consider incorporating ID/LP-like capabilities into the BCG system. However, rather than expression just linear ordering preferences, these relations might indicate distance or relative orientation preferences. We also would like to incorporate TF expression data into the learning procedure. We argued that many of the TF’s with good individual AUC’s were liver specific and so our method was working. We were not able to work deep into that list due to time constraints. By considering expression evidence, we would be able to identify TFBS’s with weaker AUC values that may still play a role in regulation.

169

Bibliography [1] Naoki Abe and Hiroshi Mamitsuka. Predicting Protein Secondary Structure Using Stochastic Tree Grammars. Machine Learning, 29:275–301, 1997. [2] S. Aerts, G. Thijs, B. Coessens, M. Staes, Y. Moreau, and B. De Moor. Toucan: deciphering the cis-regulatory logic of coregulated genes. Nucleic Acids Res, 31(6):1753–64, 2003. [3] S. Aerts, P. Van Loo, Y. Moreau, and B. De Moor. A genetic algorithm for the detection of new cis-regulatory modules in sets of coregulated genes. Bioinformatics, 20(12):1974–6, 2004. [4] S. Aerts, P. Van Loo, G. Thijs, Y. Moreau, and B. De Moor. Computational detection of cis -regulatory modules. Bioinformatics, 19 Suppl 2:II5–II14, 2003. [5] W. B. Alkema, O. Johansson, J. Lagergren, and W. W. Wasserman. MSCAN: identification of functional clusters of transcription factor binding sites. Nucleic Acids Res, 32(Web Server issue):W195–8, 2004. [6] Dana Angluin. Learning regular sets from queries and counterexamples. Information and Computation, 75(2):87–106, 1987. [7] M. N. Arbeitman, E. E. Furlong, F. Imam, E. Johnson, B. H. Null, B. S. Baker, M. A. Krasnow, M. P. Scott, R. W. Davis, and K. P. White. Gene expression during the life cycle of Drosophila melanogaster. Science, 297(5590):2270–5, 2002. [8] Robert B. Ash. Information Theory. Dover Publications, 1965. [9] AI Baars, A Loh, and SD Swierstra. Parsing Permutation Phrases. Journal of Functional Programming, 14(Part 6):635–646, 2004. [10] T. L. Bailey, M. E. Baker, and C. P. Elkan. An artificial intelligence approach to motif discovery in protein sequences: application to steriod dehydrogenases. J Steroid Biochem Mol Biol, 62(1):29–44, 1997. 170

[11] T. L. Bailey and M. Gribskov. Methods and statistics for combining motif match scores. J Comput Biol, 5(2):211–21, 1998. [12] V. B. Bajic, V. Choudhary, and C. K. Hock. Content analysis of the core promoter region of human genes. In Silico Biol, 4(2):109–25, 2004. [13] P. Banerjee, M. Bahlo, J. R. Schwartz, G. G. Loots, K. A. Houston, I. Dubchak, T. P. Speed, and E. M. Rubin. SNPs in putative regulatory regions identified by human mouse comparative sequencing and transcription factor binding site data. Mamm Genome, 13(10):554–7, 2002. [14] G. Edward Barton. On the Complexity of ID/LP Parsing. Computational Linguistics, 11(4):205–218, 1985. [15] A. D. Basehoar, S. J. Zanton, and B. F. Pugh. Identification and distinct regulation of yeast TATA box-containing genes. Cell, 116(5):699–709, 2004. 0092-8674 Journal Article. [16] M. Beato, P. Herrlich, and G. Schutz. Steroid hormone receptors: many actors in search of a plot. Cell, 83(6):851–7, 1995. [17] M. A. Beer and S. Tavazoie. Predicting gene expression from sequence. Cell, 117(2):185–98, 2004. 0092-8674 Journal Article. [18] Otto Berg and Peter von Hippel. Selection of DNA Binding Sites by Regulatory Proteins: Statistical-mechanical Theorey and Application to Operators and Promoters. Journal of Molecular Biology, 193:723–750, 1987. [19] P. L. Bergad, H. C. Towle, and S. A. Berry. Yin-yang 1 and glucocorticoid receptor participate in the Stat5-mediated growth hormone response of the serine protease inhibitor 2.1 gene. J Biol Chem, 275(11):8114–20, 2000. [20] B. P. Berman, Y. Nibu, B. D. Pfeiffer, P. Tomancak, S. E. Celniker, M. Levine, G. M. Rubin, and M. B. Eisen. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Sci USA, 99(2):757–62, 2002. [21] B. P. Berman, B. D. Pfeiffer, T. R. Laverty, S. L. Salzberg, G. M. Rubin, M. B. Eisen, and S. E. Celniker. Computational identification of developmental enhancers: conservation and function of transcription factor binding-site clusters in Drosophila melanogaster and Drosophila pseudoobscura. Genome Biol, 5(9):R61, 2004. 171

[22] A. P. Bird. DNA methylation–how important in gene control?

Nature, 307(5951):503–4,

1984. [23] A. P. Bird. DNA methylation versus gene expression. J Embryol Exp Morphol, 83 Suppl:31– 40, 1984. [24] E. Birney, D. Andrews, P. Bevan, M. Caccamo, G. Cameron, Y. Chen, L. Clarke, G. Coates, T. Cox, J. Cuff, V. Curwen, T. Cutts, T. Down, R. Durbin, E. Eyras, X. M. Fernandez-Suarez, P. Gane, B. Gibbins, J. Gilbert, M. Hammond, H. Hotz, V. Iyer, A. Kahari, K. Jekosch, A. Kasprzyk, D. Keefe, S. Keenan, H. Lehvaslaiho, G. McVicker, C. Melsopp, P. Meidl, E. Mongin, R. Pettett, S. Potter, G. Proctor, M. Rae, S. Searle, G. Slater, D. Smedley, J. Smith, W. Spooner, A. Stabenau, J. Stalker, R. Storey, A. Ureta-Vidal, C. Woodwark, M. Clamp, and T. Hubbard. Ensembl 2004. Nucleic Acids Res, 32 Database issue:D468–70, 2004. [25] N. Bluthgen, S. M. Kielbasa, and H. Herzel. Inferring combinatorial regulation of transcription in silico. Nucleic Acids Res, 33(1):272–9, 2005. [26] M. S. Boguski, T. M. Lowe, and C. M. Tolstoshev. dbEST–database for ”expressed sequence tags”. Nat Genet, 4(4):332–3, 1993. [27] K. R. Boheler and M. D. Stern. The new role of SAGE in gene discovery. Trends Biotechnol, 21(2):55–7; discussion 57–8, 2003. [28] A. Brazma, I. Jonassen, J. Vilo, and E. Ukkonen. Predicting gene regulatory elements in silico on a genomic scale. Genome Res, 8(11):1202–15., 1998. [29] A. Brazma, J. Vilo, E. Ukkonen, and K. Valtonen. Data mining for regulatory elements in yeast genome. Proc Int Conf Intell Syst Mol Biol, 5:65–74, 1997. [30] M. Brown and C. Wilson. RNA pseudoknot modeling using intersections of stochastic context free grammars with applications to database search. Pac Symp Biocomput, pages 109–25, 1996. [31] P. Bucher. Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J Mol Biol, 212(4):563–78, 1990. [32] S. Buonamici, S. Chakraborty, V. Senyuk, and G. Nucifora. The role of EVI1 in normal and leukemic cells. Blood Cells Mol Dis, 31(2):206–12, 2003. 172

[33] C. Burge and S. Karlin. Prediction of complete gene structures in human genomic DNA. J Mol Biol, 268(1):78–94, 1997. [34] T. W. Burke and J. T. Kadonaga. Drosophila TFIID binds to a conserved downstream basal promoter element that is present in many TATA-box-deficient promoters. Genes Dev, 10(6):711–24, 1996. [35] T. W. Burke and J. T. Kadonaga. The downstream core promoter element, DPE, is conserved from Drosophila to humans and is recognized by TAFII60 of Drosophila. Genes Dev, 11(22):3020–31, 1997. [36] H. J. Bussemaker, H. Li, and E. D. Siggia. Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. Proc Natl Acad Sci USA, 97(18):10096– 100, 2000. [37] H. J. Bussemaker, H. Li, and E. D. Siggia. Regulatory element detection using correlation with expression. Nat Genet, 27(2):167–71., 2001. [38] L. Cai, R. L. Malmberg, and Y. Wu. Stochastic modeling of RNA pseudoknotted structures: a grammatical approach. Bioinformatics, 19 Suppl 1:i66–73, 2003. [39] Robert D. Cameron. Extending Context-free grammars with permutation phrases. ACM Letters on Programming Langugages and Systtems (LOPLAS), 2(104):85–94, 1993. [40] P. Carninci, K. Waki, T. Shiraki, H. Konno, K. Shibata, M. Itoh, K. Aizawa, T. Arakawa, Y. Ishii, D. Sasaki, H. Bono, S. Kondo, Y. Sugahara, R. Saito, N. Osato, S. Fukuda, K. Sato, A. Watahiki, T. Hirozane-Kishikawa, M. Nakamura, Y. Shibata, A. Yasunishi, N. Kikuchi, A. Yoshiki, M. Kusakabe, S. Gustincich, K. Beisel, W. Pavan, V. Aidinis, A. Nakagawara, W. A. Held, H. Iwata, T. Kono, H. Nakauchi, P. Lyons, C. Wells, D. A. Hume, M. Fagiolini, T. K. Hensch, M. Brinkmeier, S. Camper, J. Hirota, P. Mombaerts, M. Muramatsu, Y. Okazaki, J. Kawai, and Y. Hayashizaki. Targeting a complex transcriptome: the construction of the mouse full-length cDNA encyclopedia. Genome Res, 13(6B):1273–89, 2003. [41] R. Carrasco and J. Oncina. Learning stochastic regular grammars by means of a start merging method. In Grammatical Inference and Applications, ICGI’94, volume 862 of Lecture Notes in Artificial Intelligenes, pages 139–150. Springer Verlag, 1994. [42] CBIL. AllGenes: a web site providing access to an integrated database of known and predicted human (release 9.0, 2004) and mouse genes. (release 9.0, 2004), 2004. 173

[43] U. R. Chandran, B. S. Warren, C. T. Baumann, G. L. Hager, and D. B. DeFranco. The glucocorticoid receptor is tethered to DNA-bound Oct-1 at the mouse gonadotropin-releasing hormone distal negative glucocorticoid response element. J Biol Chem, 274(4):2372–8, 1999. [44] Q. K. Chen, G. Z. Hertz, and G. D. Stormo. MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices. Comput Appl Biosci, 11(5):563–6, 1995. [45] S. F. Chen. Bayesian Grammar Induction for Language Modeling. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 228–235, 1995. [46] P. J. Coffer and B. M. Burgering. Forkhead-box transcription factors and their role in the immune system. Nat Rev Immunol, 4(11):889–99, 2004. [47] J. Collado-Vides. Towards a unified grammatical model of sigma 70 and sigma 54 bacterial promoters. Biochimie, 78(5):351–63, 1996. [48] M. D. Conkright, E. Guzman, L. Flechner, A. I. Su, J. B. Hogenesch, and M. Montminy. Genome-wide analysis of CREB target genes reveals a core promoter requirement for cAMP responsiveness. Mol Cell, 11(4):1101–8, 2003. [49] International Human Genome Sequence Consortium. Finishing the euchromatic sequence of the human genome. Nature, 431(7011):931–45, 2004. [50] T. Cook, B. Gebelein, and R. Urrutia. Sp1 and its likes: biochemical and functional predictions for a growing family of zinc finger transcription factors. Ann NY Acad Sci, 880:94–102, 1999. [51] M. J. Cunningham, S. Liang, S. Fuhrman, J. J. Seilhamer, and R. Somogyi. Gene expression microarray data analysis for toxicology profiling. Ann NY Acad Sci, 919:52–67, 2000. [52] W. H. Day and F. R. McMorris. Critical comparison of consensus methods for molecular sequences. Nucleic Acids Res, 20(5):1093–9., 1992. [53] C. de la Higuera. Characteristic Sets fr Polynomial Grammatical Inferences. Machine Learning, 27:125–138, 1997. [54] F. Denis. Learning Regular Languages from simple positive examples. In P. Dupont and L. Chase, editors, Using Symbol Clustering to Improve Probabilistic Automaton Learning. 1998. 174

[55] Jr. Dennis, G., B. T. Sherman, D. A. Hosack, J. Yang, W. Gao, H. C. Lane, and R. A. Lempicki. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol, 4(5):P3, 2003. [56] C. Dieterich, S. Grossmann, A. Tanzer, S. Ropcke, P. F. Arndt, P. F. Stadler, and M. Vingron. Comparative promoter region analysis powered by CORG. BMC Genomics, 6(1):24, 2005. 1471-2164 Journal Article. [57] C. Dieterich, S. Rahmann, and M. Vingron. Functional inference from non-random distributions of conserved predicted transcription factor binding sites. Bioinformatics, 20 Suppl 1:I109–I115, 2004. 1367-4803 Journal Article. [58] C. Dieterich, H. Wang, K. Rateitschak, H. Luz, and M. Vingron. CORG: a database for COmparative Regulatory Genomics. Nucleic Acids Res, 31(1):55–7, 2003. 1362-4962 Journal Article. [59] J. F. DiMartino, L. Selleri, D. Traver, M. T. Firpo, J. Rhee, R. Warnke, S. O’Gorman, I. L. Weissman, and M. L. Cleary. The Hox cofactor and proto-oncogene Pbx1 is required for maintenance of definitive hematopoiesis in the fetal liver. Blood, 98(3):618–26, 2001. [60] S. Dohr, A. Klingenhoff, H. Maier, M. Hrabe de Angelis, T. Werner, and R. Schneider. Linking disease-associated genes to regulatory networks via promoter organization. Nucleic Acids Res, 33(3):864–72, 2005. [61] S. Dong and D. B. Searls. Gene Structure Prediction by Linguistic Methods. Genomics, 23(3):540–551, 1994. [62] R. D. Dowell, R. M. Jokerst, A. Day, S. R. Eddy, and L. Stein. The distributed annotation system. BMC Bioinformatics, 2(1):7, 2001. 1471-2105 Journal Article. [63] T. A. Down and T. J. Hubbard. Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res, 12(3):458–61, 2002. 1088-9051 Journal Article. [64] S. R. Eddy. Profile hidden Markov models. Bioinformatics, 14(9):755–63, 1998. [65] E. Eisenberg and E. Y. Levanon. Human housekeeping genes are compact. Trends Genet, 19(7):362–5, 2003. 175

[66] L. Elnitski, R. C. Hardison, J. Li, S. Yang, D. Kolbe, P. Eswara, M. J. O’Connor, S. Schwartz, W. Miller, and F. Chiaromonte. Distinguishing regulatory DNA from neutral sites. Genome Res, 13(1):64–72, 2003. [67] L. M. Ettwiller, J. Rung, and E. Birney. Discovering novel cis-regulatory motifs using functional networks. Genome Res, 13(5):883–95, 2003. [68] J Felsenstein. PHYLIP, 1993. [69] J. W. Fickett.

Coordinate positioning of MEF2 and myogenin binding sites.

Gene,

172(1):GC19–32, 1996. [70] P. C. FitzGerald, A. Shlyakhtenko, A. A. Mir, and C. Vinson. Clustering of DNA sequences in human promoters. Genome Res, 14(8):1562–74, 2004. [71] K. Frech and T. Werner. Specific modelling of regulatory units in DNA sequences. Pac Symp Biocomput, pages 151–62, 1997. [72] J. R. Friedman, B. Larris, P. P. Le, T. H. Peiris, A. Arsenlis, J. Schug, J. W. Tobias, K. H. Kaestner, and L. E. Greenbaum. Orthogonal analysis of C/EBPbeta targets in vivo during liver proliferation. Proc Natl Acad Sci USA, 101(35):12986–91, 2004. [73] M. C. Frith, U. Hansen, and Z. Weng. Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics, 17(10):878–89, 2001. [74] M. C. Frith, J. L. Spouge, U. Hansen, and Z. Weng. Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. Nucleic Acids Res, 30(14):3214–24, 2002. [75] S. Fuhrman, M. J. Cunningham, X. Wen, G. Zweiger, J. J. Seilhamer, and R. Somogyi. The application of shannon entropy in the identification of putative drug targets. Biosystems, 55(1-3):5–14, 2000. [76] V. Gailus-Durner, M. Scherf, and T. Werner. Experimental data of a single promoter can be used for in silico detection of genes with related regulation in the absence of sequence similarity. Mamm Genome, 12(1):67–72, 2001. [77] F. Gao, B. C. Foat, and H. J. Bussemaker. Defining transcriptional networks through integrative modeling of mRNA expression and transcription factor binding data. BMC Bioinformatics, 5(1):31, 2004. 176

[78] Y. Garten, S. Kaplan, and Y. Pilpel. Extraction of transcription regulatory signals from genome-wide DNA-protein interaction data. Nucleic Acids Res, 33(2):605–15, 2005. [79] D. Gauchat, H. Escriva, M. Miljkovic-Licina, S. Chera, M. C. Langlois, A. Begue, V. Laudet, and B. Galliot. The orphan COUP-TF nuclear receptors are markers for neurogenesis from cnidarians to vertebrates. Dev Biol, 275(1):104–23, 2004. [80] Gerald Gazdar and Geoffrey K. Pullum. Subcategorization, constitutent order and the notion of ”head”. In M. Moortgat, H. van der Hulst, and T. Hoekstra, editors, The Scope of Lexical Rules, pages 107–123. Foris Publications, Dordrecht, Holland, 1981. [81] N. I. Gershenzon and I. P. Ioshikhes. Synergy of human Pol II core promoter elements revealed by statistical sequence analysis. Bioinformatics, 21(8):1295–300, 2005. 1367-4803 Journal Article. [82] D. Ghosh. TFD: the transcription factors database. Nucleic Acids Res, 20 Suppl:2091–3, 1992. [83] Y. Gitton, N. Dahmane, S. Baik, A. Ruiz i Altaba, L. Neidhardt, M. Scholze, B. G. Herrmann, P. Kahlem, A. Benkahla, S. Schrinner, R. Yildirimman, R. Herwig, H. Lehrach, and M. L. Yaspo. A gene expression map of human chromosome 21 orthologues in the mouse. Nature, 420(6915):586–90, 2002. [84] M. Gold. Complexity of Automaton Identification from Given Data. Information and Control, 37(3):302–320, 1978. [85] M. I. Gonzalez and D. M. Robins. Oct-1 preferentially interacts with androgen receptor in a DNA-dependent manner that facilitates recruitment of SRC-1. J Biol Chem, 276(9):6420–8, 2001. [86] N. Grabe. AliBaba2: Context Specific Identification of Transcription Factor Binding Sites. In Silico Biology, 000119, 2000. [87] L. Grate. Automatic RNA secondary structure determination with stochastic context-free grammars. Proc Int Conf Intell Syst Mol Biol, 3:136–44, 1995. [88] L. Grate, M. Herbster, R. Hughey, D. Haussler, I. S. Mian, and H. Noller. RNA modeling using Gibbs sampling and stochastic context free grammars. Proc Int Conf Intell Syst Mol Biol, 2:138–46, 1994. 177

[89] G. R. Grimmett and D. R. Stirzaker. Probability and Random Processes. Clarendon Press, Oxford, second edition, 1992. [90] W. N. Grundy, T. L. Bailey, C. P. Elkan, and M. E. Baker. Meta-MEME: motif-based hidden Markov models of protein families. Comput Appl Biosci, 13(4):397–406, 1997. [91] D. GuhaThakurta and G. D. Stormo. Identifying target sites for cooperatively binding factors. Bioinformatics, 17(7):608–21, 2001. [92] M. S. Halfon, Y. Grad, G. M. Church, and A. M. Michelson. Computation-based discovery of related transcriptional regulatory modules and motifs using an experimentally validated combinatorial model. Genome Res, 12(7):1019–28, 2002. [93] S. Hannenhalli and S. Levy. Promoter prediction in the human genome. Bioinformatics, 17 Suppl 1:S90–6, 2001. 1367-4803 Journal Article. [94] S. Hannenhalli and S. Levy. Transcriptional regulation of protein complexes and biological pathways. Mamm Genome, 14(9):611–9, 2003. [95] R. W. Hanson and L. Reshef. Regulation of phosphoenolpyruvate carboxykinase (GTP) gene expression. Annu Rev Biochem, 66:581–611, 1997. [96] M. A. Harris, J. Clark, A. Ireland, J. Lomax, M. Ashburner, R. Foulger, K. Eilbeck, S. Lewis, B. Marshall, C. Mungall, J. Richter, G. M. Rubin, J. A. Blake, C. Bult, M. Dolan, H. Drabkin, J. T. Eppig, D. P. Hill, L. Ni, M. Ringwald, R. Balakrishnan, J. M. Cherry, K. R. Christie, M. C. Costanzo, S. S. Dwight, S. Engel, D. G. Fisk, J. E. Hirschman, E. L. Hong, R. S. Nash, A. Sethuraman, C. L. Theesfeld, D. Botstein, K. Dolinski, B. Feierbach, T. Berardini, S. Mundodi, S. Y. Rhee, R. Apweiler, D. Barrell, E. Camon, E. Dimmer, V. Lee, R. Chisholm, P. Gaudet, W. Kibbe, R. Kishore, E. M. Schwarz, P. Sternberg, M. Gwinn, L. Hannick, J. Wortman, M. Berriman, V. Wood, N. de la Cruz, P. Tonellato, P. Jaiswal, T. Seigfried, and R. White. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res, 32 Database issue:D258–61, 2004. [97] Y. Hayashizaki. RIKEN mouse genome encyclopedia. Mech Ageing Dev, 124(1):93–102, 2003. [98] S. Heck, M. Kullmann, A. Gast, H. Ponta, H. J. Rahmsdorf, P. Herrlich, and A. C. Cato. A distinct modulating domain in glucocorticoid receptor monomers in the repression of activity of the transcription factor AP-1. Embo J, 13(17):4087–95, 1994. [99] J. Henderson, S. Salzberg, and K. H. Fasman. Finding genes in DNA with a Hidden Markov Model. J Comput Biol, 4(2):127–41., 1997. 178

[100] G. Z. Hertz, G. W. Hartzell, and G. D. Stormo. Identification of Consensus Patterns in Unaligned DNA-Sequences Known to Be Functionally Related. Computer Applications in the Biosciences, 6(2):81–92, 1990. [101] D. P. Hill, D. A. Begley, J. H. Finger, T. F. Hayamizu, I. J. McCright, C. M. Smith, J. S. Beal, L. E. Corbani, J. A. Blake, J. T. Eppig, J. A. Kadin, J. E. Richardson, and M. Ringwald. The mouse Gene Expression Database (GXD): updates and enhancements. Nucleic Acids Res, 32(Database issue):D568–71, 2004. [102] I. Holmes. Accelerated probabilistic inference of RNA structure evolution. BMC Bioinformatics, 6(1):73, 2005. [103] D. A. Hosack, Jr. Dennis, G., B. T. Sherman, H. C. Lane, and R. A. Lempicki. Identifying biological themes within lists of genes with EASE. Genome Biol, 4(10):R70, 2003. [104] L. L. Hsiao, F. Dangond, T. Yoshida, R. Hong, R. V. Jensen, J. Misra, W. Dillon, K. F. Lee, K. E. Clark, P. Haverty, Z. Weng, G. L. Mutter, M. P. Frosch, M. E. Macdonald, E. L. Milford, C. P. Crum, R. Bueno, R. E. Pratt, M. Mahadevappa, J. A. Warrington, G. Stephanopoulos, and S. R. Gullans. A compendium of gene expression in normal human tissues. Physiol Genomics, 7(2):97–104, 2001. [105] J. D. Hughes, P. W. Estep, S. Tavazoie, and G. M. Church. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol, 296(5):1205–14, 2000. [106] L. Huminiecki, A. T. Lloyd, and K. H. Wolfe. Congruence of tissue expression profiles from Gene Expression Atlas, SAGEmap and TissueInfo databases. BMC Genomics, 4(1):31, 2003. [107] H. Ichikawa, S. C. Lin, S. Y. Tsai, M. J. Tsai, and T. Sugimoto. Effect of mCOUP-TF1 deficiency on the glossopharyngeal and vagal sensory ganglia. Brain Res, 1014(1-2):247–50, 2004. [108] T. A. Ince and K. W. Scotto. A conserved downstream element defines a new class of RNA polymerase II promoters. J Biol Chem, 270(51):30249–52, 1995. [109] I. P. Ioshikhes and M. Q. Zhang. Large-scale human promoter mapping using CpG islands. Nat Genet, 26(1):61–3, 2000. 1061-4036 Journal Article. [110] N. Jareborg, E. Birney, and R. Durbin. Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. Genome Res, 9(9):815–24, 1999. 179

[111] A. G. Jegga, S. P. Sherwood, J. W. Carman, A. T. Pinski, J. L. Phillips, J. P. Pestian, and B. J. Aronow. Detection and visualization of compositionally similar cis-regulatory element clusters in orthologous and coordinately controlled genes. Genome Res, 12(9):1408–17, 2002. [112] X. Jiang, M. Norman, L. Roth, and X. Li. Protein-DNA array-based identification of transcription factor activities regulated by interaction with the glucocorticoid receptor. J Biol Chem, 279(37):38480–5, 2004. [113] K. H. Kaestner, C. S. Lee, L. M. Scearce, J. E. Brestelli, A. Arsenlis, P. P. Le, K. A. Lantz, J. Crabtree, A. Pizarro, J. Mazzarelli, D. Pinney, S. Fischer, E. Manduchi, Jr. Stoeckert, C. J., G. Gradwohl, S. W. Clifton, J. R. Brown, H. Inoue, C. Cras-Meneur, and M. A. Permutt. Transcriptional program of the endocrine pancreas in mice and humans. Diabetes, 52(7):1604–10, 2003. [114] Y. Kamei, S. Miura, M. Suzuki, Y. Kai, J. Mizukami, T. Taniguchi, K. Mochida, T. Hata, J. Matsuda, H. Aburatani, I. Nishino, and O. Ezaki. Skeletal muscle FOXO1 (FKHR) transgenic mice have less skeletal muscle mass, down-regulated Type I (slow twitch/red muscle) fiber genes, and impaired glycemic control. J Biol Chem, 279(39):41114–23, 2004. [115] D. Karolchik, A. S. Hinrichs, T. S. Furey, K. M. Roskin, C. W. Sugnet, D. Haussler, and W. J. Kent. The UCSC Table Browser data retrieval tool. Nucleic Acids Res, 32 Database issue:D493–6, 2004. [116] Alan F. Karr. Probability. Springer Texts in Statistics. Springer, 1993. [117] M. Kato, N. Hata, N. Banerjee, B. Futcher, and M. Q. Zhang. Identifying combinatorial regulation of transcription factors and binding motifs. Genome Biol, 5(8):R56, 2004. [118] M. Kearns and L. G. Valiant. Cryptographic limitations on learning boolean formulae and finit automata. In 21st Annual ACM Symposium on Theory of Computation, pages 433–444, New York, 1989. ACM. [119] A. Kel, O. Kel-Margoulis, V. Babenko, and E. Wingender. Recognition of NFATp/AP-1 composite elements within genes induced upon the activation of immune cells. J Mol Biol, 288(3):353–76, 1999. [120] O. V. Kel-Margoulis, A. E. Kel, I. Reuter, I. V. Deineko, and E. Wingender. TRANSCompel: a database on composite regulatory elements in eukaryotic genes. Nucleic Acids Res, 30(1):332–4, 2002. 180

[121] W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler, and D. Haussler. The human genome browser at UCSC. Genome Res, 12(6):996–1006, 2002. [122] T. Kin, K. Tsuda, and K. Asai. Marginalized kernels for RNA sequence data analysis. Genome Inform Ser Workshop Genome Inform, 13:112–22, 2002. [123] A. Klingenhoff, K. Frech, K. Quandt, and T. Werner. Functional promoter modules can be detected by formal models independent of overall nucleotide sequence similarity. Bioinformatics, 15(3):180–6, 1999. [124] Satoshi Kobayashi and Takashi Yokomori. Modeling RNA Secondary Structures Using Tree Grammars. In Genome Informatics Workshop V, Yokohama, Japan, 1994. [125] D. Kolbe, J. Taylor, L. Elnitski, P. Eswara, J. Li, W. Miller, R. Hardison, and F. Chiaromonte. Regulatory potential scores from genome-wide three-way alignments of human, mouse, and rat. Genome Res, 14(4):700–7, 2004. [126] G. Kreiman. Identification of sparsely distributed clusters of cis-regulatory elements in sets of co-expressed genes. Nucleic Acids Res, 32(9):2889–900, 2004. [127] W. Krivan and W. W. Wasserman. A predictive model for regulatory sequences directing liver-specific transcription. Genome Res, 11(9):1559–66, 2001. [128] A. K. Kutach and J. T. Kadonaga. The downstream promoter element DPE appears to be as widely used as the TATA box in Drosophila core promoters. Mol Cell Biol, 20(13):4754–64, 2000. [129] C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262(5131):208–14, 1993. [130] Phillip Phuc Le, Joshua R. Friedman, Jonathan Schug, John Brestelli, Brandon Parker, Irina Bochkis, and Klaus H. Kaestner. Glucocorticoid Receptor Dependent Gene Regulatory Networks. PLoS Genetics, in prep. [131] J. S. Lee, K. M. Galvin, and Y. Shi. Evidence for physical interaction between the zinc-finger transcription factors YY1 and Sp1. Proc Natl Acad Sci USA, 90(13):6145–9, 1993. [132] T. I. Lee, N. J. Rinaldi, F. Robert, D. T. Odom, Z. Bar-Joseph, G. K. Gerber, N. M. Hannett, C. T. Harbison, C. M. Thompson, I. Simon, J. Zeitlinger, E. G. Jennings, H. L. Murray, D. B. Gordon, B. Ren, J. J. Wyrick, J. B. Tagne, T. L. Volkert, E. Fraenkel, D. K. Gifford, and 181

R. A. Young. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 298(5594):799–804, 2002. [133] N. Lehman, M. D. Donne, M. West, and T. G. Dewey. The genotypic landscape during in vitro evolution of a catalytic RNA: implications for phenotypic buffering. J Mol Evol, 50(5):481–90, 2000. [134] S. Leung, C. Mellish, and D. Robertson. Basic Gene Grammars and DNA-Chart Parser for language processing of Escherichia coli promoter DNA sequences. Bioinformatics, 17:226–236, 2001. [135] S. Levy and S. Hannenhalli. Identification of transcription factor binding sites in the human genome sequence. Mamm Genome, 13(9):510–4, 2002. [136] L. Li, S. He, J. M. Sun, and J. R. Davie. Gene regulation by Sp1 and Sp3. Biochem Cell Biol, 82(4):460–71, 2004. [137] Z. Li, S. Van Calcar, C. Qu, W. K. Cavenee, M. Q. Zhang, and B. Ren. A global transcriptional regulatory role for c-Myc in Burkitt’s lymphoma cells. Proc Natl Acad Sci USA, 100(14):8164– 9, 2003. [138] F. Long, H. Liu, C. Hahn, P. Sumazin, M. Q. Zhang, and A. Zilberstein. Genome-wide prediction and analysis of function-specific transcription factor binding sites. In Silico Biol, 4(4):395–410, 2004. [139] G. G. Loots, I. Ovcharenko, L. Pachter, I. Dubchak, and E. M. Rubin. rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res, 12(5):832–9, 2002. [140] H. Mamitsuka and N. Abe. Predicting location and structure of beta-sheet regions using stochastic tree grammars. Ismb, 2:276–84, 1994. [141] R. Mantovani. A survey of 178 NF-Y binding CCAAT boxes. Nucleic Acids Res, 26(5):1135– 43, 1998. [142] L. Marino-Ramirez, J. L. Spouge, G. C. Kanga, and D. Landsman. Statistical analysis of over-represented words in human promoter sequences. Nucleic Acids Res, 32(3):949–58, 2004. 182

[143] R. Martone, G. Euskirchen, P. Bertone, S. Hartman, T. E. Royce, N. M. Luscombe, J. L. Rinn, F. K. Nelson, P. Miller, M. Gerstein, S. Weissman, and M. Snyder. Distribution of NFkappaB-binding sites across human chromosome 22. Proc Natl Acad Sci USA, 100(21):12247– 52, 2003. [144] H. Matsui, K. Sato, and Y. Sakakibara. Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures. Bioinformatics, 2005. [145] V. Matys, E. Fricke, R. Geffers, E. Gossling, M. Haubrock, R. Hehl, K. Hornischer, D. Karas, A. E. Kel, O. V. Kel-Margoulis, D. U. Kloos, S. Land, B. Lewicki-Potapov, H. Michael, R. Munch, I. Reuter, S. Rotert, H. Saxel, M. Scheer, S. Thiele, and E. Wingender. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res, 31(1):374–8, 2003. [146] A. M. McGuire, J. D. Hughes, and G. M. Church. Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. Genome Res, 10(6):744–57, 2000. [147] L. I. McKay and J. A. Cidlowski. Molecular control of immune/inflammatory responses: interactions between nuclear factor-kappa B and steroid receptor-signaling pathways. Endocr Rev, 20(4):435–59, 1999. [148] V. H. Nagaraj, R. A. O’Flanagan, A. R. Bruning, J. R. Mathias, A. K. Vershon, and A. M. Sengupta. Combined analysis of expression data and transcription factor binding sites in the yeast genome. BMC Genomics, 5(1):59, 2004. [149] S. Natesan and M. Z. Gilman. DNA bending and orientation-dependent function of YY1 in the c-fos promoter. Genes Dev, 7(12B):2497–509, 1993. [150] M. E. Nebel. Identifying good predictions of RNA secondary structure. Pac Symp Biocomput, pages 423–34, 2004. [151] D. T. Odom, N. Zizlsperger, D. B. Gordon, G. W. Bell, N. J. Rinaldi, H. L. Murray, T. L. Volkert, J. Schreiber, P. A. Rolfe, D. K. Gifford, E. Fraenkel, G. I. Bell, and R. A. Young. Control of pancreas and liver gene expression by HNF transcription factors. Science, 303(5662):1378– 81, 2004. [152] T. J. Oesterreicher and S. J. Henning. Rapid induction of GATA transcription factors in developing mouse intestine following glucocorticoid administration. Am J Physiol Gastrointest Liver Physiol, 286(6):G947–53, 2004. 183

[153] J. Oncina and P. Garcia. Inferring Regular Languages in Polynomial Updated Time. In Perez de la Blanca, Sanfeliu, and Vidal, editors, Pattern Recognition and Image Analysis. World Scientific, 1992. [154] I. Ovcharenko, G. G. Loots, B. M. Giardine, M. Hou, J. Ma, R. C. Hardison, L. Stubbs, and W. Miller. Mulan: multiple-sequence local alignment and visualization for studying function and evolution. Genome Res, 15(1):184–94, 2005. [155] R. Overbeek. PatScan. [156] M. Parrizas, M. A. Maestro, S. F. Boj, A. Paniagua, R. Casamitjana, R. Gomis, F. Rivera, and J. Ferrer. Hepatic nuclear factor 1-alpha directs nucleosomal hyperacetylation to its tissue-specific transcriptional targets. Mol Cell Biol, 21(9):3234–43, 2001. [157] A. G. Pedersen, P. Baldi, S. Brunak, and Y. Chauvin. Characterization of prokaryotic and eukaryotic promoters using hidden Markov models. Proc Int Conf Intell Syst Mol Biol, 4:182– 91, 1996. [158] J. S. Pedersen, I. M. Meyer, R. Forsberg, P. Simmonds, and J. Hein. A comparative method for finding and folding RNA secondary structures within protein-coding regions. Nucleic Acids Res, 32(16):4925–36, 2004. [159] F. Pereira. Formal grammar and information theory: together again? Philosophical Transactions of the Royal Society of London Series a-Mathematical Physical and Engineering Sciences, 358(1769):1239–1253, 2000. [160] F. A. Pereira, Y. Qiu, M. J. Tsai, and S. Y. Tsai. Chicken ovalbumin upstream promoter transcription factor (COUP-TF): expression during mouse embryogenesis. J Steroid Biochem Mol Biol, 53(1-6):503–8, 1995. [161] Vladimir Pericliev. Learning Linear Precedence Rule. In COLING 1996, 16th International Conference on Computational Linguistics, pages 883–888, Copenhagen, Denmark, 1996. [162] R. C. Perier, V. Praz, T. Junier, C. Bonnard, and P. Bucher. The eukaryotic promoter database (EPD). Nucleic Acids Res, 28(1):302–3, 2000. [163] G. Pesole, N. Prunella, S. Liuni, M. Attimonelli, and C. Saccone. WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences. Nucleic Acids Res, 20(11):2871–5, 1992. 184

[164] L. E. Peterson. CLUSFAVOR 5.0: hierarchical cluster and principal-component analysis of microarray-based transcriptional profiles. Genome Biol, 3(7):SOFTWARE0002, 2002. [165] L. Ponger, L. Duret, and D. Mouchiroud. Determinants of CpG islands: expression in early embryo and isochore structure. Genome Res, 11(11):1854–60, 2001. [166] M. Pontoglio, D. M. Faust, A. Doyen, M. Yaniv, and M. C. Weiss. Hepatocyte nuclear factor 1alpha gene inactivation impairs chromatin remodeling and demethylation of the phenylalanine hydroxylase gene. Mol Cell Biol, 17(9):4948–56, 1997. [167] G. G. Prefontaine, M. E. Lemieux, W. Giffin, C. Schild-Poulter, L. Pope, E. LaCasse, P. Walker, and R. J. Hache. Recruitment of octamer transcription factors to DNA by glucocorticoid receptor. Mol Cell Biol, 18(6):3416–30, 1998. [168] G. G. Prefontaine, M. E. Lemieux, W. Giffin, C. Schild-Poulter, L. Pope, E. LaCasse, P. Walker, and R. J. Hache. Recruitment of octamer transcription factors to DNA by glucocorticoid receptor. Mol Cell Biol, 18(6):3416–30, 1998. [169] D. S. Prestridge. Predicting Pol II promoter sequences using transcription factor binding sites. J Mol Biol, 249(5):923–32, 1995. 0022-2836 Journal Article. [170] M. G. Reese, F. H. Eeckman, D. Kulp, and D. Haussler. Improved splice site detection in Genie. J Comput Biol, 4(3):311–23, 1997. [171] Gesine Reinert, Sophie Schbath, and Michael S. Waterman. Probabalistic and Statistical Properties of Words: An Overview. Journal of Computational Biology, 7(1/2):1–46, 2000. [172] A. Reymond, V. Marigo, M. B. Yaylaoglu, A. Leoni, C. Ucla, N. Scamuffa, C. Caccioppoli, E. T. Dermitzakis, R. Lyle, S. Banfi, G. Eichele, S. E. Antonarakis, and A. Ballabio. Human chromosome 21 gene expression atlas in the mouse. Nature, 420(6915):582–6, 2002. [173] P. Rice, I. Longden, and A. Bleasby. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet, 16(6):276–7, 2000. [174] K. J. Riggs, K. T. Merrell, G. Wilson, and K. Calame. Common factor 1 is a transcriptional activator which binds in the c-myc promoter, the skeletal alpha-actin promoter, and the immunoglobulin heavy-chain enhancer. Mol Cell Biol, 11(3):1765–9, 1991. [175] K. J. Riggs, S. Saleque, K. K. Wong, K. T. Merrell, J. S. Lee, Y. Shi, and K. Calame. Yin-yang 1 activates the c-myc promoter. Mol Cell Biol, 13(12):7487–95, 1993. 185

[176] L. Ringrose, M. Rehmsmeier, J. M. Dura, and R. Paro. Genome-wide prediction of Polycomb/Trithorax response elements in Drosophila melanogaster. Dev Cell, 5(5):759–71, 2003. [177] E. Rivas. Evolutionary models for insertions and deletions in a probabilistic modeling framework. BMC Bioinformatics, 6(1):63, 2005. [178] E. Rivas and S. R. Eddy. The language of RNA: a formal grammar that includes pseudoknots. Bioinformatics, 16(4):334–40., 2000. [179] E. Rivas and S. R. Eddy. Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics, 2(1):8, 2001. [180] M. Safran, V. Chalifa-Caspi, O. Shmueli, T. Olender, M. Lapidot, N. Rosen, M. Shmoish, Y. Peter, G. Glusman, E. Feldmesser, A. Adato, I. Peter, M. Khen, T. Atarot, Y. Groner, and D. Lancet. Human Gene-Centric Databases at the Weizmann Institute of Science: GeneCards, UDB, CroW 21 and HORDE. Nucleic Acids Res, 31(1):142–6, 2003. [181] H. Salgado, A. Santos-Zavaleta, S. Gama-Castro, D. Millan-Zarate, E. Diaz-Peredo, F. Sanchez-Solano, E. Perez-Rueda, C. Bonavides-Martinez, and J. Collado-Vides. RegulonDB (version 3.2): transcriptional regulation and operon organization in Escherichia coli K-12. Nucleic Acids Res, 29(1):72–4., 2001. [182] A. Sandelin, W. Alkema, P. Engstrom, W. W. Wasserman, and B. Lenhard. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res, 32(Database issue):D91–4, 2004. [183] T. D. Schneider and R. M. Stephens. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res, 18(20):6097–100, 1990. [184] T. D. Schneider, G. D. Stormo, L. Gold, and A. Ehrenfeucht. Information content of binding sites on nucleotide sequences. J Mol Biol, 188(3):415–31, 1986. [185] O. J. Schoneveld, I. C. Gaemers, and W. H. Lamers. Mechanisms of glucocorticoid signalling. Biochim Biophys Acta, 1680(2):114–28, 2004. [186] H. Schrem, J. Klempnauer, and J. Borlak. Liver-enriched transcription factors in liver function and development. Part I: the hepatocyte nuclear factor network and liver-specific gene expression. Pharmacol Rev, 54(1):129–58, 2002. [187] H. Schrem, J. Klempnauer, and J. Borlak. Liver-enriched transcription factors in liver function and development. Part II: the C/EBPs and D site-binding protein in cell cycle control, 186

carcinogenesis, circadian gene regulation, liver regeneration, apoptosis, and liver-specific gene regulation. Pharmacol Rev, 56(2):291–330, 2004. [188] Jonathan Schug. Using TESS to Predict Transcription Factor Binding Sites in DNA Sequence. In Andreas D. Baxevanis, editor, Current Protocols in Bioinformatics. J. Wiley and Sons, 2003. [189] Jonathan Schug, Winfried-Paul Schuller, Claudia Kappen, J. Michael Salbaum, Maja Bucan, and Jr. Stoeckert, C. J. Promoter Features Related to Tissue Specificity as Measured by Shannon Entropy. Genome Biol, 2005. [190] S. Schwartz, L. Elnitski, M. Li, M. Weirauch, C. Riemer, A. Smit, E. D. Green, R. C. Hardison, and W. Miller. MultiPipMaker and supporting tools: Alignments and analysis of multiple genomic DNA sequences. Nucleic Acids Res, 31(13):3518–24, 2003. [191] S. Schwartz, Z. Zhang, K. A. Frazer, A. Smit, C. Riemer, J. Bouck, R. Gibbs, R. Hardison, and W. Miller. PipMaker–a web server for aligning two genomic DNA sequences. Genome Res, 10(4):577–86, 2000. [192] D. B. Searls. The Linguistics of DNA. American Scientist, 80(6):579–591, 1992. [193] D. B. Searls. Linguistic approaches to biological sequences. Computer Applications in the Biosciences, 13(4):333–344, 1997. [194] D. B. Searls. Languages, automate, and macromolecules. Biophysical Journal, 76(1):A272– A272, 1999. [195] David B. Searls. String Variable Grammar: A Logic Grammar Formalism for the Biological Language of DNA. Journal of Logic Programming, pages 73–102, 1995. [196] E. Segal, R. Yelensky, and D. Koller. Genome-wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics, 19 Suppl 1:i273–82, 2003. [197] S. Sengupta and B. Wasylyk. Physiological and pathological consequences of the interactions of the p53 tumor suppressor with the glucocorticoid, androgen, and estrogen receptors. Ann NY Acad Sci, 1024:54–71, 2004. [198] E. Seto, Y. Shi, and T. Shenk. YY1 is an initiator sequence-binding protein that directs and activates transcription in vitro. Nature, 354(6350):241–5, 1991. [199] Claude Shannon. The mathematical theory of communication. University of Illinois Press, Urbana, 1949. 187

[200] R. Sharan, A. Ben-Hur, G. G. Loots, and I. Ovcharenko. CREME: Cis-Regulatory Module Explorer for the human genome. Nucleic Acids Res, 32(Web Server issue):W253–6, 2004. [201] R. Sharan, I. Ovcharenko, A. Ben-Hur, and R. M. Karp. CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segments. Bioinformatics, 19 Suppl 1:i283– 91, 2003. [202] S. M. Sheiber. Direct Parsing of ID/LP Grammars. Linguistic Philosophy, 7(2):135–154, 1984. [203] G. Sherlock. Analysis of large-scale gene expression data. Brief Bioinform, 2(4):350–62, 2001. [204] Y. Shi, E. Seto, L. S. Chang, and T. Shenk. Transcriptional repression by YY1, a human GLIKruppel-related protein, and relief of repression by adenovirus E1A protein. Cell, 67(2):377– 88, 1991. [205] A. Shrivastava and K. Calame. An analysis of genes regulated by the multi-functional transcriptional regulator Yin Yang-1. Nucleic Acids Res, 22(24):5151–5, 1994. [206] S. Sinha and M. Tompa. YMF: A program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res, 31(13):3586–8, 2003. [207] S. T. Smale. Transcription initiation from TATA-less promoters within eukaryotic proteincoding genes. Biochim Biophys Acta, 1351(1-2):73–88, 1997. [208] S. T. Smale and D. Baltimore. The ”initiator” as a transcription control element. Cell, 57(1):103–13, 1989. [209] E. L. Sonnhammer, S. R. Eddy, E. Birney, A. Bateman, and R. Durbin. Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res, 26(1):320–2., 1998. [210] J. M. Stafford, J. C. Wilkinson, J. M. Beechem, and D. K. Granner. Accessory factors facilitate the binding of glucocorticoid receptor to the phosphoenolpyruvate carboxykinase gene promoter. J Biol Chem, 276(43):39885–91, 2001. [211] J. A. Stanton, A. B. Macgregor, and D. P. Green. Identifying tissue-enriched gene expression in mouse tissues using the NIH UniGene database. Appl Bioinformatics, 2(3 Suppl):S65–73, 2003. [212] A. Stolcke. Bayesian Learning of Probabilistic Language Models. Phd, University of California at Berkely, 1994. 188

[213] G. D. Stormo. Consensus patterns in DNA. Methods Enzymol, 183:211–21, 1990. [214] R. L. Strausberg, E. A. Feingold, R. D. Klausner, and F. S. Collins. The mammalian gene collection. Science, 286(5439):455–7, 1999. [215] J. M. Stuart, E. Segal, D. Koller, and S. K. Kim. A gene-coexpression network for global discovery of conserved genetic modules. Science, 302(5643):249–55, 2003. [216] A. I. Su, M. P. Cooke, K. A. Ching, Y. Hakak, J. R. Walker, T. Wiltshire, A. P. Orth, R. G. Vega, L. M. Sapinoso, A. Moqrich, A. Patapoutian, G. M. Hampton, P. G. Schultz, and J. B. Hogenesch. Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci USA, 99(7):4465–70, 2002. [217] A. I. Su, T. Wiltshire, S. Batalov, H. Lapp, K. A. Ching, D. Block, J. Zhang, R. Soden, M. Hayakawa, G. Kreiman, M. P. Cooke, J. R. Walker, and J. B. Hogenesch. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA, 101(16):6062–7, 2004. [218] S. Subramaniam. The Biology Workbench–a seamless database and analysis environment for the biologist. Proteins, 32(1):1–2, 1998. [219] P. Sumazin, G. Chen, N. Hata, A. D. Smith, T. Zhang, and M. Q. Zhang. DWE: discriminating word enumerator. Bioinformatics, 21(1):31–8, 2005. [220] Y. Suzuki, R. Yamashita, S. Sugano, and K. Nakai. DBTSS, DataBase of Transcriptional Start Sites: progress report 2004. Nucleic Acids Res, 32 Database issue:D78–81, 2004. [221] R Development Core Team. R: A language and environment for statistical computing, 2004. [222] G. Terai and T. Takagi. Predicting rules on organization of cis-regulatory elements, taking the order of elements into account. Bioinformatics, 20(7):1119–28, 2004. [223] D. Thieffry, D. A. Rosenblueth, A. M. Huerta, H. Salgado, and J. Collado-Vides. Definiteclause grammars for the analysis of cis-regulatory regions in E. coli. Pac Symp Biocomput, pages 441–52., 1997. [224] W. Thompson, M. J. Palumbo, W. W. Wasserman, J. S. Liu, and C. E. Lawrence. Decoding human regulatory circuits. Genome Res, 14(10A):1967–74, 2004. [225] B. A. Trakhtenbrot and Ya. M. Barzdin. Finite automata; behavior and synthesis. American Elsevier, New York, 1973. 189

[226] M. Tripodi, A. Filosa, M. Armentano, and M. Studer. The COUP-TF nuclear receptors regulate cell migration in the mammalian basal forebrain. Development, 131(24):6119–29, 2004. [227] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1988. [228] J. van Helden, B. Andre, and J. Collado-Vides. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol, 281(5):827–42., 1998. [229] J. van Helden, A. F. Rios, and J. Collado-Vides. Discovering regulatory elements in noncoding sequences by analysis of spaced dyads. Nucleic Acids Res, 28(8):1808–18., 2000. [230] A. E. Vinogradov. Isochores and tissue-specificity. Nucleic Acids Res, 31(17):5212–20, 2003. [231] A. Wagner. A computational ”genome walk” technique to identify regulatory interactions in gene networks. Pac Symp Biocomput, pages 264–78, 1998. [232] J. Waldispuhl, B. Behzadi, and J. M. Steyaert. An approximate matching algorithm for finding (sub-)optimal sequences in S-attributed grammars. Bioinformatics, 18 Suppl 2:S250– 9, 2002. [233] H. Wan, Y. Xu, M. Ikegami, M. T. Stahlman, K. H. Kaestner, S. L. Ang, and J. A. Whitsett. Foxa2 is required for transition to air breathing at birth. Proc Natl Acad Sci USA, 101(40):14449–54, 2004. [234] W. W. Wasserman and J. W. Fickett. Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol, 278(1):167–81, 1998. [235] W. W. Wasserman, M. Palumbo, W. Thompson, J. W. Fickett, and C. E. Lawrence. Humanmouse genome comparisons to locate regulatory sites. Nat Genet, 26(2):225–8, 2000. [236] Wyeth W. Wasserman and Albin Sandelin. Applied Bioinformatics for the Identification of Regulatory Elements. Nat Rev Genet, 5(4):276–287, 2004. [237] R. H. Waterston, K. Lindblad-Toh, E. Birney, J. Rogers, J. F. Abril, P. Agarwal, R. Agarwala, R. Ainscough, M. Alexandersson, P. An, S. E. Antonarakis, J. Attwood, R. Baertsch, J. Bailey, K. Barlow, S. Beck, E. Berry, B. Birren, T. Bloom, P. Bork, M. Botcherby, N. Bray, M. R. Brent, D. G. Brown, S. D. Brown, C. Bult, J. Burton, J. Butler, R. D. Campbell, P. Carninci, S. Cawley, F. Chiaromonte, A. T. Chinwalla, D. M. Church, M. Clamp, C. Clee, F. S. Collins, 190

L. L. Cook, R. R. Copley, A. Coulson, O. Couronne, J. Cuff, V. Curwen, T. Cutts, M. Daly, R. David, J. Davies, K. D. Delehaunty, J. Deri, E. T. Dermitzakis, C. Dewey, N. J. Dickens, M. Diekhans, S. Dodge, I. Dubchak, D. M. Dunn, S. R. Eddy, L. Elnitski, R. D. Emes, P. Eswara, E. Eyras, A. Felsenfeld, G. A. Fewell, P. Flicek, K. Foley, W. N. Frankel, L. A. Fulton, R. S. Fulton, T. S. Furey, D. Gage, R. A. Gibbs, G. Glusman, S. Gnerre, N. Goldman, L. Goodstadt, D. Grafham, T. A. Graves, E. D. Green, S. Gregory, R. Guigo, M. Guyer, R. C. Hardison, D. Haussler, Y. Hayashizaki, L. W. Hillier, A. Hinrichs, W. Hlavina, T. Holzer, F. Hsu, A. Hua, T. Hubbard, A. Hunt, I. Jackson, D. B. Jaffe, L. S. Johnson, M. Jones, T. A. Jones, A. Joy, M. Kamal, and et al. Karlsson, E. K. Initial sequencing and comparative analysis of the mouse genome. Nature, 420(6915):520–62, 2002. [238] D. L. Wheeler, D. M. Church, S. Federhen, A. E. Lash, T. L. Madden, J. U. Pontius, G. D. Schuler, L. M. Schriml, E. Sequeira, T. A. Tatusova, and L. Wagner. Database resources of the National Center for Biotechnology. Nucleic Acids Res, 31(1):28–33, 2003. [239] C. T. Workman and G. D. Stormo. ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac Symp Biocomput, pages 467–78, 2000. [240] T. Yada, M. Nakao, Y. Totoki, and K. Nakai. Modeling and predicting transcriptional units of Escherichia coli genes using hidden Markov models. Bioinformatics, 15(12):987–93., 1999. [241] K. R. Yamamoto. Steroid receptor regulated transcription of specific genes and gene networks. Annu Rev Genet, 19:209–52, 1985. [242] I. Yanai, H. Benjamin, M. Shmoish, V. Chalifa-Caspi, M. Shklar, R. Ophir, A. Bar-Even, S. Horn-Saban, M. Safran, E. Domany, D. Lancet, and O. Shmueli. Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics, 2004. [243] C. H. Yuh, H. Bolouri, and E. H. Davidson. Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. Science, 279(5358):1896–902., 1998. [244] M. Zavolan, S. Kondo, C. Schonbach, J. Adachi, D. A. Hume, Y. Hayashizaki, and T. Gaasterland. Impact of alternative initiation, splicing, and termination on the diversity of the mRNA transcripts encoded by the mouse transcriptome. Genome Res, 13(6B):1290–300, 2003. [245] M. Zavolan, E. van Nimwegen, and T. Gaasterland. Splice variation in mouse full-length cDNAs identified by mapping to the mouse genome. Genome Res, 12(9):1377–85, 2002. 191

[246] F. Zhao, Z. Xuan, L. Liu, and M. Q. Zhang. TRED: a Transcriptional Regulatory Element Database and a platform for in silico gene regulation studies. Nucleic Acids Res, 33 Database Issue:D103–7, 2005. [247] Q. Zhou and J. S. Liu. Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics, 20(6):909–16, 2004. [248] J. Zhu, J. S. Liu, and C. E. Lawrence. Bayesian adaptive sequence alignment algorithms. Bioinformatics, 14(1):25–39, 1998.

192

Appendix A

GLE yapp Specification This is a pretty-printed version of the yapp grammar for BCG concrete syntax. This is what is used to parse flat files containing a BCG grammar. Code fragments have been eliminated for a cleaner presentation. The specification has been broken into sections to aid in comprehension. Comments begin with a ’#’ and continue for the rest of the line. Literal characters are quoted. See GLE::Reader.yp for the complete specification. The rest of this page is blank.

193

# # # # #

GRAMMAR ...................................................................... A grammar is a commented bunch of statements. Grammar consist of three different kinds of statements. They are use, stream definitions, and productions.

grammar: required_comment statements ; statements: statement statements | ; statement: use_stmt | stream_stmt | production ;

194

# # # # #

COMMENTS ...................................................................... Human-readable annotation for the grammar. Can be null or otherwise some characters C-style comments. Comment can contain a leading PSTORE tag. C++-style comments are stripped out by the tokenizer.

comment: required_comment | ; required_comment: COMMENT_BEGIN pstore any_chars COMMENT_END ; pstore: PSTORE | ; any_chars: ANY_CHAR any_chars | ;

195

# # # # # #

ANNOTATION GUIDE ...................................................................... The guide is a set of key-value pairs that indicate how to convert a match of a production or nonterminal to a standard annotation feature. They are contained in a pair of curly braces. It is optional. Key value pairs are separated by commas.

annotation_guide: ’{’ ag_terms ’}’ | ; ag_terms: ag_term ag_term_list | ; ag_term_list: ’,’ ag_terms | ; ag_term: ATTR_KW ’=’ stream_value ; #^ # USE statements # ...................................................................... # A means to include another grammar into the current grammar. use_stmt: USE_KW ID ;

196

# STREAM declarations # ...................................................................... # Give the stream a name, a type, and initialization parameters. Can # be specified either by a forward or backward arrow. Initialize to a # a constant value; these are used only for local scratch space These # will be removed if we eliminate the annotation feature. stream_stmt: comment ’@’ ID HEAD initing ’;’ ; initing: ’=’ stream_value | plugin parameters ; plugin: ID ; parameters: parameter parameters | ; parameter: PARAM prm_assign ; prm_assign: ’=’ stream_value | ; stream_value: REAL_NUMBER | INTEGER | ID | VARIABLE | ANY_LITERAL ;

197

# # # # # # # # #

PRODUCTIONS ...................................................................... A production consists of a LHS, arrow, and one or more RHSs. A LHS is just a nonterminal identifier as defined in tokenizer. Should this allow for a stream id? See tokenizer code for to verify this, but TAIL is ’-’ and HEAD is ’->’. Since ’-’ is used elsewhere in the syntax, the tokenizer figures out the context. We want to accept ’->’, ’-->’, or ’--->’ as a normal production.

production: comment lhs annotation_guide arrow rhss ’;’ ; lhs: ID ; arrow: TAIL bound_type HEAD | HEAD ; bound_type: ’[’ bound ’]’ # | ’{’ bound ’}’ # | ’’ # | ’(’ bound ’)’ # | TAIL # ordinary | # ordinary ;

list set bag ordinary

bound: size ’;’ opt_loc ’;’ loose | ; loose: ’*’ # need not be minimal | ’!’ # must be minimal | # not minimal by default ;

continued...

198

size: size_value units # size bound | # no size bound ; size_value: INTEGER | REAL_NUMBER # best used with KB or MB ; units: UNITS_BP | UNITS_KB | UNITS_MB | ; opt_loc: location # location bound | # no location bound ; location: path_expression ;

199

# # # # # # # # # # # # # #

PATH EXPRESSIONS ...................................................................... This spec allows for different streams at lower levels of the path. Stream derivations will be from single stream. Run-time derivations can be of mixed types since a RHS reference to another stream will cause this. A contained-in relation will also allow switching from plug-in streams to main stream. Thus a plain ID should be interpreted as the same stream which defaults to the main stream at the start of the path. Once we’ve identified the stream a *selector* can be used to filter annotations to get only those that match. Path expressions can also be used as RHS terms. In this case either the whole interval or just specific points can be selected using the *anchor points*.

path_expression: first_path_term other_path_terms anchor_points ; first_path_term: first_id selector ; first_id: STR_ID ID # a specific stream | ANY_STR_ID ID # any stream | MAIN_STR_ID ID # the main stream ; id: first_id # full specification | ID # default to same stream ; other_path_terms: path_relation more_path_bound | ; path_relation: ’/’ # contained in | ’%’ # intersects with | ’.’ # part of ;

continued...

200

more_path_bound: path_term other_path_terms ; path_term: id selector ; selector: ’[’ selector_terms ’]’ | ; selector_terms: selector_term more_selector_terms ; more_selector_terms: ’,’ selector_terms | ; selector_term: ATTR_KW relation stream_value # full spec | INTEGER # index (>0) or xedni (

Suggest Documents