the galaxy" ...... tion by patching the system call table so it's not safe to ...... 1, S1âS6. Bestor,T.H. (2000). The DNA methyltransferases of mammals. Hum. Mol. Genet. ...... algorithm, 15th Annual International Conference on Intelligent Systems for.
to Renate voor Renate
Promotor:
Prof. dr. ir. Wim Van Criekinge
Dean:
Prof. dr. ir. Guido Van Huylenbroeck
Rector:
Prof. dr. Paul Van Cauwenberge
Members of the examination committee: Prof. dr. ir. Jacques Viaene Chairman Department Agricultural Economics Faculty of BioScience Engineering, Ghent University
Prof. dr. Els Van Damme Secretary Department Molecular Biotechnology Faculty of BioScience Engineering, Ghent University
Prof. dr. ir. Wim Van Criekinge Promotor Department Molecular Biotechnology Faculty of BioScience Engineering, Ghent University
Prof. dr. ir. Olivier Thas Department of Applied Mathematics, Biometrics and Process Control Faculty of BioScience Engineering, Ghent University
Dr. Manon van Engeland Department of Pathology Research Institute for Growth and Development, Maastricht University and University Hos pital, Maastricht, the Netherlands
Prof. dr. Danny Geelen Department of Plant Production Faculty of BioScience Engineering, Ghent University
Dr. Joost Louwagie Product Development OncoMethylome Sciences, Leuven
Cellular reprogramming
ir. Maté Ongenaert
Promotor Prof. dr. ir. Wim Van Criekinge Lab. for Bioinformatics and Computational Genomics (BioBix) Department of Molecular Biotechnology Faculty of BioScience Engineering Ghent University Thesis submitted in fulfilment of the requirements for the degree of Doctor (PhD) in Applied Biological Sciences: cell‐ and gene biotechnology
Dutch translation of the title Cellulaire herprogrammatie Illustration on the cover The Vitruvian Man (by Leonardo Da Vinci, around 1487). It depicts a nude male figure in two superimposed positions with his arms and legs apart and simulta‐ neously inscribed in a circle and square. Da Vinci based his drawing on some hints at correlations of ideal human proportions with geometry in Book III of the treatise De Architectura by the ancient Roman architect Vitruvius, thus its name. Overlayed is a representation of a double‐stranded piece of DNA, with some cytosine residues methylated. The representation is a so‐called ‘ascii‐art’, con‐ taining only A,T,C and G to represent both a link to the ‘computer world’ and the sequence world. Artwork by Maté Ongenaert, Vitruvian Man image from Wikimedia Commons, photograph by Luc Viatour. Ascii art generated by text‐image.com (by Patrik Roos). Printing DCL Signs, Zelzate ISBN 978‐90‐5989‐274‐3 Maté Ongenaert
Wim Van Criekinge
The author and the promoter give the authorisation to consult and to copy parts of this work for personal use only. Every other use is subject to the copyright laws. Permission to reproduce any material contained in this work should be obtained from the author.
Contents CONTENTS ................................................................................... I ABBREVIATIONS ....................................................................... 1 ACKNOWLEDGMENTS DANKWOORD ................................ 3 INTRODUCTION ........................................................................ 5 PART 1: EPIGENETICS, DNAMETHYLATION, DEVELOPMENT AND DISEASE .............................................. 7 CHAPTER 1: GENETICS AND EPIGENETICS – INTRODUCTION ................................................. 9 1.1 1.2 1.3
1.2.1 1.2.2
Situation in molecular biology ................................................................................ 9 Types of epigenetic modifications .......................................................................... 9 DNA modifications .......................................................................................................... 10 Histone modifications .................................................................................................... 10
Research objectives ..................................................................................................... 11
CHAPTER 2: DNAMETHYLATION ........................................................................................ 13 2.1 2.2 2.3 2.4
Occurrence of DNAmethylation ........................................................................... 13 Mechanism of DNAmethylation ........................................................................... 14 Influence of nutrition ................................................................................................. 14 Detection of DNAmethylation .............................................................................. 16
CHAPTER 3: FUNCTIONS OF DNAMETHYLATION ............................................................... 21 3.1 3.2 3.3 3.4 3.5
3.2.1 3.2.2
Contents
Imprinting ....................................................................................................................... 21 Diseases caused by abnormal imprinting ......................................................... 22 Beckwith‐Wiedemann syndrome ............................................................................. 22 Prader‐Willi/Angelman syndromes ........................................................................ 22
Silencing of the female Xchromosome .............................................................. 22 Silencing of junk DNA ................................................................................................ 23 RNA structures and methylation .......................................................................... 23
i
CHAPTER 4: METHYLATION AND INFLUENCE ON TRANSCRIPTION ...................................... 25 4.1 Interactions with DNAmethylation .................................................................... 25 4.2 Protein complexes involved in the link DNAmethylation – histone modification ....................................................................................................................................... 26 4.3 The influence of the Polycomb group of proteins.......................................... 27 CHAPTER 5: DNAMETHYLATION AND CANCER................................................................... 29 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8
Development of cancer and the role of DNAmethylation ........................ 29 Cancer stem cell hypothesis and epigenetics .................................................. 32 Cancer profiling based on DNAmethylation .................................................. 33 Uncovering the cancer methylome ...................................................................... 35 Discovering epigenetic biomarkers ..................................................................... 36 Early diagnostics ......................................................................................................... 37 Stratification and personalized medicine ........................................................ 37 Epigenetics and cancer therapy ............................................................................ 38
PART 2: DNAMETHYLATION, CANCER AND LITERATURE ........................................................................... 41 CHAPTER 6: INTRODUCTION ................................................................................................ 43 CHAPTER 7: DNAMETHYLATION AND LITERATURE ANALYSIS ........................................... 45 CHAPTER 8: INTERMEZZO: BIOLOGICAL TEXT MINING ....................................................... 47 8.1 Introduction ................................................................................................................... 48 8.2 Step 1: Perform automated literature queries ............................................... 49 8.3 Step 2: Define what to search for: deal with ontologies, gene and protein lists and thesauruses of chemical compounds and diseases ......................... 52 8.4 Step 3: Identify keywords, annotation lists and concepts in literature results. Deal with textual variants and ambiguities and identify relationships in the results ............................................................................................................................................ 53 8.5 Step 4: Rank, summarize and present the results ......................................... 58 8.6 Discussion ........................................................................................................................ 60 8.7 Conclusion ....................................................................................................................... 60
ii
Contents
CHAPTER 9: PUBMETH: METHYLATION DATABASE IN CANCER ........................................... 63 9.1 9.2 9.3 9.4 9.5
9.3.1 9.3.2
Introduction ................................................................................................................... 64 Filling up the database ............................................................................................. 65 Querying the database .............................................................................................. 67 Gene‐centric query .......................................................................................................... 67 Cancer‐centric query ...................................................................................................... 69
Performance of PubMeth, discussion and future ........................................... 70 Acknowledgments ....................................................................................................... 71
CHAPTER 10: CONCLUSION .................................................................................................. 73
PART 3: GENOMEWIDE SELECTION OF METHYLATION MARKERS .................................................. 75 CHAPTER 11: INTRODUCTION .............................................................................................. 77 CHAPTER 12: INTERMEZZO: DNA METHYLATIEMERKERS HELPEN VROEGTIJDIGE OPSPORING VAN CERVIXCARCINOOM .................................................................................... 79 CHAPTER 13: DISCOVERY OF METHYLATION MARKERS IN CERVICAL CANCER, USING RELAXATION RANKING .......................................................................................................... 85 13.1 13.2
Introduction ................................................................................................................... 86 Material and methods ............................................................................................... 88
13.3
Results ............................................................................................................................... 92
13.4 13.5
Discussion ..................................................................................................................... 103 Acknowledgments .................................................................................................... 105
13.2.1 13.2.2 13.2.3 13.2.4 13.2.5 13.2.6
Primary cervical tissue samples ................................................................................ 88 Cervical cancer cell lines ............................................................................................... 89 RNA and DNA isolation ................................................................................................. 89 Expression data ................................................................................................................ 90 Relaxation ranking algorithm ..................................................................................... 90 DNA methylation analysis using COBRA and bisulfite sequencing ............ 92
13.3.1 The validation of the top 3000 probe‐list selected using relaxing high‐ ranking 95 13.3.2 Validation of the 10 highest ranking candidate genes by COBRA ............... 99
Contents
iii
CHAPTER 14: EXPLORING THE CANCER METHYLOME USING GENOMEWIDE PROMOTER ANALYSIS ............................................................................................................................ 107 14.1 14.2
Introduction ................................................................................................................ 108 Materials and methods .......................................................................................... 109
14.3
Results ............................................................................................................................ 115
14.4
Discussion ..................................................................................................................... 121
14.2.1 14.2.2 14.2.3 14.2.4
Data sources .................................................................................................................... 109 Broad‐analysis: genome‐wide promoter alignment ...................................... 110 Deep analysis: specific binding patterns ............................................................ 113 Application of both approaches and experimental validation .................. 114
14.3.1 Broad‐analysis ............................................................................................................... 115 14.3.2 Deep‐analysis ................................................................................................................. 118 14.3.3 Application: Marker identification and experimental validation of proposed markers .......................................................................................................................... 120
CHAPTER 15: GENOMEWIDE PROMOTER ANALYSIS UNCOVERS PORTIONS OF THE CANCER METHYLOME ....................................................................................................................... 125 15.1 15.2
Introduction ................................................................................................................ 126 Materials and methods .......................................................................................... 127
15.3
Results ............................................................................................................................ 131
15.4
Discussion ..................................................................................................................... 141
15.2.1 15.2.2 15.2.3 15.2.4 15.2.5 15.2.6 15.2.7 15.2.8
Cell lines ........................................................................................................................... 127 5‐aza‐dC treatment of cells ....................................................................................... 127 Biotinylated RNA Probe Preparation and Hybridization ............................. 128 Analysis of Expression Data ..................................................................................... 128 BROAD analysis: Genome‐wide Promoter Alignment .................................. 129 DEEP analysis: Specific Binding Patterns ........................................................... 129 Tissue samples and DNA extraction ..................................................................... 129 Bisulfite Genomic Sequence Analysis, Conventional MSP, QMSP ............. 129
15.3.1 15.3.2 15.3.3 15.3.4
Validation of Modified Approach in Cell Lines ................................................. 133 Promoter Hypermethylation in Normal and Primary Tumor Tissues ... 134 Candidate Cancer Genes ............................................................................................ 137 New Targets of Aberrant Methylation in Major Types of Cancer by QMSP141
CHAPTER 16: TRANSCRIPTOMEWIDE PROMOTER HYPERMETHYLATION PROFILING IN NEUROBLASTOMA .............................................................................................................. 147 16.1
Materials and methods .......................................................................................... 148
16.2
Results and discussion ............................................................................................ 148
16.1.1 Neuroblastoma cell lines ........................................................................................... 148 16.1.2 Microarray analysis ..................................................................................................... 148
iv
Contents
CHAPTER 17: PREDICTING PLATINUM RESPONSE IN OVARIAN CANCER, USING DNA METHYLATION PROFILING.................................................................................................. 155 17.1 17.2
Introduction ................................................................................................................ 156 Materials and methods .......................................................................................... 156
17.3
Results ............................................................................................................................ 160
17.4
Discussion ..................................................................................................................... 162
17.2.1 17.2.2 17.2.3 17.2.4 17.2.5
Samples ............................................................................................................................. 156 5‐aza‐dC treatment of cells ....................................................................................... 157 Biotinylated RNA Probe Preparation and Hybridization ............................. 157 Analysis of Expression Data ..................................................................................... 157 In‐silico analysis of top‐ranking probes .............................................................. 160
17.3.1 Ovarian cancer methylation markers .................................................................. 160 17.3.2 Platinum resistance methylation markers ......................................................... 161 17.3.3 Platinum sensitivity methylation markers ........................................................ 161
CHAPTER 18: CONCLUSIONS ............................................................................................. 163
PART 4: REPROGRAMMING OF HUMAN HOST CELLS BY VIRUSES ............................................................................ 165 CHAPTER 19: INTRODUCTION ........................................................................................... 167 CHAPTER 20: CERVICAL CANCER AND THE HPV FAMILY OF VIRUSES .............................. 169 20.1 20.2
Introduction ................................................................................................................ 170 Materials and methods .......................................................................................... 171
20.3
Results ............................................................................................................................ 173
20.4
Discussion ..................................................................................................................... 174
20.2.1 20.2.2 20.2.3 20.2.4
Cell lines ........................................................................................................................... 171 Methylation‐specific digital karyotyping ............................................................ 172 Tag extraction and mapping .................................................................................... 172 Real‐time MSP platform ............................................................................................. 173
20.3.1 Tags, significantly different between libraries ................................................. 173 20.3.2 Real‐time MSP ................................................................................................................ 174
CHAPTER 21: CONCLUSIONS ............................................................................................. 177
OTHER RESEARCH PROJECTS ......................................... 179 SUPPLEMENTARY DATA ................................................... 181 SUMMARY AND FUTURE PERSPECTIVES ..................... 185 Contents
v
SAMENVATTING EN TOEKOMSTPERSPECTIEVEN . 189 REFERENCES ......................................................................... 193 CURRICULUM VITAE ........................................................... 205
vi
Contents
Abbreviations ACTB beta‐actin API Application Programming Interface AS Angelman syndrome BioC BioConductor BLAST Basic Local Alignment Search Tool BLAT BLAST‐like alignment tool bp basepairs BSP Bisulfite sequencing prod‐ uct BWS Beckwith‐Wiedemann syn‐ drome C Cytosine CGI CpG island ChIP Chromatin ImmunoPrecipi‐ tation COBRA Combined Bisulfite Restric‐ tion Analysis CSC cancer stem cell CSS Cascading Style Sheets CSV Comma Seperated Values DAC 5‐aza‐2’‐deoxycytidine DBTSS Database of Transcription Start Sites DNA Deoxyribonucleic acid DNMT DNA methyltransferase FIGO Federation of Gynecology and Obstetrics G Guanin GO Gene Ontology HBV hepatitis B virus HDAC Histone deacetylase HMT Histone methyltransferase HPV Human Papilloma Virus hrHPV high‐risk HPV HSIL hoog‐gradige squameuze intraepitheliale letsels ICR Imprinting Control Region
Abbreviations
IPA IRAS
Ingenuity Pathway Analysis imidazoline receptor anti‐ sera‐selected LINE Long Interspersed Nuclear Elements LOH loss of Heterozygocity LOI loss of imprinting LSIL laag‐gradige squameuze intraepitheliale letsels MBD Methyl‐CpG Binding Do‐ main MeDIP MEthylated DNA Immuno‐ Precipitation MeSH Medical Subject Headings miRNA microRNA MSDK Methylation Specific Digital Karyotyping MSP Methylation specific PCR NCBI National Center for Bio‐ technology Information NCI National Cancer Institute ncRNA non‐coding RNA O/E Observed/Expected ratio PAGE Polyacrylamide Gel Elec‐ trophoresis PcG Polycomb Group PCR Polymerase Chain Reaction PWS Prader‐Willi syndrome QMSP quantitative MSP SAGE Serial Analysis of Gene Expression SAH S‐adenosylhomocysteine SAM S‐adenosyl‐L‐methionine SINE Short Interspersed Nuclear Elements TSA trichostatin A TSG Tumor suppressor gene TSS Transcription Start Site XIC X Inactivation Centre
1
Acknowledgments Dankwoord Dit stukje tekst is het eerste van een lange tekstreeks, en misschien wel het be‐ langrijkste deel. Het schrijven van deze paragrafen mag dan wel een vrij strikt individueel gebeuren zijn, de inkt zou niet op het papier staan zonder een hele‐ boel mensen die mij op professioneel of persoonlijk vlak gesteund hebben: zon‐ der hen zou dit werk eenvoudigweg niet bestaan. In de eerste plaats ben ik mijn promotor, Wim Van Criekinge, veel dank ver‐ schuldigd. Hij is degene die mijn interesse in de bio‐informatica aanwakkerde, gaf mij altijd zijn vertrouwen en zijn enthousiasme en stroom aan ideeën werkte zeer stimulerend. Als mede‐begeleider de practica van de cursus bio‐informatica mee vorm geven was een leerrijke en vooral leuke afwisseling, ook daar kreeg ik alle vertrouwen. Ook mijn andere collega’s van de twee eilanden en ons outstation in de kelder, ben ik veel dank verschuldigd. Leander, Tim, Gerben, Tom, Peter, Joachim en Sofie zorgen ervoor dat de sfeer er altijd in bleef. Ook op minder rooskleurige dagen bleven zij mij altijd steunen; de uren in de resto werd er soms wel wat gezaagd en geklaagd maar vooral leuke babbels geslagen. Bedankt! Nadat onze kleine bende starters in ware Prison Break stijl en met enkele dozij‐ nen dozen Pringles chips als proviand was ontsnapt uit de kelderruimte van Blok E, kwam ik terecht in het grote bureau op het gelijkvloers van Blok B. Sorry voor alle demo’s, Nijntje‐sessies en de invulling van de muziekquiz sinds dan… Be‐ dankt voor de leuke momenten! Op onze uitvalsbasis op het tweede kan ik altijd rekenen op de administratieve steun van Fien en Sofie. Ook alle andere collega’s van de vakgroep: bedankt voor de aangename tijd! Naast mijn eigen collega’s zijn er nog een aantal mensen die mij dit werk helpen verwezenlijken hebben.
Acknowledgments / Dankwoord
3
‐
Jasmien Hoebeeck, Katleen De Preter en Frank Speleman van de Medische genetica Gent voor de leuke samenwerking in het neuroblastoma‐deel Lieselot Vercruysse en Guy Smagghe van het labo Agrozoölogie: bedankt voor de vlotte samenwerking Veerle Melotte en Manon van Engeland (Departement pathologie, Universi‐ teit Maastricht): bedankt voor het vertrouwen An Nijs, Jean‐Pierre Renard, Gonda Verpooten, Geert Trooskens en Valérie Deregowski van Oncomethylome Sciences (Leuven) voor de hulp bij de praktische experimenten in Leuven, en de aangename samenwerking in een heleboel projecten Bea Schuurs, Ed Schuuring en Ate van der Zee (UMCG Groningen) voor het werk op baarmoederhalskanker Renske Steenbergen en Peter Snijders (VUMC, Amsterdam) voor het mee tot stand brengen van de experimenten met de HPV‐modellen in cervicale kanker Kornelia Polyak, Min Hu and Noga Qimron (and co‐workers at Dana‐Farber Cancer Institute, Boston) for the execution of the MSDK‐experiments. It was a pleasure to stay there for two weeks! Mohammad Hoque, Marianna Brait and David Sidransky (and other col‐ leagues) (Johns Hopkins, Baltimore): it was really a pleasure to work with you on the validation of the methylation markers from our computational approaches. It was really nice to get your feedback on our analysis methods: thank you for your confidence and useful feedback on our methodologies!
‐ ‐ ‐
‐ ‐
‐
‐
Buiten deze eerder werkgerelateerde contacten, zijn er nog tientallen mensen die mij de nodige steun gaven. Dank aan mijn ouders, die mij de mogelijkheden ga‐ ven mij de opleiding van mijn keuze te volgen en mij steunden in al mijn beslis‐ singen. Zonder de kansen die jullie mij gegeven hebben was dit werk er niet ge‐ weest! Bedankt. Als allerlaatste bedank ik Renate: haar steun in moeilijke dagen, haar geduld als ik haar weer eens het verschil tussen RNA en DNA wou uitleggen, het nalezen van artikels moet haar meer dan eens hoofdpijn hebben bezorgd. Zonder jou was ik niet dezelfde persoon geweest. Woorden schieten te kort om te beschrijven wat je voor mij betekent: ik heb je lief en draag dit werk op aan je.
4
Acknowledgments / Dankwoord
Introduction
Ever since the discovery of the structure and function of DNA by Watson and Crick in 1953, tens of thousands of researchers all over the world work on genet‐ ics. About 50 years later, with the publication about the human genome sequence in the same journal, their work was highlighted in the media again. Our blueprint seems to be composed out of DNA strands, relatively simple mole‐ cular structures, each containing 3 billion base pairs. Transcription and transla‐ tion of this DNA gives rise to about 25000 proteins with diverse functions, inte‐ racting with each other and controlling the complexity of the entire organism. However, genetics alone is not able to explain all observed events. Very recently, it became clear that what is considered as junk DNA (such as miRNAs) might play an important role in controlling genetics. Also important are chemical mod‐ ifications of the DNA, that seem to be able to finetune and control the genetics as we know it. These so‐called epigenetic modifications of DNA and histones are described in this work. In part 1, this phenomenon is described, with a focus on DNA‐methylation, its functions and relationship with gene silencing and its involvement in cancer development and progression. In part 2, the existing knowledge of DNA‐methylation in cancer is summarized using text mining techniques and a methylation database in cancer is developed.
Introduction
5
In part 3 we search for novel methylation markers in various cancer types. Therefore, we make use of different genome‐wide experimental techniques. The results of such experiments produce enormous amounts of data that contain a considerable amount of noise. Selecting the most promising markers out of these results thus requires adapted (computational) sorting and selection methodolo‐ gies. The results of the computational approaches that were developed were validated experimentally on both cell lines and primary samples, revealing novel methylation biomarkers in various cancer types (including cervical, ovarian, neuroblastoma, head and neck cancer, prostate, lung, ...) Such an identified methylation ‘biomarker’ can be used to detect cancer devel‐ opment in early stages, to classify different patient groups (opening the road to personalized treatment) and to predict response to chemotherapy. Potentially, these markers can also be used as targets for ‘epigenetic therapy’: therapies that are able to alter the epigenetic state of key genes (in a more or less specific way), slowing down the development of cancer or making the patient more sensitive to chemo‐ or other therapeutic agents or treatments. In part 4, a unique viral infection model system is used in order to investigate the influence of viral infection on the epigenetic state of their human host cells. As in some cancer types (such as cervical cancer), the infection with a virus plays a key role in the development of the tumor, the genes affected by viral infection may be very early diagnostic markers, with high precision (both sensitivity and specificity). I wish you an inspiring journey through the world of DNA‐methylation! Maté Ongenaert – Lokeren/Ghent, January 2009.
6
Introduction
What is epigenetics and DNAmethylation? What are the functions of DNAmethylation and how does the epigenetic changes contribute to the development of diseases such as cancer?
Part 1: Epigenetics, DNA‐methylation, development and cancer
Part 1: Epigenetics, DNA‐methylation, development and disease
7
Chapter 1: Genetics and epigenetics – introduction
學而不思則罔,思而不學則殆。 (To study and not think is a waste. To think and not study is dangerous) Confucius, Chapter II, The Analects
1.1 Situation in molecular biology The human genome is composed out of roughly 3 billion base pairs, of which less than 2 percent encode pro‐ teins (the genes). This central dogma of molecular biology profiled this tiny piece of our DNA as the carrier of our genetic and inheritable material. How‐ ever, some events cannot be ex‐ plained, taking only the genes into account. There are diseases known that are clearly inheritable but seem to randomly pick their patients, some‐ times one part of identical twins gets ill while the other half is not affected. There are cancer types known that develop because of change in activity in a not mutated gene. And why do most mammal clones don’t survive? The answer could partially be found in επιgenetics, literally a layer above the
Genetics and epigenetics – introduction
DNA. This layer, made out of proteins or (simple) chemical compounds, does not alter the DNA‐sequence, but is able to distinguish between ill and healthy and controls properties of the organ‐ ism. This epigenetic layer of informa‐ tion is determining processes like growth, aging and the development of cancer. Epi‐mutations (changes in the epigenetic layer) are believed to have a role in diabetes and schizophrenia. Epigenetics is the study of mitotically heritable (i.e. they are maintained when cells divide) alterations in gene expression potential that are not me‐ diated by changes in DNA sequence.
1.2 Types of epigenetic modifications Epigenetic modifications occur at the different levels where genetic infor‐ mation is stored in: ranging from modifications of the DNA‐strand itself to modifications of the proteins in the nucleosomes (the building blocks of chromatin). A simple overview is given in Figure 1.1. Modifications mainly occur by addi‐ tion or removal of simple chemical molecules such as methyl and acetyl groups. The most common changes are addi‐ tions of methylmarks on the DNA (DNA‐methylation) and the addition of different chemical residues to the 9
histones, proteins where the DNA is attached to. The latter modifications are often referred to as the histone tails.
M
DNA Methylation Methyl groups added to specific DNA bases repress gene activity.
e
M
e
C
G
G
C
C
G
Me
Histone Histone Modification Many different modifications to histones, including methylation and acetylation, have been identified. These modifications can alter the activity of the DNA wrapped around them. Chromosome
Figure 1.1: Overview of the different levels that epigenetic modifications control
1.2.1
DNA modifications
The most described epigenetic change is DNA methylation. A methylgroup is added onto the DNA by a family of enzymes, called the DNA methyltrans‐ ferases (DNMTs). This epigenetic change influences transcription and other epigenetic changes and is dis‐ cussed in detail in Chapter 2: DNA‐ methylation.
10
1.2.2
Histone modifications
Chromatin is the physiological tem‐ plate of our genome. Its fundamental unit, the nucleosome core particle, consists of 146 DNA base pairs organ‐ ized around an octamer consisting of two copies of each highly conserved core histone proteins – H2A, H2B, H3 and H4. Dynamic modulation of chro‐ matin structure, chromatin remodel‐ ling, is a key component in the regula‐ tion of gene expression, apoptosis,
Genetics and epigenetics – introduction
The histones can be modified in differ‐ ent ways: methylation, acetylation and phosphorylation are the most com‐ mon modifications. These modifica‐ tions can happen at distinct amino acid residues of the histones, as shown in Figure 1.2 (Inche and La Thangue, 2006).
DNA replication and repair and chro‐ mosome condensation and segrega‐ tion. Disruption of these processes is intimately associated with human diseases, including cancer (Wang et al., 2007).
Ac
P
Ac
Ac Ac
S G R G K Q G G K AR AK S KS RS SR AG LQ F PV G R IHR L L RK G NY Me
Ac
Ac
Ac P
Ac
Ac
P
Me
P
P D PA K S APAP KK G SK K AV TK AQ KK D G KK R KR SR KE SY SI P
Me Me
Me
Ac
Me
P
Me
P
Ac
Ac
Me
Me
Ac
Ac
Me
Ac
Ac
H2B
Me
Ac
Me
P
Me
A R T K Q TA RK STG G K APR KQ L AT KA AR KS APATG G V K K PH P Me Ac
H2A
H3
Me
S G R G K G G KG LG KG G A KR HR K V L RD NI Q G IT K PAI RR L AR
H4
Figure 1.2: Modifications and their locations on the tails of core histones, H2A, H2B, H3 and H4. Spheres indicate the residue that is modified and the type of modification; methy‐ lation of lysines (red), methylation of arginines (blue), acetylation of lysines (orange), phosphorylation of serine or threonine (green) (Inche and La Thangue, 2006)
1.3 Research objectives The ‘broad’ title of this thesis (cellular reprogramming) may already give an indication of its wide range of re‐ search questions. These research questions are however all in the field of epigenetic modifications, and can be roughly divided into four sections;
Genetics and epigenetics – introduction
represented in this thesis by four parts. Part 2 (DNA‐methylation, cancer and literature) focuses on the impact of one of the most described epigenetic changes (DNA‐methylation) on cancer initiation and development. A lot of research is performed in this area, and one of the first objectives is to collect, 11
summarize and present the current knowledge in an easy to use interface. In this perspective, we have developed a methylation database in cancer: PubMeth.
choice of possible markers, analysis of genome‐wide data and help with the ranking and selection of possible can‐ didates to validate after initial data analysis.
Part 3 (Genome‐wide selection of me‐ thylation markers) focuses on the careful selection of DNA methylation biomarkers in different cancer types. The initial discovery of DNA‐methylation biomarkers is im‐ portant, as these initially discovered markers can be validated as being cancer‐specific. Such a marker can be used in early discovery of cancer and to predict whether a patient will bene‐ fit from certain therapies. Current technologies allow to screen for bio‐ markers on a genome‐wide scale; however there has to be a high performing screening and analysis strategy to increase the success rate in the experimental validation studies. In this perspective, computational approaches in all stages are very useful: they may provide better initial
Part 4 (reprogramming of human host cells by viruses) focuses on the influ ence of viral infections on DNA methylation of the human host cells. Therefore, we make use of keratinocyte cell lines, that are trans‐ fected with proteins from a high‐risk HPV type (Human Papilloma Virus type 16), which is clearly associated with cervical cancer (> 99,7 % of pa‐ tients are infected with a HPV type). Using a genome‐wide marker discov‐ ery technique (MSDK), we are able to identify markers, whose methylation state changes after viral infection. This discovery shows that viruses are able to reprogram their host cells, and the markers identified can help under‐ stand this process and may be ideal early diagnosis targets.
12
Genetics and epigenetics – introduction
Chapter 2: DNA methylation Nino is late. Amélie can only see two explanations: 1 he didn't get the photo. 2 before he could assemble it, a gang of bank robbers took him hostage. The cops gave chase. They got away... but he caused a crash. When he came to, he'd lost his memory. An excon picked him up, mistook him for a fugitive, and shipped him to Istanbul. There he met some Afghan raiders who took him to steal some Russian warheads. But their truck hit a mine in Tajikistan. He sur vived, took to the hills, and became a Mujaheddin. [Increasingly angry] Amé lie refuses to get upset for a guy who'll eat borscht all his life in a hat like a tea cozy. Narrator in “Le fabuleux destin d'Amé lie Poulain” (2001)
2.1 Occurrence of DNA methylation DNA‐methylation is a natural occur‐ ring epigenetic change in normal cells.
DNA‐methylation
In the human genome, DNA‐ methylation almost exclusively hap‐ pens at cytosine residues within the dinucleotide CG. This symmetric dinu‐ cleotide is often indicated as CpG, where p represents the phosphate in between the two base pairs. The largest part of the CpG dinucleo‐ tides (70 %) is methylated. This ac‐ counts for 0.75 – 1 % of all DNA bases (Bestor, 2000). Methylated cytosines are widely spread, across the genome, with particularly high densities in the promoters of retroviruses and trans‐ posons that have accumulated in the genome. Unmethylated CpG sites are usually found in DNA regions with a high frequency of CpG’s, the so‐called CpGislands. These CpG‐islands (about 29000 in the human genome) are distributed in a non‐random way, with a preference for the promoter and the first exon regions of genes. This is illustrated in Figure 2.1. Most CpG islands remain free of me‐ thylation and are associated with transcriptionally active genes. Some CpG‐islands are methylated and are associated with imprinted genes and genes on the inactivated X‐ chromosome (Worm and Guldberg, 2002).
13
Normal cell
Transcripon start (TSS)
Exon 1
DNMT
2
3
Promoter region
Figure 2.1: Methylation in normal somatic cells: CG dinucleotides occur rarely in the genome and are in most cases methylated. However, in some regions (CpG islands) they occur in clusters, mostly in promoter regions. The CG dinucleotides in CpG islands are in most cases not methylated. The DNMT1 enzyme maintains methylation (Herman and Baylin, 2003)
2.2 Mechanism of DNA methylation There are two different forms of DNA‐ methylation: de novo‐methylation and maintenance methylation. In both situations, enzymes called DNA me‐ thyltransferases, catalyze the reaction. In maintenance methylation (where the methylation of the newly synthe‐ sized strand is a copy of the parental strand), the involved enzyme is DNMT1. Other methyltransferases (DNMT3A and DNMT3B) catalyze de novo methylation (add a methylgroup to previously unmethylated DNA) (Bestor, 2000). SadenosylL methionine (SAM) is used as the sub‐ strate. This is illustrated in Figure 2.2.
2.3 Influence of nutrition During DNA‐methylation, S‐adenosyl‐ methionine is bound on cytosine resi‐ dues. Mechanisms (such as biochemi‐ cal pathways) that can influence or control the amount and supply of methyl groups could also have an in‐ fluence on DNA‐methylation. Indeed, there are interactions between dietary factors and DNA‐methylation. Such nutrients include folate, vitamin B6 and B12, methionine and choline.
As shown in Figure 2.3, folate has a central role in the one‐carbon metabo‐ lism. Normally, a carbon unit from serine or glycine is transferred to tet‐ rahydrofolate to form 5,10‐ methylenetetrahydrofolate. Vitamin NH 2 NH 2 B6 is a necessary co‐factor of the en‐ CH 3 zyme in this reaction. This folate‐form N DNMTs N SAM can further be used for the synthesis O N O of purines or reduced and used to N methylate homocysteine to form me cytosine 5-methylcytosine thionine. This latter reaction is cata‐ lyzed by a vitamin B12 dependent Figure 2.2: Methylation of the cytosine‐ enzyme. residue in the DNA 14
DNA‐methylation
Methionine is then converted to S‐ adenosylmethionine (SAM). SAM do‐ nates its labile methylgroups to more than 80 biological methylation reac‐ tions such as methylation of DNA, RNA and protein. However, when the supply of folate is limited, the levels of homocysteine increase. However there exists a reac‐ tion for methionine synthesis from homocysteine. This alternative path‐ way is not sufficient to compensate diminishing SAM pools. The cellular levels of S‐adenosylhomocysteine (SAH) increase as the equilibrium of the SAH‐homocysteine interconver‐ sion is in favour of SAH synthesis.
DNA‐methylation
Therefore, when homocysteine me‐ tabolism is inhibited (as in folate defi‐ ciency), cellular SAH will be increased. Increased SAH inhibits methyltrans‐ ferase activity and, consequently, DNA methylation reactions. This inhibition of DNA methylation associated with inadequate dietary folate has also been associated with increased cancer susceptibility (Davis and Uthus, 2004). The effects of folate deficiency on DNA methylation are highly complex; ap‐ pear to depend on cell type, target organ, and stage of transformation; and are gene and site specific.
15
DNA etc.
Diet
Diet
SAM DHF Ser
Gly
B6 GHMT
THF
Thymidylate 5,10-CH 2 THF 5,10-CH THF B2
Diet
MAT
DNMT Diet
Methylated DNA etc.
Methionine DMG Choline SAH MS Zn, B6 B12 BHMT Adenosine Betaine SAHH Methyl-THF Homocysteine CBS Cystathionine Cys Zn, B6 Diet
MTHFR Purines etc.
GSH
Figure 2.3: Dietary factors, enzymes, and substrates involved in methyl metabolism. En‐ zymes are shown in italics with a box around them. These include glycine hydroxymethyl‐ transferase (GHMT; EC 2.1.2.1); methylenetetrahydrofolate reductase (MTHFR; EC 1.5.1.20); 5‐methyltetrahy‐drofolate:homocysteine S–methyltransferase (methionine synthase of MS; EC 2.1.1.13); betaine‐homocysteine S‐methyltransferase (BHMT; CD 2.1.1.5); methionine adenosyltransferase (MAT; EC 2.5.1.6); DNA methyltransferase (DNMT; EC 2.1.1.37); S‐adenosyl‐homocysteine hydrolase (SAHH; EC 3.3.1.1); and cys‐ tathionine‐ß‐synthase (CBS; EC 4.2.1.22) (Davis and Uthus, 2004) Abbreviations: DHF, dihydrofolate; Ser, serine; Gly, glycine; Cys, cysteine; THF, tetrahy‐ drofolate; B6, vitamin B6 or pyridoxine; B12, vitamin B12 or cobalamin; B2, vitamin B2 or riboflavin; 5,10‐CH2 THF, 5,10‐methyltetrahydrofolate; 5,10‐THF, 5,10‐ methylenetetrahydrofolate; methyl‐THF, 5‐methyltetrahydrofolate; Zn, zinc; DMG, di‐ methylglycine; SAM, S‐adenosylmethionine; SAH, S‐adenosylhomocysteine; GSH, glu‐ tathione
2.4 Detection of DNA methylation As in other detection methods in mo‐ lecular biotechnology, there is no sin‐ gle method that is suited in every pos‐ sible application. Do we want to detect methylation of only one gene or on a genome‐wide scale? Must the detec‐ tion be highly sensitive or quantita‐
16
DNA‐methylation
tive? Depending on the application, different detection methodologies are used. Most of the detection method‐ ologies for DNA‐methylation are de‐ rived from standard molecular tech‐ niques, such as the use of restriction enzymes, PCR technologies and se‐ quencing. Figure 2.4 shows which methodology is suitable, depending on the application.
Global or locus-specific ? Global Cytosin extension Bisulfite sequencing of repetitive elements HPLC
Locus-specific Genome-wide or candidate gene ? Genome-wide
Candidate gene Quantitative or sensitive ?
Array-based or not ? Array-based
Antibody or 5mC binding Methylation-sensitive restriction enzyme Bisulfite modification
Not
RLGS Digital karyotyping Library and sequencing
Quantitative Allele-specific or not Allele-specific
Bisulfite cloning and sequencing
Sensitive Methyl light MSP
Not
Direct bisulfite sequencing •Pyrosequencing •Manual sequencing •Mass array
Figure 2.4: Overview of the different detection methodologies for DNA‐methylation, according to the application (Shen and Waterland, 2007)
To be able to detect DNA‐methylation, some findings were crucial and are now applied in almost all detection techniques used. These findings are the use of methylation‐specific en‐ zymes and bisulfite treatment of DNA. Methylation‐sensitive restriction en‐ zymes can be used to distinguish me‐ thylated and unmethylated sequences. The restriction functionality of these enzymes is dependent of the methyla‐ tion state of their restriction site. If one can then separate based on frag‐ ment length, the difference methy‐
DNA‐methylation
lated‐not methylated can be visual‐ ized. Bisulfite modification (Hayatsu, 1976) of DNA deaminates all cytosine resi‐ dues to thymine, except the methy‐ lated cytosines. Sequencing (bisulfite sequencing) or designing specific primers for the methylated (not con‐ verted) and methylated treated se‐ quences can distinguish between me‐ thylated and unmethylated sequences. Bisulfite treatment and designing spe‐ cific primers is referred to as MSP (methylation specific PCR) (Herman et al., 1996).
17
Other techniques used in this PhD thesis are COBRA (COmbined Bisulfite Restriction Analysis) and MSDK (Me‐ thylation Specific Digital Karyotyping). COBRA (COmbined Bisulfite Restric‐ tion Analysis) (Xiong and Laird, 1997) is a technique based on bisulfite treatment, restriction and quantitative PCR. First, a PCR reaction amplifies the region of interest. This PCR product is then treated with sodium bisulfite. Next, a restriction digestion is per‐ formed with an enzyme that loses its restriction site when unmethylated. The fragments are separated using PAGE (PolyAcrylamide Gel Electro‐ phoresis), transferred to a membrane via electroblotting and labeled by hybridizing labeled oligonucleotides. Based on the signal intensities, the methylation degree can be quantified.
AscI) and ligated to a biotinylated linker. Next a fragmenting enzyme (such as NlaIII), a frequent cutter, is used. Unmethylated fragments are captured with streptavidin coated magnetic beads. Adapters that bind on the NlaIII over‐ hang are ligated. Tagging enzyme MmeI (has its restriction site outside its recognition site which is in the adapter) is used to cleave 17 bp tags. Next steps in the analysis are: ‐ ‐ ‐ ‐ ‐ ‐
MSDK (Methylation Specific Digital Karyotyping) (Hu et al., 2006) is a genome‐wide methylation detection technology, similar to SAGE (Serial Analysis of Gene Expression). Genomic DNA is digested with a methylation‐ sensitive mapping enzyme (such as
18
DNA‐methylation
‐
ligate ditags amplify by PCR release ditags from adapters using NlaIII ligate to form concatemers clone into a vector and E. coli bac‐ teria using electroporation sequence the plasmid vector to obtain tag sequences map onto the genome and apply statistics: identify tags that are present in one library (not methy‐ lated) vs. less present (methylated) in another library
Unmethylated
Methylated
Digest with methylaton-specific methylation mapping enzyme AscI
Ligate to biotinylated linker
Cleave with fragmenting enzyme NlaIII CATG
CATG
GTAC GTAC
Capture with straptavidin beats GTAC
CATG
CATG GTAC
Ligate to LS adapter A,B
A CATG GTAC A
CATG GTAC
CATG GTAC
B
A CATG GTAC
A
B
CATG GTAC
CATG GTAC
B CATG GTAC
CATG GTAC
B
Release tags with IIS enzyme MmeII
A
CATGNNNNNNNNNNNNNNNNN B GTACNNNNNNNNNNNNNNN
CATGXXXXXXXXXXXXXXXXX GTACXXXXXXXXXXXXXXX
tag2
tag1
A CATGXXXXXXXXXXXXXXXXX GTACXXXXXXXXXXXXXXX
Ligate to form ditags
CATGNNNNNNNNNNNNNNNNN B GTACNNNNNNNNNNNNNNN
tag1
ditag XXXXXXXXXXXXXXXXX NNNNNNNNNNNNNNNCATG A CATG GTAC XXXXXXXXXXXXXXXNNNNNNNNNNNNNNNNNGTAC B
ditag
NNNNNNNNNNNNNNNCATG A CATGXXXXXXXXXXXXXXXXX GTACXXXXXXXXXXXXXXXNNNNNNNNNNNNNNNNNGTAC B
tag2 PCR amplification, NlaIII digestion tag1
XXXXXXXXXXXXXXXXX NNNNNNNNNNNNNNN CATG GTAC XXXXXXXXXXXXXXXNNNNNNNNNNNNNNNNN
XXXXXXXXXXXXXXXXX NNNNNNNNNNNNNNNCATG GTACXXXXXXXXXXXXXXXNNNNNNNNNNNNNNNNN
Ligate to form concatemers tag1
tag2
tag3
XXXXXXXXXXXXXXXXX NNNNNNNNNNNNNNN CATGXXXXXXXXXXXXXXXXX NNNNNNNNNNNNNNNCATG GTACXXXXXXXXXXXXXXXNNNNNNNNNNNNNNNNN GTACXXXXXXXXXXXXXXXNNNNNNNNNNNNNNNNN
tag2
tag4
Figure 2.5: Overview of MSDK library generation (Hu et al., 2006)
DNA‐methylation
19
Chapter 3: Func tions of DNA methylation There is a difference between knowing the path and walking the path Morpheus in “The Matrix” (1999)
3.1 Imprinting A subset of genes is expressed from only one of the two chromosome homologues. In all organisms it is al‐ ways either the maternal copy or the paternal copy that is expressed. This process is called genomic imprinting; to date about 70 imprinted genes have been identified. While there are a number of lone imprinted genes, the majority of identified imprinted sites are found in clusters. The clusters seem to contain at least one non‐ coding RNA (ncRNA) gene. Each clus‐ ter is controlled by a single major cis‐ acting element: the Imprinting Con trol Region (ICR). ICRs acquire differential methylation between the maternal and paternal copy in the germ cells and are able to control the imprinting of all genes within the cluster. The clusters can be divided in two distinct types: those
Functions of DNA‐methylation
whose ICR is methylated during oogenesis (on the maternally im‐ printed chromosome), and those whose ICR is methylated during sper‐ matogenesis (Edwards and Ferguson‐ Smith, 2007). Methylation plays a crucial role in imprinting. The mechanism that is involved in methylation of the ICR is not completely resolved yet. However, DNMT3a (a de novo methylation en‐ zyme) and DNMT3l (which is similar to DNMT3a) seem to be essential for methylation. How the difference arises between the paternal and maternal lineages is not fully understood. Re‐ searchers think that methylation as well as protection from methylation is involved. Once the methylation mark is estab‐ lished, it is mainly maintained by the maintenance methyltransferase en‐ zyme DNMT1. After fertilization, a genome‐wide reprogramming event occurs and all DNA methylation is actively and immediately lost from the paternal pronucleus and progressively lost from maternally inherited chro‐ mosomes. However, DNA‐methylation imprints on both parental genomes are resistant to these events. Some factors that protect the demethylation of imprinted genes have been identi‐ fied but the entire mechanism is unre‐ vealed.
21
3.2 Diseases caused by abnormal imprinting 3.2.1
Beckwith-Wiedemann syndrome
The Beckwith‐Wiedemann syndrome (BWS) is characterized by prenatal overgrowth, midline abdominal wall defects, ear creases or pits, neonatal hypoglycemia, and a high frequency of Wilms and other embryonal tumors, such as rhabdomyosarcoma and hepa‐ toblastoma. It is the first disease where it became clear that, next to genetic factors, epigenetic factors were clearly involved. The risk of each of the clinical stigmata of BWS could be determined with respect to the molecular defects. The first of these is loss of imprinting (LOI) of the insulin‐like growth factor‐II gene (IGF2), an imprinted growth factor gene normally expressed only from the paternally inherited allele but in BWS expressed from both paternal and maternal copies (Feinberg, 2008).
3.2.2
Prader-Willi/Angelman syndromes
The importance of parent‐specific imprinting becomes clear in the Prader‐Willi (PWS) and Angelman syndromes (AS). Both diseases are distinct neurodevelopmental disord‐ ers, each caused by several genetic and epigenetic mechanisms involving 22
Functions of DNA‐methylation
the proximal long arm of chromosome 15. Lack of a functional paternal copy of 15q11‐q13 causes PWS; lack of a functional maternal copy of UBE3A, a gene within 15q11‐q13, causes AS (Horsthemke and Wagstaff, 2008).
3.3 Silencing of the female Xchromosome In somatic cells of mammalian fe‐ males, only one copy of the X‐ chromosome is active, the other copy is silenced. During the development, in each cell an initial choice is made which copy of the X chromosome will be inactivated and this chromosome is stably silenced through the different mitotic divisions. Thus, females are mosaics: some clusters of cells have the paternal X chromosome active, others have the maternal X chromo‐ some active. The process of X‐inactivation is com‐ plex and occurs in different stages. First, the region where the initial marking happens, is called the X Inac tivation Centre (XIC). The most im‐ portant gene in this region is the X inactive Specific Transcript Gene (Xist). The region includes elements that are required for the marking of an active X chromosome and the stable expression and localization of Xist from the inactive X chromosome. Prior to inactivation, Xist expression is detected as a small pinpoint of expres‐
sion from both X chromosomes, until the transcripts accumulate and local‐ ize on the future inactive X, mediated at least in part by stabilization of the transcript. The puzzle of how one of two apparently equivalent X chromo‐ somes can be chosen to express Xist, and thus be inactivated, remains to be solved. It is clear, however, that com‐ ponents of the XIC are involved, and it has been suggested that the levels of Xist RNA may influence which copy of X undergoes inactivation. DNA methy lation has been implicated in the regulation of Xist in differentiated cells since the promoter region of the tran‐ scriptionally active allele on the inac‐ tive X chromosome is unmethylated, whereas that of the transcriptionally inactive allele on the active X chromo‐ some is methylated. The coating of the inactive X‐ chromosome with Xist is only the start of a whole cascade of reactions that change the structure of the chromatin and block transcription. These reac‐ tions include histone H3 lysine 9 me‐ thylation and hypoacetylation, H4 hypoacetylation and DNA‐methylation (Chang et al., 2006).
3.4 Silencing of junk DNA Junk DNA are parts of the genome that seem to have no any function. On the contrary, they could be harmful if expressed. It is the remainder of vi‐ ruses that were able to integrate in the Functions of DNA‐methylation
genome and genes that have dupli‐ cated and mutated. Transposable ele‐ ments (that account for over 30 % of the human genome) such as trans‐ posons and retrotransposons are examples. The latter include Long Interspersed Nuclear Elements (LINEs) and Short Interspersed Nu‐ clear Elements (SINEs). Expression of these elements leads to genetic insta‐ bility, as the elements are able to copy themselves and integrate in other parts of the genome. Therefore, it is important that these elements remain transcriptionally silenced. DNA‐ methylation plays a crucial role in this regard. The concept of methylation as a ge‐ nome defence system assumes that retrotransposable elements are inher‐ ently detrimental to the genome. Pro‐ tection and conservation of the integ‐ rity and fidelity of an organisms DNA serves as the overriding goal. There‐ fore, an important aspect of DNA me‐ thylation is its connection to the host‐ defence system, which acts to offset the threats from these largely parasitic sequences by maintaining them in a methylated, transcriptionally silent state (Carnell and Goodman, 2003).
3.5 RNA structures and methylation miRNAs (microRNAs) are short (around 22 nucleotides) RNA mole‐ cules encoded in the genome. They are 23
transcribed into primary miRNAs, processed in the nucleus by RNAseII Drosha and DGCR8. The resulting pre‐ cursor miRNAs form imperfect stem‐ loop structures that are exported to the cytoplasm by Exportin‐5, where they are further processed by RNAseIII Dicer into the mature and functional miRNAs. These miRNAs have the ability to bind to their target mRNA sequences with complete com‐ plementarity, which can lead to deg radation of this target.
of the PHB gene and change the chro‐ matin structure. It is also predicted that the main DNMTs are potential targets of miRNAs. miRNAs may regu‐ late chromatin structure as well by regulating key histone modifiers. This complex interplay between epigenet‐ ics and miRNA is schematized in Fig‐ ure 3.1. DNA-methylation 5-Aza-CdR
DNMTs
miRNAs are linked with cancer, they are reported to act either as onco‐ genes or as tumor suppressor genes (Chuang and Jones, 2007).
miRNAs
Translational suppression HDACs e.g. HDAC 4
PBA
miRNAs can also be involved in estab‐ lishing DNA methylation. In Arabidop sis two miRNAs (miR‐165 and miR‐ 166) are involved in the methylation
24
Functions of DNA‐methylation
Chromatin remodeling
Figure 3.1: The interplay of epigenetics and miRNAs (Chuang and Jones, 2007)
Chapter 4: Methyla tion and influence on transcription Is that what your little note says? It must be hard living your life off a cou ple of scraps of paper. You mix your laundry list with your grocery list you'll end up eating your underwear for breakfast. Natalie in “Memento” (2000)
4.1 Interactions with DNA methylation Initially it was proposed that methyla‐ tion could alter or interfere with the correct binding of nuclear factors (such as transcription factors) to their targets. In a number of cases (mostly solitary CGs that become methylated) this mechanism is indeed involved. An alternative mechanism is one whereby transcriptional repressors selectively recognize methylated CpGs. In this context, researchers were able to identify proteins that specifically bind to symmetrically methylated CpGs. This family of proteins seemed to contain a MethylCpG Binding Domain (MBD). This group of pro‐
teins seemed also to be involved in gene silencing through changes in the histone modification pattern. Previ‐ ously, it was already shown that DNA methylation patterns are mechanisti‐ cally linked to gene silencing through changes in the histone code. The MBD‐ containing proteins links these two epigenetic changes (Ballestar and Esteller, 2005). A number of MBD‐ containing proteins have been identi‐ fied such as MECP2, MBD1‐4, BAZ2A and 2B. The ability of MBD proteins to repress transcription is fundamental. As men‐ tioned earlier, methylated DNA is as‐ sociated with transcriptional repres‐ sion and inactive chromatin. An initial hypothesis was that the binding of these factors could alter the chromatin structure, thereby denying access to the transcriptional machinery. Further studies have demonstrated that MeCP2‐dependent repression is medi‐ ated through the recruitment of his‐ tone deacetylases and histone lysine methyltransferases as shown in Figure 4.1 (Worm and Guldberg, 2002). Histone deacetylases (HDAC) deace‐ tylate the histones, causing the chro‐ matin to condense and become inac‐ cessible for the transcription machin‐ ery. This chromatin remodelling causes transcriptional silencing.
Methylation and influence on transcription
25
Figure 4.1: Linking DNA‐methylation and inactive chromatin. Methyl‐binding domains bind onto methylated DNA and recruit histone deacetylases, turning active chromatin into condensed chromatin (Worm and Guldberg, 2002)
4.2 Protein complexes in volved in the link DNA methylation – histone modification The interaction between DNA‐ methylation and histone modifications through the MDB‐containing proteins is very complicated, as there are dif‐ ferent protein complexes involved. 26
One such a complex (Mi‐2/NuRD his‐ tone deacetylase) interacts with MBD2 (also known as the MeCP1 complex). This complex contains the histone deacetylase complex but also ATP‐ dependent chromatin‐remodelling subunits. In addition, MBD3 is also incorporated in the complex. The in‐ terplay between DNA‐methylation and histone modification through protein complexes is illustrated in Figure 4.2.
Methylation and influence on transcription
Figure 4.2: The role of methyl‐CpG‐binding domains (MBDs) in silencing methylated tumor‐suppressor genes. In cancer cells, many tumor‐suppressor genes undergo aberrant hypermethylation at their CpG islands, and many different elements can be recruited: the 4 MBD proteins involved in transcriptional repression, recruitment of histone deacetylase (HDAC) and histone methyltransferases (HAT), binding to both methyl‐CpG‐rich and ‐poor sequences, as well as modulation of the binding by post‐translational modification. Changes in the chromatin of these genes lead to transcriptional silencing (Ballestar and Esteller, 2005)
4.3 The influence of the Polycomb group of proteins The establishment and maintenance of epigenetic gene silencing is fundamen‐ tal to cell determination and function. Apart from DNA methylation systems, a group of proteins, the Polycomb Group (PcG), is a conserved system to establish gene silencing.
This group of proteins, first discov‐ ered in Drosophila flies, are part of multiplex protein complexes, includ‐ ing HDACs (histone deacetylases) and HMTs (histone methyltransferases). The PRC2 complex, containing EZH2, is capable of methylating the histone H3 tail at lysine‐9 (K9) and more prominently at K27. Both histone marks cause gene silencing (Lund and van Lohuizen, 2004).
Methylation and influence on transcription
27
In addition, the Polycomb protein EZH2 shows to associate with DNMTs and is required to establish DNA me‐ thylation in a subset of target genes (Vire et al., 2006). This shows that the Polycomb proteins may serve as re‐ cruiters for DNMTs, involved in the
hypermethylation of tumor suppres‐ sor genes, highlighting another con‐ nection between various epigenetic silencing mechanisms. A summary of the networks where Polycomb plays a role in, is given in Figure 4.3 (Spar‐ mann and van Lohuizen, 2006).
PRC1
Histone tails
Me
Pol II
H3K27
Nucleosome
Inhibition of transcription
DNA Ub
EZH2 PRC2
H3K27 Me
PRC1
H2AK119 PRC1
H2AK119 ubiquitylation
Me Me Me
DNMT PRC2
Chromatin compaction
Recruitment of DNMTs
Figure 4.3 : Binding of the PRC2 (Polycomb repressive complex 2) initiation complex to the Polycomb group (PcG) target genes induces enhancer of zeste homologue 2 (EZH2)‐ mediated methylation (me) of histone proteins, primarily at lysine 27 of histone H3 (H3K27). PRC1 is able to recognize the trimethylated H3K27 (H3K27me3) mark through the chromodomain of Polycomb (PC). This interaction might bring neighbouring nu‐ cleosomes into the proximity of the PRC2 complex to facilitate widespread methylation over extended chromosomal regions. Although the precise mechanisms for PRC‐mediated stable gene silencing are still poorly understood, they are proposed to involve direct inhi‐ bition of the transcriptional machinery, PRC1‐mediated ubiquitylation (Ub) of H2AK119, chromatin compaction and recruitment of DNA methyltransferases (DNMTs) to target gene loci by PRC2. Pol II, RNA polymerase II (Sparmann and van Lohuizen, 2006)
28
Methylation and influence on transcription
Chapter 5: DNA methylation and cancer [running] Okay, so what am I doing? [sees man also running] I'm chasing this guy. [man shoots] Nope. He's chas ing me. Leonard Shelby in “Memento” (2000)
5.1 Development of can cer and the role of DNAmethylation Cancer results from the uncontrolled growth of cells. The cell division proc‐ ess of a normal cell is strictly con‐ trolled at different stages. Thus to become a progenitor cancer cell, these control mechanisms have to be by‐ passed. These mechanism include control of growth and cell divisions and control of programmed cell death (apoptosis). The alteration of normal cell functions are caused by both genetic and epige‐ netic failures and defects that either activate proto‐oncogenes or inactivate tumour suppressor genes. Protooncogenes are genes that, in normal cells, act as components of growth promoting signalling pathway (growth factors, growth factor recep‐ DNA‐methylation and cancer
tors, intracellular signalling molecules and transcription factors) or are anti‐ apoptotic genes, angiogenesis promot‐ ing genes, telomerase (TERT) and invasion and metastasis promoting genes. If these genes are more acti‐ vated than is intended, this can pro‐ mote tumour growth, stimulate angio‐ genesis and metastasis. This activation can be caused by single base pair changes (SNPs: Single Nucleotide Polymorphisms) that influence the expression or activation. Other possi‐ bility are chromosomal translocations whereby the gene is inserted in an more active region or by the formation of an oncogenic fusion protein. Silencing of tumour suppressor genes (TSG) is caused by mutations, chromosomal aberrations and DNA‐ methylation. The TSGs are genes in‐ volved in important controlling path‐ ways that prevent a somatic cell to turn into a cancer cell. Their functions are cell cycle control, apoptosis and cell adherence and communication control. To silence a TSG, two consecu‐ tive alterations are required as the copies of the TSG on both chromo‐ somes must be targeted. The silencing of TSGs by two genetic or epigenetic events is known as the Knudson two hit hypothesis, shown in Figure 5.1. In hereditary cancer types, one of the hits is genetically caused and trans‐ ferred to the offspring, and only one other hit is required. In sporadic can‐
29
cers, two consecutive hits are re‐ quired. The first hit is either a genetic (such as point mutations) or an epige‐ netic one. The consecutive hit is often a chromosomal defect (such as Loss Of Heterozygocity ‐ LOH) or is caused by DNA‐methylation. The number of can‐ cer‐related genes affected by epige‐ netic inactivation equals or exceeds the number that are inactivated by mutation. Many genes modified by
promoter hypermethylation in cancers have tumour‐suppressor function. Some important pathways whose function is disabled by DNA‐ methylation of important genes in the pathway are listed in Table 5.1 (Esteller, 2007b). An overview of which genes have been described as hypermethylated in cancer, is given in Figure 5.2.
mutation
methylation FIRST HIT
methylation
LOH
LOH
methylation
SECOND HIT
mutation + LOH
mutation + methylation
methylation + LOH
biallelic methylation
Figure 5.1: Overview of the Knudson two‐hit hypothesis. The first hit in non‐heritable cancer types can either be a mutation or a methylation; the second hit can be a methyla‐ tion or a chromosomal defects such as Loss Of Heterozygocity (LOH)
30
DNA‐methylation and cancer
Table 5.1: Important pathways that are altered during cancer initiation or development by DNA‐hypermethylation of genes in the pathway (Esteller, 2007a) Pathways DNA repair Hormone response
Representative hypermethylated genes hMLH1, MGMT, WRN, BRCA1 Receptors of estrogen, progesterone, androgen, prolactin and thyroid‐stimulating hormones Vitamin response RARB2, CRBP1 Ras signaling RASSF1, NORE1A Cell cycle p16, p15, Rb P53 network P14, p73, HIC1 Cell adherence and E‐cadherin, H‐cadherin, FAT cadherin, EXT1, SLIT2, EMP3 invasion Apoptosis TMS1, DAPK1, WIF1, SFRP1 Wnt signaling APC, DKK1, IGFBP3 Tyrosine kinase SOCS1, SOCS3, SYK cascades Transcription factors GATA4, GATA5, ID4 Homeobox genes PAX6, HOXA9 Other pathways GSTP1, LKB1, THBS14, COX2, SRBC, RIZA, TPEF, SLC5A8, Lamin A/C microRNAs miR‐127 (target: BCL6), miR‐124a (target: CDK6)
DNA‐methylation and cancer
31
Figure 5.2: Chromosomal location of genes, whose promoter region is described as hy‐ permethylated in different cancer types (Esteller, 2007a)
5.2 Cancer stem cell hy pothesis and epigenet ics The multistep model of carcinogenesis (Knudson two hit hypothesis, as shown above) requires a long‐living cell in which multiple genetic or epi‐ genetic hits occur. As an alternative, it might be possible that the progenitors of stem cells, that normally undergo limited numbers of cell divisions, ac‐ quire the capacity to self‐renew. These so called cancer stem cells (CSCs) subsequently become the long‐living 32
DNA‐methylation and cancer
targets, acquiring (epi‐)genetic le‐ sions. As normal adult stem cells, CSCs can divide indefinitely and give rise to both more CSCs and progeny that can differentiate into the different cell types in a tumour (Tan et al., 2006). Remaining question is how a normal stem cell acquires unlimited division capability and becomes a cancer stem cell. Recently, genes have been identified that are targeted for transcriptional
repression in human embryonic stem (ES) cells. This is caused by the PcG (Polycomb) proteins SUppressor of Zeste 12 (SUZ12) and Embryonic Ec‐ toderm Development (EED), which form the Polycomb repressive com‐ plex 2 (PRC2) and which are associ‐ ated with nucleosomes that are trimethylated at Lys27 of histone H3 (H3K27). It seems that genes, targeted in ES cells by Polycomb, have a high chance of being cancer‐specifically methy‐
lated. The predisposition of ES cell PRC2 targets to cancer‐specific DNA‐ hypermethylation suggests crosstalk between PRC2 and de novo DNA me‐ thyltransferases in an early precursor cell with a PRC2 distribution similar to that of ES cells. The precise develop‐ mental stage and type of cell in which such crosstalk might occur is un‐ known and might not be an embryonic stem cell. This crosstalk between Polycomb repression and aberrant DNA‐methylation is shown in Figure 5.3 (Widschwendter et al., 2007).
Figure 5.3: A model for the progression of epigenetic marks from reversible repression in ES cells to aberrant DNA methylation in cancer precursor cells and persistent gene silenc‐ ing in cancer cells (Widschwendter et al., 2007)
5.3 Cancer profiling based on DNAmethylation Important tumour suppressor genes can be silenced by DNA‐ hypermethylation. However, this hap‐ pens in a non‐random way: in a cer‐ tain cancer type, DNA‐ hypermethylation in the promoter regions of the same genes can be ob‐ served in many patients. The DNA hypermethylome (which genes are DNA‐methylation and cancer
methylated) seems to be dependent of the cancer‐type, subtype and in some cancer types even of the stage of can‐ cer‐development. These cancer‐ specific methylation patterns have been used in some studies to classify cancer (sub)types: DNA‐methylation information can be used to profile the different cancer types. This is illus‐ trated in Figure 5.4 (Paz et al., 2003) where, based on methylation analysis in the promoter region of 15 genes,
33
cell lines of different cancer types can be correctly classified.
Figure 5.4: Hierarchical clustering of human cancer cell lines by CpG island promoter hypermethylation. In the top and left parts, the genes and cell lines analyzed are indicated, respectively. In the panel, red indicates hypermethylated CpG island, green indicates unmethylated CpG island, and black indicates homozy‐ gous deletion. Different cell types are indicated by colors: colon (blue), breast and prostate (dark green), lung (pink), renal (gray), head and neck (light green), leukemia (light blue), mela‐ noma (yellow), bladder (light violet), glioma (dark violet), lym‐ phoma (magenta), and nontransformed cell lines (red) (Paz et al., 2003)
34
DNA‐methylation and cancer
There are large differences in the me‐ thylation frequencies of the main hy‐ permethylated genes across various cancer types. Combining this informa‐ tion, one can determine a typical me‐ thylation profile for each cancer type (Esteller, 2007b).
future investigations towards person‐ alized medicine.
5.4 Uncovering the cancer methylome
Also commonly used are DAC (5aza 2’deoxycytidine) and TSA (trichostatin A) treatments. DAC is a nucleotide homologue and is built in the DNA instead of cytosine. Main difference is that DAC cannot be me‐ thylated and the initial methylation signal will be lost. TSA is an inhibitor of HDAC (Histone DeACetylases) and prevents the histones to become deacetylated and thus the chromatin to condense.
Different methodologies have been used to detect methylation in a ge‐ nome‐wide manner. Commonly used are arraybased techniques (such as Differential Methylation Hybridisation, DMH, using different probes where the different products after bisulfite treatment can hybridize against). Other studies rely on the use of anti bodies (such as ChIP (Chromatin Im‐ munoPrecipitation), that recognizes histones) and MeDIP (MEthylated DNA ImmunoPrecipitation), directly recognizing methylated cytosine). Figure 5.5: A CpG island hypermethylation Using high‐resolution tiling arrays, the profile of human cancer. Y‐axis, frequency of hypermethylation for each gene in each binding sites of the antibodies can be detected. primary cancer type (Esteller, 2007a)
With the availability of genome‐wide screening methods, these methods have been adopted for the detection of DNA‐methylation. Purpose of these large‐scale genome‐wide analyses is to discover which genes are methylated in the investigated cancer types and in which group and cancer stage of pa‐ tients this occurs. This kind of analy‐ ses can uncover large portions of the so‐called cancer methylome at once and give insight in the biology of the tumours and their progression. Differ‐ ent patient groups can be identified as well, which could be useful for the DNA‐methylation and cancer
The difference in expression of the genes is measured (e.g. using an ex‐ pression micro‐array) before and after treatment. If initially a gene was not expressed and after treatment with
35
DAC and/or TSA it is reactivated, its silencing could be caused by DNA‐ methylation.
ternative inactivating route to muta‐ tions for many tumour suppressor genes.
In a genome‐wide sequencing of can‐ cer genes, Sjoblom et al. (Sjoblom et al., 2006) observed that newly discov‐ ered gene mutations in colon and breast cancers generally had a low incidence of occurrence, with 90% of the genes identified harbouring a mu‐ tation frequency of less than 10%. An epigenetic study (Schuebel et al., 2007) shows that about half of these mutated genes are methylated in a much higher frequency, as shown in Figure 5.6. Also, a much higher num‐ ber of candidate hypermethylated genes were found in comparison with mutated genes.
Both observations show that this epi‐ genetic change might provide an al‐
Figure 5.6: Relationship between methy‐ lation status, analyzed by MSP, and muta‐ tion for 13 genes (Sjoblom et al., 2006)
5.5 Discovering epigenetic biomarkers
state of the DNA at certain residues, or the chemical modification of the his‐ tone tails. Depending on the existing knowledge and the purpose of the biomarker discovery study, the differ‐ ent attempts in cancer can be divided into a limited number of main catego‐ ries.
According to the World Health Organi‐ sation (WHO), a biomarker is a cellular or molecular indicator of exposure, health effects, or susceptibility. Bio markers can be used to measure inter nal dose, biologically effective dose, early biological response, altered struc ture or function, susceptibility. In the epigenetic perspective, a bio‐ marker could thus be the methylation
36
DNA‐methylation and cancer
At first, the biomarker discovery is focussed at finding the difference be‐ tween cancer and normal samples. A good biomarker is methylated in the cancer samples while not in normal samples, and preferentially detectable
from the early stages on. Some studies try to compose a panel of biomarkers to increase the sensitivity and/or se‐ lectivity. These initial candidates need further investigation and validation on a higher number of samples (possibly including different cancer stages) in order to be able to identify candidates for early diagnosis, survival analysis or stratification. In later stages, the influence of DNA‐ methylation profiles on therapies and survival can be determined. In addi‐ tion, possible targets for epigenetic therapy can be identified, for which clinical trials should be designed, in‐ cluding large patient and control groups.
5.6 Early diagnostics As there exist very sensitive method‐ ologies to detect DNA‐methylation of specific gene promoters, few cells with a changed methylation profile of these genes can be detected within thou‐ sands of normal cells. This makes DNA‐methylation an ideal screening tool to detect the development of can‐ cer before any clinical symptom can be observed. This early detection meth‐ odology is especially useful if non invasive samples containing tumour‐ DNA can be obtained. These samples can include blood or other body fluids, semen, urine or faeces from patients (Paluszczak and Baer‐Dubowska, 2006). DNA‐methylation and cancer
Two parameters are important in this perspective: sensitivity and selectiv ity. The sensitivity is the ability to detect the true positives (cancer sam‐ ples in this case), while the selectivity is the ability to distinguish cancer samples and normal tissues. Sensitiv‐ ity and selectivity should be as high as possible with a minimal set of genes to test methylation status in. When this number is relatively low, the cost of analyzing the samples will be low and applicable to screen high‐risk patient populations.
5.7 Stratification and per sonalized medicine Within one tumour (sub) type, the identified methylation biomarkers often show distinct patient popula tion groups. These groups, as found by clustering methodologies, often interfere with discrete histological cancer subtypes. DNA methylation profile assessment in cancer allows subtype discrimina‐ tion, which is often connected with risk and prognosis estimation. In addi‐ tion to this patient stratification, methylation profiles can also be used to predict response to treatment (such as chemotherapy). This could be used to treat patients based on their (epi‐) genetic profile, often referred to as personalized medicine. Currently, real individualized personal medicine is not applied but different patient 37
groups (as identified by their methyla‐ tion profile) receive different treat‐ ments, or one decides that some groups benefit from additional or omitted treatment strategies. Exam‐ ples are the application of radiation in combination with chemotherapy.
velopment of cancer. Therefore, (tar‐ geted) therapies could be used to af‐ fect these DNA‐methylation and his‐ tone changes. There are a number of treatments in a variety of cancer types, both solid tumours and hematological malignancies.
Both early detection and patient stratification are a huge benefit and a good biomarker combination provides both. The treatment with the highest success at a very early stage is crucial in stopping cancer development, me‐ tastasis and recurrence.
The epigenetic drugs currently avail‐ able are divided into two separate classes: DNMT inhibitors (DNA de‐ methylating drugs) and HDAC inhibi‐ tors (chromatin remodelling drugs). In both classes, different chemical groups are discovered, as schematized in Table 5.2 (Peedicayil, 2006).
Also in the development of novel therapies, the discovery of different patient groups that react differently on the active components may speed up the entire trial and registration process. In the different clinical trial stages, the therapy can directly be applied to patient groups with an epi‐ genetic profile that seem to benefit from the treatment. In these patient groups, it is easier to demonstrate the potential of the novel therapy: less patients would be needed to comply with the statistical thresholds.
5.8 Epigenetics and cancer therapy Epigenetic events such as DNA‐ methylation and histone modifications seem to be highly involved in the de‐
38
DNA‐methylation and cancer
DNMT inhibitors inhibit the DNA methyltransferases to methylate the DNA. This causes demethylation after cell divisions. If this causes the si‐ lenced tumor suppressor genes to be activated again, this would be very beneficial in stopping the tumor pro‐ gression. Depending on their chemical structures, the DNMT inhibitors are divided in three subgroups. The nucleoside analogues are built in the DNA instead of cytosine and can‐ not be methylated. These are very potent inhibitors of DNMT, although the most potent ones (such as 5‐aza‐ CR) are also very cytotoxic and are administrated in very low doses. Re‐ cently, Zebularine showed to be less cytotoxic.
Table 5.2: Classification of epigenetic drugs with therapeutic potential (Peedicayil, 2006) DNMT inhibitors Nucleoside analogues: • 5‐azacytidine (5‐aza‐CR) • Decitabine (5‐aza‐CdR) • Zebularine Nonnucleoside analogues: • Procainamide • Procaine • Epigallocatechin‐3‐gallate (EGCG) Antisense Oligonucleotides: • DNMT1 ASO
In the search to find alternatives for the cytotoxic nucleotide analogues, other (non‐nucleotide) inhibitors of DNMT are found such as procaine. A completely different strategy is to use antisense constructs to silence the DNMTs. These short oligonucleotides
DNA‐methylation and cancer
HDAC inhibitors Hydroxamates: • Trichostatin A • Suberoylanilide hydroxamic acid (SAHA) Cyclic tetrapeptides: • Depsipeptide • Apicidin Aliphatic acids: • Valproic acid • Phenyl butyrate Benzamides: • MS‐275 • CI‐994 Electrophylic ketones: • Trifluorylmethyl ketones • α‐Ketoamides
hybridize with mRNA, making them inactive. HDAC inhibitors inhibit the histone deacetylases, which along with HATs, help maintain the acetylation status of the histones. Various (small) compo‐ nents show to have HDAC inhibitory effects.
39
Upon DNA hypermethylation, transcription of the affected genes may be blocked, resulting in gene silencing. In neopla sia, abnormal patterns of DNA methylation have been rec ognized and hypermethylation is now considered one of the important mechanisms resulting in silencing expression of tumour suppressor genes, i.e. genes responsible for control of normal cell differentiation and/or inhibition of cell growth
Part 2: DNA‐methylation, cancer and literature
Part 2: DNA‐methylation, cancer and literature
41
Chapter 6: Intro duction "Six pints of bitter," said Ford Prefect to the barman of the Horse and Groom. "And quickly please, the world's about to end." Douglas Adams, "Hitchhiker’s guide to the galaxy" DNA methylation represents a modifi‐ cation of DNA by addition of a methyl group to a cytosine, also referred to as the fifth base (Doerfler et al., 1990). This epigenetic change does not alter the primary DNA sequence and might contribute to overall genetic stability and maintenance of chromosomal integrity and consequently facilitate organization of the genome into active and inactive regions with respect to gene transcription (Robertson, 2002). Genes with CpG islands in the pro‐ moter region are generally unmethy‐ lated in normal tissues. Upon DNA hypermethylation, transcription of the affected genes may be blocked, result‐ ing in gene silencing. In neoplasia, abnormal patterns of DNA methyla‐ tion have been recognized and hyper‐ methylation is now considered one of the important mechanisms resulting
Introduction
in silencing expression of tumour sup‐ pressor genes, i.e. genes responsible for control of normal cell differentia‐ tion and/or inhibition of cell growth. In the last few years, new hyper‐ methylated biomarkers have been used in cancer research and diagnos‐ tics (Esteller, 2003). The detection of DNA hypermethyla‐ tion was revolutionized by two dis‐ coveries. Bisulfite treatment results in the conversion of cytosine residues into uracil, except the protected me‐ thylcytosine residues (Hayatsu, 1976) and based on the sequence differences after bisulfite treatment, with a me thylation specific PCR (MSP), methy‐ lated DNA can be distinguished from unmethylated DNA (Herman et al., 1996). In many cancers, various markers have been reported to be hypermethylated (Paluszczak and Baer‐Dubowska, 2006). As discussed in Chapter 5: DNA‐ methylation and cancer, DNA‐ methylation plays a crucial role in cancer and thousands of publications describe DNA‐methylation of hun‐ dreds of genes in a whole variety of cancer types. Being able to explore and combine this existing knowledge may lead to novel insights in the mechanisms behind and create new research questions.
43
Chapter 7: DNA methylation and literature analysis Ainsi presque tout est imitation. L’idée des Lettres persanes est prise de celle de l’Espion turc. Le Boiardo a imité le Pulci, l’Arioste a imité le Boiardo. Les esprits les plus originaux empruntent les uns des autres. (Almost everything is imitation... The most original writers borrowed from one another) Voltaire, "Lettre XII: sur M. Pope et quelques autres poètes fameux", Lettres philosophiques (1733) As each month new publications de‐ scribe the hypermethylation of genes in different cancer types, it would be an advantage to be able to keep track of this information. If one is able to fasten or make it easier to perform literature searches, this information can be used for instance for selecting positive controls and for gene‐ prioritizing purposes. Most abstracts of publications, related to methylation, are stored in the Pub‐ Med database, hosted by the NCBI (National Center for Biotechnology Information). The information of the publication (abstracts, authors, key‐ words,…) can be accessed through the DNA‐methylation and literature analysis
web‐interface NCBI provides, as well as through a retrieval system, called E‐ Fetch. This is in fact a so called ‘API’: an Application Programming Inter‐ face. This API allows access to and retrieval from PubMed (as well as other NCBI databases), using a pro‐ gramming language (such as Perl). This enables us to access PubMed records without having to use the web‐interface of PubMed. This offers perspectives to: • Automatically query PubMed using a lot of (combinations of) search terms at the same time • Sequentially edit the results: high‐ lighting, counting, sorting of spe‐ cific content of interest In the (epi‐)genomic field, applica‐ tions of these two possibilities include: • Computergenerated searching of different aliases of a gene at the same time. A human gene is commonly identified by different aliases (symbols, identifiers in ge‐ netic databases, variants of names and descriptions). This makes it ex‐ tremely difficult to search a specific gene in literature as often more than 10 such aliases are associated with one single gene. And what about genes that share aliases, and textual variants of all these differ‐ ent aliases (e.g. BRCA1, BRCA‐1, BRCA 1, BRCA‐I)? However, using
45
computer programs, it is feasible to download all aliases for a gene through different databases, gen‐ erate textual variants and search PubMed with all these variants at the same time, generating one sin‐ gle results file within seconds to minutes. • Downloading, highlighting and sorting all abstracts, related with one particular area of in terest. With a list of keywords, re‐ lated with one particular area of in‐ terest, all related abstracts can be retrieved. These keywords can be highlighted, as well as all genes (and their aliases and textual vari‐
46
ants), sentences with both an alias and a keyword,… At the same time, the different words of interest can be counted and a scoring scheme can be applied. Based on the count‐ ing, one could then rank the ab‐ stracts, find only the abstracts with a particular gene mentioned in it, while the highlighting enables fast screening of the abstract. The next chapters will give insight in biological text mining in general and how we applied these methodologies to generate a methylation database in cancer: PubMeth.
DNA‐methylation and literature analysis
Chapter 8: Intermezzo: Biological Text Mining Paper 1: Digging into biomedical literature: a guide to biological text mining Ongenaert M, Van Criekinge W. In preparation Lifescience researchers make use of enormous amounts of data, presented in biomedical literature. Researchers would benefit enormously from metho dologies that are able to perform literature queries, analyze and filter the results and present the answer to specific research questions and interac tions and relationships among them in a summarized overview in an auto mated way. This brings us in the field of Biological Text Mining: the use of robotized methods for exploiting the enormous amount of knowledge availa ble in the biomedical literature today. This amount of data comes from vari ous independent research groups with different angles of view and focus points and it is extremely useful to be able to compare and combine their datasets to gain new insights and come to hypotheses that could not be gen erated using only one data source. We discuss how these literature sources can be automatically queried and the results annotated. We therefore make use of existing webbased tools and discuss their strengths and limits. In addition, custom Perlscripts give insight in the mechanisms that most text mining approaches share.
Intermezzo: Biological Text Mining
47
8.1 Introduction Scientists and researchers in life sciences that plan an experiment, are performing it or are discussing results, consult biomedical literature in all of these stages. Therefore they can use different literature databases (PubMed (Wheeler et al., 2005), Web of Science, Scopus, Google Scholar) to simplify their search and to cover as much trusted sources as possible. However, with currently more than 18 million abstracts in PubMed, the result of the search queries they perform can be unmanageable to analyze and filter the needed information. In addition, creating an accurate and complete literature search query may not be as easy. Therefore, researchers would benefit enormously from methodologies that are able to perform literature queries, analyze and filter the results and present the answer to specific re‐ search questions and interactions and relationships among them in a sum‐ marized overview in an automated way. This brings us in the field of Bio logical Text Mining: the use of robo‐ tized methods for exploiting the enormous amount of knowledge avail‐ able in the biomedical literature today. This amount of data comes from vari‐ ous independent research groups with different angles of view and focus points and it is extremely useful to be able to compare and combine their 48
datasets to gain new insights and come to hypotheses that could not be generated using only one data source. Biological text mining must be able to deal with enormous amounts of data and biological classifications such as ontologies and controlled vocabula‐ ries. The algorithms used, should take into account different gene aliases, names and descriptions and variants of them. The analysis should be able to make use of lists of disease pheno‐ types and symptoms, identify chemical compounds and drugs or analysis methods and their abbreviations. It becomes clear that the ideal text min‐ ing application does not exist and the choice of which techniques to use, will be depending on the research domain and the defined goals of the literature search. In this review, we discuss which basic techniques exist to gather relevant literature and how to analyze them, taking ontologies, gene lists and clini‐ cal symptoms into account. The dis‐ cussed methodologies are illustrated with real life examples. Purpose is to give researchers insight in the possi‐ bilities and show examples how to use existing methodologies in an easy way, as fast and with as few additional work as possible. This review is con‐ structed from a practical view: four steps, mostly more or less prominent present in any biological text mining effort are discussed.
Intermezzo: Biological Text Mining
We will show application examples of tools, already available on the Internet as a web service (application exam‐ ples) and we will demonstrate basic custom Perl‐scripts to give more in‐ sight in the underlying mechanisms (insight examples). The insight exam‐ ples are simplified versions of scripts, used to create the PubMeth database (Ongenaert et al., 2008).
8.2 Step 1: Perform auto mated literature que ries In any case, the first step in biomedical text mining is to actually get the bio‐ medical texts (abstracts or full text), based on search queries. This seems trivial but this may be the most crucial step as it is in the beginning of the pipeline. If the query is incomplete or inaccurate, the performance (recall and precision) will be significantly affected. To get biomedical texts, the researcher needs to decide which database to use and which queries he/she will perform, unless the whole database is used to screen. Most literature databases (and text mining interfaces) have a web inter‐ face and the generation of the query is seamlessly integrated in this interface. However, one should always check whether the generated query reflects the research question.
Intermezzo: Biological Text Mining
For example, submitting the query “text mining” through the web‐ interface of NCBI PubMed actually generates “text[All Fields] AND ("min‐ ing"[MeSH Terms] OR "mining"[All Fields])” (this can be seen in the De‐ tails tab of PubMed). Mining apparent‐ ly is identified as a MeSH term. MeSH (Medical Subject Headings) is the Na‐ tional Library of Medicine's controlled vocabulary thesaurus (Kim and Wil‐ bur, 2005) which consists of sets of terms and naming descriptors in a hierarchical structure that permits searching at various levels of specifici‐ ty. So actually, PubMed not only searches for texts, but is able to recognize some specific scientific and medical context. This normally improves the quality and accuracy of the search and is enabled by default. However in some situations it is undesirable. In addi‐ tion, PubMed allows adding additional limits to the query, such as publication year, topics covered and age groups of subjects (‘Limits’ tab). Once a query has been carefully gen‐ erated, adjusted and fine‐tuned, it can be executed. However, if one has a whole list of such queries and thus a lot of search results, the web‐interface is not sufficient any more to deal with these results and to further use them in the analysis pipeline. Therefore, an automated system to query the litera‐ ture database and to store the results
49
in a structured format (e.g. XML) or by using a database, will be required. Fortunately, all main literature data‐ bases provide ways to perform auto‐ mated queries and save the results in a structured format. This file format has a defined and do‐ cumented structure, e.g. an XML file has different ‘tags’, indicating at which level the different features can be situated. This defined layout makes it relatively easy and fast to get the re‐ quired individual elements of the data out of this file using a program lan‐ guage, although XML files are not very readable for humans. This data‐ operation is often called ‘parsing’. In this review, the different steps are described separately from each other, while in practice this is not the case. In this step, often keyword lists or orga‐ nized thesauruses (as described in step 2) are used to generate the search query. For example a tool that uses a chromosomal region as input and links the genes on this region to a certain disease or phenotype will use the different gene symbols and aliases, as well as a controlled vocabulary (synonyms, symptoms, different sub‐ types etc.) of diseases to generate the initial query. Application example: PubNet PubNet (Douglas et al., 2005) is a web service that extracts several types of
50
relationships returned by PubMed queries and maps them into networks, allowing graphical visualization, tex‐ tual navigation, and topological analy‐ sis. Based on user search terms, PubMed is queried and the results are gathered as XML files. As these XML files have a defined and documented structure, it is relatively easy to parse them. In the example of PubNet, de‐ pending on the relationship the user wants to visualize, the required fea‐ tures (such as authors or MeSH terms) are parsed out of the XML files and passed to the visualization technolo‐ gies. For instance, one is interested to see which authors are experts in the field of DNA‐methylation in colorectal can‐ cer. PubNet will send the generated query to PubMed and retrieve the results as an XML file. Per publication, the authors are parsed out of this XML file, an index of all authors is made and it is determined how many times each of these authors was co‐author with others in the list. Depending on this analysis, the graphical summary is generated. This view shows authors, publishing frequently together in groups (with line width indicating the frequency). This way, it also becomes clear which groups cooperate, and which persons connect the different groups and probably are experts in the field. This way, the relationship be‐ tween the authors of thousands of abstracts can be visualized and inter‐
Intermezzo: Biological Text Mining
preted within minutes. The process of data fetching, parsing and a visualiza‐ tion example is given in Figure 8.1.
Fetching XML 1 Parsing XML Compiling nodes: 3513 edges: 21182 Generating output: svg ps pdf png
Figure 8.1: PubNet scheme of getting literature data from PubMed (Performing query, get XML files and parse them) and the visualization of co‐authors in the field of colorectal cancer DNA‐methylation
Insight example: automated querying of PubMed PubMed (and other NCBI databases) can be queried using E‐Utils. This al‐ lows to pass queries to reserved NCBI servers using any program language and get the results back. The system works as follows: ‐
In a first stage, one passes its que‐ ries to the NCBI search system (E‐
Intermezzo: Biological Text Mining
‐
search). The servers execute the request and return a list of Primary ID’s as result. In this case PubMed IDs Afterwards, the details of these PubMed IDs are requested from the NCBI servers (E‐Fetch) and the results are passed in the required supported format (e.g. XML, plain text)
51
Perl‐script 1 illustrates this process: it gets all PubMed records related with DNA methylation and epigenetics and stores the results in a single XML file, allowing further processing.
8.3 Step 2: Define what to search for: deal with ontologies, gene and protein lists and the sauruses of chemical compounds and dis eases Once the literature results are stored in a structured way e.g. by using data‐ base technologies, the challenge is to identify certain interesting concepts in the results. As we are dealing with biomedical literature, we might want to identify gene or protein symbols, aliases or names; cancer types or oth‐ er diseases and their symptoms; Gene Ontology terms (Gene Ontology Con‐ sortium, 2008), chemical (Singh et al., 2003; Hoffmann, 2007; Wild and Hur, 2008) and pharmaceutical compounds such as drugs . To identify gene‐related, medical, chemical or drugs‐related terminolo‐ gies, a various number of databases are available. Some of these databases are hierarchically organized, others are unstructured. Examples of struc‐ tured sets of keywords are Gene On‐ tology terms and MeSH terms. Gene 52
symbols and names are examples of unstructured keyword sets. Some databases are in between: they cover for instance synonyms and symptoms of a disease, but there are no relations between the different diseases cov‐ ered. Some of these databases and lists (also known as thesauruses) are generated themselves by using text‐mining ap‐ proaches (Jin et al., 2006; Kim et al., 2008). Some databases actually are a so called ‘mashup’ (Cheung et al., 2008; Belleau et al., 2008) of different other data sources: they try to com‐ bine different data sources in one single interface. An example of such a database is GeneCards (Safran et al., 2002) for human gene information. In order to use these different lists, they will also have to be parsed to get the individual entities, taking into account synonyms and the hierarchic‐ al structure. Insight example: creating a list of human genes, their aliases and symbols We demonstrate in script 2 how to parse GeneCards to retrieve a list of all aliases and symbols of human genes. In a later phase, we search for all these gene symbols in the entire literature results set.
Intermezzo: Biological Text Mining
‐
‐
First, we make use of Ensembl (Flicek et al., 2008) / BioMart to re‐ trieve a list of all human genes, as‐ sociated with a Ensembl gene ID (BioMart: human genes – no filters – output: Ensembl gene ID – present only unique results – save as CSV file). This initial step can al‐ so be automated, using the En‐ sembl API Second, the corresponding Gene‐ Cards records are retrieved. All gene symbols and aliases, names and descriptions are parsed out of this record and locally stored
8.4 Step 3: Identify key words, annotation lists and concepts in litera ture results. Deal with textual variants and ambiguities and iden tify relationships in the results Next in the analysis pipeline is to match the different annotation and keyword lists from step 2 with the literature results retrieved in step 1. Based on the identifications discov‐ ered in this step, the literature refer‐ ences are ranked, information is fil‐ tered, sorted, highlighted and summa‐ rized in order to present the analysis results or given to advanced machine learning classifiers.
Intermezzo: Biological Text Mining
The previous steps were straightfor‐ ward, not prone to errors and compu‐ tationally relatively easy. This step however is more difficult as ambigui‐ ties, biases and errors are introduced (Tanabe and Wilbur, 2002; Tuason et al., 2004; Chen et al., 2005; Fundel and Zimmer, 2006; Yang et al., 2008), hav‐ ing an impact on the rest of the down‐ stream analysis. Genes can share aliases, gene symbols may be existing English words or are also used as an identifier of a cell line (Sehgal and Srinivasan, 2006). Symbols and names can also be written in different ways, the so‐called textual variants (e.g BRCA1, BRCA‐1, BRCA/1, BRCA 1, BRCA‐I). Many abstracts use abbrevia‐ tions for chemical compounds or dis‐ ease, making it even harder to identify these abbreviations correctly (Liu et al., 2002). After identification of different enti‐ ties, relationships between all these entries can be detected. For instance: are the genes on a certain chromo‐ somal region associated with symp‐ toms of a certain disease. Or given a certain disease, which genes are re‐ lated with this disease and in which pathways can these be situated. If the mechanisms for the identification of relationships are very complex, they could even be used to identify indirect relationships and be able to define novel research hypotheses.
53
The different relationship identifica‐ tion strategies range from co‐ occurrence to statistical and machine‐ learning based techniques. Some ef‐ forts use linguistic models to improve the detection. Three basic techniques (co‐occurrence, rule‐based and know‐ ledge‐based) are reviewed in (Cohen and Hunter, 2008). For hierarchically structured lists, the different levels of relationships can be taken into account. A gene ontology term detected at a certain level can be associated with its synonyms, its par‐ ents and children. In the analysis for relationships, these different levels of relationship can be taken into account in order to cross the level borders. To compare the performance of differ‐ ent text mining efforts, the F‐value (Chen et al., 2006) is often used. This value is calculated based on both pre‐ cision (P) and recall (R):
. To
calculate the F‐value, (manually) an‐ notated reference datasets are availa‐ ble for testing and comparing text mining efforts (Jimeno et al., 2008). Application examples: iHOP, PolySearch
54
Whatizit,
Different web‐based services that allow easy identification of various keyword lists, are created. One exam‐ ple is Whatizit (Rebholz‐Schuhmann et al., 2008). Given plain text of a list of PubMed IDs, this tool can be used to identify 23 different terminologies, ranging from proteins to drugs. A frequently used tool is iHOP (Fer‐ nandez et al., 2007), illustrated in Figure 8.2. iHOP is able to detect a whole range of terms and relation‐ ships in one single view and is pre‐ indexed: the results appear almost immediately. Disadvantage is that iHOP only shows results of one single gene and cannot be used on other queries. This speeds up manual litera‐ ture exploration of one single gene as different entities are highlighted and the results will be more accurate as iHOP deals with different aliases and synonyms. The interface displays key sentences: sentences that contain most probably valuable information. Relationships are indicated in these sentences as well: if the gene of inter‐ est is regulated by another gene, not only both genes will be highlighted, also the keywords, indicating interac‐ tion between them, will be empha‐ sized.
Intermezzo: Biological Text Mining
Age penetrance is greater for BRCA1
-linked than for BRCA2
-linked cancers in this population. [2000]
mRNA were more likely to lack BRCA2
Tumors lacking BRCA1
We evaluate current knowledge of BRCA1 ovarian cancer. [2001]
and BRCA2
mRNA than tumors expressing BRCA1
functions to explain why mutations in BRCA1
mRNA (Pala) required at least tenfold higher doses of adriamycin to down-regulate BRCA1 and BRCA2 mRNAs than did parental MCF-7 cells or control-transfected MCF-7 clones. [1998] METHODS: We determined the frequency of ATM IVS10-6T-->G variants in a cohort of individuals affected by breast and/or ovarian cancer who underwent BRCA1 and BRCA2 genetic testing at four major Australian familial cancer clinics. [2004] The product of the RAD51
gene functions with BRCA1
in the repair of double-stranded DNA breaks. [2003]
colocalizes with both BRCA1
RAD51
is an important component of double-stranded DNA-repair mechanisms that interacts with both BRCA1
In particular, BRCA1 recombination. [2004]
and BRCA2
and BRCA2
RAD51
functions in concert with Rad51
, and genetic variants in RAD51 , BRCA2
would be candidate BRCA1
/2 modifiers. [2001]
and BRCA2
. [2007]
and other genes to control double strand break repair (DSBR) and homologous
Hence, these data indicate that human cells with biallelic BRCA2 mutations display typical features of both FA - and HR-deficient cells, which suggests that FANCD1 /BRCA2 is part of the integrated FA /BRCA DNA damage response pathway but also controls other functions outside the FA pathway. [2006]
Another possibility that explains the lack of detection of alterations in BRCA1 or BRCA2 is the presence of mutations in undiscovered genes or in genes that interact with BRCA1 and/or BRCA2 , which may be low-penetrance genes, like CHEK2 . [2006]
Figure 8.2: iHOP result for the BRCA1 gene. Different annotation lists are used (diseases, chemical compouns, Gene Ontology terms) and indicated. The webinterface makes use of hover‐over effects and all information is hyperlinked
On the other hand, the interface does not help in giving a complete overview and prioritizing the relationships iden‐ tified. The hierarchical structures of some keyword lists is not reflected as well. This tool gives a very fast im‐ pression and performs well in identi‐ fying interesting text phrases and individual elements but is not de‐ signed to generate data summaries or prioritizing.
Intermezzo: Biological Text Mining
PolySearch (Cheng et al., 2008)(example in Figure 8.3) is not restricted to a single gene and the user can create its own query, but one is limited to use one of the prior defined relationship identification schemes at a time (e.g. given a disease, list all genes associated with it and rank them). PolySearch is able to rank the results but does not create summary views.
55
Color Code Query Gene/Protein Disease Drug Metabolite Association Word Relevancy PubMed Score ID 126 (0,4,4,6)
Full details
Key Sentences
Ren CC, Miao XH, Yang B, Zhao L, Sun R, Song WQ: Methylation status of the fragile histidine triad and 17009983 E-cadherin genes in plasma of cervical cancer patients. Int J Gynecol Cancer. 2006 Sep-Oct;16(5):1862-7.
100 12751384 (0,3,3,10)
Chen CL, Liu SS, Ip SM, Wong LC, Ng TY, Ngan HY: E-cadherin expression is silenced by DNA methylation in cervical cancer cell lines and tumours. Eur J Cancer. 2003 Mar;39(4):517-23. We examined promoter methylation of E-cadherin in five cervical cancer cell lines and 20 cervical cancer tissues using methylation-specific PCR (MSP) and bisulphite DNA sequencing.
Color Coded Text Color Coded Text
23 (0,0,3,8)
Moon HS, Choi EA, Park HY, Choi JY, Chung HW, Kim JI, Park WI: Expression and tyrosine 11371122 phosphorylation of E-cadherin, beta- and gamma-catenin, and epidermal growth factor receptor in cervical cancer cells. Gynecol Oncol. 2001 Jun;81(3):355-9.
Color Coded Text
22 (0,0,3,7)
Rodriguez-Sastre MA, Gonzalez-Maya L, Delgado R, Lizano M, Tsubaki G, Mohar A, Garcia-Carranca A: 15863126 Abnormal distribution of E-cadherin and beta-catenin in different histologic types of cancer of the uterine cervix. Gynecol Oncol. 2005 May;97(2):330-6.
Color Coded Text
Branca M, Giorgi C, Ciotti M, Santini D, Di Bonito L, Costa S, Benedetto A, Bonifacio D, Di Bonito P, Paba 21 P, Accardi L, Mariani L, Syrjanen S, Favalli C, Syrjanen K: Down-regulation of E-cadherin is closely 16800245 (0,0,2,11) associated with progression of cervical intraepithelial neoplasia (CIN), but not with high-risk human papillomavirus (HPV) or disease outcome in cervical cancer. Eur J Gynaecol Oncol. 2006;27(3):215-23. Hsu YM, Chen YF, Chou CY, Tang MJ, Chen JH, Wilkins RJ, Ellory JC, Shen MR: KCl cotransporter-3 down-regulates E-cadherin/beta-catenin complex to promote epithelial-mesenchymal transition. Cancer Res. 2007 Nov 15;67(22):11064-73.
18 (0,0,2,8)
18006853
18 (0,0,3,3)
9218005
15 (0,0,2,5)
Haga T, Uchide N, Tugizov S, Palefsky JM: Role of E-cadherin in the induction of apoptosis of 17906929 HPV16-positive CaSki cervical cancer cells during multicellular tumor spheroid formation. Apoptosis. 2008 Jan;13(1):97-108.
E-cadherin and beta-catenin colocalize in the cell-cell junctions, which becomes more obvious in a time-dependent manner by blockade of KCC activity in cervical cancer SiHa and CaSki cells. Fujimoto J, Ichigo S, Hirose R, Sakaguchi H, Tamaya T: Expression of E-cadherin and alpha- and beta-catenin mRNAs in uterine cervical cancers. Tumour Biol. 1997;18(4):206-12.
Wu H, Lotan R, Menter D, Lippman SM, Xu XC: Expression of E-cadherin is associated with squamous differentiation in squamous cell carcinomas. Anticancer Res. 2000 May-Jun;20(3A):1385-90. 14 (0,0,1,9)
10928048 To evaluate whether E-cadherin could serve as a biomarker of squamous cell differentiation, we analyzed its expression by immunohistochemistry in formalin-fixed, paraffin-embedded tissue sections of 7 head and neck cancer patients, 19 lung cancer patients, 73 esophageal cancer patients, 19 skin cancer patients, and 18 cervical cancer patients.
Color Coded Text
Color Coded Text Color Coded Text Color Coded Text
Color Coded Text
Figure 8.3: PolySearch results interface: investigating the connection between E‐cadherin and cervical cancer. Different identified entities get different colors, indicating the query terms more prominently. Note the ranking and scoring information
Novel possible relationships can be detected using Chilibot (Chen and Sharp, 2004), where in the most com‐ plex version two lists (of either genes or keywords) are searched for rela‐ tionships within a group or between terms in the groups. These two ap‐ proaches cannot use indexation in the relationship detection and are there‐ fore much slower but are able to re‐ veal more complex relationships. It also creates a schematic summary figure of different relationships (co‐ occurrence, stimulative, inhibitory) identified, with the possibility to color
56
the gene nodes in this graph according to gene expression data. Which tool to use, mainly depends on the research question. The perfect tool for every research question simply does not exists as all tools have their specific properties: indexing of data, availability and completeness of key‐ word lists, detection and identification methodologies, techniques for identi‐ fying relationships, incorporation of linguistic knowledge and text corpus‐ es, speed and way of ordering and representation. The most sophisti‐ cated and accurate text mining appli‐
Intermezzo: Biological Text Mining
cations must be fed with positive and negative sets or trained, which often takes too long. In any case, making use of the described techniques and tools, is faster and/or more accurate and complete than manual literature searches and could open possibilities to uncover hidden associations, which could not be easily discovered by hu‐ mans. Insight example: matching genes and keywords and present abstracts, re‐ sult illustrated in Figure 8.4 ‐
‐
First of all, based on the list of aliases, generated in step 2, a list of textual variants is generated in script 3a. This script basically ge‐ nerates regular expressions, suited for immediate use in Perl In script 3b, the XML file with all abstracts is parsed and analyzed: o Author information and publi‐ cation details are stored o The title and abstract are searched for keywords; can‐ cer types; detection metho‐ dologies in methylation re‐ search and gene symbols; aliases and their textual va‐ riants. Sentences with both a
Intermezzo: Biological Text Mining
gene alias and a keyword or a gene alias and a cancer type are identified o The results of the identifica‐ tion of single occurrences and sentences with both an alias and a keyword are stored in a relational MySQL database to enable fast querying and sort‐ ing afterwards. This database could for instance be used to query which genes were men‐ tioned in combination with a certain cancer type o Per abstract, a HTML file with Javascript (for the hover‐over effect) and keyword highlight‐ ing is created, in order to fas‐ ten human revision. The dif‐ ferent colors and the hig‐ hlighting of sentences with both a gene symbol and a keyword drastically improves revision times and accuracy. The use of hyperlinks (to orig‐ inal abstract in PubMed; to other publication of any of the authors, to GeneCards) dy‐ namically links all these dif‐ ferent information sources and allows intuitive naviga‐ tion
57
16820927: Promoter methylation status of the MGMT, hMLH1, and CDKN2A/p16 genes in non-neoplastic mucosa ofMGMT patients with and without colorectal adenomas. MGMT
Close
O-6-methylguanine-DNA
Ye C. Shrubsole MJ. Cai Q. Ness R. Grady WM. Smalley W. Cai H. Washington K methyltransferase - 10q26.3 Oncol Rep - 2006
The aberrant methylation of CpG islands is a common epigenetic alteration found in cancers. The process contributes to cancer formation through the transcriptional silencing of tumor suppressor genes. CpG island methylation has been observed in aberrant crypt foci (ACF) and adenomas in the colon, implicating it in the earliest aspects of colon cancer formation. In addition, some investigators have identified an age-related increase in DNA methylation of the ESR1 locus in the colon mucosa, suggesting that DNA methylation may be a pre-neoplastic change that increases the risk of colon adenomas and colon cancer. We investigated the methylation status in the promoter regions of the CDKN2A/p16, hMLH1, and MGMT genes in human non-neoplastic rectal mucosa and evaluated whether these methylation markers may predict the presence of adenomatous polyps in the colon. The promoter methylation patterns of these genes were examined in rectal biopsies (mucosa samples) of 97 colorectal adenoma cases and 94 healthy controls using methylation-specific PCR (MSP) assays. Methylation of the MGMT and hMLH1 genes was present in both cases and controls, with a frequency of 12.4% and 18.1% for the MGMT gene and 12.4% and 11.7% for the hMLH1 gene. The frequency of CDKN2A/p16 promoter methylation was very rare in normal colorectal tissue with a frequency of approximately 2%. Overall, no apparent case-control difference was identified in the methylation status of these genes, either alone or in combination. hMLH1 methylation was more frequently observed among overweight or obese subjects (BMI>/=25) with an adjusted OR of 3.7 (95% CI=1.0-13.7). Methylated alleles of the hMLH1 and MGMT genes were frequently detected in normal rectal mucosa, while the frequency of CDKN2A/p16 methylation detected was very low. The methylation status of these genes in rectal mucosa biopsies detected by MSP assays may not distinguish between patients with and without adenomas in the colon.
Figure 8.4: Example of identification of human genes and their symbols, cancer types, methylation related keywords and detection methodologies. Note the errors in identifica‐ tion: ACF and CI being detected as a gene while it are abbreviations
8.5 Step 4: Rank, summar ize and present the re sults The last step, crucial for usability, is to order and summarize the data and present the results to the user. The visualization of the text analysis must be easy to understand and navigate, but however provide sufficient data, available within a few mouse clicks. The key concepts or the results with the highest support must be listed first. The interface must be self‐ explainable, also for first‐time users.
58
It can be a real challenge to present data in a structured form, giving enough details but without losing the overview and navigation aspects. Of‐ ten, graphic representations of data are excellent: they may give overview and summarizations, indicate connec‐ tions and may be scalable. The result presentation may also largely benefit by using representations that re‐ searchers are familiar with. For exam‐ ple: it may be a good idea to present results, related with pathways, in a cellular representation indicating membranes, nucleus and organelles.
Intermezzo: Biological Text Mining
Some commercial packages (such as Ingenuity Pathway Analysis) imple‐ ment such attractive visualization strategies as this enables biological researchers to gain insight in the on‐ going processes and to more effective‐ ly formulate new hypothesizes. Sortable tables, colors, highlighted and hyperlinked data are often used to visualize the results in a web‐ environment as users are familiar with this. With the new technological (web 2.0) developments such as AJAX, the visualization can be even further im‐ proved: hover‐over effects, expanda‐ ble sections with details or additional
filters without having to refresh the page. In addition to literature searches, oth‐ er public data sources could be used to prioritize genes. A nice example is Endeavour (Tranchevent et al., 2008)(example in Figure 8.5): this prioritization tool uses, in addition to text sources, biological pathways; sequence motifs; protein interaction; regulatory modules and expression data. All this data is used to create a statistical model that is able to rank genes in every of these aspects. In the end, a global ranking is presented.
Figure 8.5: Candidate gene prioritizing visualization in Endeavour. A single candidate gene (given a unique background color) is ranked according to different data sources, including literature data. A global ranking order is then determined: this is the prioritized candidate gene list
Intermezzo: Biological Text Mining
59
8.6 Discussion In this review we discussed four of the major mechanisms that most biologi‐ cal text mining systems share. Most applications are either completely focused on one particular theme in literature or either more applicable in more biological and medical contexts. The more specialized, the better the identification or training methodolo‐ gies can be implemented, while the cost of generalization is loss in recall or specificity. This probably is the reason why so many applications and databases are created, relying on text mining efforts in a single (more or less narrow) research area (Fang et al., 2008; Shtatland et al., 2007; Lee et al., 2008; Gajendran et al., 2007). Most of these databases are created by people with a biological background, rather than researchers with computa‐ tional or linguistic expertise and thus use rather simple text mining ap‐ proaches instead of taking compli‐ cated models into account. Despite the lack of these technologies, these data‐ bases perform well, as the knowledge of experts in the field greatly adds to recall and specificity. Instead of having to train statistical models, the data‐ base is annotated and complemented by experts that mainly benefit from simple keyword highlighting and im‐ provement of the navigation through the abstracts and the hyperlinked information. 60
The perfect application most probably does not exist for one research ques‐ tion, but the available web‐based tools often provide a very good basis for analysis, certainly when executed by researchers that would otherwise perform a manual literature search. Apart from the querying, identification and ordering mechanisms in the back‐ ground, the presentation of the results is a key feature. The use of dynamic web technologies with carefully cho‐ sen colors, intuitive navigation enabl‐ ing both general overviews and de‐ tailed information can make the gen‐ erated data and knowledge accessible. This review shows some of the advan‐ tages and the power of text mining approaches and the visualization of the results. However, techniques to automatically analyze full text articles and to distil information out of tables and figures are in development and would increase the available data even further. Full text versions of articles are not accessible in a standardized way, there is not one single locate where they are stored and some jour‐ nals require registration. The recent move to open access articles and the implementation of the DOI system (Digital Object Identifier) might speed up full text searches.
8.7 Conclusion Biological text‐mining is becoming necessary in order to analyze litera‐
Intermezzo: Biological Text Mining
ture results. With the availability of diverse biological databases and web tools, it has become feasible to auto‐ matically mine biological texts and setting up a tailored text‐mining ap‐ proach in order to answer biological questions in a certain content can help speeding up getting insight in existing knowledge and the generation of nov‐ el research hypothesis. Different web‐ based text‐mining tools are available, some are highly focused on one par‐ ticular area while others can be ap‐
plied in broad biological contexts. Some tools only use co‐occurrence and other relatively simple detection prin‐ ciples, while others implement linguis‐ tic knowledge; the visualization of the results ranges from simple color hig‐ hlighting to web 2.0 enabled diagrams. The ideal text‐mining application for all specific research questions proba‐ bly does not exist, but with some mod‐ ifications, the results can be obtained fast and accurate.
Intermezzo: Biological Text Mining
61
Chapter 9: PubMeth: methylation data base in cancer Paper 2: PubMeth: a cancer methylation database combining text mining and expert annotation. Ongenaert M, Van Neste L, De Meyer T, Menschaert G, Bekaert S, Van Criekinge W. As published in Nucleic Acids Research; 36 (Data base issue): D8426. (Open Access)
Epigenetics, and more specifically DNA methylation is a fast evolving re search area. In almost every cancer type, each month new publications con firm the differentiated regulation of specific genes due to methylation and mention the discovery of novel methylation markers. Therefore, it would be extremely useful to have an annotated, reviewed, sorted and summarized overview of all available data. PubMeth is a cancer methylation database that includes genes that are reported to be methylated in various cancer types. A query can be based either on genes (to check in which cancer types the genes are reported as being methylated) or on cancer types (which genes are reported to be methylated in the cancer (sub) types of interest). The da tabase is freely accessible at http://www.pubmeth.org. PubMeth is based on textmining of Medline/PubMed abstracts, combined with manual reading and annotation of preselected abstracts. The text mining approach results in increased speed and selectivity (as for instance many different aliases of a gene are searched at once), while the manual screening significantly raises the specificity and quality of the database. The summarized overview of the results is very useful in case more genes or can cer types are searched at the same time.
PubMeth: methylation database in cancer
63
9.1 Introduction DNA methylation represents a modifi‐ cation of DNA by addition of a methyl group to a cytosine, also referred to as the fifth base (Doerfler et al., 1990). This reaction uses S‐adenosyl‐ methionine as a methyl donor and is catalyzed by a group of enzymes, the DNA methyltransferases (DNMTs). In humans and other mammals, this epi‐ genetic modification is almost exclu‐ sively imposed on cytosines that pre‐ cede a guanosine in the primary DNA sequence (often called a CpG dinucleo‐ tide). The frequency of these CpGs in the genome is much lower than would be expected as a methylated cytosine often is subject to deamination thereby forming thymidine. However, in some regions, dense clusters of CpGs can be identified: these regions are referred to as CpG islands (Her‐ man and Baylin, 2003). DNA‐methylation is an epigenetic change: it does not alter the primary DNA sequence and might contribute to overall genetic stability and mainte‐ nance of chromosomal integrity. Con‐ sequently, it facilitates the organiza‐ tion of the genome into active and inactive regions with respect to gene transcription (Robertson, 2002). Genes with CpG islands in their pro‐ moter region are generally unmethy‐ lated in normal tissues. Upon DNA hypermethylation, transcription of the affected genes may be blocked, result‐ 64
ing in gene silencing. In neoplasia, abnormal patterns of DNA methyla‐ tion have been recognized. Hyper‐ methylation is now considered one of the important mechanisms resulting in silencing expression of tumour sup‐ pressor genes, i.e. genes responsible for control of normal cell differentia‐ tion and/or inhibition of cell growth. In the last few years, new hyper‐ methylated biomarkers have been used in cancer research and diagnos‐ tics (Esteller, 2003). MethDB (Amoreira et al., 2003), one of the few databases that focus on DNA methylation, is general and sample oriented. But it is not optimized to cancer‐related queries because this type of query requires a summarized overview. However, in MethDB query‐ ing multiple genes or cancer types is not supported and data is always han‐ dled as a separate sample. Another database, MethPrimerDB (Pattyn et al., 2006), has a focus on detection meth‐ odologies (e.g. MSP primer design). Both databases discussed here, de‐ pend on submissions by administra‐ tors or users, which guarantees the required quality of the databases, but consequently they are not always complete and up to date. The data‐ bases are neither designed to rank and summarize cancer‐related information (genes and cancer (sub) types in‐ volved), although this is crucial in applied methylation research in the cancer field.
PubMeth: methylation database in cancer
Hereby we present PubMeth, a data‐ base that combines a text‐mining ap‐ proach (fast, intelligent to search mul‐ tiple aliases and textual variants of these aliases, querying multiple key‐ word lists at once) with a manual re‐ viewing and annotation step. The lat‐ ter one drastically improves specificity and annotation quality. The interface is able to rank, summarize and repre‐ sent data, making the information the database contains easily accessible. The reviewing step also heavily de‐ pends on the text‐mining step that sorts abstracts, highlights terms and provides links to different sources. This way, the reviewing step can be done fast and accurate enough to process all abstracts, electronically published until now in PubMed. In addition, using this approach, an up‐ date strategy can be more easily im‐ plemented. DNA methylation in cancer research has evolved to a mainstream research topic. Methylation profiles are suc‐ cessfully used in early detection and personalized treatment. However, more and more data is available, espe‐ cially with the availability of large‐ scale screening techniques. All the information taken together deter‐ mines the knowledge of the ‘cancer methylome’. Ultimately, the epige‐ nome of all cancer tissues, including those of different stage and grade, could be mapped out. Epigenetic
PubMeth: methylation database in cancer
states differ widely among tissues, and changes are far more varied and much more frequent per tumor than DNA mutations. "Each differentiated cell has a different epigenome," said Jones (Garber, 2006). In this perspective, it is very useful to extract which genes are already reported in which cancer types from literature. This information might be used as positive controls, to check the same genes in other (re‐ lated) cancer types, to screen for markers that could be used as early diagnostic utility or in the context of personalized medicine and to deepen the knowledge of the mechanisms of methylation. PubMeth tries to contain and summa‐ rize as many available literature data and presents them in a easy to use graphical interface. It speeds up the process of searching relevant litera‐ ture, many aliases and keywords are searched at the same time and the results are reliable as they are manu‐ ally reviewed as one would do when performing a manual literature search.
9.2 Filling up the database Abstracts, related to epigenetics and methylation, are downloaded in XML‐ format through NCBI E‐Utils (E‐fetch) using more than 15 methylation‐ related keywords (such as methyla‐ tion, DNA‐methylation, methylated, epigenetic and a range of variants, as well as detection technologies). The 65
aliases, symbols and descriptions of human genes, associated with an En‐ sembl ID, are obtained using a perl‐ script. This queries the GeneCards database (Rebhan et al., 1998), that already combines different genetic databases such as Ensembl and Entrez Gene. Different textual variations of all aliases are generated to be as com‐ plete as possible (e.g. variants for BRCA1 include BRCA 1, BRCA‐1, BRCAI, BRCA I and BRCA‐I). To avoid counting and highlighting aliases that are also common English words, an alias is rejected if more than 100,000 PubMed abstracts are retrieved. A list (http://www.wordcount.org) of fre‐ quently used English words is searched at the same time. Cancer‐related keywords were ob‐ tained from a list of the National Can‐ cer Institute (http://www.cancer.gov/cancertopics /alphalist) and keywords related with detection‐methodologies are manually compiled. One by one, abstracts are searched for aliases and their variants, methyla‐ tion‐related keywords, sentences with both an alias and such a keyword. In addition, terms related with cancer and detection methodologies were also highlighted and counted. This
66
information is stored in a MySQL 5 relational database using Perl‐DBI. Based on the information in this data‐ base, abstracts are ranked. This rank‐ ing is based on a large number of pa‐ rameters such as the number of ali‐ ases, the number of different genes, the number of different aliases per gene, the number of sentences with both an alias and a methylation‐ related keyword, the presence of de‐ tection‐methodology and cancer‐ related keywords. Abstracts are then manually reviewed, taking into account the order after ranking, with the aid of highlighting the different keyword lists, aliases and sentences with alias and methylation‐ related keyword in different colours. Aliases are linked with gene informa‐ tion using hover‐over effects gener‐ ated with JavaScript and CSS. After manual reviewing, the information in the database only has to be minimally updated or corrected. A schematic overview of the complete process is given in Figure 9.1. This process is still in progress; due to the ranking system the most important publications are currently in the database. The remain‐ ing abstracts will be reviewed soon, and an accurate update strategy will be developed.
PubMeth: methylation database in cancer
Abstracts PubMed abstracts, retrieved trough NCBI E-uls (E-Fetch), associated with methylaonrelated keywords or textual variants
Gene variants Aliases, symbols, names, descripons and textual variants from GeneCards
Highlighng and annotaon 12679904: Mutation and methylation of hMLH1 in gastric carcinomas with microsatellite instability. Fang DC Wang RQ Yang SM Yang M L u HF Peng GY Xiao L Luo YH World
Gastroenterol 2003
Methylaon Methylaon-related keywords and textual variants
Cancer Cancer-related keywords and textual variants from NCI
AIM To appraise the corre at on of muta ion and methylation of hMSH1 with m crosatell te instabi ity (MSI) in gastric cancers METHODS Muta ion of hMLH1 was detected by Two dimensional electrophores s (Two D) and DNA sequencing Methylation of hMLH1 promoter was measured w th methylation specific PCR MSI was ana yzed by PCR based methods RESULTS S xty e ght cases of sporad c gastric carcinoma were studied for mutation and methylation of hMLH1 promoter and MSI Three mu ations were ound two of hem were caused by a single bp subst tut on and one was caused by a 2 bp substitution which displayed s m lar Two D band pa tern Methylation of hMLH1 promoter was detected in 11(16 2 %) gastric cancer By us ng five MSI markers MSI n at least one locus was de ected n 17/68(25 %) of the tumors analyzed Three hMLH1 mutat ons were all de ec ed n MSI H (>=2 loci n=8) but no mutat on was found in MSI L (on y one ocus n=9) or MSS tumor lack ng MSI or stable n=51) Methylation frequency of hMLH1 in MSI H (87 5 % 7/8) was significantly higher than that in MSI L (11 1 % 1/9) or MSS (5 9 % 3/51) (P0 05) CONCLUSION Both mutat on and methylat on of hMLH1 are invo ved in the MSI pathway but not re ated to the LOH pathway in gastric carc hMLH1 Close
Counng, storing and sorng Counters for aliases, different genes, different keyword-types, sentences with both an alias and a keyword,…
MySQL DB
mutL homolog 1 colon cancer nonpolyposis type 2 (E coli) 3p22 3
Detecon detecon-related keywords and textual variants
Sorng, manual review of highlighted abstracts
Figure 9.1: Scheme that illustrates the initial filling up of database using text‐mining. Aliases of genes and different keyword lists (methylation, cancer and detection‐related) are highlighted in the abstract. At the same time, different parameters are counted and stored in a MySQL relational database. Afterwards, the data is ranked and manually re‐ viewed
9.3 Querying the database A record in the database contains in‐ formation about the source publica‐ tion, the gene, the cancer type and subtypes if specified. It includes the number of primary cancer samples where methylation is analyzed in, as well as the number of analyzed cell lines and the number of normal tis‐ sues. For all these three categories the methylation frequency (the percent‐ age of the samples that show methyla‐ tion) is also available. Other informa‐ tion includes the detection technolo‐ gies used and an ‘evidence sentence’ where most of the information in this record came from. PubMeth: methylation database in cancer
PubMeth can be queried using the web‐interface at http://www.pubmeth.org in two ways, depending on the researcher’s focus: •
•
Gene‐related: in which cancer types (and subtypes) the genes of interest are reported to be me‐ thylated Cancer‐related: which genes are reported to be methylated in the cancer types/subtypes
9.3.1
Gene-centric query
A query is created in two easy steps. In the first step, the user provides a list of 67
genes (different identifiers are ac‐ cepted: gene symbol or name, RefSeq, Ensembl ID, …). The query is analyzed using local symbol/alias lists, gener‐ ated using GeneCards, and suggestions are presented to the user. In the sec‐ ond step, the user reviews the selec‐ tions made (most likely the genes selected due to intelligent sorting in the background are correct) and sub‐ mits his choices.
cal usage example would be that, using a pharmacologic demethylation ap‐ proach in cell lines, 50 candidate genes are selected. The question then is to sub‐select genes to verify in pri‐ mary cancer samples, often based on time‐consuming literature searches. This selection is facilitated by the summarization view of PubMeth.
The summary is very useful if multiple genes are searched at once; this fea‐ ture is what distinguishes this data‐ base from previous efforts. One practi‐
From this main page, one can go to the detailed pages, focusing on a selected gene in a certain cancer type. On such a detailed page, graphical representa‐ tions of the number of references in the database, the total number of samples and the mean methylation frequency are displayed for the differ‐ ent cancer types and their subtypes. The complete individual records, linked with their original PubMed record, are shown. Users can also choose to browse a pre‐computed gene list. Advantage is that the user can browse all genes in PubMeth without having to query the database, which is significantly faster. However, the summary view is not available.
At this point, the results will be gener‐ ated and the main result page is pre‐ sented to the user. This main result page ranks the genes, based on the number of references to the gene in the database. A graphical summary representation of the number of refer‐ ences, the number of primary samples and the mean methylation frequency within different cancer types is also given (example in Figure 9.2).
68
PubMeth: methylation database in cancer
Methylation frequency:
0
0 20 %
20 40 %
40 60 %
60 80 %
80 100 %
breast
478
593
148
570
630
cervical
658
759
586
433
397
cardial
50
527
oesophaegeal
299
159
383
27
179
48
liver
284
450
339
301
1271
545
salivary gland
96
60
gall bladder
114
9
55
59
105
50
oral
362
36
103
197
34
121
687
135
29
124 36
56
ov arian
802
676
79
115
brain
1086
460
1981
229
neuroblastoma
183
193
83
27
50 23
56 72
52
48
259
92
313
141
163
32
27
468 38
20
neuroendocrine 46
mucoepidermoid carcinoma 79
102
71
1
lymphoma
452
132
292
gastric
2162
379
1284
1724
635
85
108
17
17
NULL
8
19
endocrine
67
176
kidney
154
501
160
175
110
pancreas
198
140
140
112
106
11
1173
25
colorectal
3935
925
3184
769
leukaemia
1114
52
40
606
mesothelioma
17
6
6
6
nasopharyngeal
71
98
133
234
skin
89 41
96
small bowel
bone
257
92
larynx
wilms tumour
374
82
50
28 756
42 99
58
58
393
374
1463 6
203
28
46
65
prostate
514
944
348
568
33 159 222
6
6
1691
214
535
49
280
48
71
138
b le duct
46
42
165 146
415 170
107
20
27 846
Figure 9.2: Summary page of a gene‐centric query. The different colors represent the frequency of methylation of the gene in the different cancer types (what percentage of the samples showed methylation), while the numbers indicate the total number of primary samples tested for methylation
9.3.2
Cancer-centric query
A cancer‐centric query is executed in one easy step: the user selects cancer types (and/or subtypes up to three levels – e.g. lymphoma ‐ non‐hodgkin lymphoma ‐ b‐cell lymphoma ‐ diffuse large B‐cell). An overview (in the same style as the gene‐centric searching approach) of the genes that are most commonly described as methylated in the selected cancer types, as well as the total number of samples and the mean methylation frequency is re‐ turned. From this summary page, navigating to detailed pages is intui‐ tive.
PubMeth: methylation database in cancer
This type of search is meant to get a quick overview of the genes that are reported in the methylation context in the cancer (sub)types of interest and in which frequency, to explore methy‐ lation in the cancer types of interest, to compare experimental results with or to perform, in a next step, a gene‐ centric search on these genes for full details in all cancer (sub)types. A screencast that dynamically shows how to query PubMeth is available on the PubMeth website.
69
9.4 Performance of Pub Meth, discussion and future To evaluate the performance of Pub‐ Meth, we tested how well the database performed in comparison with a care‐ ful manual literature search. There‐ fore, we selected a very recent review, focusing on DNA‐methylation in breast cancer (Agrawal et al., 2007). This article contains a table where the authors provide a list of 39 genes, known to be hypermethylated in breast cancer and their literature ref‐ erences. The genes in this list are en‐ tered into the gene‐centric search of PubMeth: 27 genes are listed in Pub‐ Meth, 11 are not listed and 1 gene could not be associated with a gene symbol. Of the 27 genes listed in Pub‐ Meth, 20 are described in breast can‐ cer. Breast cancer is listed first on the results page due to the background sorting mechanisms, but 18 genes are not associated with breast cancer in PubMeth. On the other hand, the review article lists 39 different genes, while a can‐ cer‐centric search for breast cancer returns 94 genes. Important to men‐ tion: the genes both in PubMeth and the review (the shared group) are associated with a high number of pri‐ mary samples in PubMeth. If all 94 genes were ranked according to the number of primary samples in de‐
70
creasing order, most members of the shared group are on top of this rank‐ ing, almost the complete top‐10 is present in the shared group (except numbers 7 and 8). This example clearly shows the power of PubMeth as well as its weaknesses. First of all, doing such a literature search manually usually takes multi‐ ple hours, while PubMeth presents its summary within minutes. PubMeth is in most cases able to find more refer‐ ences than a manual search would, using the different keywords and alias lists. On the contrary, often abstracts don’t mention any of the genes in question, and these abstracts are not taken into account for consideration in PubMeth. Examples of such articles are large‐scale studies with multiple genes or reviews. These articles gen‐ erally are easily found by manual searches but not using our text mining approach that is only able to screen abstracts. As long as there is no universal or centralized system to be able to screen full text articles or a mainstream open access strategy, solution for this would be to leave the restriction that the abstract has to contain a gene out and to do more general searches. Other possibility is to allow users to enter their suggestions for inclusion into PubMeth; such a submission system would allow to combine the power of both submissions by users and an
PubMeth: methylation database in cancer
automated text mining approach that demonstrates to be very powerful dealing with different keyword lists and gene name variants. The latter is available on the PubMeth website: articles related with DNA‐methylation in cancer that could not be found with PubMeth, can be suggested for inclu‐ sion. Currently, PubMeth only focuses on hypermethylation, however the inclu‐ sion of hypomethylated genes would be useful as well for some users. In a next update, keywords related with hypomethylation will be added. Other future database updates should take into account different originating tissues (for example clearly separate between primary cancer tissue and serum) and the different types of normals (surrounding tissue in tumor patient versus samples from healthy person). However, different articles use different terminologies and often
PubMeth: methylation database in cancer
this information is not easily extract‐ able. Improvements to the interface should represent the above described separa‐ tion of samples. Currently, only the mean of the methylation frequency in primary cancers is given. This could be extended to give an idea of the degree of variation in the different experi‐ ments and the different methods, the difference between cancer and normal tissue and the frequency in cell lines. However, it is a real challenge to pre‐ sent all this useful information in a clear interface that is easy to overview and browse.
9.5 Acknowledgments The authors would like to thank all initial test users of PubMeth for their detailed comments and suggestions for improvement. Many thanks to all the people who helped correcting and improving the manuscript.
71
Chapter 10: Conclu sion ‐ [The phone is ringing. Roy is drinking coffee and licking doughnut sugar from his fingers, purposefully delay ing for as long as possible before he answers it] ‐ Roy: [answers phone] Hello IT. Have you tried turning it off and on again? ... OK, well, the button on the side. Is it glowing? ... Yeah, you need to turn it on. Err, the button turns it on. [Moss enters and tosses Roy a muffin] ‐ Roy: Yeah, you do know how a button works, don't you? No, not on clothes. [Moss's phone rings. He answers it.] ‐ Moss: Hello IT. Yuhuh. Have you tried forcing an unexpected reboot? ‐ Roy: No, there you go, I just heard it come on. No, that's the music you hear when it comes on. No, that's the music you hear when... I'm sorry, are you from the past? ‐ Moss: You see the driver hooks a func tion by patching the system call table so it's not safe to unload it unless an other thread is about to jump in there and do its stuff. And you don't want to end up in the middle of invalid mem ory. [laughs] Hello? Situation in “The ITcrowd” (2006)
Conclusion
DNA‐methylation and epigenetics in general is a ‘hot topic’ in scientific literature. It now is clear that epige‐ netic changes are in some cancer types much more common than mutations and genetic alterations. However, literature data is not accessible in a standardized way. A way to overcome this problem is to make use of text recognition and mining tools, in com‐ bination with publicly available the sauruses (lists with words, synonyms, gene ontology, cancer‐related key‐ words,…). Being able to automatically search in these datasets gives quick insight in the current state in the research and allows better and faster formulation of new hypotheses to test, avoids doing tests that are already performed and allows taking confirmed data as posi‐ tive controls. With PubMeth, we were able to put methylation literature data into a searchable interface and presenting the results in an easy to overview and handy interface. This knowledge base can be used to fasten routine litera‐ ture searches and to drive research hypothesizes forward (posi‐ tive/negative controls, enrichment and clustering analysis, functional insights).
73
Genomewide selection and discovery of DNA methylation markers. DNAmethylation markers are socalled bio markers that can be used to discover cancer cells in an early stage, can be detected using noninvasive methods or can be used to predict response to therapies. Careful initial ex perimental setup and selection procedures can increase the success rate of the experimental validation and select mark ers with better specificity and selectivity
Part 3: Genome‐wide selection of methylation markers
Part 3: Genome‐wide selection of me‐ thylation markers
75
Chapter 11: Intro duction Dieu me pardonnera, c'est son métier Last words of Heinrich Heine (1856) A biomarker in cancer research would be a feature that can distinguish can‐ cer samples and normal samples, is able to stage/grade the cancer or can be used to predict response to treat‐ ment. The earlier this marker can be identified in the cancer progression, the better. Epigenetics, and DNA‐methylation in particular, open new perspectives in finding such biomarkers. Finding possible methylation bio‐ markers is now possible with the rise of highthroughput methodologies. However, some restrictions apply in the search of methylation biomarkers: • Testing a lot of promoter regions for methylation requires a lot of sample material. Often, primary cancer samples are limited. There‐ fore, samples of cancer cell lines are used. However, about half of the effects seen in the cell line may solely be due to the fact that it is a cell line • The effect of de‐methylating treat‐ ments (such as treatment with
Introduction
DAC) can only be identified in cell lines • Array‐based techniques have lim ited sensitivity as they are limited by diffusion The problems described above indi‐ cate that high‐throughput screening methodologies can only be used to list possible marker candidates. These candidates need further validation on primary cancer samples, using more sensitive techniques (such as PCR‐ based technologies). Also, the sensitiv‐ ity and selectivity of the biological methylation marker has to be deter‐ mined in order to appropriately make use of it and to avoid overfitting. In order to limit the candidate list to a reasonable number of genes to vali‐ date and to improve the success rate of this selection, the following strate‐ gies can be applied: • Perform replications or perform the experiment with multiple can‐ cer cell lines • Try to incorporate primary can cer samples in the initial high‐ throughput screening steps • Combine different sources of data: expression results of samples without treatment and samples with different demethylation treatments. Make use of publicly available literature and other data sources
77
• Find a way to prioritize the can didate list after initial screening in order to increase the success rate of the validation study These strategies have been applied in the different case studies in this part. The main focus is on intelligent setup of the large‐scale screening studies and ways to rank and prioritize possi‐ ble markers for the validation studies. The case studies clearly demonstrate that both the design of the study and the analysis can be optimized to ob‐ tain a acceptable success rate in the validation studies afterwards. As cervical cancer is prominently pre‐ sent in both this part and the follow‐
78
ing, a brief introduction is given in an intermezzo (in Dutch). Next, several analysis methods are described. The methodologies can mainly divided into three different strategies: • Relaxation ranking, applied both in vivo and in silico in cervical cancer • Genome‐wide promoter analysis strategies; both shown in silico as applied in various cancer types, in‐ cluding lung, prostate, breast and neuroblastoma • Ranking methodologies to identify markers that can be used to predict treatment response: applied for platinum therapy in ovarian cancer
Introduction
Chapter 12: Intermezzo: DNA methyla tiemerkers helpen vroegtijdige opsporing van cervixcarcinoom Paper 3: DNA methylatiemerkers helpen vroegtijdige opsporing van cervixcarcinoom Van Criekinge W, Ongenaert M, van der zee AGJ, Wisman GBA, Kridelka F. As published in De agenda Gynaecologie – oncologie. Oktober 2008, p. 12 14. Aangezien baarmoederhalskanker (cervixcarcinoom, cervical cancer) ver schillende keren in dit proefschrift wordt vermeld en dit vanuit verschillende perspectieven, is het nuttig dit type kanker te kaderen. Recent kwam baar moederhalskanker in het nieuws, gezien men verschillende vaccins heeft ontwikkeld tegen een sterk betrokken virus (HPV). Cervixcarcinooom is de tweede meest voorkomende kanker bij vrouwen we reldwijd en zelfs de meest frequente in ontwikkelingslanden. Door reguliere bevolkingsscreening is een sterke daling in sterfte opgetreden in de westerse wereld. Hierbij wordt een cytologisch onderzoek van de cervix, het klassieke uitstrijkje of de PAPtest, uitgevoerd met als doel het opsporen van asympto matische premaligne afwijkingen van de cervix, de zogenaamde laag gradige squameuze intraepitheliale letsels (LSILs) en de hooggradige SILs (HSILs). De sensitiviteit van deze test (30% 87%) is echter voor HSIL en baarmoederhalskanker verre van optimaal (Nanda et al., 2000). Hierdoor zullen er dus HSILs en cervixcarcinomen gemist worden en daar door niet worden behandeld. Door de relatief langzame carcinogenese en door regelmatig te screenen wordt dit probleem grotendeels ondervangen. Toch worden er regelmatig cervixcarcinomen gediagnosticeerd bij patiënten bij wie het voorafgaande uitstrijkje van de cervix als normaal was gescoord.
Intermezzo: DNA methylatiemerkers helpen vroegtijdige opsporing van cervixcarcinoom
79
Daar er een sterke associatie bestaat tussen het humaan papillomavirus (HPV) en het ontstaan van het cervix‐ carcinoom (Bosch et al., 2002), wordt momenteel in diverse landen onder‐ zocht of een HPV‐test, al dan niet in combinatie met een uitstrijkje, zinvol is (Bulkmans et al., 2007). Er bestaan meer dan 100 verschillende soorten HPV, waarvan er 15 gekarakteriseerd worden als hoog‐risico HPV (hr‐HPV), wat inhoudt dat dit de typen zijn die baarmoederhalskanker kunnen ver‐ oorzaken. Verschillende onderzoeken toonden aan dat het testen van hr‐ HPV DNA een aanzienlijk hogere sen‐ sitiviteit oplevert dan cytologie, name‐ lijk meer dan 95%. Dit gaat echter wel gepaard met een verlaagde specifici‐ teit. De meeste vrouwen raken gedu‐ rende het leven besmet met HPV, maar kunnen dit virus ook spontaan weer klaren. Hierdoor is screenen op hr‐ HPV niet specifiek, omdat er vrouwen positief zullen testen die geen (pre)maligne afwijking van de cervix hebben en deze ook niet zullen ont‐ wikkelen. Naarmate de te screenen populatie meer jongere vrouwen bevat zal deze specificiteit nog lager worden door de bekende hoge incidentie van het hr‐HPV bij jonge, sexueel actieve vrouwen (Kulasingam et al., 2002).
groep die gevaccineerd zal worden in eerste instantie meisjes betreft tussen 10 en 13 jaar, zullen de eerste effecten op het voorkomen van (pre)maligne afwijkingen van de cervix niet eerder dan na 20‐25 jaar optreden, hetgeen derhalve ook betekent dat tot die tijd er ook geen consequenties met be‐ trekking tot het screenen zijn.
Recent zijn er 2 HPV vaccins ter be‐ schikking gekomen, die waarschijnlijk op korte termijn op grote schaal in Europa toegespast zullen worden (Harper et al., 2004). Daar de doel‐
De analyse van DNA methylatie, een epigenetisch proces, is mogelijk meer geschikt voor het voorspellen van het risico op cervixcarcinoom. Methylatie is het koppelen van een methyl groep
80
Ook daarna zal screenen op (pre)maligne afwijkingen zeer waar‐ schijnlijk noodzakelijk blijven daar de 2 vaccins momenteel slechts tegen hr‐ HPV 16 en 18 bescherming bieden en deze gezamenlijk voor slechts 75% van de cervixkankers in Europa ver‐ antwoordelijk zijn (Smith et al., 2007). Gezien de blijvende noodzaak voor screenen op cervixcarcinoom en de tekortkomingen van zowel de PAP‐test als van de hr‐HPV test, zou de ontwik‐ keling van nieuwe biomerkers met zeer hoge sensitiviteit en specificiteit een toegevoegde waarde kunnen heb‐ ben bij het bevolkingsonderzoek naar cervixkanker. Ook kan de tekortko‐ ming van de zeer gevoelige hr‐HPV screeningstest ondervangen worden door deze test, indien positief, te com‐ bineren met een zeer specifieke triage test.
Intermezzo: DNA methylatiemerkers helpen vroegtijdige opsporing van cervixcarcinoom
(CH3) aan het Cytosine (C) molecuul in de DNA sequentie wanneer dit ge‐ volgd wordt door een guanine (G) molecuul. Als er in de sequentie talrijke CG dinu‐ cleotiden in elkaars buurt voorkomen, is dit een zogenaamd CpG eiland. Me‐ thylatie van CpG eilanden in het pro‐ moter gedeelte van een gen (ongeveer de helft van de genen heeft een CpG eiland) resulteert in het uitschakelen van RNA transcriptie, hetgeen “gene silencing” genoemd.
protein expression
Normal: promoter gene region
X
inactivatie van tumor suppressorge‐ nen met als gevolg een ontregelde celdeling. In veel verschillende tumo‐ ren zijn gemethyleerde genpromoters aangetoond, alsmede in voorstadia van tumoren. DNA methylatieverande‐ ringen zijn ook weefsel‐ en tumor‐type specifiek. Daarom is DNA methylatie een uitermate interessante biomerker voor het (vroegtijdig) detecteren van (pre)maligne afwijkingen. Verschillende cervix specifieke geme‐ thyleerde genpromoters zijn reeds beschreven in de literatuur en kunnen worden opgespoord met behulp van de Methylatie Specifieke PCR (MSP) methode in uitstrijkjes van patiënten met (pre)maligne afwijkingen van de cervix.
In een eerste preliminaire studie waarbij uitstrijkjes werden verzameld Cancer: M M M M M M van gezonde vrouwen die een hyste‐ promoter gene region rectomie ondergingen (controle uit‐ strijkjes) en van cervixcarcinoom pati‐ enten, bleek dat 4 methylatiemerkers Figure 12.1: methylatie van een genpro‐ moter blokkeert de genexpressie en de een sensitiviteit voor de detectie van eiwitproductie kanker hadden van 89% met een ge‐ definieerde specificiteit van 100% Het gen zal dan niet meer afgeschre‐ (Wisman et al., 2006). Deze sensitivi‐ ven worden en dus niet meer vertaald teit was vergelijkbaar met hr‐HPV naar eiwit. Het betreft hier vaak de detectie en cytomorfologie. blocked protein expression
Intermezzo: DNA methylatiemerkers helpen vroegtijdige opsporing van cervixcarcinoom
81
Methylated CpG
Un-Methylated CpG
CGACGCGCGCCGC
CGACGCGCGCCGC Step 1: Chemical treatment
CGACGCGCGUCGU
UGAUGUGUGUUGU Step 2: PCR with methylationGC TGCGCGCAGCA GC TGCGCGCAGCA specific primers
PCR product
X
No PCR product
Figure 12.2: Methylatie Specifieke PCR (MSP) methode: DNA wordt eerst chemisch ge‐ modificieerd (via bisulfiet) waarbij een niet‐gemethyleerde Cytosine (C) wordt geconver‐ teerd naar Uracil (U) en waarbij een gemethyleerde C onveranderd blijft. Daarna wordt het methylatieprofiel gedetecteerd door middel van een PCR‐reactie met specifieke pri‐ mers voor de gemethyleerde sequentie
Desalniettemin werd met deze 4‐ merker combinatie nog steeds niet alle cervixcarcinomen gedetecteerd. Daar‐ om wordt op dit moment een aantal nieuw geïdentificeerde kanker‐ specifieke methylatiemerkers verder bestudeerd. De strategie in ons onder‐ zoek maakte gebruik van cervixcarci‐ noom cellijnen, niet of wel behandeld met demethylerende middelen. Na behandeling met deze demethyla‐ tie agentia, komen genen die door methylatie transcriptioneel inactief waren, terug tot expressie (re‐ expressie). Het verschil in RNA ex‐ pressie werd bepaald door middel van microarray analyse die werd gekop‐ peld aan een biostatistische analyse om methylatiemerkers te identifice‐ ren. Deze merkers werden functioneel 82
geëvalueerd op een screeningsplat‐ form en gerangschikt in een methyla‐ tietabel volgens hun differentieel me‐ thylatiepatroon tussen kanker en normaal cervix weefsel. Momenteel wordt gewerkt aan de verificatie en validatie van deze geme‐ thyleerde genen door de diagnostische waarde van detectie van gemethyleer‐ de genen uit te zoeken in grote aantal‐ len patiënten, die verwezen werden naar de Polikliniek Gynaecologie van het UMCG (Groningen, Nederland) en van het ULg (Liège, België) in verband met een afwijkend uitstrijkje. Wanneer de resultaten van deze ana‐ lyse bevredigend zijn, zullen vervol‐ gens nog veel grotere series uitstrijk‐ jes geanalyseerd worden, die afgeno‐
Intermezzo: DNA methylatiemerkers helpen vroegtijdige opsporing van cervixcarcinoom
men zijn in het kader van screenings‐ onderzoek op cervixcarcinoom. Op deze wijze zal uiteindelijk een sensi‐ tieve en specifieke test gebaseerd op methylatiemerkers ontwikkeld wor‐
den, die gebruikt kan worden als screeningsmethode op (pre)maligne afwijkingen van de cervix of als triage test voor hr‐HPV positieve uitstrijkjes.
Microarray experimenten: cervixcarcinoom cellijnen re-expressie data
Biostatistische analyse + literatuur: Selectie van 232 genes
Assay design tool: 424 MSP assay designs
High-throughput MSP platform: 79170 MSP resultaten
Data analyse en interpretatie: Rangschikking van de 424 merkers Methylatietabel
Figure 12.3: schema voor het identifi‐ ceren van kandidaat methylatiemerkers
Intermezzo: DNA methylatiemerkers helpen vroegtijdige opsporing van cervixcarcinoom
83
Ranked Assays
Cancer Normal
Figure 12.4: methylatietabel. 87 methylatieprofielen van cervixkanker monsters (boven‐ aan) tov 114 normale cervix monsters (onderaan). Monsters worden in de Y‐as getoond, het methylatieprofiel over de 424 merkers (X‐as) per individueel monster wordt in de horizontale rijen weergegeven. Merkers die het beste discrimineren tussen kankers en normalen (hoge sensitiviteit, hoge specificiteit) worden het meest links geplaatst, met een dalend discriminerend effect naar rechts. De rode vierkantjes tonen de gemethyleerde resultaten, de groene vierkantjes tonen de niet‐gemethyleerde resultaten, de witte vier‐ kantjes tonen de ongeldige resultaten
Een deel van de biostatistische analyse waarnaar in dit intermezzo wordt verwezen, is gebaseerd op Chapter 13: Discovery of methylation markers in cervical cancer, using relaxation ranking; methylatie biomerkers in baarmoederhalskanker (cervical cancer). De methodieken worden ook verder nog beschreven in Chapter 15: Geno mewide promoter analysis uncovers portions of the cancer methylome en Chapter 20: Cervical cancer and the HPV family of viruses.
84
Intermezzo: DNA methylatiemerkers helpen vroegtijdige opsporing van cervixcarcinoom
Chapter 13: Discovery of methylation markers in cervical cancer, using relaxa tion ranking Paper 4: Discovery of methylation markers in cervical cancer, using relaxation ranking Ongenaert M, G. Wisman GBA, Volders HH, Koning AJ, van der Zee AGJ, Van Criekinge W, Schuuring E. As published in BMC Medical Genomics, 1:57. (Open Access) Background: to discover cancer specific DNA methylation markers, largescale screening methods are widely used. The pharmacological unmasking expression microarray approach is an elegant method to enrich for genes that are silenced and reexpressed during functional reversal of DNA methylation upon treatment with demethylation agents. However, such experiments are performed in in vitro (cancer) cell lines, mostly with poor relevance when extrapolating to primary cancers. To overcome this problem, we incorporated data from primary cancer samples in the experimental design. A strategy to combine and rank data from these different data sources is essential to minimize the experimental work in the validation steps. Aim: to apply a new relaxation ranking algo rithm to enrich DNA methylation markers in cervical cancer. Results: the application of a new sorting methodology allowed us to sort high throughput microarray data from both cervical cancer cell lines and primary cervical cancer samples. The performance of the sorting was analyzed in silico. Pathway and gene ontology analysis was performed on the top selection and gives a strong indication that the ranking methodology is able to enrich towards genes that might be methylated. Terms like regulation of progression through cell cycle, positive regulation of programmed cell death as well as organ development and embryonic development are overrepre sented. Combined with the highly enriched number of imprinted and Xchromosome located genes, and increased prevalence of known methylation markers selected from cervical (the highest ranking known gene is CCNA1) as well as from other cancer types, the use of the ranking algorithm seems to be powerful in enriching towards methylated genes. Verifi cation of the DNA methylation state of the 10 highestranking genes revealed that 7/9 (78%) gene promoters showed DNA methylation in cervical carcinomas. Of these 7 genes, 3 (SST, HTRA3 and NPTX1) are not methylated in normal cervix tissue. Conclusion: the appli cation of this new relaxation methodology allowed us to significantly enrich towards methy lation genes in cancer. This enrichment is both shown in silico and by experimental valida tion, and revealed novel methylation markers as proofofconcept that might be useful in early cancer detection in cervical scrapings.
Discovery of methylation markers in cervical cancer, using relaxation ranking
85
13.1 Introduction DNA methylation represents a modifi‐ cation of DNA by addition of a methyl group to a cytosine, also referred to as the fifth base (Doerfler et al., 1990). This epigenetic change does not alter the primary DNA sequence and might contribute to overall genetic stability and maintenance of chromosomal integrity. Consequently, it facilitates the organization of the genome into active and inactive regions with re‐ spect to gene transcription (Robert‐ son, 2002). Genes with CpG islands in the promoter region are generally unmethylated in normal tissues. Upon DNA hypermethylation, transcription of the affected genes may be blocked, resulting in gene silencing. In neopla‐ sia, hypermethylation is now consid‐ ered as one of the important mecha‐ nisms resulting in silencing expression of tumour suppressor genes, i.e. genes responsible for control of normal cell differentiation and/or inhibition of cell growth (Serman et al., 2006). In many cancers, various markers have been reported to be hypermethylated (Paluszczak and Baer‐Dubowska, 2006). The detection of DNA hyper‐ methylation was revolutionized by two discoveries. Bisulfite treatment results in the conversion of cytosine residues into uracil, except the pro‐ tected methylcytosine residues (Ha‐ yatsu, 1976). Based on the sequence differences after bisulfite treatment, methylated DNA can be distinguished 86
from unmethylated DNA, using methy‐ lation specific PCR (MSP) (Herman et al., 1996). In the last few years, hypermethylated biomarkers have been used in cancer research and diagnostics (Esteller, 2003; Esteller, 2007a; Herman, 2005). Presently, DNA hypermethylation of only few markers is of clinical rele‐ vance (Esteller, 2007a). Two classical examples are hypermethylation of MGMT in the prediction of treatment response to temozolomide in glioblas‐ toma (Hegi et al., 2005) and DNA hy‐ permethylation of GSTP1 in the early detection of prostate cancer (Hoque et al., 2005b). The search for markers that are hypermethylated in specific cancer‐types resulted in a large list of genes but more recent evidence re‐ vealed that many of these markers are methylated in normal tissues as well (Dammann et al., 2005; Wisman et al., 2006; Wisman et al., 2006). To discover novel markers that are specific for certain stages of cancer with a high specificity and sensitivity, large‐scale screening methods were developed such as Restriction Land‐ mark Genomic Scanning (Costello et al., 2000), Differential Methylation Hybridization (Yan et al., 2000; Strathdee and Brown, 2002; Huang et al., 1999), Illumina GoldenGate® Me‐ thylation, microarray‐based Inte‐ grated Analysis of Methylation by Isoschizomers (MIAMI) (Hatada et al.,
Discovery of methylation markers in cervical cancer, using relaxation ranking
2006) and MeDIP in combination with methylation‐specific oligonu‐ cleotide microarray (Shi et al., 2003). These approaches demonstrated that large‐scale screening has a large po‐ tential to find novel methylation tar‐ gets in a whole range of cancers. To identify cancer‐related hypermethy‐ lated genes, also pharmacological unmasking expression microarray approaches were suited (Sova et al., 2006; Tokumaru et al., 2004; Yamashi‐ ta et al., 2002). In this approach, the re‐activation of gene expression using microarray analysis was studied dur‐ ing functional reversal of DNA methy‐ lation and histone acetylation in can‐ cer cell lines using demethylating agents and histone deacetylase inhibi‐ tors. This methodology generally re‐ sults in a list of several hundreds of candidate genes. Although the analysis of the promoter (e.g. screening for dense CpG islands) is used to narrow down the number of candidate genes, the number list is still too large. This methodology has proven relevant as its application resulted in the identifi‐ cation of new potential methylated genes (Guo et al., 2004; Yamashita et al., 2006). However, the initial large scale screen‐ ing approach will also detect many genes that are not directly targets themselves but become re‐activated due to the re‐expression of for in‐ stance transcription factors (Cameron et al., 1999). Furthermore, in most
studies only re‐expression data after demethylation in cell lines were used. Smiraglia and coworkers (Smiraglia et al., 2001) calculated that more than 57% of the loci methylated in cell lines were never methylated in 114 primary cancers of different malignancy types. The small number of cell lines used to identify methylated genes does not allow to draw conclusions on the rele‐ vance of such cancer‐specific genes without testing a large series of pri‐ mary tumors, which is not done in most studies. Finally, the completion of the se‐ quence of the human genome pro‐ vided information on genes, promoter gene structure, CG‐content and chro‐ mosomal localization. These data are useful to define criteria for the candi‐ date genes to act as appropriate tar‐ gets for DNA methylation. To identify genes that are downregu‐ lated due to promoter hypermethyla‐ tion and to enrich for those genes that are most frequently involved in cervi‐ cal cancer, we performed the following experiments: Affymetrix expression microarray analysis on a panel of frozen tissue samples from 39 human primary cervical cancers to identify cancer‐ specific down‐regulated genes To select those genes that are hyper‐ methylated in cervical cancer, Affy‐
Discovery of methylation markers in cervical cancer, using relaxation ranking
87
metrix expression microarray analysis on a panel of 4 different cervical can cer cell lines in which the expression of (hyper)methylated genes was re‐ activated upon treatment with 5‐aza‐ 2'deoxycytidine (DAC) (blocking DNA methylation), and/or trichostatin A (TSA) (inhibiting histone deacetylase ‐ HDAC) Data from both approaches were combined, and a novel non‐ parametrical ranking and selection method was applied to identify and rank candidate genes. Using in silico promoter analysis we restricted the analysis to those candidate genes that carry CpG‐islands To validate whether our new ap‐ proach resulted in a significant en‐ richment of hypermethylated genes, we compared the first 3000 high‐ ranking candidate probes with lists of imprinted genes, X‐chromosome lo‐ cated genes and known methylation markers. In addition, to investigate whether the promoters of these se‐ lected gene probes are hypermethy‐ lated and this methylation is present in cancer and not in normal tissue, we determined the hypermethylation status of the 10 highest ranking candi‐ date genes in both cervical cancers and normal cervices using COBRA (COmbined Bisulfite Restriction Analysis). These data revealed a highly significant enrichment of methylated genes. 88
13.2 Material and methods 13.2.1 Primary cervical tissue samples For the expression microarray analy‐ sis, tissues from 39 early stage frozen cervical cancer samples were used from a collection of primary tumors surgically removed between 1993 and 2003. All patients were asked to par‐ ticipate in our study during their ini‐ tial visit to the outpatient clinic of the University Medical Center Groningen (UMCG, Groningen, The Netherlands). Gynecological examination under general anesthesia was performed in all cervical cancer patients for staging in accordance with the International Federation of Gynecology and Obstet‐ rics (FIGO) criteria (Finan et al., 1996). Tumor samples were collected after surgery and stored at ‐80 °C. The stage of cervical cancer patients in‐ cluded 33 FIGO stage IB (85%) and 6 FIGO stage IIA (15%). The median age of the cervical cancer patients was 46 years (IQ range 35 – 52 yr.). For COBRA and BSP (Bisulfite Se‐ quencing PCR), 10 (of the 39) primary cervical cancers and 5 controls (nor‐ mal cervix) were used. The age‐ matched normal cervical controls were women without a history of ab‐ normal Pap smears or any form of cancer and planned to undergo a hys‐ terectomy for benign reasons during the same period. Normal cervices
Discovery of methylation markers in cervical cancer, using relaxation ranking
were collected after surgery and his‐ tologically confirmed. Informed consent was obtained from all patients participating in this study. The study was approved by the ethics committee of the UMCG.
13.2.2 Cervical cancer cell lines Four cervical carcinoma cell lines were used: HeLa (cervical adenocarci‐ noma, HPV18), SiHa (cervical squamous cell carcinoma, HPV16), CSCC‐7 (nonkeratinizing large cell cervical squamous cell carcinoma, HPV16) and CC‐8 (cervical adenosquamous carcinoma, HPV45). HeLa and SiHa were obtained from the American Tissue Type Collection. CSCC‐7 and CC‐8 (Koopman et al., 1999) were a kind gift of Prof. GJ Fleuren (Leiden University Medical Center, Leiden, the Netherlands). All cell lines were cultured in DMEM/Ham's F12 supplemented with 10% fetal calf serum. Cell lines were treated for 3 days with low to high dose (200 nM, 1 μM or 5 μM) 5‐aza‐2'deoxycytidine (DAC), 200 nM DAC with 300 nM trichostatin A (TSA) after 48 hours, or left untreated. Cells were split to low density 24 hours before treatment. Every 24 hours DAC was refreshed. After 72 hours cells were collected for RNA isolation.
13.2.3 RNA and DNA isolation From the frozen biopsies, four 10‐µm‐ thick sections were cut and used for standard RNA and DNA isolation. After cutting, a 3‐µm‐thick section was stained with haematoxylin/eosin for histological examination and only tissues with >80% tumor cells were included. Macrodissection was per‐ formed to enrich for epithelial cells in all normal cervices. For DNA isolation, cells and tissue sections were dissolved in lysis buffer and incubated overnight at 55°C. DNA was extracted using standard salt‐ chloroform extraction and ethanol precipitation for high molecular DNA and dissolved in 250 µl TE‐4 buffer (10 mM Tris; 1 mM EDTA (pH 8.0)). For quality control, genomic DNA was amplified in a multiplex PCR contain‐ ing a control gene primer set resulting in products of 100, 200, 300, 400 and 600 bp according to the BIOMED‐2 protocol (van Dongen et al., 2003). RNA was isolated with TRIzol reagent (Invitrogen, Breda, The Netherlands) according to manufacturer’s protocol. RNA was treated with DNAse and purified using the RNeasy mini‐kit (Qiagen, Westburg, Leusden, The Netherlands). The quality and quantity of the RNA was determined by Agilent Lab‐on‐Chip analysis (ServiceXS, Lei‐ den, The Netherlands, www.serviceXS.com).
Discovery of methylation markers in cervical cancer, using relaxation ranking
89
13.2.4 Expression data Gene expression for 39 primary can‐ cers and 20 cell line samples was per‐ formed using the Affymetrix HGU 133 Plus 2.0 array with 54,675 probes for analysis of over 47,000 human tran‐ scripts. The labeling of the RNA, the quality control, the microarray hy‐ bridization and scanning were per‐ formed by ServiceXS according to Affymetrix standards. For labeling, ten microgram of total RNA was amplified by in vitro transcription using T7 RNA polymerase. Quality of the microarray data was checked using histograms, boxplots and a RNA degradation plot. One cell line sample was omitted because of poor quality. Using BioConductor (Gentleman et al., 2004), present (P), absent (A) or marginal (M) calls were determined with the MAS5 algorithm. MAS5 uses a non‐parametric statisti‐ cal test (Wilcoxon signed rank test) that assesses whether significantly more perfect matches show more hybridization signal than their corre‐ sponding mismatches to produce the detection call for each probe set (Liu et al., 2002). The relaxation ranking approach only relied on P‐calls. Some samples were analyzed in duplicate, and the profile of P‐calls is highly simi‐ lar (93‐95 % of the probesets have an identical P/M/A call).
90
13.2.5 Relaxation ranking algorithm In order to identify the most promis‐ ing markers that are methylated in cervical cancer, we assumed that such markers should be silenced in cancer cells and upregulated upon re‐ activation after DAC/TSA treatment, Therefore, the best methylation mark‐ ers will be genes represented by probes with: • no expression in primary cervical cancers: P‐calls=0 out of 39 cancers • no expression in (untreated) cervi‐ cal cancer cell lines: P‐calls=0 out of 4 cell lines • expression in cervical cancer cell lines treated with DAC (or DAC in combination with TSA): P‐calls=15 out of 15 treated cell lines To select for those gene probes that would be the best candidate hyper‐ methylated genes in cervical cancer, we present the relaxation ranking algorithm. Probesets were ranked, not primarily based on the number of P‐ calls and thus explicitly setting thresholds, but primarily driven by the number of probesets that would be picked up, based on selection criteria (the number of P‐calls in primary can‐ cers, untreated and treated cell lines). The stricter (e.g. P‐calls: 0 ‐ 0 ‐ 15) these selection criteria, the lower the number of probes that meet with these criteria; while if the conditions
Discovery of methylation markers in cervical cancer, using relaxation ranking
become more and more relaxed (higher number of P‐calls in primary cancers and untreated cell lines, and lower number of P‐calls in treated cell lines), the more probes will comply. In the end, using P‐calls: 39 ‐ 4 ‐ 0 as criteria, all probe sets were returned. This way, there was no need to define a ‘prior’ threshold for the number of P‐ calls. The following sorting method was applied (R‐scripts are presented in the supplementary data): 1) All possible conditions were gen‐ erated and the number of probes that were picked up under these conditions was calculated: a. the number of samples with expression (P) of a certain probe in i. primary cervical cancer samples is called xsample ii. cervical cancer cell lines is called ysample iii. treated cervical can‐ cer cell lines is called zsample b. all combinations of x, y and z are made i. x (the number of P‐calls is primary cancers) varies from 0 to 39
ii. y (the number of P‐calls in untreated cell lines) from 0 to 4 iii. z (the number of P‐calls in treated cell lines) from 0 to 15 iv. In total, 3200 combinations of x, y and z can be made c. a probeset was found under each of these generated condi‐ tions x, y and z if: i. xsample ≤ x (number of P‐calls for probe in primary cancers smaller or equal compared to condi‐ tion) AND ii. ysample ≤ y (number of P‐calls for probe in untreated cell lines smaller or equal compared to condi‐ tion) AND iii. zsample ≥ z (number of P‐calls for probe in treated cell lines lar‐ ger or equal com‐ pared to condition) d. under very strict conditions (x=0, y=0, z=15) no probes were found, while under the most relaxed conditions (x=39, y=4, z=0) all probes
Discovery of methylation markers in cervical cancer, using relaxation ranking
91
were returned. For all combi‐ nations of x, y and z, the num‐ ber of probes that complied (w), was stored 2) The data was sorted with w as pri‐ mary criterion (ascending), followed by x (ascending), y (ascending) and z (descending) 3) This sorted dataset was analyzed row per row. In row i, the wi probes retrieved with criteria xi yi zi were compared with the list of probes, al‐ ready picked up in rows 1 to i‐1. If a probe did not occur in this list, it was added to the list 4) This process continued until there were m (user‐defined) probes in the list DNA methylation analysis using CO‐ BRA and bisulphate sequencing
13.2.6 DNA methylation analysis using COBRA and bisulfite sequencing To validate the (hyper)methylated status of candidate gene probes, DNA extracted from 10 cervical cancers and 5 normal cervices were analyzed using BSP and COBRA. Bisulfite modification of genomic DNA was performed using the EZ DNA methylation kit (Zymogen, BaseClear, Leiden, The Netherlands). The 5’ promoter region of the tested gene was amplified using bisulfite 92
treated DNA. PCR primers for amplifi‐ cation of specific targets sequences are listed in Supplementary Table 1. COBRA was performed directly on the BSP products as described by Xiong et al. (Xiong and Laird, 1997) using di‐ gestions with BstUI, Taq1 and/or HinfI according the manufacture’s protocol (New England Biolabs Inc., Beverly, MA). For sequence analysis, the BSP products were purified (Qiagen) and subjected to direct sequencing (Base‐ Clear, Leiden, The Netherlands). Leu‐ kocyte DNA collected from anonymous healthy volunteers and in vitro CpG methylated DNA with SssI (CpG) me‐ thyltransferase (New England Biolabs Inc.) were used as negative and posi‐ tive control, respectively.
13.3 Results To identify novel markers that are methylated in cervical cancer, we ap‐ plied a multistep approach that com‐ bines re‐expression of silenced hy‐ permethylated genes in cervical can‐ cer cell lines (using DAC and DAC/TSA), downregulated expression in 39 cervical cancers expression, and selection of candidate markers using a relaxing ranking algorithm. The best profile of a candidate marker would be: no expression in any of the 39 cervical primary cancers and 4 un‐ treated cancer cell lines, but re‐ activation of expression after de‐ methylation and/or blocking of his‐ tone deacetylation in all 15 cell lines
Discovery of methylation markers in cervical cancer, using relaxation ranking
treated with various combinations of DAC/TSA (P‐calls: 0 – 0 – 15). How‐ ever, none of the probe sets showed
this ideal profile. To generate a list of candidate genes, a relaxation ranking algorithm was applied.
Figure 13.1: The number of probes (w) that is retrieved using parameters x (number of P‐calls in primary cancers for probe), y (number of P‐calls in untreated cell‐lines for probe) and z (number of P‐calls in treated cell‐lines for probe)
The only variable used in the relaxa‐ tion ranking is the number of probes we would like to retrieve. As shown in Figure 13.1, the number of probes retrieved (w) with parameters x, y and z (the number of P‐calls in respec‐ tively primary tumor samples, un‐ treated and treated cell lines) follows a complex profile which consists not only of additive elements, but also interactions between the parameters. In general, the number of P‐calls in primary cancer samples (x) has the largest influence on w. The sorting methodology has the advantage that no cut‐off values have to be chosen for x, y and z, and therefore there is no need to implicitly link a relative weight factor to the parameters. To calculate the most optimal number of potentially hypermethylated candi‐ date markers for further analysis, we estimated this number based on known (i.e. described in literature)
methylation markers in cervical can‐ cer. Forty‐five known methylation markers were found using text‐mining using GeneCards (Rebhan et al., 1997) for aliases/symbols to query PubMed through NCBI E‐Utils (Supplementary Table 2). The position of the markers after ranking (“observed”) was deter‐ mined as shown in the step plot in Figure 13.2. If the markers would be randomly distributed in the ranking, the profile would be similar to the curve, marked ‘expected’. This ‘ex‐ pected’ curve is not a straight line, but is calculated based on whether a probe could be assigned with a gene symbol and taking probes into account that are associated with a gene that is already associated with an earlier selected probe. The number of ob‐ served methylation markers has in general the same slope as expected. However, until about 3000 probes, the slope of the number observed markers versus the number of selected probes
Discovery of methylation markers in cervical cancer, using relaxation ranking
93
(in dashed lines) cannot be explained if the markers would be randomly distributed as its steepness is much higher. When selecting more than 3000 probes, the slope suddenly de‐ creases to a level that is close to ran‐ dom distribution. This enrichment can
also statistically be proven (see fur‐ ther). Therefore, we selected the first 3000 probes, referred to as TOP3000, in the ranking for further analysis. In this TOP3000 list, 2135 probes are associated with a gene symbol, of which 1904 are unique.
Number of methylation markers in cervical cancer
40 Expected Observed
30
20
10
0
0
10000
20000
30000
40000
Number of selected probes
50000
60000
Figure 13.2: Step‐plot to determine optimal number of probes for further analysis. Step‐ plot of the number of retrieved known markers (45 published hypermethylation markers in cervical cancer, see Supplementary Table 2) as a function of the position after relaxa‐ tion ranking (this is the number of selected probes after ranking). The step plot shows the actual (observed) number of markers. If the markers were randomly distributed, one would expect the profile, marked with ‘expected’ (details in the text). The trend of the observed markers versus the number of selected probes is indicated with dashed lines
94
Discovery of methylation markers in cervical cancer, using relaxation ranking
13.3.1 The validation of the top 3000 probe-list selected using relaxing highranking To validate whether the TOP3000 contains potential hypermethylated genes, we determined the occurrence of various gene sets that are known to be hypermethylated such as imprinted genes, chromosome‐X genes, cervical cancer‐related hypermethylated genes and genes reported to be methylated frequently in cancers, other than cer‐ vical cancer. Enrichment for imprinted genes Imprinting is a genetic mechanism by which genes are selectively expressed from the maternal or paternal homo‐ logue of a chromosome. As methyla‐ tion is one of the regulatory mecha‐ nisms controlling the allele‐specific expression of imprinted genes (Holmes and Soloway, 2006), it is expected that known imprinted genes are enriched in the TOP3000 selection. According to the Imprinted Gene Cata‐ logue (Morison et al., 2005), this TOP3000 list contains 16 imprinted (or parent‐specific expressed) genes (Supplementary Table 3)). On the whole Affymetrix array in total 74 imprinted genes could be assigned with a probe. Taking into account duplicate probes and probes that are not associated with a gene symbol, 8.76 imprinted genes could be ex‐
pected in the first 3000 probes if the imprinted genes were randomly dis‐ tributed indicating a 1.83‐fold [16/8.76] enrichment in the TOP3000 (Χ²=5.904; p=0.0151). The enrichment towards imprinted genes is even more significant in the TOP100 candidate genes (3 versus only 0.31 expected; Χ²=14.9; p