Mar 12, 2007 - 6.3.2 Learning Bayesian Networks from Microarray Data . ..... Data mining is a very special and intellige
ECD Master Thesis Report
Inference of Gene Interaction Networks from DNA Microarray. Application to Meta-analysis with R NGUYEN Hoai-Tuong 2008 Supervision: G´ erard Ramstein Location: Polytechnic School of Nantes University Duration: February 04 - July 31 2008 in partial fulfillment of the requirements of the Master 2 ECD Program Abstract: Data mining is one of ten emergent technologies of 21th decade. It has been developed in diverse fields, from the supermarket to the factory, from the hospital to the biology laboratories,... By analyzing of relationships among objects of each field, the researchers can extract the intelligent and useful information. In fact, in the last decades, there are numerous studies of data mining that can visualize interaction of the gene in microarray. Our work inspires this development by combining two well-known approaches, namely, meta-analysis and Bayesian networks to reconstruct the gene interaction network. This report will present three key points as follows: firstly, meta-analysis for combining multiple microarray studies in a single common analysis; secondly, discretization from the continuous microarray data for the input of the Baysian network (BN); and finally, inferring gene regulation using BN approach. Resume : La fouille de donn´ees (FDD) est une des dix technologies ´emergentes du 21e si`ecle. Elle a ´et´e d´evelopp´ee dans divers domaines, du supermarch´e au agrobussiness, de l’hˆ opital aux laboratoires biologiques. Par l’analyse des relations entre les objets dans chaque domaine, les chercheurs peuvent extraire les informations utiles. C’est une tˆ ache fondamentale au cœur de nombreux probl`emes de FDD. En effet, il y a plusieurs ´etudes d’extraction de donn´ees, qui permet de visualiser l’interaction des gnes dans les donn´ees de puces ` a ADN (biopuces). Notre travail inspire cette ´evolution en combinant les deux bien connus approches, ` a savoir, la m´eta-analyse et r´eseau Baysiens (RBs) pour reconstruire le r´eseau d’interaction g´enique. Ce rapport pr´esentera trois principaux points comme suit: tout d’abord, la m´eta-analyse pour combiner de multiples ´etudes de biopuces, d’autre part, la discr´etisation des donn´ees de biopuces pour l’entr´ee du RB et, enfin, la reconstruction du r´eseau de r´egulation g´enique avec l’aide de RBs.
Contents 1 Acknowledgments
1
2 Introduction 2.1 Motivation behind the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Contribution of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 2 2 3
3 Review and Background of DNA Microarray Technology 3.1 From the cell to the Gene expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 What is a microarray? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 How does a microarray work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 3 5 5
4 Meta-analysis for microarray 4.1 What is Meta-analysis? . . . . . . . . . . 4.2 Why using meta-analysis for microarray? 4.3 Meta-analysis from Microarray Studies . . 4.4 Normalization . . . . . . . . . . . . . . . .
. . . .
6 6 6 7 8
5 Discretization 5.1 An introductive example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 To discretize or not to discretize for gene expression data: an important decision . . . . .
9 9 9
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
6 Bayesian networks for microarray 6.1 What is a Bayesian network (BN)? . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Graphical model (GM) . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Definition of Bayesian networks . . . . . . . . . . . . . . . . . . . . 6.2 Why use BN for microarray? . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Learning and Validating Baysian Networks for Inferring Gene Regulation 6.3.1 Bayesian Networks in Microarray Technology . . . . . . . . . . . . 6.3.2 Learning Bayesian Networks from Microarray Data . . . . . . . . . 7 Experiments and Results 7.1 GEO and meta-analysis . . . . . . . . . . . . . . . . . . . . . 7.2 Identification of a robust gene . . . . . . . . . . . . . . . . . . 7.3 Missing values imputation . . . . . . . . . . . . . . . . . . . . 7.4 Normalization with R . . . . . . . . . . . . . . . . . . . . . . 7.5 Threshold choice: the art of discretization for gene expression 7.6 The first ”brick” of the future ”house” . . . . . . . . . . . . . 7.7 Deal R package for Bayesian Gene Interaction Networks . . .
. . . . . . . . . . . . data . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . .
. . . . . . .
. . . . . . .
. . . .
. . . . . . .
. . . . . . .
. . . .
. . . . . . .
. . . . . . .
. . . .
. . . . . . .
. . . . . . .
. . . .
. . . . . . .
. . . . . . .
. . . .
. . . . . . .
. . . . . . .
. . . .
. . . . . . .
. . . . . . .
. . . .
. . . . . . .
. . . . . . .
. . . . . . .
10 10 10 10 12 13 13 13
. . . . . . .
15 15 16 16 16 17 18 18
8 Conclusion and Perspective
21
A Appendix
II
List of Figures 1 2 3 4 5 6 7 8 9 10 11
A organism under the microscope (cell in red and DNA in green) . . . From to cell to ADN . . . . . . . . . . . . . . . . . . . . . . . . . . . . From ADN to genome . . . . . . . . . . . . . . . . . . . . . . . . . . . From genome to Gene expression . . . . . . . . . . . . . . . . . . . . . DNA microarrays Process . . . . . . . . . . . . . . . . . . . . . . . . . The DNA Array Analysis Pipeline . . . . . . . . . . . . . . . . . . . . A simple example of Bayesian networks . . . . . . . . . . . . . . . . . Gene expression during the yeast cell cycle. The genes correspond to experiment are the columns. . . . . . . . . . . . . . . . . . . . . . . . Experimental results for each dataset (study) . . . . . . . . . . . . . . Experimental results for the combining dataset (cross-study) . . . . . Orange nodes is the highly connected nodes in the graph . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . the rows, and the . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 3 4 4 5 7 11 12 20 21 III
List of Tables 1 2 3 4 5 6 7
Some works focused on identifying genes interactions . . . . . . . . . . . . . . . . . . Comparison of the proposed methods for the gene regulation using Baysian networks Variables in the our testing data set . . . . . . . . . . . . . . . . . . . . . . . . . . . Gene expression matrix before discretization . . . . . . . . . . . . . . . . . . . . . . . Gene expression matrix after discretization . . . . . . . . . . . . . . . . . . . . . . . A part of our data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table probability distribution for gene BIRC5 . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
8 14 16 18 18 19 20
1
Acknowledgments
I would like to express my sincere gratitude to Mr. Henri BRIAND and Mr. Xuan-Hiep HUYNH for providing me the wonderful opportunity to pursue my Master’s degree in France. I am grateful to my supervisor Mr. G´erard RAMSTEIN for his continuous encouragement, support and guidance throughout the time of this internship. I would like to thank to Mr. Philippe LERAY for the assistance with general questions in Bayesian Networks and the invitation for JFRB’08 (4`eme Journ´ees Francophones sur les R´eseaux Bayesiens). And also thanks to Mr. William CLAYTON for his useful suggestions on the manuscript. Finally, this work was supported by a training grant from Lyon 2 University (France) and Polytechnic School of Nantes University (France).
Nantes, June 23th 2008
(signature) Nguyen Hoai Tuong
1
2 2.1
Introduction Motivation behind the thesis
There are two things dedicated to our reader in this thesis: (1) we would like to provide such a text for students and alike scientists who venture into the field of DNA array data analysis for the first time. Methods are introduced by simple examples and citations of relevant literature. The information in this report will prompt questions such as when can I say that a certain gene is up-regulated? What do I do with the thousands of genes that show some regulation in the same time? How much information can I extract from this? (2) By answering these questions with meta-analysis and Bayesian networks approaches, we hope for an increase of the reproducibility, reliability and efficiency of biological experiments from gene expression data and a promising way to survey direct interaction of gene regulation. In fact, in the recent years, by measuring simultaneously the transcriptional activity of a large number of genes, the microarray technology has created many new opportunities to study human disease. From this time forward, on the one hand many biologists have dedicatedly started performing their own microarray experiments on the data analysis. On the other hand the modern data mining techniques can help them to do this more exactly and quickly. The term ”microarray” is ubiquitous in the bioinformatic researches. So firstly, the exponential increase of experimental microarray studies nowadays is existent in the different biologic laboratories. Additionally, reproducibility of these studies is generally low, mainly due to study biases, small sample sizes and the highly multivariate nature of microarrays. So, discovery the knowledge from this kind of data becomes an interesting subject not only for the bioinformatician but also for the biologist. In the literature, there are many ways of looking for gene expression, but we have three mains ways as follows: (1) discriminant analysis seeks to identify genes which sort the cellular snapshots into previously defined classes; (2) cluster analysis seeks to identify genes which vary together, thus identifying new classes; (3) network modeling seeks to identify the causal relationships among gene expression levels, i.e Bayesian networks; Compared with discriminant and clustering analysis, causal network modeling has the advantage of uncovering conditional independency among genes, which provides a promising way to survey direct interaction of gene regulation. Moreover, by using statistical evaluation approaches, we can examine features of induced high score networks, e.g. the confidence of the existence of an edge. Thus, highly confident features provide us a potential way to mine (learn) significant sub-networks from the observation data. In the other hand, to increase the reproducibility, reliability and efficiency of biological experiments, it is essential to combine data from multiple studies in a single common analysis. However, to do this, the variation in the measured gene expression levels is caused not only by the interest and natural biological differences, but also by the technological and laboratory-based differences between studies. Especially, to cope with this problem, meta-analysis applies to a collection of studies the same methodological rigor and statistical precision ordinarily found in primary research. In a meta-analysis, the collection of studies tests the same conceptual hypothesis, but may do so using a wide variety of methods, measures, samples, and settings. Our approach does something new to cope with these problems using meta-analysis and Bayesian networks for DNA microarray data. More precisely, we focus dedicatedly on the data normalization and Bayesian networks for constructing gene interaction networks. Specially, we focus on continuous data treatment by the most used discretization methods in the literature for gene expression data .
2.2
Contribution of the thesis
Our methodology may promisingly provide many advantages for dealing with these above problems of microarray technology. In fact, some of them can be expressed as follows: (1) it is based on a larger training set with the results of diverse studies and hence is expected to cope with the problem of too few samples to completely determine a network ; (2) it is also based on the robust methods of discretization for gene expression data using mean and standard deviation of data; the commonly used techniques of structure learning by scoring-based functions. Hence can provide an optimal solution for finding the 2
most relevant possible gene interaction networks for the biologist; (3) the utility of our models can be demonstrated for many next research because of their originality for the literature in the field. This can help the next studies use efficiently the results from different studies.
2.3
Organization of the thesis
This report is organized as follows. In Section 3 we describe briefly some useful reviews of DNA microarray technology, meta-analysis and Bayesian networks. In Section 4 we present the state-of-the-art of the actual research directions using Bayesian networks for inferring the gene regulation and meta-analysis for combining data in the context of multiple microarray studies. We discuss about some related approaches to our work before the concluding remarks in Section 5.
3
Review and Background of DNA Microarray Technology
This section provides background information necessary for subsequent discussion. A brief review of the important components from basic niology to microarray technology. And a concise overview of the inference using graphical models will be presented precedently a formal definition and advantages/disadvantages of meta-analysis technique.
3.1
From the cell to the Gene expression
From to cell to ADN To understand what a microarray can be, we invite to take a tour from the cell to the gene expression, a very important element in a microarray experiments. We start with a cell. The cell is the structural and functional unit of all known living organisms. It is the smallest unit of an organism that is classified as living. In see Figure 1,
Figure 1: A organism under the microscope (cell in red and DNA in green) Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms and some viruses. The main role of DNA molecules is the long-term storage of information. DNA is often compared to a set of blueprints, since it contains the instructions needed to construct other components of cells, such as proteins and RNA molecules. The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in regulating the use of this genetic information. This is the most important component for the biology researches.
Figure 2: From to cell to ADN (Image source from )
Normally, the genetic information is compartmented in a nuclear. DNA is presented in this nuclear. It is a molecular which contains millions of nucleotides (Adenine - A, cytosine - C, Guanine - G, Thymine - T). According to Watson and Crick, the pairs of nucleotides such as A and T, C and G contribute to 3
the organization of DNA by the double helix, so each strand has a sequence complementary to others. This compartment is the base for the majority of analysis technique in molecular biology. From ADN to genome In biology the genome of an organism is its whole hereditary information and is encoded in the DNA (or, for some viruses, RNA). It is a portmanteau of the words gene and chromosome. This includes both the genes and the non-coding sequences of the DNA.
Figure 3: From ADN to genome This stylistic diagram shows a gene in relation to the double helix structure of DNA and to a chromosome (right). Introns are regions often found in eukaryote genes that are removed in the splicing process (after the DNA is transcribed into RNA): only the exons encode the protein. (Image source from Wikipedia.com)
From genome to Gene expression In all organisms, there are two (2) major steps separating a protein-coding gene from its protein: first, the DNA on which the gene resides must be transcribed from DNA to messenger RNA (mRNA), and second, it must be translated from mRNA to protein. RNA-coding genes must still go through the first step, but are not translated into protein. The process of producing a biologically functional molecule of either RNA or protein is called gene expression, and the resulting molecule itself is called a gene product.
Figure 4: From genome to Gene expression Diagram of the ”typical” eukaryotic protein-coding gene. Promoters (some genes have ”strong” promoters that bind the transcription machinery well, and others have ”weak” promoters that bind poorly) and enhancers (can compensate for a weak promoter) determine what portions of the DNA will be transcribed into the precursor mRNA (pre-mRNA). The pre-mRNA s then spliced into messenger RNA (mRNA) which is later translated into protein. (Image source from Wikipedia.com)
In cells, genes consist of a long strand of DNA that contains a regulatory region shared by almost all genes are called the promoter, and coding and non-coding sequence. They control the activity of a gene and provide a position that is recognized by the transcription machinery when a gene is about to be transcribed and expressed. Coding sequence determines what the gene produces, while non-coding sequence can regulate the conditions of gene expression. When a gene is active, the coding and non-coding sequence is copied in a process called transcription, for producing an RNA copy of the gene’s information 4
(RNA is a second type of nucleic acid that is very similar to DNA, but whose monomers contain the sugar ribose rather than deoxyribose). This RNA can then direct the synthesis of proteins via the genetic code.
3.2
What is a microarray?
Literally, there are different kinds of biological assays are called microarrays: DNA microarrays (such as cDNA microarrays and oligonucleotide microarrays), protein microarrays, transfection microarrays (also called cell microarrays), tissue microarrays, chemical compound microarrays, antibody microarrays. The most successful between them is DNA microarrays. They have revolutionized biology. Technically, instead of studying one gene or one protein at a time, the scientist is now studying many simultaneously and monitor genome wide expression levels of these genes in a given organism. This global approach has created many new opportunities to study human disease [25]. Then, what is a DNA microarray? A DNA microarray (called subsequently ”microarray”) is typically a glass slide on which DNA molecules are fixed in an orderly manner at specific locations called spots (or features), see [4] and references therein for more interesting information about microarray. A microarray may contain thousands of spots and each spot may contain a few million copies of identical DNA molecules that uniquely correspond to a gene (Figure 5A).
Figure 5: DNA microarrays Process (A) A microarray may contain thousands of spots. Each spot contains many copies of the same DNA sequence that uniquely represents a gene from an organism. Spots are arranged in an orderly fashion into Pengroups. (B) Schematic of the experimental protocol to study differential expression of genes. The organism is grown in two (2) different conditions (a reference condition and a test condition). RNA is extracted from the two (2) cells, and is labeled with different dyes (red and green) during the synthesis of cDNA by reverse transcriptase. Following this step, cDNA is hybridized onto the microarray slide, where each cDNA molecule representing a gene will bind to the spot containing its complementary DNA sequence. The microarray slide is then excited with a laser at suitable wavelengths to detect the red and green dyes. The final image is stored as a file for further analysis. Color figure at: http://www.mrc-lmb.cam.ac.uk/genomes/madanm/microarray/
3.3
How does a microarray work?
Microarrays may be used to measure gene expression in many ways, but as said before in the Introduction Section, there are three main ways of looking for gene expression data: (1) discriminant analysis seeks to identify genes which sort the cellular snapshots into previously defined classes; (2) cluster analysis seeks to identify genes which vary together, thus identifying new classes; (3) network modeling seeks to identify the causal relationships among gene expression levels. 5
One of the most common principles is to compare expression of a set of genes from a cell maintained in a particular condition (condition A) to the same set of genes from a reference cell maintained under normal conditions (condition B ). Figure 5B gives a general picture of the experimental steps involved. First, RNA is extracted from the cells. Next, RNA molecules in the extract are reverse transcribed into cDNA by using an enzyme reverse transcriptase and nucleotides labeled with different fluorescent dyes. For example, cDNA from cells grown in condition A may be labeled with a red dye and from cells grown in condition B with a green dye. Once the samples have been differentially labeled, they are allowed to hybridize onto the same glass slide. At this point, any cDNA sequence in the sample will hybridize to specific spots on the glass slide containing its complementary sequence. The amount of cDNA bound to a spot will be directly proportional to the initial number of RNA molecules present for that gene in both samples. Following the hybridization step, the spots in the hybridized microarray are excited by a laser and scanned at suitable wavelengths to detect the red and green dyes. The amount of fluorescence emitted upon excitation corresponds to the amount of bound nucleic acid. For instance, if cDNA from condition A for a particular gene was in greater abundance than that from condition B, one would find the spot to be red. If it was the other way, the spot would be green. If the gene was expressed to the same extent in both conditions, one would find the spot to be yellow, and if the gene was not expressed in both conditions, the spot would be black. Thus, what is seen at the end of the experimental step is an image of the microarray, in which each spot that corresponds to a gene has an associated fluorescence value representing the relative expression level of that gene.
4 4.1
Meta-analysis for microarray What is Meta-analysis?
Data analysis is the process of looking at and summarizing data in order to extract useful information and efficient decision. There are two (2) key points of the view in data analysis: statistic and data mining. Data mining is a very special and intelligent kind of data analysis. It tends to focus on larger data sets, with less emphasis on making inference, and often uses data that was originally collected for a different purpose. Statistically, there are three kinds of analysis: (1) Primary analysis: The analysis of data from a single study to test the hypotheses originally formulated; (2) Secondary analysis: The re-analysis of data from a single study to test new hypotheses; (3) Meta-analysis - ”analysis of analysis”: The application of statistical procedures to examine tests of a common hypothesis from more than one study. Meta-analysis attempts to apply to a collection of studies the same methodological rigor and statistical precision ordinarily found in primary research. In a meta-analysis, the collection of studies tests the same conceptual hypothesis, but may do so using a wide variety of methods, measures, samples, and settings. The challenges that meta-analysis must answer, are to choose a way for combining the seemingly disparate studies. This provides a convincing overall test of the hypothesis and to explore its moderators.
4.2
Why using meta-analysis for microarray?
Many bioinformaticians have developed methods and algorithms for combining of diverse sources of data. However, most of these studies involved the use of different platforms of microarray data which has much more complicated than a single platform on one type of data. To facilitate to understand how it works a bioinformatic system in the context of this internship, we interested in the analysis of the single platform on the one type of cancer. Meta-analysis enables the ability to improve the resulting data from different studies. The question why meta-analysis is used for microarray is answer clearly, because the quality of microarray data has often been limited by cost, sample size, inadequate analysis methods, and heterogeneous nomenclatures. As the quality of array platforms improves, prices fall, and experimental design and analytical methods standardize, microarrays will have a transforming effect on the way biomedical research is conducted.
6
In the convention of our project with other bioinformatican colleagues, we concentrate to the analytical methods for the normalization. We report herein the state-of-the-art meta-analysis methods that combine gene expression data analysis to infer transcriptional regulatory genes. There are two (2) general approaches to integrating microarray studies: comprehensive re-analysis of the primary data by merging data from multiple studies [32]; or, comparative analysis of the published results [31](i.e. gene lists, for example Affymetrix GeneChips (U133A) in 220 mature aggressive B-cell lymphomas. This is the type of our dataset). Our work focuses exclusively on comprehensive re-analysis on GPL96 platform, an approach which may become increasingly useful as the sheer volume of microarray data expands.
4.3
Meta-analysis from Microarray Studies
Microarrays measure simultaneously the transcriptional activity of a large number of genes. This is an area of intense development. There are many types of analysis for microarray technology (see Figure 6 and Table 1).
Figure 6: The DNA Array Analysis Pipeline To increase the reliability and efficiency of biological experiments, it can be critical to combine data from multiple studies in a single analysis. However, for increased statistical power and novel discovery by combining data from several studies, the variation in the measured gene expression levels is caused not only by the interest and natural biological differences, but also by the technological and laboratory-based differences between studies. That means there is no direct association of samples from different studies. So, the most important difficulties are the first, the presence of both absolute and relative expression measurements depending on the technology, and secondly, the challenges associated with cross referencing measurements made by different technologies to the genome and to each other ([26]; [60];). Recent attempts are based on meta-analysis for interstudy variation ([46], [16]). Also, an approach has been proposed to translate results across microarray platforms (a.k.a cross-platforms) ([54]). However, these methods possess two (2) major limitations: first, the information on correlation and hence the ability to explore the differential expression of the same gene across studies is ignored. This means if the gene X are over-expressed frequently in study A, there is any information on correlation allowing to implicate (or try to think ) the expression level of gene X is also frequently over-express; and second, if
7
Author Daniel R., Rhodes Patrick, Warnat Xiang Jasmine, Zhou Homin K. Lee
Publish PNAS(2004) [47] BMC bioinformatics (2005) [52] Nature Biotechnology (2005) [61] Genome research (2006) [33]
Key idea Meta-signature Improvement of classification Regulatory network Coexpressed genes
Table 1: Some works focused on identifying genes interactions we want to explore this, the conditions (samples) have to be the same from across studies. To tackle these problems, the authors of [45] proposed a mixed-model, namely, Multivariate mixedmodel equations (MME) allowing for a general covariance structure. This model allow incorporate the simultaneous information from seemingly independent experiments by allowing a non-zero correlation among gene expressions across experiments, while imposing a null residual covariance. In particular, MME allows full use of the information available, with multiple factors and a hierarchy of sources of variation. These authors have illustrated the feasibility of MME as a natural and promising mechanism for the combination of multiple studies. There are several more simple approaches for meta-analysis in microarray, that originally devised to integrate published results in meta-analysis ([19]), that have considered combination of p-values ([46]: a case study in prostate cancer ), that have focused on the more efficient strategy of combining effect sizes ([16]: a case study in prostate cancer ; [51]). Another proposed method is normalization that considers directly the sample-level measurements within each study, and merge these into a single data set, to which standard single-study analysis can be applied ([55]: a case study in Saccharomyces cerevisiae; [49]: a case study in breast cancer;). For our experimental application, we interested in the kind of these approaches. . In conclusion, most statistical combining microarray studies works focused on identifying genes that exhibit (active—inhibit) differential expression across experimental conditions or phenotypes. In fact, Meta-analysis commonly used in statistic is now used effectively to deal with the combining the studies in microarray. To get this advantage, there are also some of challenges as followings: • Measurements are not directly comparable across different arrays: (1) Lab protocols vary; (2) Measurement technology varies; (3) Probe sequences vary • Annotations of various studies are not easily comparable: (1) Different terminologies have been used ; (2) Type of sampling varies This is one important part of our work. We will present this problem in the next sections.
4.4
Normalization
The questions like what normalization is in the context of microarray, why normalization is used, and how we can do a normalization for the microarray data are the main discussion in this section. Microarrays are usually applied to the comparison of gene expression profiles under different conditions. So, we have to make sure that what you are comparing is really comparable. Normalization allows you to recognize the biological information in your data, to compare data from one array to another, to compare data from one microarray platform to another. Microarray data normalization is an important step for obtaining data that are reliable and usable for subsequent analysis. Over the years, the normalization methods have been mainly developed more effectively for microarray data. The commonly used normalization procedures are total signal normalization [20], Lowess ([39]), normalization using background signals [58], spatial normalization [40], and mixed models of Lowess and spatial normalization [27]. In order to understand how we can do a normalization, we firstly determine the technical errors (systematic bias). Because normalization is an attempt to correct for systematic bias in data. It attempts to remove the impact of non-biological influences on biological data. The systematic bias sources can
8
be scanner ”malfunction”, printing problems, labeling and detection efficiencies of the fluorescent labels, quantity of initial RNA. The normalization procedure consists in two (2) steps: 1. Determine the likely conserved gene set that be used to normalize a DNA microarray data set in step 2 2. In this step the actual DNA microarray normalization is performed using the likely conserved gene set created in step 1. This normalization is an adaptation of one of the methods presented above1 . More technically, for example with Lowess (used in our work), we calculate to find a regression function for all point of our data (likely conserved gene set) by evaluating the local polynomial using the explanatory variable values for that data points. That means, at each point in the data set a low-degree polynomial is fit to a subset of the data, with explanatory variable values near the point whose response is being estimated. The subsets of data used for each weighted least squares fit in Lowess are determined by a nearest neighbors algorithm. For the input, we can choose the ”bandwidth” or ”smoothing parameter” that determines how much of the data is used to fit each local polynomial. Output is a list of new regression function value for our data. This result is considered the normalized input for the next step of our analysis. (see a simple introduction with clear example2 ) In our work, this Lowess technique is typically chosen because it is one of the most commonly utilized normalization techniques, and freely available implementation in the statistical software package R [23] and in many commercial microarray analysis software. Moreover, several other freely public available microarray data handling packages have incorporated this normalization technique. This is very important and meaningful for our meta-analysis application that used many different data sources.
5
Discretization
5.1
An introductive example
In mathematics, discretization concerns the process of transferring continuous models and equations into discrete counterparts [Wikipedia]. In microarray technique, this process is used for dividing the range of values into a set of mutually exclusive and exhaustive intervals by a vector of thresholds, namely, discretization sequence of a variable. An extremely simple example is the fact of a variable ”age” that has the continue values from 0 to 200 (perhaps less!!!). In data analysis, in order to facilitate for decision making, one often discretizes this variable into three intervals as a new category variable: ”children” represents the age from 0 to 11, ”youth” represents the age from 12 to 25, ”adult” represents the age from 26 to 200.
5.2
To discretize or not to discretize for gene expression data: an important decision
Discretization is an important step of analysis gene expression data because to determine the relationships between two genes in the interaction network, the level expression of the genes should be represented under the form of the variable two (2) or three (3) values such as {”over-expressed”, ”under-expressed”} or {”over-expressed”, ”normal”, ”under-expressed”}. However, ”we can’t have something in the both ways”, we can have ”mooncake” but cannot have ”moon and cake” (”mooncake is a kind of special cake for the vietnamese children at the Mid-autumn Festival”). The discretization can help us to easily determine the relationships between two genes but there is also the information loss in every discretization operation. There are three general approaches for learning network structure with continuous data ([14]: a very useful document for discretization problem for Bayesian networks structure learning): 1. Pre-discretization methods: the data is discretized prior to application of the learning algorithm. Due to the pertinence of these methods to most machine learning algorithms, a great deal of research has focused on this area. Unfortunately, many of these methods focus on supervised classification and are not applicable to Bayesian networks structure learning, but we can still use those that are 1 2
total intensity normalization, Lowess normalization, Mean centering, Ratio statistics, standard deviation regularization An example of Lowess: http://www.itl.nist.gov/div898/handbook/pmd/section1/pmd144.htm
9
not tailored to classification. We chose this approach for discretization of our testing data. This will be presented more detail in the experiments and results section. 2. Integrated methods: the learning of the variable discretization and structure can be integrated. These methods follow greedy, iterative procedures by starting with an initial discretization, learning a model based on the discretized data, and re-discretizing given the learned model. These steps repeat until a termination condition is met. The approaches output a discretization of the input variables. 3. Direct methods: learning can be done directly with continuous data without committing to a specific discretization of the variables.
6
Bayesian networks for microarray
6.1
What is a Bayesian network (BN)?
A complete understanding of the Bayesian networks is the next important section of this report. 6.1.1
Graphical model (GM)
We want to construct a gene expression network by BNs approach. BNs are based on the GM idea. Now, what is a GM? GM is a marriage between probability theory and graph theory. GM is a graph in which nodes represent random variables, and the lack of edges represent conditional independence assumptions. There are two (2) types of GM: undirected GM and direct GM. Undirected GM (a.k.a Markov Random Field) has a simple definition of independence: two (sets of) nodes A and B are conditionally independent given a third set, C, if all paths between the nodes in A and B are separated by a node in C. In constract, a directed GM (which CANNOT have directed cycles), also called directed acyclic graph (DAG) model, has a more complicated notion of independence with the directed edges. This is the model of Bayesian Networks. Here, we have a litle note, despite the name, BNs do not necessarily imply a commitment to Bayesian methods; rather, they are so called because they use Bayes’ rule (also called Bayes theorem, Bayes law ) for inference (see an example in the next section). 6.1.2
Definition of Bayesian networks
A Bayesian network is a directed acyclic graph (directed models), where each node represents a real-life feature such as ”human”, ”food”,... Each discrete node has mutually exclusive and exhaustive values, for example, the values of ”human” could be man, woman, the values of ”food” could be chocolate, tartare. Arcs between the nodes represent direction of dependence. Together the nodes and the arcs define the structure of the BNs. A Bayesian network can be considered as a family tree, in which there are also ”child”, ”father” or ”spouse” (parents). A node is called a child if it is directly dependent on a node in the network. A node having children is called a ”parent”. The idea behind BNs is to decompose the joint probability distribution as a product of conditional distributions and then, to take advantage of conditional independences. For example, node D is not directly dependent on A or C, so P (D|A, B, C) reduces to P (D|B) (see Figure 7). Accordingly, arcs coming to a node can be thought of as representing conditional probability distribution between the node and its parents. More technically, a static BN (resp. dynamic BN for the context of time series data) is an acyclic directed graph that consists in two (2) parts: the structure of the BNs G and a set of conditional probability distributions Θ. These are two important points that almost BNs studies discuss about:
10
Figure 7: A simple example of Bayesian networks The first part, the structure of the BNs G represents a directed acyclic graph (DAG), presented above, whose vertices correspond to a set of the random variables χ = X1 , ..., Xn . Why DAG? Because in a BN, the edges are directed and there are no cycles. That means starting from any given node and following the direction of the edges, there is no way to cycle back to the original node. In this graphical representation, each variable Xi is independent of its non-descendants given its parents P a(Xi ). When we have determined P a(Xi ), we can go to the second part. The second part, a set of conditional probability (a.k.a posterior probability ) distributions < Θ >, that describe the conditional probability P (Xi |P a(Xi )) of a variable Xi given its parents P a(Xi ) in the graph G. Thus, a BN specifies a unique joint probability distribution (a.k.a parameter of the BN ) over χ given by: Q P (X1 , ..., Xn ) = ni=1 P (Xi |P a(Xi )) For more simple, we present an simple example for making a decision in medical diagnostic using Bayes rule, the foundation of a BN. Bayes’ rule simply represent the relation of the conditional and marginal probabilities of two (2) random events. It is often used to compute posterior probabilities given observations (prior probabilities). P (A|B) =
P (B|A)∗P (A) P (B)
where: 1. P(A) is the prior probability or marginal probability of A. 2. P(A—B) is the posterior probability (a.k.a conditional probability) of A, given B. 3. P(B—A) is the posterior probability of B given A. 4. P(B) is the prior or marginal probability of B. For example, suppose having a gene X active (cause) in a particular disease 50% of the time. Suppose the prevalence of having the gene X is 1/1000 and the prevalence of having a particular disease is 1%. From this, compute the probability of having gene X given that you have the disease. We could use this information and try to produce a joint probability for the two events having the gene and having the disease. Or, now that we know there is Bayes rule, we can use it to solve this problem. P(Gene)=1/1000 P(Disease)=1/100 And P(Disease—Gene)=0.5
11
Note that what we are doing here is inverting the probability P (Disease|Gene) to calculate P (Gene|Disease) in light of the prior knowledge that the P (Gene) is 1/1000 and the P (Disease) = 1/100. The P (Disease) can be assumed to be the marginal probability of having the disease in the population, across those with and those without the gene X. P (Gene|Disease) =
P (Disease|Gene)∗P (Gene) P (Disease)
Solving this is as simple as plugging in the numbers: P (Gene|Disease) =
0.5∗(1/1000) (1/100) ∗
= 0.05
This result is interpreted as if you have the disease (condition) the probability that you have the gene is 5%. Note that this is very different from the condition we started with which was the probability that you have the gene, then there is a 50% probability you will have the disease.
6.2
Why use BN for microarray?
The most commonly used computational method for analyzing gene expression data is clustering. For example, we consider a microarray A, with rows AiJ correspond to genes and columns AIj correspond to probes (tissue samples, experiments). Based on a correlation measure between the row vectors AiJ , genes are partitioned into clusters. In the example below, the dendrogram on the left side of the figure x shows the structure of the clusters.
Figure 8: Gene expression during the yeast cell cycle. The genes correspond to the rows, and the experiment are the columns. c Stanford University ( http: // genome-www. stanford. edu/ cellcycle/ figures/ figure1B. html ) Figure copyrighted by
However, clustering suffers also from the following shortcomings:
12
• Clustering is based on a global correlation measure. This obscures relationships that exist over only a subset (local) of the data • Clustering fails to detect interactions between genes different from linear correlation (very often in the context regulation between genes) • It is impossible to incorporate additional types of information, such as clinical data or experimental details (e.g does gene A inhibit gene B?) Therefore, we use BN for microarray data because it can give the solutions for these shortcomings: • Genes mediate the interactions within a cluster of genes or between clusters • These interactions can be simulated by probabilistic function • The nature of the interaction between genes (active or inhibit) can be presented in the network
6.3 6.3.1
Learning and Validating Baysian Networks for Inferring Gene Regulation Bayesian Networks in Microarray Technology
Recently developed technologies allow parallel measurement of the expression level of thousands of genes/proteins. This allows biologists to view the cell as a complete system. To do this, we must confront a big challenge: extracting meaningful information from the expression data and a suitable experiment planning. As the presented evidence in the Introduction Section technically, the Baysian networks have many more advantages than another approaches for the presentation of gene expression. The most computational approaches of them are used to optimize modeling gene regulatory network, a very useful network for achieving the regulation of gene expression. In order to understand the underlying function of organisms, it is necessary to study the behavior of genes in this network. By modeling gene regulatory network, collections of genes interact with one another and other substances in a cell, we can know which genes active others efficiently. There are now in the literature some remarkable BNs approaches for assessing differential expression across two (2) or more experimental conditions within a single study: including parametric [17], semiparametric [41] and non-parametric [12] models. 6.3.2
Learning Bayesian Networks from Microarray Data
In the section of BN introduction, we presented two important parts of BNs. Inhere, their importance presented one more time in the form of the main learning approaches in BNs. The problem of learning a BN from a collection of observed data can be started with two (2) tasks as following: Parameter learning: Given the graph G (proposed by the experts), the parameter learning task infers the parameters from data. These are the probability distributions for each node. This is the case of our example of Bayes’ rule above (see Bayes’ rule example). Structure learning: Given a data set D = Y1 , Y2 , Y3 , ......, Yn with a set of random variables, the structure learning task finds the most probable graph G for explaining the data contained in D. There are three (3) approaches for this problem: 1. Constraint-based learning (also called conditional independence tests) [9], [11] with the testing of independencies and adding the edges according to the tests; 2. Searching and scoring with the definition a selection criterion that measures the goodness of a graph; 3. andMixture model (recent) [3] with the testing for almost all independencies, then searching and scoring according to the possible tests.
13
Searching and scoring as one of the most common used approach to this problem. It introduces a score that evaluates how probable each graph G which interpret the most exact possible the data in D. That means, structure learning based on the searching and scoring find a high score, given the observed data in D. This score is calculated by an estimation of the probabilities between nodes in the network. Almost proposed scoring approaches are based on this rule, respectably. To search the highest scoring graph, a particular search method needs to be used. The most works in BNs prefer apply the greedy search for this task. For a more formal bibliography document about BNs structure learning, we invite to visit our technical report presented by NGUYEN et al. [50]. Efficient algorithms for learning a BNs have to reveal underlying structure of domain, direct relations between variables, find causual influence and discover hidden variables [34]. For example, with microarray data, we can apply methods for learning BNs to analyze gene expression data ([38]: An very useful bibliography on learning causal networks of gene interactions - Last updated: March 12, 2007 ). More formally, to measure expression level of each gene, we must observe all possible data sets (with the random variables) that can be used for measurement of the external stimuli, the environment parameters (such as temp, nutrients, PH.), the biological factors. We present inhere some of remarkable works in the literature of BNs structure learning for microarray data. These works presented in the form of a table with our evaluation according to our point of views on the usually confronted challenges in the microarrays. For example, (1) Discretization for the continuous varaibles; (2)Dimensional problems: these are a massive number of variables (thousands), a small number of samples (dozens);(3) BNs structure learning for the microarray data. This is two crucial aspects of our works. We list in Table 2 the actual research for this problem. Star ∗∗∗ ∗∗ ∗∗∗ ∗∗ ∗∗∗ ∗∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗∗ ∗∗∗ ∗∗ ∗∗ ∗ ∗ ∗∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗∗∗ ∗∗∗
Author Li et al. 2007 Schlitt et al. 2007 Huang et al. 2007 Zan et al. 2007 Isabel et al. 2007 Kim et al. 2007 Geier et al. 2007 Chen et al. 2006 Adriano et al. 2006 Pea et al. 2006 Harri et al. 2006 Li et al. 2006 Matthew et al. 2005 Fu et al. 2005 Min et al. 2005 Tiefei et al. 2005 Reza et al. 2004 Xiaobo et al. 2004 Yu et al. 2002
Ref. [35] [48] [57] [22] [24] [28] [15] [8] [53] [42] [30] [36] [6] [14] [63] [37] [44] [62] [59]
Discret. + + + + + + + + + + + + + + + +
Scoring +
MI + +
+ + +
+ + +
Cancer + + +
+ + +
+ +
+ +
+ +
+ + + + + + +
T ime + + + + + + +
+ +
+ + + + +
+ +
+
Table 2: Comparison of the proposed methods for the gene regulation using Baysian networks Star: Evaluation Level;Author: Author; Ref : References; Discret.: Discretization; Scoring: Structure learning using Scoring function; M I: Structure learning using Mutual Information; T ime: Dealing with time-series problem; Cancer: Cancer database
Through this table, we may recognize that discretization and structure learning from data is an important step for the constructing of the BNs.
14
7
Experiments and Results
7.1
GEO and meta-analysis
We selected GPL96 and a combining GDS266, GDS946, GDS1063, and GDS2643 as the input data for our application. What are these data? Why we use it? The answer of these questions can be found in the next paragraph. GEO (Gene Expression Omnibus)3 is a standard repository at the National Center for Biotechnology Information (NCBI). Usage of this standard repository aim to realize an meta-analysis application for microarray data. NCBI is the very popular term with biologists and bioinfomaticians. It is considered an hug public reliable repository among others such as DNA Data Bank of Japan (DDBJ), the EMBLEuropean Bioinformatics Institute (EBI), and the Microarray Gene Expression Data (MGED). GEO archives and freely disseminates microarray and other forms of high-throughput data generated by the scientific community. On GEO, the database has a minimum information about a microarray experiment (MIAME). MIAME is a data content standard developed by the microarray gene expression data (MGED) society to outline what information should be provided when describing a microarray experiment. In fact, according to recommendation of MGED [5], in order to increase the reliability, the reproducibility, and the accuracy of a microarray analysis, the scientific journals should require that all primary microarray data be submitted to one of the public repositories such as GEO, ArrayExpress4 , and CIBEX5 in a format that complies with the MIAME guidelines. However, over 35 other repositories for Microarray data can be found here: http://mybio.wikia.com/wiki/Microarray_databases. The choice of microarray databases depends on the purpose (type of data, human, mouse, rice, cancer, tumor...) and the budget of each project. With a big project, the free databases at the public repositories become not enough for a higher accurate analysis. For our testing data in our work, GEO was chosen because it is one of the most commonly utilized repository and this is also the recommendation of our colleagues who are the bioinformaticians for many years in this field. In fact, GEO currently stores approximately half a billion individual gene expression measurements, derived from over 100 organisms, addressing a wide range of biological issues. These huge volumes of data may be effectively explored, queried, and visualized using user-friendly Web-based tools. The submitters can supply and the user can download their gene expression data in five sections: 1. Platform - GEO Platform (GPL): describes the list of elements on the array (cDNAs, oligonucleotides) or the list of elements that may be detected and quantified in that experiment (SAGE tags). We used GPL96 for our testing application. 2. Sample - GEO Sample (GSM): describes the conditions under which an mRNA source was handled, and the abundance measurement of each element derived from it 3. Series - GEO Series (GSE): defines a set of related Samples considered to be part of an experiment, and how the samples are related. 4. Datasets - GEO DataSets (GDS): Sample data are assembled into biologically meaningful and comparable. Several data deposit options and formats are supported, including web forms, spreadsheets, XML and Simple Omnibus Format in Text (SOFT). GDS records provide a coherent synopsis about an experiment and form the basis of GEOs data display and analysis tools. We interested in this section. Our databases consist of the combining of GDS266, GDS946, GDS1063, and GDS2643. These are the datasets of lymphoma B-cell cancers. 5. Raw data: original microarray scan images or raw quantification data. Beside these principal sections, on GEO website, there are also two (2) useful tools for data query using data mining and visualization methods for supplying a first global view and estimation of data. 3
GEO: http://www.ncbi.nhn.nih.gov/geo ArrayExpress: http://www.ebi.ac.uk/arrayexpress 5 CIBEX: http://cibex.nig.ac.jp 4
15
Entrez GEO-DataSets provides an experiment-centric view of the data in GEO. Entrez GEO-Profiles provides a gene-centric view of the data in GEO.
7.2
Identification of a robust gene
The application of expression microarray profiling technology promises to change not only our understanding of biology and clinical practices, but also leading to better diagnosis, prognosis, and ultimately treatment of cancer. Recently, Korkola et al.[29] recognized that the most important contribution of microarrays to biology research has been the identification of gene sets that are the best predictive of disease. The goal of their project is to identify the robust gene set with extensive validation that may have clinical utility for outcome prediction in breast cancer patients. By using clustering as the classification method and a leave-one-out as the validation stage, they detected and tested the performance of their robust gene sets in the diverse set of cancers, including lymphoma (results of Alizadeh et al. [1],[2]), prostate cancer, and breast cancer. Their results were compared to many other published predictive gene sets in order to find a subset of genes in common among these studies. In [2], a smaller 21 gene set has been developed, based on a combination of genes selected from published microarray work and traditional clinical markers. Of these 21 genes, 16 have predictive utility, while the remaining 5 are reference genes. The robust gene sets of Korkola et al. had 7 genes in common with the 16 predictive genes, namely STK15, BIRC5, MYBL2, MMP11, ERBB2, GSTM1 and ESR1. Gene BIRC5 MYBL2 MMP11 ERBB2 GSTM1 ESR1
Description baculoviral IAP repeat-containing 5 (survivin) v-myb myeloblastosis viral oncogene homolog (avian)-like 2 matrix metallopeptidase 11 (stromelysin 3) v-erb-b2 erythroblastic leukemia viral oncogene homolog 2 neuro/glioblastoma derived oncogene homolog (avian) glutathione S-transferase M1 estrogen receptor 1 Table 3: Variables in the our testing data set
7.3
Missing values imputation
For simplicity, data set D was assumed to be complete up to now, i.e., a value is assigned to every random variable in every case contained in the given data set. In many situations, however, especially in biological data, one faces the problem that the available data is incomplete. It means that in a data set D, which consists of a set of data vectors D = {d1 ; d2 ; ...; dn }, where each data vector consists of i components according to the number of variables, some individual components of these data vectors may be undefined ”NA” (a.k.a N/A: Not available) or ”Inf” (Infinitive) If the proportion of cases with missing data is low, the simplest solution is to throw out those cases from the dataset, given that the missing values are independent of the data. In this case the values are missing at random (MAR). When the proportion of missing cases is to high, it becomes important to make full use of the information which is potentially available from the incomplete patterns. A popular methods for handling missing gene expression data is the K-NN (k-nearest neighbor)[56]. The detail of missing values imputation procedure by K-NN with the implementation of impute R package is presented in Appendix.
7.4
Normalization with R
After the selection of data, a normalization phase might be applied on these data. The reason of this phase can be found in the section of normalization above. Now, a powerful tool for this function will be presented inhere.
16
R implements a dialect of the S language, a language that the citation for John Chambers’ 1998 Association for Computing Machinery Software Award stated that ”S has forever altered how people analyze, visualize and manipulate data”. The initial version of R was developed in 1996 by Ross Ihaka and Robert Gentleman, both from the University of Auckland, New Zealand. The goal of this project is replace the commercial statistic software such as SAS, SPSS, Stata, Statistica, and S-Plus. Like Linux, R is an ”open source” system. Versions and a variety of official and contributed documentation of R are freely available for any operation system (Windows, Unix, Linux, Macintosh) through the Comprehensive R Archive Network (CRAN)6 . R is a pack-ware, that means each computational module of R is developed as a package. The packages that do not come with the base distribution must be downloaded and installed separately. For example, if we want to realize a normalization with R, we have to either develop a normalization package for ourself or to find an available normalization R package. Normally, one often prefers to choose a flexible way by taking some functions of a primary available package as the fundamental functions for a individual package. For our experimental application, we chose a function that performs the computations for the Lowess smoother which uses locally-weighted polynomial regression of ”stats” available package and developed a new package that can answer almost our questions. For a more technical reading, we invite to visit our implementation instruction in the Appendix.
7.5
Threshold choice: the art of discretization for gene expression data
We have now in hand a normalized data. For the purpose of viewing the interaction between genes in the future network, we have to know which genes are over-expressed (normal or under-expressed) in our data before its introduction for the learning of BNs. Because there are many data mining model work with discretized versions of the expression data. But, these studies might be used to discretize the data ”arbitrarily”. Consequently, that might lose more useful information. Therefore, we discuss here how we can choose the best threshold for discretization. In fact, the effectiveness of the discretization methods is assessed by evaluating the biological relevance of the results. For this condition, we are discretized our data using average and standard deviation. Straightforwardly, this discretization method discretizes the gene expression matrix using the average expression value, or the average combined with the standard deviation of the expression values. The limit values between bins can be computed using all the values in the expression matrix, that is, the overall average expression level and its standard deviation. In alternative, the average and the standard deviation can be computed for each row or column in the matrix. The goal is to discretize the matrix into a matrix using three symbols (for instance, -1, 0 and 1, meaning over-expressed, normal or under-expressed). Let A be an n row by m column gene expression matrix, where Aij represents the expression level of gene i under condition (experiment) j. The matrix A is defined by its set of rows, I, and its set of columns, J. Moreover, let µIJ denote the average value in the expression matrix A and µiJ and µIj denote the mean of row i and condition j, respectively. Finally, let p be a parameter used to tune the desired deviation from average and σIJ , σiJ and σIj , be the standard deviations of the overall values in the matrix, row i, and column j, respectively. The discretization is then performed using one of the following equations: (the results are assigned for the matrix A0 : −1, if Aij < µIJ − p.σIJ ; 0 0, if Aij > µIJ + p.σIJ ; Aij = 1, otherwise. −1, if Aij < µiJ − p.σiJ ; 0, if Aij > µiJ + p.σiJ ; A0ij = 1, otherwise. 6
http://cran.r-project.org
17
−1, if Aij < µIj − p.σIj ; 0, if Aij > µIj + p.σIj ; A0ij = 1, otherwise. This discretization technique was used by Becquet et al. [7], Pensa et al. [43] performed the assessment of these discretization techniques for relevant pattern discovery from gene expression data. The example of a part of the discretization for our data set will be presented below. It begin with original matrix A, after the dicretization, a matrix A0 will be produced. For the new value of each cell (A0ij = −1, 0, 1), the value of µ and σ must be calculated. The default parameter setting was chosen and depends on the nature of input data. There are many works used classification and validation methods to estimate the performance of different value of the parameter p in the different data sets. The best p might be chosen according to which kind of its input data (see [21] for the detail argumentation). This is also an interesting study in the discretization literature. Return our case, for gene expression data, one prefer to choose 2 as a default parameter of discretizaton. For instance, we choose the first equation, with the case of A011 : A11 = 2.56, µ = 0.385, σ = 1.491256, p = 2. So, A011 = 0. We continue until the end A 1 2
Gene 1D12A 6H9A
Expt.1 2.56 −1.96
Expt.2 1.44 −0.69
Expt.3 1.99 2.41
Expt.4 −0.17 −0.46
Expt.5 −0.31 1.36
Expt.6 −1.04 −0.51
Table 4: Gene expression matrix before discretization A: Original matrix; Gene: Name of gene;Expt.: Experiment (sample, patent,etc);
A 1 2
Gene 1D12A 6H9A
Expt.1 0 −1
Expt.2 0 0
Expt.3 0 1
Expt.4 −1 1
Expt.5 −1 1
Expt.6 −1 1
Table 5: Gene expression matrix after discretization A: Discretized matrix; Gene: Name of gene;Expt.: Experiment (sample, patent,etc);
7.6
The first ”brick” of the future ”house”
Our ”house” is built by the graphs that contain the ”bricks” represented by the nodes and edges. The nodes represent objects of interest. The edges represent relationships between the nodes. The flexibility of node-and-edge graphs make them ideal modes of representation for high-throughput data common in systems biology research. Nodes are often used to represent genes/proteins and edges are used to represent many different pairwise relationships, for example, coexpression, interactions, etc. In our project, the essential tool for this ”construction” is Rgraphviz R package. Rgraphviz use the most general and flexible graph class graphNEL (graph of the Nodes and Edges List). This class extend from graph, a virtual graph class that all other classes should extend. Usage of Rgraphviz is simple. By the simple command we can create/remove a node, an edge between nodes; know the weight of a graph; and detect the highly connected nodes in a graph with the distinct color. This package is implemented in deal (presented in the next section). And more useful command can be found in the Appendix.
7.7
Deal R package for Bayesian Gene Interaction Networks
deal is a software package freely available for use with R. deal may be used in conjunction with other statistical methods available in R for analysing data. It includes several methods for analyzing data using BNs with mixed variables (discrete/continuous), but with actual version, deal have been restricted to conditionally Gaussian networks. That mean, in deal we do not allow continuous parents of discrete nodes, so we cannot describe such a relation. This is not a problem for us, because all our variables were discretized. 18
In deal, a Bayesian network is represented as an object of class network. A network is generated from a dataframe in which the variables are specified as factors (like categorized variables). The primary attribute of a network is the list of nodes. Each entry in the list is an object of class node representing a node in the graph, which includes information associated with the node (such as label, type of node.). To learn a Bayesian network, the user needs to supply a training data set and represent any prior knowledge available as a Bayesian network (such as prior probability of each variable). Next, deal uses these prior information to deduce prior distributions for all parameters in the model. Then, this is combined with the training data to yield posterior distributions of the parameters. From these learning parameter, deal learn the structure of the network. A network score is calculated and a search strategy is employed to find the network with the highest score. This network gives the best representation of data and we call it the posterior network. The criteria for comparing the different network structures in deal is the BDe score. The learning the structure methods of dealare implemented by methods of Heckerman et al. [18]. we are working on an implementation of the greedy equivalence search (GES) algorithm proposed by Chickering [10]. Gene BIRC5 M Y BL2 M M P 11 ERBB2 GST M 1 ESR1 ...
Expt.1 0 0 0 0 0 0 ...
Expt.2 0 0 0 0 0 0 ...
Expt.3 1 0 0 0 1 0 ...
Expt.4 0 0 0 0 0 0 ...
Expt.5 0 0 1 1 0 1 ...
Expt.6 1 1 1 0 1 1 ...
... ... ... ... ... ... ... ...
Table 6: A part of our data set The variables were normalized by Lowess function, were discretized by mean and standard deviation discretization approach and their missing values were imputed by K-NN algorithm
To evaluate the results, we used different database for the training data to construct the network. We tested with GDS266, GDS946, GDS1063, GDS2643 of GEO, and a combining data set from these four data. The difference network scores resulted by the application allow to compare the quality of the networks that were trained by different data set (single study) and the combining data set (cross study). Our experiment applied to several small different subnetworks in order to find the correlation with the group of clusters (group of the most correlated genes) who presented by some clustering method in the same data base. Here, we present one of these subnetworks. Figure shows different networks that used different training data. The nodes of these networks are the same. Each node is a gene of Lymphoma B-cell. For facilitating to understand, we describe inhere the significance of B-cell in a gene interaction network. B-cell biology and its role in cancer is another area of intensive research. B cells are lymphocytes that play a large role in the humoral immune response (as opposed to the cell-mediated immune response, which is governed by T cells). The principal function of B cells is to produced antibodies against soluble antigens. B cells are an essential component of the adaptive immune system. However, like most cells in the body, B cells can become cancerous. Identifying antigens found on the surface of cancerous B cells is very important. It allows to detect these cancerous cells and to generate antibodies immediately to these proteins. Thereby, marking them for destruction by the body’s own immune system. Finding the probability relation between genes allows to predict which gene will be over-expressed if some were over-expressed. In this test, we simulated the interaction of six (6) gene BIRC5, MYBL2, MMP11, ERBB2, GSTM1 and ESR1 of Lymphoma B-cell (see Figure 9 and Figure 10). With the result generated by the application, we may decide to take a network structure with the best score. In the figure, we can see the network of combining training data set have the best score7 . Moreover, We would like to note on the number of the sample (condition, patient) in each dataset. The higher score network will be generated from the 7
Deal:
19
Figure 9: Experimental results for each dataset (study) training data that have more sample. For example, GDS266 has 29 samples and its score is −95.89306, GDS946 has 24 samples and its score is −81.16829, GDS1063 has 21 samples and its score is −78.40007, GDS1063 has 55 samples and its score is −137.6083 and GDS-meta has 29 + 24 + 21 + 55 = 129 samples and its score is −336.5945. The robust edges, that are colored in the illustrative image, are present in all four tests (for example, the edge of MYBL2→ BIRC5, ERBB2→ BIRC5, GSTM1)→ BIRC5)). P (BIRC5|M Y BL2) BIRC5 = −1 BIRC5 = 0 BIRC5 = 1
M Y BL2 = −1 0.32096 0.31387 0.32167
M Y BL2 = 0 0.27086 0.0.41247 0.34920
M Y BL2 = 1 0.40818 0.27366 0.32913
Table 7: Table probability distribution for gene BIRC5 From these results, we can realize the validation for these networks from the testing data. The method of prediction can be presented as followings: we choose BIRC5 as a target gene. We want to know which expression level BIRC5 is if the parents of BIRC5 (MYBL2, ERBB2, GSTM1) are over-expressed, normal or under-expressed. These expression information are given by the testing data. For example, in the testing data, we see a sample MYBL2 is over-expressed (meaning value ”1” in the data), ERBB2 is overexpressed (meaning value ”1” in the data),GSTM1 is normal (meaning value ”0” in the data). To answer the above question, we calcute P (BIRC5|M Y BL2, ERBB2, GST M 1). This probability was calculated automatically by deal. So, we can simply get this value by request the program. The same idea, we continue to calculate for all other samples. Given a threshold t ∈ [0, 1], we say a prediction is ”present” if P (prediction = present|d) > t and otherwise we say it is ”absent”. If the prediction of a network says ”present” when in the testing data, it is ”absent”, we call that a False Positive, whereas if the prediction is ”absent” when in the testing data it is ”present”, we call False Negative. As t increases, the number 20
Figure 10: Experimental results for the combining dataset (cross-study) of False Negative increases whereas the number of False Positive decreases. This method proposed by Friedman et al. (2002) [13]. They used this method to evaluate error rate of prediction. This can be interpreted by the ROC curve for the comparison between different networks. Because of the limitation of time, we have not yet implement this function.
8
Conclusion and Perspective
In this thesis the approach of learning Bayesian networks was applied on microarray datasets with the goal to reveal the regulatory mechanisms among genes in the gene interaction network. The implemented framework was applied to several small subnetworks, and the experiments showed that on the one hand the results are the best in the case of meta-analysis. On the other hand experiments showed that applying Bayesian networks, the learned networks have probabilistic semantics, which better fits the stochastic nature of both the biological processes and the noisy experiments that are normalized by a common used normalization method. The resulting networks, which are visualized graphically, are easy interpretable in a biological manner, and could be used by other scientists for further explorations. Finally, we chose an ideal discretization method for the microarray, that reduced efficaciously the arbitrary information. This is one of the major problems was remedied in this thesis. deal is a tool box that adds functionality to R so that BNs may be used in conjunction with other statistical methods available in R for analyzing data. However, deal has some limitations: The methods in deal are only applicable on complete data sets. The criteria for comparing the different network structures in deal is the BDe criteria. This is not the best score for the structure learning in microarray data. Because our gene expression data were discretized by three level discretization method (three classes problem). Moreover, the nature of this method is unsupervised. So, the evaluation phase for our prediction required a special approach as shown below. This method is proposed by Friedman et al. (2002) [13]. These are considered our future works. Since one of the major advantages of Bayesian network is the ability to combine prior knowledge with information extracted from data, integrating prior knowledge into the framework may improve the results. In the other hand, meta-analysis is one of the approaches that will be promising for improving of the performance of different studies in a common analysis. Moreover, in the context of meta-analysis, the exploration of Gene ontology function interest us in future works. Therefore, we hope to confront these perspectives in the near future to accomplish our mission.
21
References [1] Alizadeh A, Eisen M, Davis RE, Ma C, Sabet H, Tran T, Powell JI, Yang L, Marti GE, Moore DT, Hudson JR Jr., Chan WC, Greiner T, Weisenburger D, Armitage JO, Lossos I, Levy R, Botstein D, Brown PO, and Staudt LM. The lymphochip: a specialized cdna microarray for the genomic-scale analysis of gene expression in normal and malignant lymphocytes. Cold Spring Harb Symp Quant Biol, 64:71–78, 1999. 16 [2] Alizadeh A, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J Jr., Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, and Staudt LM. Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403(6769):503–511, 2000. 16 [3] S. Acid and L. M. de Campos. Searching for bayesian network structures in the space of restricted acyclic partially directed graphs. Journal of Artificial Intelligence Research, 2003. 13 [4] M. Madan Babu. An Introduction to Microarray Data Analysis. unknown, 2004. 5 [5] Catherine Ball, Alison Brazma, Helen Causton, and Steve Chervitz. Microarray data standards: An open letter. Environmental Health Perspectives, 112:A666–A667, 2004. 15 [6] Matthew J. Beal, Francesco Falciani, Zoubin Ghahramani, Claudia Rangel, and David L. Wild. A bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics, 21(3):349–356, 2005. 14 [7] C. Becquet, S. Blachon, B. Jeudy, J-F. Boulicaut, and O. Gandrillon. Strong-association-rule mining for large-scale gene-expression data analysis: a case study on human sage data. Genome Biology, page 3(12), 2002. 18 [8] Xiaohui Chen, Ming Chen, and Kaida Ning. BNArray: an R package for constructing gene regulatory networks from microarray data by using Bayesian network. Bioinformatics, 22(23):2952–2954, 2006. 14 [9] J. Cheng, R. Greiner, J. Kelly, D. A. Bell, and W. Liu. Learning bayesian networks from data: an information-theory based approach. artificial intelligence. Artificial Intelligence, 2002. 13 [10] D.M. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3:507554, 2002. 19 [11] L. M. de Campos and J. F. Huete. A new approach for learning belief networks using independence criteria. International Journal of Approximate Reasoning, 2000. 13 [12] K.-A. Do, P. Muller, and F. Tang. A bayesian mixture model for differential gene expression. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(3):627–644, 2005. 13 [13] N. Friedman, M. Ninio, I. Peer, and T. A structural em algorithm for phylogenetic inference. Journal of Computational Biology, 2002. 21 [14] Lawrence Dachen Fu. A Comparison of State-of-the-Art Algorithms for Learning Bayesian Network Structure from Continuous Data. PhD thesis, Biomedical Informatics, 2005. 9, 14 [15] Florian Geier, Jens Timmer, and Christian Fleck. Reconstructing gene-regulatory networks from time series, knock-out data, and prior knowledge. BMC Systems Biology, 1(1):11, 2007. 14 [16] D. Ghosh, T. R. Barette, D. Rhodes, and A. M. Chinnaiyan. Statistical issues and methods for meta-analysis of microarray data: a case study in prostate cancer. Funct Integr Genomics,1438793X (Print) Journal Article Meta-Analysis, 3(4):180–188, 2003. 7, 8
22
[17] R. Gottardo, J. Pannucci, and C. Kuske. Statistical analysis of microarray data: a bayesian approach. Biostatistics, 4:597–620, 2003. 13 [18] D. Heckerman, D. Geiger, and D. Chickering. Learning bayesian networks: The combination of knowledge and statistical data. Machine Learning, 1995. 19 [19] L. V. Hedges and I. Olkin. Statistical Methods for Meta-analysis. Academic Press, 1985. 8 [20] SJ Hinchliffe, KE Isherwood, RA Stabler, MB Prentice, A Rakin, RA Nichols, PC Oyston, J Hinds, RW Titball, and BW Wren. Application of dna microarrays to study the evolutionary genomics of yersinia pestis and yersinia pseudotuberculosis. Genome Res, 13:2018–2029, 2003. 8 [21] Chun-Nan Hsu, Hung-Ju Huang, and Tzu-Tsung Wong. Implications of the dirichlet assumption for discretization of continuous variables in naive bayesian classifiers. Machine Learning, 53(3):235–263, 2003. 18 [22] Zan Huang, Jiexun Li, Hua Su, George S. Watts, and Hsinchun Chen. Large-scale regulatory network analysis from microarray data: modified bayesian network learning and association rule mining. Decis. Support Syst., 43(4):1207–1225, 2007. 14 [23] Ross Ihaka and Robert Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299–314, 1996. 9 [24] T. Isabel, H. Yufei, Y. Yufang, P. Diego, and M. Carmen. Uncovering gene regulatory networks from time-series microarray data with variational bayesian structural expectation maximization. EURASIP Journal on Bioinformatics and Systems Biology, page 14, 2007. 14 [25] Steen K. A Biologist’s Guide to Analysis of DNA Microarray Data. John Wiley and Sons, 2002. 5 [26] K. Kerr. Extended analysis of benchmark datasets for agilent two-color microarrays. BMC Bioinformatics, 8(1):371, 2007. 7 [27] M Khojasteh, WL Lam, RK Ward, and C MacAulay. A stepwise framework for the normalization of array cgh data. BMC Bioinformatics, 6:274, 2005. 8 [28] Haseong Kim, Jae Lee, and Taesung Park. Boolean networks using the chi-square test for inferring large-scale gene regulatory networks. BMC Bioinformatics, 8(1):37, 2007. 14 [29] James E. Korkola, Ekaterina Blaveri, Sandy DeVries, Dan H Mooreand E Shelley Hwang, Yunn-Yi Chen, Anne LH. Estep, Karen L. Chew, Ronald H. Jensen, and Frederic M. Waldman. Identification of a robust gene signature that predicts breast cancer outcome in independent data sets. BMC Cancer, 61(7), 2007. 16 [30] Harri L¨ahdesm¨ aki, Sampsa Hautaniemi, Ilya Shmulevich, and Olli Yli-Harja. Relationships between probabilistic boolean networks and dynamic bayesian networks as models of gene regulatory networks. Signal Process., 86(4):814–834, 2006. 14 [31] O. Larsson and R. Sandberg. Lack of correct data format and comparability limits future integrative microarray research. Nat Biotechnol, 24(11):1322–1323, 2006. 7 [32] Ola Larsson, Kristian Wennmalm, and Rickard Sandberg. Comparative microarray analysis. OMICS: A Journal of Integrative Biology, 10(3):381–397, 2006. 7 [33] Homin K. Lee, Amy K. Hsu, Jon Sajdak, Jie Qin, and Paul Pavlidis. Coexpression Analysis of Human Genes Across Many Microarray Data Sets. Genome Res., 14(6):1085–1094, 2004. 8 [34] Philippe Leray. Rseaux baysiens : apprentissage et modlisation de systmes complexes. HABILITATION A DIRIGER LES RECHERCHES, 2006. 14
23
[35] Peng Li, Chaoyang Zhang, Edward Perkins, Ping Gong, and Youping Deng. Comparison of probabilistic boolean network and dynamic bayesian network approaches for inferring gene regulatory networks. BMC Bioinformatics, 8(Suppl 7):S13, 2007. 14 [36] Xia Li, Shaoqi Rao, Wei Jiang, Chuanxing Li, Yun Xiao, Zheng Guo, Qingpu Zhang, Lihong Wang, Lei Du, Jing Li, Li Li, Tianwen Zhang, and Qing Wang. Discovery of time-delayed gene regulatory networks based on temporal gene expression profiling. BMC Bioinformatics, 7(1):26, 2006. 14 [37] Tiefei Liu. Learning gene network using Bayesian network framework. PhD thesis, National University of Singapore, 2005. 14 [38] F. Markowetz. A bibliography on learning causal networks of gene interactions. Oxford University Press, pages 349–356, March 2007. 14 [39] D Molenaar, F Bringel, FH Schuren, WM de Vos, RJ Siezen, and M Kleerebezem. Exploring lactobacillus plantarum genome diversity by using microarrays. J Bacteriol, 187:6119–6127, 2005. 8 [40] P Neuvial, P Hupe, I Brito, S Liva, E Manie, C Brennetot, F Radvanyi, A Aurias, and E Barillot. Spatial normalization of array-cgh data. BMC Bioinformatics, 7:264, 2006. 8 [41] M. Newton, A. Noueiry, and D. Sarkar. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics, 5:155–176, 2004. 13 [42] J. M. Pea, J. Bjrkegren, and J. Tegnr. Learning and validating bayesian network models of genetic regulatory networks. Advances in Bayesian networks, Springer Verlag, 2006. 14 [43] R. G. Pensa, C. Leschi, J. Besson, , and J. Boulicaut. Assessment of discretization techniques for relevant pattern discovery from gene expression data. In 4th Workshop on Data Mining in Bioinformatics, 2004. 18 [44] Reza Jamasebi. Nattakarn Ratprasartporn. Bayesian networks with applications in bioinformatics. PhD thesis, EECS, 2004. 14 [45] A. Reverter, Y. H. Wang, K. A. Byrne, S. H. Tan, G. S. Harper, and S. A. Lehnert. Joint analysis of multiple cdna microarray studies via multivariate mixed models applied to genetic improvement of beef cattle. American Society of Animal Science, 83:3430–3439, 2004. 8 [46] D. R. Rhodes, T. R. Barrette, and M. A. Rubin. Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Res, 0008-5472 (Print) Journal Article Meta-Analysis, 62(15):4427–33, 2002. 7, 8 [47] Daniel R. Rhodes, Jianjun Yu, K. Shanker, Nandan Deshpande, Radhika Varambally, Debashis Ghosh, Terrence Barrette, Akhilesh Pandey, and Arul M. Chinnaiyan. Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proceedings of the National Academy of Sciences, 101(25):9309–9314, 2004. 8 [48] Thomas Schlitt and Alvis Brazma. Current approaches to gene regulatory network modelling. BMC Bioinformatics, 8(Suppl 6):S9, 2007. 14 [49] R. Shen, D. Ghosh, and A. Chinnaiyan. Prognostic meta-signature of breast cancer developed by two-stage mixture modeling of microarray data. BMC Genomics, 5(1):94, 2004. 8 [50] Nguyen H. Tuong and Philippe Leray. Scoring function for bayesian network structure learning. Technical report, Polytechnic School of Nantes University, http: // nhtuong. wordpress. com , 2008. 14 [51] J. Wang, K. R. Coombes, and W. E. Highsmith. Differences in gene expression between b-cell chronic lymphocytic leukemia and normal b cells: a meta-analysis of three microarray studies. Bioinformatics, 20(17):3166–3178, 2004. 8
24
[52] Patrick Warnat, Roland Eils, and Benedikt Brors. Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics, 6(1):265, 2005. 8 [53] Adriano V. Werhli, Marco Grzegorczyk, and Dirk Husmeier. Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical gaussian models and bayesian networks. Bioinformatics, 22(20):2523–2531, 2006. 14 [54] B. Tan A. Rosenwald H. E. Hurt A. Wiestner Wright, G. and L. M. Staudt. A gene expression-based method to diagnose clinically distinct subgroups of diffuse large b cell lymphoma. Proc. Natl. Acad. Sci. USA, 100:9991–9996, 2003. 7 [55] L. F. Wu, T. R. Hughes, and A. P. Davierwala. Large-scale prediction of saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat Genet, 31(3):255–265, 2002. 8 [56] Qian Xiang, Xianhua Dai, Yangyang Deng, Caisheng He, Jiang Wang, Jihua Feng, and Zhiming Dai. Missing value imputation for microarray gene expression data using histone acetylation information. BMC Bioinformatics, 9:252+, May 2008. 16 [57] Y.Huang, J.Wang, J. Zhang, M. Sanchez, and Y. Wang. Bayesian inference of genetic regulatory networks from time series microarray data using dynamic bayesian networks. Bioinformatics, 2:46– 56, 2007. 14 [58] D Yoon, SG Yi, JH Kim, and T Park. Two-stage normalization using background intensities in cdna microarray data. BMC Bioinformatics, 5:97, 2004. 8 [59] J. Yu, V. Smith, P. Wang, A. Hartemink, and E. Jarvis. Using bayesian network inference algorithms to recover molecular genetic regulatory networks. International Conference on Systems Biology, 2002. 14 [60] X. Zhong, L. Marchionni, and L. Cope. Optimized cross-study analysis of microarray based predictors. Technical Report,johns Hopkins University Department of Biostatistics, 129, 2007. 7 [61] . Zhou, XJ, MJ Kao, H. Huang, A. Wong, J. Nunez-Iglesias, M. Primig, OM. Aparicio, CE. Finch, TE. Morgan, and WH. Wong. Functional annotation and network reconstruction through crossplatform integration of microarray data. Nature Biotechnology, pages 238–43, 2005. 8 [62] Xiaobo Zhou, Xiaodong Wang, Ranadip Pal, Ivan Ivanov, Michael Bittner, and Edward R. Dougherty. A bayesian connectivity-based approach to constructing probabilistic gene regulatory networks. Bioinformatics, 20(17):2918–2927, 2004. 14 [63] Min Zou and Suzanne D. Conzen. A new dynamic bayesian network (dbn) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics, 21(1):71–79, 2005. 14
25
APPENDICES
I
A
Appendix
Downloading and querying data from GEO and : The data set you are going to use is available from the Gene Expression Omnibus database (GEO), hosted by NCBI. First go to their website http://www.ncbi.nlm.nih.gov/geo/. Next, query Datasets for the dataset you have been allocated (for example GDS266, GDS946, GDS1036, GDS2643). On the first result, click the icon of a heat map on the right hand side. About half way down the new page, to the left of the heat map icon, choose: Data->Download->DataSet SOFT file Save the file (normally GDS266.soft.gz or similar) to your computer, making a note of where you save it. Also, from the same menu, pick: Data->Download->Annotation SOFT file For querying, we used GEOQuery: > library(GEOquery) > gpl96 #Or, open an existing GDS file (even if its compressed): > gds946 eset946 > > > > > > > >
Graph with Rgraphviz: Usage of Rgraphviz is simple. To create a graph with one node, we use: g > > > > > > > > > >
Especially, the next code allows to detect the highly connected nodes in a graph: HIGHLY CONNECTED NODES hcs