EURASIP Journal on Bioinformatics and Systems Biology
Applications of Signal Processing Techniques to Bioinformatics, Genomics, and Proteomics Guest Editors: Erchin Serpedin, Javier Garcia-Frias, Yufei Huang, and Ulisses Braga-Neto
Applications of Signal Processing Techniques to Bioinformatics, Genomics, and Proteomics
EURASIP Journal on Bioinformatics and Systems Biology
Applications of Signal Processing Techniques to Bioinformatics, Genomics, and Proteomics Guest Editors: Erchin Serpedin, Javier Garcia-Frias, Yufei Huang, and Ulisses Braga-Neto
Copyright © 2009 Hindawi Publishing Corporation. All rights reserved. This is a special issue published in volume 2009 of “EURASIP Journal on Bioinformatics and Systems Biology.” All articles are open access articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Editor-in-Chief Ioan Tabus, Tampere University of Technology, Finland
Associate Editors Jaakko T. Astola, Finland Junior Barrera, Brazil Michael L. Bittner, USA Michael R. Brent, USA Yidong Chen, USA Paul Cristea, Romania Aniruddha Datta, USA Bart De Moor, Belgium Edward R. Dougherty, USA
Javier Garcia-Frias, USA Debashis Ghosh, USA John Goutsias, USA Roderic Guigo, Spain Yufei Huang, USA Seungchan Kim, USA John Quackenbush, USA Jorma Rissanen, Finland St´ephane Robin, France
Paola Sebastiani, USA Erchin Serpedin, USA Ilya Shmulevich, USA Ahmed Tewfik, USA Sabine Van Huffel, Belgium Z. Jane Wang, Canada Yue Wang, USA
Contents Applications of Signal Processing Techniques to Bioinformatics, Genomics, and Proteomics, Erchin Serpedin, Javier Garcia-Frias, Yufei Huang, and Ulisses Braga-Neto Volume 2009, Article ID 250306, 2 pages Compressive Sensing DNA Microarrays, Wei Dai, Mona A. Sheikh, Olgica Milenkovic, and Richard G. Baraniuk Volume 2009, Article ID 162824, 12 pages How to Improve Postgenomic Knowledge Discovery Using Imputation, Muhammad Shoaib B. Sehgal, Iqbal Gondal, Laurence S. Dooley, and Ross Coppel Volume 2009, Article ID 717136, 14 pages Efficient Alignment of RNAs with Pseudoknots Using Sequence Alignment Constraints, Byung-Jun Yoon Volume 2009, Article ID 491074, 13 pages A Hybrid Technique for the Periodicity Characterization of Genomic Sequence Data, Julien Epps Volume 2009, Article ID 924601, 8 pages Identifying Genes Involved in Cyclic Processes by Combining Gene Expression Analysis and Prior Knowledge, Wentao Zhao, Erchin Serpedin, and Edward R. Dougherty Volume 2009, Article ID 683463, 9 pages Clustering of Gene Expression Data Based on Shape Similarity, Travis J. Hestilow and Yufei Huang Volume 2009, Article ID 195712, 12 pages Spectral Preprocessing for Clustering Time-Series Gene Expressions, Wentao Zhao, Erchin Serpedin, and Edward R. Dougherty Volume 2009, Article ID 713248, 10 pages Is Bagging Effective in the Classification of Small-Sample Genomic and Proteomic Data?, T. T. Vu and U. M. Braga-Neto Volume 2009, Article ID 15836, 10 pages Intervention in Context-Sensitive Probabilistic Boolean Networks Revisited, Babak Faryabi, Golnaz Vahedi, Jean-Francois Chamberland, Aniruddha Datta, and Edward R. Dougherty Volume 2009, Article ID 360864, 13 pages
Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2009, Article ID 250306, 2 pages doi:10.1155/2009/250306
Editorial Applications of Signal Processing Techniques to Bioinformatics, Genomics, and Proteomics Erchin Serpedin (EURASIP Member),1 Javier Garcia-Frias,2 Yufei Huang,3 and Ulisses Braga-Neto1 1 Department
of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128, USA of Electrical and Computer Engineering, University of Delaware, Newark, DE 19716-7501, USA 3 Department of Electrical and Computer Engineering, University of Texas at San-Antonio, TX 78249-2057, USA 2 Department
Correspondence should be addressed to Erchin Serpedin,
[email protected] Received 17 February 2009; Accepted 17 February 2009 Copyright © 2009 Erchin Serpedin et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
The recent development of high-throughput molecular genetics technologies has brought a major impact to bioinformatics and systems biology. These technologies have made possible the measurement of the expression profiles of genes and proteins in a highly parallel and integrated fashion. The examination of the huge amounts of genomic and proteomic data holds the promise for understanding the complex interactions between genes and proteins, the functional processes of a cell, and the impact of various factors on a cell, and ultimately, for enabling the design of new technologies for intelligent management of diseases. This special issue focuses on modeling and processing of data arising in bioinformatics, genomics, and proteomics using signal processing methods. The importance of signal processing techniques is due to their important role in extracting, processing, and interpreting the information contained in genomic and proteomic data. It is our hope that signal processing methods will lead to new advances and insights in uncovering the structure, functioning and evolution of biological systems. The special issue consists of nine papers that span a wide range of problems and applications in bioinformatics, genomics, and proteomics such as design of compressive sensing microarrays, analysis of missing values in microarray data, and effect of imputation techniques on post genomic inference methods, RNA sequence alignment, detection of periodicity in genomic sequences and gene expression profiles, clustering and classification of gene and protein expression data, and intervention in probabilistic Boolean
networks. Next, we will briefly introduce the papers reported in this special issue. W. Dai et al. analyze how to design a microarray that it is fit for compressive sensing and that captures also the biochemistry of probe-target DNA hybridization. Algorithms and design results are reported for determining probe sequences that satisfy the binding requirements and for evaluating the target concentrations. M. S. B. Sehgal et al. address the general problem of improving post genomic knowledge discovery procedures such as the selection of the most significant genes and inference of gene regulatory networks using missing microarray data imputation techniques. It is shown that instead of neglecting missing data, recycling microarray data via robust imputation techniques can yield substantial performance improvements in the subsequent post genomic discovery procedures. B.-J. Yoon developed a novel efficient and robust approach for fast and accurate structural alignment of RNAs, including pseudoknots. The proposed method turns out to accelerate the dynamic programming algorithm for familyspecific models such as profile-csHMMs and CMs, and to be robust to small parameter changes that are present in the model used to predict the constraint. The paper by J. Epps explains in detail the origins of ambiguity in period estimation for symbolic sequences, and proposes a novel hybrid autocorrelation-IPDFT technique for periodicity characterization of sequences. W. Zhao et al. developed a novel algorithm for identification of genes involved in cyclic processes by combining
2
EURASIP Journal on Bioinformatics and Systems Biology
gene expression analysis and prior knowledge. The proposed cyclic-genes detection algorithm is validated on data sets corresponding to Saccharomyces cerevisiae and Drosophila melanogaster, and shown to represent a valuable technique for unveiling pathways related to cyclic processes. T. J. Hestilow and Y. Huang propose a novel method for gene clustering using the shape information of gene expression profiles. The shape information which is represented in terms of normalized and time-scaled forward first-order differences is then exploited by a variational Bayes clustering approach and a non-Bayesian (Silhouette) cluster statistic, and shown to yield promising results in clustering time-series microarray data. The paper by W. Zhao et al. proposes a new clustering approach to combine the traditional clustering methods with power spectral analysis of time series gene expression measurements. Simulation results confirm that the proposed clustering approach provides superior performance relative to hierarchical, K-means, and self-organizing maps, and yields additional information about temporal regulated genetic processes, for example, cell-cycle. T. T. Vu and U. M. Braga-Neto address the important problem of assessing the effectiveness of bagging in the classification of small-sample genomic and proteomic data sets. Representative experimental results are presented and discussed. Finally, the paper by B. Faryabi et al. studies the effects on intervention performance in the context of probabilistic Boolean networks due to a reduction in the values of the model parameters.
Acknowledgments The authors would like to thank the Editor-in-Chief Dr. Ioan Tabus for the opportunity to prepare this special issue, and the reviewers for their help and constructive criticism in preparing this special issue. Erchin Serpedin Javier Garcia-Frias Yufei Huang Ulisses Braga-Neto
Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2009, Article ID 162824, 12 pages doi:10.1155/2009/162824
Research Article Compressive Sensing DNA Microarrays Wei Dai,1 Mona A. Sheikh,2 Olgica Milenkovic,1 and Richard G. Baraniuk2 1 Coordinated 2 Department
Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA of Electrical and Computer Engineering, Rice University, Houston, TX 77005, USA
Correspondence should be addressed to Wei Dai,
[email protected] and Olgica Milenkovic,
[email protected] Received 30 July 2008; Accepted 23 October 2008 Recommended by Ulisses Braga-Neto Compressive sensing microarrays (CSMs) are DNA-based sensors that operate using group testing and compressive sensing (CS) principles. In contrast to conventional DNA microarrays, in which each genetic sensor is designed to respond to a single target, in a CSM, each sensor responds to a set of targets. We study the problem of designing CSMs that simultaneously account for both the constraints from CS theory and the biochemistry of probe-target DNA hybridization. An appropriate cross-hybridization model is proposed for CSMs, and several methods are developed for probe design and CS signal recovery based on the new model. Lab experiments suggest that in order to achieve accurate hybridization profiling, consensus probe sequences are required to have sequence homology of at least 80% with all targets to be detected. Furthermore, out-of-equilibrium datasets are usually as accurate as those obtained from equilibrium conditions. Consequently, one can use CSMs in applications in which only short hybridization times are allowed. Copyright © 2009 Wei Dai et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction Accurate identification of large numbers of genetic sequences in an environment is an important and challenging research problem. DNA microarrays are a frequently applied solution for microbe DNA detection and classification [1]. The array consists of genetic sensors or spots, containing a large number of single-stranded DNA sequences termed probes. A DNA strand in a test sample, referred to as a target, tends to bind or “hybridize” with its complementary probe on a microarray so as to form a stable duplex structure. The DNA samples to be identified are fluorescently tagged before being flushed against the microarray. The excess DNA strands are washed away and only the hybridized DNA strands are left on the array. The fluorescent illumination pattern of the array spots is then used to infer the genetic makeup in the test sample. 1.1. Concerns in Classical DNA Microarrays. In traditional microarray designs, each spot has a DNA subsequence that serves as a unique identifier of only one organism in the target set. However, there may be other probes in the array with similar base sequences for identifying other organisms. Due to the fact that the spots may have DNA probes with similar
base sequences, both specific and nonspecific hybridization events occur; the latter effect leads to errors in the array readout. Furthermore, the unique sequence design approach severely restricts the number of organisms that can be identified. In typical biosensing applications, an extremely large number of organisms must be identified. For example, there are more than 1000 known harmful microbes, many with significantly more than 100 strains [2]. A large number of DNA targets require microarrays with a large number of spots. The implementation cost and speed of microarray data processing is directly related to the number of spots, which represents a significant problem for commercial deployment of hand-held microarray-based biosensors. 1.2. Compressive Sensing. Compressive sensing (CS) is a recently developed sampling theory for sparse signals [3]. The main result of CS, introduced by Cand`es and Tao [3] and Donoho [4], is that a length-N signal x that is K-sparse in some basis can be recovered exactly in polynomial time from just M = O(K log(N/K)) linear measurements of the signal. In this paper, we choose the canonical basis; hence x has K N nonzero and N − K zero entries.
In matrix notation, we measure y = Φx, where x is the N × 1 sparse signal vector we aim to sense, y is an M × 1 measurement vector, and the measurement matrix Φ is an M × N matrix. Since M < N, recovery of the signal x from the measurements y is ill posed in general. However, the additional assumption of signal sparsity makes recovery possible. In the presence of measurement noise, the model becomes y = Φx + w, where w stands for i.i.d. additive white Gaussian noise with zero mean. The two critical conditions to realize CS are that (i) the vector x to be sensed is sufficiently sparse, and (ii) the rows of Φ are sufficiently incoherent with the signal sparsity basis. Incoherence is achieved if Φ satisfies the so-called restricted isometry property (RIP) [3]. For example, random matrices built from Gaussian and Bernoulli distributions satisfy the RIP with high probability. Φ can also be sparse with only L nonzero entries per row (L can vary from row to row) [5]. Various methods have been developed to recover a sparse x from the measurements y [3, 5–7]. When Φ itself is sparse, belief propagation and related graphical inference algorithms can also be applied for fast signal reconstruction [5]. An important property of CS is its information scalability—CS measurements can be used for a wide range of statistical inference tasks besides signal reconstruction, including estimation, detection, and classification. 1.3. Compressive Sensing Meets Microarrays. The setting for microbial DNA sensing naturally lends itself to CS, although the number of potential agents that a hostile adversary can use is large, not all agents are expected to be present in a significant concentration at a given time and location, or even in an air/water/soil sample to be tested in a laboratory. In traditional microarrays, this results in many inactive probes during sensing. On the other hand, there will always be minute quantities of certain harmful biological agents that may be of interest to us. Therefore, it is important not just to detect the presence of agents in a sample, but also to estimate the concentrations with which they are present. Mathematically, one can represent the DNA concentration of each organism as an element in a vector x. Therefore, as per the assumption of only a few agents being present, this vector x is sparse, that is, contains only a few significant entries. This suggests putting thought into the design of a microarray along the lines of the CS measurement process, where each measurement yi is a linear combination of the entries in the x vector, and where the sparse vector x can be reconstructed from y via CS decoding methods. In our proposed microarrays, the readout of each probe represents a probabilistic combination of all the targets in the test sample. The probabilities are representatives of each probe affinity to its targets due to how much the target and probe are likely to hybridize together. We explain our model for probe-target hybridization in Section 2.2. In particular, the cross-hybridization property of a DNA probe with several targets, not just one, is the key for applying CS principles. Figure 1 describes the sensing process algebraically. Formally, assume that there is a total number of N possible targets, but that at most K of them are simultaneously
EURASIP Journal on Bioinformatics and Systems Biology ⎡
Φ= Sensing matrix
M spots
2
ϕ11
⎢ ⎢ ϕ21 ⎢ ⎢· · · ⎣
ϕ12 · · · ϕ22 · · · ···
··· ϕM1 ϕM2 · · ·
⎤
ϕ1N ⎥ ϕ2N ⎥ ⎥ ··· ⎥ ⎦
ϕMN
N target agents
Figure 1: Structure of the sensing matrix in relation to number of spots and target agents.
present in a significant concentration, with K N. Let M be the number of measurements required for robust reconstruction according to CS theory. For 1 ≤ i ≤ M and 1 ≤ j ≤ N, the probe at spot i hybridizes to target j with probability ϕi, j . The target j occurs in the test DNA sample with concentration x j . The measured microarray signal intensity vector y = { yi }, i = 1, . . . , M equals y = Φx + w.
(1)
Here, Φ is the sensing matrix, and w denotes a vector of i.i.d. additive white Gaussian noise samples with zero mean. We note that this probabilistic combination is assumed to be linear for the purposes of microarray design. However, in reality, there is a nonlinear saturation effect when excessive targets are present (see Section 2.4 for details). We take this into account on the reconstruction side, as part of the CS decoding techniques to decipher the combinatorial sensor readout. Therefore, by using the CS principle, the number of spots in the microarray can be made much smaller than the number of target organisms. With fewer “intelligently chosen” DNA probes, the microarray can also be more easily miniaturized [8–10]. We refer to a microarray designed this way as a CS microarray (CSM). The CS principle is similar to the concept of group testing [8–11], which also relies on the sparsity observed in the DNA target signals. The chief advantage of a CSbased approach over direct group testing is its information scalability. With a reduced number of measurements, we are able not just to detect, but also to estimate the target signal. This is important because often pathogens in the environment are only harmful to us in large concentrations. Furthermore, we are able to use CS recovery methods such as belief propagation that decode x while accounting for experimental noise and measurement nonlinearities due to excessive target molecules [12]. It is also worth to point out the substantial difference between CSMs and the “composite microarrays” designed to reduce measurement variability [13]. In the latter approach, the microarray readouts are linear combinations of input signal components and therefore can be expressed in the form given by (1). However, the Φ matrix of [13] does typically not satisfy the CS design principles. As a result, the number of required measurements/spots is significantly larger than that of CSMs. On the other hand, the use of the CS principle allows both the robustness of measurements and a significant reduction in the number of spots on the array [14].
EURASIP Journal on Bioinformatics and Systems Biology
Organism 1
Organism 2
Protein 1A COG 1 Protein 1B COG 2 Protein 1C .. . COG M
Organism 3 · · · Organism N Protein 3A
Protein 2A Protein 2B
Protein 3B Protein 3C
Figure 2: Block diagram showing a grouping of organisms, their proteins, COGs.
1.4. Clusters of Orthologous Groups. Note that searching whole genomes of large sets of organisms can be computationally very expensive. As a remedy for classifying the genetic similarity of these organisms, we use the NIH database of clusters of orthologous groups (COGs) of proteins. The COGs database groups the proteins and the corresponding DNA sequences of 66 unicellular organisms into groups (“clusters”) based on the similarity of their protein sequences by aligning matching bases in them (see Figure 2 for an illustration). The COGs classification is a phylogenetic classification—meaning that the basis of classification is that organisms of the same ancestral families will demonstrate sequence similarity in their genes that produce proteins for similar function. Since protein sequences can be translated back to the DNA sequences that produced them, a classification of similar proteins is also a classification of DNA similarity. The COGs database consists of groups of 192, 987 proteins in 66 unicellular organisms classified into 4872 clusters. We use these clusters as a guideline to group targets together. Targets with similar DNA sequences belong to the same group, and can be more easily identified with a single probe. When designing probes, it is important to make sure that the chosen probes align minimally with organisms that do not belong to its group (the “nontargets”). We can use the COGs database with its exhaustive classification to this end, since DNA sequences of an organism whose proteins do not belong to a certain COG will have minimal alignment with DNA sequences of other organisms in that COG. This significantly reduces the computational complexity of the search for good probe sequences. One limitation in using COGs is that it will constrain design of the Φ matrix for us. For instance, if we were to choose a set of 10 organisms we are interested in for microarray detection, there are only a finite number of COGs (groups) that these 10 organisms will belong to. We would have to carefully sift through these groups to find the one that best satisfies CS-requirements of Φ, and for each choice, making sure that it is dissimilar enough from the other groups chosen. So on the one hand, using COGs guides our target grouping strategy; on the other hand, it is possible that we might not be able to find enough Φ-suitable COGs to identify all members of the group. Using only a COGs-based approach, we may have to resort to using a Φ that may not be the best from a CS perspective but simply what nature gives us. Here, however, we only consider an approach using COGs.
3 A second limitation of COGs is the fact that it is a classification of organisms based on alignments between the sections of their DNA that encode for proteins, not entire sequences. Therefore, a point for future exploration would be to work with values from alignments between entire DNA sequences of organisms. Probes selected using such an alignment would be better reflective of the actual probetarget hybridization that takes place in a biosensing device. However, we are fortunate that prokaryotes such as unicellular bacteria typically have larger percentages of coding DNA to noncoding, and therefore as long as we are interested in the detection of unicellular bacteria, which are prokaryotes, using a COGs-based probe selection is not as much of an issue. On the other hand, eukaryotes have large amounts of noncoding regions in their DNA. This phenomenon is known as the C-value enigma [15]: more complex organisms often have more noncoding DNA in their genomes. 1.5. CSM Design Consideration. To design a CSM, we start with a given set of N targets and a valid CS matrix Φ ∈ RM ×N . The design goal is to find M DNA probe sequences such that the hybridization affinity between the ith probe and the jth target can be approximated by the value of ϕi, j . For this purpose, we need to go row-by-row in Φ, and for each row find a probe sequence such that the hybridization affinities between the probe and the N targets mimic the entries in this row. For simplicity, we assume that the CS matrix Φ is binary, that is, its entries have value zero or are equal to some positive constant, say c. An entry of positive value refers to the case where the corresponding target and probe DNA strands bind together with a sufficient strength such that the fluorescence from the target strand adhered to the probe is visible during the microarray readout process. A zero-valued entry indicates that no such hybridization affinity exists. How to construct a binary CS matrix Φ is discussed in many papers, including [16, 17], but is beyond the scope of this paper. Henceforth, we assume that we know the Φ we want to approximate. The CSM design process is then reduced to answering two questions. Given a probe and target sequence pair, how does one predict the corresponding microarray readout intensity? Given N targets and the desired binding pattern, how does one find a probe DNA sequence such that the binding pattern is satisfied? The first question is answered by a two-step translation of a probe-target pair to the spot intensity. First, we need a hybridization model that uses features of the probe and target sequences to predict the cross-hybridization affinity between them. Since the CS matrix that we want to approximate is binary, the desired hybridization affinities can be roughly categorized into two levels, “high” and “low,” corresponding to one and zero entries in Φ, respectively. The affinities in each category should be roughly uniform, while those belonging to different categories must differ significantly. With these design requirements in mind, we develop a simplified hybridization model in Section 2.2 and verify its accuracy via laboratory experiments, the results of which
4
EURASIP Journal on Bioinformatics and Systems Biology
Table 1: 12 parameters used in [18] for predicting hybridization affinities between DNA sequence pairs. Parameter X1 , X3 X2 , X4 X5 X6 X7 X8 X9 X10 X11 X12
Description Probe sequence length, Target sequence length Probe GC content, target GC content Smith-Waterman score: computed from the scoring system used in the SW alignment E-value: probability that the SW score occurred by chance Percent identity: percentage of matched bases in the aligned region after SW alignment Length of the SW alignment Gibbs free energy for probe DNA folding Hamming distance between probe and target Length of longest contiguous matched segment in a SW alignment GC content in the longest contiguous segment
are presented in Section 2.3. As the second step, we need to translate the hybridization values to microarray spot intensities using a model that includes physical parameters of the experiment, such as background noise. This issue is discussed in Section 2.4. To answer the second question, we propose a probe design algorithm that uses a “sequence voting mechanism” and a randomization mechanism. The algorithm is presented in Section 3.1. An example of the practical implementation of this algorithm is given in Section 3.2.
2. Hybridization Model 2.1. Classical Models. The task of accurately modeling the hybridization affinity between a given probe-target sequence pair is extremely challenging. There are many parameters influencing the hybridization affinity. In [18], twelve such sequence parameters are presented, as listed in Table 1. Many of these parameters (X5 –X8 ) are based on the Smith-Waterman (SW) local alignment, computed using dynamic programming techniques [19]. The SW alignment identifies the most similar local region between two nucleotide sequences. It compares segments of all possible lengths, calculates the corresponding sequence similarity according to some scoring system, and outputs the optimal local alignment and the optimal similarity score. For example, if we have two sequences 5 -CCCTGGCT-3 and 5 GTAAGGGA-3 , the SW alignment, which ignores prefix and suffix gaps, outputs the best local alignment 3 -T C C C-5 ||| |
5 -A G G G-3 .
Another important parameter for assessing hybridization affinity is X11 , the length of contiguous matched base pairs. It has been shown in [18, 20] that long contiguous base pairs imply strong affinity between the probe and target.
Usually, one requires at least 10 bases in oligo DNA probes for ensuring sufficiently strong hybridization affinity. Besides the large number of parameters that potentially influence hybridization affinity, there are many theories for which features most influence hybridization and how they affect the process [18, 21, 22]. A third-order polynomial model using percent identity X7 , as the single parameter, was developed in [21]. More recently, three multivariate models, based on the third-order polynomial regression, regression trees, and artificial neural networks, respectively, were studied in [18]. 2.2. Our Model for CSM. Different from the above approaches aiming at identifying the exact affinity value, the binary nature of our CS matrix brings possible simplifications. As we have discussed in Section 1.5, we only need to predict whether the affinity between a probe-target pair is either “high” or “low.” For this purpose, two set of rules, designed for deciding “high” and “low” affinities, respectively, are developed in this section. We propose the notion of the best matched substring pair, defined as follows, for our hybridization model. Definition 1. Let {xi }, i = 1, . . . , n be a DNA sequence. A substring of {xi } is a sequence of the form xi , xi+1 , . . . , xs , where 1 ≤ i ≤ s ≤ n. Consider a given sequence pair {xi } and { y j }, 1 ≤ i ≤ n and 1 ≤ j ≤ m. Let L be a positive integer at most min(n, m). A pair of substrings of length L, one of which is part of {xi } and the other part of { y j }, will be denoted by xi , xi+1 , . . . , xi+L−1 and y j , y j+1 , . . . , y j+L−1 , where 1 ≤ i ≤ n − L + 1, 1 ≤ j ≤ m − L + 1. For a given substring pair of length L, the corresponding substring percent identity PI is defined as PI =
0 ≤ k ≤ L − 1 : xi+k = y j+L−1−k
L
,
(2)
where x j+k denotes the Watson-Crick complement of x j+k , and |·| denotes the cardinality of the underlying set. The best matched substring pair of length L is the substring pair with the largest PI among all possible substring pairs of length L from the pair of {xi } and { y j }. For a given L, the largest substring percent identity PI∗ (L) is the PI of the best matched substring pair of length L. For a given PI value, the corresponding best matched length L∗ (PI ) is defined as
L∗ (PI ) := max L : PI∗ (L) ≥ PI .
(3)
Remark 1. For a given L, the best matched substring pair is not necessarily unique, while the PI∗ (L) value is unique. Our definition is motivated by the following observations. (1) For hybridization prediction, the parameter percent identity X7 should be used together with the alignment length X8 . Although the significance of the single-parameter model based on X7 was demonstrated in [21], we observed that using the X7 parameter as the sole affinity indicator
EURASIP Journal on Bioinformatics and Systems Biology
Sequence pair B (X7 = 0.80, X8 = 20, X11 = 6): 3 -CCTTTTTTTGCAAACGAACCTCTACCGATAGAC-5 |||||| || || ||| ||| 5 -GGAAAATAAAGTCTGCCTGGTATGATGGCCGGA-3 Sequence pair C (X7 = 0.71, X8 = 28, X11 = 6): 3 -CCTTTTTTTGCAAACGAACCTTTACCGCTAGAC-5 |||||| || || ||| ||| | | | | 5 -GGAAAATAAAGTCTGCCTGGTATTAGGGCCGGA-3 Sequence pair D (X7 = 0.71, X8 = 28, X11 = 3): 3 -CCTCTTTTTGCAAACAGACCTTTACCGCTAGAC-5 ||| || || || || || || || ||| 5 -GGAAAATAAAGTCTGCCTTGACATAGCGCCGGA-3
Figure 3: Aligned sequence pairs from the SW alignment.
is sometimes misleading. As an illustration, consider the example in Figure 3. For the sequence pair A, the SW alignment gives X7 = 1.00 and X8 = 6. For the sequence pair B, the SW alignment gives X7 = 0.80 and X8 = 20. Though the pair B exhibits a smaller X7 , it obviously has a stronger binding affinity than the pair A, for the aligned part of the pair A is merely a part of the aligned region of the pair B. The same principle holds for the sequence pairs B and C as well. This example shows that besides the percent identity, the alignment length is important. (2) The pair of X7 and X8 is not sufficient to predict hybridization affinity. Consider the sequence pairs C and D in Figure 3. Both of them exhibit the same values for the X7 and X8 parameters. However, the hybridization affinities of these two pairs are different. To see this, let us refer to Figure 4 which depicts the PI∗ (L) values of sequence pairs C and D for different length L. It can be observed that for any given 1 ≤ L ≤ 30, the PI∗ (L) value of the sequence pair C is larger than that of the sequence pair D. In other words, the sequences in the former pair match with each other uniformly better than the sequences in the latter pair. The sequence pair C has a larger chance to hybridize than the pair D does. With the same values of parameters X7 and X8 , the difference in hybridization affinity comes from the distribution of matched bases in the aligned region. The advantage of using the largest substring percent identities for hybridization prediction is now apparent. The PI∗ (L)s include all the information contained in the previously discussed X7 , X8 , and X11 parameters; it can be verified that PI∗ (X8 ) = X7 and that the X11 is one of the values of Ls such that PI∗ (L) = 1.00. Of course, a list of PI∗ (L) provides more detailed information, since it gives both local and global matching information. Based on the notion of best matched substrings, we propose a set of criteria for CSM probe-target hybridization prediction. A positive-valued entry in the CS matrix suggests that the corresponding probe-target pair satisfies the following two criteria.
1 The largest substring percent identity
Sequence pair A (X7 = 1.00, X8 = 6, X11 = 6): 3 -CCTTTTAACTACGACT-5 |||||| 5 -GGAAAAGACGACACAG-3
5
0.95 0.9 0.85 0.8 0.75 0.7 0.65
0
2
4
6
8 10 12 14 16 18 20 22 24 26 28 30 Length L
Sequence pair C Sequence pair D
Figure 4: The PI∗ (L)s of sequence pairs C and D in Figure 3.
(C1) There exists a best matched substring pair of length at least Lhy,1 such that the corresponding substring percent identity satisfies PI ≥ PI,hy . Alternatively, ∃L ≥ Lhy,1 such that PI∗ (L) ≥ PI,hy . Here, both Lhy,1 and PI,hy are judiciously chosen parameters. (C2) Among all the best matched substring pairs with PI ≥ PI,hy , there should be no pair of length longer than Lhy,2 , that is, it should hold that PI∗ (L) < PI,hy for all L > Lhy,2 . Again, Lhy,2 has to be chosen properly. Criterion (C1) guarantees that there is a significantly long substring pair with high-percent identity that ensures strong hybridization affinity. Although criterion (C2) may seem counterintuitive at first glance, it ensures that one single target cannot dominantly hybridize with the consensus probe, that is, the binding affinities between probe-target pairs are roughly uniform. The probe-target pair associated with a zero entry in the CS matrix satisfies the following two criteria. (C3) Among all the best matched substring pairs with percent identity at least PI,no , there should be no pair of length longer than Lno,1 , that is, ∀L > Lno,1 , PI∗ (L) < PI,no . (C4) Among all the substring pairs matched perfectly (with PI = 1.00), there should be no pair of length greater than Lno,2 , that is, PI∗ (L) < 1.00 for all L > Lno,2 . Criterion (C3) asserts that there should be no substring pair that has both long length and high-percentage identity. The last criterion, (C4), prevents the existence of a long contiguous matched substring pair which suggests large binding affinity. Again, PI,no , Lno,1 , and Lno,2 have to be chosen appropriately.
6
EURASIP Journal on Bioinformatics and Systems Biology
(a)
(b)
(c)
(d)
Figure 5: Microarray readouts. The readouts (a), (b), and (c) correspond to the targets A, B, and C, respectively, with sixteen-hour incubation, while the readout (d) corresponds to the target A with four-hour incubation.
This model may seem an oversimplification for accurate hybridization affinity prediction. However, in our practical experience with small binary CS matrices (Section 1.5), this model functions properly (see Section 2.3). The model error can be formulated mathematically as follows. Let us denote the actual affinity matrix by A, where the entry ai, j is the affinity between the ith probe and the jth target, 1 ≤ i ≤ M and 1 ≤ j ≤ N. Then the entries of the affinity matrix A are approximation of the entries of the binary CS matrix Φ of the form αi, j = ϕi, j + i, j ,
(4)
where ϕi, j is either zero-valued or equal to c, and i, j is the approximation error that is assumed to take small values only. The physical interpretation of c is given in (9). The values of αi, j s can be calibrated via lab experiments. Furthermore, the reconstruction algorithm can be designed to be robust to the approximation error. Remark 2. This model can be further refined by introducing weighting factors in the definition of PI . More precisely, the number of positionally matched base pairs can be replaced by a weighted sum, where C-G and A-T pairs are assigned different values. More accurate model, taking into account nearest-neighbor interaction, can be considered as well [23, 24]. These extensions will be considered elsewhere. 2.3. Experimental Calibration of Parameters. Lab experiments were performed to verify our translation criteria (C1)– (C4) and to choose appropriate values for the involved parameters. The microarray chip employed contains 70 spots distributed within seven rows, each row containing 10 identical
spots for the purpose of providing more accurate readouts. The probe DNA sequences in the first six rows, denoted by probes A, B, . . ., and F, respectively, are 5 -CCAGCATGTACTTTTTTTCCGGACCTTCCTGGATT TCGCCCGATTTCAAGTTCTCCCCCCATTTTACCTC-3 , 5 -CAGTTCCAGTACCAGATAGCCATCTCCAAGCAAAC GTTTTTTTCCTCCTACCTTTTTCCCAACCAGCATG-3 , 5 -TGAAGCATTAGAACGAGAAGAGTTCGGGACACAGC AAGTAATAGAGAGGGTCAGACCATAAGGGAAAACG-3 , 5 -CTCTGGCTGGTTGAAGAAGTAGGAGA-3 , 5 -CAGTAATTCTCCTGTGCCCCGTCCTG-3 , 5 -AGCATGGAGGTTTTCGAGGAGGGAAA-3 . The last row is a control row, which always gives the maximum fluorescent readout. Here, probes of different lengths are used to test influence of length on hybridization affinity. The target sequences used in our experiments are Target A: 5 -ACTTCTTCTGACCCTCCTCGAAAAC CAAAAAGAGGGGAGAACTTGAAGGCGATAGAGCTT-3 , Target B: 5 -GGAAAATAAAGTCTGCCTGGTATGA TGGCCGGAGAATTCCTACTCCTTCACAGGGGAATT-3 , Target C: 5 -GGAGTGTATGAAATCGGCCGAAATC TTATGGTCTGACCCTAAAAATCACGCGCGG-3 . The probe and target sequences were synthesized by Invitrogen, with the first three probes purified using the PAG
EURASIP Journal on Bioinformatics and Systems Biology (polyacrylamide gel electrophoresis) method, while all other sequences were purified using the high-performance liquid chromatography method (HPLC). The fluorescent tags of the targets are Alexa 532. The experiments proceeded as follows. The first step was to prehybridize our microarray slide. The prehybridization buffer was composed of 49.2 mL TRIS, 300 μL Ethanolamin, and 500 μL SDS. The printed microarray slide was incubated in the prehybridization buffer at 42o C for 20 minutes. In the hybridization step, we used 1× hybridization buffer (50% formamide, 5X SSC, and 0.1% SDS). We dissolved 1 ng target into 22 μL hybridization buffer, and then heated the target liquid to 95o C for two minutes to denature. All 22 μL target liquid was applied to the prehybridized microarray slide. Then the slide was incubated in a 42o C water bath for 16 hours. In the washing step, we needed three wash buffers: a low-stringency wash buffer containing 1× SSC and 0.2% SDS, a high-stringency wash buffer containing 0.1× SSC and 0.2% SDS, and a 0.1× SSC wash buffer. After the incubation, we washed the slide (with coverslip removed) with the low-stringency wash buffer (preheated to 42o C), the high-stringency wash buffer, and the SSC wash buffer successively, by submerging the slide into each buffer and agitating for five minutes. Finally, we dried the slide and read it using an Axon 4000B scanner. The same procedure was repeated for each target. The microarray readouts are depicted in Figure 5. A readout associated with target A with shorten incubation time (four hours) is also included (Figure 4(d)). We study the relationship between these binding patterns and the substring matches. For each probe-target pair, we calculated the corresponding PI∗ (L) for each valid L ∈ Z+ , and the L∗ (PI )s for different PI values. Here, we omit most of these results and only list the most important ones in Table 2. We have the following observations. (1) For all sequence pairs exhibiting significant hybridization level, one must have PI∗ (20) ≥ 0.80. (2) For all sequence pairs of which the microarray readout is weak, we have PI∗ (20) ≤ 0.75. (For the pair of probe A and Target B, PI∗ (20) = 0.75, but the corresponding microarray readout is week.) Consequently, PI∗ (20) may be a critical parameter for deciding whether a probe-target pair hybridizes or not. (3) Among all sequence pairs with weak microarray readouts, the length of the longest contiguous segment is 10 (the pair of probe C and target A). This fact implies that the probe-target pair may not hybridize even when they have a contiguous matched substring of length 10. Based on the above observations, we choose the values of the parameters in the criteria (C1)–(C4) as in Table 3. Here, the values are chosen to allow certain safeguard region. The chosen values are used in our probe-search algorithm (see Sections 3.1 and 3.2). These choices are based on limited experiments, and further experimental calibration/testing is needed to fully verify these parameter choices.
7 Interestingly, when we reduced the incubation time to four hours such that the full equilibrium has not been achieved, the microarray still gave an accurate readout (see Figure 5(d)). We expect that one can use CSMs in applications for which only short hybridization times are allowed. 2.4. Translating Hybridization Affinity into Microarray Spot Intensity. The hybridization affinity values need to be converted into a form that is physically meaningful and reflective of the spot intensities we observe in an experiment. In the case of a one-spot, one-target scenario, the sensing function takes the form γαx + b + w, (5) y= αx + β where y is the actual spot intensity we measure for given experimental conditions, γ and β are positive hybridization constants, α is the hybridization affinity, x is the target concentration, b presents the mean background noise, and w denotes the measurement noise which is often assumed to be Gaussian distributed with mean zero and variance σw2 [25, 26]. This model mimics the well-known Langmuir model, with background noise taken into consideration [26, 27]. For the probe-target pairs corresponding to zero entries of Φ (i.e., α is close to zero), the measured intensity can be approximated by y ≈ b + w.
(6)
Consider the probe-target pairs exhibiting “high” affinities. If the target concentration is small or moderately large, then the microarray readout is approximately y≈
γ αx + b + w. β
(7)
When the target concentration is extremely large, the saturation effect becomes dominant and one has y ≈ γ.
(8)
As a result, in the linear region, the affinity between the ith probe and jth target is given by ai, j = c + i, j ≈ ai, j ≈ 0,
γi, j αi, j , βi, j
for high affinity, (9)
for low affinity.
3. Search for Appropriate Probes 3.1. Probe Design Algorithm. We describe next an iterative algorithm for finding probe sequences satisfying a predefined set of binding patterns, that is, sequences that can serve as CS probes. The design problem is illustrated by the following example. Suppose that we are dealing with three targets, labeled by T1 , T2 , and T3 , and that the binding pattern of the probe and targets is such that the probe is supposed to bind
8
EURASIP Journal on Bioinformatics and Systems Biology
Table 2: Best match substring data. The values in the parenthesis, from the left to the right, are L∗ (1.00), PI∗ (16) and PI∗ (20). The probetarget pairs corresponding to the bold-font entries exhibit significant microarray readout. Probe → Target ↓ A B C
A
B
C
D
E
F
(14, 0.94, 0.90) (06, 0.75, 0.75) (09, 0.94, 0.80)
(06, 0.69, 0.60) (06, 0.81, 0.80) (05, 0.63, 0.55)
(10, 0.69, 0.60) (05, 0.63, 0.60) (16, 1.00, 0.80)
(08, 0.63, 0.60) (07, 0.75, 0.65) (04, 0.56, 0.45)
(06, 0.56, 0.45) (08, 0.69, 0.60) (04, 0.50, 0.45)
(15, 0.94, 0.80) (05, 0.56, 0.45) (05, 0.56, 0.50)
Table 3: Chosen values of the parameters in the criteria (C1)–(C4). Parameter Value
PI,hy 0.80
Lhy,1 20
Lhy,2 25
PI,no 0.75
Lno,1 16
Lno,2 7
with targets T1 and T2 , but not with target T3 . Assume next that the hybridization affinities between a candidate probe and targets T1 and T2 are too small, while the hybridization affinity between the probe and target T3 is too large. In order to meet the desired binding pattern, we need to change some nucleotide bases of the probe sequence. For example, consider a particular aligned position of the probe and the targets, the corresponding probe and targets T1 , T2 , T3 bases equal to “T,” “T,” “A,” and “A,” respectively. In this case, from the perspective of target T1 , the base “T” of the probe should be changed to “A,” while from the perspective of target T3 , this “T” base should be changed to any other base not equal to “T.” On the other hand, for target T2 to exhibit strong hybridization affinity with the probe, the identity of the corresponding probe base should be kept intact. As different preferences appear from the perspectives of different targets, it is not clear whether the base under consideration should be changed or not. We address this problem by using a sequence voting mechanism. For each position in the probe sequence, one has four base choices—“A,” “T,” “C,” and “G.” Each target is allowed to “cast its vote” for its preferred base choice. The final decision is made based on counting all the votes from all targets. More specifically, we propose a design parameter, termed as preference value (PV), to implement our voting mechanism. For a given pair of probe and target sequences, a unique PV is assigned to each base choice at each position of the probe. We design four rules for PV assignment. (1) If the target “prefers” the current probe base left unchanged, a positive PV is assigned to the corresponding base choice. (2) From the perspective of the target, if the current probe base should be changed to another specific base, then the original base choice is assigned a negative PV while the intended base choice is assigned a positive PV. (3) If the current base should be changed to any other base, then the corresponding base choice is assigned a negative PV while other base choices are assigned a zero PV. (4) Finally, if a base choice is not included in the above three rules, a zero PV is assigned to it.
The specific magnitude of the nonzero PVs is chosen according to the significance of the potential impact on the hybridization affinity between the considered target and probe. The details of this PV assignment are highly technical and therefore omitted. The interested reader is referred to our software tool [28] for a detailed implementation of the PV computation algorithm. After PV assignment, we calculate the so-called Accumulated PV (APV). For a given base choice at a given position of the probe, the corresponding APV is the sum of all the PVs associated with this choice. The APV is used as an indicator of the influence of a base change in our algorithm; the bases associated with negative APVs are deemed undesirable and therefore should be changed; if the current base of the probe is associated with a positive APV, one would like to leave this base unchanged; if a base choice, different from the current base of the probe, has a positive APV value, one should change the current base to this new choice. It is worth pointing out the “partly” random nature of the algorithm. In step 5 of our algorithm, whether a current base at a given position is changed or not and which base the current base is changed to are randomly decided. The probabilities with which the current base is changed, and with which a specific base is selected to replace the current base, are related to the magnitudes of the associated APVs. The implementation details behind this randomization mechanism are omitted, but can be found in [28]. This random choice component helps in avoiding “dead traps” that may occur in deterministic algorithms. As an illustrative example, suppose that the intended binding pattern between a probe and all targets except target 1 is satisfied in a given iteration. From the perspective of target 1, the first base of the probe should be changed from “T” to “C.” In a deterministic approach, a base replacement must be performed following this preference exactly. However, this base change breaks the desired hybridization pattern between the probe and target 2. In the next iteration, according to the perspective of target 2, the first base of the probe has to be changed back to “T.” As a result, this probe base “oscillates” between these two choices of “T” and “C,” and the algorithm falls into a “dead trap.” In contrast, due to the randomization mechanism in our algorithm, there is a certain probability that the base change does not follow exactly what seems necessary. Dead traps can be prevented from happening or escaped from once they happen. The algorithm is repeated as many times as the number of probes.
EURASIP Journal on Bioinformatics and Systems Biology 3.2. Toy Probe Design Example for Φ3×7 . We describe a proof-of-concept small-scale CSM example. In this example, we have seven target sequences of length 55, listed in Table 4. Also listed are the seven unicellular organisms from which the target sequences are spliced, and the specific genome positions of the targets. Here, we follow the notation convention used by the Kyoto Encyclopedia of Genes and Genomes (KEGG). Given the targets, our goal is to design a CSM with three probes that mimics a [3, 4, 7] Hamming code. The corresponding CS matrix is given as ⎡
⎤
1 0 0 1 0 1 1 ⎢ ⎥ Φ = c ⎣0 1 0 1 1 0 1⎦ . 0 0 1 0 1 1 1
(10)
In the probe-design process, we use the criteria (C1)– (C4) to decide whether a probe-target pair satisfies the corresponding hybridization requirements encoded in the CS matrix (10). The parameters are set according to Table 3. The probe design algorithm (Algorithm 1) for probe selection produced the following outcomes. Probe 1: 5 -AAGAATCTGGCCACTCTCCGTAGATAACAG GAAGCTCTCTTGCCACCATTACCGCTCCTCCTCCGTATAT-3 , Probe 2: 5 -TCACCGCCCCGCTGGTCGATTCTGGCATAG CACTGAGTCCTGAAGCAGGCTTTCTCTCTCATCAATAAAA-3 , Probe 3: 5 -GAGGAAGTGTGTGGGCTTGCCTTCTTGCCG TCTCTTACCGCCCCAGGGCCGCTTATTTTCAGATAATTAT-3 . The GC contents for these three probes are 50%, 51.4%, and 51.4%, respectively. The GC contents of the sequences should be of similar value to ensure similar melting temperatures for the duplexes. The secondary structures of these probes can be predicted by using the m-fold package [29] and are depicted in Figure 6. As one can see, all folds have sufficiently long unmatched regions that can hybridize to the targets. A list of the best matched lengths of the probes and targets is listed in Table 5. According to this table, all probetarget pairs corresponding to entries one of matrix (10) satisfy criteria (C1) and (C2), while all probe-target pairs corresponding to entries zero of matrix (10) satisfy criteria (C3) and (C4). The designed CSM mimics the binary CS matrix (10).
4. CSM Signal Recovery The final step of a CSM process is to estimate the target concentration according to the microarray readout. Recall the signal acquisition model in (5), a signal recovery algorithm specifically designed for CSMs have to take into account the measurement nonlinearity. Compared to other CS signal recovery methods, belief propagation (BP) is the best amenable to incorporate nonlinear measurement. It has been shown that a CS measurement matrix Φ can be represented as a bipartite graph of signal coefficient nodes x j s and measurement nodes yi s [5, 12].
9
Input: The N target sequences, the row of the intended binding matrix Φ corresponding to the chosen probe. Initialization: Randomly generate multiple candidates for the probe under consideration. For each candidate, perform the following iterative sequence update procedure. Iteration: (1) Check the probe’s GC content. If GC content is too low, randomly change some “A” or “T” bases to “G” or “C” bases, and vice versa. The GC content after base changes must satisfy the GC content requirement. (2) Check whether the probe sequence satisfies the intended binding pattern. If yes, quit the iterations. If not, go to the next step. (3) If an appropriate probe has not been found after a large number of iterations, report a failure, and quit the iterations. (4) For each of the N targets, calculate the PV associated with each of the base choice at each position of the probe. Then calculate the APV. (5) Randomly change some bases of the probe sequence so that a potential change associated with a larger APV increment is made more probable. (6) Go back to Step 1. Completion: Check for loop information in the secondary structure of all the surviving probe candidates. Choose the probe with the fewest loops. If more than one such probe exists, randomly choose one of the probes with the shortest loop length. Output: The probe sequence. Algorithm 1: Probe design for CSMs.
When Φ is sparse enough, BP can be applied, so we are able to approximate the marginal distributions of each of the x j coefficients conditioned on the observed data. (Note that the Hamming code matrix Φ is not sparse. Still, one can use simple “sparsified” techniques to modify Φ for decoding purpose only [30]). We can then estimate the MLE, MMSE, and MAP estimates of the coefficients from their distributions (we refer to [5, 12] for details.) In the context of DNA array decoding, we are given measurement intensities of the spots in the CS microarray, and want to recover the target concentrations x j s in our test sample. If we abstract the nonlinearity as T(·), and the linear combination of gene concentrations as L[·], we can represent the ith spot intensity as
yi = T L[x1 , . . . , xn ] + wi ,
(11)
where wi ∼ N (0, σw2 ) is the Gaussian distributed measurement noise. To tailor CS decoding by BP for the nonlinear case, we will account for the nonlinearity T(·) through additional variable nodes, and the measurement noise in the model by noise constraint nodes. The factor graph in Figure 7 represents the relationship between the signal coefficients and measurements in the CS decoding problem for nonlinear measurement intensities T(L[x]) in the presence of measurement noise.
10
EURASIP Journal on Bioinformatics and Systems Biology Table 4: The target nucleotide sequences. 5 -GATATGAAATGGGCGGACCAGAGTTTATAGTTATCTACGGGAGAAGGAGAGTGGG-3 From Methanothermobacter thermautotrophicus (Mth)—Genome position: complement (142033 · · · 142087) 5 -GATGCTGTGATGGAGGGACTGTTTCAAGATGGAGTGCTATGCAAATAGGGATGAG-3 From Methanococcus jannaschii (Mja)—Genome position: (77481 · · · 77535) 5 -AGCTTTCCCTCCTCGAAAACCTCCATGCTGAAGGCAAGCCCAAACTGATCCTCCT-3 From Methanosarcina acetivorans str.C2A (Mac)—Genome position: (59910 · · · 59964) 5 -AGGGATCTATCTGTTAGCTGAGGAGAGTGAAACCGTTCTTGAGGACTTCTCTGAG-3 From Pyrococcus horikoshii (Pab)—Genome position: complement (1122252 · · · 1122306) 5 -TGTTCACGAAGTTGACAATCTGAGGGAAACTACCTACGGGGCGGTGAGAGACGAG-3 From Archaeoglobus fulgidus (Afu)—Genome Position: complement (365030 · · · 365084) 5 -TATTTCAAGGACTTTCGCAAATACGCGGAGCTGGAGCGGTTGTGGTCGCAGTACG-3 From Methanopyrus kandleri AV19 (Mka)—Genome Position: complement (1007480 · · · 1007534) 5 -AGGCAAAAGATGGCAAGAAAGCCTCCCCACATACTCATTACCACGCCAGAATCAT-3 From Thermoplasma volcanium (Tvo)—Genome Position: (636571 · · · 636625)
Target 1 Target 2 Target 3 Target 4 Target 5 Target 6 Target 7
C
G
T
A C
T
C
T
C
C
20
G
A
30
T
C
A
C G
T
G
G T
10
A
A
A
C
G A C T
C A
T
G
C
G
T
A
T
C
G
40
20 G
C
C
C
C
G C
A T A
C
G
C
A
30 G
C
A T
C
C G
G
50
T
A
C T G A G T
C
5’
T
3’
A
C
T
A G
C
C
T
C
C
T
A
A
C
C
T
A
C
T
C T
C
T
T C
50
T T
A A T A G
A
60
T T
A
T T
A
C
G C
C
G
50
G
C
G
C
A
C
C T
dG = −0.85 08Jul17-02-35-10
dG = −2.3 08Jul17-02-38-47
T
T
30
G
A G
T
T
60
C C
C
C
40
T G
60
G
G G
T
C
A
C
T
G
T C
C
G
A
T
T
G T
T
T
A
C
T
T
A
G
A
G
G G A C
T
C
3’
40 A
C
A
C
T
5’
C C T
A
A
T
A
A T
G
G
C A
T
C
T
G
A
C
C
G
C
T
10
T G
10
A
3’
20
G C G T
A
5’
T T
C
dG = −2.57 08Jul17-02-40-02
Figure 6: Secondary structures of the three probes in the toy example. The predicted structures, from left to right, are corresponding to probes 1, 2, and 3, respectively. Signal coeffs x1 · · · xn Measurements y1 · · · yn
Noise N yi = T(L[X]) + noise L[X]
T(L[X])
Figure 7: Factor graph depicting the relationship between the variables involved in CS decoding of the nonlinear intensities. Variable nodes are black and the constraint nodes are white.
4.1. Extracting the Signal from Nonlinear Measurements. Due to saturation effects in the intensity response of the microarray, the nonlinearity acts on L[x] so that recorded
measurements will never exceed y = γ. We note that due to the presence of measurement noise, the solution is not as simple as inverting the nonlinearity and then applying BP for CS reconstruction. Our goal is to determine the probability distribution of L[x] at all possible values the true signal values xi can take on a grid of sample points, using the measurement intensities y1 , . . . , ym as constraints. The problem then reduces to solving the regular CS signal recovery problem using BP [5]. We note that instead of inverse-mapping T to find P[L[x]], we can calculate the equivalent probabilities of the transformed distribution: P[T(L[x]) = y ], by mapping the required sample points for the x distribution to transformed points y . At the ith measurement node yi , T(L[x]) = yi −wi ; the latter probability masses can be picked out at the desired y points. None of the values of yi − wi will be evaluated at y values that exceed γ by construction. Now, the inverse function is well defined and we can calculate probability masses of L[x] from those of T(L[x]). The problem thus reduces to the regular BP solution for CS reconstruction. This procedure is repeated at each constraint node yi .
EURASIP Journal on Bioinformatics and Systems Biology
11
Table 5: The best matched lengths of the probes and targets. The three integers in the parenthesis, from left to right are L∗ (0.8), L∗ (0.75), and L∗ (1.00), respectively. The probe-target pairs corresponding to the bold-font entries are designed to have large affinities. Target 1 (21, 24, 11) (08, 09, 06) (11, 13, 06)
Normalized L2 error: ||x − xrecon ||2 / ||x||2
Probe 1 Probe 2 Probe 3
Target 2 (11, 13, 05) (20, 28, 10) (10, 12, 05)
Target 3 (10, 10, 06) (10, 12, 05) (25, 26, 13)
0.8
Target 5 (11, 13, 06) (22, 24, 11) (20, 21, 08)
Target 6 (25, 30, 08) (08, 09, 06) (22, 25, 05)
Target 7 (21, 24, 08) (21, 22, 09) (21, 34, 08)
had ignored it. It is important to note that BP appears to be the only CS reconstruction technique that not only meets the requirements of speed in decoding, but can also incorporate the nonlinearity in the measurement prior with ease.
0.7 0.6 0.5
5. Conclusion
0.4 0.3 0.2 0.1
Target 4 (20, 29, 08) (25, 30, 06) (10, 10, 06)
50
100 150 Number of measurements (M)
200
Linear BP Non-linear BP
Figure 8: Plot of normalized L2 measurement error versus number of measurements for the cases of nonlinear BP-decoding, and BP that ignores the nonlinearity. Number of signal coefficients N = 200; α = β = 25; σ y = 2.
In summary, to “invert” the nonlinearity. (1) Transform the sample points x by applying T(L[·]) to get y . (2) For kth measurement node yi , obtain the probability distribution of T(L[x]) which is equivalent to the distribution of yi − wi . (3) Evaluate the probability masses of yi − wi at sample grid points y . (4) Calculate probability masses of L[x] from those of T(L[x]) by applying function T −1 . (5) Apply BP for CS decoding as in [5]. 4.2. Numerical Results. Since the experimental data is currently of relatively small scale, we apply the designed BP algorithm to a set of synthetic data to test the proposed concept. In the computer simulations, we assume that the sparsity of the target concentration signal is 10%. Figure 8 demonstrates the change in L2 reconstruction error of the signal against the number of measurements (i.e., DNA spots), using our nonlinearly modified BP algorithm, as well as the regular BP decoding algorithm that ignores the nonlinearity. We notice that by taking into account the nonlinearity and reversing it during the decoding process as our modified algorithm does, the L2 decoding error converges to a smaller value than if we
We study how to design a microarray suitable for compressive sensing. A hybridization model is proposed to predict whether given CS probes mimic the behavior of a binary CS matrix, and algorithms are designed, respectively, to find probe sequences satisfying the binding requirements, and to compute the target concentration from measurement intensities. Lab experimental calibration of the model and a small-scale CSM design result are presented.
Acknowledgments This work was supported by NSF Grants CCF 0821910 and CCF 0809895. The authors also gratefully acknowledge many useful discussions with Xiaorong Wu from the University of Colorado at Denver School of Medicine.
References [1] Affymetrix microarrays, http://www.affymetrix.com/products/arrays/specific/cexpress.affx. [2] J. W. Taylor, E. Turner, J. P. Townsend, J. R. Dettman, and D. Jacobson, “Eukaryotic microbes, species recognition and the geographic limits of species: examples from the kingdom Fungi,” Philosophical Transactions of the Royal Society B, vol. 361, no. 1475, pp. 1947–1963, 2006. [3] E. J. Cand`es and T. Tao, “Decoding by linear programming,” IEEE Transactions on Information Theory, vol. 51, no. 12, pp. 4203–4215, 2005. [4] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1289–1306, 2006. [5] S. Sarvotham, D. Baron, and R. Baraniuk, “Compressed sensing reconstruction via belief propagation,” preprint, 2006, http://www.dsp.ece.rice.edu/cs/csbpTR07142006.pdf. [6] J. A. Tropp, “Greed is good: algorithmic results for sparse approximation,” IEEE Transactions on Information Theory, vol. 50, no. 10, pp. 2231–2242, 2004. [7] W. Dai and O. Milenkovic, “Subspace pursuit for compressive sensing: closing the gap between performance and complexity,” submitted to IEEE Transactions on Information Theory, http://arxiv.org/abs/0803.0811. [8] D. Wang, A. Urisman, Y.-T. Liu, et al., “Viral discovery and sequence recovery using DNA microarrays,” PLoS Biology, vol. 1, no. 2, article e2, pp. 1–4, 2003.
12 [9] A. Schliep, D. C. Torney, and S. Rahmann, “Group testing with DNA chips: generating designs and decoding experiments,” in Proceedings of the Computational Systems Bioinformatics Conference (CSB ’03), vol. 2, pp. 84–91, Stanford, Calif, USA, August 2003. [10] A. J. Macula, A. Schliep, M. A. Bishop, and T. E. Renz, “New, improved, and practical k-stem sequence similarity measures for probe design,” Journal of Computational Biology, vol. 15, no. 5, pp. 525–534, 2008. [11] D. Z. Du and F. K. Hwang, Combinatorial Group Testing and Its Applications, World Scientific, Singapore, 2000. [12] M. A. Sheikh, S. Sarvotham, O. Milenkovic, and R. G. Baraniuk, “DNA array decoding from nonlinear measurements by belief propagation,” in Proceedings of the 14th IEEE/SP Workshop on Statistical Signal Processing (SSP ’07), pp. 215– 219, Madison, Wis, USA, August 2007. [13] I. Shmulevich, J. Astola, D. Cogdell, S. R. Hamilton, and W. Zhang, “Data extraction from composite oligonucleotide microarrays,” Nucleic Acids Research, vol. 31, no. 7, article e36, pp. 1–5, 2003. [14] E. J. Cand`es, J. K. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” Communications on Pure and Applied Mathematics, vol. 59, no. 8, pp. 1207–1223, 2006. [15] T. R. Gregory, “Macroevolution, hierarchy theory, and the Cvalue enigma,” Paleobiology, vol. 30, no. 2, pp. 179–202, 2004. [16] R. A. DeVore, “Deterministic constructions of compressed sensing matrices,” Journal of Complexity, vol. 23, no. 4–6, pp. 918–925, 2007. [17] R. Berinde and P. Indyk, “Sparse recovery using sparse random matrices,” preprint, 2008, http://people.csail.mit.edu/indyk/ report.pdf. [18] Y. A. Chen, C.-C. Chou, X. Lu, et al., “A multivariate prediction model for microarray cross-hybridization,” BMC Bioinformatics, vol. 7, article 101, pp. 1–12, 2006. [19] T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” Journal of Molecular Biology, vol. 147, no. 1, pp. 195–197, 1981. [20] Matlab Bioinformatics Toolbox—Exploring Primer Design Demo. http://www.mathworks.com/applications/compbio/ demos.html?file=/products/demos/shipping/bioinfo/primerdemo.html. [21] W. Xu, S. Bak, A. Decker, S. M. Paquette, R. Feyereisen, and D. W. Galbraith, “Microarray-based analysis of gene expression in very large gene families: the cytochrome P450 gene superfamily of Arabidopsis thaliana,” Gene, vol. 272, no. 1-2, pp. 61–74, 2001. [22] E. Khomyakova, M. A. Livshits, M.-C. Steinhauser, et al., “On-chip hybridization kinetics for optimization of gene expression experiments,” BioTechniques, vol. 44, no. 1, pp. 109–117, 2008. [23] K. J. Breslauer, R. Frank, H. Blocker, and L. A. Marky, “Predicting DNA duplex stability from the base sequence,” Proceedings of the National Academy of Sciences of the United States of America, vol. 83, no. 11, pp. 3746–3750, 1986. [24] O. Milenkovic and N. Kashyap, “DNA codes that avoid secondary structures,” in Proceedings of the IEEE International Symposium on Information Theory (ISIT ’05), pp. 288–292, Adelaide, Australia, September 2005. [25] B. P. Durbin, J. S. Hardin, D. M. Hawkins, and D. M. Rocke, “A variance-stabilizing transformation for gene-expression microarray data,” Bioinformatics, vol. 18, supplement 1, pp. S105–S110, 2002.
EURASIP Journal on Bioinformatics and Systems Biology [26] D. Hekstra, A. R. Taussig, M. Magnasco, and F. Naef, “Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays,” Nucleic Acids Research, vol. 31, no. 7, pp. 1962–1968, 2003. [27] M. D. Kane, T. A. Jatkoe, C. R. Stumpf, J. Lu, J. D. Thomas, and S. J. Madore, “Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays,” Nucleic Acids Research, vol. 28, no. 22, pp. 4552–4557, 2000. [28] Matlab codes for probe design in CSMs. [29] M. Zuker, “Mfold web server for nucleic acid folding and hybridization prediction,” Nucleic Acids Research, vol. 31, no. 13, pp. 3406–3415, 2003. [30] V. Kumar and O. Milenkovic, “On graphical representations of algebraic codes suitable for iterative decoding,” IEEE Communications Letters, vol. 9, no. 8, pp. 729–731, 2005.
Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2009, Article ID 717136, 14 pages doi:10.1155/2009/717136
Research Article How to Improve Postgenomic Knowledge Discovery Using Imputation Muhammad Shoaib B. Sehgal,1, 2 Iqbal Gondal,3 Laurence S. Dooley,4 and Ross Coppel2, 5 1 ARC
Centre of Excellence in Bioinformatics, Institute for Molecular Bioscience (IMB), University of Queensland, St Lucia, QLD 4067, Australia 2 Victorian Bioinformatics Consortium, Monash University, VIC 3800, Australia 3 Gippsland School of Information Technology (GSIT), Faculty of Information Technology, Monash University, Churchill, VIC 3842, Australia 4 Faculty of Mathematics, Computing and Technology, The Open University, Milton Keynes MK7 6BJ, UK 5 Department of Microbiology, Monash University, VIC 3800, Australia Correspondence should be addressed to Iqbal Gondal,
[email protected] Received 28 February 2008; Revised 8 September 2008; Accepted 4 November 2008 Recommended by Erchin Serpedin While microarrays make it feasible to rapidly investigate many complex biological problems, their multistep fabrication has the proclivity for error at every stage. The standard tactic has been to either ignore or regard erroneous gene readings as missing values, though this assumption can exert a major influence upon postgenomic knowledge discovery methods like gene selection and gene regulatory network (GRN) reconstruction. This has been the catalyst for a raft of new flexible imputation algorithms including local least square impute and the recent heuristic collateral missing value imputation, which exploit the biological transactional behaviour of functionally correlated genes to afford accurate missing value estimation. This paper examines the influence of missing value imputation techniques upon postgenomic knowledge inference methods with results for various algorithms consistently corroborating that instead of ignoring missing values, recycling microarray data by flexible and robust imputation can provide substantial performance benefits for subsequent downstream procedures. Copyright © 2009 Muhammad Shoaib B. Sehgal et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction The study of genes and their transactional relationship with other genes can be modelled using machine learning algorithms in a diverse range of applications from disease analysis [1] and drug progression for target diseases [2] through evolutionary study [3] and comparative genomics [4], all of which are characterised by using microarray gene expression data. The statistical analysis of microarray datasets depends highly upon the accuracy of the gene expression methods. Microarray production is a complex process, whereby samples are prepared for differential expression in a series of stages involving the laying of specimens on the slides by a robotic arm, imaging of the slides, and finally determining the numerical gene expression values. Each step inevitably exhibits a propensity for error [5], a corollary to this is the
inherent erroneous gene expression values for certain genes, which are popularly referred to as missing values. While microarray technology is continually being refined, there is an enormous amount of public domain gene expression data available that frequently contains at least 5% erroneous spots. Indeed, in many datasets, at least 60% of genes have either one or more missing values [6], which can seriously impact on subsequent data analysis involving, for example, significant gene selection, gene regulatory network (GRN) reconstruction, and clustering algorithms [7, 8]. The simplest ways to address this problem are to either repeat the experiment, though this is often not feasible for economic reasons, or ignore those samples containing missing values, but again this is not recommended because of the limited number of available samples. Alternative strategies include row average/median imputation (substitution
Schematic representation of knowledge discovery framework Microarray glass slide Step 1
Imputed matrix 0.21 0.12 0.09 −0.15 0.02 0.03 0.01 0.07 0 0.15 0.07 0.08
Gene expression data 0.12 0.09 0.21 −0.15 0.01 0.07 0.15 0.07 0.08 Step 2
by the corresponding row average/median value) and the ubiquitous ZeroImpute, where missing values are replaced by zero. Both approaches are high variance, with neither exploiting the underlying data correlations which can lead to higher estimation errors [9]. The prevailing wisdom is to accurately estimate missing values by exploiting the latent correlation structure of the microarray data [8, 10], as manifested by the development of numerous microarray imputation techniques including collateral missing value estimation (CMVE) [11], singular value decomposition impute (SVDImpute) [9], K-nearest neighbour (KNN) [9], least square impute (LSImpute) [10], local LSimpute (LLSImpute) [8], Bayesian principal component analysis (BPCA) [12], a set of theoretic framework based on projection onto convex sets imputation (POCS Impute) method [13] and most recently, heuristic collateral missing value imputation (HCMVI) [14]. In addition, other methods which use contextual information include gene ontology-based imputation (GOImpute) [15] and metadata-based imputation technique [16]. This paper will investigate the gene expression correlation assumption by empirically analysing different postgenomic knowledge discovery methods including gene selection and GRN reconstruction techniques in the presence of missing values, specifically for the breast and ovarian cancer datasets of Hedenfalk et al. [17] and Jazaeri et al. [18], respectively. The rationale for choosing these two datasets is that generally cancerous data [19] lacks molecular homogeneity in tumour tissues, which makes missing value estimation far more challenging. Additionally, breast cancer is the second leading cause of cancer death in women today (following lung cancer), with 1 in 11 Australian women being diagnosed with the disease before the age of 75, and the number of breast cancer patients increasing everyday, as diagnosis methods improve [20]. Ovarian cancer is the fourth most common cause of cancer-related deaths in American women of all ages, as well as being the most prevalent cause of death from gynaecologic malignancies in the United States [21]. Figure 1 displays a generic postgenomic knowledge inference framework, with the DNA sample being firstly converted to expression values prior to any knowledge inference being undertaken. As highlighted earlier, this phase (STEP 1 in Figure 1) can introduce several erroneous (missing) values that can significantly impact upon any subsequent analysis. Unfortunately, while there have been many propitious imputation algorithmic contributions (STEP 2), there is still the pervading fallacy that either new data analysis methods will successfully manage missing values or more seriously that missing values in fact do not impact appreciably upon downstream analysis [22]. Interestingly, even though there have been some attempts to test the impact of imputation on clustering methods [23, 24], no comprehensive single study has been undertaken to date to analyse the impact missing values can have on different postgenomic knowledge discovery methods like gene selection, class prediction, clustering of functionally related genes, and GRN reconstruction (STEP 4). This paper cogently argues that imputation is both an integral and indeed mandatory preprocessing step (STEP 2) prior to applying
EURASIP Journal on Bioinformatics and Systems Biology
Step 4
2
Step 3
Imputation
Knowledge discovery methods
Gene selection Gene A Gene B . . .
Clustering
Gene N
Class prediction Class A | Class B
GRN reconstruction
Figure 1: A Schematic representation of postgenomic knowledge discovery framework.
any knowledge discovery method (STEP 4). This judgement is justified by analysing various results which consistently reveal improved estimation accuracy when missing values are approximated by more flexible approaches such as HCMVI and LLSImpute (STEP 3) because of their innate ability to preserve the variance of the data compared to other popular, if simpler, high-variance methods. Aside from the obvious numerical relevance of missing value estimation, another key driver is the biological significance of imputation, particularly algorithmic performance in estimating significant genes in microarray data that may be erroneously affected. Plakophilin 2 (PKP2), for example, is present in breast carcinoma cell lines [25] and is significant as it serves as a marker for the identification and characterisation of carcinomas derived either from or corresponding to, simple, and complex epithelia [26]. As will be witnessed in Section 6, PKP2 is often not selected by gene selection methods when missing values are present and so would generally be either ignored or replaced when conventional estimation methods are applied. By judiciously employing a flexible imputation strategy such as HCMVI, however, the probability that these genes are correctly selected can be significantly enhanced. Similarly, the GRN reconstruction performance may be significantly influenced by missing values with a substantial number of vital coregulation links being neglected when imputing by traditional and contemporary methods (Sections 3 and 4). The interaction in breast cancer data between ADP-ribosylation factor 3 and estrogen sulfotransferase (EST) , which is similar to the NSAP1 protein, is, for instance, consistently overlooked when missing values are introduced, though they have been successfully reconstructed using flexible imputation methods
EURASIP Journal on Bioinformatics and Systems Biology (Section 5). In both scenarios, accurate imputation crucially eliminates the need for repeating an experiment which can be costly, and may be pragmatically infeasible. This paper presents a treatise on existing imputation methods by examining their performance in managing microarray dataset missing values to improve postgenomic knowledge discovery. Concomitant with analysing the numerical accuracy of imputation, the biological significance for two proteins is analysed, namely, KIAA1025 and MHC, from the breast and ovarian cancer datasets, respectively, because of their acknowledged importance in diagnosing the different cancer types [27–29]. The remainder of the paper is organised as follows. After formally defining the nomenclature, Sections 3, 4, and 5 will, respectively, review the gamut of traditional, contemporary, and flexible microarray missing value imputation algorithms together with their particular epithets and limitations. A reflective analysis is then presented in Section 6 upon a series of experiments performed on various breast and ovarian cancer microarray datasets, including both statistical and biological significance interpretations, while some conclusions are provided in Section 7.
2. Nomenclature The convention adopted in all the imputation strategies is to assume that the gene expression matrix Y has m rows and n columns, where the rows and columns represent genes and samples, respectively, as in (1). A missing value in gene expression data Y for gene i and sample j is formally expressed as Yi j ⎡
⎤
g11 g12 g13 · · · g1n ⎢g ⎥ ⎢ 21 g22 g23 · · · g2n ⎥
⎢ · Y =⎢ ⎢ ⎣ ·
· ·
· ·
gm1 gm2 gm3
⎥ · · · · ⎥ ∈ Rm×n . ⎥ ··· · ⎦ · · · gmn
(1)
Imputation strategies have been broadly classified into three categories: traditional, contemporary, and flexible techniques. Original imputation approaches, which replace a missing value by either zero or row/column mean, are designated as traditional, as they are simple and computationally efficient, but do not take advantage of any latent correlation within the data. Contemporary techniques subsequently evolved to improve the estimation accuracy by using inherent data correlations, usually under the assumption that the causal correlation structure is either localised or globalised. They are also characterised by using a fixed number of predictor genes in the estimation which limits the flexibility to fully exploit any data correlations. This was the incentive for the most recent family of flexible imputation methods which are able to freely adapt to the data distribution by automatically determining the optimal number of predictor genes, thereby minimising the impact of missing values on subsequent biological analysis. In the following sections, these three imputation categories are, respectively, reviewed.
3
3. Traditional Imputation Techniques for Microarray Data These are broadly characterised by replacing expression values of those genes that posses missing values by zero, their gene/sample mean or median, and in certain cases, by using the well-known KNN method. The advantages and disadvantages of these popular approaches are now discussed. 3.1. ZeroImpute and Mean/Median Imputation. In these methods, missing values are, respectively, replaced either by zero (ZeroImpute) or by the gene/sample average [30] and/or median. The attraction is their simplicity and computational efficiency, though none take advantage of the underlying correlation structure of the data, with the consequence that the data variance is generally high. This means that when there are a large number of missing values present in the microarray data, these imputation strategies can significantly compromise subsequent postgenomic analysis. The impact, however, can be reduced by adapting the estimation parameters to the underlying correlation structure of the data, with the following sections examining some well-established methods. 3.2. Singular Value Decomposition-Based Imputation (SVDImpute). This uses the combination of singular value decomposition (SVD) [9] and expectation maximization (EM) [31] to estimate the missing values by calculating mutually orthogonal expression patterns often referred to as Eigen genes. As SVD calculations require the entire matrix, missing values are replaced by their row mean prior to the k most effective Eigen genes being selected according to their corresponding Eigen values. The imputed missing value estimate for Yi j is then calculated by regressing gi against the k most effective Eigen genes with expression values from sample j which contained the missing value being ignored. SVDImpute reduces imputation errors by recursively estimating the missing values using the EM algorithm until the change in the matrices becomes less than an empirically determined threshold, nominally 0.01 [9]. The technique performs best when 20% of the Eigen genes are used for estimation, and while it is a better strategy than highvariance approaches like ZeroImpute, it has the drawbacks of both being highly sensitive to noise and only considering global data correlations, which inevitably leads to higher estimation errors in locally correlated datasets. 3.3. K-Nearest Neighbour (KNN) Estimation. KNN [9] estimates missing values by searching for the k nearest genes normally by applying the Euclidean distance and then taking the weighted average of these k genes. The k genes whose expression vectors are most similar to genetic expression values in all samples,except the sample which contains the missing value, are selected. The similarity measure between gene gi and other genes is then determined by the Euclidian distance over the observed components in sample j, and the missing value estimated as the weighted average of the
4 corresponding entries in the selected k expression vectors, where the contribution of every gene is scaled by the similarity of its expression to gi . While KNN is flexible in terms of the choice of similarity measure, it does imply the performance of a specific metric is data dependent. Troyanskaya et al. [9] demonstrated that Euclidean distance performs better than other similarity measures for microarray data, and though it is highly sensitive to microarray data outliers, log-transforming the data can significantly reduce their effect in determining gene similarity. The choice of an appropriate k value especially influences imputation performance. Experimental results have established that for small datasets k = 10 is the best choice [7], while Troyanskaya et al. [9] observed that KNN is insensitive to values of k in the range 10 to 20. The key point to emphasise is that regardless of the underlying structure of the microarray data, a preset value of k is employed which clearly does not fully harness the capability of an imputation method. A much more creative strategy is to endeavour to automatically determine the best k value from the data correlation structure, which is the fundamental premise of the two flexible imputation techniques described in Section 5. Summarising, while traditional algorithms have been widely adopted, the inherently high data variance has a major impact on downstream analysis methods like significant gene selection and class prediction GRN reconstruction. To relax this restriction, more robust techniques have evolved in an attempt to garner superior performance in terms of estimation accuracy, although as will be witnessed, they still exhibit some limitations, most notably from a biological significance perspective. Section 4 focuses on some of the most well-established contemporary imputation approaches.
4. Contemporary Imputation Techniques for Microarray Data This category embraces those methods that implicitly attempt to lower the data variance of missing value estimates, by seeking to exploit the underlying localised or global correlation structure of the microarray data. Some of the most popular algorithms together with their relative merits and demerits will now be investigated. 4.1. Least Square Impute (LSImpute) Estimation. This is a regression-based method that exploits the correlation between genes. There are three variants of the imputation LSImpute [10] algorithm, namely, LSImpute-Gene, LSImpute-Array, and LSImpute-Adaptive. LSImpute-Gene estimates missing values using the correlation between the genes (intrasample) while LSImpute-Array exploits intersample correlation while LSImpute-Adaptive combines both techniques using a bootstrapping approach [32]. The communal features of all three LSImpute variants will now be delineated. To estimate missing value Yi j in (1), the k mostcorrelated genes are firstly selected, whose expression vectors
EURASIP Journal on Bioinformatics and Systems Biology are similar to gene i from Y in all samples except j, where all the correlated genes do not contain any missing values. As LSImpute-Gene is based upon a regression, it mandates that the number of model parameters must be lower than the number of observations, though in general for microarray data, the number of genes is usually much greater than the sample number. The algorithm then computes regressive estimates for each selected gene and the missing value estimate is obtained from their weighted average. While LSImpute-Gene affords greater accuracy than traditional imputation methods like KNN and SVDImpute (Section 3), it still has the same fundamental limitation of using a preset k value. Bø et al. [10], for example, empirically determined k = 10 as the most suitable value for their particular dataset, though crucially this finding is data dependent and not generic. It also demonstrated that this imputation approach works better if missing values have been initially approximated by LSImpute-Gene and then refined with LSImpute-Array. This lowers the imputation error, though commensurately it increases the computational overhead, and since it still employs LSImpute-Gene prior to any estimation, the value of k is always fixed. LSImpute-Adaptive combines the strengths of both LSImpute-Gene and LSImpute-Array by fusing their respective imputation results. It modifies the weights for each imputation using a bootstrapping process [32], with empirical results [10] endorsing that this strategy performs better when either variant is separately applied. With the flexibility to adjust the number of predictor genes in the regression, LSImpute performs best when data exhibits a strong local correlation structure, though the comparative prediction accuracy is still inferior to that achieved by the new flexible imputation algorithms, which dynamically determine k directly from the data (Section 5). 4.2. Bayesian Principal Component Analysis (BPCA) Estimation [12]. BPCA estimates missing values using Bayesian estimation theory with a variational algorithm [33] to calculate the model parameters and ultimately the imputed value Yi j . The posteriori distribution p(Yi j ) of the missing value and the posteriori distribution p(θ) of the model parameter θare firstly computed from gene values having no missing values and since this distribution calculation requires the complete matrix, so missing values are replaced by their corresponding gene averages. The model parameters p(θ) are then used to compute the current posteriori distribution, with the maximum likelihood [32] parameters being iteratively updated using the current posteriori distribution of model parameters and missing values, until convergence is reached. By considering only global correlations within a dataset, BPCA has a distinct advantage in terms of prediction speed compared with all the other imputation techniques analysed, but its performance is highly dependent on either a strong underlying global correlation within the data or having a very high number of samples. This is an offset by the likelihood of high imputation errors when either the dataset is locally correlated or comprises a small number of samples.
EURASIP Journal on Bioinformatics and Systems Biology 100
Accuracy (%)
80 60 40 20 0 0
5
HCMVI CMVE BPCA
10 Missing values (%)
15
20
LLSImpute KNN ZeroImpute
Figure 2: Gene selection accuracy for 50 significant genes in breast cancer.
4.3. Collateral Missing Value Estimation (CMVE) [11]. This algorithm is unique in contemporary missing value imputation techniques in using multiple estimates. Like LSImpute, it firstly estimates the missing value Yi j by identifying the k most-correlated genes, with either a covariance or Pearson correlation matrix being employed, depending upon the data distribution to find these correlated genes. LS regression and two variants of the nonnegative LS (NNLS) algorithm are then applied to compute three separate estimates for Yi j , which are then linearly fused as follows: Yi j = ρ·Φ1 + Δ·Φ2 + Λ·Φ3 ,
(2)
where ρ, Δ, and Λ are the weights assigned to each constituent imputation estimate. CMVE uses LS regression of k-correlated genes for the first missing value estimate Φ1 , while NNLS and linear programming compute the other two estimates Φ2 and Φ3 . The rationale for including NNLS is that unnormalised microarray data has only positive values so NNLS takes advantage of exploiting the positive search space. If the data is either normalized or log-transformed then it will contain some negative values so LS regression is used for this particular estimation. Since both the Pearson correlation and the covariance functions necessitate complete imputation matrices, so CMVE firstly replaces all missing values by gene averages. Once the initial missing value estimate is generated, then new estimated value is used in all future predictions, which is a distinctive feature of this particular imputation strategy. CMVE has been proven to perform best for locally correlated data, providing consistently superior imputation quality compared to all the aforementioned techniques, by virtue of the property of recycling estimated values in future predictions [34]. It is also more robust as witnessed by its performance in the presence of high numbers of missing values. The main drawback of CMVE, just like all the other
5 contemporary algorithms, is the preset value of k which means that it does not fully adapt to the correlation structure of the data and compromises performance when data has a global structure. In summarising the imputation methods reviewed so far, the main assumption relates to the underlying correlation structure of the dataset, where KNN, LSImpute, and CMVE perform better when data is locally correlated, while SVDImpute and BPCA are more apposite for missing value estimation in globally correlated datasets. From a postgenomic knowledge inference viewpoint, however, any estimation strategy must be adapted to the correlation data structure so imputation performs equally well for both types of correlated data. The next section presents two recent flexible imputation methods that exhibit this propitious property, in automatically adapting to the data correlation structure to produce minimal imputation error.
5. Flexible Imputation Techniques for Microarray Data Flexible imputation techniques use, to some extent, core building blocks developed for their contemporary estimation counterparts in Section 4, and are characterised by automatically selecting, a priori, the optimal number of estimator genes from the data correlation structure. This avoids the problem that if the data is globally correlated, then a small number of predictor genes (low k value) may ignore genes that are strongly correlated to the gene having the missing value. Conversely, when an unnecessarily large value of number of genes (high k value) is used, this can introduce genes for prediction which either has little or no correlation to the gene with missing values. Two techniques are reviewed in this category. 5.1. Local Least Square Impute (LLSImpute) [8]. This is similar to LSImpute in that it estimates missing values by constructing a linear combination of correlated genes using LS principles. The crucial difference is that in estimating Yi j , the number of predictor genes k is heuristically determined directly from the dataset. To determine the optimum k, LLSImpute artificially removes a known value from the most correlated gene gi before iteratively estimating it over a range of k values, with the k that produces the minimum estimation error then being used for imputation. Kim et al. [8] employed the L2 norm as well as Pearson correlation to identify the most correlated genes, with the L2 norm reported to perform slightly better than the Pearson correlation method for the chosen experimental data, although the difference in prediction accuracies between the two approaches was statistically insignificant. In comparison with the various traditional and contemporary approaches, LLSImpute adapts to the underlying correlated data structure, with the corollary being superior imputation performance, and while it incurs a considerably higher computational cost, from a microarray data perspective, missing value estimation accuracy always has a greater priority than computational complexity.
5.2. Heuristic Collateral Missing Value Imputation (HCMVI) [14]. This uses the multiestimate CMVE algorithm [11] detailed in Section 4, as its kernel building block to formulate the final imputation of missing value Yi j . It is analogous to LLSImpute in that it also automatically determines the optimal number of predictor genes k by using Monte Carlo (MC) simulation [35]. It selects multiple matrices with known gene expression values with each matrix [36] having a selection probability = .05 in the MC simulation. HCMVI then identifies the most-correlated matrix from the Pearson correlation [37] between each selected matrix and the gene expression Y. These known values are then estimated by CMVE for a range of k values, with the optimal k being the one that generates the minimum estimation error. HCMVI retains all the enhanced imputation performance characteristics and advantages of the original CMVE algorithm, while crucially automatically adapting to the underlying correlation structure of the microarray data, though as with LLSImpute, it incurs an additional computational overhead.
EURASIP Journal on Bioinformatics and Systems Biology 100 80 Accuracy (%)
6
6.1. Imputation and Biological Significance of Selected Genes. To explore the impact of each estimation algorithm upon significant gene selection, a set of genes (Gorg ) has been
40 20 0 0
5
HCMVI CMVE BPCA
10 Missing values (%)
15
20
LLSImpute KNN ZeroImpute
Figure 3: Gene selection accuracy for 1000 significant genes in breast cancer. 100
6. Discussion of Results
80 Accuracy (%)
This section will rigorously examine the influence the aforementioned imputation strategies have in improving missing-value estimation accuracy for postgenomic knowledge discovery methods such as significant gene selection [38], allied with the biological significance of the imputation. Six different microarray datasets for breast and ovarian cancer tissues are used, with data being log-transformed and normalized, so that x = 0 and σ 2 = 1, in order to remove all experimental variations. The breast cancer dataset [17] contained 7, 7, 8 samples of BRCA1, BRCA2, and sporadic mutations (neither BRCA1 nor BRCA2), respectively, while the ovarian cancer dataset [18] contained 16, 16, and 18 samples, respectively, of BRCA1, BRCA2, sporadic mutations. Each breast cancer data sample contained microarray data of 3226 genes and there were 6445 genetic expressions per sample for the ovarian dataset. It is worth noting that the number of probes in both breast and ovarian cancer datasets is different. The data are generated by different labs under different experimental conditions and thus represent experimental variations. To equitably evaluate the performance of the traditional and contemporary imputation algorithms on downstream biological analysis methods, the number of predictor genes was fixed at k = 10 in all experiments. In contrast, the two flexible imputation methods (LLSImpute and HMCVI) automatically determine k by adapting to the correlation structure of the data. Also in this empirical analysis, the LLSImpute variant based upon the L2 norm is applied due to its superior performance [8]. In the next section, the influence of imputation on both significant gene selection and GRN reconstruction (STEP 4 in Figure 1) is investigated.
60
60 40 20 0 0
5
HCMVI CMVE BPCA
10 Missing values (%)
15
20
LLSImpute KNN ZeroImpute
Figure 4: Gene selection accuracy for 50 significant genes in ovarian cancer.
chosen from the original dataset using the between sum of squares to within sum of squares (BSS/WSS)35 method which identifies genes that concomitantly have large interclass and small intraclass variations. The main reason for adopting this particular method is its proven superior performance capability to select significant genes compared with other popular methods such as the t-test [39]. To assess the effect of missing values on gene selection, experiments were performed across a missing value range of probabilities from .01 to .2, with values being iteratively removed from the original gene expression in (1). These were then estimated using ZeroImpute, KNN, LLSImpute, BPCA, CMVE, and HCMVI, respectively, to form Yest prior to being applied to selected sets of p genes using BSS/WSS, for each respective estimation matrix. The selected genes have been then compared with Gorg to obtain the true positive
EURASIP Journal on Bioinformatics and Systems Biology 100
Accuracy (%)
80 60 40 20 0 0
5
HCMVI CMVE BPCA
10 Missing values (%)
15
20
LLSImpute KNN ZeroImpute
Figure 5: Gene selection accuracy for 1000 significant genes in ovarian cancer.
percentage accuracy (%Accuracy) metric, to provide a dispassionate measure of the estimation performance of each algorithm. To eliminate performance variations with respect to the number of selected genes in the BSS/WSS method, each imputation technique was tested for 50 and 1000 significant genes, with the results in Figures 2, 3, 4, and 5 displaying the respective gene selection performance for both the breast and ovarian cancer datasets. These clearly reveal that the flexible imputation methods (LLSImpute and HCMVI) consistently produce superior performance for both cancer datasets, with HCMVI provides the highest %Accuracy metric in the experiments. In contrast, contemporary imputation algorithms like CMVE and BPCA were unable to maintain their performance across both datasets, though interestingly, CMVE performed better than LLSImpute as well as all the other contemporary imputation methods for the breast cancer dataset, which has a predominantly localised data correlation structure. This was not, however, maintained for the more globally correlated ovarian cancer dataset, where BPCA performed better, though it correspondingly failed to sustain the improved estimation accuracy for the breast cancer data. Not surprisingly, the high-variance traditional imputation approaches such as ZeroImpute and KNN exhibit the poorest performance in Figures 2–5 for both cancer datasets, confirming the judgement that incorrectly imputed missing values can have a significant potential impact upon overall gene selection performance. Imputation algorithm performance has normally only been assessed numerically, with considerable debate within the research community of the suitability of standard evaluation measures, such as normalised RMS error (NRMSE). Interpreting the results from a biological significance perspective has not received the same attention, though the impact of missing values on selected genes in postgenomic knowledge discovery is clearly a major factor in algorithmic performance assessment.
7 6.2. Biological Significance of Imputation. While the primary focus is on the estimation accuracy of an imputation method, it is equally important to conduct an investigation into the biological significance of certain selected genes for the respective datasets when evaluating the impact of missing values on gene selection. Indeed, it is constructive to ascertain whether a particular imputation technique assists the gene selection methods in identifying known and novel genes for a given sample. This may provide not only valuable information for the design of basic mechanistic, diagnostic, and biomarker studies, but also valuable data for use in the construction of gene networks and pathways involved in processes like oncogenesis and resistance to tumour induction. In examining the results for both the breast and ovarian cancer datasets, a number of genes were overlooked using traditional methods, when missing values were introduced and processed, which independent experiments [40] have confirmed alter expressions in tumor lines and so can be very important in oncogenesis. This set of genes have not only been selected by the BSS/WSS algorithm, but have been revalidated using the modified t-test with greedy pairs method [41] which minimizes the bias of the gene selection strategy towards either a particular imputation technique or a set of genes. As the results for various gene selection algorithms in Table 1 reveal that the KIAA1025 protein was not always correctly selected when missing values were imputed using KNN, BPCA CMVE, and LLSImpute, but were consistently identified by HCMVI. This is a vital protein which is coregulated with estrogen receptors for both in vivo and clinical data, which are expressed in more than 66% of human breast tumors [29]. Another gene always selected by HCMVI across the range of missing values is plakophilin 2 (PKP2) which is a common protein and exhibits a dual role, appearing as both a constitutive karyoplasmic protein and a desmosomal plaque component for all the desmosomepossessing tissues and cell culture lines. The gene is found in breast carcinoma cell lines [25] and, furthermore, because of its significance, it can serve as a marker for the identification and characterisation of carcinomas derived either from or corresponding to, simple, or complex epithelia [26]. Similar observations can be drawn from the study of significant genes in the ovarian cancer dataset in Table 2. For instance, MHC Class II = DQ alpha (MHCα) and MHC Class II = DQ beta (MHCβ) genes are linked to the immune system and have been shown to be downregulated for ovary syndrome [27]. The allele gene is also present at a higher frequency in patients with malignant melanoma than in Caucasian controls. These genes help in particular to diagnose melanoma patients in the relatively advanced stages of the disease and/or patients who are more likely to have a recurrence [28]. The results confirm that these genes have been correctly identified by the flexible HCMVI method, while being consistently overlooked by other techniques, most notably by all traditional imputation algorithms, for missing values probabilities greater than .05. Interestingly, for both cancer datasets, across the full missing value range from 1% to 20%, these regulated genes
8
EURASIP Journal on Bioinformatics and Systems Biology Table 1: KIAA1025 and plakophilin2 selection in breast cancer dataset across the range of missing values.
% MV
HCMVI KIAA1025 Plakophilin2
CMVE KIAA1025 Plakophilin2
5
KIAA1025 Plakophilin2
KIAA1025 Plakophilin2
10
KIAA Plakophilin2
KIAA Plakophilin2
15
KIAA1025 Plakophilin2
KIAA1025 Plakophilin2
20
KIAA1025 Plakophilin2
1
LLSImpute
BPCA
KNN
ZeroImpute
KIAA1025
KIAA1025
KIAA1025
KIAA1025
Table 2: MHC class II = DQ alpha (MHCα) and MHC Class II = DQ beta (MHCβ) selection in ovarian cancer across the range of missing values. % MV 1
HCMVI MHCα MHCβ
5
MHCα MHCβ
10
MHCα MHCβ
15
MHCα MHCβ
20
MHCα MHCβ
CMVE
LLSImpute
BPCA
KNN
ZeroImpute
MHCα
MHCα
MHCα
MHCα
MHCα
MHCβ
have been correctly identified when gene selection has been preceded by HCMVI imputation as confirmed in Tables 1 and 2. It highlights that consideration of the biological significance of any imputation is extremely important and underscores the need for accurate estimation prior to gene selection, particularly in the presence of higher numbers of missing values. As alluded earlier, existing GRN reconstruction methods conventionally replace missing values by either ZeroImpute or gene average [30, 42], despite both inevitably impacting upon subsequent GRN reconstruction, as will now be more fully examined. 6.3. Impact of Missing Values on Gene Regulatory Network Reconstruction. To evaluate the influence of missing values, the algorithm for the reconstruction of accurate cellular networks (ARACNE) [43] has been employed because it affords better performance over alternative approaches like Bayesian networks [44] and has been tested for mammalian gene network reconstruction and compared with other techniques that are normally applied to simple eukaryotes such as Saccharomyces cerevisiae [45]. ARACNE firstly computes the statistical significant genegene coregulation using mutual information before applying
a data processing inequality to prune indirect relationships, that is, genes which are coregulated by either one or more intermediate genes. To comparatively evaluate the respective imputation performances on GRN reconstruction, the number of conserved links is determined, which represents whether a particular coregulation link is present in both GRNorg and GRNimputed . The gene network GRNorg is then initially constructed using ARACNE from the original data Y with no missing values. As in the previous experiments, up to 20% missing values have been randomly introduced and then, respectively, estimated using traditional, contemporary, and flexible imputation methods (Section 3–5, resp.). The corresponding gene networks GRNimputed are then constructed from the imputed data and GRNorg and GRNimputed compared to ascertain the conserved links. Figures 6, 7, 8, and 9 show that the ARACNE method, which has been reported to be robust [46] for GRN construction, does not maintain its performance in the presence of missing values, especially for ZeroImpute. In contrast, when a flexible imputation method like HCMVI is applied, ARACNE conserves the number of links even at higher missing value probabilities. For example, in BRCA1 breast cancer data, the transcriptional link between ADPribosylation factor 3 (ARF3) and general transcription factor
EURASIP Journal on Bioinformatics and Systems Biology
9 40 35
70 60
TP accuracy (%)
TP accuracy (%)
100 90 80
50 40 30 20 10 0
25 20 15 10 5
1
5
10 15 Missing values (%)
0
20
1
LLSImpute KNN ZeroImpute
HCMVI CMVE BPCA
5
10 15 Missing values (%)
20
LLSImpute KNN ZeroImpute
HCMVI CMVE BPCA
Figure 6: Accuracy of conserved links in BRCA1-breast cancer data.
Figure 8: Accuracy of conserved links in BRCA1-ovarian cancer data. 40
100 90 80
35 TP accuracy (%)
TP accuracy (%)
30
70 60 50 40 30 20 10 0
30 25 20 15 10 5
1
HCMVI CMVE BPCA
5
10 15 Missing values (%)
20
LLSImpute KNN ZeroImpute
0 1
HCMVI CMVE BPCA
5
10 15 Missing values (%)
20
LLSImpute KNN ZeroImpute
Figure 7: Accuracy of conserved links in sporadic-breast cancer data.
Figure 9: Accuracy of conserved links in BRCA2-ovarian cancer data.
II, i, pseudogene 1(GTF2IP1) was overlooked when missing values were imputed by all traditional and contemporary methods, but was correctly inferred when values were imputed by both HCMVI and LLSImpute. Similarly, the link between HS1 binding protein and mitogen-activated protein kinase 3 in BRCA2 breast cancer data was reconstructed when values were imputed using HCMVI, but was neglected by all other techniques. The results for breast cancer sporadic data revealed similar observations, with, for example, the interaction between ADP-ribosylation factor 3 and EST, which is very similar to the NSAP1 protein, being identified when data was imputed using flexible methods, while being missed by the other strategies, so corroborating the importance of accurate imputation in improving GRN reconstruction performance. In the ovarian cancer dataset, the interaction link between Ro ribonucleoprotein autoantigen (Ro/SS-A) =
autoantigen calreticulin and Glutathione S-transferase theta 1 was not identified in BRCA1-data, when missing values were introduced but was regenerated when these missing values were imputed using HCMVI. Similarly, coregulation between Inhibitor of DNA binding 3, dominant negative helix-loop-helix protein, and p53 in BRCA2 ovarian cancer dataset was also missed, but the link was reconstructed when HMCVI imputation was applied across the range of missing values. In the sporadic ovarian cancer dataset, transcriptional links between CD97 and RAB-10 were again only successfully reconstructed using HCMVI, while they were overlooked by all other estimation methods again underpinning the significance of accurate missing value imputation prior to GRN reconstruction. The impact of missing values on GRN was further investigated on artificially created networks. Two artificial expression datasets and networks by Bansal et al. [47] were
10
EURASIP Journal on Bioinformatics and Systems Biology ROC covex hull curve without missing values
1
0.9
0.9
0.8
0.8
0.7
0.7 True positive rate
True positive rate
1
0.6 0.5 0.4
0.6 0.5 0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.2
0.4 0.6 0.8 False positive rate
ROC covex hull curve with 20% missing values
0
1
0
0.2
0.4 0.6 0.8 False positive rate
(a)
1
(b)
Figure 10: ROC plots of artificial networks.
6.4. Significance Test Results. For completeness, the statistical significance and variance stability of all the various imputation methods have been analysed using the two-sided Wilcoxon rank sum statistical significance test. The impetus for applying this test is that it does not assume that data is coming from same distribution, which is particularly important given the data variance can be appreciably disturbed by erroneous estimation, as, for instance, in ZeroImpute. To test the hypothesis H0 , Y = Yest , where Y and Yest are the actual and estimated matrices, respectively, the P-value of the hypothesis is determined
H0 , P-value = 1 − 2Pr R ≤ yr ,
(3)
where yr is the sum of the ranks of observations for Y and R is the corresponding random variable. The corresponding
1 0.9 0.8 P-values of similarity
used for this purpose. Each expression data had 100 probes with 100 samples per probe. The networks were constructed using ARACNE with no imputation and compared against artificial networks to compute reference area under receiver operating characteristic (ROC) curve. Then, 20% missing values were introduced and imputed using HCMVI which was followed by network reconstruction using ARACNE under same experimental setup to compute area under ROC curve. Figure 10 shows average ROC curve for 10 runs with and without imputation. The areas under ROC curve for networks 1 and 2 were 0.6653 and 0.5979, respectively, when networks were constructed from complete dataset. The average areas under ROC were 0.6653 and 0.5901, respectively, when networks were constructed after randomly introducing 20% missing values and estimation using HCMVI. Again, the result shows that network inference performance is upheld if accurate imputation is used prior constructing networks.
0.7 0.6 0.5 0.4 0.3 0.2 0.1 HCMVI
CMVE LLSImpute BPCA
KNN ZeroImpute
Imputation technique
Figure 11: Significance test results for BRCA1-breast cancer data.
results shown in box plot in Figures 11, 12, 13, 14, 15, and 16 demonstrate that traditional approaches tend to rapidly degrade at higher numbers of missing values, while both contemporary and flexible imputation techniques maintain a far more consistent performance across the range of missing values, see notably in Figures 12 and 14. As box plot can be used to display smallest observation, lower quartile, median, upper quartile, and largest observation, and it can also show if any value is an outlier. This corroborates the fundamental hypothesis that a suitably accurate imputation strategy should always be employed for microarray data before any biological downstreaming analysis is undertaken.
11
1
0.9
0.9
0.8
0.8
0.7 P-values of similarity
P-values of similarity
EURASIP Journal on Bioinformatics and Systems Biology
0.7 0.6 0.5 0.4 0.3
0.6 0.5 0.4 0.3 0.2
0.2
0.1
0.1
0
0 HCMVI
CMVE LLSImpute BPCA
KNN ZeroImpute
HCMVI
Figure 12: Significance test results for BRCA2-breast cancer data.
KNN ZeroImpute
Figure 15: Significance test results for BRCA2-ovarian cancer data.
1
1
0.9
0.9
0.8
0.8 P-values of similarity
P-values of similarity
CMVE LLSImpute BPCA Imputation technique
Imputation technique
0.7 0.6 0.5 0.4 0.3
0.7 0.6 0.5 0.4 0.3 0.2
0.2
0.1
0.1
0 HCMVI
CMVE LLSImpute BPCA
KNN ZeroImpute
Imputation method
Figure 13: Significance test results for sporadic-breast cancer data.
HCMVI
CMVE LLSImpute BPCA
KNN ZeroImpute
Imputation technique
Figure 16: Significance test results for sporadic-ovarian cancer data.
1
6.5. Normalized Root Mean Square Error. For completeness, the estimation performance of HCMVI and comparative imputation methods was also analysed using the traditional parametric normalised root mean square (NRMS) error measure, despite its limitations in reflecting the true impact of missing values on subsequent biological analysis. NRMS Error is defined as
0.9
P-values of similarity
0.8 0.7 0.6 0.5 0.4
0.3
Θ=
0.2
RMS Y − Yest , RMS(Y )
(4)
0.1 0 HCMVI
CMVE LLSImpute BPCA KNN ZeroImpute Imputation technique
Figure 14: Significance test results for BRCA1-ovarian cancer data.
where Y is the original data matrix and Yest is the estimated matrix using HCMVI, CMVE, BPCA, LLSImpute, and KNN, respectively. This particular measure has been used by Sehgal et al. [11], Ouyang et al. [48], and Tuikkala et al. [6] for error estimation because Θ = 1 for zero imputation.
12
EURASIP Journal on Bioinformatics and Systems Biology 1 0.9
0.2
0.8 0.7 NRMS error
NRMS error
0.15
0.1
0.6 0.5 0.4 0.3
0.05
0.2 0.1
0
0 HCMVI
CMVE
LLSImpute
BPCA
KNN
HCMVI
Imputation methods
LLSImpute
BPCA
KNN
Imputation methods
Figure 17: NRMS error in BRCA1-breast cancer data.
Figure 20: NRMS error in BRCA1-ovarian cancer data.
1
0.4
0.9
0.35
0.8
0.3
0.7 NRMS error
NRMS error
CMVE
0.25 0.2 0.15
0.6 0.5 0.4 0.3
0.1
0.2
0.05
0.1
0
0 HCMVI
CMVE
LLSImpute
BPCA
KNN
HCMVI
Imputation methods
LLSImpute
BPCA
KNN
Imputation methods
Figure 18: NRMS error in BRCA2-breast cancer data.
Figure 21: NRMS error in BRCA2-ovarian cancer data.
0.7
0.9
0.6
0.8 0.7 NRMS error
0.5 NRMS error
CMVE
0.4 0.3 0.2
0.6 0.5 0.4 0.3 0.2
0.1
0.1 0
0 HCMVI
CMVE
LLSImpute
BPCA
KNN
Imputation methods
Figure 19: NRMS error in sporadic-breast cancer data.
HCMVI
CMVE
LLSImpute
BPCA
KNN
Imputation methods
Figure 22: NRMS error in sporadic-ovarian cancer data.
EURASIP Journal on Bioinformatics and Systems Biology Figures 17, 18, 19, 20, 21, and 22 show box plot of NMRS error for different imputation algorithms (see Supplementary Material available online at doi: 10.1155/2009/717136 for the rest of the results). It again confirms the better performance of HCMVI (see notably Figure 19) and reiterates the value of accurately exploiting information about the underlying correlation structure of the data instead of using a preset value. Interestingly, LLSImpute exhibited similar performance to HCMVI so justifying the merit of using other metrics to dispassionately compare the performance of different imputation strategies.
7. Conclusion This paper has pragmatically argued that imputation can be effectively applied to recycle microarray data and in doing so provide many potential benefits ranging from cost savings to performance enhancements in postgenomic knowledge discovery. While cognisance is made that ZeroImpute and other traditional missing value imputation strategies are straightforward to implement, new flexible methods have been proven to exhibit much superior accuracy and performance from both a statistical and biological significance perspectives, by virtue of their innate ability to exploit any underlying data correlation structures. A comprehensive study of missing values in microarray data has been presented and their subsequent impact upon postgenomic knowledge discovery methods, including significant gene selection and gene regulatory network reconstruction, has been investigated. Empirical analysis has consistently shown that rather than merely ignoring missing values, which has been the preferred approach to resolve this problem, flexible and robust imputation algorithms afford considerable performance benefits and so should, wherever possible, be mandated prior to any knowledge inference process using microarray data.
References [1] P. D. Sutphin, S. Raychaudhuri, N. C. Denko, R. B. Altman, and A. J. Giaccia, “Application of supervised machine learning to identify genes associated with the hypoxia response,” Nature Genetics, vol. 27, p. 90, 2001. [2] D. Schmatz and S. Friend, “A simple recipe for drug interaction networks earns its stars,” Nature Genetics, vol. 38, no. 4, pp. 405–406, 2006. [3] M. Joron, C. D. Jiggins, A. Papanicolaou, and W. O. McMillan, “Heliconius wing patterns: an evo-devo model for understanding phenotypic diversity,” Heredity, vol. 97, no. 3, pp. 157–167, 2006. [4] I. P. Ioshikhes, I. Albert, S. J. Zanton, and B. F. Pugh, “Nucleosome positions predicted through comparative genomics,” Nature Genetics, vol. 38, no. 10, pp. 1210–1215, 2006. [5] A. Brazma, P. Hingamp, J. Quackenbush, et al., “Minimum information about a microarray experiment (MIAME)— toward standards for microarray data,” Nature Genetics, vol. 29, no. 4, pp. 365–371, 2001. [6] J. Tuikkala, L. Elo, O. S. Nevalainen, and T. Aittokallio, “Improving missing value estimation in microarray data with gene ontology,” Bioinformatics, vol. 22, no. 5, pp. 566–572, 2006.
13 [7] E. Acuna and C. Rodriguez, “The treatment of missing values and its effect in the classifier accuracy,” in Classification, Clustering and Data Mining Applications, pp. 639–648, Springer, Berlin, Germany, 2004. [8] H. Kim, G. H. Golub, and H. Park, “Missing value estimation for DNA microarray gene expression data: local least squares imputation,” Bioinformatics, vol. 21, no. 2, pp. 187–198, 2005. [9] O. Troyanskaya, M. Cantor, G. Sherlock, et al., “Missing value estimation methods for DNA microarrays,” Bioinformatics, vol. 17, no. 6, pp. 520–525, 2001. [10] T. H. Bø, B. Dysvik, and I. Jonassen, “LSimpute: accurate estimation of missing values in microarray data with least squares methods,” Nucleic Acids Research, vol. 32, no. 3, p. e34, 2004. [11] M. S. B. Sehgal, I. Gondal, and L. S. Dooley, “Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data,” Bioinformatics, vol. 21, no. 10, pp. 2417–2423, 2005. [12] S. Oba, M.-A. Sato, I. Takemasa, M. Monden, K.-I. Matsubara, and S. Ishii, “A Bayesian missing value estimation method for gene expression profile data,” Bioinformatics, vol. 19, no. 16, pp. 2088–2096, 2003. [13] X. Gan, A. W.-C. Liew, and H. Yan, “Microarray missing data imputation based on a set theoretic framework and biological knowledge,” Nucleic Acids Research, vol. 34, no. 5, pp. 1608– 1619, 2006. [14] M. S. B. Sehgal, I. Gondal, L. S. Dooley, and R. Coppel, “Heuristic non parametric collateral missing value imputation: a step towards robust post-genomic knowledge discovery,” in Pattern Recognition in Bioinformatics, Lecture Notes in Computer Science, pp. 373–387, Springer, Berlin, Germany, 2008. [15] J. Tuikkala, L. Elo, O. S. Nevalainen, and T. Aittokallio, “Improving missing value estimation in microarray data with gene ontology,” Bioinformatics, vol. 22, no. 5, pp. 566–572, 2006. [16] R. J¨ornsten, M. Ouyang, and H.-Y. Wang, “A meta-data based method for DNA microarray imputation,” BMC Bioinformatics, vol. 8, article 109, pp. 1–10, 2007. [17] I. Hedenfalk, D. Duggan, Y. Chen, et al., “Gene-expression profiles in hereditary breast cancer,” The New England Journal of Medicine, vol. 344, no. 8, pp. 539–548, 2001. [18] A. A. Jazaeri, C. J. Yee, C. Sotiriou, K. R. Brantley, J. Boyd, and E. T. Liu, “Gene expression profiles of BRCA1-linked, BRCA2linked, and sporadic ovarian cancers,” Journal of the National Cancer Institute, vol. 94, no. 13, pp. 990–1000, 2002. [19] R. J¨ornsten, H.-Y. Wang, W. J. Welsh, and M. Ouyang, “DNA microarray data imputation and significance analysis of differential expression,” Bioinformatics, vol. 21, no. 22, pp. 4155–4161, 2005. [20] J. Laurier, “Alarming increase in cancer rates,” WHO report, 2003. [21] T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Haussler, “Support vector machine classification and validation of cancer tissue samples using microarray expression data,” Bioinformatics, vol. 16, no. 10, pp. 906–914, 2000. [22] E. Keedwell and A. Narayanan, Intelligent Bioinformatics: The Application of Artificial Intelligence Techniques to Bioinformatics, John Wiley & Sons, New York, NY, USA, 2005. [23] L. P. Br´as and J. C. Menezes, “Improving cluster-based missing value estimation of DNA microarray data,” Biomolecular Engineering, vol. 24, no. 2, pp. 273–282, 2007.
14 [24] D. S. V. Wong, F. K. Wong, and G. R. Wood, “A multi-stage approach to clustering and imputation of gene expression profiles,” Bioinformatics, vol. 23, no. 8, pp. 998–1005, 2007. [25] C. Mertens, C. Kuhn, and W. W. Franke, “Plakophilins 2a and 2b: constitutive proteins of dual location in the karyoplasm and the desmosomal plaque,” Journal of Cell Biology, vol. 135, no. 4, pp. 1009–1025, 1996. [26] C. Mertens, C. Kuhn, R. Moll, I. Schwetlick, and W. W. Franke, “Desmosomal plakophilin 2 as a differentiation marker in normal and malignant tissues,” Differentiation, vol. 64, no. 5, pp. 277–290, 1999. [27] E. Jansen, J. S. E. Laven, H. B. R. Dommerholt, et al., “Abnormal gene expression profiles in human ovaries from polycystic ovary syndrome patients,” Molecular Endocrinology, vol. 18, no. 12, pp. 3050–3063, 2004. [28] M. Lu, W. A. Thompson, D. A. Lawlor, J. D. Reveille, and J. E. Lee, “Rapid direct determination of HLA − DQB1∗ 0301 in the whole blood of normal individuals and cancer patients by specific polymerase chain reaction amplification,” Journal of Immunological Methods, vol. 199, no. 1, pp. 61–68, 1996. [29] D. M. E. Harvell, J. K. Richer, D. C. Allred, C. A. Sartorius, and K. B. Horwitz, “Estradiol regulates different genes in human breast tumor xenografts compared with the identical cells in culture,” Endocrinology, vol. 147, no. 2, pp. 700–713, 2006. [30] H. Xu, P. Wu, C. F. J. Wu, C. Tidwell, and Y. Wang, “A smooth response surface algorithm for constructing a gene regulatory network,” Physiological Genomics, vol. 11, no. 1, pp. 11–20, 2003. [31] P. Whittle, Probability via Expectation, Springer, Berlin, Germany, 3rd edition, 1992. [32] C. Z. Mooney and R. D. Duval, Bootstrapping: A Nonparametric Approach to Statistical Inference, Sage, Thousand Oaks, Calif, USA, 1993. [33] G. McLachlan and D. Peel, Finite Mixture Models, Wiley Series in Prbability and Statistics, Wiley-Interscience, New York, NY, USA, 2000. [34] K.-Y. Kim, B.-J. Kim, and G.-S. Yi, “Reuse of imputed data in microarray analysis increases imputation efficiency,” BMC Bioinformatics, vol. 5, article 160, pp. 1–9, 2004. [35] G. Casella and C. P. Robert, Monte Carlo Statistical Methods, Springer, Berlin, Germany, 2005. [36] R. P. Abelson, Statistics as Principled Argument, Lawrence Erlbaum, Mahwah, NJ, USA, 1995. [37] R. R. Wilcox, Fundamentals of Modern Statistical Methods, Springer, Berlin, Germany, 2001. [38] B. Scholkopf, K. Tsuda, and J.-P. Vert, Kernel Methods in Computational Biology, MIT Press, Cambridge, Mass, USA, 2004. [39] S. Dudoit, J. Fridlyand, and T. P. Speed, “Comparison of discrimination methods for the classification of tumors using gene expression data,” Journal of the American Statistical Association, vol. 97, no. 457, pp. 77–86, 2002. [40] S. Salceda, C. Drumright, A. DiEgidio, et al., “Identification of differentially expressed genes in breast cancer,” Nature Genetics, vol. 27, pp. 83–84, 2001. [41] T. H. Bø and I. Jonassen, “New feature subset selection procedures for classification of expression profiles,” Genome Biology, vol. 3, no. 4, pp. research0017.1–research0017.11, 2002. [42] J. K. Choi, U. Yu, O. J. Yoo, and S. Kim, “Differential coexpression analysis using microarray data and its application to human cancer,” Bioinformatics, vol. 21, no. 24, pp. 4348–4355, 2005.
EURASIP Journal on Bioinformatics and Systems Biology [43] K. Basso, A. A. Margolin, G. Stolovitzky, U. Klein, R. DallaFavera, and A. Califano, “Reverse engineering of regulatory networks in human B cells,” Nature Genetics, vol. 37, no. 4, pp. 382–390, 2005. [44] F. V. Jensen, Bayesian Networks and Decision Graphs, Springer, Berlin, Germany, 2nd edition, 2002. [45] J. Ihmels, R. Levy, and N. Barkai, “Principles of transcriptional control in the metabolic network of Saccharomyces cerevisiae,” Nature Biotechnology, vol. 22, no. 1, pp. 86–92, 2004. [46] A. A. Margolin, I. Nemenman, K. Basso, et al., “ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context,” BMC Bioinformatics, vol. 7, supplement 1, article S7, pp. 1–15, 2006. [47] M. Bansal, V. Belcastro, A. Ambesi-Impiombato, and D. Di Bernardo, “How to infer gene networks from expression profiles,” Molecular Systems Biology, vol. 3, article 78, pp. 1– 10, 2007. [48] M. Ouyang, W. J. Welsh, and P. Georgopoulos, “Gaussian mixture clustering and imputation of microarray data,” Bioinformatics, vol. 20, no. 6, pp. 917–923, 2004.
Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2009, Article ID 491074, 13 pages doi:10.1155/2009/491074
Research Article Efficient Alignment of RNAs with Pseudoknots Using Sequence Alignment Constraints Byung-Jun Yoon Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128, USA Correspondence should be addressed to Byung-Jun Yoon,
[email protected] Received 18 June 2008; Revised 13 November 2008; Accepted 16 January 2009 Recommended by Javier Garcia-Frias When aligning RNAs, it is important to consider both the secondary structure similarity and primary sequence similarity to find an accurate alignment. However, algorithms that can handle RNA secondary structures typically have high computational complexity that limits their utility. For this reason, there have been a number of attempts to find useful alignment constraints that can reduce the computations without sacrificing the alignment accuracy. In this paper, we propose a new method for finding effective alignment constraints for fast and accurate structural alignment of RNAs, including pseudoknots. In the proposed method, we use a profile-HMM to identify the “seed” regions that can be aligned with high confidence. We also estimate the position range of the aligned bases that are located outside the seed regions. The location of the seed regions and the estimated range of the alignment positions are then used to establish the sequence alignment constraints. We incorporated the proposed constraints into the profile context-sensitive HMM (profile-csHMM) based RNA structural alignment algorithm. Experiments indicate that the proposed method can make the alignment speed up to 11 times faster without degrading the accuracy of the RNA alignment. Copyright © 2009 Byung-Jun Yoon. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction Sequence alignment lies at the heart of various computational methods that are used for analyzing biological sequences, such as RNAs and proteins. Alignment algorithms have been extensively used for comparing sequences to identify homologues, predict their structures, and infer their biological functions. Many functional noncoding RNAs (ncRNAs) are known to conserve their base paired secondary structure as well as their primary sequence [1]. For this reason, when aligning RNAs, it is important to consider both structure and sequence similarities in order to find an accurate alignment that is biologically meaningful. For a similar reason, it is expedient to employ scoring schemes that can reasonably combine contributions from the secondary structure similarity as well as primary sequence similarity when performing an RNA similarity search, as it can significantly reduce the number of false-positive predictions [2]. Conservation of the secondary structure gives rise to complicated symbol correlations between the pairing bases in the RNA sequence. Therefore, in order to take structural
similarity into account, RNA alignment and search algorithms need to handle these base correlations in a principled manner. Until now, a number of probabilistic models have been proposed for this purpose [2, 3], where stochastic context-free grammars (SCFGs) and their variants have been especially popular. A typical problem of these models and the relevant algorithms is the high computational complexity. For example, the Cocke-Younger-Kasami (CYK) algorithm used in the SCFG-based alignment and search has a complexity of O(L3 ), where L is the length of the RNA that is to be aligned. Algorithms for simultaneous folding and alignment of RNAs (typically referred to as the Sankoff algorithm) have an even higher complexity, which require O(L3N ) computations for aligning N RNAs of length L [4]. The aforementioned algorithms do not consider pseudoknots, which are RNA secondary structures with crossing basepairs, and there will be a steep increase in complexity if we begin to consider such pseudoknots. Pseudoknots are often ignored by many algorithms since they significantly increase the computational complexity. The high computational cost of RNA alignment and search algorithms limits their utility in practical applications,
RNA 2
RNA 1 1 2 3 4 5 6 7 8 9 C C C A A A C G G 1 2 3 4 5 6 7 8 9
C C C U A A A G G
RNA 1: C C C - A A A C G G RNA 2: C C C U A A A - G G (a) RNA 1 1 2 3 4 5 6 7 8 9 A A U C C C A U A
RNA 2
especially when the RNA of interest is long. To cope with this problem, there have been extensive research efforts to develop heuristic methods that can make these algorithms faster without degrading the accuracy. For example, let us consider the simultaneous folding and alignment algorithm. Its computational complexity is already O(L6 ) for aligning just two RNAs, making it practically unusable for a larger number of RNAs. Even for pairwise alignments, the algorithm becomes quickly infeasible as the RNAs get longer. Therefore, in order to utilize these algorithms in practical applications, it is essential that we first reduce their computations. For this reason, most of the pairwise RNA alignment algorithms adopt various tricks to minimize the alignment time [5–11]. Similarly, a number of methods have been proposed to make RNA similarity searching faster, where the prescreening approach is a good example [12–15]. The prescreening approach uses a simple model, such as a profile hidden Markov model (profile-HMM), to identify the regions that have a reasonable amount of (sequence) similarity. Only these regions will be passed to a more complex model, such as a covariance model (CM; profile-SCFG) [3] or a profile context-sensitive HMM (profile-csHMM) [16, 17], for further inspection. In fact, these are just a few examples, and there also exist other approaches for making RNA alignment and RNA search algorithms faster [18, 19]. Recently, we proposed an efficient RNA structural alignment algorithm based on profile-csHMMs, which can also be used for aligning RNAs that contain pseudoknots [17]. This algorithm finds the optimal alignment between a structured reference RNA and an unstructured target RNA, by taking both structure and sequence similarities into account. It was demonstrated that the profile-csHMM algorithm can find accurate alignment of RNA pseudoknots [17]. In this paper, we propose a novel method for finding effective sequence alignment constraints that can improve the computational efficiency of the profile-csHMM structural alignment algorithm. The overall organization of the paper is as follows. In Section 2, we describe the concept of constrained alignment and briefly review some of the existing methods for finding the alignment constraints. After the review, we propose a new method for estimating the alignment constraints in Section 3. Finally, Section 4 describes how these constraints can be incorporated into the profile-csHMM based structural alignment method. Experimental results will be presented at the end of Section 4, which demonstrate the effectiveness of the proposed approach.
EURASIP Journal on Bioinformatics and Systems Biology
1 2 3 4 5 6 7 8 9
U C C C A U C A C
RNA 1: A A U C C C A U - A RNA 2: - - U C C C A U C A C (b) RNA 1 L1
1 1
RNA 2
2
L2
Alignment constraint
2. Constrained Sequence Alignment Let us assume that we want to find the alignment of two RNAs X = x1 x2 · · · xL1 (RNA-1) and Y = y1 y2 · · · yL2 (RNA-2). The predicted sequence alignment can be uniquely represented by the set of aligned bases (xm , yn ) in a matrix. For example, let us consider the RNA alignment in Figure 1(a). The matrix shows the positions of the aligned bases, where a black square at (m, n) indicates that the bases xm and yn are aligned to each other. As we can see in Figure 1(a), the sequence alignment can be represented by
Sequence alignment (c)
Figure 1: (a), (b) Examples of sequence alignment. The black squares denote the aligned positions. (c) Illustration of constrained alignment. The final sequence alignment (shown in black) should lie inside the constrained region (shown in gray).
a “path” of aligned base positions (m, n) in the alignment matrix. Another example is shown in Figure 1(b) for a
EURASIP Journal on Bioinformatics and Systems Biology slightly different alignment. Without any prior constraints, any base in one RNA can be aligned to any base in the other RNA, hence the “alignment path” can be located anywhere in the matrix. Now, assume that we are given some prior information about the region where the aligned bases should be located. This is illustrated in Figure 1(c), where the gray area depicts the possible position (m, n) of the aligned bases xm and yn . Knowing that all the aligned bases should be located inside the given region, we only have to consider the alignment paths that are contained in this region when finding the RNA alignment. Consequently, the search space for finding the optimal alignment is reduced, bringing down the overall computational cost. The reduction in time achieved by restricting the alignment space is typically much larger compared to the reduction in the alignment space itself, since the computational complexity of most RNA alignment algorithms is a high-order polynomial of the RNA length. It has to be noted that the alignment accuracy will not be affected as far as the optimal alignment (the black path in Figure 1(c)) is contained in the constraining region (the gray area in Figure 1(c)). As illustrated in the previous example, using appropriate sequence alignment constraints can greatly enhance the efficiency of the alignment algorithm. So, the natural question is how we can predict good alignment constraints that can minimize the alignment time without degrading the alignment accuracy. Until now, various methods have been proposed for restricting the alignment space to improve the efficiency of diverse RNA alignment and search algorithms [5–9, 11, 17, 18]. For example, the query-dependent banding (QDB) method [18] is used to make CM-based RNA alignment algorithms faster, by excluding the regions in the dynamic programming matrix that have insignificant probability. These regions can be precomputed based on the given CM and do not depend on the target database. Foldalign [7], an algorithm for simultaneous RNA folding and alignment, limits the maximum length of the RNA-motif as well as the maximum length difference between the subsequences that are being compared. Recent implementation of Foldalign [8] adopts a heuristic that prunes the dynamic programming matrix in order to reduce the overall time and memory requirements. Another RNA alignment and structure prediction algorithm called Stemloc [9] constrains the solution space by using “fold envelopes” and “alignment envelopes.” The fold envelopes are used to restrict the search over secondary structures and the alignment envelopes are used to restrict the possible alignments between the given sequences. A recent implementation of Dynalign [11], a joint alignment and secondary structure prediction algorithm for two RNAs, assumes that the aligned bases in the respective RNAs should be located within a certain distance. To be more precise, the mth base xm in RNA-1 (X = x1 x2 · · · xL1 ) can be aligned to the nth base yn in RNA-2 (Y = y1 y2 · · · yL2 ), only if the following condition is satisfied: L 2 m − n ≤ M, L1
(1)
3 for a given M. For convenience, we refer to this constraint as the M-constraint. The parameter M is used to specify the maximum distance between the alignable bases. By imposing this constraint, we are restricting the number of insertions and deletions in homologous sequences, which is a reasonable assumption for real biological sequences. The constrained alignment space is band-shaped as depicted in Figure 2(a). Despite its simplicity, is has been shown that the proposed constraint works reasonably well [6, 11]. The latest implementation of Dynalign [6] takes a more principled approach for estimating the alignment region. In the new approach, a hidden Markov model (HMM) is used to estimate the set Q of aligned positions (m, n), whose coincidence probability P(xm ↔ yn | X, Y) is larger than a reasonably low threshold pt (≈ 0):
Q = (m, n) | P xm ←→ yn | X, Y > pt .
(2)
Two bases xm and yn are said to be coincident if (i) they are either aligned to each other, or (ii) if xm is inserted in the region that immediately follows xm1 (m1 < m) which is aligned to yn , or vice versa [6]. The estimated set Q is used to constrain the final alignment. It was demonstrated that this technique can provide significant savings in computational time as well as a small improvement in alignment accuracy [6]. Another pairwise folding and alignment algorithm called Consan [5] first finds the confidently aligned base positions, referred to as “pins,” and constrains the RNA alignment by fixing these positions. This is illustrated in Figure 2(b). The set of pins P is estimated using a pair-HMM, by looking for base positions (m, n) whose alignment probability P(xm ⇔ yn | X, Y) exceeds some threshold ph , which is close to unity. This set P can be written as follows:
P = (m, n) | P xm ⇐⇒ yn | X, Y ≥ ph .
(3)
For every predicted pin (m, n) ∈ P , the bases xm and yn are forced to be aligned to each other in the final alignment. While Dynalign [6] finds the set of all possible alignment positions, Consan [5] tries to find only a small set of alignment positions that must be included in the final alignment. Although the previous alignment constraints [5, 6, 11] were mainly used to speed up Sankoff-style joint alignment and folding algorithms, similar ideas can be used to expedite dynamic programming alignment algorithms such as the CYK algorithm [3] for CMs and the sequential component adjoining (SCA) algorithm [16, 17] for profile-csHMMs. In the following section, we propose a new method for finding effective sequence alignment constraints that are especially useful for making these algorithms faster.
3. Alignment Constraints for RNA Family-Specific Models Let us assume that we have a reference RNA whose structure is known. This can be either the consensus sequence of an RNA family or simply a single RNA sequence. Also assume that we are given a target RNA with an
4
EURASIP Journal on Bioinformatics and Systems Biology RNA 1 L1
1
RNA 2
1
L2
Alignment constraint Sequence alignment (a) RNA 1 L1
1
RNA 2
1
L2
Constrained region Pins Sequence alignment (b)
Figure 2: (a) Alignment constraint in Dynalign [11]. The maximum distance between the aligned bases is restricted. (b) Consan [5] constrains the alignment space by fixing the “pins,” or confidently aligned base positions.
unknown structure, which might be a putative member of the same family. We want to find the optimal alignment between these RNAs by considering both their sequence and structural similarities. This structural alignment can be used for predicting the secondary structure of a new homologue [17, 20] or performing an RNA similarity search to identify new members in the same RNA family [3]. In order to find the structural alignment, we first construct a stochastic model (such as a profile-csHMM or a CM) that can closely represent the reference RNA. Then we use
a dynamic programming alignment algorithm to find the best alignment between the reference RNA (represented by the constructed model) and the target RNA. Although the computational complexity of these algorithms is generally lower than that of Sankoff-style algorithms, it still ranges between O(L3 ) and O(L6 ) for a target RNA of length L. For RNAs without pseudoknots, the computational complexity of the alignment algorithm will be O(L3 ). The complexity for aligning pseudoknotted RNAs is at least O(L4 ), and it can become higher as the structure gets more complex. This renders the dynamic programming algorithms impractical for aligning long RNAs or scanning a large database, and using effective alignment constraints can be greatly helpful in relieving this problem. 3.1. Motivation for Estimating Constraints Based on Predicted Alignment Positions. When we are interested in a specific RNA family, it will be more appropriate to establish the alignment constraints based on the member sequences in the given family. Therefore, it will be more desirable to use a family-specific model for finding the constraints, rather than using a general model that applies to all RNAs as in [5, 6]. However, in many practical situations, we may not have enough number of sequences in the given family for reliably estimating the model parameters. As we can see in (2) and (3), the alignment constraints in Dynalign [6] and Consan [5] strongly depend on the estimated alignment probabilities. Although the alignment constraints used in these methods are expected to work well when we have a large number of training sequences, they are not suitable when only a handful of RNAs are available for training the model. So, how can we find efficient alignment constraints for a family-specific model when we have only a limited number of sequences in the reference RNA family? In order to answer this question, let us consider the pair-HMMs shown in Figure 3. Both pair-HMMs have three hidden states, ALN, IX , and IY , for base alignment, base insertion in X (RNA1), and base insertion in Y (RNA-2), respectively. The state ALN emits a pair of aligned bases xm ∈ X and yn ∈ Y. The insert state IX emits an unaligned base xm in X, and similarly, the state IY emits an unaligned base yn in Y. These pairHMMs can be used for finding a sequence-based alignment between two RNAs, and for estimating the base alignment probabilities. It is called a sequence-based alignment, since the alignment is obtained based on sequence similarity alone. Similar models have been used to find the alignment constraints (2) and (3) in Dynalign [6] and Consan [5], respectively. In this example, the transition probabilities of the pair-HMMs are shown along the arrows. We assume that the probability of entering a state in the beginning is identical for all three states. The emission probability e(x, y | ALN) of a pair of aligned bases (x, y) is shown inside the box below the respective HMMs in Figure 3. Finally, the emission probability at an insert state is specified as follows: 1 e(x|s) = , 4
∀x ∈ {A, C, G, U}, ∀s ∈ IX , IY .
(4)
EURASIP Journal on Bioinformatics and Systems Biology
5
0.1
0.2
0.4
0.4 0.1
IX
0.3
0.3
IY
0.5
0.2
IX
0.5
IY
0.5
0.05
0.05
0.5 0.1
0.1
ALN
ALN 0.9
e(x, y |ALN) =
0.8
0.22 (if x = y) e(x, y |ALN) =
0.01 (otherwise)
RNA 1: RNA 2:
0.16 (if x = y) 0.03 (otherwise)
AAC U G - - - C U G AA
RNA 1: RNA 2:
0.9 0.93 0.88
AAC U G - - - C U G AA 0.46 0.5 0.42
(a) HMM-1
(b) HMM-2
Figure 3: Two pair-HMMs with slightly different parameters. Both pair-HMMs have three states, ALN, IX , and IY , which represent base alignment, base insertion in RNA-1, and base insertion in RNA-2, respectively.
Now, let us assume that we want to find the sequence-based alignment of the following RNAs: X = AACUG, Y = CUGAA,
(5)
using the pair-HMM shown in Figure 3(a). Using the Viterbi algorithm, we can get the following alignment X : A A C U G – –, Y : – – C U G A A.
(6)
For each aligned pair (xm , yn ), we can compute the alignment probability P(xm ⇔ yn | X, Y) using the forward-backward algorithm [21]. The estimated base alignment probabilities are shown in Figure 3(a), below the RNA alignment. The estimated alignment probabilities are close to unity, indicating that we can be more or less confident about the predicted base alignments. Now, let us repeat this process using the pair-HMM shown in Figure 3(b), which has slightly different parameters. As we can see in Figure 3(b), HMM-2 finds the same alignment as HMM-1, but the estimated alignment probabilities are significantly different from the previous estimates. This example clearly shows that the estimation of the base alignment probability P(xm ⇔ yn | X, Y) can be very sensitive to small changes in the model parameters. This implies that the alignment constraint in (3), which depends on P(xm ⇔ yn | X, Y), may not be reliable when we do not have enough training data to accurately estimate the HMM parameters. Compared to this, the alignment constraint in (2) might be more reliable, as the coincidence probability P(xm ↔ yn | X, Y) used to estimate the constraint also includes the insertion probabilities. However, the predicted
constraint will nevertheless depend on the parameters of the HMM to a considerable extent. However, one thing we can notice by comparing Figures 3(a) and 3(b) is that, despite the large difference in the estimated alignment probabilities, the resulting sequence alignments are identical. In fact, the alignment positions in an optimal sequence alignment are not very sensitive to small parameter changes, and as a result, HMMs with reasonably similar parameters often yield almost identical alignment results. This motivates us to exploit the aligned base positions (xm , yn ) for establishing the alignment constraints, instead of using the base alignment probabilities P(xm ⇔ yn | X, Y). 3.2. Finding Seed Regions Using a Profile-HMM. Based on the previous observation, we propose a new method that utilizes the predicted alignment positions to find the alignment constraints. As before, let us denote the structured reference RNA as X (RNA-1) and the unstructured target RNA as Y (RNA-2). Ultimately, we want to find the structural alignment of these RNAs. However, since the dynamic programming algorithm for finding the structural alignment is computationally expensive, we want to come up with effective alignment constraints that can speed up the alignment. For this purpose, we first build a profile-HMM [3] based on the reference RNA family. This model is used to find the sequence-based alignment between the reference RNA (represented by the profile-HMM) and the target RNA. Secondly, we identify the regions that consist of multiple consecutive base alignments, or base matches. Although a single base match may not be meaningful by itself, having a region of consecutive matches often indicates that the alignment in the given region is reasonably accurate. This
6
EURASIP Journal on Bioinformatics and Systems Biology RNA 1 RNA 2 Seed 1
Seed 2
Seed 3
(a) σ1 (k)
σ1r (k)
m
RNA 1 k-th seed region RNA 2 σ2 (k)
σ1r (k) − σ1 (k) + 1 = σ2r (k) − σ2 (k) + 1 = Lk ,
σ2r (k)
n
RNA 1 RNA 2
m
Seed (k)
νr1 (k) Seed (k + 1)
Seed (k)
Seed (k + 1)
ν2 (k)
k-th non-seed region
νr2 (k)
n
(7)
where Lk is the length of the kth seed. For convenience, we define Dk as the position difference between the aligned bases in the kth seed
(b) ν1 (k)
are relatively reliable, we keep the alignment space in these regions small. Let us consider the kth seed region illustrated in Figure 4(b). We denote the begin index and the end index of the kth seed in RNA-1 as σ1 (k) and σ1r (k), respectively. The superscripts and r in σ1 (k) and σ1r (k) stand for “left” and “right,” respectively. Similarly, the begin and the end indices of the kth seed in RNA-2 are denoted as σ2 (k) and σ2r (k), respectively. Since a seed region consists of consecutive base matches, we have
(c)
Figure 4: Illustration of the proposed method. (a) Seed regions are identified from the sequence-based alignment. (b) Example of a seed region, which consists of consecutive base matches. (c) Example of a nonseed region.
is especially true for those matches that are located in the middle of the region. For example, we can see in both Figures 3(a) and 3(b) that the alignment probability P(xm ⇔ yn | X, Y) is the largest for the alignment between x4 = U and y2 = U, which is located between the matches (x3 , y1 ) and (x5 , y3 ). Therefore, we exclude the matches near the end and keep the remainder to obtain a set of reliable base alignments. The set of reliable contiguous matches is referred to as the seed region. The procedure for finding the seed regions can be summarized as follows. (1) Find a sequence-based alignment between the RNAs. (2) Identify all regions, and longest such regions, that consist of consecutive matches. Let Nmatch be the number of consecutive matches in a given region. Keep only those regions with Nmatch ≥ Γ. (3) In each region, exclude the first γ matches in the left end and the last γ matches in the right end. (4) The region that consists of the Nmatch − 2γ remaining matches is defined as a seed region. The integer parameters Γ and γ (< Γ/2) define the seed regions during this process. In general, using a larger Γ will identify a smaller number of seed regions, and a larger γ makes the seed regions contain fewer but more reliable base matches. Figure 4(a) illustrates an example alignment with three seed regions. 3.3. Constraints in a Seed Region. Assume that we have identified K seed regions according to the procedure described in Section 3.2. Since the base alignments in these regions
σ2 (k) − σ1 (k) = σ2r (k) − σ1r (k) = Dk .
(8)
Based on the kth seed, we define the set of allowed alignment positions (m, n) as follows:
S(k) = (m, n) | σ1 (k) ≤ m ≤ σ1r (k), n − m − Dk ≤ Δ . (9) For a base xm that is contained in the kth seed region of RNA1, the parameter Δ restricts the distance between the base yn∗ (n∗ = m + Dk ), to which xm is aligned in the sequencebased alignment, and the base yn , to which xm will be aligned in the final structural alignment. As the base alignments in the seed regions are reliable, Δ can be typically set to a small number. We find the set S(k) for all k = 1, 2, . . . , K, and these sets will be combined later to establish the final alignment constraints. 3.4. Constraints in a Nonseed Region. The predicted base alignments in the nonseed regions are generally less reliable compared to those inside the seed regions. Therefore, we define different alignment constraints for the bases contained in the nonseed regions, and make the constraints less stringent compared to (9). Let us consider the kth nonseed region illustrated in Figure 4(c). The begin and end indices of the kth nonseed region in RNA-1 are denoted by ν1 (k) and νr1 (k), respectively. Similarly, ν2 (k) and νr2 (k), respectively denote the begin and end indices of the corresponding nonseed region in RNA-2. Now, we define the set A(k), which contains (i) all aligned base positions (m, n) in the kth nonseed region, as well as (ii) the first and last positions (ν1 (k), ν2 (k)) and (νr1 (k), νr2 (k)) in this region:
A(k) = (m, n)ν1 (k) ≤ m ≤ νr1 (k), ν2 (k) ∪
≤ n ≤ νr2 (k), xm ⇐⇒ yn r r
(10)
ν1 (k), ν2 (k) , ν1 (k), ν2 (k) .
In practice, it is possible that there may be no aligned bases xm ⇔ yn in the given nonseed region. Including the terminal positions (ν1 (k), ν2 (k)) and (νr1 (k), νr2 (k)) of the kth nonseed region in A(k) ensures that the set A(k) will never be empty.
EURASIP Journal on Bioinformatics and Systems Biology For the position pairs (m, n) ∈ A(k), we estimate the range of the position difference (n − m) as follows: Δmin (k) = Δmax (k) =
min
(n − m),
max
(n − m).
(m,n)∈A(k) (m,n)∈A(k)
(11)
Based on these values, we define the following set:
N (k) = (m, n)ν1 (k) ≤ m ≤ νr1 (k), m + Δmin (k) − Δ ≤ n ≤ m + Δmax (k) + Δ ,
(12)
which contains the alignable base positions (m, n) in the kth nonseed region. Note that we use the same Δ in (9) and (12). Therefore, a larger Δ will relax the alignment constraints for both seed and nonseed regions, and a smaller Δ will make both constraints more stringent. 3.5. Overall Alignment Constraints. In Sections 3.3 and 3.4, we defined the alignment constraints in the seed regions as well as the constraints in the nonseed regions. Finally, we combine (9) and (12) to obtain the overall alignment constraints C as follows:
C=
S k1
∀k1
∪
N k2
.
(13)
∀k2
This set C can be used to constrain the alignment space of the dynamic programming algorithm for finding the structural alignment of the given RNAs. When finding the RNA alignment, we allow a base xm in RNA-1 to be aligned to a base yn in RNA-2 only if the pair (m, n) is included in this set C.
4. Experimental Results To demonstrate the effectiveness of the proposed method, we applied the new alignment constraints to the profilecsHMM-based structural alignment method [17]. In the following, we provide a brief explanation about the experimental set-up and present the experimental results. 4.1. Profile-csHMM-Based Structural Alignment. ProfilecsHMMs are a subclass of context-sensitive HMMs [22] that are especially useful for representing RNA sequence profiles and their secondary structure. In principle, profile-csHMMs can represent RNA secondary structures with any kind of base pairs [16, 17]. As a result, profile-csHMMs can also be used for aligning and predicting the structure of RNAs that contain pseudoknots, which cannot be done using the widely used SCFGs (or CMs). The profile-csHMM-based structural alignment algorithm proposed in [17] proceeds as follows. In the first place, a profile-csHMM is constructed based on a reference RNA sequence with a known structure. In [17], a single reference RNA was used to build the model. This can be used for performing a single RNA homology search, similar to the CM-based search proposed in [23]. Figure 5 illustrates an example, where a profile-csHMM is
7 constructed based on a reference RNA that has two crossing base pairs. Obviously, we do not have enough training sequences to accurately estimate the model parameters in this case, hence the parameters of the profile-csHMM are chosen according to the scoring scheme proposed by Gorodkin et al. [24]. These scores can be viewed as normalized log-probabilities for observing base substitutions or gaps (insertions and deletions) in homologous RNAs. They have been used in a number of RNA alignment algorithms [20, 24], yielding accurate alignment results. The constructed profile-csHMM can then be used for finding the optimal structural alignment between the reference RNA and an unstructured target RNA, computing their alignment score, and predicting the secondary structure of the target RNA. A dynamic programming alignment algorithm called the sequential component adjoining (SCA) algorithm can be used for this purpose. 4.2. Estimating the Alignment Constraints. In order to estimate the alignment constraints for expediting the profilecsHMM alignment algorithm (or the SCA algorithm), we construct a profile-HMM-based on the same reference RNA. Note that unlike the profile-csHMM, the traditional profile-HMM reflects only the sequence characteristics of the reference RNA. Similar to the parameterization of the profile-csHMM described in Section 4.1, the parameters of the profile-HMM are also specified according to the scores in [24]. The resulting profile-HMM is used to estimate the sequence alignment constraints as we elaborated in Section 3. We use the estimated constraints to restrict the alignment space of the structural RNA alignment to reduce the overall computational load. 4.3. Choosing the Parameters for Constraint Estimation. Now, one practical question is how we should choose the parameters Γ, γ, and Δ that are used to estimate the alignment constraints in Sections 3.3 and 3.4. Ideally, the predicted alignment constraints should minimize the alignment space without affecting the quality of the final structural alignment. Since the alignment constraints proposed in Section 3 are derived from the predicted seed regions, the alignment accuracy in these regions has a crucial impact on the accuracy of the proposed approach. For this reason, we estimated the average base alignment accuracy in the seed regions for the 5S rRNA and tRNA families in the Rfam database (version 8.1) [25]. We used the RNAs in the seed alignment of the respective family, as they have a relatively reliable secondary structure annotation. For each RNA family, we first chose a reference RNA among its members, and constructed a profile-HMM based on the chosen RNA. Then we aligned the remaining members to the reference RNA using the profile-HMM. For every sequence alignment, the predicted alignment positions have been compared to the correct positions in the database to estimate the alignment error rate. In order to get a reliable estimate, we repeated these experiments by using every member as the reference RNA. This resulted in 1, 182, 656 pairwise alignments for tRNAs and 345, 156 alignments for 5S rRNAs.
8
EURASIP Journal on Bioinformatics and Systems Biology Base-pair 5
(a)
A
C
A
U
G
)
D1 (b)
D2
I0
Start
I2 M2
Write
Structure
D3
I1 M1
3 Sequence
D4 I3
M3 Write
I4 M4
Read
Mem 1
D5 I5
End
M5 Read
Mem 2
Mk
Single-emission match states
Dk Delete states
Mk
Pairwise-emission match states
Ik
Mk
Context-sensitive match states
Insert states
Figure 5: Constructing a profile-csHMM. (a) A reference RNA sequence with a known secondary structure. (b) The profile-csHMM that represents the given reference RNA.
m
m1
m2
and the minimum distance dmin is defined as
RNA 1
dmin = min = min
RNA 2 n1
n
n2
m − m1 , m2 − m
n − n1 , n2 − n .
(15)
Based on these definitions, we define
Consecutive matches
Pe (L, d)
Figure 6: A region that consists of consecutive base matches.
= P xm yn in the trusted alignmentLaln ≥ L, dmin ≥ d .
(16) Let us assume that the profile-HMM predicted that xm in the reference RNA should be aligned to yn in the target RNA. We want to estimate the probability of error for this prediction as a function of the following parameters: (1) Laln : the number of consecutive base matches in the region containing the alignment (m, n), (2) dmin : the minimum distance between the given base alignment (m, n) and the terminal alignment positions (m1 , n1 ) and (m2 , n2 ). See Figure 6 for illustration. Consider the example illustrated in Figure 6. The number of consecutive base matches, or the length, of the region containing (m, n) is Laln = m2 − m1 + 1 = n2 − n1 + 1,
(14)
This is the probability that the predicted base alignment (m, n) between xm and yn will be incorrect, given that the following hold: (1) the length Laln of the alignment region containing (m, n) is at least L, (2) there are at least d matches in the left-hand side of (m, n) as well as in the right-hand side. Figure 7(a) shows the contour plot of the misalignment probability Pe (L, d) for 5S rRNAs, where the x-axis is for L and the y-axis is for d. On top of each contour curve, we show the corresponding misalignment probability Pe (L, d) for the points (L, d) on the given curve. Darker shaded regions correspond to higher Pe (L, d) and lighter shaded regions correspond to lower Pe (L, d). The diagonal line representing L = 2d + 1 is shown in the plot as a reference. Note that, by definition, we have L ≥ 2d + 1. Therefore, for any (L, d) such
EURASIP Journal on Bioinformatics and Systems Biology
9
25 100
15
10
Error rate
Distance
20
10−1
0.005 10−2 0.01
5
0
0.005
0.0 2 0.05 10
10−3 20
30 Length
40
50
10
20
30 Length
20–40% 40–60% (a) rRNA
40
50
40
50
60–80% 80–100% (b) rRNA
25
20
100
0.005
15
Error rate
Distance
0.01
10
10−1
10−2 5
0
10
0.01
0.0 2
0.05 20
30 Length
40
10−3 50
10
20
30 Length
20–40% 40–60% (c) tRNA
60–80% 80–100% (d) tRNA
Figure 7: Alignment error probability Pe (L, d) in the seed regions. (a) Contour plot of Pe (L, d) for 5S rRNAs. Darker shade corresponds to higher Pe (L, d) and lighter shade corresponds to lower Pe (L, d). The misalignment probability Pe (L, d) on the level curves are shown on top of the contours. (b) Misalignment probability for 5S rRNAs with different percentage identities. (c) Contour plot of Pe (L, d) for tRNAs. (d) Misalignment probability for tRNAs with different percentage identities.
that L < 2d + 1, which corresponds to the region above the diagonal line, we will have Pe (L, d) = Pe (L, (L − 1)/2). As we would expect, the misalignment probability Pe (L, d) becomes smaller as L and d get larger. Figure 7(b) shows the misalignment probability Pe (L, d) of 5S rRNAs for L = 2d + 1. The pairwise alignments have been divided into different groups based on the percentage identity (or percent sequence similarity) of the aligned RNAs, and the alignment error probability Pe (L, d) has been computed for the respective groups. As we can see in Figure 7(b), the error probability is generally lower for RNAs with higher percentage identity. This is expected, since the seed regions are predicted from a sequence-based alignment, which will
be more accurate if the RNAs have higher sequence similarity. Figures 7(c) and 7(d) show the misalignment probability Pe (L, d) for tRNAs, which have similar trends. We also computed Pe (L, d) for RNAs with 60% ∼ 100% identity for different values of (L, d). This is summarized in Table 1. For example, the alignment error probability Pe (L, d) for 5S rRNAs is 1.81% for (L, d) = (9, 4) and 0.79% for (L, d) = (15, 7). This implies that if we choose Γ = 9 and γ = 4 when finding the seed regions (see Section 3.2), more than 98% of the alignments in the predicted seed regions will be correct. In our experiments, we observed that the misaligned bases were typically located within 1 ∼ 2 base positions from the correct ones. This implies that if we let
10
EURASIP Journal on Bioinformatics and Systems Biology
Table 1: Misalignment probability in the seed regions of RNAs with 60% ∼ 100% identity. The probability Pe (L, d) has been estimated for five different values of (L, d). (L, d)
5S ribosomal RNA Transfer RNA
(7,3) Error (%)
(9, 4) Error (%)
(11, 5) Error (%)
(13, 6) Error (%)
(15, 7) Error (%)
2.64
1.81
1.32
1.00
0.79
3.60
2.46
1.80
1.44
1.28
Table 2: Basic properties of the RNA families used in the experiments.
Transfer RNA 5S ribosomal RNA Corona pk3 HDV ribozyme Tombus 3 IV Flavi pk3
Number of seed sequences
Average length
1088 602 14 15 18 14
72.7 116.8 62.5 88.8 64.5 95.4
Average percentage identity 45 61 70 95 94 69
Δ = 1 or Δ = 2, most of the correct base alignments (m, n) will be included in the constrained alignment space S(k) in (9). Therefore, imposing these constraints will not degrade the accuracy of the final structural alignment. 4.4. RNA Structural Alignment with the Proposed Constraints. As mentioned earlier, we applied the proposed constraints to the profile-csHMM-based structural alignment algorithm [17]. For our experiments, we chose six RNA families from the Rfam database [25]. The Flavi pk3 RNAs were obtained from Rfam 7.0, which are now part of the Flavi CRE family in Rfam 8.1. For other families, we obtained the RNAs from Rfam 8.1. Among the six families, two families, tRNAs and 5S rRNAs, do not contain pseudoknots in their secondary structures, while the other four families, Corona pk3, HDV ribozyme, Tombus 3IV, and Flavi pk3, contain pseudoknots. The basic properties of these RNA families, such as the number of RNAs in the Rfam seed alignment, the average length of the member sequences, and their average percentage (sequence) identity, are shown in Table 2. For each RNA family, we performed the following experiment. (1) Choose a reference RNA from the seed alignment. (2) Construct a profile-HMM and a profile-csHMMbased on the reference RNA. (3) Choose a different target RNA from the seed alignment. (4) Estimate the alignment constraint using the profileHMM. (5) Apply the alignment constraint and find the structural alignment using the profile-csHMM.
(6) Repeat step-3 to step-5 for different target RNAs. (7) Repeat step-1 to step-6 for different reference RNAs. In order to measure the quality of the structural alignment, we predicted the secondary structure of the target RNA based on the structural alignment, and compared it to the trusted structure in the Rfam database. Then we counted the number of correctly predicted base pairs (TP; true-positives), the number of incorrectly predicted base pairs (FP; falsepositives), and the number of true base pairs that could not be predicted (FN; false-negatives). Based on these numbers, we estimated the sensitivity (SN) and the positive predictive value (PPV) as follows: SN =
TP , TP + FN
PPV =
TP . TP + FP
(17)
The sensitivity is defined as the fraction of base pairs in the trusted structure that could be predicted by the algorithm, and the positive predictive value is defined as the fraction of predicted base pairs that were correct. We first tested the performance of the proposed approach on RNA families that do not contain pseudoknots. In order to compare the effectiveness of different alignment constraints, we repeated the above experiment for the following methods. (1) Profile-csHMM + proposed alignment constraint (referred to as “PROPOSED”). (2) Profile-csHMM + M-constraint (referred to as “MCONSTRAINT”). (3) Profile-csHMM (original implementation in [17]; referred to as “ORIGINAL”). Table 3 summarizes the average sensitivity (SN), positive predictive value (PPV), and alignment time for using different alignment constraints with the profile-csHMM-based structural alignment method. The CPU time for finding the alignment has been measured on a MacPro with two 2.8 GHz quad-core Intel Xeon processors and 4 GB memory. These results have been obtained from one thousand structural alignments of distinct pairs of RNAs that were chosen from the seed alignment of the respective RNA families. For these experiments, we used (Γ, γ, Δ) = (9, 4, 2) for tRNAs and (Γ, γ, Δ) = (9, 4, 0) for 5S rRNAs. These parameters were chosen based on the analysis in Section 4.3. In general, there will be a tradeoff between alignment accuracy and runtime. These parameters have been used as they provide a good balance between these two measures. Further discussion on this tradeoff can be found at the end of Section 4.4. For the M-constraint defined in (1), we used M = 7 as in [6]. As we can see in Table 3, all three methods were able to achieve accurate alignment results that were comparable to each other. However, adopting the proposed alignment constraint improved the average alignment speed significantly, which was around 7 ∼ 68 times faster compared to the fixed M-constraint, and up to 2.4 times faster compared to the original implementation in [17] that uses a simple heuristic. In order to test the performance of the proposed method on RNA pseudoknots, we carried out similar experiments
EURASIP Journal on Bioinformatics and Systems Biology
11
Table 3: Average sensitivity (SN), positive predictive value (PPV), and alignment time for RNA families that do not contain pseudoknots.
Transfer RNA 5S ribosomal RNA
SN (%) 94.2 94.8
M-constraint PPV (%) Time (sec) 95.8 0.0739 96.3 0.0676
SN (%) 94.1 95.1
Profile-csHMM Original PPV (%) Time (sec) 96.0 0.0139 97.0 0.0024
SN (%) 93.6 95.9
Proposed PPV (%) 96.2 98.5
Time (sec) 0.0108 0.0010
Table 4: Average sensitivity (SN) and positive predictive value (PPV) for RNA families with pseudoknots.
Corona pk3 HDV ribozyme Tombus 3 IV Flavi pk3
M-constraint SN (%) PPV (%) 95.5 95.7 94.5 95.1 95.9 96.4 94.6 96.5
Profile-csHMM Original SN (%) PPV (%) 95.7 96.5 94.5 95.3 95.9 96.4 94.5 96.4
using four pseudoknotted RNA families, Corona pk3, HDV ribozyme, Tombus 3 IV and Flavi pk3. For these experiments, we used (Γ, γ, Δ) = (9, 4, 0) and M = 3 for all four RNA families. In addition to evaluating the performance of the profile-csHMM method for these families, we evaluated the performance of the PSTAG-based method [20] for comparison. The PSTAG-based structural alignment method is a state-of-the-art pairwise RNA alignment method that uses pair stochastic tree adjoining grammars (PSTAGs). PSTAGs can be used for aligning many known pseudoknots, though not all of them. To the best of our knowledge, the PSTAG-based alignment method [20] is the only grammarbased method that can be used for finding the structural alignment of pseudoknotted RNAs, except for the profilecsHMM method. Table 4 shows the average sensitivity and positive predictive value of the different alignment methods. The sensitivity and the positive predictive value of the PSTAG-based method have been obtained from [20] based on the same test set. As we can see in this table, all four methods could achieve high sensitivity and PPV for the Corona pk3, HDV ribozyme, and Tombus 3 IV RNA families. The Flavi pk3 RNAs could not be aligned using PSTAGs, as they have a more complex secondary structure compared to other RNA families. Unlike PSTAGs, profile-csHMMs can handle RNAs with any kind of base pairs, hence they could be used for aligning Flavi pk3 RNAs as well. The current implementation can handle any RNAs in the Rivas and Eddy class [26], which includes nearly all known pseudoknots. We can also handle the RNAs outside the Rivas and Eddy class by incorporating additional adjoining rules. See [19] for further discussions on adjoining rules and the descriptive capability of profile-csHMMs. Table 4 shows that all three profilecsHMM-based approaches yielded accurate alignment results for Flavi pk3 RNAs. By comparing the performance of the profile-csHMM method with different constraints, we can note that incorporating the proposed alignment constraint virtually did not affect the alignment accuracy. As we can see in Table 5, the proposed sequence alignment constraint was able to significantly improve the
Proposed SN (%) PPV (%) 94.8 96.0 94.2 95.9 96.8 97.4 94.5 96.8
SN (%) 94.6 94.1 97.4 N/A
PSTAG PPV (%) 95.5 95.6 97.4 N/A
alignment speed also for pseudoknotted RNAs. In fact, by comparing the results in Tables 3 and 5, we can observe that the overall computational gain becomes even larger for RNAs with more complicated secondary structures. The new constraint made the alignment speed around 40 ∼ 100 times faster compared to the fixed M-constraint (using M = 3), and around 3 ∼ 11 times faster compared to the original implementation [17], at a comparable prediction accuracy. We can also note that the PSTAG-based alignment takes considerably longer than the profile-csHMM-based alignment. The large difference in alignment speed is mainly due to the fact that the PSTAG algorithm [20] does not incorporate any constraint to restrict the alignment space. It would be also interesting to see how the parameters used for predicting the alignment constraint would affect the overall performance. For this purpose, we repeated the previous experiment using different values of Γ and γ. Three pairs of (Γ, γ) were chosen based on the experimental results shown in Figure 7, such that the average misalignment probability does not exceed 5% for both tRNAs and 5S rRNAs. We used Γ = 2γ + 1 in all three cases, such that the minimum length of the seed region is one. Note that if there are regions with more than Γ consecutive matches in the sequence-based alignment, the lengths of the corresponding seed regions will be longer than this minimum. In all three experiments, the parameter Δ was set to zero. Table 6 shows the sensitivity, PPV, and alignment time for different pairs of (Γ, γ). In general, small Γ and γ tend to increase the fraction of bases included in the seed regions, thereby reducing the overall alignment space. As a consequence, the alignment time becomes smaller as we can see in Table 6. However, if these values are made too small, the resulting alignment space can become too restricted, hence degrading the alignment accuracy. This phenomenon could be observed when aligning the Corona pk3 RNAs with Γ = 7 and γ = 3. In Tables 3 and 5, we have shown that the proposed alignment constraint can significantly reduce the average computational requirement for finding the RNA structural
12
EURASIP Journal on Bioinformatics and Systems Biology Table 5: Average CPU time for finding the structural alignment of RNAs containing pseudoknots. Profile-csHMM Original Time (sec) 0.71 1.03 0.35 3.96
M-constraint Time (sec) 9.37 10.30 6.99 13.31
Corona pk3 HDV ribozyme Tombus 3 IV Flavi pk3
PSTAG Proposed Time (sec) 0.23 0.13 0.07 0.35
Time (sec) 19.65 158.77 193.06 N/A
Table 6: Performance of the proposed approach for different parameters.
Corona pk3 HDV ribozyme Tombus 3 IV Flavi pk3
Γ = 7, γ = 3 PPV (%) Time (sec) 94.4 0.139 95.7 0.131 97.1 0.065 97.2 0.329
SN (%) 92.9 93.2 96.5 94.8
Profile-csHMM (Proposed) Γ = 9, γ = 4 SN (%) PPV (%) Time (sec) 94.8 96.0 0.232 94.2 95.9 0.133 96.8 97.4 0.068 94.5 96.8 0.351
alignments. Since the proposed method estimates the constraint based on the sequence alignment of the given RNAs, the actual reduction in complexity will depend on the degree of sequence similarity between the RNAs. Suppose we have a reference RNA of length Lr and a target RNA of length Lt . In the best case, when these RNAs are perfectly aligned, the overall computational cost will be dominated by the constraint estimation step, hence the resulting complexity will be O(Lr Lt ). In the worst case, the complexity will be identical to that of an unconstrained profile-csHMM alignment, which is O(Lr L3t ) for RNAs without pseudoknots, O(Lr L4t ) for typical RNA pseudoknots (including Corona pk3, HDV ribozyme, Tombus 3 IV, and Flavi pk3 used in our experiments), and O(Lr L6t ) for RNAs with the most complicated secondary structure in the Rivas and Eddy work [26]. In general, the maximum distance between the alignable bases will be limited by the constraint (12). If we define D as
D = 2Δ + max Δmax (k) − Δmin (k) , k
(18)
the computational complexity of the profile-csHMM alignment method with the proposed constraint will be O(Lr Lt + Lr D3 ) for RNAs that do not contain pseudoknots. For pseudoknotted RNAs in the Rivas and Eddy work, the complexity will range between O(Lr Lt + Lr D4 ) and O(Lr Lt + Lr D6 ).
5. Concluding Remarks In this paper, we proposed a new method for finding an effective alignment constraint for fast and accurate structural alignment of RNAs. The proposed method is especially useful for accelerating the dynamic programming alignment algorithm of family-specific models, such as the profile-csHMMs or CMs. The alignment constraint proposed in this paper is not very sensitive to small parameter changes in the model that is used to predict the constraint. Therefore, it can be
SN (%) 95.3 94.5 96.6 94.5
Γ = 11, γ = 5 PPV (%) Time (sec) 96.2 0.278 95.5 0.147 97.3 0.069 96.8 0.362
especially useful when we do not have enough sequences in the reference RNA family for training the model. We applied the new constraint to the profile-csHMM-based structural alignment method [17], and evaluated its performance using several RNA families containing pseudoknots. Experimental results showed that the proposed alignment constraint could significantly reduce the alignment time without any loss of alignment accuracy. Although we have mainly focused on incorporating the proposed constraint into the profilecsHMM-based method, these constraints can certainly be used to expedite other alignment methods based on CMs [3, 23] or PSTAGs [20].
References [1] S. R. Eddy, “Non-coding RNA genes and the modern RNA world,” Nature Reviews Genetics, vol. 2, no. 12, pp. 919–929, 2001. [2] B.-J. Yoon and P. P. Vaidyanathan, “Computational identification and analysis of noncoding RNAs—unearthing the buried treasures in the genome,” IEEE Signal Processing Magazine, vol. 24, no. 1, pp. 64–74, 2007. [3] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis, Cambridge University Press, Cambridge, UK, 1998. [4] D. Sankoff, “Simultaneous solution of the RNA folding, alignment, and protosequence problems,” SIAM Journal on Applied Mathematics, vol. 45, no. 5, pp. 810–825, 1985. [5] R. D. Dowell and S. R. Eddy, “Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints,” BMC Bioinformatics, vol. 7, article 400, 2006. [6] A. O. Harmanci, G. Sharma, and D. H. Mathews, “Efficient pairwise RNA structure prediction using probabilistic alignment constraints in Dynalign,” BMC Bioinformatics, vol. 8, article 130, 2007. [7] J. H. Havgaard, R. B. Lyngsø, G. D. Stormo, and J. Gorodkin, “Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%,” Bioinformatics, vol. 21, no. 9, pp. 1815–1824, 2005.
EURASIP Journal on Bioinformatics and Systems Biology [8] J. H. Havgaard, E. Torarinsson, and J. Gorodkin, “Fast pairwise structural RNA alignments by pruning of the dynamical programming matrix,” PLoS Computational Biology, vol. 3, no. 10, p. e193, 2007. [9] I. Holmes, “Accelerated probabilistic inference of RNA structure evolution,” BMC Bioinformatics, vol. 6, article 73, 2005. [10] D. H. Mathews, “Predicting a set of minimal free energy RNA secondary structures common to two sequences,” Bioinformatics, vol. 21, no. 10, pp. 2246–2253, 2005. [11] A. V. Uzilov, J. M. Keegan, and D. H. Mathews, “Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change,” BMC Bioinformatics, vol. 7, article 173, 2006. [12] Z. Weinberg and W. L. Ruzzo, “Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy,” Bioinformatics, vol. 20, supplement 1, pp. i334–i341, 2004. [13] Z. Weinberg and W. L. Ruzzo, “Sequence-based heuristics for faster annotation of non-coding RNA families,” Bioinformatics, vol. 22, no. 1, pp. 35–39, 2006. [14] B.-J. Yoon and P. P. Vaidyanathan, “Fast search of sequences with complex symbol correlations using profile contextsensitive HMMS and pre-screening filters,” in Proceedings of the 32nd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’07), vol. 1, pp. I345–I348, Honolulu, Hawaii, USA, April 2007. [15] B.-J. Yoon and P. P. Vaidyanathan, “Fast structural similarity search of noncoding RNAs based on matched filtering of stem patterns,” in Proceedings of the 41st Asilomar Conference on Signals, Systems and Computers (ACSSC ’07), pp. 44–48, Pacific Grove, Calif, USA, November 2007. [16] B.-J. Yoon and P. P. Vaidyanathan, “Profile context-sensitive HMMS for probabilistic modeling of sequences with complex correlations,” in Proceedings of 31st IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’06), vol. 3, pp. 317–320, Toulouse, France, May 2006. [17] B.-J. Yoon and P. P. Vaidyanathan, “Structural alignment of RNAs using profile-csHMMs and its application to RNA homology search: overview and new results,” IEEE Transactions on Automatic Control, vol. 53, pp. 10–25, 2008. [18] E. P. Nawrocki and S. R. Eddy, “Query-dependent banding (QDB) for faster RNA similarity searches,” PLoS Computational Biology, vol. 3, no. 3, p. e56, 2007. [19] B.-J. Yoon and P. P. Vaidyanathan, “Fast structural alignment of RNAs by optimizing the adjoining order of profile-cs HMMs,” IEEE Journal on Selected Topics in Signal Processing, vol. 2, no. 3, pp. 400–411, 2008. [20] H. Matsui, K. Sato, and Y. Sakakibara, “Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures,” Bioinformatics, vol. 21, no. 11, pp. 2611– 2617, 2005. [21] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989. [22] B.-J. Yoon and P. P. Vaidyanathan, “Context-sensitive hidden Markov models for modeling long-range dependencies in symbol sequences,” IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4169–4184, 2006. [23] R. J. Klein and S. R. Eddy, “RSEARCH: finding homologs of single structured RNA sequences,” BMC Bioinformatics, vol. 4, article 44, 2003. [24] J. Gorodkin, L. J. Heyer, and G. D. Stormo, “Finding the most significant common sequence and structure motifs in a set of
13 RNA sequences,” Nucleic Acids Research, vol. 25, no. 18, pp. 3724–3732, 1997. [25] S. Griffiths-Jones, S. Moxon, M. Marshall, A. Khanna, S. R. Eddy, and A. Bateman, “Rfam: annotating non-coding RNAs in complete genomes,” Nucleic Acids Research, vol. 33, database issue, pp. D121–D124, 2005. [26] E. Rivas and S. R. Eddy, “The language of RNA: a formal grammar that includes pseudoknots,” Bioinformatics, vol. 16, no. 4, pp. 334–340, 2000.
Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2009, Article ID 924601, 8 pages doi:10.1155/2009/924601
Research Article A Hybrid Technique for the Periodicity Characterization of Genomic Sequence Data Julien Epps1, 2 1 School
of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney NSW 2052, Australia Information Communication Technology Australia (NICTA), Australian Technology Park, Eveleigh 1430, Australia
2 National
Correspondence should be addressed to Julien Epps,
[email protected] Received 29 May 2008; Revised 13 October 2008; Accepted 21 January 2009 Recommended by Ulisses Braga-Neto Many studies of biological sequence data have examined sequence structure in terms of periodicity, and various methods for measuring periodicity have been suggested for this purpose. This paper compares two such methods, autocorrelation and the Fourier transform, using synthetic periodic sequences, and explains the differences in periodicity estimates produced by each. A hybrid autocorrelation—integer period discrete Fourier transform is proposed that combines the advantages of both techniques. Collectively, this representation and a recently proposed variant on the discrete Fourier transform offer alternatives to the widely used autocorrelation for the periodicity characterization of sequence data. Finally, these methods are compared for various tetramers of interest in C. elegans chromosome I. Copyright © 2009 Julien Epps. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction The detection of structure within the DNA sequence has long captivated the interest of the research community. Among the various statistical characterizations of sequence data, one measure of structure within sequences is the degree of correlation or periodicity at various displacements along the sequence. Periodicity characterization of sequence data provides a compact and informative representation that has been used in many studies of structure within genomic sequences, including DNA sequence analysis [1], gene and exon detection [2], tandem repeat detection [3], and DNA sequence search and retrieval [4]. To measure such periodicity, autocorrelation has been widely employed [1, 5–11]. Similarly, Fourier analysis and its variants have been used for periodicity characterization of sequences [4, 9, 12–24]. In some cases [25, 26], the Fourier transform of the autocorrelation sequence has also been computed, however using existing symbolic-numeric mappings such as binary indicator sequences [27], this transform can also be calculated without first determining the autocorrelation. Other recent promising approaches to periodicity characterization for biological sequences include the periodicity transform [28], the exactly periodic subspace
decomposition [3], and maximum-likelihood statistical periodicity [29], however these techniques have yet to be adopted by biologists for the purposes of sequence structure characterization. Studies of structure within sequences, such as those referenced above, have tended to use either the autocorrelation or the Fourier transform, and to the author’s knowledge, the limitations of each have not been compared in this context. In this paper, the limitations of both approaches are investigated using synthetic symbolic sequences, and caveats to their characterization of sequence data are discussed. A hybrid approach to periodicity characterization of symbolic sequence data is introduced, and its use is illustrated in a comparative manner on a study of tetramers in C. elegans.
2. Periodicity Measures for Symbolic Sequence Characterization 2.1. Definition of Periodicity. Perhaps the most common definition of exact periodicity in a general sequence s[n] is s[n + p] = s[n] ∀n ∈ Z,
(1)
2
EURASIP Journal on Bioinformatics and Systems Biology
for some p ∈ Z+ . Assuming s[n] can be represented numerically as x[n], this definition admits the following decomposition: ∞
x[n] =
x p [k]δ p [k − n],
(2)
k=−∞
where
⎧ ⎨x[n]
x p [n] = ⎩
0
0 ≤ n < p, elsewhere,
(3)
is the numerical representation of a repeated symbol or pattern, and δ p [n] is a periodic binary impulse train: δ p [n] = δ[n − k p] ∀k ∈ Z.
2.2. Autocorrelation. The autocorrelation of a finite length numerical sequence x[n] is defined as N −1
x[n]x[(n − ρ) mod N],
(5)
n=0
where n is the sequence index, ρ is the lag, and N is the length of the sequence. The application of the autocorrelation as defined in (5) to a symbolic sequence s[n] requires a numerical representation x[n]. The binary indicator sequences [27], which are sufficiently general as to form the basis for many different representations of DNA sequences, are employed in this analysis to represent s[n] in terms of M binary signals: ⎧ ⎨1
if s[n] = Sm , m = 1, 2, . . . , M, bm [n] = ⎩ 0 otherwise,
(6)
where M is the number of symbols (or patterns of symbols, such as a polynucleotide) S1 , . . . , SM , to which the numerical values a1 , . . . , aM are assigned, respectively, resulting in M components xm [n] = am bm [n]. Assuming a1 = / a2 = / ··· = / aM , the numerical representation can thus be unambiguously expressed as x[n] =
M
xm [n] =
m=1
M
am bm [n].
(7)
m=1
Note that applying the decomposition in (2) to an exactly periodic sequence results in x p [n] comprising a sequence of the numerical values am that correspond to the repeated pattern of symbols. Alternatively, the autocorrelation can be defined directly on a symbolic sequence s[n], as used in [20]: ⎧ ⎨1
rss [ρ] = ⎩
if s[n] = s[n − ρ]
0 otherwise,
rxm xm [ρ] =
(8)
N −1
am δ p [n]am δ p [(n − ρ) mod N]
n=0
(9)
= a2m Eδ p δ p [ρ],
(4)
While this expression of x[n] in terms of a binary impulse train is perhaps not so common in signal processing of numerical sequences, the reverse is true for DNA sequences, which have been represented numerically using binary indicator sequences [27] in many studies (e.g., [13, 19, 23, 24, 30]).
rxx [ρ] =
so that the autocorrelation at a lag, or period, p ∈ Z+ for a symbol (or pattern of symbols) is simply the count of the number of instances of that symbol at a spacing of ρ. Consider now a sequence containing a symbol (or pattern of symbols) Sm that repeats with exactly period p, so that the numerical representation of the sequence has a component xm [n] = am bm [n] = am δ p [n]. The autocorrelation of this component xm [n], for a segment of finite length N, has the following expression:
where Eδ p = N/ p is the energy of δ p [n] over a segment of finite length N. Thus a shortcoming of the autocorrelation for sequence characterization is that an exactly p-periodic sequence will show not only a peak at ρ = p, but also peaks at values of ρ that are integer multiples of p (an example is given in Figure 1(a)). Note that similar artifacts can be found in other periodicity detection methods (e.g., [29]). 2.3. Fourier Interpretation of Periodicity. In many applications, including sequence analysis, the discrete Fourier transform has been used to determine the periodic component(s) of a numerical sequence x[n]. The discrete Fourier transform (DFT) of a numerical sequence x[n] is defined as X[k] =
N −1
x[n] exp − j
n=0
2πnk , N
k = 0, 1, . . . , N − 1, (10)
where k is the discrete frequency index. Since the DFT has sinusoidal basis functions, the notion of periodicity in the Fourier sense is described in terms of the frequencies of those basis functions onto which the projections of x[n] are the largest in magnitude. That is, the magnitude of the DFT at a frequency k, |X[k]|, is often taken as an estimate of the relative amount of that frequency component occurring in x[n] [13, 19, 23, 24], from which the relative contribution of a particular period p = N/k can be estimated. Assuming a numerical representation x[n] of the kind shown in (7), the linearity property of the DFT means that the DFT of a symbolic sequence s[n] can be determined as X[k] =
M
am Bm [k],
(11)
m=1
where the Bm [k] are determined according to (10). For the purposes of characterizing sequence data using periodicity, it can be noted that positive integer periods are generally of most interest. This means firstly that N and k need to be carefully chosen to allow fast Fourier transformbased calculation of S[k] for periods ρ = 1, 2, . . . , P, where P is the longest period to be estimated. Secondly, calculating the DFT at other frequencies k = / N/ρ is unnecessary. For
these reasons, the integer period DFT (IPDFT) was proposed as an alternative to the DFT [19]: X[ρ] =
N −1
x[n] exp − j
n=0
2πn , ρ
ρ = 1, 2, . . . , P ≤ N.
3 Autocorrelation
EURASIP Journal on Bioinformatics and Systems Biology
800 600 400 200 0
5
10
15
(12)
⎧N −1 ⎪ j2πn ⎪ ⎨ 1 · exp −
Δ p [ρ] = ⎪ n=0 ⎪ ⎩ 0
(N −1)/ p
=
k=0
=
ρ
30
35
25
30
35
25
30
35
IPDFT
800 600 400 200 0 5
10
15
20 Period (b)
800 600 400 200 0
5
10
15
20 Period (c)
Figure 1: Periodicity characterization of the period-12 synthetic signal xE [n] using (a) autocorrelation, (b) integer period DFT, and (c) hybrid autocorrelation-IPDFT.
n = k p, k ∈ Z otherwise
j2πk p exp − ρ
3. Hybrid Autocorrelation-IPDFT Periodicity Estimation
⎧ N −1 ⎪ ⎪ ⎪ ⎪ ⎪ p ⎪ ⎨
ρ=
(N −1)/ p ⎪ ⎪ j2πk p ⎪ ⎪ ⎪ exp − ⎪ ⎩
k=k0
25
(a)
Hybrid
Using a similar process to that described above in (10) and (11), the numerical representation of a symbolic sequence x[n] can also be transformed using the IPDFT to produce a spectrum X[ρ] that is linear in period (ρ) rather than in frequency (k). For the periodicity characterization of sequences, usually the magnitude |X[ρ]| is of greatest interest. Some care is needed in the interpretation of the IPDFT, since for a binary periodic sequence such as δ p [n] of fixed length N, |X[ρ]| will decrease for longer periods due to the fact that the energy of δ p [n] is N/ p. Consider now the effect of representing an exactly periodic sequence component xm [n] using the IPDFT. From (2) and the convolution theorem, Xm [ρ] = Xm p [ρ]Δ p [ρ], where Δ p [ρ] is the IPDFT of δ p [n]. In particular, if xm p [n] is assumed to be aperiodic, consider the IPDFT of δ p [n]:
20 Period
ρ
p , for l ∈ Z+ l
otherwise,
3.1. Hybrid Autocorrelation-IPDFT. From Figure 1, it is apparent that the autocorrelation and IPDFT are complementary, and that their combination can improve periodicity estimation. This is the motivation for the hybrid autocorrelation-IPDFT period estimate: Hx [ρ] = rxx [ρ]|X[ρ]|.
(13) where k0 = (N − 1)/ p/ρρ. That is, |Δ p [ρ]| is relatively large for ρ = p/l, and relatively small for ρ = / p/l. From this, we see that a shortcoming of Fourier transform approaches such as the IPDFT for sequence characterization by periodicity is that they produce not only a peak at ρ = p, but also peaks at values of ρ that are integer divisors of the period p (see example in Figure 1(b)). For the DFT, this effect is also seen, but instead for indices whose value is k = Nl/ p ∈ {0, 1, . . . , N − 1} (i.e., harmonics of the frequency 2π/ p with integer frequency indices). 2.4. Periodicity of a Synthetic Sequence Using Autocorrelation and DFT. To illustrate the shortcomings of the autocorrelation and DFT discussed in Sections 2.2 and 2.3, consider the periodicity characterization of an example signal xE [n] = δ p [n] (i.e., exact monomer periodicity x p [n] = δ[n]), where p = 12 and N = 10000. The autocorrelation and IPDFT are shown in Figures 1(a) and 1(b), respectively, from which the ambiguities in period estimate discussed in Sections 2.2 and 2.3 can be clearly seen.
(14)
For the simple example signal xE [n] from Section 2.4, the calculation of Hx [ρ] results in a single, unambiguous periodicity estimate, as seen in Figure 1(c). An alternative, more flexible formulation is
Hx [ρ] = rxx [ρ]
1−α
α
|X[ρ]| ,
(15)
where α ∈ [0, 1], which may be helpful for biologists who have conventionally used either the autocorrelation (α = 0) or the Fourier transform (α = 1). For the purpose of sequence periodicity visualization, for example, α could be represented as a parameter available for real-time control, so that a biologist viewing a periodicity characterization of a sequence might subjectively assign a relative weight to each of the autocorrelation and Fourier transform components. Care is needed, however, with the application of (15), since (rxx [ρ])1−α is only well defined for rxx [ρ] ≥ 0 for all ρ. Note that this is satisfied by the autocorrelation defined in (8), in addition to a number of DNA numerical representations (several example representations are discussed in [30]). It is further noted that (14) and (15) do not have a straightforward physical interpretation, in contrast to rxx [ρ] and |X[ρ]|.
1500 1000 500 0
5
10
15
20 Period
25
30
Error rate (%)
EURASIP Journal on Bioinformatics and Systems Biology Autocorrelation
4
35
100 50 0
0
5
10
15
1000 500 0
5
10
15
40
45
50
40
45
50
40
45
50
(a) Error rate (%)
IPDFT
(a)
20 25 30 35 Percent substitution
20 Period
25
30
35
100 50 0
0
5
10
15
(b)
20 25 30 35 Percent substitution
5
10
15
20 Period
25
30
Error rate (%)
Hybrid
(b) 600 400 200 0 35
(c)
Figure 2: Periodicity characterization of a period-7, 10 and 12 synthetic signal using (a) autocorrelation, (b) integer period DFT, and (c) hybrid autocorrelation-IPDFT.
Applying the hybrid autocorrelation-IPDFT period estimate to another example, synthetic signal with multiple exact periodic components (N = 10000) further illustrates the shortcomings of the autocorrelation and IPDFT, and suggests the hybrid approach as suitable for periodicity analyses, as seen in Figure 2. 3.2. Evaluation of Periodicity Estimation in Noise. In the absence of an obvious objective evaluation metric for periodicity characterization approaches, one limited approach is to compare their accuracies for the problem of estimating a single periodic component that has been obscured by noise. Specifically, suppose a periodic binary impulse train δ p [n] is degraded by random binary noise, simulating the effect of the DNA substitution process, to produce a binary pseudo-periodic signal x[n]. Then estimates of the signal periodicity using each of the autocorrelation, integer period DFT and hybrid autocorrelation-IPDFT can be calculated, respectively, as
pA = arg max rxx [ρ] , ρ>1
pI = arg max(|X[ρ]|), ρ>1
(16)
pH = arg max Hx [ρ] , ρ>1
where Hx [ρ] is calculated using (14) throughout both this section and Section 4. A comparison of the periodicity estimates was conducted by generating synthetic periodic signals of length N = 10000, introducing various amounts of substitution (noise) and
100 50 0
0
5
10
15
20 25 30 35 Percent substitution (c)
Figure 3: Error rate versus substitutions averaged over 100 instances of sequences of length 10000 with (a) p = 7, (b) p = 23, (c) p = 24, for period estimates using autocorrelation (. . .), integer period DFT (- - -), and hybrid autocorrelation-IPDFT (—).
estimating pA , pI , and pH . This process was repeated 100 times for each combination of period and substitution rate tested. The resulting average period error rates are shown as a function of substitution rate for three example values of period p in Figure 3 (p small, p larger and prime, and p larger and highly composite), and as a function of the period in Figure 4. These results confirm earlier observations that the IPDFT provides more robust period estimates for prime periods than the autocorrelation, while the reverse is true for highly composite periods. The results also show that the hybrid technique is often able to provide a lower period error rate than either the autocorrelation or the IPDFT. Exceptions to this occur for some prime periods (see Figure 4), where the poorer performance of the autocorrelation seems to slightly adversely affect the hybrid estimate pH relative to the IPDFTonly estimate pI . 3.3. Evaluation of Multiple Periodicity Estimation. For periodicity characterization, a more relevant evaluation criterion is the extent to which all periodicities present can be detected correctly. Since an exhaustive evaluation is impractical, in this work, synthetic sequences comprising three randomly chosen integer periodic components p1 , p2 , p3 ∈ {2, 3, . . . , 40 | p1 = / p2 = / p3 } were constructed, and the frequency with which all three periods were correctly detected was measured. When multiple perfectly periodic components are present in a binary signal, the shorter periods will be favoured during estimation, as a result of their greater occurrence in a fixed-length signal. Hence, when combining
Autocorrelation
100 80 60 40 20 0
5
5
10
15
20 25 Period (bp)
30
35
500 0
5
10
15
40
30
35
40
30
35
40
30
35
40
(a)
Figure 4: Error rate versus period averaged over 100 instances of sequences of length 10000 with a substitution rate of 30%, for period estimates using autocorrelation (. . .), integer period DFT (- - -), and hybrid autocorrelation-IPDFT (—).
500
0
5
10
15
55
20 25 Period (b)
50
500 Hybrid
45 All periods correct (%)
20 25 Period
IPDFT
Error rate (%)
EURASIP Journal on Bioinformatics and Systems Biology
40
0
35 30 25
5
10
15
20 25 Period (c)
20
Figure 6: (a) Autocorrelation from [1], (b) integer period DFT magnitude, and (c) hybrid autocorrelation-IPDFT of TATA tetramers from C. elegans chromosome I.
15 10 5 0 0.5
1
2
5
10
20
Erosion γ%
Figure 5: Percentage of sequence instances for which all three periods were correctly estimated in order of strength versus erosion γ, over 500 instances of sequences of length 10000 with three randomly chosen integer periodic components, estimated using autocorrelation (. . .), integer period DFT (- - -), and hybrid autocorrelation-IPDFT (—).
three periodic components, the shorter period components were randomly eroded to give an equal occurrence between all periods. In the general case of multiple periodicities, some periodic components will be stronger than others. To simulate this, the p2 -periodic component was further randomly eroded by γ% and the p3 -periodic component was further randomly eroded by 2γ%, that is, larger values of γ correspond to a more dominant p1 component. Erosions of greater than about 20% were experimentally found to degrade the accuracy of all three period estimates, using all methods. Finally, the percentage of instances for which the periods p1 , p2 , and p3 were correctly estimated in correct order of strength according to the 3-best period estimates, calculated similarly to equations (16), was determined. The results, shown in Figure 5, strongly support the validity of the proposed hybrid autocorrelation-IPDFT technique relative to the autocorrelation and IPDFT. It is noted that the signal processing literature includes examples of methods for detecting multiple periodic signal components, such as the MUSIC algorithm [31]. For comparative purposes, the above experiment was repeated
employing MUSIC to estimate the strengths of the periodic components. Results indicated that MUSIC was unable to consistently estimate either the periods or the relative strengths of the three components, returning no instances of all three periods correct and in the correct order. The dominant period estimate often contained the common factors of two or more of the true periodic components, an artifact attributable to the superposition of harmonic spectra reinforcing multiples of the individual component fundamentals that coincide in frequency. Two assumptions of MUSIC are not valid for this application: (i) the periodic components are not sinusoidal (although they can be represented as a harmonic series of sinusoids), (ii) the periodic components and noise may not be uncorrelated.
4. Application to DNA Sequence Data Having discussed the differences between the autocorrelation and DFT for synthetic sequences, we now investigate the effect of using the IPDFT and hybrid autocorrelationIPDFT in place of the autocorrelation on real sequence data. Numerous researchers have used autocorrelation [1, 5– 10, 32]; here we compare with examples from the study of tetramer periodicity in the C. elegans genome using autocorrelation by Kumar et al. [1]. In the investigation of TATA tetramers, particular mention was made of the strong period-2 component [1], which features prominently in estimates by all three techniques, as seen in Figure 2. In the autocorrelation estimate (Figure 6(a)), the period-10 component appears to have been virtually completely masked by the period-2 component.
500
0
5
10
15
20 25 Period
30
35
40
Autocorrelation
EURASIP Journal on Bioinformatics and Systems Biology Autocorrelation
6
600 400 5
10
15
IPDFT
IPDFT
35
40
30
35
40
30
35
40
400
100 5
10
15
20 25 Period
30
35
200 0
40
5
10
15
(b)
100 5
10
15
20 25 Period
20 25 Period (b)
400
Hybrid
Hybrid
30
(a)
200
0
25
Period
(a)
0
20
30
35
200
40
5
10
15
20 25 Period
(c)
(c)
Figure 7: (a) Autocorrelation from [1], (b) integer period DFT magnitude, and (c) hybrid autocorrelation-IPDFT of TGCC tetramers from C. elegans chromosome I.
Figure 8: (a) Autocorrelation from [1], (b) integer period DFT magnitude, and (c) hybrid autocorrelation-IPDFT of AGAA tetramers from C. elegans chromosome I.
Autocorrelation
×105 2.8 2.6 2.4 2.2
5
10
15
20 25 Period
30
35
40
IPDFT
(a) 4000 2000 0
5
10
15
20 25 Period
30
35
40
(b) ×104
Hybrid
In contrast, the period-10 component features strongly in the IPDFT (Figure 6(b)) and hybrid (Figure 6(c)) estimates. Although this period-10 component was not mentioned in the analysis of TATA tetramers specifically, it was found to be characteristic of all other C. elegans tetramers analyzed in [1]. Note also that the IPDFT reveals a strong period-25 component, not at all evident in the autocorrelation. This surprising result was verified by constructing a synthetic sequence with perfect periodic components at p = 2 and p = 25, and examining its autocorrelation and IPDFT. The autocorrelation of the sequence did not display visually any significant peak at p = 25 until the period-2 component had been eroded by at least 80%. In contrast, the IPDFT showed a clear peak at p = 25 with no period-2 erosion at all. The period-25 component has rarely been noted in previous literature, however in [11], a filtered distribution of distances between TA dinucleotides shows a strong peak at p = 25, which Salih et al. attribute to a 5-base periodicity associated with the period-10 consensus sequence structure for C. elegans. In the investigation of TGCC tetramers (see Figure 7), the periodic components at 8 and 35 bp were noted in [1]. The proposed hybrid technique also produces peaks at these periods (mainly due to the autocorrelation in this instance), however it additionally finds period-12 and period-39 components. Note that the IPDFT produces a strong peak at a 6 bp period (presumably due to being an integer divisor of 12), however in the hybrid result, this is effectively suppressed by the autocorrelation. In [1], mention is made of the period-10 and 11 behaviour of AGAA tetramers. As seen in Figure 8, the
2 0
5
10
15
20 25 Period
30
35
40
(c)
Figure 9: (a) Autocorrelation from [1], (b) integer period DFT magnitude, and (c) hybrid autocorrelation-IPDFT of WWWW tetramers from C. elegans chromosome I.
autocorrelation finds a dominant peak at 9 bp, while the hybrid technique is more convincing in revealing period10 behavior. Note that, as previously, the period-5 IPDFT component (presumably due to the 10 bp periodicity) is effectively attenuated in the hybrid result.
EURASIP Journal on Bioinformatics and Systems Biology In the investigation of WWWW tetramers (where W represents either A or T), the autocorrelation (Figure 9(a)), as in [1], is dominated by the period-10 component. A very similar characteristic is observed in the distribution of distances between TT to TT dinucleotides in [11], and in the distribution of AAAA to AAAA tetramer distances in [33], suggesting a strong influence by these motifs. While the dominance of the period-10 component is similar for the IPDFT, it also detects a relatively strong period-25 component, perhaps due to TA dinucleotide periodicity, as discussed above for TATA tetramers. In this example, the hybrid autocorrelation-IPDFT result is biased towards the IPDFT, as a result of the IPDFT having a larger dynamic range than the autocorrelation. Here, the effect is not detrimental, having the effect of suppressing the spurious peaks at periods 20, 30, and 40, however in other applications it may be desirable to offset the autocorrelation and/or IPDFT to produce a minimum value of zero prior to calculating the hybrid autocorrelation-IPDFT period estimate.
5. Conclusion This paper has made two contributions to the periodicity characterization of sequence data. Firstly, the origins of ambiguities in period estimates for symbolic sequences due to multiples or sub multiples of the true period in the autocorrelation and Fourier transform methods, respectively, were explained. This is significant because these two methods account for perhaps the majority of the periodicity analysis seen in biology literature, and yet, to the author’s knowledge, their limitations have not been discussed in this context. Secondly, a hybrid autocorrelation-IPDFT technique for periodicity characterization of sequences has been proposed. This technique has been shown to provide improved accuracy relative to the autocorrelation and IPDFT for period estimation in noise and multiple periodicity estimation, for synthetic sequence data. Comparative results from a preliminary investigation of tetramers in C. elegans chromosome I suggest that the proposed approach yields estimates that are consistently less prone to attribute significance to integer multiples or divisors of the true period(s). Thus, the hybrid autocorrelation-IPDFT is putatively advanced as a useful tool for biologists in their quest to reveal and explain structure within biological sequences. Future work will include studies of different types of periodicity in sequence data from other organisms, using IPDFT-based and hybrid techniques.
Acknowledgments The author would like to thank two anonymous reviewers for a number of helpful suggestions, which have certainly improved the quality of this paper. Thanks are also due to Professor Eliathamby Ambikairajah for helpful discussions. This research was supported by a University of New South Wales Faculty of Engineering Early Career Research Grant for genomic signal processing, 2009.
7
References [1] L. Kumar, M. Futschik, and H. Herzel, “DNA motifs and sequence periodicities,” In Silico Biology, vol. 6, no. 1-2, pp. 71–78, 2006. [2] E. N. Trifonov, “3-, 10.5-, 200- and 400-base periodicities in genome sequences,” Physica A, vol. 249, no. 1–4, pp. 511–516, 1998. [3] D. D. Muresan and T. W. Parks, “Orthogonal, exactly periodic subspace decomposition,” IEEE Transactions on Signal Processing, vol. 51, no. 9, pp. 2270–2279, 2003. [4] E. Santo and N. Dimitrova, “Improvement of spectral analysis as a genomic analysis tool,” in Proceedings of the 5th IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS ’07), Tuusula, Finland, June 2007. [5] P. Bernaola-Galv´an, P. Carpena, R. Rom´an-Rold´an, and J. L. Oliver, “Study of statistical correlations in DNA sequences,” Gene, vol. 300, no. 1-2, pp. 105–115, 2002. [6] N. Chakravarthy, A. Spanias, L. D. Iasemidis, and K. Tsakalis, “Autoregressive modeling and feature analysis of DNA sequences,” EURASIP Journal on Applied Signal Processing, vol. 2004, no. 1, pp. 13–28, 2004. [7] H. Herzel, E. N. Trifonov, O. Weiss, and I. Große, “Interpreting correlations in biosequences,” Physica A, vol. 249, no. 1–4, pp. 449–459, 1998. [8] W. Li, “The study of correlation structures of DNA sequences: a critical review,” Computers and Chemistry, vol. 21, no. 4, pp. 257–271, 1997. [9] A. D. McLachlan, “Multichannel Fourier analysis of patterns in protein sequences,” The Journal of Physical Chemistry, vol. 97, no. 12, pp. 3000–3006, 1993. [10] C.-K. Peng, S. V. Buldyrev, A. L. Goldberger, et al., “Longrange correlations in nucleotide sequences,” Nature, vol. 356, no. 6365, pp. 168–170, 1992. [11] F. Salih, B. Salih, and E. N. Trifonov, “Sequence structure of hidden 10.4-base repeat in the nucleosomes of C. elegans,” Journal of Biomolecular Structure and Dynamics, vol. 26, no. 3, pp. 273–281, 2008. [12] V. Afreixo, P. J. S. G. Ferreira, and D. Santos, “Fourier analysis of symbolic data: a brief review,” Digital Signal Processing, vol. 14, no. 6, pp. 523–530, 2004. [13] D. Anastassiou, “Genomic signal processing,” IEEE Signal Processing Magazine, vol. 18, no. 4, pp. 8–20, 2001. [14] J. A. Berger, S. K. Mitra, and J. Astola, “Power spectrum analysis for DNA sequences,” in Proceedings of the 7th International Symposium on Signal Processing and Its Applications (ISSPA ’03), vol. 2, pp. 29–32, Paris, France, July 2003. [15] E. Coward, “Equivalence of two Fourier methods for biological sequences,” Journal of Mathematical Biology, vol. 36, no. 1, pp. 64–70, 1997. [16] S. Datta and A. Asif, “A fast DFT based gene prediction algorithm for identification of protein coding regions,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’05), vol. 5, pp. 653–656, Philadelphia, Pa, USA, March 2005. [17] G. Dodin, P. Vandergheynst, P. Levoir, C. Cordier, and L. Marcourt, “Fourier and wavelet transform analysis, a tool for visualizing regular patterns in DNA sequences,” Journal of Theoretical Biology, vol. 206, no. 3, pp. 323–326, 2000. [18] V. A. Emanuele II, T. T. Tran, and G. T. Zhou, “A fourier product method for detecting approximate tandem repeats in DNA,” in Proceedings of the 13th IEEE/SP Workshop on Statistical Signal Processing (SSP ’05), pp. 1390–1395, Bordeaux, France, July 2005.
8 [19] J. Epps, E. Ambikairajah, and M. Akhtar, “An integer period DFT for biological sequence processing,” in Proceedings of the 6th IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS ’08), pp. 1–4, Phoenix, Ariz, USA, June 2008. [20] B. Issac, H. Singh, H. Kaur, and G. P. S. Raghava, “Locating probable genes using Fourier transform approach,” Bioinformatics, vol. 18, no. 1, pp. 196–197, 2002. [21] V. Ju. Makeev and V. G. Tumanyan, “Search of periodicities in primary structure of biopolymers: a general Fourier approach,” Computer Applications in the Biosciences, vol. 12, no. 1, pp. 49–54, 1996. [22] B. D. Silverman and R. Linsker, “A measure of DNA periodicity,” Journal of Theoretical Biology, vol. 118, no. 3, pp. 295–300, 1986. [23] S. Tiwari, S. Ramachandran, A. Bhattacharya, S. Bhattacharya, and R. Ramaswamy, “Prediction of probable genes by Fourier analysis of genomic sequences,” Computer Applications in the Biosciences, vol. 13, no. 3, pp. 263–270, 1997. [24] W. Wang and D. H. Johnson, “Computing linear transforms of symbolic signals,” IEEE Transactions on Signal Processing, vol. 50, no. 3, pp. 628–634, 2002. [25] S. Hosid, E. N. Trifonov, and A. Bolshoy, “Sequence periodicity of Escherichia coli is concentrated in intergenic regions,” BMC Molecular Biology, vol. 5, article 14, pp. 1–7, 2004. [26] P. Worning, L. J. Jensen, K. E. Nelson, S. Brunak, and D. W. Ussery, “Structural analysis of DNA sequence: evidence for lateral gene transfer in Thermotoga maritima,” Nucleic Acids Research, vol. 28, no. 3, pp. 706–709, 2000. [27] R. F. Voss, “Evolution of long-range fractal correlations and 1/ f noise in DNA base sequences,” Physical Review Letters, vol. 68, no. 25, pp. 3805–3808, 1992. [28] W. A. Sethares and T. W. Staley, “Periodicity transforms,” IEEE Transactions on Signal Processing, vol. 47, no. 11, pp. 2953– 2964, 1999. [29] R. Arora and W. A. Sethares, “Detection of periodicities in gene sequences: a maximum likelihood approach,” in Proceedings of the 5th IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS ’07), Tuusula, Finland, June 2007. [30] M. Akhtar, J. Epps, and E. Ambikairajah, “Signal processing in sequence analysis: advances in eukaryotic gene prediction,” IEEE Journal on Selected Topics in Signal Processing, vol. 2, no. 3, pp. 310–321, 2008. [31] R. O. Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE Transactions on Antennas and Propagation, vol. 34, no. 3, pp. 276–280, 1986. [32] W. Li, T. G. Marr, and K. Kaneko, “Understanding long-range correlations in DNA sequences,” Physica D, vol. 75, no. 1–3, pp. 392–416, 1994. [33] A. Fire, R. Alcazar, and F. Tan, “Unusual DNA structures associated with germline genetic activity in Caenorhabditis elegans,” Genetics, vol. 173, no. 3, pp. 1259–1273, 2006.
EURASIP Journal on Bioinformatics and Systems Biology
Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2009, Article ID 683463, 9 pages doi:10.1155/2009/683463
Research Article Identifying Genes Involved in Cyclic Processes by Combining Gene Expression Analysis and Prior Knowledge Wentao Zhao,1 Erchin Serpedin (EURASIP Member),1 and Edward R. Dougherty1, 2 1 Department
of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128, USA Biology Division, Translational Genomics Research Institute, 400 North Fifth Street, Suite 1600, Phoenix, AZ 85004, USA
2 Computational
Correspondence should be addressed to Erchin Serpedin,
[email protected] Received 9 July 2008; Revised 24 December 2008; Accepted 26 January 2009 Recommended by Yufei Huang Based on time series gene expressions, cyclic genes can be recognized via spectral analysis and statistical periodicity detection tests. These cyclic genes are usually associated with cyclic biological processes, for example, cell cycle and circadian rhythm. The power of a scheme is practically measured by comparing the detected periodically expressed genes with experimentally verified genes participating in a cyclic process. However, in the above mentioned procedure the valuable prior knowledge only serves as an evaluation benchmark, and it is not fully exploited in the implementation of the algorithm. In addition, partial data sets are also disregarded due to their nonstationarity. This paper proposes a novel algorithm to identify cyclic-process-involved genes by integrating the prior knowledge with the gene expression analysis. The proposed algorithm is applied on data sets corresponding to Saccharomyces cerevisiae and Drosophila melanogaster, respectively. Biological evidences are found to validate the roles of the discovered genes in cell cycle and circadian rhythm. Dendrograms are presented to cluster the identified genes and to reveal expression patterns. It is corroborated that the proposed novel identification scheme provides a valuable technique for unveiling pathways related to cyclic processes. Copyright © 2009 Wentao Zhao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction The eukaryotic cell hosts several cyclic molecular processes, for example, cell cycle and circadian rhythm. The transcriptional events in these processes can be quantitatively observed by measuring the concentration of the messenger RNA (mRNA), which is transcribed from DNA and serves as the template for synthesizing the corresponding protein. To achieve this goal, the microarray experiments exploit high-throughput gene chips to snapshot genome-wide gene expressions sequentially at discrete time points. The sampled time series data present three main characteristics. First, most data sets present small sample size, for example, no more than 50 data points. Obtaining large sample size data sets is not financially affordable, and besides, in the long run the cell culture loses synchronization and the data become meaningless if they are sampled much later on. Second, the data might not be evenly sampled, and many time points could be missing. In order to capture critical events
with minimal cost, biologists usually conduct microarray experiments and make measurements when these events happen. Third, the data are highly corrupted by experimental noise, and a robust stochastic analysis is desired. Based on time series data, various approaches have been proposed to identify periodically expressed genes, which are sometimes believed to be involved in the cell cycle. Assuming the cell cycle signal to be a simple sinusoid, Spellman et al. [1] and Whitfield et al. [2] performed Fourier transformations on the data sampled with different synchronization methods, Wichert et al. [3] applied the traditional periodogram and Fisher’s test, while Ahdesm¨aki et al. [4] implemented a robust periodicity test assuming non-Gaussian noise. In [5], Giurcˇaneanu explored the stochastic complexity of detecting periodically expressed genes by means of generalized Gaussian distributions. Alternatively, Luan and Li [6] employed guide genes and constructed cubic B-spline-based periodic functions for modeling, while Lu et al. [7] employed up to third harmonics to fit the data and proposed a periodic
2 normal mixture model. De Lichtenberg et al. [8] compared the approaches [1, 6, 7] and proposed a new score combining the periodicity and regulation magnitude. Interestingly, the mathematically more advanced methods seem not to achieve a better performance compared with the original Spellman’s method that relies on the Fast Fourier Transform (FFT) method. As an important observation, notice that the majority of these works deal only with evenly sampled data. When data points are missing, in general for the adopted methods, the vacancies are usually filled by interpolation in time domain for all genes, or the genes are disregarded if there are more than 30% of data samples missing. The biological experiments generally output unevenly spaced measurements. The change of sampling frequency can be attributed to missing data. Besides, the measurements are usually event-driven, that is, more observations are recorded when certain biological events happen, and the observational process is slowed down when the cell remains quiet or no event of interest occurs. Therefore, the analysis based on unevenly sampled data sets is practically more desirable and technically more challenging. Notice that in the case of uneven sampling, the harmonics exploited in the discrete Fourier transform (DFT) are no longer orthogonal. Lomb [9] and Scargle [10] demonstrated that a phase shift suffices to make the sine and cosine terms orthogonal again, and consequently a spectral estimator can be designed in the presence of uneven sampling. The Lomb-Scargle scheme has been exploited by Glynn et al. [11] in analyzing the budding yeast data set. Notice also that a number of alternative schemes were proposed recently to cope with missing and/or irregularly spaced data samples. Stoica and Sandgren [12] updated the traditional Capon method to cope with the irregularly sampled data. Wang et al. [13] designed the missing-data amplitude and phase estimation (MAPES) approach, which estimated the missing data and spectra iteratively through the Expectation Maximization (EM) algorithm. Although Capon and MAPES methods aim to achieve a better spectral resolution than Lomb-Scargle periodogram, for small sample size, the simpler LombScargle scheme appears to possess better performance in the presence of realistic biological data [14]. Most of the algorithms proposed in literature identify cyclic genes by exploiting mathematical models to explain the gene’s time series pattern. Employing these models and statistical tests, the periodically expressed genes are normally identified. Finally, the detected genes are compared with the genes that had been experimentally discovered to participate in specific processes like cell cycle. Notice that these practically verified cycle-involved genes only serve as a golden benchmark to evaluate the performance of the proposed identification algorithms. They are not fully exploited in the implementation of the identification algorithm. Notice also that most of the existing algorithms fail to utilize all the available data information. For example, the elutriation data provided in [1] was usually discarded when performing the spectral analysis. In other experiments, some data sets were also disregarded due to either loss of synchronization or nonstationarity. Herein, we propose a novel algorithm to detect periodically expressed genes by integrating the
EURASIP Journal on Bioinformatics and Systems Biology gene expression analysis with the valuable prior knowledge offered by all available data. The prior knowledge can consist of two data sets, that is, the set of genes involved in a cyclic process and the set of noncycle-involved genes recognized in biological experiments. The cycle-involved genes are used to initialize the proposed algorithm, and the noncycle-involved genes are employed to control the false positives. The expression analysis is composed of the spectral estimation technique and the computation of gene expression distance. The underlying approach relies on the assumption that genes expressing similarly with genes of a process of interest are also likely to participate in that process. This assumption is actually exploited to apply the clustering schemes on the microarray measurements in order to partition genes into different functional groups. The proposed algorithm identifies potential cyclic-processinvolved genes and guarantees that the verified cycle genes will be included with 100% certainty into the output gene set, and at the same time the verified noncycle-involved genes are removed from the derived set with 100% certainty. Although most of the existing power-spectra-based algorithms can be crafted into the proposed algorithm seamlessly, herein we are using the Lomb-Scargle periodogram due to its simplicity and good performance. The proposed algorithm will also lay a ground for the following cycle pathway research.
2. Methods The proposed algorithm is composed of a spectral density analysis and a gene distance computation based on the time series microarray data. All existing spectral analysis schemes can be incorporated into the proposed algorithm. However, the Lomb-Scargle periodogram is recommended here due to its convenience of implementation and excellent performance for small sample size. The nonparametric Spearman’s correlation coefficient is accepted to construct the measure of distance between two genes. 2.1. Lomb-Scargle Periodogram and Periodicity Detection. Microarray measurements usually have a large portion of missing data points. Besides, the sampling frequency is tuned to adapt to nonuniformly occurring events. Lomb-Scargle periodogram appears as an excellent candidate for analyzing these irregular data [14]. Given m time-series observations (tl , xl ), l = 0, . . . , m − 1, where t stands for the time tag, and x denotes the sampled expression of a specific gene, the normalized Lomb-Scargle periodogram at angular frequency ω is defined as follows: m−1
1 ( ΦLS (ω) = 2 2σ
2
[xl − x] cos[ω(tl − τ)]) m−1 2 l=0 cos [ω(tl − τ)]
l=0
+
(
m−1
2
[xl − x] sin[ω(tl − τ)]) m−1 2 l=0 sin [ω(tl − τ)]
l=0
, (1)
EURASIP Journal on Bioinformatics and Systems Biology
3
1: Input gene expression measurements, all sampled genes (referred as Ω), experimentally verified cycle-involved genes (denoted as G), noncycle-involved genes (represented as F) and priori frequency range [ω1 , ω2 ]; 2: Perform power spectral analysis on gene expression data; 3: Perform statistical tests so that the periodically expressed genes are recognized and stored in set C; 4: for each xi ∈ C do 5: if ωΦmax ∈ / [ω1 , ω2 ] then 6: C ← C − {xi } 7: end 8: end 9: G ← G ∪ C, F ← F, specify the distance threshold t; 10: repeat / ∗ iterative accumulation ∗ / 11: G ← G ; 12: for each xi ∈ Ω, gi ∈ G do 13: if d(xi , gi ) < t then 14: G ← G ∪ {xi }; 15: end 16: end 17: until G = / G ; 18: repeat / ∗ false positive control ∗ / 19: F ← F ; 20: for each xi ∈ Ω, f j ∈ F do 21: if d(xi , f j ) < t then 22: F ← F ∪ {xi }; 23: end 24: end 25: until F = / F; 26: G ← G − F; 27: Output G; Algorithm 1: Identifying cyclic process involved genes.
where x and σ 2 stand for the mean and variance of the sampled data, respectively, and τ is defined as follows:
τ=
m−1 sin(2ωtl ) 1 . a tan ml=−01 2ω cos(2ωt l) l=0
(2)
Let δ be the greatest common divisor (gcd) for all intervals tk − tl (k = / l); Eyer and Bartholdi in [15] proved that the highest frequency that should be searched is given by ωmax 1 = . (3) 2π 2δ Based on the obtained power spectral density, each gene is to be classified as either cyclic or noncyclic. The null hypothesis is usually formed to assume that the measurements are generated by a Gaussian noise stochastic process. For the Lomb-Scargle periodogram, ΦLS (ω) was shown to be exponentially distributed under the null hypothesis [10], a result which was also exploited in [11]. However, recently Schwarzenberg-Czerny reported in [16] that a beta distribution is more appropriate for small sample size frameworks and the P-value for detecting the largest peak Φmax is given by fmax =
2Φmax P(T > t) = 1 − 1 − 1 − m
m/2 m
.
(4)
A rejection of the null hypothesis based on a P-value threshold implies that the power spectral density contains a frequency with magnitude substantially greater than the average value. This indicates that the time series data contain a periodic signal, and the corresponding gene is cyclic in expression. In order to prevent the false positives from overwhelming the true positives, the multiple testing correction is performed to control the q-value, which is defined as p( j) n , (5) qk = min j k≤ j ≤n where n stands for the number of measured genes, and p( j) represents the sorted P-values in ascending order. The part being minimized is an estimate of False Discovery Rate (FDR). Given a q-value threshold θ, through which the number of genes to preserve can then be derived as k = max q j ≤ θ. 1≤ j ≤n
(6)
2.2. Gene Distance Measure. A gene is identified to be a cyclic gene if it satisfies either of two conditions: it passes the periodicity test which is performed on the gene expression measurements, or it is within a small distance from the verified cyclic-process-involved genes. Various distance metrics
4
EURASIP Journal on Bioinformatics and Systems Biology
have been proposed in the clustering literature to capture the distance between genes. These include Pearson’s correlation, Euclidean distance, city block distance, mutual information. Because the biological samples are generally highly corrupted and the rank statistics tests, as nonparametric methods, usually behave better when extreme observation exists, we accept here Spearman’s correlation coefficient as the core of our distance measure. This distance is obtained for two genes x and y between their expressions across all the available experiments as follows:
d(x, y) = 1 − 1 −
m
6
i=1 (xi − yi ) , m(m2 − 1)
(7)
where (xi , yi ) stand for the rank pair of the measurements of genes x and y. The parameter m counts the number of sampling points where both gene x and gene y present available observations. This distance measure always assumes values between 0 and 1. 2.3. Algorithm Formulation. The proposed algorithm is formulated as Algorithm 1. Lines 1 to 9 accept inputs and initialize the target cyclic gene set with the spectral analysis results and the prior cycle-involved genes. Inside them lines 4 to 8 exclude genes whose peak periodicity, ωΦmax , is in contrast with the prior knowledge of the frequency range [ω1 , ω2 ] of the researched phenomenon. Lines 10 to 17 represent the iterative accumulation part. They iteratively insert into the potential cyclic gene set the genes expressed similarly as the genes within that set. Lines 18 to 25 stand for the false positive control part, which constructs the control set iteratively to suppress the potential false positives by using the prior knowledge. Line 26 subtracts the control set from the established target set and finalizes the cyclic gene set. The simulation results on the yeast data set showed that the iterative accumulation part controls the false positives pretty well. The algorithm will surely converge to a set. This is because in each iteration of the accumulation and false positive control part, there have to be new members added into the target gene sets. The number of set members keeps increasing, and the set in the previous iteration is a subset of the later set. However, this increase is upper-bounded by the full gene set that contains all the measured genes. Therefore, both the iterative accumulation part and false positive control part converge, and the proposed algorithm also converges. Usually some general idea about the phenomenon of interest can be used to determine the two bounds ω1 and ω2 of the frequency range. For example, the circadian rhythm has a periodicity around 24 hours, which can be somehow compressed or prolonged by experimental protocols. If no prior knowledge exists, the set (0, ∞) can be used. The other two thresholds are to be specified. The first is the threshold for the periodicity test. To effectively control the false alarm rate, multiple testing correction can be applied and a q-value threshold θ can be specified. In practice, θ can be chosen around 0.15. This threshold can also be decided by comparing the spectral analysis results with
the prior knowledge. Such an approach is more attractive when the proposed algorithm is combined with other periodicity detection methods. We are inclined to use a more stringent threshold, which also represents a trade-off between the number of conserved genes and the number of experimentally verified genes. The second threshold is the distance threshold t. It keeps decreasing along the iteration. For example, the initial value is assigned to be 0.25, which means high correlation according to Cohen’s rule of thumb [17]. Each iteration decreases this threshold by 0.05 until it reaches 0.1, then it remains constant at 0.1. This technique in practice helps to prevent the amplification of false positives.
3. Results The proposed algorithm was applied on the data sets provided by unicellular Saccharomyces cerevisiae (budding yeast) and multicellular Drosophila melanogaster (fruit fly), respectively. The in silico results are discussed briefly here. The full list of identified potential cell cycle genes is presented in the additional files. 3.1. Case Study 1: Saccharomyces cerevisiae. Although various time series data sets have been available, including the experiments on human cells [2], the yeast data set published by Spellman et al. [1] is still among the most popular research targets or benchmarks of computational biology, since this data set excels in its large size of samples and the simplicity of the genome. The mRNA concentrations of nearly 6200 Open Reading Frames (ORF) were measured for the yeast strains synchronized by using four different methods, that is, α factor, cdc15, cdc28, and elutriation. The data set contained in total 73 sampling points for all genes, while there existed missing observations for some experiments. The detected periodicity matched the yeast cell cycle. Our prior knowledge was derived from two sources: Spellman et al. [1] revised 104 cell cycle genes that were verified in previous biological experiments, while de Lichtenberg et al. [18] summarized 105 genes that were not involved in the cell cycle. Spellman et al. [1] designed a periodicity metric, namely, CDC score, based on the Fast Fourier Transform (FFT) of three experiments α factor, cdc15, and cdc28. The observations of elutriation were discarded due to a computation obstacle. Although later a bunch of other methods were proposed to identify the cell cycle genes, for example, [3, 6, 7], de Lichtenberg found that Spellman’s FFT-based method still excelled in testing power and detected the most verified cell cycle genes [8]. However, as admitted in [1], the selection of the number of conserved genes was fairly arbitrary. As Figure 1 illustrates, when the number of conserved genes increases, the number of verified genes increases at a decreasing rate. Actually, after 400 genes have been identified, the curve becomes relatively flat. Therefore, we conserved the 400 genes with top CDC scores as the initialization set in the proposed algorithm. This means a more stringent test threshold for the spectral analysis part. Figure 2 compares the simulation results with the 800 genes identified by Spellman et al. [1]. Before the running of
EURASIP Journal on Bioinformatics and Systems Biology
5
100
106 (2)
90
199 (1)
Number of verified genes
80
284 (2)
316
70 60
10 10
50
84
40 30 20 10 0
0
100
200
300 400 500 600 Number of conserved genes
700
800
Figure 1: Performance of Spellman et al.’s CDC score on Saccharomyces cerevisiae data. A specified number of genes are conserved as periodically expressed genes. These genes are compared with the published 104 cell cycle involved genes. The matched genes are counted. Most experimentally discovered cell cycle genes possess high periodicity scores. When the number of conserved genes is greater than 400, Spellman et al. method’s identification ability degenerates, as shown by the flat tail of the curve.
the false positive control, the proposed algorithm identified 725 genes, in which 104 genes were from the prior experimental knowledge, and 400 genes were from Spellman et al.’s spectral analysis method. These two sets overlapped in 84 genes. We identified 199 genes that were neither identified by Spellman et al.’s method nor reported in the prior knowledge of the 104 genes. The false positive control removed 3 genes and left 722 genes marked as potential cell cycle involved genes. The identified genes are provided in the additional files in MS Excel format. As an example of a gene detected by the proposed algorithm, Figures 3(a)–3(d) plot time series data for two genes CWP2 (YKL096W-A) and CCW12 (YLR110C). These two genes indicated a strong correlation, with the correlation coefficient 0.19, in their expressions for all four experiments. Both genes are annotated to encode cell wall mannoprotein. CWP2 is cell-cycle regulated at the S/G2 phase [19]. It was assigned a CDC score of 2.031, which ranked 478 in all ORFs. Therefore, it was selected in Spellman et al.’s 800 genes. A stringent CDC score threshold, for example, 2.37 that conserves 400 genes, will make CWP2 discarded from cell cycle genes. CCW12 was not selected in Spellman et al.’s 800 genes because its CDC score was 0.297, which was very low and ranked 4092 in all genes. It has been found that the cell wall accounts for around 30% of the cell dry weight, and its construction tightly coordinated with the cell cycle [20]. Smits et al. [21] summarized that among 43 discovered cell wall protein encoding genes, in which CCW12 was not included at that time, more than half of them were verified to be cell-cycle regulated. In other words, cell wall proteins are highly likely to be involved in the cell
725 genes identified by the proposed algorithm 800 genes identified by Spellman et. al. 400 periodic genes used in initialization of the proposed algorithm 104 genes verified in previous experiments
Figure 2: Venn graph of identified Saccharomyces cerevisiae genes. The proposed algorithm identified 722 genes as potential cell cycle genes. 725 genes were identified before running the false positive control procedure. False positive control removed 3 genes, which are marked within the parenthesis. Various sets are differentiated by their colors.
proliferation process. Based on the similarity between the expressions of CWP2 and CCW12 in the cell cycle arrest experiments, we hypothesize that CCW12 is also cell cycle regulated at phase S/G2. All the detected 722 genes are hierarchically clustered in Figure 4. The hierarchical clustering was selected mainly because it was convenient for visualization, and it avoided to specify the number of desired clusters. It is worthy to note that more advanced methods, for example, self organizing map (SOM) [22] could achieve a better clustering performance. Most clusters indicate a strong periodicity pattern, as can be discerned by the red and green regions which are positioned alternately. There is an exotic cluster, which exhibits fast oscillation in the cdc15 experiments. This cluster contains 130 genes that are illustrated in Figure 5. By examining the existing annotations for these genes, we found that most of them either encode nucleolar proteins or are involved in ribosome biogenesis. It has been verified that ribosome biogenesis consumes up to 80% of proliferating energy, and it is linked to cell cycle in metazoan cells. However, in yeast, the ribosome biogenesis is not regulated by the cell cycle in the same manner as in advanced organisms due to the closed mitosis of the yeast [23]. Defects in nucleolar genes halt the cell at the Start checkpoint [24]. The ribosome biogenesis controls the growth of the size and inhibits the cell cycle until the cell has reached a satisfiable size [25]. In order to measure valid time series samples, the cell culture has to be synchronized. In other words, all cells within the culture should be homogeneous in all aspects, for example, cell size, DNA, RNA, protein, and other
EURASIP Journal on Bioinformatics and Systems Biology 2
2
1.5
1.5
1
1
0.5
0.5
Expression
Expression
6
0 −0.5
0 −0.5
−1
−1
−1.5
−1.5
−2
0
20
40
60 Time (min)
80
−2
100
50
YKL096W-A YLR110C
100
150 Time (min)
200
250
YKL096W-A YLR110C
(a) Alpha data set
(b) cdc15 data set
2
2
1.5
1.5
1 1 0
Expression
Expression
0.5
−0.5 −1 −1.5
0 −0.5 −1
−2
−1.5
−2.5 −3
0.5
0
20
40
60 80 100 Time (min)
120
140
160
YKL096W-A YLR110C (c) cdc28 data set
−2
0
50
100
150 200 250 Time (min)
300
350
YKL096W-A YLR110C (d) Elutriation data set
Figure 3: YKL096W-A(CWP1) and YLR110C(CCW12) time series expressions in four datasets. Both CWP1 and CCW12 are cell wall protein encoding genes. CWP1 has been verified to be involved in the cell cycle experiment.
cellular contents. Cooper in [26, 27] argued that the ideal synchronization is an impossible mission because different dimensions, like cell size and DNA content, could not be controlled at the same time. Therefore, current popular synchronization methods, like serum starvation and thymidine blocking, are only one-dimensional synchronization methods and fail to achieve a complete synchronization. It is fully possible that the discovered periodicity is completely caused by chance or by the specific synchronization method. Based on the Spellman et al.’s spectral analysis with CDC scores, it is obvious that the most experimentally verified cell cycle genes exhibit top CDC scores. Hence, the spectral analysis is still highly valuable. However, due to the loss of synchronization and nonstationarity, the choice of threshold
for the periodicity test has to be much more stringent in order to suppress false positives. When the cell culture is not ideally synchronized or stationary, the spectral analysis may fail for some data sets, such as the elutriation data set. However, the proposed algorithm is still capable to identify a set of genes which are closely correlated to the verified cell cycle genes based on all the available data. The exploitation of the prior knowledge, consisting of experimentally verified cell cycle genes and noncell-cycle genes, can help to improve the detection accuracy and combat the negative effects induced by the loss of synchronization and nonstationarity. 3.2. Case Study 2: Drosophila melanogaster. The multicellular Drosophila melanogaster serves as a good prototype for the
EURASIP Journal on Bioinformatics and Systems Biology Alpha
cdc15
cdc28
7 Embryo
Elu.
Lava
Pupal
Male Female 5
4
4
3
3 2
2 1 1
0
0
−1 −2
−1 −2 −3 −4
Figure 4: Clustering analysis of identified Saccharomyces cerevisiae genes. Gene expression levels are indicated by the heatmap. There are 722 genes identified by the proposed algorithm to participate in the cell cycle. Most genes exhibit strong periodicity, as indicated by alternately positioned red and green regions.
Alpha
cdc15
cdc28
Elu. 1.5 1 0.5 0 −0.5 −1 −1.5
Figure 5: The exotic clustering of identified Saccharomyces Cerevisiae genes. Gene expression levels are indicated by the heatmap. This cluster contains 130 genes. The gene expressions in the cdc15 experiment oscillate between low and high levels. Most of these genes are nucleolar genes.
research of mammalian diseases because it has only 4 pairs of chromosomes, on which are located abundant genes with mammalian analogs. Our in silico experiments are performed on the Drosophila melanogaster data set published by Arbeitman et al. [28]. With the usage of cDNA microarrays, the RNA expression levels of 4028 genes were measured, and these stood for about one-third of all found fruit fly genes. The synchronization of the cell culture was yielded by the Cryonics method. In Arbeitman et al.’s experiments, 75 sequential sampling points were observed, starting right after fertilization and through embryonic, larval, pupal, and early
−3 −4 −5
Figure 6: Clustering analysis of identified Drosophila melanogaster genes. Gene expression levels are indicated by the heatmap. There are 344 genes identified by the proposed algorithm to be involved in the circadian rhythm. The dendrogram can be split into the top and bottom groups, respectively, which are complementary in their expressions.
days of adulthood. There were 134 experimentally verified cycling circadian genes [29]. Among these 134 genes, 52 were measured in Arbeitman’s experiment [28]. We did not locate the set of noncell-cycle genes in the Drosophila literature. Therefore, the false positive control procedure was not performed. The least time interval between any two sampling points was 30 minutes, which was much larger than the Drosophila’s cell cycle period. However, the pupal data set had sufficient sampling points to provide insights into the circadian rhythm. The spectral analysis was accomplished by applying the Lomb-Scargle periodogram on the nonuniformly sampled pupal data. We found that cyclic genes concentrated most of the power spectral density at the frequency band with the period of tens of hours. By posing a q-value threshold at 0.1, 50 genes were preserved for the initialization of the proposed algorithm. Then, there were 344 genes identified by the proposed algorithm. A dendrogram for these genes is illustrated in Figure 6. The top and bottom parts constitute two complementary groups. Most of the experimentally verified genes (46 out of 52) are located in the bottom part, exhibit a transition from the repressed level to the induced level around the time of 11 hours after fertilization. Two most extensively studied genes involved in the Drosophila circadian rhythm are per and clk. In Arbeitman’s experiment, clk showed relatively prominent periodicity in the pupal stage. However, the period was prolonged to be more than 24 hours. This was due to the fact that the synchronization method slowed down the biological process. Unfortunately, per was not measured in the experiment. A large portion of identified genes have been verified to participate in metabolism, a process closely controlled by circadian rhythm. A cross-species knowledge might be valuable. However, special precautions must be considered
8 when the two organisms are too different, like the yeast and fly. The yeast is a unicellular organism with closed mitosis while fly is multi-cellular with open mitosis. The difference between multicellular organisms is less prominent. Therefore, we hypothesize that the prior knowledge of the Drosophila might be valuable for the identification of more advanced species, for example, Homosapiens. The complete list of identified genes is provided in the supplementary materials [30].
4. Conclusions A novel algorithm is proposed to identify the cyclic-processinvolved genes through the incorporation of microarray data analysis with the prior knowledge of genes participating in the cyclic process. The in silico experiments were conducted based on the data sets corresponding to the unicellular Saccharomyces cerevisiae and the multicellular Drosophila melanogaster. The potential cell cycle and circadian rhythmic genes were identified and compared with the existing computational results. It is corroborated that the proposed algorithm is capable to exploit all the available data and propose potential cycle-involved genes.
References [1] P. T. Spellman, G. Sherlock, M. Q. Zhang, et al., “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Molecular Biology of the Cell, vol. 9, no. 12, pp. 3273–3297, 1998. [2] M. L. Whitfield, G. Sherlock, A. J. Saldanha, et al., “Identification of genes periodically expressed in the human cell cycle and their expression in tumors,” Molecular Biology of the Cell, vol. 13, no. 6, pp. 1977–2000, 2002. [3] S. Wichert, K. Fonkianos, and K. Strimmer, “Identifying periodically expressed trascripts in microarry time series data,” Bioinformatics, vol. 20, no. 1, pp. 5–20, 2004. [4] M. Ahdesm¨aki, H. L¨ahdesm¨aki, R. Pearson, H. Huttunen, and O. Yli-Harja, “Robust detection of periodic time series measured from biological systems,” BMC Bioinformatics, vol. 6, article 117, pp. 1–18, 2005. [5] C. D. Giurcˇaneanu, “Stochastic complexity for the detection of periodically expressed genes,” in Proceedings of the 5th IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS ’07), pp. 1–4, Tuusula, Finland, June 2007. [6] Y. Luan and H. Li, “Model-based methods for identifying periodically expressed genes based on time course microarray gene expression data,” Bioinformatics, vol. 20, no. 3, pp. 332– 339, 2004. [7] X. Lu, W. Zhang, Z. S. Qin, K. E. Kwast, and J. S. Liu, “Statistical resynchronization and Bayesian detection of periodically expressed genes,” Nucleic Acids Research, vol. 32, no. 2, pp. 447–455, 2004. [8] U. de Lichtenberg, L. J. Jensen, A. Fausbøll, T. S. Jensen, P. Bork, and S. Brunak, “Comparison of computational methods for the identification of cell cycle-regulated genes,” Bioinformatics, vol. 21, no. 7, pp. 1164–1171, 2005. [9] N. R. Lomb, “Least-squares frequency analysis of unequally spaced data,” Astrophysics and Space Science, vol. 39, no. 2, pp. 447–462, 1976.
EURASIP Journal on Bioinformatics and Systems Biology [10] J. D. Scargle, “Studies in astronomical time series analysis—II. Statistical aspects of spectral analysis of unevenly spaced data,” The Astrophysics Journal, vol. 263, pp. 835–853, 1982. [11] E. F. Glynn, J. Chen, and A. R. Mushegian, “Detecting periodic patterns in unevenly spaced gene expression time series using Lomb-Scargle periodograms,” Bioinformatics, vol. 22, no. 3, pp. 310–316, 2006. [12] P. Stoica and N. Sandgren, “Spectral analysis of irregularlysampled data: paralleling the regularly-sampled data approaches,” Digital Signal Processing, vol. 16, no. 6, pp. 712–734, 2006. [13] Y. Wang, P. Stoica, J. Li, and T. L. Marzetta, “Nonparametric spectral analysis with missing data via the EM algorithm,” Digital Signal Processing, vol. 15, no. 2, pp. 191–206, 2005. [14] W. Zhao, K. Agyepong, E. Serpedin, and E. R. Dougherty, “Detecting periodic genes from irregularly sampled gene expressions: a comparison study,” EURASIP Journal on Bioinformatics and Systems Biology, vol. 2008, Article ID 769293, 8 pages, 2008. [15] L. Eyer and P. Bartholdi, “Variable stars: which Nyquist frequency?” Astronomy and Astrophysics Supplement Series, vol. 135, no. 1, pp. 1–3, 1999. [16] A. Schwarzenberg-Czerny, “The distribution of empirical periodograms: Lomb-Scargle and PDM spectra,” Monthly Notices of the Royal Astronomical Society, vol. 301, no. 3, pp. 831–840, 1998. [17] J. Cohen, Statistical Power Analysis for the Behavioral Sciences, Lawrence Erlbaum, Hillsdale, NJ, USA, 2nd edition, 1988. [18] U. de Lichtenberg, R. Wernersson, T. S. Jensen, et al., “New weakly expressed cell cycle-regulated genes in yeast,” Yeast, vol. 22, no. 15, pp. 1191–1201, 2005. [19] L. H. P. Caro, G. J. Smits, P. van Egmond, J. W. Chapman, and F. M. Klis, “Transcription of multiple cell wall proteinencoding genes in Saccharomyces cerevisiae is differentially regulated during the cell cycle,” FEMS Microbiology Letters, vol. 161, no. 2, pp. 345–349, 1998. [20] F. M. Klis, A. Boorsma, and P. W. J. De Groot, “Cell wall construction in Saccharomyces cerevisiae,” Yeast, vol. 23, no. 3, pp. 185–202, 2006. [21] G. J. Smits, J. C. Kapteyn, H. van den Ende, and F. M. Klis, “Cell wall dynamics in yeast,” Current Opinion in Microbiology, vol. 2, no. 4, pp. 348–352, 1999. [22] P. Tamayo, D. Slonim, J. Mesirov, et al., “Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 6, pp. 2907–2912, 1999. [23] K. A. Bernstein and S. J. Baserga, “The small subunit processome is required for cell cycle progression at G1,” Molecular Biology of the Cell, vol. 15, no. 11, pp. 5038–5046, 2004. [24] K. A. Bernstein, F. Bleichert, J. M. Bean, F. R. Cross, and S. J. Baserga, “Ribosome biogenesis is sensed at the start cell cycle checkpoint,” Molecular Biology of the Cell, vol. 18, no. 3, pp. 953–964, 2007. [25] G. Thomas, “An encore for ribosome biogenesis in the control of cell proliferation,” Nature Cell Biology, vol. 2, no. 5, pp. E71– E72, 2000. [26] S. Cooper, “Rethinking synchronization of mammalian cells for cell cycle analysis,” Cellular and Molecular Life Sciences, vol. 60, no. 6, pp. 1099–1106, 2003. [27] S. Cooper, “Rejoinder: whole-culture synchronization cannot, and does not, synchronize cells,” Trends in Biotechnology, vol. 22, no. 6, pp. 274–276, 2004.
EURASIP Journal on Bioinformatics and Systems Biology [28] M. N. Arbeitman, E. E. M. Furlong, F. Imam, et al., “Gene expression during the life cycle of Drosophila melanogaster,” Science, vol. 297, no. 5590, pp. 2270–2275, 2002. [29] M. J. McDonald and M. Rosbash, “Microarray analysis and organization of circadian gene expression in Drosophila,” Cell, vol. 107, no. 5, pp. 567–578, 2001. [30] Supplementary Materials, http://www.ece.tamu.edu/∼wtzhao/FlyCellCycleGenes.xls.
9
Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2009, Article ID 195712, 12 pages doi:10.1155/2009/195712
Research Article Clustering of Gene Expression Data Based on Shape Similarity Travis J. Hestilow1 and Yufei Huang1, 2 1 Department 2 Greehey
of Electrical and Computer Engineering, The University of Texas at San Antonio, San Antonio, TX 78249, USA Children’s Cancer Research Institute, University of Texas Health Science Center at San Antonio, TX 78229, USA
Correspondence should be addressed to Yufei Huang,
[email protected] Received 4 August 2008; Revised 8 January 2009; Accepted 27 January 2009 Recommended by Erchin Serpedin A method for gene clustering from expression profiles using shape information is presented. The conventional clustering approaches such as K-means assume that genes with similar functions have similar expression levels and hence allocate genes with similar expression levels into the same cluster. However, genes with similar function often exhibit similarity in signal shape even though the expression magnitude can be far apart. Therefore, this investigation studies clustering according to signal shape similarity. This shape information is captured in the form of normalized and time-scaled forward first differences, which then are subject to a variational Bayes clustering plus a non-Bayesian (Silhouette) cluster statistic. The statistic shows an improved ability to identify the correct number of clusters and assign the components of cluster. Based on initial results for both generated test data and Escherichia coli microarray expression data and initial validation of the Escherichia coli results, it is shown that the method has promise in being able to better cluster time-series microarray data according to shape similarity. Copyright © 2009 T. J. Hestilow and Y. Huang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction Investigating the genetic structure and metabolic functions of organisms is an important yet demanding task. Genetic actions, interactions, how they control and are controlled, are determined, and/or inferred by data from many sources. One of these sources is time-series microarray data, which measure the dynamic expression of genes across an entire organism. Many methods of analyzing this data have been presented and used. One popular method, especially for time-series data, is gene-based profile clustering [1]. This method groups genes with similar expression profiles in order to find genes with similar functions or to relate genes with dissimilar functions across different pathways occurring simultaneously. There has been much work on clustering time-series data and clustering can be done based on either similarity of expression magnitude or the shape of expression dynamics. Clustering methods include hierarchical and partitional types (such as K-means, fuzzy K-means, and mixture modeling) [2]. Each method has its strengths and weaknesses. Hierarchical techniques do not produce clusters per se; rather, they produce trees or dendrograms. Clusters
can be built from these structures by later cutting the output structure at various levels. Hierarchical techniques can be computationally expensive, require relatively smooth data, and/or be unable to “recover” from a poor guess; that is, the method is unable to reverse itself and recalculate from a prior clustering set. They also often require manual intervention in order to properly delineate the clusters. Finally, the clusters themselves must be well defined. Noisy data resulting in illdefined boundaries between clusters usually results in a poor cluster set. Partitional clustering techniques strive to group data vectors (in this case, gene expression profiles) into clusters such that the data in a particular cluster are more similar to each other than to data in other clusters. Partitional clustering can be done on the data itself or on spline representations of the data [3, 4]. In either case, squareerror techniques such as K-means are often used. K-means is computationally efficient and can always find the global minimum variance. However, it must know the number of clusters in advance; there is no provision for determining an unknown number of clusters other than repeatedly testing the algorithm with different cluster numbers, which for large datasets can be very time consuming. Further, as is the case
2 with hierarchical methods, K-means is best suited for clusters which are compact and well separated; it performs poorly with overlapping clusters. Finally, it is sensitive to noise and has no provision for accounting for such noise through a probabilistic model or the like. A related technique, fuzzy K-means, attempts to mimic the idea of posterior cluster membership probability through a concept of “degree of membership.” However, this method is not computationally efficient and requires at least an a priori estimate of the degree of membership for each data point. Also, the number of clusters must be supplied a priori, or a separate algorithm must be used in order to determine the optimum number of clusters. Another similar method is agglomerative clustering [5]. Model-based techniques go beyond fuzzy K-means and actually attempt to model the underlying distributions of the data. The methods maximize the likelihood of the data given the proposed model [4, 6]. More recently, much study has been given toward clustering based on expression profile shape (or trajectory) rather than absolute levels. Kim et al. [7] show that genes with similar function often exhibit similarity in signal shape even though the expression magnitude can be far apart. Therefore, expression shape is a more important indication of similar gene functions than expression magnitude. The same clustering methods mentioned above can be used based on shape similarity. An excellent example of a tree-based algorithm using shape-similarity as a criterion can be found in [8]. While the results of this investigation proved fruitful, it should be noted that the data used in the study resulted in well-defined clusters. Further, the clustering was done manually once the dendrogram was created. M¨oller-Levet et al. [9] used fuzzy K-means to cluster timeseries microarray data using shape similarity as a criterion. However, the number of clusters was known beforehand; no separate optimization method was used in order to find the proper number of clusters. Balasubramaniyan et al. [10] used a similarity measure over time-shifted profiles to find local (short-time scale) similarities. Phang et al. [11] used a simple (+/0/ −) shape decomposition and used a nonparametric Kruskal-Wallis test to group the trajectories. Finally, Tjaden [12] used a K-means related method with error information included intrinsically in the algorithm. A common difficulty with these approaches is to determine the optimal number of clusters. There have been numerous studies and surveys over the years aimed at finding optimal methods for unsupervised clustering of data; for example, [13–20]. Different methods achieve different results, and no single method appears to be optimal in a global sense. The problem is essentially a model selection problem. It is well known that the Bayesian methods provide the optimal framework for selecting models, though a complete treatment is analytically intractable for most cases. In this paper, a Bayesian approach based on the Variational Bayes Expectation Maximization (VBEM) algorithm is proposed to determine the number of clusters and better performance than MDL and BIC criterion has been demonstrated. In this study, the goal was to find clusters of genes with similar functions; that is, coregulated genes using
EURASIP Journal on Bioinformatics and Systems Biology time-series microarray data. As a result, we choose to cluster genes based on signal shape information. Particularly, signal shape information is derived from the normalized time-scaled forward first differences of the time-sequence data. This information is then forwarded to a Variational Bayes Expectation Maximization algorithm (VBEM, [21]), which performs the clustering. Unlike K-means, VBEM is a probabilistic method, which was derived based on the Bayesian statistical framework and has shown to provide better performance. Further, when paired with an external clustering statistic such as the Silhouette statistic [22], the VBEM algorithm can also determine the optimal number of clusters. The rest of the paper is organized as follows. In Section 2 the problem is discussed in more detail, the underlying model is developed, and the algorithm is presented. In Section 3 the results of our evaluation of the algorithm against both simulated and real time-series data are shown. Also presented are comparisons between the algorithm and K-means clustering, both methods using several different criteria for making clustering decisions. Conclusions are summarized in Section 4. Finally, Appendices A, B, and C present a more detailed derivation of the algorithm.
2. Method 2.1. Problem Statement and Method. Given the microarray datasets of G genes, xg ∈ RN ×1 for (g = 1, 2, 3, . . . , G), where N is the number of time points, that is, the columns in the microarray, it is desired to cluster the gene expressions based on signal shape. The clustering is not known a priori; therefore not only must individual genes be assigned to relevant clusters, but the number of clusters themselves must also be determined. The clustering is based on expression-level shape rather than magnitude. The shape information is captured by the first-order time difference. However, since the gene expression profiles were obscured by the varying levels manifested in the data, the time difference must be obtained on the expression levels with the same scale and dynamic range. Motivated by the observations, the proposed algorithm has three steps. In the first step, the expression data is rescaled. In the second step, the signal shape information is captured by calculating the first-order time difference. In the last step, clustering is performed on the time-difference data using a Variational Bayes Expectation Maximization (VBEM) algorithm. In the following, each step is discussed in detail. 2.2. Initial Data Transformation. Each gene sequence was rescaled by subtracting the mean value of each sequence from each individual gene, resulting in sequences with zero mean. This operation was intended to mitigate the widely different magnitudes and slopes in the profile data. By resetting all genes to a zero-mean sequence, the overall shape of each sequence could be better identified without the complication of comparing genes with different magnitudes.
EURASIP Journal on Bioinformatics and Systems Biology
3 1.5
3.5 3
1
2.5 2
0.5
1.5 0
1 0.5
−0.5
0 −0.5
−1
1
3
5
7
9
1
3
5
7
9
Figure 1: Dissimilar expression levels with similar shape.
Figure 2: Normalized differences: the same two sequences after transformations.
After this, the resulting sequences were then normalized such that the maximum absolute value of the sequence was 1. Gene expression between related genes can result in a large change or a small; if two genes are related, that relationship should be recoverable regardless of the amplitude of change. By renormalizing the data in this manner, the amplitudes of both large-change and small-change genes were placed into the same order of magnitude. Mathematically, the above operation can be expressed by
clustering would place these two sequences in different clusters. By transforming the data, the similarity of the two sequences is enhanced, and the clustering algorithm can then place them in the same cluster. Figure 2 shows the original two sequences after data transformation.
zg =
xg − μxg , max abs xg − μxg
(1)
where μxg represents the mean of xg . 2.3. Extraction of Shape Information and Time Scaling. To extract shape information of time-varying gene expression, the derivative of the expression trajectory is considered. Since we are dealing with discrete sequences, differences must be used rather than analytical derivatives. To characterize the shape of each sequence, a simple first-difference scheme was used, this being the magnitude difference of the succeeding point and the point under consideration, divided by the time difference between those points. The data was taken nonuniformly over a period of approximately 100 minutes, with sample times varying from 7 to 50 minutes. As the transformation in (1) already scales the data to a range of [−1, 1], further compressing that scale by nearly 2 orders of magnitude over some time stretches was deemed neither prudent nor necessary. Therefore, the time difference was scaled in hours to prevent this unneeded range compression. The resulting sequences were used as data for clustering. Mathematically, this operation can be written as yg,k
zg,k+1 − zg,k = , tg,k+1 − tg,k
k = 1 · · · N − 1,
(2)
where tg is the length-N vector of time points associated with gene g, zg is the vector of transformed time-series data (from (1)) associated with gene g, and yg is the resulting vector of first differences associated with gene g. Figure 1 shows an example pair of sequences using contrived data. These two sequences are visually related in shape, but their mean values are greatly different. A K-means
2.4. Clustering. Once the sequence of first differences was calculated for each gene, clustering was performed on y, the first-order difference. To this end, a VBEM algorithm was developed. Before presenting that development, a general discussion of VBEM is in order. An important problem in Bayesian inference is determining the best model for a set of data from many competing models. The problem itself can be stated fairly compactly. Given a set of data y, the marginal likelihood of that data given a particular model m can be expressed as
p(y | m) = p(y, x, θ | m)dx dθ,
(3)
where x and θ are, respectively, the latent variables and the model parameters. The integration is taken over both variables and parameters in order to prevent overfitting, as a model with many parameters would naturally be able to fit a wider variety of datasets than a model with few parameters. Unfortunately, this integral is not easily solved. The VBEM method approximates this by introducing a free distribution, q(x, θ), and taking the logarithm of the above integral. If q(x, θ) has support everywhere that p(x, θ | y, m) does, we can construct a lower bound to the integral using Jensen’s inequality:
ln p(y | m) = ln p(y, x, θ | m)dx dθ p(y, x, θ | m) = ln q(x, θ) dx dθ
q(x, θ)
≥
q(x, θ) ln
(4)
p(y, x, θ | m) dx dθ. q(x, θ)
Maximizing this lower bound with respect to the free distribution q(x, θ) results in q(x, θ) = p(x, θ | y, m), the joint posterior. Since the normalizing constant is not known, this posterior cannot be calculated exactly. Therefore another simplification is made. The free distribution q(x, θ)
4
EURASIP Journal on Bioinformatics and Systems Biology
is assumed to be factorable, that is, q(x, θ) = q(x)q(θ). The inequality then becomes
p(y, x, θ | m) dx dθ q(x)q(θ)
ln p(y | m) ≥ q(x)q(θ) ln
(5)
= F (q(x), q(θ)).
Maximizing this functional F is equivalent to minimizing the KL distance between q(x)q(θ) and p(x, θ | y, m). The distributions q(x) and q(θ) are coupled and must be iterated until they converge. With the above discussion in mind, we now develop the model that our VBEM algorithm is based on. Given K clusters in total, we can let Cg ∈ {1, 2, . . . , k} denote the cluster number of gene g. Then, we assume that, given Cg = k, the expression level for gene g follows a Gaussian distribution, that is,
p yg | Cg = k, m1:k , s21:k = N mk , diag sk 2 ,
(6)
where mk = [mk1 , mk2 , . . . , mkN ]T is the mean and sk 2 = [sk1 2 , sk2 2 , . . . , skN 2 ]T is the variance of the kth Gaussian cluster. Since both mk and sk 2 are unknown parameters, a Normal-Inverse-Gamma prior distribution is assigned as
p mk , sk
2
N
si, j = N 0, k j =1
2
Unfortunately, there are now multiple unknown nuisance parameters at this point: mk , sk 2 , a, b, k, and L all still need to be found. To do so requires a marginalization procedure over all the unknowns, which is intractable for unknown cluster id Cg . Therefore, a VBEM scheme is adopted for estimating the necessary distributions. 2.5. VBEM Algorithm. Given the development above, p(y | H = K) can be expressed as
where θ is the vector of unknown parameters mk , sk 2 , a, b, k, and L. Notice the summation in (11) is NP hard, whose complexity increases exponentially with the number of genes. We therefore resort to approximate this integration by variational EM. First, a lower bound is constructed for the expression in (11). The ultimate aim is to maximize this lower bound. The expression for the lower bound can be written ln p(y | H = k) = ln
where Lk is the prior probability that gene g belongs to kth cluster k and Kk=1 Lk = 1. Lk further assumes a priori the Dirichlet distribution
p L1 , L2 , . . . , Lk = Dir a1 , . . . , ak ,
Kmax = arg max p(y | H = K), K
Cg,max = arg max p Cg = k | y , k
p y | Cg , θ p Cg p(θ)dθ
≥ ln
Cg
p y, Cg | θ p(θ) q Cg ln + ln dθ, q(θ) q Cg
(12) where as above the inequality derives by use of Jensen’s inequality. The free distributions q(Cg ) and q(θ) are introduced as approximations to the unknown distributions p(Cg | y) and p(θ | y). The q(·) distributions are chosen so as to maximize the lower bound. Using variational derivatives and an iterative coordinate ascent procedure, we find
(9)
where a1 · · · ak are the known parameters of the distribution. Given the transformed expressions of G genes, y = [y1 , y2 , . . . , yG ]T , the stated two tasks are equivalent to estimating K, the total number of clusters, and Cg for all G genes. A Bayesian framework is adopted for estimating both K and Cg , which are calculated by the maximum a posteriori criterion as
Cg
(7)
(8)
(11)
where k, a0 , and b0 are the known parameters of the prior distribution. Furthermore, a multinomial prior is assigned for the cluster number Cg as p Cg = k | L = Lk ,
p y | Cg = k, θ p Cg p(θ)dθ,
Cg
a b IG si, j 2 | 0 , 0 , 2 2
p(y | H = k) =
k ∈ 1, . . . , Kmax , (10)
where p(y | H = k) is the marginal likelihood given the model H has K clusters, and p(Cg = k | y) is the a posteriori probability of Cg when the total number of clusters is K.
VBE Step:
q j+1 Cg =
1 exp ZCg
q( j) (θ) ln p Cg , y | θ
dθ;
(13)
VBM Step:
q j+1 (θ) =
1 exp q( j+1) Cg ln p Cg , y | θ , Zθ Cg
(14)
where j and j + 1 are iterations and Z(·) are normalizing constants to be determined. Because of the integration in (13), q(θ) must be chosen carefully in order to have an analytic expression. By choosing q(θ) as a member of the exponential family, this condition is satisfied. Note q(θ) is an approximation to the posterior distribution q(θ | y) and therefore can be used to obtain the estimate of θ.
EURASIP Journal on Bioinformatics and Systems Biology 2.6. Summary of VBEM Algorithm. The VBEM algorithm is summarized as follows: (1) Initialization
5 cluster A. Let bv be the minimum average squared distance between data vector v and all other vectors of cluster B, B= / A. Then the Silhouette statistic for data vector v is
(i) Initialize mk , sk 2 , a, b, k, and L.
Sil(v) =
bv − av . max av , bv
(16)
Iterate until lower bound converges enumerate (2) VBE Step: (i) for k = 1 : K, g = 1 : G, (ii) calculate q(Cg = k) using (A.1) in Appendix A, (iii) end g, k. (3) VBM Step:
3. Results
(i) for k = 1 : K, (ii) calculate q(θ) using (B.1) in Appendix B, (iii) End k.
We illustrate the method using simulated expression data and with microarray data available online.
(4) Lower bound: (i) calculate F (q (C g ), q (θ)) using (C.1) in Appendix C. End iteration. 2.7. Choice of the Optimum Number of Clusters. The Bayesian formulation of (11) suggests using the number of clusters that maximize the marginal likelihood, or in the context of VBEM, the lower bound F(·). Instead of solely basing the determination of the number of clusters using F(·), 4 different criteria are investigated in this work: (a) lower bound F(·) used within the VBEM algorithm (labelled KL), (b) the Bayes Information Criterion [23], (c) the Silhouette statistic performed on clusters built from transformed data, and (d) the Silhouette statistic performed on clusters built from raw data. The VBEM lower bound F(·) is discussed above; the BIC and Silhouette criteria are discussed below. 2.8. Bayes Information Criterion (BIC). The Bayes Information Criterion (BIC, [23]) is an asymptotic approximation to the Bayes Factor, which itself is an average likelihood ratio similar to the maximum likelihood ratio. As the Bayes Factor is often a difficult calculation, the BIC offers a less-intensive approximation. Subject to the assumptions of large data size and exponential-family prior distributions, maximizing the BIC is equivalent to maximizing the integrated likelihood function. The BIC can be written as BIC = 2 ln p(x | θ) − k ln(n),
It is quickly seen that the range of this statistic is [−1, 1]. A value close to 1 means the data vector is very probably assigned to the correct cluster, while a value close to −1 means the data vector is very probably assigned to the wrong cluster. A value near 0 is a neutral evaluation.
(15)
where p(x | θ) is the likelihood function of data x given parameters θ, k is the size (dimensionality) of parameter set θ, and n is the sample size. The term −k ln(n) is a penalty term discouraging more complex models. 2.9. Silhouette Statistic. The Silhouette statistic (Sil, [22]) uses the squared difference between a data vector and all other data vectors in all clusters. For any particular data vector v belonging to cluster A, let av be the average squared difference between data vector v and all other vectors in
3.1. Simulation Study. In order to test the ability of VBEM to properly cluster data of similar shape but dissimilar mean level, and scale, several datasets were constructed. These datasets were intended to appear as would a set of time-series microarray data. Each consisted of 5 data points in a vector, corresponding to what might be seen from a microarray from a single gene over 5-time samples. Identical assumptions were used to produce these datasets; namely, that the inherent clusters within the data were based upon a mean vector of values for a particular cluster, that each cluster may have subclusters exhibiting a mean shift and/or a scale change from the mean vector, and that the data within a cluster randomly varied about that mean vector (plus any mean shift and scale change). All sets of sample data shared the characteristics shown in Table 1. For example, a test “gene” of cluster “dms” would be a random length-5 vector, drawn from a Gaussian distribution with a mean of 2.0 −2.0 0.0 0.0 0.0 and a particular standard deviation (defined below). This random vector would then be scaled by 0.25 and shifted in value by −1.25. The datasets constructed from these basis vectors differed in number of data vectors per subcluster (and thus the total number of data vectors), and the standard deviation used to vary the individual vector values about their corresponding basis vectors. Generally speaking, the standard deviation vectors were constructed to be approximately 25% of the mean vector for the “low-noise” sets, and approximately 50% of the mean vector for the “high-noise” sets. 3.2. “Low-Noise” Test Datasets. Two datasets were constructed using standard deviation vectors approximately 25% of the relevant mean vector. Table 2 shows the standard deviation vectors used. Each subcluster in Table 1 was replicated several times, randomly varying about the mean vector in a Gaussian distribution with a standard deviation as shown in Table 2. Test set 1 had 5 replicates per subcluster (e.g., a1–a5, cs1–cs5), resulting in a total set N = 55 data vectors. Test set 2 had 99 replicates per subcluster, resulting in a total set N = 1089 data vectors.
6
EURASIP Journal on Bioinformatics and Systems Biology Table 1: Basis vectors for clusters in sample datasets.
Cluster a b c d
e
Subcluster a b bm c cs d dms e em es ems
Mean vector 0.5 0.5 0.5 0.5 2.0
0.5 2.0 2.0 −2.0 −2.0
0.0 2.0 0.0 2.0 0.0
2.0 −2.0 0.0 0.0 0.0
−2.0 0.0 0.0 0.0 −2.0
Mean shift 0 0 −1.25 0 0 0 −1.25 0 −1.25 0 −1.25
Scale factor 1 1 1 1 0.25 1 0.25 1 1 0.25 0.25
Table 2: Standard deviation vectors for clusters in “low-noise” sample datasets.
Table 4: Subcluster replicates and total vector sizes for “high-noise” datasets.
Cluster a b c d e
Test set 3 4 5 6 7 8
Standard deviation vector 0.1 0.1 0.1 0.1 0.5 0.1 0.5 0.5 0.5 0.5 0.1 0.5 0.1 0.5 0.1 0.5 0.5 0.1 0.1 0.1 0.5 0.1 0.1 0.1 0.5
Table 3: Standard deviation vectors for clusters in “high-noise” sample datasets. Cluster a b c d e
Standard deviation vector 0.2 0.2 0.2 0.2 1.0 0.2 1.0 1.0 1.0 1.0 0.2 1.0 0.2 1.0 0.2 1.0 1.0 0.2 0.2 0.2 1.0 0.2 0.2 0.2 1.0
3.3. “High-Noise” Test Datasets. Because of the need to test the robustness of the clustering and prediction algorithms in the presence of higher amounts of noise, six datasets were constructed using standard deviation vectors approximately 50% of the relevant mean vector. Table 3 shows the standard deviation vectors used. As with the “low-noise” sets, each subcluster in Table 1 was replicated several times, randomly varying about the mean vector in a Gaussian distribution, this time with a standard deviation as shown in Table 3. Table 4 shows the number of replicates produced for each dataset. For the test data, an added transformation step was accomplished that would normally not be performed on actual data. Since the test data was produced in already clustered form, the vectors (rows) were randomly shuffled to break up this clustering. 3.4. Test Types and Evaluation Measures. To evaluate the ability of VBEM to properly cluster the datasets, two test sequences were conducted. First, the data was clustered using VBEM in a “controlled” fashion; that is, the number of
Total replicates 5 9 30 50 70 99
Total N 55 99 330 550 770 1089
clusters was assumed to be known and passed to the algorithm. Second, the algorithm was tested in an “uncontrolled” fashion; that is, the number of clusters was unknown, and the algorithm had to predict the number of clusters given the data. During the uncontrolled tests, a K-means algorithm was also run against the data as a comparison. The VBEM algorithm as currently implemented requires an initial (random) probability matrix for the distribution of genes to clusters, given a value for K. Therefore, for each dataset, 55 trials were conducted, each trial having a different initial matrix. Also, each trial begins with an initial clustering of genes. As currently implemented, this initialization is performed using a K-means algorithm. The algorithm attempts to cluster the data such that the sum of squared differences between data within a cluster is minimized. Depending on the initial starting position, this clustering may change. In MATLAB, the built-in K-means algorithm has several options available to include how many different trials (from different starting points) are conducted to produce a “minimum” sum-squared distance, how many iterations are allowed per trial to reach a stable clustering, and how clusters that become “empty” during the clustering process are handled. For these tests, the K-means algorithm conducted 100 trials of its own per initial probability matrix (and output the clustering with the smallest sum-squared distance), had a limit of 100 iterations, and created a “singleton” cluster when a cluster became empty. As mentioned above, the choice of optimum K was conducted using four different calculations. The first used
EURASIP Journal on Bioinformatics and Systems Biology Predicted number of clusters
Misclassification rate
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
7
0
200
400
600
800
1000
1200
16 14 12 10 8 6 4 2
0
200
KL BIC
400
600
800
1000
1200
N (number of genes)
N (number of genes) V/KL V/BIC V/SilT
SilT SilR
KM/SilR KM/SilT V/SilR
Figure 3: Misclassification rate versus N, high-noise data, K fixed. Figure 4: K(pred) versus N, high-noise data.
the estimate for the VBEM lower bound, the second used the BIC equation. In both cases, the optimum K for a particular trial was that which showed a decrease in value when K was increased. This does not mean the values used to determine the optimum K were the absolute maxima for the parameter within that trial; in fact, they usually were not. The overall optimum K for a particular choice of parameter was the maximum value over the number of trials. The third and fourth criteria made use of the Silhouette statistic, one using the clusters of transformed data and one using the corresponding clusters of raw data. We used the built-in Silhouette function contained within MATLAB for our calculations. To find the optimum K, the mean Silhouette value for all data vectors in a clustering was calculated for each value of K. The value of K for which the mean value was maximized was chosen as the optimum K. To evaluate the actual clustering, a misclassification rate was calculated for each trial cluster. Since the “ground-truth” clustering was known a priori, this rate can be calculated as a sum of probabilities derived from the original data and the clustering results: Rmi =
K K
p C j | Ck p Ck ,
(17)
j =1 k=1
where p(C j | Ck ) is the probability that computed cluster C j belongs to a priori cluster Ck given that Ck is in fact the correct cluster, and p(Ck ) is the probability of a priori cluster Ck occurring. Rmi refers to the misclassification rate using statistic m (KL, BIC, both Silhouette) for trial i. This rate is in the range [0, 1] and is equal to 1 only when the number of clusters is properly predicted and those calculated clusters match the a priori clusters. Thus, both under- and overprediction of clusters were penalized. For the “controlled” test sequences, the combinations of VBEM + KL (V/KL), VBEM + BIC (V/BIC), VBEM + Silhouette (transformed data) (V/SilT), and VBEM + Silhouette (raw data) (V/SilR) all properly chose the optimum clustering for the two “low-noise” datasets, in all
cases with no misclassification. For the six “high-noise” sets, V/KL and V/BIC were completely unable to choose the optimum clustering (lowest misclassification rate). In the case of V/SilT, the algorithm-chosen optimum was rarely the true optimum (2 out of 6 datasets). However, the chosen optimum was always very nearly optimal. Finally, V/SilR chose the optimum clustering 5 out of 6 datasets. The algorithm-chosen optimal clustering for both V/SilT and V/SilR showed a misclassification rate of 6 percent or less, while the misclassification rates for V/KL and V/BIC were often in the range of 15–35 percent. Figure 3 summarizes this data. For the “uncontrolled” tests, the above 4 algorithms were tested with the number of clusters unknown. Further, K-means clustering with Silhouette statistic (KM/SilT and KM/SilR) was also conducted for comparison. The results for the 6 “high-noise” datasets are summarized below. Figure 4 shows a summary plot of the predicted number of clusters K versus dataset size N for all combinations. Note that V/SilR correctly identified K = 5 for all datasets. Also note that KM/SilT, KM/SilR, and V/SilT predicted K = 5 or K = 6 for all datasets except for test set 3 (N = 55). However, even though V/SilR correctly identified K = 5 for this dataset, it had equivalent optimum values for K = 7, 8, 10, and 15. Given the poor performance of all combinations for this dataset, this suggests that for highnoise data such as this, N = 55 is insufficient to give good results. V/KL and V/BIC both performed poorly with all datasets, in most cases overpredicting the number of clusters. As can be seen in Figure 4, this overprediction tended to increase with dataset size N. V/BIC resulted in a lower over-prediction than V/KL. Figure 5 shows a summary plot of misclassification rate versus dataset size N for the VBEM versus K-means comparison using Silhouette statistics only (both raw and difference). This plot shows the greater performance of V/SilR even more dramatically. While the misclassification rates for the KM/SilT, KM/SilR, and V/SilT were generally on the order of 10–20%, V/SilR was very stable, generally between 3-4%.
8
EURASIP Journal on Bioinformatics and Systems Biology the predefined number of clusters, varying from 3 to 15. K was chosen in the same manner as in the test data sequences. Figure 6 shows a summary of the final result of the algorithm. Each subfigure shows the mean shapes clustered by the particular algorithm/statistic. As can be seen from the figure, V/KL resulted in an overclassification of structure in the data. The other three algorithms gave more consistent results. As a result of this, the V/KL clusters were removed from further analysis.
0.45 Misclassification rate
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
0
200
400
600
800
1000
1200
N (number of genes) KM/SilT KM/SilR
V/SilT V/SilR
Figure 5: Misclassification rate versus N, high-noise data, K unknown.
3.5. Test Results Conclusion. The VBEM algorithm can correctly cluster shape-based data even in the presence of fairly high amounts of noise, when paired with the Silhouette statistic performed on the raw data clusters (V/SilR). Further, V/SilR is robust in correctly predicting the number of clusters in noise. The misclassification rate is superior to K-means using Silhouette statistics, as well as VBEM using all other statistics. Because of this, it was expected that V/SilR would be the algorithm of choice for the experimental microarray data. However, to maintain comparison, all four VBEM/statistic algorithms were tested. 3.6. Experimental E. Coli Expression Data. The proposed approach for gene clustering on shape similarity was tested using time-series data from the University of Oklahoma E. coli Gene Expression Database resident at their Bioinformatics Core Facility (OUBCF) [24]. The exploration concentrated on the wild-type MG1655 strain during exponential growth on glucose. The data available consisted of 5 timeseries log-ratio samples of 4389 genes. The initial tests were run against genes identified as being from metabolic categories. Specifically, genes identified in the E. coli K-12 Entrez Genome database at the National Center for Biotechnology Information, US National Library of Medicine, National Institutes of Health (http://www.ncbi.nlm.nih.gov/) [25] (NIH) as being in categories C, G, E, F, H, I, and/or Q were chosen. Because of the short-sequence lengths, any gene with even a single invalid data point was removed from the set. With only 5-time samples to work with in each gene sequence, even a single missing point would have significant ramifications in the final output. The final set of genes used for testing numbered 1309. In implementing the VBEM algorithm, initial values for the algorithm were a0 = b0 = 0.0002. The algorithm was set to iterate until the change in lower bound decreased below 5 × 10−2 or became negative (which required the prior iteration to be taken as the end value) or 200 iterations, whichever came first. The optimal number of clusters was arrived at by multiple runs of the algorithm at values of K,
3.7. Validation of E. Coli Expression Data Results. We validated the results of our tests using Gene Ontology (GO) enrichment analysis. To this end, the genes used in the analysis were tagged with their respective GO categories and analyzed within each cluster for overrepresentation of certain categories versus the “background” level of the population (in this case, the entire set of metabolic genes used). Again, the Entrez Genome database at NIH was used for the GO annotation information. As most of the entries enriched were from the Biological Process portion of the ontology, the analysis was restricted to those terms. To perform the analysis, the software package Cytoscape (http://www.cytoscape.org/) [26] was used. Cytoscape offers access to a wide variety of plug-in analysis packages, including a GO enrichment analysis tool, BiNGO, which stands for Biological Network Gene Ontology (http://www.psb .ugent.be/cbd/papers/BiNGO/) [27]. To evaluate the clusters, we modified an approach used by Yuan and Li [28] to score the clusters based on the information content and the likelihood of enrichment (P-value < .05). Unlike [28], however, a distance metric was not included in the calculations. Because of the large cluster sizes involved, such distance calculations would have exacted a high calculation overhead. Rather, the simpler approach of forming subclusters of adjacent enriched terms was chosen; that is, if two GO terms had a relationship to each other and were both enriched, they were placed in the same subcluster and their scores multiplied by the number of terms in the subcluster. Also, a large portion of the score of any term shared across more than one cluster was subtracted. This method rewarded large subclusters, while penalizing numerous small subclusters and overlapping terms. The scoring equation for a cluster C, consisting of k subclusters each of size nk is given as ScoreC =
k
nj
nj − 1
j =1
−
n−1 n
log Pr ti j
log pi j
i=1
log Pr tk
(18)
log pk ,
tk ∈Ci ∩C j ∩ ···∩Cn
where Pr(ti j ) is the probability of GO term ti j being selected, log(Pr(ti j )) is the negative of the information content of the GO term, and pi j is the P-value (P < .05) of the GO term ti j . Large subclusters are rewarded by larger values of nk . Subtracting 1 from nk compensates for the “baseline” score value; that is, the score a cluster would achieve if no terms were connected. The final term in the equation is the devaluation of any GO term shared by n clusters.
EURASIP Journal on Bioinformatics and Systems Biology
9 0.5
1 0.5
0
0 −0.5
−0.5 −1
0
20
40
60
80
−1
100
0
20
40
(a)
60
80
100
60
80
100
(b) 0.5
0.5
0 0 −0.5 −0.5
0
20
40
60
80
−1
100
0
20
40
(c)
(d)
Figure 6: Mean data shapes. (a) V/KL, (b) V/BIC, (c) V/SilT, (d) V/SilR.
3673 8150
3673
8151
3673
51234 32324
51186 19748 6807
8151 44238
19720
43170
44248
6777
6796
46483
44237
9058 43283
51187
6629
44255
6643
8610
6082
9059
44249
6811 19438
6519
51
9308 6575 44271
197526053117
6818 9141 25 6163 46394
6812
9108
9165
65206220
9309 9698
675 9199 9144 915 914 9260 6164
15672
6760
43648 9072 46417
8652 081 221 19064
9069
6099
6576
6568
42401
6768
42435
6586 9437
6725
42434
6084
46467 8654
46483
46356
42398
6810
16310
42430
51186 9109
6575 6575
43412
6644
9058
44237
8151
8150
8152
44237
8152
51179
43545
19438 8150
6119
46655
9073
6752 9396
15992
46034 9205
9102
9152
15985
46656
6558
6525
6563 907 908 6551 9084
9095
9201 9145
6526
9094
46219 42413
9096
15986
(a)
9206
6754
162
16089 6564
9423
(b)
9098
(c)
Figure 7: GO clusters resulting from V/SilR.
3673 9060 8150 51179
44237
9058 3673 3673
8150
8150
3673
6810 6796
8150
19222
51244
51869
43170
8152
8151
42221 44237
43170
44238
16310 9057
17035
6119
9059
6139
16052
44255
45449
6351
6350
42967
6777
46395 6631
9259
9142
9150
6753
6164
9260
9081
9069
8652 9084
15992 9205
9145
9201
15985
9082 9152
46034
46349
6551
Figure 8: GO clusters resulting from V/SilT.
9070
6563 16089
6754
(c)
9092
9206 9098
15986
(b)
9698
6520 9064
9199
6355
(a)
9165 9309
6763 9144
15937
6040
32324
16054
6163
9108
15672
44248
9058 44262
32774
6629
9141
6752
9056
44238
31323 19219
44271 6818 51188
5975 43545
6519
9117
8152
50791 6547
19720
19748 44249
9075
105
44238
51234
65007
9076
6464
8152
(d)
9090
6564
10
EURASIP Journal on Bioinformatics and Systems Biology 3673 3673
8150 3673
8150 9057 65007
8150
50791 43170
44237 44238
8152
9058
9058 51244
43283
6139
6350
8151
44237
8152
8151
9059
43412
16070
19222
43545
19219
6519
19720
32324
42967
44249
6807 9075
31323
32774
9076
9451
44271
9308
6547
6520
9309
9069
9064
9628
6351
45449
105
6777
8652
51
6464
6355
6526
(a)
6525
9092
6563
(b)
6564
9084
9070
9090
(c)
3673
8150
3673
51179
8150
8151
8152
8151
44237
44238
6082
9056
44248
19752
43170
6629
16054
32787
5975
44255
46395
44262
6553
6631
9085
44249
15672
6818
6752
6725
44271
15992
9152
9201
9205
9108
6082
6760
16310
6753
9206
46034
6525
6119
6796
9145
9117
9698
43648
51
9064
9095
6220
46451
46417
16089
9141
9259
6810
6732
6520
19748
51234
9058
44237
9089
9423
(d)
15986
9165
6754
6221
19438
6526
9309
15985
9396
8652
16053
9110
46655
9084
46394
42364
46656
(e)
Figure 9: GO clusters resulting from V/BIC. Table 5: Summary scores from E. coli data analysis. Cluster/algorithm VSil/R V/SilT V/BIC
1 153.14 405.73 4.42
2 2004.55 3.10 422.42
3 22129.80 82.95 513.70
Given that algorithm was expected to group related functions together, the expectation for GO analysis was the creation of large, highly-connected subclusters within each main gene cluster. Ideally, one such subcluster would subsume the entire cluster; however, a small number of large subclusters within each cluster would validate the algorithm. The scoring equation (18) greatly rewards large, highlyconnected subclusters; in fact, given a cluster, the score is maximized by having all GO terms within that cluster be connected within a single subcluster. Figures 7, 8, and 9 show the results of the clustering using the three algorithms. Subclusters have been outlined for ease of identification. In some instances, nonenriched GO terms (colored white) have been removed for clarity. Visually, V/SilR is the better choice of the three. It has fewer overall clusters, and each cluster has generally fewer subclusters than V/SilT or V/BIC.
4
5
7343.89 44.64
11196.16
Total score 24287.48 7835.67 12181.33
Average score 8095.83 1958.92 2436.27
The clusters were scored using (18). Table 5 shows a summary of this analysis. As can be seen, V/SilR (3 clusters) far outscored both V/SilT (4 clusters) and V/BIC (5 clusters), both in aggregate and average cluster scores. Therefore, the conclusion is that V/SilR provides the better clustering performance.
4. Conclusion Four combinations of VBEM algorithm and cluster statistics were tested. One of these, VBEM combined with the Silhouette statistic performed on the raw data clusters, clearly outperformed the other three in both simulated and real data tests. This method definitely shows promise in clustering time-series microarray data according to profile shape.
EURASIP Journal on Bioinformatics and Systems Biology
11
Appendices
KL
A. Calculation of VBE Step
ξg (k) q( j+1) Cg = k = K , k=1 ξg (k)
1 = Ψ γk +
Ψ
2 n=1
N βk,n αk,n , − ln − μk,n − yg,n − 2 2 Kk,n (A.2)
Now we assume we have q( j+1) (Cg = k) from the prior VBE step. Then, μn,k , σn,k
2
= NIG μn,k , σn,k
αk,n βk,n | Kk,n , , , 2 2
2
q j+1 (L) = Dir γ˙ 1 , γ˙ 2 , . . . , γ˙ k , (B.1) G
where Kk,n = g =1 q j (Cg = k) + K; μk,n = K −1 k,n [Kμk,n,0 + G ( j) j g =1 q (Cg = k)yg,n ]; αk,n = αk0 + g =1 q (Cg = k); G βk,n = βk0 + t(y); t(y) = g =1 (Cg = k)yg,n 2 + Kμk,n,0 2 + μk,n ; γK = γk + Gg=1 q( j) (Cg = k); NIG(·): Normal-InverseGamma distribution; Dir(·): Dirichlet distribution.
G
C. Calculation of Lower Bound F(q(Cg ), q(θ)) Once q( j+1) (Cg = k) and q( j+1) (θ) have been calculated, we calculate the lower bound using the following:
F q Cg , q(θ) =−
K N
KL
n=1 k=1
q μn,k , σn,k 2 q(L) − KL + ln Z,
p μn,k , σn,k 2
p(L)
(C.1)
q μn,k , σn,k 2 KL
p μn,k , σn,k 2
β0 α α 1 K K − +1 + k,n Ψ k,n + −1 = − ln 2 Kn,k 2Kn,k 2 2 βk,n
βk,n αk,n α + · · · + 0 ln −Ψ 2 2 2
+
ln Z =
G
ln Zg ,
(C.4)
where Zg = ξ
K
k=1 ξg (k)
and ln ξ = −(N/2) ln 2π − Ψ(γ0 ).
Acknowledgment This work is supported in part by NSF Grant CCF-0546345. Dr. Tim Lilburn has been instrumental with his assistance and guidance.
References
B. Calculation of VBM Step
g =1
where N: number of time samples; G: number of genes (index g); Ψ(·): digamma function, and all other parameters are calculated from the VBM step.
q( j+1)
K Γ γ Γ γ0 ln k − γk − γk Ψ γk − Ψ γ0 , = ln − Γ γ0 k=1 Γ γk (C.3)
(A.1)
where ln ξg (k)
Let us assume we are on iteration j + 1 and have both q( j) (Cg = k) and q( j) (θ) available from iteration j. Then,
N
q(L) p(L)
β0 − ln 2
2 αk,n Γ αk,n /2 K , μn,k − μ0 − ln 2 βk,n Γ α0 /2
(C.2)
[1] D. Jiang, C. Tang, and A. Zhang, “Cluster analysis for gene expression data: a survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 11, pp. 1370–1386, 2004. [2] M. H. Asyali, D. Colak, O. Demirkaya, and M. S. Inan, “Gene expression profile classification: a review,” Current Bioinformatics, vol. 1, no. 1, pp. 55–73, 2006. [3] Z. Bar-Joseph, G. K. Gerber, D. K. Gifford, T. S. Jaakkola, and I. Simon, “Continuous representations of time-series gene expression data,” Journal of Computational Biology, vol. 10, no. 3-4, pp. 341–356, 2003. [4] P. Ma, C. I. Castillo-Davis, W. Zhong, and J. S. Liu, “A datadriven clustering method for time course gene expression data,” Nucleic Acids Research, vol. 34, no. 4, pp. 1261–1269, 2006. [5] L. Rueda, A. Bari, and A. Ngom, “Clustering time-series gene expression data with unequal time intervals,” in Transactions on Computational Systems Biology X, vol. 5410 of Lecture Notes in Computer Science, pp. 100–123, Springer, Berlin, Germany, 2008. [6] Y. Yuan and C.-T. Li, “Unsupervised clustering of gene expression time series with conditional random fields,” in Proceedings of the Inaugural IEEE International Conference on Digital EcoSystems and Technologies (DEST ’07), pp. 571–576, Cairns, Australia, February 2007. [7] K. Kim, S. Zhang, K. Jiang, et al., “Measuring similarities between gene expression profiles through new data transformations,” BMC Bioinformatics, vol. 8, article 29, pp. 1–14, 2007. [8] X. Wen, S. Fuhrman, G. S. Michaels, et al., “Large-scale temporal gene expression mapping of central nervous system development,” Proceedings of the National Academy of Sciences of the United States of America, vol. 95, no. 1, pp. 334–339, 1998. [9] C. S. M¨oller-Levet, F. Klawonn, K.-H. Cho, H. Yin, and O. Wolkenhauer, “Clustering of unevenly sampled gene expression time-series data,” Fuzzy Sets and Systems, vol. 152, no. 1, pp. 49–66, 2005. [10] R. Balasubramaniyan, E. H¨ullermeier, N. Weskamp, and J. K¨amper, “Clustering of gene expression data using a local shape-based similarity measure,” Bioinformatics, vol. 21, no. 7, pp. 1069–1077, 2005.
12 [11] T. L. Phang, M. C. Neville, M. Rudolph, and L. Hunter, “Trajectory clustering: a non-parametric method for grouping gene expression time courses, with applications to mammary development,” in Proceedings of the 8th Pacific Symposium on Biocomputing (PSB ’03), pp. 351–362, Lihue, Hawaii, USA, January 2003. [12] B. Tjaden, “An approach for clustering gene expression data with error information,” BMC Bioinformatics, vol. 7, article 17, pp. 1–15, 2006. [13] A. Ben-Hur, A. Elisseeff, and I. Guyon, “A stability based method for discovering structure in clustered data,” in Proceedings of the 7th Pacific Symposium on Biocomputing (PSB ’02), pp. 6–17, Lihue, Hawaii, USA, January 2002. [14] E. Dimitriadou, S. Dolniˇcar, and A. Weingessel, “An examination of indexes for determining the number of clusters in binary data sets,” Psychometrika, vol. 67, no. 1, pp. 137–159, 2002. [15] S. Dudoit and J. Fridlyand, “A prediction-based resampling method for estimating the number of clusters in a dataset,” Genome Biology, vol. 3, no. 7, pp. 1–21, 2002. [16] R. Tibshirani, G. Walther, and T. Hastie, “Estimating the number of clusters in a data set via the gap statistic,” Journal of the Royal Statistical Society. Series B, vol. 63, no. 2, pp. 411–423, 2001. [17] H. Sun and M. Sun, “Trail-and-error approach for determining the number of clusters,” in Proceedings of the 4th International Conference on Machine Learning and Cybernetics (ICMLC ’05), vol. 3930 of Lecture Notes in Computer Science, pp. 229–238, Guangzhou, China, August 2006. [18] D. L. Wild, C. E. Rasmussen, and Z. Ghahramani, “A Bayesian approach to modeling uncertainty in gene expression clusters,” in Proceedings of the 3rd International Conference on Systems Biology (ICSB ’02), Stockholm, Sweden, December 2002. [19] Y. Xu, V. Olman, and D. Xu, “Minimum spanning trees for gene expression data clustering,” Genome Informatics, vol. 12, pp. 24–33, 2001. [20] M. Yan and K. Ye, “Determining the number of clusters using the weighted gap statistic,” Biometrics, vol. 63, no. 4, pp. 1031– 1037, 2007. [21] M. J. Beal and Z. Ghahramani, “The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures,” in Proceedings of the 7th Valencia International Meeting on Bayesian Statistics, vol. 7, pp. 453– 464, Tenerife, Spain, June 2003. [22] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” Journal of Computational and Applied Mathematics, vol. 20, pp. 53–65, 1987. [23] G. Schwarz, “Estimating the dimension of a model,” Annals of Statistics, vol. 6, no. 2, pp. 461–464, 1978. [24] The University of Oklahoma’s E. coli Gene Expression Database, http://chase.ou.edu/oubcf/. [25] The Entrez Genome Database. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, “Escherichia coli K-12 data,” http://www.ncbi.nlm.nih.gov/. [26] P. Shannon, A. Markiel, O. Ozier, et al., “Cytoscape: a software environment for integrated models of biomolecular interaction networks,” Genome Research, vol. 13, no. 11, pp. 2498–2504, 2003. [27] S. Maere, K. Heymans, and M. Kuiper, “BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks,” Bioinformatics, vol. 21, no. 16, pp. 3448–3449, 2005.
EURASIP Journal on Bioinformatics and Systems Biology [28] Y. Yuan and C.-T. Li, “Probabilistic framework for gene expression clustering validation based on gene ontology and graph theory,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’08), pp. 625–628, Las Vegas, Nev, USA, March-April 2008.
Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2009, Article ID 713248, 10 pages doi:10.1155/2009/713248
Research Article Spectral Preprocessing for Clustering Time-Series Gene Expressions Wentao Zhao,1 Erchin Serpedin (EURASIP Member),1 and Edward R. Dougherty2 1 Electrical
and Computer Engineering Department, Texas A&M University, College Station, TX 77843, USA Genomics Research Institute, 400 North Fifth Street, Suite 1600, Phoenix, AZ 85004, USA
2 Translational
Correspondence should be addressed to Erchin Serpedin,
[email protected] Received 31 July 2008; Accepted 19 January 2009 Recommended by Yufei Huang Based on gene expression profiles, genes can be partitioned into clusters, which might be associated with biological processes or functions, for example, cell cycle, circadian rhythm, and so forth. This paper proposes a novel clustering preprocessing strategy which combines clustering with spectral estimation techniques so that the time information present in time series gene expressions is fully exploited. By comparing the clustering results with a set of biologically annotated yeast cell-cycle genes, the proposed clustering strategy is corroborated to yield significantly different clusters from those created by the traditional expression-based schemes. The proposed technique is especially helpful in grouping genes participating in time-regulated processes. Copyright © 2009 Wentao Zhao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction A cell is the basic unit of life, and each cell contains instructions necessary for its proper functioning. These instructions are encoded in the form of DNAs that are replicated and transmitted to its progeny when the cell divides. mRNAs are middle products in this process. They are transcribed from DNA segments (genes) and serve as the templates for protein translation. This conduit of information constitutes the central dogma of molecular biology. The fast evolving gene microarray technology has enabled simultaneous measurement of genome-wide gene expressions in terms of mRNA concentrations. There are two types of microarray data: time series and steady state. Time-series data are obtained by sequential measurements in temporal experiments, while steady-state data are produced by recording gene expressions from independent sources, for example, different individuals, tissues, experiments, and so forth. The high costs, ethical concerns, and implementation issues prevent from collecting large time-series data sets. Therefore, about 70% of the data sets are steady state [1], and most of time-series data sets contain only a few time points, in general less than 20 samples. Based on microarray measurements, clustering methods have been exploited to partition genes into subsets. Members
in each subset are assumed to share specific biological function or participate in the same molecular-level process. They are termed as coexpressed genes and are supposed to be located closely in the underlying genetic regulatory networks. Eisen et al. [2] applied the hierarchical clustering to partition yeast genes, Tamayo et al. [3] exploited the selforganizing map (SOM), and Tavazoie et al. [4] employed Kmeans clustering to group gene expressions and then search upstream DNA sequence motifs that contribute to the coexpression of genes. Besides the above mentioned successful applications, Zhou et al. [5] designed a clustering strategy by minimizing the mutual information between clusters, and bootstrap techniques were combined with heuristic search to solve the underlying optimization problem. Also, Giurc˘aneanu et al. [6] exploited the minimum description length (MDL) principle to determine the number of clusters. Whether technically advanced schemes represent better solutions for real biological data is still under debate. However, usually most of the schemes provide valuable alternatives and insights to each other. Therefore, it was recommended that several clustering schemes be performed to analyze the same real data set [7] so that the difference between clusterings would capture some patterns that otherwise would be neglected by running only one method.
2 A straightforward application of clustering schemes will cause the loss of temporal information inherent in the timeseries measurements. This shortcoming has been noticed in literature. Ramoni et al. [8] designed a model-based Bayesian method to cluster the time-series data and specified the number of clusters intelligently, Tabus and Astola [9] proposed to fit the data by linear dynamic systems, and Ernst et al. [10] presented an algorithm especially for short time series. In these models genes in the same cluster were assumed to share similar time domain profile. The temporal relationships were also explored via more complex models, that is, genetic regulatory networks, which can be constructed via more computationally-demanding algorithms, for example, Zhao et al. [11] and Liang et al. [12]. However, in general, the network inference schemes deal only with relatively small-scale networks consisting of less than hundreds of genes. Genome wide analysis is beyond the computational capability of these inference algorithms. Therefore, clustering methods are usually exploited to partition genes, and the obtained subsets of genes serve as further research targets, and more accurate maps of real biological processes are to be recovered. Based on time-series data, modern spectral density estimation methods have been exploited to identify periodically expressed genes. Assuming the cell cycle signal to be a single sinusoid, Spellman et al. [13] and Whitfield et al. [14] performed a Fourier transformation on the data sampled with different synchronization methods, Wichert et al. [15] applied the traditional periodogram and Fisher’s test, while Ahdesm¨aki et al. [16] implemented a robust periodicity test procedure assuming non-Gaussian noise. The majority of these works dealt with evenly sampled data, and missing data points were usually filled by interpolation in time domain, or the genes were disregarded if there were too many vacancies. The biological experiments generally output unequally spaced measurements. The change of sampling frequency is due to missing data and the fact that the measurements are usually event driven, that is, more observations are taken when certain biological events occur, and the measurement process is slowed down when the cell remains quiet. Therefore, an analysis based on unevenly sampled data is practically desired and technically more challenging. The harmonics exploited in discrete Fourier transform (DFT) are no longer orthogonal in the presence of uneven sampling. Lomb [17] and Scargle [18] demonstrated that a phase shift suffices to make the sine and cosine terms orthogonal again. The Lomb-Scargle scheme has been exploited in analyzing the budding yeast data set by Glynn et al. [19]. Stoica and Sandgren [20] updated the traditional Capon method to cope with the irregularly sampled data. Notice also that Wang et al. [21] designed the missing-data amplitude and phase estimation (MAPES) approach, which estimated the missing data and spectrum iteratively through the usage of the Expectation Maximization (EM) algorithm. Although Capon and MAPES methods aim to achieve a better spectral resolution than Lomb-Scargle periodogram, for small sample size, the simpler Lomb-Scargle periodogram appears to possess higher accuracy in the presence of real biological data sets [22].
EURASIP Journal on Bioinformatics and Systems Biology This paper proposes a novel clustering preprocessing procedure which combines the power spectral density analysis with clustering schemes. Given a set of microarray measurements, the power spectral density of each gene is first computed, then the spectral information is fed into the clustering schemes. The members within the same cluster will share similar spectral information, therefore they are supposed to participate in the same temporally regulated biological process. The assumptions underlying this statement rely on the following facts: if two genes X and Y are in the same cluster, their spectral densities are very close to each other; in the time domain, their gene expressions may just differ in their phases. The phases are usually modeled to correspond to different stages of the same biological processes, for example, cell cycle or circadian rhythms. The proposed spectral-density-based clustering actually differentiates the following two cases. (1) Gene X’s expression and Gene Y’s expression are uncorrelated in both time and frequency domains. (2) Gene X and Y expressions are uncorrelated in time domain, but gene X’s expression is a time-shifted version of gene Y’s expression. In the traditional clustering schemes, the distances are the same for the above two cases (both assuming large values). However, in the proposed algorithm, the second case is favorable and presents a lower distance. Therefore, by exploiting the proposed algorithm, the genes participating in the same biological process are more likely to be grouped into the same cluster. Lomb-Scargle periodogram serves as the spectral density estimation tool since it is computationally simple and possesses higher accuracy in the presence of unevenly measured and small-size gene expression data sets. The appropriate clustering method is determined based on intense computer simulations. Three major clustering methods: hierarchical, K-means, and self-organizing map (SOM) schemes are tested with different configurations. The spectra and expression-based clusterings are compared with respect to their ability of grouping cell-cycle genes that have been experimentally verified. The differences between clusterings are recorded and compared in terms of information theoretic quantities.
2. Methods This section explains how to apply the Lomb-Scargle periodogram to time-series gene expressions. Next are formulated briefly the three clustering schemes: hierarchical, K-means, and self-organizing map (SOM). Afterward, we discuss how to validate the clusterings and make comparisons between them. The notational convention is as follows: the matrices and vectors are in bold face, and scalars are represented in regular font. 2.1. Lomb-Scargle Periodogram. Most spectral analysis methods, for example, Fourier transform and traditional periodogram employed in Spellman et al. [13] and Wichert et al. [15], rely on evenly sampled data, which are projected
EURASIP Journal on Bioinformatics and Systems Biology
3
Table 1: Distance metric between two genes’ measurements x and y. Distance Euclidean
Formula of d(x, y) (x − y)(x − y)T
Remarks T is the matrix transpose.
City block
M |xi − yi |
M represents sample size, and i indexes a specific sample.
i=1
1−
Cosine
1−
Correlation
xyT (xxT )1/2 (yyT )1/2 (x − x)(y − y)T 1/2
((x − x)(x − x)T ) ((y − y)(y − y)T )
1/2
x, y are means of vectors x and y, respectively.
Table 2: Distance metric between two clusters Ci and C j . Distance Single
Formula of d(Ci , C j ) min d(x, y), x ∈ Ci , y ∈ C j
Complete
max d(x, y), x ∈ Ci , y ∈ C j
1
Average
|Ci | · |C j | x∈C
1 2σ 2
M −1
[xl − x] cos[ω(tl − τ)] M −1 2 l=0 cos [ω(tl − τ)]
2
l=0
M −1
+
[xl − x] sin[ω(tl − τ)] M −1 2 l=0 sin [ω(tl − τ)]
l=0
,
(2)
Let δ be the greatest common divisor (gcd) for all intervals tk − tl (k = / l), Eyer and Bartholdi [23] proved that the highest frequency to be searched is given by ωmax 1 = . 2π 2δ
= M
tM −1 − t0 , δ
(4)
and the frequency grid can be defined in terms of the following equation: ωl δ =
2π l, M
− 1. l = 0, . . . , M
(5)
Notice further that the spectra at the front and rear halves of the frequency grid are symmetric since the microarray experiments output real values. Lomb-Scargle periodogram represents an efficient solution in estimating the spectra of unevenly sampled data sets. Simulation results also verify its superior performance for biological data with small sample size and various unevenly sampled patterns [22]. 2.2. Clustering. The obtained Lomb-Scargle power spectral density will be used as input to clustering schemes as an alternative to the original gene expression measurements. Three clustering schemes: Hierachical, K-means, and selforganizing map (SOM) are used for testing this substitution.
M −1 1 l=0 sin(2ωtl ) . atan M τ= −1 2ω l=0 cos(2ωtl )
fmax =
The number of probing frequencies is denoted by
(1)
2
where x and σ 2 stand for the mean and variance of the sampled data, respectively, and τ is defined as
| · | obtains the size of the cluster.
d(x, y)
i y ∈C j
on orthogonal sine and cosine harmonics. However, real microarray measurements are not evenly observed due to missing data points and changing sampling frequency. The uneven sampling ruins data projection’s orthogonality. Lomb [17] found that a phase shift of the sine and cosine functions would restore the orthogonality among harmonics. Scargle [18] complemented Lomb’s periodogram by exploiting its distribution. Since then the established Lomb-Scargle periodogram has been exploited in numerous fields and applications, including bioinformatics and genomics (see, e.g., Glynn et al. [19]). Given M time-series observations (tl , xl ), l = 0, . . . , M − 1, where t stands for the time tag and x denotes the sampled expression of a specific gene, the normalized Lomb-Scargle periodogram for that gene expression at angular frequency ω is defined as size ΦLS (ω) =
Remarks d(x, y) is defined in Table 1.
(3)
2.2.1. Hierarchical Clustering. The hierarchical clustering represents the partitioning procedure that assumes the form of a tree, also known as the dendrogram. The bottom-up algorithm starts in treating each gene as a cluster. Then at each higher level, a new cluster is generated by joining the two closest clusters at the lower level. In order to quantize the distance between two gene profiles, different metrics have been proposed in literature, as enumerated in Table 1.
4
EURASIP Journal on Bioinformatics and Systems Biology
1: Input n genes with their expressions or spectral densities; 2: Initialize k ⇐ n, Ci ⇐ {xi }; 3: while k > 1 do 4: {i, j } = mini, j d(Ci , C j ); 5: Insert Ci ∪ C j , delete Ci and C j ; 6: Label all existing clusters with integers 1, 2, . . . , (k − 1); 7: k ⇐ k − 1 8: end while Algorithm 1: Hierarchical clustering algorithm.
1: Input gene expressions or spectral densities, and the desired number of clusters K; 2: Randomly create centroids µ1 , . . . , µK ; 3: Assign each gene x to the cluster i = arg min j =1···K d(µ j , x); 4: while members in some clusters change do 5: compute centroids µ1 , . . . , µK ; 6: assign gene x to cluster i = arg min j d(x, µ j ); 7: end while Algorithm 2: K-means clustering algorithm.
The correlation is the most popular metric and was exploited in Eisen’s work [2]. Based on distances between gene expressions, we can further define the distances between two gene clusters, that is, linkage methods, as illustrated by Table 2. The single linkage method actually constructs a minimal spanning tree, and it sometimes builds an undesirable long chain. The complete linkage method discourages the chaining effect and in each step increases the cluster diameter as little as possible. However, it assumes that the true clusters are compact. Alternatively, the average linkage method makes a compromise and is usually the preferred method since it poses no assumption on the structure of clusters. The selection of distance metric and linkage method depends on the nature of the real data, and several clustering schemes were proposed to be tested at the same time so that each can capture different aspects of the data. The hierarchical clustering scheme can be formulated in terms of the pseudo code depicted in Algorithm 1. If a specific number of clusters c are desired, only line 3 is needed to be changed by substituting k > c for k > 1. 2.2.2. K-means Clustering. The K-means clustering divides the genes into K predetermined clusters. It iteratively updates the centroid of each cluster and reassigns each gene to the cluster with the nearest centroid. Different distance metrics, as listed in Table 1, can also be exploited in the K-means clustering scheme. In each iteration, the new centroid might be the median or mean of the cluster members. The Kmeans clustering can be formulated as Algorithm 2. One of the problems associated with K-means clustering is that the iterations may finally converge to a local suboptimum solution. Therefore, in our simulation we ran the algorithm 5 times and reported the one with the best performance. The
K-means clustering method was exploited by Tavazoie et al. [4], which combined the clustering with the motif finding problem. 2.2.3. Self-Organizing Map (SOM) Clustering. The selforganizing map method is in essence based on a one-layer neural network, and it is exploited in [3]. Each cluster centroid maps to a node in the two-dimensional lattice. It iteratively updates the centroid of each cluster through competitive learning. At iteration t, a randomly selected gene’s expression vector x is fed to the learning system, and the centroid which is closest to the coming gene’s expression vector is represented in terms of µi . Then each centroid is updated via
t t µt+1 j = µ j + g(d(i, j), t) x − µ j ,
j = 1, . . . , K,
(6)
where the function d(i, j) defines the distance between two nodes indexed by i and j in the two-dimensional lattice. It can be set to 1 if node j is within the neighborhood of node i, and 0 otherwise. The function g(·, ·) represents the learning rate function, and it is monotonically decreasing with the increase of t or d(i, j). The SOM clustering algorithm can be formulated as Algorithm 3. 2.3. Performance Evaluation Metric. The three clustering schemes with inputs of either gene expressions or spectral densities are to be evaluated in two different ways: how they group time-regulated genes, and whether they are significantly different from each other. Different criteria are defined based on information theoretic quantities. 2.3.1. Validation of Clustering Scheme. Given N genes with their expression or spectral density information
EURASIP Journal on Bioinformatics and Systems Biology
5
1: Input gene expressions or spectral densities, the desired number of clusters K, and the number of max iterations T; 2: Randomly create centroids µ1 , . . . , µK ; 3: Assign each gene x to the cluster i = arg min j =1···K d(µ j , x); 4: for t = 1 to T do 5: Randomly select a gene expresssion x; 6: Find the point i = arg min j =1···K d(µ j , x); 7: Update centroids µ1 , . . . , µK based on (6); 8: end for 9: Assign each gene x to cluster i = arg min j =1···K d(x, µ j ); Algorithm 3: SOM clustering algorithm.
{x1 , x2 , . . . , xN } = Ω, suppose the clustering scheme creates a partition of genes containing K clusters C = {C1 , C2 , . . . , CK }, any two clusters Ci and C j are mutually exclusive (Ci ∩ C j = φ), and all clusters constitute the measured gene expressions (∪Ki=1 Ci = Ω), then the entropy of the clustering can be exploited to measure the information of the clustering K | Ci |
H(C) = −
i=1
N
log
| Ci |
N
,
(7)
where | · | measures the size of a cluster. Genes cooperate by participating in the same biological processes, in other words, singleton clusters are not expected to occur frequently in the clustering. Therefore, for a given K, the sizes of clusters should be balanced, and the higher the entropy of the clustering, the better the clustering scheme. The clustering schemes can be validated by their ability to group genes that have been annotated to share similar biological functions or participate in the same biological process. One of the most explored processes is the yeast cell cycle, for which genes have been mostly identified and their interactions have been proposed in the public database [24]. Assume a set of genes, denoted as G, has been verified to participate in a specific process, the joint entropy of the clustering and the known set can be represented by K | Ci ∩ G |
H(C, G) = −
i=1
N
log
| Ci ∩ G |
N
.
(8)
It is desirable that genes with the same functions be integrated in as small number of clusters as possible. Therefore, the smaller the joint entropy, the better the clustering. A straightforward performance metric combining both the clustering entropy and the joint entropy is defined as the mutual information I(C, G) = H(C) + H(G) − H(C, G),
(9)
where the H(G) is defined similarly as in (7), and it is constant across different clustering schemes. This metric is actually consistent with that proposed in Gibbons and Roth [25], whereby multiple gene attributes were considered. Higher mutual information between the clustering C and the prespecified set G stands for a balanced clustering for all genes while genes of G are more accumulated, in other words, it exhibits better performance.
2.3.2. Difference between Two Clusterings. Two clustering schemes create two different partitions of all the observed genes. A measure of the distance between two clusterings is highly valuable when the two schemes do not show a significant difference in their performance. Various metrics have been proposed to evaluate the difference between two clusterings, for example, Fowlkes and Mallows [26], Rand [27], and more recently Meil˘a [28]. We accept Meil˘a’s variation of information (VI) metric because it is more discriminative, makes no assumption on the clustering structure, requires no rescaling, neither does it depend on the sample size. Assume two different schemes produce two clusterings C = {C1 , . . . , CK } and C = {C1 , . . . , CK }, respectively, then the mutual information between these two clusterings is represented by I(C, C ) =
K K |C ∩ C | i j i=1 j =1
N
· log
N · |Ci ∩ C j | . |Ci | · |C j |
(10)
Then, the variation of information (VI) is defined as VI(C, C ) = H(C) + H(C ) − 2I(C, C ).
(11)
VI is upper bounded by 2 log K. It is zero if and only if the two clusterings are exactly the same. The greater the variation of information, the larger the difference between the two clusterings.
3. Results The performance of the proposed power spectrum-based scheme is illustrated through comparisons with three traditional expression-based clustering schemes: Hierarchical, Kmeans, and self-organizing map (SOM). The comparisons are divided into two parts. In the first part, we evaluate their ability to group the cell-cycle involved genes, while the second part is devoted to illustrate the fact that the proposed schemes construct clusters that are significantly different from those created by the traditional schemes. 3.1. Clustering Performance Evaluation. These simulations were performed on the cdc15 data set published by Spellman et al. [13], which contained 24 time-series expression measurements of 6178 yeast genes. The hierarchical, K-means,
EURASIP Journal on Bioinformatics and Systems Biology 10
20
9
18
8
16 Mutual information (bits)
Mutual information (bits)
6
7 6 5 4 3
14 12 10 8 6
2
4
1
2
0
0
50
100 Number of clusters
Expression, euclidean Spectral, euclidean Expression, city block Spectral, city block
150
0
200
0
50
100 Number of clusters
Expression, euclidean Spectral, euclidean Expression, city block Spectral, city block
Expression, cosine Spectral, cosine Expression, correlation Spectral, correlation
(a)
150
200
Expression, cosine Spectral, cosine Expression, correlation Spectral, correlation
(b) 20 18
Mutual information (bits)
16 14 12 10 8 6 4 2 0
0
50
100 Number of clusters
Expression, euclidean Spectral, euclidean Expression, city block Spectral, city block
150
200
Expression, cosine Spectral, cosine Expression, correlation Spectral, correlation
(c)
Figure 1: Performance of hierarchical clustering: (a) single linkage, (b) complete linkage, and (c) average linkage. The solid curves represent the clusterings based on original gene expressions while the dotted curves stand for clusterings based on spectral densities.
and self-organizing map (SOM) clustering schemes were simulated having as inputs the computed spectral densities and the original expression data. The hierarchical and Kmeans clustering were configured with different distance and linkage methods, which are defined in Tables 1 and 2,
respectively. The simulations were executed until up to 200 clusters were created. Cell cycle has served as a research target in molecular biology for a long time since it plays a crucial rule in cell division, and medically it underlies the development
7
20
20
18
18
16
16 Mutual information (bits)
Mutual information (bits)
EURASIP Journal on Bioinformatics and Systems Biology
14 12 10 8 6
14 12 10 8 6
4
4
2
2
0
0
50
100 Number of clusters
Expression, euclidean Spectral, euclidean Expression, city block Spectral, city block
150
200
Expression, cosine Spectral, cosine Expression, correlation Spectral, correlation
Figure 2: Performance of K-means clustering. The solid curves represent the clusterings based on original gene expressions while the dotted curves stand for clusterings based on spectral densities.
of cancer. Experimentally 109 genes have been verified to participate in the cell-cycle process, and their interactions were recorded in the public database KEGG [24]. Among them 104 genes were reported in Spellman’s data set. The simulations tested how these genes were clustered with other genes. Intuitively, the more integrated are these 104 genes, the better is the clustering scheme. On the other hand, it is hoped that the size of the cluster is relatively balanced, and there should not be many singleton clusters (clusters containing only one gene). The clustering performance is represented by an information theoretic quantity, that is, mutual information, which is defined between the obtained partition of all measured genes and the set of 104 genes. Higher mutual information indicates that the 104 cell-cycle genes are closely integrated into only a few clusters, and most clusters are balanced in size. In other words, with the same number of clusters, the higher the mutual information, the better the performance. The proposed strategy is surely not constrained to detect cell cycle genes. However we have to confine our discussion to cell cycle here because the available data set is right for the purpose of cell cycle research. Besides, the cell cycle genes have been identified for a relatively long time with high confidence. The simulation results for hierarchical clustering are illustrated in Figure 1. Each subplot is associated with a linkage method. Figure 1(a) demonstrates the performance for the single linkage method. The dotted curves represent
0
0
50
100 Number of clusters
150
200
Expression, hierarchical, correlation, complete Spectral, hierarchical, euclidean, complete Expression, kmeans correlation Spectral, kmeans euclidean Expression, som Spectral, som
Figure 3: Performance of hierarchical, K-means, and SOM. The comparison is performed across the complete linkage of hierarchical, K-means, and SOM. The solid curves represent the clustering based on original gene expression data while the dotted curves stand for clustering based on spectral data.
schemes clustering spectral densities while the solid curves denote schemes clustering original gene expressions. The mutual information goes up nearly linearly when the number of clusters increases. Actually, when we delved into the generated clusters, it was found that most clusters were singletons. The chaining effect took place, and the single linkage method is not a good candidate for the purpose of clustering gene expression measurements. Spectral density-based methods were all better than their traditional counterparts, which performed clustering on the original gene expression data. Among all, the Euclidean method clustering spectral densities achieved the best performance. Figure 1(b) shows the results for the complete linkage method of the hierarchical clustering. Each cluster actually represents a complete subgraph. The complete linkage method discourages the chaining effect to occur in the single linkage method. The performance of spectral density-based clusterings is lower bounded by the worst performances of the traditional gene expression-based clusterings. For the gene expression-based clustering, the correlation and cosine approaches are better than the Euclidean and cityblock approaches, while for the spectral density clustering, the Euclidean and city-block approaches exhibit the best performance.
EURASIP Journal on Bioinformatics and Systems Biology 10
10
9
9
8
8 Variation of information (bits)
Variation of information (bits)
8
7 6 5 4 3
6 5 4 3 2
2
1
1 0
7
0 0
50
100 Number of clusters
150
200
Hier exp euc versus hier exp cor Hier psd euc versus hier psd cor Hier exp cor versus kmeans exp cor Hier psd euc versus kmeans psd euc Hier exp cor versus som exp Hier psd euc versus som psd Kmeans exp cor versus som exp Kmeans psd euc versus som psd
Figure 4: Distance between the two clusterings created by different methods with the same input. Only the complete linkage for the hierarchical clustering is considered. The solid curves represent the clustering based on original gene expression data while the dotted curves stand for clustering based on spectral densities. Abbreviations are exploited for the conciseness of labels as follows: hier (hierarchical clustering), euc (Euclidean), cor (correlation), psd (power spectral density), exp (expression data).
Figure 1(c) plots the results for the average linkage method of the hierarchical clustering. The average linkage is the most widely deployed method since it makes a compromise between the single and the complete methods, and it does not assume any structure on the underlying data. However, in the presence of real gene expression data, it is not as good as the complete linkage method. Different distance metrics differ in terms of their ability to group the involved cell-cycle genes. For clustering expression data, the cosine and correlation approaches still achieve the best performance, but they exhibit poorer performance than the spectra-based Euclidean and city-block methods. Configured also with various distance metrics, the Kmeans algorithm was applied on both the spectral and original gene expression data. To avoid converging to local suboptimal solutions, all K-means clustering schemes were executed 5 times, and the best performance was reported. For clustering expression data, the correlation and cosine approaches are still the best choices while for spectra-based schemes, the Euclidean and city-block approaches still exceed the other schemes (see Figure 2).
0
50
100 Number of clusters
150
200
Hier exp euc versus hier psd euc Hier exp cor versus hier psd cor Kmeans exp euc versus kmeans psd euc Kmeans exp cor versus kmeans psd cor Som exp versus som psd Hier exp cor versus som exp
Figure 5: Distance between two clusterings created by the same method assuming different inputs. The comparison is performed across the complete linkage of hierarchical, K-means, and SOM. The dashed curve is provided with the purpose of reference. Abbreviations are exploited for the conciseness of labels as follows: hier (hierarchical clustering), euc (Euclidean), cor (correlation), psd (power spectral density), exp (expression data).
Figure 3 compares the performance of hierarchical and K-means clustering schemes with that of SOM. The best schemes of hierarchical and K-means were displayed. It turns out that SOM is the best performing scheme, Kmeans locates in the middle, whereas the hierarchical clustering is the worst, although the discrepancy looks not significant. Among all schemes, the spectral densitybased SOM achieves the best performance. Although the discrepancy between the best spectral-based clustering and the best gene expression-based clustering is not obvious, they actually create significantly different clusters. This difference can be captured by the distance metric between clusterings. The inferior performance of correlation and cosine metrics with spectra input is partially due to the flat spectra for those genes with no time-regulated patterns. The flat spectrum in the denominator will cause the distance metrics to be highly biased. It is also worthwhile to note that in literature other distance metrics have been proposed, for example, coherence [29] and mutual information [30]. However, these metrics involve the estimation of joint distribution, which usually requires large sample sizes. Such a requirement cannot be satisfied in general by the microarray experiments. Extra normalization of the spectrum can be
EURASIP Journal on Bioinformatics and Systems Biology performed, but simulation shows that it does not provide a significant or consistent improvement. 3.2. Distance between Clusterings. A testing of the distance between spectra-based and gene expression-based clusterings also reveals the value of the proposed scheme. The variation of information metric approach, proposed by Meil˘a [28], is exploited to measure the difference between the two clusterings. The basic principle resumes to: the higher the variation of information, the greater the difference. Figure 4 demonstrates the distance between the two clusterings with the same input, either computed using spectral densities or measured based on gene expressions. For the hierarchical clustering, only the complete linkage method is considered since it possesses the best performance in terms of grouping the known cell-cycle genes. The complete set of distances between any two schemes is depicted in the additional File 1 [31]. Figure 4 conserves only the salient general patterns for conciseness. For hierarchical clustering of gene expression data, the correlation and Euclidean schemes differ more, and the distance between these two is the highest curve when the number of clusters is greater than 120. The distance between the correlation and Euclidean hierarchical clusterings is even much larger than the distance between the clusterings created by the hierarchical scheme and Kmeans or SOM. However, when clustering spectral densities, all schemes display quite similar patterns and exhibit closely located performances. This means that clustering spectral densities is stable across different clustering schemes. Figure 5 compares the same clustering methods assuming different inputs. Comparing with the scale of Figure 4, the distance between different clusterings with the same input is much smaller than the distance between clusterings that assume different input types. The distance between any two schemes that assume the same input is below 7 bits when the number of clusters is ranging from 0 to 200, as shown in Figure 4 or the dashed curve in Figure 5, while the distance between the clusterings created by the same scheme assuming two different input types is above 8 bits when the number of clusters is ranging from 100 to 200. This shows that changing the input type from gene expression to spectral density has produced a significant different clustering scheme. For the complete plots of the distance between clusterings produced by various schemes assuming different input types, please refer to the additional File 2 [31].
4. Conclusion A novel clustering preprocessing strategy is proposed to combine the traditional clustering schemes with power spectral analysis of time-series gene expression measurements. The simulation results corroborate that the proposed approach achieves a better clustering for hierarchical, Kmeans, and self-organizing map (SOM) in most cases. Besides, it constructs a significantly different partition relative to traditional clustering strategies. When deploying the hierarchical or K-means clustering methods based on the spectral density, the Euclidean and city-block distance
9 metrics appear to be more appealing than the cosine or correlation distance metrics. The proposed novel algorithm is valuable since it provides additional information about temporal regulated genetic processes, for example, cell cycle.
Acknowledgments This work was supported by the National Cancer Institute (CA-90301) and the National Science Foundation (ECS0355227 and CCF-0514644).
References [1] I. Simon, Z. Siegfried, J. Ernst, and Z. Bar-Joseph, “Combined static and dynamic analysis for determining the quality of time-series expression profiles,” Nature Biotechnology, vol. 23, no. 12, pp. 1503–1508, 2005. [2] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proceedings of the National Academy of Sciences of the United States of America, vol. 95, no. 25, pp. 14863–14868, 1998. [3] P. Tamayo, D. Slonim, J. Mesirov, et al., “Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 6, pp. 2907–2912, 1999. [4] S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho, and G. M. Church, “Systematic determination of genetic network architecture,” Nature Genetics, vol. 22, no. 3, pp. 281–285, 1999. [5] X. Zhou, X. Wang, E. R. Dougherty, D. Russ, and E. Suh, “Gene clustering based on clusterwide mutual information,” Journal of Computational Biology, vol. 11, no. 1, pp. 147–161, 2004. [6] C. D. Giurc˘aneanu, I. T˘abus¸, J. Astola, J. Ollila, and M. Vihinen, “Fast iterative gene clustering based on information theoretic criteria for selecting the cluster structure,” Journal of Computational Biology, vol. 11, no. 4, pp. 660–682, 2004. [7] P. D’Haeseleer, “How does gene expression clustering work?” Nature Biotechnology, vol. 23, no. 12, pp. 1499–1501, 2005. [8] M. F. Ramoni, P. Sebastiani, and I. S. Kohane, “Cluster analysis of gene expression dynamics,” Proceedings of the National Academy of Sciences of the United States of America, vol. 99, no. 14, pp. 9121–9126, 2002. [9] I. Tabus and J. Astola, “Clustering the non-uniformly sampled time series of gene expression data,” in Proceedings of the International Symposium on Signal Processing and Applications (ISSPA ’03), vol. 2, pp. 61–64, Paris, France, July 2003. [10] J. Ernst, G. J. Nau, and Z. Bar-Joseph, “Clustering short time series gene expression data,” Bioinformatics, vol. 21, supplement 1, pp. i159–i168, 2005. [11] W. Zhao, E. Serpedin, and E. R. Dougherty, “Inferring gene regulatory networks from time series data using the minimum description length principle,” Bioinformatics, vol. 22, no. 17, pp. 2129–2135, 2006. [12] S. Liang, S. Fuhrman, and R. Somogyi, “Reveal, a general reverse engineering algorithm for inference of genetic network architectures,” in Proceedings of the Pacific Symposium on Biocomputing, vol. 3, pp. 18–29, Maui, Hawaii, USA, January 1998. [13] P. T. Spellman, G. Sherlock, M. Q. Zhang, et al., “Comprehensive identification of cell cycle-regulated genes of the
10
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24] [25]
[26]
[27]
[28]
[29]
[30]
[31]
EURASIP Journal on Bioinformatics and Systems Biology yeast Saccharomyces cerevisiae by microarray hybridization,” Molecular Biology of the Cell, vol. 9, no. 12, pp. 3273–3297, 1998. M. L. Whitfield, G. Sherlock, A. J. Saldanha, et al., “Identification of genes periodically expressed in the human cell cycle and their expression in tumors,” Molecular Biology of the Cell, vol. 13, no. 6, pp. 1977–2000, 2002. S. Wichert, K. Fonkianos, and K. Strimmer, “Identifying periodically expressed trascripts in microarry time series data,” Bioinformatics, vol. 20, no. 1, pp. 5–20, 2004. M. Ahdesm¨aki, H. L¨ahdesm¨aki, R. Pearson, H. Huttunen, and O. Yli-Harja, “Robust detection of periodic time series measured from biological systems,” BMC Bioinformatics, vol. 6, article 117, pp. 1–18, 2005. N. R. Lomb, “Least-squares frequency analysis of unequally spaced data,” Astrophysics and Space Science, vol. 39, no. 2, pp. 447–462, 1976. J. D. Scargle, “Studies in astronomical time series analysis—II. Statistical aspects of spectral analysis of unevenly spaced data,” The Astrophysics Journal, vol. 263, no. 99, pp. 835–853, 1982. E. F. Glynn, J. Chen, and A. R. Mushegian, “Detecting periodic patterns in unevenly spaced gene expression time series using Lomb-Scargle periodograms,” Bioinformatics, vol. 22, no. 3, pp. 310–316, 2006. P. Stoica and N. Sandgren, “Spectral analysis of irregularlysampled data: paralleling the regularly-sampled data approaches,” Digital Signal Processing, vol. 16, no. 6, pp. 712–734, 2006. Y. Wang, P. Stoica, J. Li, and T. L. Marzetta, “Nonparametric spectral analysis with missing data via the EM algorithm,” Digital Signal Processing, vol. 15, no. 2, pp. 191–206, 2005. W. Zhao, K. Agyepong, E. Serpedin, and E. R. Dougherty, “Detecting periodic genes from irregularly sampled gene expressions: a comparison study,” EURASIP Journal on Bioinformatics and Systems Biology, vol. 2008, Article ID 769293, 8 pages, 2008. L. Eyer and P. Bartholdi, “Variable stars: which Nyquist frequency?” Astronomy and Astrophysics, vol. 135, no. 1, pp. 1–3, 1999. “KEGG Yeast Cell Cycle Pathway,” http://www.genome .ad.jp/kegg/pathway/sce/sce04111.html. F. D. Gibbons and F. P. Roth, “Judging the quality of gene expression-based clustering methods using gene annotation,” Genome Research, vol. 12, no. 10, pp. 1574–1581, 2002. E. Fowlkes and C. Mallows, “A method for comparing two hierarchical clusterings,” Journal of the American Statistical Association, vol. 78, no. 383, pp. 553–569, 1983. W. M. Rand, “Objective criteria for the evaluation of clustering methods,” Journal of the American Statistical Association, vol. 66, no. 336, pp. 846–850, 1971. M. Meil˘a, “Comparing clusterings—an information based distance,” Journal of Multivariate Analysis, vol. 98, no. 5, pp. 873–895, 2007. A. J. Butte, L. Bao, B. Y. Reis, T. W. Watkins, and I. S. Kohane, “Comparing the similarity of time-series gene expression using signal processing metrics,” Journal of Biomedical Informatics, vol. 34, no. 6, pp. 396–405, 2001. D. R. Brillinger, “Second-order moments and mutual information in the analysis of time series,” in Recent Advances in Statistical Methods, pp. 64–76, Imperial College Press, London, UK, 2002. “Supplementary Materials,” http://www.ece.tamu.edu/ ∼wtzhao/EurasipBSBClutering.htm.
Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2009, Article ID 158368, 10 pages doi:10.1155/2009/158368
Research Article Is Bagging Effective in the Classification of Small-Sample Genomic and Proteomic Data? T. T. Vu and U. M. Braga-Neto Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128, USA Correspondence should be addressed to U. M. Braga-Neto,
[email protected] Received 1 August 2008; Revised 4 December 2008; Accepted 19 January 2009 Recommended by Yufei Huang There has been considerable interest recently in the application of bagging in the classification of both gene-expression data and protein-abundance mass spectrometry data. The approach is often justified by the improvement it produces on the performance of unstable, overfitting classification rules under small-sample situations. However, the question of real practical interest is whether the ensemble scheme will improve performance of those classifiers sufficiently to beat the performance of single stable, nonoverfitting classifiers, in the case of small-sample genomic and proteomic data sets. To investigate that question, we conducted a detailed empirical study, using publicly-available data sets from published genomic and proteomic studies. We observed that, under t-test and RELIEF filter-based feature selection, bagging generally does a good job of improving the performance of unstable, overfitting classifiers, such as CART decision trees and neural networks, but that improvement was not sufficient to beat the performance of single stable, nonoverfitting classifiers, such as diagonal and plain linear discriminant analysis, or 3-nearest neighbors. Furthermore, as expected, the ensemble method did not improve the performance of these classifiers significantly. Representative experimental results are presented and discussed in this work. Copyright © 2009 T. T. Vu and U. M. Braga-Neto. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction Randomized ensemble methods for classifier design combine the decision of an ensemble of classifiers designed on randomly perturbed versions of the available data [1–5]. The combination is often done by means of majority voting among the individual classifier decisions [4–6], whereas the data perturbation usually employs the bootstrap resampling approach, which corresponds to sampling uniformly with replacement from the original data [7, 8]. The combination of bootstrap resampling and majority voting is known as bootstrap aggregation or bagging [4, 5]. There has been considerable interest recently in the application of bagging in the classification of both geneexpression data [9–12] and protein-abundance mass spectrometry data [13–18]. However, there is scant theoretical justification for the use of this heuristic, other than the expectation that combining the decision of several classifiers will regularize and improve the performance of unstable overfitting classification rules, such as unpruned decision
trees, provided one uses a large enough number of classifiers in the ensemble [4, 5]. It is also claimed that ensemble rules “do not overfit,” meaning that classification error converges as the number of component classifiers tends to infinity [5]. However, the main performance issue is not whether the ensemble scheme improves the classification error of a single unstable overfitting classifier, or whether its classification error converges to a fixed limit; these are important questions, which have been studied in the literature (in particular when the component classifiers are decision trees) [5, 19– 23], but the question of main practical interest is whether the ensemble scheme will improve the performance of unstable overfitting classifiers sufficiently to beat the performance of single stable, nonoverfitting classifiers, particularly in small-sample settings. Therefore, there is a pressing need to examine rigorously the suitability and validity of the ensemble approach in the classification of small-sample genomic and proteomic data. In this paper, we present results from a comprehensive empirical study concerning the effect of bagging on the performance of several classification rules,
2 including diagonal and plain linear discriminant analysis, 3-nearest neighbors, CART decision trees, and neural networks, using real data from published microarray and mass spectrometry studies. Here we are concerned exclusively with the performance in terms of the true classification error, and therefore we employ filter-based feature selection and holdout estimation based on large samples in order to allow accurate classification error estimation. Similar studies recently published [11, 12] rely on small-sample wrapper feature selection and small-sample error estimation methods, which will obscure the issue of how bagging really affects the true classification error. In particular, there is evidence that filter-based feature selection outperforms wrapper feature selection in small-sample settings [24]. In our experiments, we employ the one-tailed paired t-test to assess whether the expected true classification error is significantly smaller for the bagged classifier as opposed to the original base classifier, under different number of samples, dimensionality, and number of classifiers in the ensemble. Clearly, the heuristic is beneficial for the particular classification rule if and only there is a significant decrease in expected classification error, otherwise the procedure is to be avoided; however the magnitude of improvement is also a factor—a small improvement in performance may not be worth the extra computation required (which is roughly m times larger for the bagging classifier, where m is the number of classifiers in the ensemble). The full results of the empirical study are available on a companion website http://www.ece.tamu.edu/∼ulisses/bagging/index.html.
EURASIP Journal on Bioinformatics and Systems Biology uniformly n training points from Sn with replacement. This corresponds to using the empirical distribution of the data Sn as the surrogate joint-feature label distribution F ∗ ; the empirical distribution assigns discrete probability mass 1/n at each observed data point in Sn . Some of the original training points may appear multiple times, whereas others may not appear at all in the bootstrap sample S∗n . Note that, given Sn , the bootstrap sample S∗n is conditionally independent from the original feature-label distribution F. In aggregation by majority voting, a classifier is obtained based on majority voting among individual classifiers designed on the randomized samples S∗k using the original classification rule Ψn . This leads to an ensemble classification rule ΨRn , such that
ψnR (x) = ΨRn Sn (x) =
⎧ ⎪ ⎨1,
⎪ ⎩0,
1 E Ψn S∗k (x) | Sn > , 2 otherwise,
(1)
for x ∈ V , where expectation is with respect to the random mechanism F ∗ , fixed at the observed value of Sn . For bootstrap majority voting, or bagging, the expectation in (1) usually has to be approximated by Monte Carlo sampling, which leads to the “bagged” classifier: ⎧ ⎪ ⎪ ⎨1,
B (x) = ⎪ ψn,m
⎪ ⎩
0,
1 ∗( j) 1 ψn (x) > , m j =1 2 m
(2)
otherwise,
∗( j)
2. Randomized Ensemble Classification Rules Classification involves a feature vector X in a feature space V , a label Y ∈ {0, 1}, and a classifier ψ : V → {0, 1}, such that ψ(x) attempts to predict the value of Y for a given observation X = x. The joint feature-label distribution F of the pair (X, Y ) completely characterizes the stochastic properties of the classification problem. In practice, a classification rule is used to design a classifier based on sample training data. Working formally, a classification rule is a mapping Ψn : [V × {0, 1}]n → {0, 1}V , which takes an i.i.d. sample Sn = {(X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn )} of feature-label pairs drawn from the feature-label distribution to a designed classifier ψn = Ψn (Sn ). The classification error is the probability that classification is erroneous given the sample data, that is, εn = P(ψn (X) = / Y | Sn ). Note that the classification error is random only through the training data Sn . The expected classification error E[εn ] is the average classification error over all possible sample data sets; it is a fixed parameter of the classification rule and feature-label distribution, and used as the measure of performance of the former given the latter. Randomization approaches based on resampling can be seen as drawing i.i.d. samples S∗k = {(X1∗ , Y1∗ ), (X2∗ , Y2∗ ), . . . , (Xk∗ , Yk∗ )} from a surrogate joint-feature label distribution F ∗ , which is a function of the original training data Sn . In the bootstrap resampling approach, one has k = n, and the randomized sample S∗n corresponds to sampling
are designed by the original where the classifiers ψn ∗( j) classification rule Ψn on bootstrap samples Sn , for j = 1, . . . , m, for large enough m (notice the parallel with the development in [25], particulary equations (2.8)–(2.10), and accompanying discussion). The issue of how large m has to be so that (2) is a good Monte Carlo approximation is a critical issue in the application of bagging. Note that m represents the number of classifiers that must be designed to be part of the ensemble, so that a computational problem may emerge if m is made too large. In addition, even if a suitable m is found, the performance of the ensemble must be compared to that of the base classification rule, to see if there is significant improvement. Even more importantly, the performance of the ensemble has to be compared to that of other classification rules; that the ensemble improves the performance of an unstable overfitting classifier is of small value if it can be bested by a single stable, nonoverfitting classifier. In the next section, we present a comprehensive empirical study that addresses these questions.
3. Experimental Study In this section, we report the results obtained from a large simulation study based on publicly-available patient data from genomic and proteomic studies, which measured the performance of the bagging heuristic through the expected classification error, for varying number of component classifiers, sample size, and dimensionality.
EURASIP Journal on Bioinformatics and Systems Biology 3.1. Methods. We considered in our experiment several classification rules, listed here in order of complexity: diagonal linear discriminant analysis (DLDA), linear discriminant analysis (LDA), 3-nearest neighbors (3NN), decision trees (CART), and neural networks (NNET) [26, 27]. DLDA is an extension of LDA where only the diagonal elements (the variances) of the covariance matrix are estimated, while the off-diagonal elements (the covariances) are assumed to be zero. Bagging is applied to each of these base classification rules and its performance recorded for varying number of individual classifiers. The neural network consists of a one-hidden layer with 4 nodes and standard sigmoids as nonlinearities. The network is trained by LevenbergMarquardt optimization with a maximum of 30 iterations. CART is applied with a stopping criterion. Splitting is stopped when there are fewer than 3 points in a given node. This is distinct from the approach advocated in [5] for random forests, where unpruned, fully grown trees are used instead; the reason for this is that we did not attempt to implement the approach in [5] (which involves concepts as random node splitting and is thus specific to decision trees), but rather to study the behavior of bagging, which is the centerpiece of such ensemble methods, across different classification rules. Resampling is done by means of balanced bootstrapping, where all samples are made to appear exactly the same number of times in the computation [28]. We selected data sets with large number N of samples (see below) in order to be able to estimate the true error accurately using held out testing data. In each case, 1000 training data sets of size n = 20, 40, and 60 were drawn uniformly and independently from the total pool of N samples. The training data are drawn in a stratified fashion, following the approximate proportion of each class in the original data. Based on the training data, a filter-based gene selection step is employed to select the top p discriminating genes; we considered in this study p = 2, 3, 5, 8. The univariate feature selection methods used in the filter step are the Welch two-sample t-test [29] and the RELIEF method [30]—in the latter case, we employ the 1-nearest neighbor method when searching for hits and misses. After classifier design, the true classification error for each data set of size n is approximated by a holdout estimator, whereby the N − n sample points not drawn are used as the test set (a good approximation to the classification error, given that N n). The expected classification error is then estimated as the sample mean of classification error over the 1000 training data sets. The sample size n is kept small, as we are interested in the small-sample properties of bagging. Note also that we also must have N n in order to provide for large enough testing sets, as well as to make sure that consecutive training sets do not significantly overlap, so that the expected classification error can be accurately approximated. As can be easily verified, the expected ratio of overlapping sample points between two samples of size n from a population of size N is given simply by n/N. In all cases considered here the expected overlap is around 20% less, which we consider to be acceptable, except in the case of the lung cancer data set with n = 60. This latter case is therefore not included in our results. The one-tailed paired t-test is employed to
3 assess whether the ensemble classifier has an expected error that is significantly smaller than that of the corresponding individual classifier. 3.2. Data Sets. We utilized the following publicly-available data sets from published studies in order to study the performance of bagging in the context of genomics and proteomics applications. 3.2.1. Breast Cancer Gene Expression Data. These data come from the breast cancer classification study in [31], which analyzed N = 295 gene-expression microarrays containing a total of 25760 transcripts each. Filter-based feature selection was performed on a 70-gene prognosis profile, previously published by the same authors in [32]. Classification is between the good-prognosis class (115 samples), and the poor-prognosis class (180 samples), where prognosis is determined retrospectively in terms of survivability [31]. 3.2.2. Lung Cancer Gene Expression Data. We employed here the data set “A” from the study in [33] on nonsmallcell lung carcinomas (NSCLC), which analyzed N = 186 gene-expression microarrays containing a total of 12600 transcripts each. NSCLC is subclassified as adenocarcinomas, squamous cell carcinomas and large-cell carcinomas, of which adenocarcinomas are the most common subtypes and of interest to classify from other subtypes of NSCLC. Classification is thus between adenocarcinomas (139 samples) and non-adenocarcinomas (47 samples). 3.2.3. Prostate Cancer Protein Abundance Data. Given the recent keen interest on deriving serum-based proteomic biomarkers for the diagnosis of cancer [34], we also included in this study data from a proteomic study of prostate cancer reported in [35]. It consists of SELDI-TOF mass spectrometry of N = 326 samples, which yield mass spectra for 45000 m/z (mass over charge) values. Filter-based feature selection is employed to find the top discriminatory m/z values to be used in the experiment. Classification is between prostate cancer patients (167 samples) and noncancer patients, including benign prostatic hyperplasia and healthy patients (159 samples). We use the raw spectra values, without baseline subtraction, as we found that this leads to better classification rates. 3.3. Results and Discussion. We present results for sample sizes n = 20 and n = 40 and dimensionality p = 2 and p = 5, which are representative of the full set of results, available on the companion website http://www.ece.tamu.edu/∼ulisses/bagging/index.html. The case p = 2 is displayed in Tables 1, 2, and 3, each of which corresponds to a different data set. Each table displays the expected classification error as a function of the number m of classifiers used in the ensemble, for different base classification rules, feature selection methods, and sample sizes. We used in all cases an odd number m of classifiers in the ensembles, to avoid tie-breaking issues. Errors that are smaller for the ensemble classifier as
4
EURASIP Journal on Bioinformatics and Systems Biology
Table 1: Expected classification error of selected experiments for breast cancer gene-expression data under two different features selection methods (t-test and RELIEF) for p = 2. Bold-face type indicates the values that are smaller for the ensemble classifier as compared to a single classifier at a 99% significance level, according to the one-tailed paired t-test. n 20 40
Single 0.202 0.198
m=5 0.215 0.205
m = 11 0.208 0.201
m = 15 0.206 0.200
m = 21 0.204 0.200
m = 25 0.204 0.199
m = 31 0.204 0.199
m = 35 0.204 0.199
m = 41 0.203 0.199
m = 45 0.204 0.199
m = 51 0.203 0.199
DLDA RELIEF DLDA RELIEF LDA t-test
20 40 20
0.202 0.198 0.212
0.215 0.206 0.237
0.207 0.201 0.224
0.206 0.201 0.220
0.204 0.200 0.217
0.204 0.200 0.217
0.204 0.199 0.216
0.203 0.199 0.216
0.203 0.199 0.215
0.203 0.199 0.215
0.203 0.199 0.214
LDA LDA LDA 3NN
t-test RELIEF RELIEF t-test
40 20 40 20
0.204 0.213 0.203 0.230
0.217 0.239 0.218 0.281
0.209 0.225 0.210 0.246
0.208 0.222 0.207 0.241
0.207 0.219 0.206 0.235
0.206 0.218 0.206 0.234
0.206 0.218 0.205 0.231
0.206 0.217 0.205 0.231
0.205 0.216 0.205 0.230
0.205 0.216 0.205 0.229
0.205 0.216 0.205 0.229
3NN 3NN
t-test RELIEF
40 20
0.228 0.234
0.274 0.282
0.241 0.248
0.235 0.242
0.231 0.238
0.229 0.236
0.228 0.234
0.227 0.234
0.226 0.233
0.226 0.233
0.225 0.232
3NN CART
RELIEF t-test
40 20
0.227 0.259
0.271 0.297
0.241 0.263
0.235 0.256
0.231 0.250
0.229 0.247
0.227 0.246
0.227 0.244
0.226 0.243
0.225 0.242
0.225 0.242
CART CART
t-test RELIEF
40 20
0.257 0.263
0.294 0.299
0.258 0.265
0.252 0.258
0.245 0.253
0.244 0.250
0.242 0.247
0.240 0.247
0.239 0.245
0.239 0.245
0.237 0.244
CART RELIEF NNET t-test NNET t-test NNET RELIEF
40 20 40 20
0.256 0.252 0.226 0.255
0.293 0.293 0.256 0.298
0.260 0.246 0.225 0.248
0.253 0.240 0.219 0.240
0.245 0.230 0.215 0.233
0.244 0.230 0.213 0.232
0.241 0.225 0.212 0.229
0.240 0.224 0.210 0.228
0.239 0.223 0.210 0.226
0.239 0.222 0.209 0.225
0.238 0.221 0.209 0.224
NNET RELIEF
40
0.230
0.260
0.227
0.220
0.216
0.213
0.213
0.212
0.211
0.210
0.209
Rule DLDA DLDA
FS t-test t-test
Table 2: Expected classification error of selected experiments for lung cancer gene-expression data under two different features selection methods (t-test and RELIEF) for p = 2. Bold-face type indicates the values that are smaller for the ensemble classifier as compared to a single classifier at a 99% significance level, according to the one-tailed paired t-test. Rule DLDA DLDA DLDA DLDA LDA LDA LDA LDA 3NN 3NN 3NN 3NN CART CART CART CART NNET NNET NNET NNET
FS t-test t-test RELIEF RELIEF t-test t-test RELIEF RELIEF t-test t-test RELIEF RELIEF t-test t-test RELIEF RELIEF t-test t-test RELIEF RELIEF
n 20 40 20 40 20 40 20 40 20 40 20 40 20 40 20 40 20 40 20 40
Single 0.190 0.186 0.235 0.207 0.201 0.192 0.262 0.208 0.122 0.123 0.247 0.232 0.160 0.156 0.297 0.297 0.216 0.195 0.239 0.231
m=5 0.191 0.187 0.253 0.212 0.206 0.194 0.295 0.223 0.151 0.147 0.334 0.317 0.182 0.177 0.302 0.297 0.244 0.232 0.257 0.252
m = 11 0.190 0.186 0.238 0.209 0.203 0.193 0.274 0.214 0.130 0.129 0.265 0.252 0.161 0.155 0.280 0.273 0.235 0.215 0.247 0.242
m = 15 0.190 0.186 0.239 0.208 0.203 0.193 0.271 0.213 0.126 0.127 0.258 0.243 0.155 0.150 0.274 0.268 0.232 0.212 0.247 0.241
m = 21 0.189 0.186 0.235 0.207 0.202 0.193 0.265 0.212 0.124 0.125 0.249 0.238 0.152 0.146 0.269 0.263 0.231 0.208 0.244 0.238
m = 25 0.190 0.186 0.236 0.207 0.202 0.193 0.265 0.212 0.123 0.124 0.248 0.235 0.151 0.145 0.267 0.261 0.229 0.207 0.242 0.236
m = 31 0.189 0.186 0.233 0.207 0.203 0.192 0.263 0.210 0.122 0.123 0.246 0.234 0.150 0.144 0.266 0.260 0.228 0.205 0.242 0.235
m = 35 0.189 0.186 0.233 0.207 0.202 0.192 0.263 0.211 0.121 0.123 0.247 0.233 0.149 0.143 0.264 0.258 0.228 0.204 0.241 0.234
m = 41 0.190 0.186 0.234 0.207 0.202 0.193 0.260 0.210 0.121 0.122 0.244 0.232 0.148 0.142 0.263 0.257 0.227 0.203 0.242 0.234
m = 45 0.190 0.186 0.232 0.207 0.202 0.192 0.261 0.210 0.121 0.122 0.244 0.231 0.148 0.142 0.262 0.257 0.227 0.202 0.242 0.235
m = 51 0.190 0.186 0.233 0.206 0.203 0.192 0.261 0.208 0.120 0.121 0.243 0.230 0.147 0.142 0.263 0.256 0.226 0.202 0.241 0.233
EURASIP Journal on Bioinformatics and Systems Biology
5
Table 3: Expected classification error of selected experiments for prostate cancer protein-abundance data under two different features selection methods (t-test and RELIEF) for p = 2. Bold-face type indicates the values that are smaller for the ensemble classifier as compared to a single classifier at a 99% significance level, according to the one-tailed paired t-test. Rule DLDA DLDA DLDA DLDA LDA LDA LDA LDA 3NN 3NN 3NN 3NN CART CART CART CART NNET NNET NNET NNET
FS t-test t-test RELIEF RELIEF t-test t-test RELIEF RELIEF t-test t-test RELIEF RELIEF t-test t-test RELIEF RELIEF t-test t-test RELIEF RELIEF
n 20 40 20 40 20 40 20 40 20 40 20 40 20 40 20 40 20 40 20 40
Single 0.188 0.187 0.468 0.458 0.212 0.198 0.422 0.416 0.187 0.153 0.268 0.222 0.232 0.213 0.244 0.222 0.297 0.277 0.345 0.329
m=5 0.211 0.207 0.523 0.502 0.241 0.224 0.492 0.479 0.251 0.208 0.355 0.283 0.247 0.219 0.284 0.250 0.300 0.274 0.382 0.348
m = 11 0.199 0.196 0.492 0.477 0.225 0.210 0.449 0.440 0.203 0.168 0.307 0.248 0.223 0.198 0.259 0.233 0.271 0.254 0.337 0.312
m = 15 0.196 0.194 0.484 0.474 0.222 0.208 0.435 0.433 0.195 0.162 0.299 0.239 0.218 0.194 0.256 0.229 0.266 0.248 0.324 0.303
m = 21 0.194 0.192 0.477 0.465 0.219 0.205 0.426 0.426 0.192 0.158 0.287 0.233 0.213 0.189 0.251 0.226 0.260 0.244 0.318 0.295
compared to a single classifier at a 99% significance level, according to the one-tailed paired t-test, are indicated by bold-face type. This allows one to immediately observe that bagging is able to improve the performance of the unstable overfitting CART and NNET classifiers; in most cases, a small ensemble is required, and the improvement in performance is substantial. In contrast, bagging does not improve the performance of the stable, nonoverfitting DLDA, LDA, and 3NN classifiers, except via a large ensemble; and even so the improvement in magnitude is quite small, and certainly does not justify the extra computational cost (note that in the case of the simplest classification rule, DLDA, there is no improvement at all). This is in agreement with what is known about the ensemble approach (e.g., see [5]). However, of larger interest here is the performance of the ensemble against a single instance of the stable, nonoverfitting classifiers. This can be better visualized in the plots of Figures 1, 2, and 3, which display the expected classification errors as a function of number of component classifiers in the ensemble, for the case p = 5. The error of a single classifier is indicated by a horizontal dashed line. Marks indicate the values that are smaller for the ensemble classifier as compared to a single component classifier at a 99% significance level, according to the onetailed paired t-test. One observes that as ensemble size increases, classification error decreases and tends to converge to a fixed value (in agreement with [5]), but we can also see that the error is usually larger at very small ensemble sizes, as compared to the error of the individual classifier.
m = 25 0.194 0.191 0.475 0.469 0.218 0.204 0.426 0.421 0.189 0.156 0.284 0.231 0.210 0.189 0.249 0.225 0.259 0.244 0.314 0.294
m = 31 0.193 0.191 0.471 0.465 0.216 0.203 0.422 0.420 0.187 0.154 0.280 0.229 0.209 0.187 0.247 0.224 0.256 0.240 0.313 0.290
m = 35 0.192 0.191 0.472 0.463 0.216 0.202 0.419 0.418 0.187 0.153 0.278 0.228 0.209 0.185 0.245 0.223 0.256 0.241 0.313 0.290
m = 41 0.191 0.190 0.469 0.462 0.215 0.202 0.417 0.416 0.186 0.152 0.277 0.226 0.208 0.185 0.244 0.223 0.254 0.239 0.312 0.290
m = 45 0.191 0.190 0.467 0.463 0.215 0.202 0.415 0.415 0.185 0.152 0.276 0.226 0.209 0.185 0.244 0.223 0.254 0.239 0.309 0.288
m = 51 0.191 0.189 0.466 0.460 0.215 0.201 0.410 0.413 0.185 0.151 0.275 0.224 0.208 0.184 0.243 0.221 0.253 0.239 0.307 0.289
We can again observe that, in most cases, bagging is able to improve the performance of CART and NNET, but that is not significantly so, or at all, for DLDA, LDA, and 3NN. More importantly, we can see that the improvement on the performance of CART and NNET is not sufficient to beat the performance of single DLDA, LDA, or 3NN classifiers (with the exception of the prostate cancer data with RELIEF feature selection, which we comment on below). As we can see in Figures 1–3, the breast cancer geneexpression data produces linear features that favor single DLDA and LDA classifiers (the latter do not perform so well at n = 20, due to the difficulty of estimating the entire covariance matrix at this sample size, which affects DLDA less), while the lung cancer gene-expression data produce nonlinear features, in which case, according to the results, the best option overall is to use a single 3NN classifier, followed closely by a bagged NNET in t-test feature selection and a bagged CART in RELIEF feature selection. The case of the prostate cancer proteomic data is peculiar in that it presents the only case where the best option was not a DLDA, LDA, or 3NN classifier, but in fact a single CART classifier, namely, the case n = 20 (with either p = 2 or p = 5) for RELIEF feature selection (the results for t-test feature selection, on the other hand, are very similar to the ones obtained for the lung cancer data set). Note that, in this case, the best performance is achieved by a single CART classifier, rather than the ensemble CART scheme. We also point out that the classification errors obtained with t-test feature selection are smaller than the ones obtained with RELIEF feature
6
EURASIP Journal on Bioinformatics and Systems Biology n = 40, t-test, p = 5
0.3
0.3
0.25
0.25 Expected error
Expected error
n = 20, t-test, p = 5
0.2
0.2
0.15
0.15 10
20 30 40 Number of classifiers
50
10
NNET CART
DLDA LDA 3NN
50
NNET CART
DLDA LDA 3NN
(a)
(b)
n = 20, RELIEF, p = 5
n = 40, RELIEF, p = 5
0.35
0.35
0.3
0.3 Expected error
Expected error
20 30 40 Number of classifiers
0.25
0.2
0.25
0.2
0.15
0.15 10
20 30 40 Number of classifiers
50
CART NNET
DLDA LDA 3NN (c)
10
20 30 40 Number of classifiers
50
CART NNET
DLDA LDA 3NN (d)
Figure 1: Expected classification error as a function of number of classifiers in the ensemble for selected experiments with the breast cancer gene expression data (full results available on the companion website). The error of a single classifier is indicated by a horizontal dashed line. Marks indicate the values that are smaller for the ensemble classifier as compared to a single classifier at a 99% significance level, according to the one-tailed paired t-test.
selection, indicating that RELIEF is not a good option in this case due to the very small-sample size (in fact, there is evidence that t-test filter-based feature selection may be the method of choice in small-sample cases [24]), in the case
n = 40, the difference between 3NN and CART essentially disappears. It is also interesting that in the case n = 20 and p = 5, for RELIEF feature selection, bagging is able to improve the performance of LDA by a good margin in the
EURASIP Journal on Bioinformatics and Systems Biology
7 n = 40, t-test, p = 5
0.3
0.3
0.25
0.25 Expected error
Expected error
n = 20, t-test, p = 5
0.2
0.2
0.15
0.15
0.1
0.1 10
20 30 40 Number of classifiers
50
10
NNET CART
DLDA LDA 3NN
20 30 40 Number of classifiers NNET CART
DLDA LDA 3NN
(a)
50
(b)
n = 20, RELIEF, p = 5
n = 40, RELIEF, p = 5
0.35
0.3
0.3
Expected error
Expected error
0.25 0.25
0.2
0.2
0.15 0.15
0.1
0.1 10
20 30 40 Number of classifiers
50
CART NNET
DLDA LDA 3NN (c)
10
20 30 40 Number of classifiers
50
CART NNET
DLDA LDA 3NN (d)
Figure 2: Expected classification error as a function of number of classifiers in the ensemble for selected experiments with the lung cancer gene expression data (full results available on the companion website). The error of a single classifier is indicated by a horizontal dashed line. Marks indicate the values that are smaller for the ensemble classifier as compared to a single classifier at a 99% significance level, according to the one-tailed paired t-test.
case of the prostate cancer data. This is due to the fact that the combination of LDA and RELIEF feature selection produce an unstable overfitting classification rule at this acute smallsample scenario.
The results obtained with t-test feature selection are consistent across all data sets. When using RELIEF feature selection, there is a degree of contrast between the results for the prostate cancer protein-abundance data set and the ones
8
EURASIP Journal on Bioinformatics and Systems Biology n = 40, t-test, p = 5
0.3
0.3
0.25
0.25 Expected error
Expected error
n = 20, t-test, p = 5
0.2
0.15
0.2
0.15
0.1
0.1 10
20 30 40 Number of classifiers
50
10
NNET CART
DLDA LDA 3NN
50
NNET CART
DLDA LDA 3NN
(a)
(b)
n = 20, RELIEF, p = 5
n = 40, RELIEF, p = 5
0.5
0.5
0.4
0.4 Expected error
Expected error
20 30 40 Number of classifiers
0.3
0.2
0.3
0.2
0.1
0.1 10
20 30 40 Number of classifiers
50
CART NNET
DLDA LDA 3NN (c)
10
20 30 40 Number of classifiers
50
CART NNET
DLDA LDA 3NN (d)
Figure 3: Expected classification error as a function of number of classifiers in the ensemble for selected experiments with the prostate cancer protein abundance data (full results available on the companion website). The error of a single classifier is indicated by a horizontal dashed line. Marks indicate the values that are smaller for the ensemble classifier as compared to a single classifier at a 99% significance level, according to the one-tailed paired t-test.
for the gene-expression data sets, which may be attributed to the differences in technology as well as the fact that we do not employ baseline subtraction for the proteomics data in order to achieve better classification rates.
We remark that results are not expected to change much if ensemble sizes are increased further (beyond m = 51), as can be seen from convergence of the expected classification error curves in Figures 1–3.
EURASIP Journal on Bioinformatics and Systems Biology
4. Conclusion In this paper we conducted a detailed empirical study of the ensemble approach to classification of small-sample genomic and proteomic data. The main performance issue is not whether the ensemble scheme improves the classification error of an unstable overfitting classifier (e.g., CART, NNET), or whether its classification error converges to a fixed limit; but rather whether the ensemble scheme will improve performance of the unstable overfitting classifier sufficiently to beat the performance of single stable, nonoverfitting classifiers (e.g., DLDA, LDA, and 3NN). We observed that this never was the case for any of the data sets and experimental conditions considered here, except in the case of the proteomics data set with RELIEF feature selection in acute small-sample cases, when nevertheless the performance of a single unstable overfitting classifier (in this case, CART) was better or comparable to the corresponding ensemble classifier. We observed that in most cases bagging does a good (sometimes, admirable) job of improving the performance of unstable overfitting classifiers, but that improvement was not enough to beat the performance of single stable nonoverfitting classifiers. The main message to be gleaned from this study by practitioners is that the use of bagging in classification of small-sample genomics and proteomics data increases computational cost, but is not likely to improve overall classification accuracy over other, more simple, approaches. The solution we recommend is to use simple classification rules and avoid bagging in these scenarios. It is important to stress that we do not give a definitive recommendation on the use of the random forest method for small-sample genomics and proteomics data; however, we do think that this study does provide a step in that direction, since the random forest method depends partly, if not significantly, for its success on the effectiveness of bagging. Further research is needed to investigate this question.
References [1] R. E. Schapire, “The strength of weak learnability,” Machine Learning, vol. 5, no. 2, pp. 197–227, 1990. [2] Y. Freund, “Boosting a weak learning algorithm by majority,” in Proceedings of the 3rd Annual Workshop on Computational Learning Theory (COLT ’90), pp. 202–216, Rochester, NY, USA, August 1990. [3] L. Xu, A. Krzyzak, and C. Y. Suen, “Methods of combining multiple classifiers and their applications to handwriting recognition,” IEEE Transactions on Systems, Man and Cybernetics, vol. 22, no. 3, pp. 418–435, 1992. [4] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996. [5] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001. [6] L. Lam and C. Y. Suen, “Application of majority voting to pattern recognition: an analysis of its behavior and performance,” IEEE Transactions on Systems, Man and Cybernetics Part A, vol. 27, no. 5, pp. 553–568, 1997. [7] B. Efron, “Bootstrap methods: another look at the jacknife,” Annals of Statistics, vol. 7, pp. 1–26, 1979.
9 [8] B. Efron, The Jackknife, the Bootstrap and Other Resampling Plans, vol. 38 of CBMS-NSF Regional Conference Series in Applied Mathematics, SIAM, Philadelphia, Pa, USA, 1982. [9] S. Alvarez, R. Diaz-Uriarte, A. Osorio, et al., “A predictor based on the somatic genomic changes of the BRCA1/BRCA2 breast cancer tumors identifies the non-BRCA1/BRCA2 tumors with BRCA1 promoter hypermethylation,” Clinical Cancer Research, vol. 11, no. 3, pp. 1146–1153, 2005. [10] E. C. Gunther, D. J. Stone, R. W. Gerwien, P. Bento, and M. P. Heyes, “Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro,” Proceedings of the National Academy of Sciences of the United States of America, vol. 100, no. 16, pp. 9608–9613, 2003. [11] R. D´ıaz-Ur´ıarte and S. Alvarez de Andr´es, “Gene selection and classification of microarray data using random forest,” BMC Bioinformatics, vol. 7, article 3, pp. 1–13, 2006. [12] A. Statnikov, L. Wang, and C. F. Aliferis, “A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification,” BMC Bioinformatics, vol. 9, article 319, pp. 1–10, 2008. [13] G. Izmirlian, “Application of the random forest classification algorithm to a SELDI-TOF proteomics study in the setting of a cancer prevention trial,” Annals of the New York Academy of Sciences, vol. 1020, pp. 154–174, 2004. [14] B. Wu, T. Abbott, D. Fishman, et al., “Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data,” Bioinformatics, vol. 19, no. 13, pp. 1636– 1643, 2003. [15] P. Geurts, M. Fillet, D. de Seny, et al., “Proteomic mass spectra classification using decision tree based ensemble methods,” Bioinformatics, vol. 21, no. 14, pp. 3138–3145, 2005. [16] B. Zhang, T. D. Pham, and Y. Zhang, “Bagging support vector machine for classification of SELDI-TOF mass spectra of ovarian cancer serum samples,” in Proceedings of the 20th Australian Joint Conference on Artificial Intelligence (AI ’07), vol. 4830 of Lecture Notes in Computer Science, pp. 820–826, Gold Coast, Australia, December 2007. [17] A. Assareh, M. Moradi, and V. Esmaeili, “A novel ensemble strategy for classification of prostate cancer protein mass spectra,” in Proceedings of the 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBS ’07), pp. 5987–5990, Lyon, France, August 2007. [18] W. Tong, Q. Xie, H. Hong, et al., “Using decision forest to classify prostate cancer samples on the basis of SELDI-TOF MS data: assessing chance correlation and prediction confidence,” Environmental Health Perspectives, vol. 112, no. 16, pp. 1622– 1627, 2004. [19] T. G. Dietterich, “An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization,” Machine Learning, vol. 40, no. 2, pp. 139–157, 2000. [20] S. Dudoit, J. Fridlyand, and T. P. Speed, “Comparison of discrimination methods for the classification of tumors using gene expression data,” Journal of the American Statistical Association, vol. 97, no. 457, pp. 77–87, 2002. [21] L. Hansen and P. Salamon, “Neural network ensembles,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, no. 10, pp. 993–1001, 1990. [22] E. Bauer and R. Kohavi, “An empirical comparison of voting classification algorithms: bagging, boosting, and variants,” Machine Learning, vol. 36, no. 1, pp. 105–139, 1999.
10 [23] S. Y. Sohn and H. W. Shin, “Experimental study for the comparison of classifier combination methods,” Pattern Recognition, vol. 40, no. 1, pp. 33–40, 2007. [24] J. Hua, W. D. Tembe, and E. R. Dougherty, “Performance of feature-selection methods in the classification of highdimension data,” Pattern Recognition, vol. 42, no. 3, pp. 409– 424, 2009. [25] B. Efron, “Estimating the error rate of a prediction rule: improvement on cross-validation,” Journal of the American Statistical Association, vol. 78, no. 382, pp. 316–331, 1983. [26] L. Devroye, L. Gyorfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition, Springer, New York, NY, USA, 1996. [27] U. M. Braga-Neto and E. Dougherty, “Classification,” in Genomic Signal Processing and Statistics, E. Dougherty, I. Shmulevich, J. Chen, and Z. J. Wang, Eds., EURASIP Book Series on Signal Processing and Communication, pp. 93–128, Hindawi, New York, NY, USA, 2005. [28] M. Chernick, Bootstrap Methods: A Practitioner’s Guide, John Wiley & Sons, New York, NY, USA, 1999. [29] E. Lehmann and J. Romano, Testing Statistical Hypotheses, Springer, New York, NY, USA, 2005. [30] K. Kira and L. A. Rendell, “The feature selection problem: traditional methods and a new algorithm,” in Proceedings of the 10th National Conference on Artificial Intelligence (AAAI ’92), pp. 129–134, San Jose, Calif, USA, July 1992. [31] M. J. van de Vijver, Y. D. He, L. J. van’t Veer, et al., “A gene-expression signature as a predictor of survival in breast cancer,” The New England Journal of Medicine, vol. 347, no. 25, pp. 1999–2009, 2002. [32] L. J. van’t Veer, H. Dai, M. J. van de Vijver, et al., “Gene expression profiling predicts clinical outcome of breast cancer,” Nature, vol. 415, no. 6871, pp. 530–536, 2002. [33] A. Bhattacharjee, W. G. Richards, J. Staunton, et al., “Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses,” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 24, pp. 13790–13795, 2001. [34] H. J. Issaq, T. D. Veenstra, T. P. Conrads, and D. Felschow, “The SELDI-TOF MS approach to proteomics: protein profiling and biomarker identification,” Biochemical and Biophysical Research Communications, vol. 292, no. 3, pp. 587–592, 2002. [35] B.-L. Adam, Y. Qu, J. W. Davis, et al., “Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men,” Cancer Research, vol. 62, no. 13, pp. 3609– 3614, 2002.
EURASIP Journal on Bioinformatics and Systems Biology
Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2009, Article ID 360864, 13 pages doi:10.1155/2009/360864
Research Article Intervention in Context-Sensitive Probabilistic Boolean Networks Revisited Babak Faryabi,1 Golnaz Vahedi,1 Jean-Francois Chamberland,1 Aniruddha Datta,1 and Edward R. Dougherty1, 2 1 Department
of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA Biology Division, Translational Genomics Research Institute, Phoenix, AZ 85004, USA
2 Computational
Correspondence should be addressed to Babak Faryabi,
[email protected] Received 25 August 2008; Revised 17 November 2008; Accepted 16 January 2009 Recommended by Javier Garcia-Frias An approximate representation for the state space of a context-sensitive probabilistic Boolean network has previously been proposed and utilized to devise therapeutic intervention strategies. Whereas the full state of a context-sensitive probabilistic Boolean network is specified by an ordered pair composed of a network context and a gene-activity profile, this approximate representation collapses the state space onto the gene-activity profiles alone. This reduction yields an approximate transition probability matrix, absent of context, for the Markov chain associated with the context-sensitive probabilistic Boolean network. As with many approximation methods, a price must be paid for using a reduced model representation, namely, some loss of optimality relative to using the full state space. This paper examines the effects on intervention performance caused by the reduction with respect to various values of the model parameters. This task is performed using a new derivation for the transition probability matrix of the context-sensitive probabilistic Boolean network. This expression of transition probability distributions is in concert with the original definition of context-sensitive probabilistic Boolean network. The performance of optimal and approximate therapeutic strategies is compared for both synthetic networks and a real case study. It is observed that the approximate representation describes the dynamics of the context-sensitive probabilistic Boolean network through the instantaneously random probabilistic Boolean network with similar parameters. Copyright © 2009 Babak Faryabi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction In biology, there are numerous examples where the (in)activation of one gene or protein can lead to a certain cellular functional state or phenotype. For instance, in a stable cancer cell line, the reproductive cell cycle is repeated, and cancerous cells proliferate with time in the absence of intervention. One can use the p53 gene if the intervention goal is to push the cells into apoptosis, or programmed cell death, to arrest the cell cycle. The p53 gene is the most well-known tumor suppressor gene, encoding a protein that regulates the expression of several genes such as Bax and Fas/APO1, which function is to promote apoptosis [1, 2]. In cultured cells, extensive experimental results indicate that when p53 is activated, for example, in response to radiation, it leads to cell growth inhibition or cell death [3]. The p53 gene is also used in gene therapy, where the target gene
(p53 in this case) is cloned into a viral vector. The modified virus serves as a vehicle to transport the p53 gene into tumor cells to generate intervention [4, 5]. As this and many other examples suggest, it is prudent to use gene regulatory models to design therapeutic interventions that expediently modify the cell’s dynamics via external signals. These systembased intervention methods can be useful in identifying potential drug targets and discovering treatments to disrupt or mitigate the aberrant gene functions contributing to the pathology of a disease. The main objective of intervention is to reduce the likelihood of encountering the undesirable gene-activity profiles associated with aberrant cellular functions. Probabilistic Boolean networks (PBNs), a class of discrete-time discretespace Markovian gene regulatory networks, have been used to derive such therapeutic strategies [6]. These classes of models, which allow the incorporation of uncertainty into
2 the inter-gene relationships, are probabilistic generalizations of the standard Boolean networks introduced by Kauffman [7–9]. In a PBN model, gene values are quantized into some finite range. The values are updated synchronously at each time step according to regulatory functions. Stochastic properties are introduced into the model by allowing several possible regulatory functions for each gene and allowing random modification of the activity factors. If the regulatory functions are allowed to change at every time point, then the PBN is said to be instantaneously random [10]. On the other hand, in a context-sensitive PBN, function updating only occurs at time points selected by a binary switching random process [11, 12]. This framework incorporates the effect of latent variables outside the model, whose behaviors influence regulation within the system. In essence, a PBN is composed of a collection of networks; between switches it acts like one of the constituent networks, each being referred to as a context. The switching frequency of the context differentiates the instantaneously random PBN from the context-sensitive PBN. The context switching that occurs at every time step in an instantaneously random PBN corresponds to changing the wiring diagram of the system at every instant. In contrast, context-sensitive PBNs can better represent the stability of biological systems by capturing the period of sojourning in constituent networks [11]. Hence, this class of models is more suitable for the analysis of gene regulation and the design of intervention methods. To formulate the problem of intervention in a context-sensitive PBN, a transition probability matrix must be derived. This transition matrix acts on the possible states of the system. Once this is accomplished, the task of finding the most effective intervention strategy can be formulated as a classical sequential decision making problem. For a predefined cost of intervention and a cost-per-stage function that discriminates between the states of the system, the objective of the decision maker becomes minimizing the expected total cost associated with the progression of the system. That is, given the state of the system, an effective intervention strategy identifies which action to take so as to minimize the overall cost. Consequently, the devised intervention strategy can be used as a therapeutic strategy that alters the dynamics of aberrant cells to reduce the long-run likelihood of undesirable geneactivity profiles favorable to the disease. It is evident that the intervention strategy specified by the sequential decision maker is directly affected by the form of the transition probability matrix associated with a context-sensitive PBN. For an instantaneously random PBN, the state consists of a gene-activity profile; while for a context-sensitive PBN, the state includes a gene-activity profile and a context. The effectiveness of an intervention strategy depends, partly, on how accurate the model represents the reality of the underlying pathological cellular functions. It is therefore important to adopt a model that captures the subtleties of the biological system of interest. In the framework of contextsensitive PBNs, this entails defining a transition probability matrix that is an accurate representation of system dynamics. Context-sensitive PBN models have been considered in
EURASIP Journal on Bioinformatics and Systems Biology [13, 14]. In [13], an intervention strategy is devised for a limited window of observations. Although this method lowers the likelihood of undesirable gene-activity profiles within the control window, it may not alter the probability of visiting these gene expression profiles in the long run. To address this issue, [14] derives a stationary strategy that affects the long-run behavior of gene-activity profiles. Common to [13, 14] is the assumption that the active constituent network of the context-sensitive PBN is not observable at each instant. This means that decisions must be made without explicit knowledge of the context, and therefore without full knowledge of the system state, which is composed of a context and a gene-activity profile. As such, the authors of [13, 14] elect to proceed using a transition probability matrix in which the context is removed from the state space and system dynamics. This reduction is accomplished by computing a weighted sum of the geneactivity profile behaviors over all the possible constituent networks. At every step, the reduced system exhibits an expected behavior by averaging over the various contexts. As such, the gene-activity profile determines the status of the approximate system, and the collapsed transition probability matrix specifies its evolution. The corresponding intervention strategy is based on the approximate transition probability matrix with the collapsed state space. Not only does the reduction eliminate the need to know the context at each time point, but it also reduces the dimensionality of the control problem. For instance, in a binary context-sensitive PBN with n genes and m contexts, the full state space consists of 2n × m states, whereas the reduced system possesses 2n states. Consequently, the computational complexity of each iteration of the intervention design algorithm is O(22n m2 ) for the context-sensitive PBN, whereas it is O(22n ) for the reduced system [15]. Although the reduction in [13, 14] has benefits, as with many approximation methods, a price must be paid, and here it arises from the intervention standpoint. Specifically, what is the cost in terms of the effectiveness of the resulting intervention strategy? This issue is not addressed in [13, 14]. Our aim here is to determine, under various network parametric assumptions, the loss of intervention performance resulting from removing the context from the state space of a context-sensitive PBN. This must be done by comparing the performance of the intervention strategies derived from the full state space and the reduced state space when both are individually applied to the full state space. In [13, 14], the strategy devised on the reduced space was never actually applied to the original system. It was only applied to the approximate model. The point is that the performance of the approximate strategy was tested on the reduced model, not on the original one. As such the cost of the reduction was never assessed. This is accomplished below. This approximation simplifies the task of finding intervention strategies by describing the dynamics of a context-sensitive PBN via the instantaneously random PBN with similar parameters, and hence it should be expected to be accurate mostly when contexts switch frequently.
EURASIP Journal on Bioinformatics and Systems Biology In Section 2, we review the definition of a contextsensitive PBN. We briefly explain a method to design strategies for controlling the dynamical behavior of the model in the long run using classical Markov decision processes. A new derivation for the transition probability matrix of a contextsensitive PBN is presented in Section 2.3. In Section 2.4, we review the reduction method proposed in [13, 14] and derive the corresponding approximate transition probability matrix using the results of Section 2.3. We compare the performance of approximate and optimal intervention strategies through extensive numerical studies in Section 3. In this section, we also formulate a seven-gene context-sensitive PBN model for a melanoma case study [16]. The performance of the optimal and approximate intervention strategies for this network is compared under various model parameters.
2. Intervention in Context-Sensitive Probabilistic Boolean Networks We begin this section with a review of context-sensitive PBNs and then formulate the problem of intervention as an infinite-horizon sequential decision making problem. We derive a new expression for the transition probability matrix that specifies the dynamics of the system based on its regulatory functions. This expression for the transition probability matrix is in concert with the original definition of context-sensitive PBNs [12]. As noted previously, the state space of the associated transition probability matrix is composed of all possible context and gene-activity profile pairs. We conclude this section by presenting an approximate transition probability matrix derived by performing a state collapse over the various contexts. This approximation method was used in [13, 14]. Mathematically, this is a Markov approximation to a hidden-Markov model. In the approximate transition probability matrix, the probability of moving from one gene-activity profile to another is the weighted sum of the probability transitions between these two states under the various contexts. The coefficients of the weighted summation are the selection probabilities of the contexts. 2.1. Definition. A context-sensitive probabilistic Boolean network consists of a sequence V = {xi }ni=1 of n nodes, where xi ∈ {0, . . . , d − 1}, and a sequence {fl }kl=1 of vector-valued functions defining constituent networks. In the framework of gene regulation, each element xi represents the expression value of a gene. It is common to abuse terminology by referring to xi as the ith gene. Each vector-valued function fl = ( fl1 , . . . , fln ) represents a constituent network of the contextsensitive PBN. The function fli : {0, . . . , d −1}n → {0, . . . , d − 1} is the predictor of gene i, whenever context l is selected. The number of quantization levels for gene expressions is denoted by d. At each updating epoch, a random variable determines whether the constituent network is switched or not. The switching probability q is a system parameter. If the context remains unchanged, then the context-sensitive PBN behaves like a fixed Boolean network where the values of all the genes are updated synchronously according to the
3 current constituent network. On the other hand, if a switch occurs, then a constituent network is randomly selected from k {fl }l=1 according to the selection probability distribution {rl }kl=1 . Once the predictor function fl is determined, the values of the genes are updated using the new constituent network, that is, according to the rules defined by fl . Two quantization levels have thus far been used in practice. If d = 2 (binary), then the constituent networks are Boolean networks with 0 or 1 meaning OFF or ON, respectively [10]. The case where d = 3 (ternary) arises when we consider individual genes to be downregulated (0), upregulated (2), or invariant (1). This situation commonly occurs with cDNA microarrays, where a ratio is taken between the expression values on the test channel (red) and the base channel (green) [16]. In this paper, we will develop the methodology for d = 2, so that gene values are either 0 or 1. The methodology can be extended to other finite quantization levels, albeit, at the expense of tedious mathematical expressions. All the binary operations in this section would need to be replaced by case statements, and the perturbation process should be articulated on a case by case basis. We focus on context-sensitive PBNs with perturbations, meaning that each gene may change its value with small probability p at each epoch. If γi (t) is a Bernoulli random variable with parameter p and the random vector γ at instant t is defined as γ(t) = (γ1 (t), γ2 (t), . . . , γn (t)), then the value of gene i is determined at each epoch t by xi (t + 1) = 1(γ(t + 1) = / 0)(xi (t) ⊕ γi (t + 1)) + 1(γ(t + 1) = 0) fli (x1 (t), . . . , xn (t)),
(1)
where operator ⊕ is componentwise addition in modulo two and fli is the predictor of gene i according to the current context of the network l. Such a perturbation captures the realistic situation where the activity of a gene is subject to random alteration. The gene-activity profile (or GAP) is an n-digit binary vector x(t) = (x1 (t), . . . , xn (t)) giving the expression values of the genes at time t, where xi (t) ∈ {0, 1}. We denote the set of all possible GAPs by X. The dynamic behavior of a contextsensitive PBN can be modeled by a Markov chain whose states are ordered pairs consisting of a constituent network κ and a GAP x. The evolution of the context-sensitive PBN can therefore be represented using a stationary discrete-time equation z(t + 1) = f (z(t), w(t)),
t = 0, 1, . . . ,
(2)
where state z(t) is an element of the state space Z = {(κ, x) : κ ∈ {1, . . . , k}, x ∈ X}. The disturbance w(t) is the manifestation of uncertainties in the biological system, due either to context switching or a change in the GAP resulting from a random gene perturbation. It is assumed that both the gene perturbation distribution and the network switching distribution are independent and identically distributed over time. The switching frequency of the context differentiates the instantaneously random PBN from the context-sensitive PBN. If the contexts change at every instant, that is, q = 1, then the PBN is instantaneously random.
4
EURASIP Journal on Bioinformatics and Systems Biology
We note that a bijection can be drawn between the geneactivity profile x(t) or the states z(t) and their decimal representations x(t) and z(t) based on their binary expansion. The integers x(t) and z(t) take values in X = {0, 1, . . . , dn − 1} and Z = {0, 1, . . . , k × dn − 1}, respectively. These decimal representations facilitate the depiction of our numerical results in Section 3.
For initial state i and strategy πg = {μg (·, 0), μg (·, 1), . . .}, where μg (·, t) : Z → C denotes the decision rule at epoch t, the infinite expected total discounted cost is defined by
2.2. Infinite-Horizon Intervention. We can formulate the task of finding the most effective intervention strategy as a sequential decision making problem, when the dynamics of a context-sensitive PBN are expressed according to (2). To this end, we can specify the Markov chain that describes the dynamics of the context-sensitive PBN by defining its transition probability matrix and initial state distribution. In the presence of an external regulator, we suppose that the context-sensitive PBN has a binary intervention input ug (t) on the control gene g. The intervention input ug (t), which takes values in set C = {0, 1}, specifies the action on the control gene g. Treatment alters the status of the control gene g, which can be selected from all the genes in the network. If treatment is applied, ug (t) = 1, then the state of the control gene g is toggled; otherwise the state of the control gene g remains unchanged. For the case of a single control gene g, the system evolution is represented by a stationary discrete-time equation
The sequential decision maker must identify an optimal strategy πg∗ = {μ∗g (·, 0), μ∗g (·, 1), . . .} such that the Jπg (i) is minimized for each state i ∈ Z. Mathematically, an optimal strategy πg∗ is a solution of the optimization problem
z(t + 1) = f (z(t), ug (t), w(t)),
t = 0, 1, . . . ,
(3)
where the state z(t) is an element of Z, and similar to the context-sensitive PBN without control, w(t) is the manifestation of uncertainties in the model. The transition probability matrix for the controlled context-sensitive PBN can be defined easily, once the transition probability matrix of the uncontrolled system is known. We derive an expression for this matrix in Section 2.3. Originating from a state z1 , the successor state z2 is selected randomly within set Z according to the transition probability P(z1 , z2 , u); P(z1 , z2 , u) = Pr(z(t + 1) = z2 | z(t) = z1 , ug (t) = u) (4) for all z1 , z2 ∈ Z and all u ∈ C. To define the problem of intervention in a contextsensitive PBN, we associate a cost-per-stage c(z1 , z2 , u) to each possible event. In general, the cost-per-stage can depend on the origin state z1 , the successor state z2 , and the control input u. We define the average immediate cost in state z1 , when control u is selected, by c(z1 , u) =
P(z1 , z2 , u)c(z1 , z2 , u).
(5)
z2 ∈Z
We consider a discounted formulation of the expected total cost. The discounting factor, λ ∈ (0, 1), ensures convergence of the expected total cost over the long run [17]. In the case of cancer therapy, the discounting factor also emphasizes that obtaining treatment at an earlier stage is favored over later stages.
Jπg (i) = lim E
N −1
N →∞
t
λ c(z(t), z(t+1), μg (z(t), t)) | z(0) = i .
t =0
(6)
πg∗ (i) = arg min Jπg (i), πg
∀i ∈ Z.
(7)
For the specifics of our formulation, an optimal strategy always exists [17]. It is given by the fixed-point solution of the Bellman optimality equation
J ∗ (z1 ) = min c(z1 , u) + λ u∈C
P(z1 , z2 , u)J ∗ (z2 ) .
(8)
z2 ∈Z
Moreover, this optimal strategy is stationary and independent of the initial state i [17]. Standard dynamic programming algorithms can be used to find a fixed-point of the Bellman optimality equation. In our model, gene perturbation ensures that all the states communicate with one another. Hence, the Markov decision process associated with any stationary policy is ergodic and has a unique invariant distribution equal to its limiting distribution [18]. 2.3. Transition Probability Matrix of a Context-Sensitive Probabilistic Boolean Network. For a given cost of intervention and cost-per-stage, the solution to (8) depends on the transition probability matrix in (4). The latter can be found by observing that two mutually exclusive events may occur at any epoch: the current context of the network remains the same for two consecutive instants, or the context of the network changes to a new one at the time instant t+1. Moreover, the context may remain unchanged in two mutually exclusive ways: the binary switching variable is 0, which means that no change is possible, or the binary switching variable is 1, and the current network is picked again through random selection [11, 12]. In particular, when the switching variable is 1, a new context is selected independent of the current system state. Thus, the same network can be active twice in a row. This interpretation of switching the context in a PBN is in concert with the original definition of contextsensitive PBNs in [12]. Before proceeding, we note that transitioning was defined differently in [13, 14], where it was assumed that, whenever the switching variable is 1, a change of context must occur; the result being that context selection is conditioned on the current context. While this contrast produces a difference in the transition probabilities, it does not change the underlying issue in this paper, that being analyzing the effects of the state-space reduction proposed in [13, 14] on the performance of therapeutic interventions.
EURASIP Journal on Bioinformatics and Systems Biology
5
Letting z1 = (κ1 , x1 ) and z2 = (κ2 , x2 ) be two states in Z, we derive the transition probability P(z1 , z2 ) = Pr(z(t + 1) = (κ2 , x2 ) | z(t) = (κ1 , x1 )),
(9)
from z1 to z2 in the absence of intervention. Note that we can rewrite expression (9) as P(z1 , z2 ) = Pr(x(t + 1) = x2 , κ(t + 1) = κ2 | x(t) = x1 , κ(t) = κ1 ). (10)
The first parts of (15) and (16) correspond to the probability of transition from GAP x1 to GAP x2 according to the predictor functions defined by the constituent network at time instant t+1. The remaining terms account for transition between GAPs that are due to random gene perturbation. By replacing the terms of expression (11) by their equivalents from (12), (13), (15), and (16), it can be shown that the probability of transition from any state z1 = (κ1 , x1 ) to z2 = (κ2 , x2 ) is given by
P(z1 , z2 ) = (1 − p)n Fκ2 (x1 , x2 )
Using the Bayes theorem, we get P(z1 , z2 ) = Pr(x(t + 1) = x2 | κ(t + 1) = κ2 , x(t) = x1 , κ(t) = κ1 ) × Pr(κ(t + 1) = κ2 | x(t) = x1 , κ(t) = κ1 ) = Pr(x(t + 1) = x2 | κ(t + 1) = κ2 , x(t) = x1 , κ(t) = κ1 ) × Pr(κ(t + 1) = κ2 | κ(t) = κ1 ) = 1(κ2 = κ1 ) Pr(κ(t + 1) = κ1 | κ(t) = κ1 ) × Pr(x(t + 1) = x2 | κ(t + 1) = κ(t) = κ1 , x(t) = x1 ) + 1(κ2 = / κ1 ) Pr(κ(t + 1) = κ2 | κ(t) = κ1 ) × Pr(x(t + 1) = x2 | κ(t + 1) = κ2 , x(t) = x1 , κ(t) = κ1 ), (11)
+ (1 − p)(n−D(x1 ,x2 )) pD(x1 ,x2 ) 1(D(x1 , x2 ) = / 0)
× 1(κ2 = κ1 )((1 − q) + qrκ1 ) + 1(κ2 = / κ1 )qrκ2 .
(17)
The elements of the transition probability matrix of the controlled context-sensitive PBN given by (4) can then be expressed through (17). The value of state after intervention z1 = (κ1 , x1 ) can be determined according to the status of the control signal and the value of the state prior to the intervention z1 = (κ1 , x1 ). Here, κ1 is equated to κ1 , and the value of the GAP is updated according to the value of the control signal in the devised strategy according to
where 1(·) is the indicator function. Furthermore, we have Pr(κ(t + 1) = κ1 | κ(t) = κ1 ) = (1 − q) + qrκ1 ,
(12)
x1 = (x1 ⊕ eg )1(ug (z1 ) = 1) + x1 1(ug (z1 ) = 0).
(18)
and when κ1 = / κ2 , we get Pr(κ(t + 1) = κ2 | κ(t) = κ1 ) = qrκ2 .
(13)
Here, q is the probability of switching context, and rκi is the probability of selecting context κi during a switch. A transition from GAP x1 to GAP x2 may occur either according to the constituent network at instant t + 1 or through an appropriate number of random perturbations, but not both. Let us define Fl by Fl (x1 , x2 ) = 1(fl (x1 ) = x2 ).
(14)
Then, we have Pr(x(t + 1) = x2 | κ(t + 1) = κ(t) = κ1 , x(t) = x1 ) n = (1 − p) Fκ2 (x1 , x2 )
+ (1 − p)(n−D(x1 ,x2 )) pD(x1 ,x2 ) 1(D(x1 , x2 ) = / 0) , (15) and, for κ1 = / κ2 , we obtain
All the 2n elements of vector eg are zeros except the element at the gth position, which is set to one. 2.4. Approximate Transition Probability Matrix of a ContextSensitive Probabilistic Boolean Network. Following the reduction method proposed in [13, 14], we derive an expression for the approximate transition probability matrix in which the context is removed from the state space of the system. We base our derivation on the stochastic matrix defined by (17). The approximate stochastic model describes the dynamics of the system solely based on the GAPs, and its state space takes values from the set X. The probability of transition from GAP x1 to GAP x2 in two consecutive epochs is derived from the weighted sum of the actual transition probabilities with respect to the selection probabilities of the contexts. If we denote the probability of transition between two GAPs by P(x1 , x2 ) = Pr(x(t + 1) = x2 | x(t) = x1 ),
(19)
Pr(x(t + 1) = x2 | κ(t + 1) = κ2 , x(t) = x1 , κ(t) = κ1 ) n = (1 − p) Fκ2 (x1 , x2 )
(n−D(x1 ,x2 )) D(x1 ,x2 )
+ (1 − p)
p
1(D(x1 , x2 ) = / 0) , (16)
where D(x1 , x2 ) is the Hamming distance between two geneactivity profiles x1 and x2 .
then under the reduction assumptions, we define Δ
P(x1 , x2 ) =
rκ1 Pr(z(t + 1) = (κ2 , x2 ) | z(t) = (κ1 , x1 )).
κ1 κ2
(20)
6
EURASIP Journal on Bioinformatics and Systems Biology
Moreover, we can expand this expression as P(x1 , x2 ) =
κ1 κ2
rκ1 1(κ2 = κ1 ) × ((1 − q) + qrκ1 )
n × (1 − p) Fκ1 (x1 , x2 )
+ (1 − p)(n−D(x1 ,x2 )) pD(x1 ,x2 ) 1(D(x1 , x2 ) = / 0) + rκ1 1(κ2 = / κ1 )qrκ2 n × (1 − p) Fκ2 (x1 , x2 )
+(1 − p)(n−D(x1 ,x2 )) pD(x1 ,x2 ) 1(D(x1 , x2 ) = / 0) , (21) which in turn can be presented as P(x1 , x2 ) = (1 − p)n Λ,
(22)
where Λ=
rκ1 ((1 − q) + qrκ1 )
κ1
× Fκ1 (x1 , x2 ) +
+
3. Numerical Results
1(D(x1 , x2 ) = / 0)
1− p
⎡
D(x1 ,x2 )
p
⎢ rκ1 q ⎣ rκ2 Fκ2 (x1 , x2 )
κ2 κ2 = / κ1
κ1
+
rκ2
κ2 κ2 = / κ1
p 1− p
⎤
D(x1 ,x2 )
⎥
1(D(x1 , x2 ) = / 0)⎦ . (23)
The above expression for Λ can be further simplified as Λ=
κ1
Fκ1 (x1 , x2 ) (1 − q)rκ1 + qrκ21
+q
κ1
κ2 κ2 = / κ1
rκ1
+
rκ2 Fκ2 (x1 , x2 ) (24) 1(D(x1 , x2 ) = / 0)
1− p
× 1−q+q
κ1
κ1
κ1
rκ21 + q
rκ1 − q
rκ21 .
Thus, we have
Λ=
D(x1 ,x2 )
p
1(D(x1 , x2 ) = / 0)
1− p + (1 − q) +q
rκ1 Fκ1 (x1 , x2 )
κ1
rκ2 Fκ2 (x1 , x2 )
κ2
(25)
rκ1 .
κ1
The last expression for Λ can be further reduced to
Λ=
1− p +
by setting
κ1 rκ1
p
D(x1 ,x2 )
1(D(x1 , x2 ) = / 0)
rκ2 Fκ2 (x1 , x2 ),
κ2
= 1.
In this section, we compare the performance of algorithms based on the exact and approximate expressions for the transition probability matrix associated with a contextsensitive PBN. We perform this comparison first through extensive simulations based on randomly generated contextsensitive PBNs. We then compare these methods for a network obtained from a melanoma gene-expression data set, which is similar to the one used in [13, 14]. 3.1. Synthetic Networks. In our numerical studies, we postulate the following cost-per-stage:
D(x1 ,x2 )
p
Equations (22) and (26) express the approximate transition probability matrix associated with the reduced model. Although we have started from a different expression for the transition probability matrix of a context-sensitive PBN owing to a different interpretation of switching contexts, our final expression for the approximate transition probability matrix is similar to the one in [13, 14]. Averaging over the various contexts in (20) reduces the transition probability distributions associated with a context-sensitive PBN to transition probability distributions arising from the corresponding instantaneously random PBN, the fact that is overlooked in [13, 14]. The transition probability matrix of the corresponding instantaneously random PBN with the similar parameters can be obtained from expression (17), when the context is allowed to switch at each epoch by setting q = 1. Hence, the optimal and approximate intervention strategies perform similarly whenever the switching probability approaches to value one. This observation is supported by our numerical studies.
(26)
⎧ ⎪ 0 ⎪ ⎪ ⎪ ⎪ ⎨10
c(z, u) = ⎪ c ⎪ ⎪ ⎪ ⎪ ⎩10 + c
if if if if
u = 0, u = 0, u = 1, u = 1,
z ∈ D, z ∈ U, z ∈ D, z ∈ U,
(27)
where c is the cost of control, and U, D are the sets of undesirable and desirable states, respectively. We set c = 1 to make the application of intervention more plausible compared to visiting undesirable states. The target gene is chosen to be the most significant gene in the GAP. We assume that the upregulation of the target gene is undesirable. Consequently, the state space is partitioned into desirable states, D = {(κ, x) : κ ∈ {1, . . . , k}, x ∈ {0, . . . , 2n−1 − 1}}, and undesirable states, U = {(κ, x) : κ ∈ {1, . . . , k}, x ∈ {2n−1 , . . . , 2n − 1}}, where n is the number of genes. We use the natural decimal bijection of the GAP x to facilitate the presentation of our results. We set the number of genes to be five. The study of networks with larger numbers of genes would be computationally prohibitive due to the complexity of the corresponding dynamic program. The cost values have been chosen in accordance with an earlier study [14]. Since our objective is to downregulate the target gene, a higher cost is assigned to destination states having an
EURASIP Journal on Bioinformatics and Systems Biology upregulated target gene. Moreover, for a given status of the target gene, a higher cost is assigned when the control is applied, versus when it is not. In practice, the cost values have to mathematically capture the benefits and costs of intervention and the relative preference of states. They must be set with the help of physicians in accordance with their clinical judgement. Although this is not feasible within the realm of current medical practice, we do believe that such an approach will become feasible when engineering approaches are integrated into translational medicine. We generate synthetic context-sensitive PBNs in the following manner. Each context-sensitive PBN consists of two contexts. Each constituent network is randomly generated with bias equal to 0.5. The bias is the probability that a randomly generated Boolean function takes on a value of one. To complete the specification of a context-sensitive PBN, we need to specify the selection and switching probability distributions, along with its constituent networks. We consider the selection and switching probabilities as parameters during our numerical study. In the first set of experiments, we assume that the constituent networks are selected with equal probabilities. Then, we vary the value of the switching probability. In the second set of simulations, the switching probability is fixed but the selection probability varies. We generate one thousand random synthetic contextsensitive PBNs for each scenario. Our objective is to utilize the statistics generated by these synthetic networks to evaluate the effects of removing the context from the state space of a context-sensitive PBN when designing an intervention strategy. For each context-sensitive PBN, the exact and approximate transition probability matrices are computed according to (17) and (22), respectively. Thereafter, we solve the optimal intervention problems for the original model and its reduced approximation and find the corresponding optimal strategies. The devised strategy, μ∗g : Z → C, for the exact transition probability matrix specifies the action that should be taken at each time step. The second policy is based on the reduced stochastic matrix and only takes the GAP as its input. Since the performance of the approximate strategy must be evaluated with respect to the dynamics specified by the original model, we need to extend the approximate strategy to elements of Z. This is achieved by simply disregarding the context element of state z(t) and determining the action based on its GAP element. We denote the resulting intervention strategy obtained through state collapse by μ g : Z → C. In the first set of experiments, we determine the effect of the switching probability q on overall performance. To this end, we generate one thousand context-sensitive PBNs for each value of the switching probability. The selection probability is assumed to have uniform distribution for all the generated context-sensitive PBNs. We set the perturbation probability to 0.01 for all simulations. For each context-sensitive PBN generated per the above method, we select a random control gene. Then, we use dynamic programming and derive an optimal intervention
7 strategy μ∗g based on the exact transition probability matrix (17). Similarly, an optimal strategy based on the approximate transition probability matrix (22) is derived for the same control gene and is extended to the approximate strategy μ g . For a context-sensitive PBN, we estimate the average total discounted cost induced by the given optimal strategy μ∗g . To this end, we generate synthetic time-course data for thousand time steps from the transition probability matrix for the context-sensitive PBN, while intervening based on optimal strategy μ∗g . We estimate the total cost by accumulating the discounted cost of each state given the action at that state. This procedure is repeated ten thousand times for random initial states, and the average of the induced total discounted costs is computed. Following a similar procedure, the approximate strategy μ g is applied to the system, and the average total discounted cost is computed. Finally, we compute the average total discounted cost for time-course data when no intervention is applied. From here on, we omit the subscript g from the notation of a strategy μg to simplify our notations. Since the control gene is selected randomly, this will not affect the following discussions. The effectiveness of an intervention strategy can be evaluated by computing the difference between its induced cost and the cost accumulated in the absence of intervention. For each set of constituent networks and a given switching probability, we compute the following functions: J μ∗ , J μ , and J. These are the average total discounted cost for a given context-sensitive PBN induced by applying optimal strategy μ∗ , approximate strategy μ , and no intervention, respectively. The preceding procedure is repeated for one thousand random context-sensitive PBNs, thereby yielding one thousand values for each statistic. We compare the effects of these strategies by computing averages denoted by E[J μ∗ ], E[J μ ], and E[J]. We consider the percentage of reduction in the average total discounted cost as a performance metric. The normalized gain obtained by each intervention strategy is taken as the immediate consequence of the intervention formulation. This metric is defined as the difference between the average discounted cost before and after intervention, normalized by the cost before intervention. The normalized gain corresponding to the optimal strategy μ∗ is ΔJ E =
E[J] − E[J μ∗ ] , E[J]
(28)
and the normalized gain corresponding to the strategy derived from the approximate method μ is ΔJ A =
E[J] − E[J μ ] . E[J]
(29)
Figure 1 depicts the results of the first experiment, where q is the parameter of interest. As q increases to one, the difference between normalized gains ΔJ A and ΔJ E decreases. The approximating method yields close to optimal performance when the switching probability is large, which is outside the range of typical values used for context-sensitive PBNs. If one cannot obtain context knowledge or the number of
8
EURASIP Journal on Bioinformatics and Systems Biology 40
30
35 25 30 20 Percentage
Percentage
25 15
20 15
10 10 5 5 0
0.01
0.21
0.41 q
0.61
0
0.81
ΔJ A ΔJ E
0.01
0.21
0.41 q
0.61
0.81
ΔP A ΔP E
Figure 1: ΔJ A and ΔJ E are computed for 1000 context-sensitive PBNs consisting of two contexts. The switching probability q is the parameter. The selection probability has uniform distribution r1 = r2 = 0.5.
Figure 2: ΔP E and ΔP A are computed for 1000 context-sensitive PBNs consisting of two contexts. The switching probability q is the parameter. The selection probability has uniform distribution r1 = r2 = 0.5.
contexts results in an unacceptable computational burden, the approximate method provides a strategy for the realistic value q = 0.01, which yields a 30% reduction in performance. As a byproduct of the intervention formulation, we also consider the effect of an intervention strategy μ on the amount of change in the steady-state probability of undesirable states before and after the intervention. For each set of constituent networks and for a given switching probability, we compute ΔPμ∗ and ΔPμ . These are the normalized reduction in the total probability of visiting undesirable states in the long run for a given contextsensitive PBN when strategies μ∗ and μ are applied to original system, respectively. In other words, we define
probability matrices via the empirical averages of each sample sequence, denoted by ΔP E and ΔP A . Figure 2 shows ΔP A and ΔP E as functions of the switching probability. The trends are similar to those observed for the normalized gains. In practice, treatment options, such as chemotherapy, have detrimental side effects. A large number of interventions can cause collateral damage that reduces a patient’s quality of life. Thus, we define the quantity Γμ as the expected number of interventions when the strategy μ is applied in the long run to gauge these side effects. In particular, Γμ∗ and Γμ are the expected numbers of executed interventions in the long run using the optimal strategy μ∗ and the approximate strategy μ , respectively. We define
ΔPμ∗ =
i∈U π(i) −
ΔPμ =
i∈U πμ∗ (i) , i∈U π(i)
i∈U π(i) −
i∈U πμ (i)
i∈U π(i)
(30)
,
where πμ∗ (i) is the probability of being in state i in the long run under optimal strategy μ∗ ; πμ (i) is the probability of being in state i in the long run under approximate strategy μ ; π(i) is the probability of being in state i in the long run when no control is applied. The preceding procedure is repeated for one thousand random context-sensitive PBNs, thereby yielding one thousand values for each statistic. We compare the effect of the strategies devised by the exact and approximate transition
Γμ∗ = Γμ =
i∈Z
i∈Z
πμ∗ (i)1(μ∗ (i) = 1), πμ (i)1(μ(i) = 1),
(31)
where πμ∗ (i) and πμ (i) have similar definitions as in (30). The preceding procedure is repeated for one thousand random context-sensitive PBNs. We compare the expected number of executed interventions using the difference in empirical averages, denoted by ΔΓ Γμ − Γμ∗ . Figure 3 indicates the variation in ΔΓ as a function of switching probability q. According to this figure, for small switching probabilities, the approximate strategy μ is likely to cause more detrimental side effects. We study the effect of selection probability on the performance of the approximate strategy μ in a second set
EURASIP Journal on Bioinformatics and Systems Biology
9
0.09
30
0.08 25 0.07 20 Percentage
0.06
ΔΓ
0.05 0.04 0.03
15
10
0.02 5 0.01 0
0.01
0.21
0.41 q
0.61
0.81
ΔΓ
Figure 3: ΔΓ is computed for 1000 context-sensitive PBNs consisting of two contexts. The switching probability q is the parameter. The selection probability has uniform distribution r1 = r2 = 0.5.
of experiments. We follow the same procedure as before, except that we set q = 0.01, and we vary the probability of selecting each constituent network. We consider two constituent networks so that the selection probabilities are a function of r1 , the probability of selecting the first context. From Figure 4, as |r1 − r2 | gets smaller, the difference between the performance of strategies μ∗ and μ diminishes. Figure 5 compares the steady-state measures ΔP E and ΔP A for the optimal and approximate strategies, respectively. The most interesting observation is that, whereas ΔJ E − ΔJ A decreases as |r1 − r2 | decreases, ΔP E − ΔP A increases. These different behaviors are not contradictory, since the intervention strategy is designed to minimize the total cost and the improvement in the steady-state behavior is a side effect of our goal. We observe that both ΔJ E and ΔP E are stable across parameters, whereas the metrics ΔJ A and ΔP A vary considerably. The context removal approximation affects both ΔJ A and ΔP A . That is not the case for the exact transition probability matrix. We suspect that a mathematical analysis of this effect is complicated since it involves interaction between the optimization and the reduction. Finally, ΔΓ is plotted as a function of the selection probability in Figure 6. Here, we observe that ΔΓ increases as |r1 − r2 | decreases. 3.2. A Melanoma Case Study. In this section, we compare the performance of the optimal and approximate strategies in the context of a gene regulatory network developed from steadystate data. This steady-state data was collected in a profiling study of metastatic melanoma in which high abundance of messenger RNA for the gene WNT5A was found to be
0
0
0.1
0.2
0.3 r1
0.4
0.5
ΔJ A ΔJ E
Figure 4: ΔJ A and ΔJ E are computed for 1000 context-sensitive PBNs consisting of two contexts. The switching probability q is 0.01. The selection probability of the first constituent network r1 is varied.
highly discriminating between cells with properties typically associated with high metastatic competence versus those with low metastatic competence [19]. Seven genes were considered in [13, 14]: WNT5A, pirin, S 100 P, RET 1, MART 1, HADHB, and STC 3. We apply the design procedure proposed in [20] to generate a context-sensitive PBN possessing four constituent networks. The method of [20] generates Boolean networks with given attractor structures, and the overall context-sensitive PBN is designed so that the data points, which are assumed to come from the steady-state distribution of the network, are attractors in the resulting network. The regulatory graphs of these constituent networks can be found in [14]. This approach is reasonable because our interest is in controlling the longrun behavior of the network. The intervention objective for this 7-gene network is to downregulate WNT5A. The gene WNT5A ceasing to be downregulated is strongly predictive of the onset of metastasis. A number of other intervention studies based on the same data have aimed to downregulate WNT5A. This model has been used since the discovery of the relation between WNT5A and metastasis. The binary nature of the up or down regulation suits our binary model. A state is desirable, that is, belongs to D, if WNT5A = 0, and undesirable, that is, belongs to U, if WNT5A = 1. As we mentioned earlier, application of intervention requires the designation of desirable and undesirable states, and this depends upon the existence of relevant biological knowledge. The use of WNT5A is one such example where the knowledge of practitioners is incorporated in a theoretical framework.
10
EURASIP Journal on Bioinformatics and Systems Biology 60
40 35
50 30 40 Percentage
Percentage
25 20
30
15 20 10 10 5 0
0.1
0.2
0.3 r1
0.4
0.5
ΔP A ΔP E
0.08 0.07 0.06
ΔΓ
0.05 0.04 0.03 0.02 0.01
0.1
STC3
HADHB MART1 RET1 Control gene
S100P
Pirin
ΔJ A ΔJ E
Figure 5: ΔP E and ΔP A are computed for 1000 context-sensitive PBNs consisting of two contexts. The switching probability q is 0.01. The selection probability of the first constituent network r1 is varied.
0
0
0.2
0.3 r1
0.4
0.5
ΔΓ
Figure 6: ΔΓ is computed for 1000 context-sensitive PBNs consisting of two contexts. The switching probability q is 0.01. The selection probability of the first constituent network r1 is varied.
Based on our objective, the cost of control is assumed to be one, and the states are assigned penalties according to the cost-per-stage (27). This is the same cost structure as in [14]. Since our objective is to downregulate WNT5A, a higher penalty is assigned for states having WNT5A upregulated.
Figure 7: ΔJ A and ΔJ E are computed for the WNT5A network for various control genes. The switching probability is q = 0.01, and the constituent networks are selected with equal probabilities.
Also, for a given WNT5A status, a higher penalty is assigned when the control signal is active versus when it is not. The optimal and approximate intervention strategies are found for the melanoma-related context-sensitive PBN when different genes in the network (except WNT5A itself) are employed as the control genes. Figure 7 depicts the normalized gains when the optimal and approximate strategies for each control gene are used to intervene in the context-sensitive PBN. To compute the normalized gains, we computed the costs for ten thousand trajectories of length two hundred thousand. As we expected, the optimal strategy outperforms the approximate strategy significantly for all the control genes. Moreover, for the best control gene S100P, the difference between the two strategies is the greatest. Figure 8 depicts the effects of the optimal and approximate strategies on the normalized reduction in the aggregated long-run probability of visiting undesirable states ΔP E and ΔP A , respectively. Here, the strategy based on the S100P outperforms the strategies devised for other control genes. Note that the performance differences are not significant for most of the control genes. In particular, one should not draw any conclusions from the fact that ΔP E is slightly less than ΔP A in a couple of cases. The intervention strategy is designed to minimize the total cost, and the improvement in the steady-state behavior is a side effect of our method. Lastly, Figure 9 shows the difference between the expected number of executed interventions for the optimal strategy and the one derived from the approximate representation of the system. Note that the approximate strategy based on the most effective control gene applies 35%
EURASIP Journal on Bioinformatics and Systems Biology
11
70
0.4
60
0.35 0.3
50
ΔΓ
Percentage
0.25 40 0.2
30 0.15 20
0.1
10
0
0.05
STC3
HADHB MART1 RET1 Control gene
S100P
Pirin
ΔP A ΔP E
Figure 8: ΔP E and ΔP A are computed for the WNT5A network for various control genes. The switching probability is q = 0.01, and the constituent networks are selected with equal probabilities.
more interventions compared to the optimal one, while its performance is still worse.
4. Conclusion We have evaluated the effects on intervention performance caused by the proposed reduction in [13, 14] relative to various criteria and values of the parameters of a contextsensitive PBN. We have analytically demonstrated that the reduction method reduces the transition probability matrix of a context-sensitive PBN to the instantaneously random PBN with identical parameters, the fact that is overlooked in [13, 14]. This observation has been demonstrated through extensive numerical studies. We have further studied the relative effectiveness of the devised approximate strategy using several performance criteria: (1) the average normalized gains deduced by the optimal and approximate strategies as indicators of the intervention effectiveness; (2) the normalized reduction in the aggregated probability of visiting undesirable states in the long run as a byproduct of the intervention formulation; (3) the expected number of executed interventions for each strategy. Performance metrics have been compared as functions of both the switching and selection probabilities. In addition, we have compared the optimal and approximate strategies in the framework of a much-studied melanoma-related context-sensitive PBN. The common trend throughout the experiments is that the difference between the performance of the optimal and approximate intervention strategies is small for large switching probabilities. The performance of strategies devised
0
STC3
HADHB MART1 RET1 Control gene
S100P
Pirin
ΔΓ
Figure 9: ΔΓ is computed for the WNT5A network for various control genes. The switching probability is q = 0.01, and the constituent networks are selected with equal probabilities.
by the reduction method degrades for smaller switching probabilities, which include the range of typical values used for context-sensitive PBNs. It is certainly preferable to design interventions based on the context-sensitive PBN; nevertheless, the approximate model still yields therapeutic benefits in situations where it is impractical to utilize the exact model.
Appendix We apply the design procedure proposed in [20] to generate a context-sensitive PBN with four constituent networks. The data used in this inference was collected in a profiling study of metastatic melanoma. To generate the context-sensitive PBN based on the inferred Boolean networks, we set both the switching and perturbation probabilities to 0.01. The selection probability distribution is assumed to be uniform 4 {rl = 0.25}4l=1 . The constituent networks {fl }l=1 are reported in Tables 1, 2, 3, and 4, respectively. Each of Tables 1 to 4 has 2pred + n rows and n columns, where pred denotes the maximum number of predictors for each of the n genes in the network. We set pred = 3 in this study. The top 2pred rows depict the predictor functions of the genes. We separate the top part of each table from its lower part with a horizontal line to increase the readability. The lower n rows of each table provide the predictors for the genes in the Boolean network. For example, genes 3, 5, and 7 are the predictors of gene 1 in the constituent network 1 according to the 9th row of Table 1. Hence, f11 (x3 , x5 , x7 ), the predictor function of gene 1, can be specified by its 8
12
EURASIP Journal on Bioinformatics and Systems Biology Table 1: Constituent network f1 .
1 1 1 1 1 0 0 0 3 2 3 2 3 5 3
1 0 1 1 1 0 1 1 5 6 1 4 7 7 7
0 0 1 1
1 0 1 1 0 0 0 0
0 0 1 1
Table 3: Constituent network f3 . 0 0 0 0 1 1 1 1
1 0 1 1 0 1 1 0
7 1 7 1 1
1 1 1 0 1 1 0 1 4 4 2 4 3 3 4
0 1 1 0
5 5 4 7 7 5 6
Table 2: Constituent network f2 . 0 1 0 1 1 0 0 0 2 2 2 2 3 2 5
0 0 1 1
0 0 1 1
6 6 5 4 4 5 7
1
1 1 1 1 0 1 0 1
1 0 1 1
0 0 1 0 1 0 0 0 6
0 0 0 0 1 1 0 1
1 0 0 0 1 1 1 1
0 0 1 1 1 1 0 1
1 0 0 1
1 0 1 1 0 1 1 1
1 1 0 1
1 1 1 6
Table 4: Constituent network f4 . 0 0 1 1 0 1 1 1
1 1 0 1
7
possible outcomes enumerated in the first column of Table 1. Whenever the number of predictors is less than pred = 3, the outcomes of the predictor function can be enumerated with less than 2pred values. For instance, gene 3 in Table 1 has two predictors (refer to row 23 + 3 of Table 1), so its predictor function f13 (x3 , x1 ) can be fully specified with 4 values. According to the the upper part of the third column of Table 1, the value of gene 3 is set to 0 when the values of genes x3 (t) and x1 (t) are 0 and 0, respectively.
Acknowledgments This work was supported in part by the National Science Foundation (ECS-0355227, CCF-0514644, & ECS-0701531), the National Cancer Institute (R01 CA-104620 & CA-90301), and the Translational Genomics Research Institute.
1 1 1 1 0 0 1 0 2 2 2 2 3 3 6
1 0 1 0 1 1 1 1 5 4 5 6 6 6 7
0 0 1 0 0 0 0 1 6 7 7
0 1 1 1
1 0 1 1 0 0 0 1
7 1
References [1] T. Miyashita and J. C. Reed, “Tumor suppressor p53 is a direct transcriptional activator of the human bax gene,” Cell, vol. 80, no. 2, pp. 293–299, 1995. [2] L. B. Owen-Schaub, W. Zhang, J. C. Cusack, et al., “Wildtype human p53 and a temperature-sensitive mutant induce Fas/APO-1 expression,” Molecular and Cellular Biology, vol. 15, no. 6, pp. 3032–3040, 1995. [3] W. S. El-Deiry, T. Tokino, V. E. Velculescu, et al., “WAF1, a potential mediator of p53 tumor suppression,” Cell, vol. 75, no. 4, pp. 817–825, 1993. [4] S. G. Swisher, J. A. Roth, J. Nemunaitis, et al., “Adenovirusmediated p53 gene transfer in advanced non-small-cell lung cancer,” Journal of the National Cancer Institute, vol. 91, no. 9, pp. 763–771, 1999. [5] M. Bouvet, R. J. Bold, J. Lee, et al., “Adenovirus-mediated wild-type p53 tumor suppressor gene therapy induces
EURASIP Journal on Bioinformatics and Systems Biology
[6] [7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17] [18] [19]
[20]
apoptosis and suppresses growth of human pancreatic cancer,” Annals of Surgical Oncology, vol. 5, no. 8, pp. 681–688, 1998. I. Shmulevich and E. R. Dougherty, Genomic Signal Processing, Princeton University Press, Princeton, NJ, USA, 2007. S. A. Kauffman, “Metabolic stability and epigenesis in randomly constructed genetic nets,” Journal of Theoretical Biology, vol. 22, no. 3, pp. 437–467, 1969. S. A. Kauffman, The Origins of Order: Self-Organization and Selection in Evolution, Oxford University Press, New York, NY, USA, 1993. S. Kauffman and S. Levin, “Towards a general theory of adaptive walks on rugged landscapes,” Journal of Theoretical Biology, vol. 128, no. 1, pp. 11–45, 1987. I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang, “Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks,” Bioinformatics, vol. 18, no. 2, pp. 261–274, 2002. I. Shmulevich, E. R. Dougherty, and W. Zhang, “From Boolean to probabilistic Boolean networks as models of genetic regulatory networks,” Proceedings of the IEEE, vol. 90, no. 11, pp. 1778–1792, 2002. M. Brun, E. R. Dougherty, and I. Shmulevich, “Steady-state probabilities for attractors in probabilistic Boolean networks,” Signal Processing, vol. 85, no. 10, pp. 1993–2013, 2005. R. Pal, A. Datta, M. L. Bittner, and E. R. Dougherty, “Intervention in context-sensitive probabilistic Boolean networks,” Bioinformatics, vol. 21, no. 7, pp. 1211–1218, 2005. R. Pal, A. Datta, and E. R. Dougherty, “Optimal infinitehorizon control for probabilistic Boolean networks,” IEEE Transactions on Signal Processing, vol. 54, no. 6, part 2, pp. 2375–2387, 2006. B. Faryabi, A. Datta, and E. R. Dougherty, “On approximate stochastic control in genetic regulatory networks,” IET Systems Biology, vol. 1, no. 6, pp. 361–368, 2007. S. Kim, H. Li, E. R. Dougherty, et al., “Can Markov chain models mimic biological regulation?” Journal of Biological Systems, vol. 10, no. 4, pp. 337–357, 2002. D. P. Bertsekas, Dynamic Programming and Optimal Control, Athena Scientific, Belmont, Mass, USA, 2007. J. R. Norris, Markov Chains, Cambridge University Press, Cambridge, UK, 1997. M. Bittner, P. Meltzer, Y. Chen, et al., “Molecular classification of cutaneous malignant melanoma by gene expression profiling,” Nature, vol. 406, no. 6795, pp. 536–540, 2000. R. Pal, I. Ivanov, A. Datta, M. L. Bittner, and E. R. Dougherty, “Generating Boolean networks with a prescribed attractor structure,” Bioinformatics, vol. 21, no. 21, pp. 4021–4025, 2005.
13