Robust Genetic Network Modeling by Adding Noisy Data E.P. van Someren
L.F.A. Wessels
M.J.T. Reinders
E. Backer
Information and Communication Theory Group, Control Laboratory Faculty of Information Technology and Systems, Delft University of Technology, P.O. Box 5031, 2600 GA Delft, The Netherlands
[email protected] Keywords: Genetic Networks, Robust Modeling, Tikhonov Regularization, Ridge Regression
Abstract The most fundamental problem in genetic network modeling is generally known as the dimensionality problem. Typical gene expression matrices contain measurements of thousands of genes taken over fewer than twenty time-steps. A large dynamic network cannot be learned from data with such a limited number of time-steps without the use of additional constraints, preferably derived from biological knowledge. In this paper, we present an approach that can find rough estimates of the underlying genetic network based on limited time-course gene expression data by employing the fact that gene expression measurements are relatively noisy and genetic networks are thought to be robust. The method expands the data-set by adding noisy duplicates, thereby simultaneously tackling the dimensionality problem and making the solutions more robust against (the already large) noise in the data. This simple concept is similar to adding a Tikhonov regularization term in the optimization process. In the case of linear models, the addition of noisy duplicates is equivalent to ridge regression, i.e. the sum of the squared weights is minimized as well as the prediction error. In the limiting case, it becomes even equivalent to the application of the MoorePenrose Pseudo-Inverse to the original data. The strength of the proposed concept of adding noisy duplicates lies in the fact that it can be employed to all modelling approaches, including non-linear models.
1
Introduction
Current micro-array technology has caused a significant increase in the number of genes whose expression can be measured simultaneously on a single array. However, the number of measurements that are taken in a time-course experiment has not increased in a similar fashion. As a result, typical gene expression data-sets consist of relatively few time-points (generally less than 20) with respect to the number of genes (thousands). This so called dimensionality problem and the fact that measurements contain a substantial amount of measurement noise are two of the most fundamental problems in genetic network modeling [1]. Genetic network modeling is the field of research that tries to find the underlying network of gene-gene interactions from This work has also been submitted to ISMB’01.
the measured set of gene expressions. Up till now, several different modeling approaches have been suggested, such as Boolean networks [2], Bayesian networks [3, 4], (Quasi)Linear networks [5, 6], Neural networks [7, 8] and Differential Equations [9]. In these approaches, genetic interactions are represented by parameters in a parametric model which need to be inferred from the measured gene expressions over time. Generally, when learning parameters of genetic network models from ill-conditioned data (many genes, few time samples), the solutions become arbitrary. Comparative studies [1, 10] have recently reported that at least a number of the currently proposed models suffer from poor inferential power. This is partly caused by their limited use of imposed criteria that improve robustness, consistency and stability of the solutions. In this paper, we employ the concept of artificially expanding the original measured data-set with a set of noisy duplicates. We show that training models on this expanded set not only makes the models more robust against noise but also tackles the dimensionality problem. In [11] it has been shown that adding noise to the training set is equivalent to a Tikhonov regularization, provided the noise amplitude is kept small. In addition, it is generally known that the class of Tikhonov regularizers is especially suited for ill-conditioned data [12]. The concept of adding noise can, however, also be applied to nonlinear models as well. First, we present the fundamental problems of genetic network modeling and review currently known methodologies to overcome these problems. Then, we introduce our principle motivation that leads to the basic idea of expanding the training set and show its potential success. After the introduction of the basic idea, the method of adding noise is explained and illustrated by applying it in the case of a linear genetic network model. As part of this example, we briefly cover the equivalence between adding noise and regularization as well as its relation with ridge regression and the Moore-Penrose Pseudo Inverse in the case of a linear model. A detailed experimental study shows how to set the parameters involving the proposed method. Further, these studies show under which conditions the inferred networks are closer to the true ones. This improved performance is also shown with respect to other models that are currently proposed in literature.
2
Current Approaches to Tackling the Dimensionality Problem
In order to capture the combinatorial nature of genetic relations [13], gene regulation should be modelled as a network of genetic interactions rather than by methods that are based solely on pair-wise comparisons. The inference of genetic network models is, however, especially hampered by the dimensionality problem. Fortunately, biological knowledge about the general properties of genetic networks can be employed to alleviate some of the data requirements. In particular, true genetic networks are assumed to be 1) sparsely connected, because genes are only influenced by a limited number of other genes [14], 2) robust against noise, because small changes in expression (which are inevitable in a biological system) may not lead to large differences in expression [15], 3) redundant, because genes are known to share functionality [16], and 4) stable, because there is only a limited number of molecules in a cell, and 5) the activity level of genes are assumed to behave smoothly over time [6]. This knowledge can be used to alleviate the dimensionality problem in the modeling process either by a pre-processing step that modifies the data-set prior to the inference process or by directly controlling the inference process through regularization. With a pre-processing step the dimensionality problem can be reduced by either artificially reducing the number of genes or by artificially increasing the number of time-points. Reduction of the number of genes can be achieved by thresholding and/or clustering the data [8, 1, 17]. Thresholding is based on the fact that small signals can be neglected because of their strong corruption with measurement noise. Clustering solves the ambiguity problem by grouping similar signals and is biologically inspired by the redundancy assumption. Unfortunately, the use of thresholding and clustering is limited to what is biologically plausible and can therefore only partly reduce the dimensionality problem. By exploiting the smoothness of gene expression, the number of time-points can be increased by employing interpolation [6]. However, our experience shows that interpolation tends to reduce the dimensionality problem only marginally regardless of the number of time-points added. The use of biological constraints to regulate the inference process directly has only been applied in a few earlier papers. In previous work [5], we used the limited connectivity of genetic networks to search directly for sparse linear networks. Weaver [7] also exploits the sparsity, but only briefly describes an approach that iteratively sets small weights in the gene regulation matrix to zero. Thus far, none of the reported modeling approaches have incorporated methodologies that explicitly increase robustness or stability of the obtained gene regulation matrices. In the next section, we motivate why an approach that imposes robustness might just prove to be very successful.
3
Improving Performance by Imposing Robustness
The performance of genetic network models can be characterized by the following evaluation criteria [1, 10]: (1) inferential power; (2) predictive power; (3) robustness; (4) con-
Figure 1: The relation between data-sets, , and gene regulation matrices, , in the case of four hypothetical models of differing robustness and consistency. Arrows indicate the inference as a mapping from the universe of data-sets, , to the universe of gene regulation matrices, .
sistency; (5) stability and (6) computational costs. The principle goal of genetic network modeling is characterized by the inferential power. This criterium expresses the difference between the true genetic interactions and the relationships inferred by the model. The criteria are, however, related. The inferential power is increased if the robustness as well as the consistency increases. When a model is not robust, it means it is sensitive to noise in the data, i.e. two similar data-sets may result in two very different solutions1 . A model is said to be inconsistent when multiple solutions can be inferred from the same data-set. Inconsistency is a direct result of the dimensionality problem, and occurs only in (over-dimensioned) models that lack parameter regularization. The two criteria are graphically depicted in Figure 1. Here, and represent the results of two hypothetical experiments carried out under the same conditions. Hence, their difference results mainly from measurement noise. In the over-determined case, we can, for both data-sets, infer the corresponding gene regulation matrices, represented by and , respectively. In the under-determined case, we can infer sets of gene regulation matrices, denoted by and respectively. Now assume that in practice only a single experiment is carried out. Due to the substantial amount of measurement noise, both data-sets (either or ) are equally likely to be the result of that single experiment. A high inferential power now implies that both solutions are close to the original. Although no original is indicated in the figure, both solutions need to be close to each other in order to achieve such a high inferential power. From Figure 1 it is clear that only a robust and consistent model guarantees that both solutions are close to each other. Although a robust and consistent model does not
1 In this paper, gene regulation matrices are denoted as the solutions (of the inference process)
X 15
X 16
X 13 X1
put vector. One of the most simple continuous models is a linear genetic network model, where it is assumed that the gene expression level of each gene is the result of a weighted sum of all other gene expression levels at the previous time-point:
X 11 X
Wˆ
2 1
X 17 X 18
X 14
Figure 2: Robustness is imposed by requiring distorted datasets to infer the same solution (gene regulation matrix). guarantee a good inferential power (sufficient condition), it is a necessary condition. We believe that imposing this necessary condition will improve the general performance of inferred (genetic) network models. In machine learning terms this means that the bias will remain the same, but the variance is effectively controlled. The basic idea now is that robustness can be imposed by requiring that when small distortions are introduced to a dataset (e.g. in Figure 2, are such distorted versions of ) both the distorted data-sets and the original dataset result in the same inferred gene regulation matrix . The small distortions to the data-set can be viewed as adding noise to each of the measured activity levels in data-set . When generating multiple distorted data-sets (and requiring the same output matrix for all of them), we also impose implicitly consistency, i.e. now data-sets are available to do the inference. By taking large enough the system becomes over-determined. Hence, by expanding the measured data-set ( ) with multiple noise-distorted copies ( ), the necessary condition is imposed, leading to an improved inferential power.
4
Adding Noise
A typical time-course gene expression data-set reflects how one state of gene expression levels is followed by a consecutive state of gene expression. Such a data-set is represented as a gene expression matrix (hereafter GEM), , consisting of rows of genes and columns of time-points. A single element in this matrix, , represents the activity level of gene at time-point . Alternatively, a GEM can also be represented as a training set, , consisting of state transitions, , labeled by the index :
021 "!# %$'& )(+*-,/. 43 5& /67 8 9 ;:=:-?-?-?@:BA *
(1)
Here is the input state, being the -th column of and is the target state, being the -th column of and further we use the following notation . A genetic network model is a representation of the genetic interactions such that for a given state of gene expression it can predict the consecutive state(s). In general, a genetic network model represents a parameterized non-linear mapping, , from a -dimensional input vector into a -dimensional out-
CD'&5E
' F HBG I KJ ' H#L H J ' HSR ' H4T X T M J
:NM O 5: O 021 J J ''H H QUTP HWV .P J J' H '. T P
(2)
, represents the exisThe interaction parameter, tence ( ) or absence ( ) of a controlling action of gene on gene , whether it is activating ( ) or inhibiting ( ), as well as the strength ( ) of the relation. The complete matrix of interactions, , is called the gene regulation matrix (hereafter GRM). The weighted sum that makes up the linear model (and thus also ) forms the basis of all continuous genetic network models that are currently proposed [10]. For a known GRM, the linear model will map each gene expression state in the training set into a prediction of the consecutive state, :
C
C P L& & C * 021 Y