COMPUTATIONAL SYSTEMIC BIOLOGY
EXPLICIT INTEGRAL METHOD FOR NONLINEAR DYNAMIC MATHEMATICAL MODELS IDENTIFICATION Lashin S.A.1*, Likhoshvai V.A.1, 2 1 Institute of Cytology and Genetics SB RAS, Novosibirsk, Russia; of Information Technologies, Khanty-Mansyisk, Russia * Corresponding author: e-mail:
[email protected]
2
Ugra Research Institute
Keywords: explicit method, parameter identification, inverse problem Summary Motivation: One of the basic stages the mathematical models construction of biological processes is the identification stage of their parameters. Most of the methods focused on this problem uses process of calculation of model (numerical integration of system), that strongly increases computing expenses at the solving. The explicit method of parameter identification, dispensing integration of system and, thereof, demanding less computing expenses is offered. It does a considered problem be an actual problem of the numerical analysis and the mathematical biology. Results: The explicit method for high dimension dynamic models identification (finding of model parameter values at which the plausible conformity of model calculation results level with experimental data is reached) is developed. The method is based on integral transformation of initial system of the differential equations, approximation of discrete values of the model functions with the experimental data help, and the following solution of linear systems of the equations. Program realization of the method, including the parallel version in MPI standard is developed. Introduction
Consider a class of the nonlinear dynamic models represented by a system of the differential equations in explicit Cauchy’s form: dX i (t ) = f i (C , X ) , dt
(1)
where Xi(t) – is the set of dynamic variables of studying phenomena (i=1,...,N), C – vector of model parameters. The problem of parameters definition for mathematical model in its behavior is called as identification of mathematical model parameters (inverse problem). All methods of parameters identification can be separated on two classes. The first one uses the model calculation during iterative (direct problem), the other dispenses this process. Complexity of the first type methods, it is especial with increase in dimension of system, grows exponentially, since the complexity of the direct problem solution required on each iteration (integration of system) is growing, The second type algorithms are based on the certain calculations, allowing avoiding the system numerical integration mentioned above. By a principle of realization of this idea the methods can be divided into two principal groups: differential and integral (Ermakova, 1989). Differential methods approximate the values of derivatives dXi /dt and right part fi(C,X) using the experimental data. After that the identification problem is reduced to a set of systems of the algebraic equations (Karnaukhov, Karnaukhova, 2003). Integral methods are alternative to differential methods. They are based on integral transformations of initial system of the differential equations (1), with the following approximate calculation of integrals. In contrast to the differential methods, lacking that the approximate derivatives calculation is ill-conditioned problem, integral methods work with much smaller mistakes, since the approximate calculation of integrals is well-conditioned problem. 85
COMPUTATIONAL SYSTEMIC BIOLOGY
BGRS 2004
The method developed is integral and being designed for the identification problem solution of the certain kind systems (systems with a quadratically-linear right part), reduces an initial problem to a problem of the solution of a set of systems of the linear algebraic equations. Algorithm Consider a class of nonlinear dynamic models more particular than (1), namely, class of the systems of the differential equations with a quadratically-linear right part: N dX i (t ) = ∑ c kli X k (t ) X l (t ) , dt k ≤l
(2)
where Xi(t) – is the set of dynamic variables of studying phenomena (i=1,...,N), cikl – parameters (constants) of model (k≤l=1,…,N). If we integrate system (2) in the range of t0 to tm (m=1,…,M), we will get: t
t
m m N ⎛ N ⎞ X i (t m ) − X i (t 0 ) = ∫ ⎜ ∑ c kli X k (t ) X l (t ) ⎟dt = ∑ c kli ∫ X k (t ) X l (t )dt . k ≤l ⎠ t0 ⎝ k ≤l t0
(3)
Thus, having enough of experimental data, we can calculate appropriate values
d mi = X i (t m ) − X i (t 0 ) ,
(4)
and tm
= ∫ X k (t ) X l (t )dt .
i klm
A
(5)
t0
Using notation (4), (5), expression (3) can be transcribed as N
∑c k ≤l
i kl
i Aklm = d mi , m=1,…,M
(6)
expression (6) can reduced to the standard form of the linear equations systems notation. We have to renumber pair indexes kl into single index j: k⊗l↔j (k≤l=1,…,N↔j=1,…,N(N+1)/2). After renumbering we can get linear system of algebraic equations in the standard notation form N ( N +1) / 2
∑ j =1
cij Aijm = d mi ,
m=1,…,M,
(7)
where parameters cij are unknown quantities. On the assumption of M=N(N+1)/2 matrix A is square and the linear system obtained (7) can be solved using different numeric methods. Having solved the system, we shall find the parameters included in the i-th equation of system (2). Thus, having solved N systems, we shall find all parameters of system (2) and the problem of identification will be solved. Realization and Results The algorithm described above is realized using C ++ using the library of parallel calculations MPI (Snir et al., 2000; Korneev, 2003) and the libraries of linear algebra subroutines LAPACK (Blackford, 2000) and BLACS (Dongarra et al., 1997; Whaley, 1997).
86
COMPUTATIONAL SYSTEMIC BIOLOGY
BGRS 2004
The program realized allows to apply the received methods to identification of mathematical models with more complex right part. Because of orienting on parallel calculations, the significant acceleration is reached at the salvation of the problem. In our opinion the main advantage of the suggested algorithm that it is not excessing and comes nearer to an analytical way of the solution of identification problem. At enough of the data the method can be applied to identification of the structurally functional organization of gene networks. Acknowledgements The work was supported by the leading scientific schools grant No. 311.2003.1 of President of Russia, by the grant Nos. 04-01-00458, 03-07-96833, 03-04-48506, 03-01-00328, 02-07-90359, 02-0448802 of the RFBR, by the interdisciplinary grant No 119, Project No. 10.4 of the RAS Presidium Program “Molecular and Cellular Biology” (Program for Basic Research of the Presidium of the RAS, contract No. 10002-251/ П -25/155-270/200404-082). The authors are indebted to A.V. Ratushny for technical assistance. References Blackford L.S., Choi J., Cleary A., D’Azevedo E., Demmel J., Dhillon I., Dongarra J., Hammarling S., Henry G., Petitet A., Stanley K., Walker D., Whaley R.C. ScaLAPACK Users’ Guide. 1997. Dongarra J., Whaley R.C. A User’s Guide to the BLACS v1.1. 1997. Ermakova A. New complex of numerical methods for identification and analysis of kinetic models problem // Mathematical modeling of catalytic reactors. Novosibirsk: Science. Sib. Branch, 1989. P. 120–150. Karnaukhov A.V., Karnaukhova E.V. Use of new identification method for nonlinear dynamic systems for biochemistry problems // Biochemistry. 2003. V. 68(3). P. 309–317. (Russ.). Korneev V.D. Parallel programming in MPI. / Eds. V.E. Malyshkin, O.L. Bandman. Institute of computer research, Moscow-Izhevsk, 2003. (Russ.). Snir M., Otto S., Huss-Lederman S., Walker D., Dongarra J. MPI: The Complete Reference. MIT Press, Boston, 1996. Whaley R.C. Outstanding Issues in the MPIBLACS. 1997.
87
COMPUTATIONAL SYSTEMIC BIOLOGY
EFFICIENT ALGORITHM FOR GENE SELECTION USING PLS-RLSC Li Shen*, Eng Chong Tan School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, Singapore 639798, Singapore * Corresponding author: e-mail:
[email protected]
Keywords: cancer classification, gene selection, DNA microarray, support vector machines, partial least squares, regularized least squares classification, recursive feature elimination Summary Motivation: Accurate cancer diagnosis is very important for treatment of cancer patients. Gene selection is crucial to classifier design for cancer classification using microarray data. Efficient and effective algorithms for cancer classification and gene selection are needed in this area. Results: A new method called PLS-RLSC for cancer classification and gene selection is proposed. It is based on the partial least squares (PLS) as dimension reduction followed by regularized leastsquares classification (RLSC). The new method performed empirically better than support vector machine (SVM) on the publicly available colon cancer dataset and required much less time. It is also combined with the recursive feature elimination (RFE) algorithm to select a six-gene subset to achieve the minimum testing errors. The testing accuracy is as high as 98 %. Availability: The MATLAB source codes are available on request. Introduction The objective of cancer classification is to design a classifier to categorize the tissue samples into pre-defined classes (e.g. tumor and normal) using the gene expression levels produced by microarray techniques. Since the data dimension is very large, SVMs have been found to be very useful for this classification problem [2]. Apart from the classification task, it is also important to eliminate the irrelevant genes from the dataset and select a small subset of marker genes, which discriminate between the different types of tissue samples. Some techniques like RFE was proposed by other researchers to accomplish this task [5]. The RLSC is shown to be as good as SVM on several benchmark datasets [6]. We, however, combined this method with the dimension reduction method known as PLS. Because PLS can be executed very efficiently and RLSC can also be speeded up using the orthogonal components generated from PLS as inputs, the new algorithm is computationally efficient and its performance is as good as or even better than SVM. We also used RFE to select a small subset of marker genes using this new algorithm and the results are very satisfactory. Methods Consider a microarray dataset containing n samples, with each sample represented by the expression levels of m genes. PLS is a technique for modeling a linear relationship between a set of input variables {x i }i =1 ∈ R n
m
and a set of output variables {bi }i =1 ∈ R . Only one-dimensinal n
output is considered here. So bi = 1 or –1 corresponds to the i th sample belonging to class 1 or –1. Furthermore, we assume centered input and output variables. Let X = [x 1 , x 2 , Κ , x n ]
T
88