Document not found! Please try again

Combining Feature Selection and DTW for Time-Varying ... - IEEE Xplore

2 downloads 0 Views 728KB Size Report
Time-Varying Functional Genomics ... two-class functional genomic process, feature selection algorithms may be ... on synthetic and public microarray data.
2436

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 6, JUNE 2006

Combining Feature Selection and DTW for Time-Varying Functional Genomics Cesare Furlanello, Member, IEEE, Stefano Merler, and Giuseppe Jurman

Abstract—Given temporal high-throughput data defining a two-class functional genomic process, feature selection algorithms may be applied to extract a panel of discriminating gene time series. We aim to identify the main trends of activity through time. A reconstruction method based on stagewise boosting is endowed with a similarity measure based on the dynamic time warping (DTW) algorithm, defining a ranked set of time-series component contributing most to the reconstruction. The approach is applied on synthetic and public microarray data. On the Cardiogenomics PGA Mouse Model of Myocardial Infarction, the approach allows the identification of a time-varying molecular profile of the ventricular remodeling process. Index Terms—Clustering, genetics, pattern classification, time series.

I. INTRODUCTION

M

OST of the challenges for functional genomics regard complex time-varying biological processes. Gene expression microarrays and other high-throughput technologies are now available to trace such processes simultaneously along thousands of potentially interesting signals. By examining the transcriptome at consecutive time points, researchers can focus on changes in gene expression patterns associated with disease progression. This approach aims at discovering patterns that indicate responsiveness to therapy or at identifying new targets for therapy. In November 2005, the NCBI GEO DataSets collection included 214 data sets of time-course expression profiling studies that cover animal models, cell culture systems, and normal and pathological human tissues. The studies in NCBI GEO range from understanding cerebellar development in mice to identifying estrogen-regulated genes in a breast cancer human cell line with a mutant estrogen receptor. In this paper, we introduce a method for time-varying functional genomics based on a combination of several statistical machine learning and signal processing procedures. The main components of the method are feature selection, stagewise boosting regression, dynamic time warping (DTW), and clustering. The method implements the molecular profiling of time-course functional genomics studies with two-class experimental design, such as the search for groups of coregulated

Manuscript received May 1, 2005; revised January 7, 2006. The work of G. Jurman is supported by the FUPAT postgraduate project “Algorithms and software environments for microarray gene expression experiments.” The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Jaakko Astola. The authors are with ITC-irst, I-28050 Trento, Italy (e-mail: [email protected]; [email protected]; [email protected]). Digital Object Identifier 10.1109/TSP.2006.873715

genes responsible for phenotypic changes in pathological samples during disease progression with respect to normal samples. In order to discover the main temporal patterns in thousands of expression series, we first need an appropriate measure of similarity between gene expression series. The core idea is to endow the classification and regression methods with a distance derived from DTW in order to privilege temporal trend similarity. Specifically, we embedded a DTW procedure within a time-series boosting method originally developed to forecast marketing time series [17], [29]. The solution is inspired by the interpretation of Freunde and Schapire’s AdaBoost algorithm [8] proposed by Friedman and other authors [9], [15]. In Friedman’s analysis, boosting is viewed as a forward stagewise procedure that implements additive regression with respect to an appropriate loss function. In [29], given a population of time series and a target time series that is functionally dependent on the population (e.g., the sum of all series), boosting is used to identify a subset (also called a panel) of prototype series whose linear superposition may approximate the target function. Boosting the regression estimate directly from the time-series population avoids introducing structural hypotheses or ad hoc synthetic basis functions. The prototypes are chosen from the population according to their contribution to the reconstruction of the target function. The panel can be used for understanding the main component trends. The panel selection problem is made complex in time-course gene expression studies by large variations in the amplitude of the signals and by population sizes typically times the number of timestamps [7]. To enhance the selection of prototype time series in different amplitude regimes, a DTW-based distance was introduced within time-series boosting in [10]. Here, we extend the method by constraining the search space for the basis functions. The functions involved in the reconstruction are chosen through the feature selection process associated with an underlying classification problem. To avoid selection-bias effects, feature selection is realized with a complete validation scheme [2], [11], [27]. The use of a subset of the best ranked series according to the unbiased classification scheme is exploited to define the target function. This step highlights the main trend of the biological process by sensibly reducing the noise. The time-boosting process is then applied to the reduced family of best ranked time series to identify the genes whose transcription best contributes to the reconstruction of the biological process. Additional knowledge can be used in the procedure by introducing constraints on the possible candidate series at each stage of the selection, e.g., by choosing candidates from groups of coregulated genes or of known functions. Finally, process-specific coregulated genes may be identified by

1053-587X/$20.00 © 2006 IEEE

FURLANELLO et al.: COMBINING FEATURE SELECTION AND DTW FOR TIME-VARYING FUNCTIONAL GENOMICS

clustering the subset of the selected time series, again according to the DTW-based measure: The reconstruction thus becomes a semi-supervised procedure. Related studies on clustering time-varying gene expression data investigated correlation and model-based methods. A review of such methods is included in [30]. Clustering with local shape-based similarity using Spearman rank correlation is introduced in [3]. An analysis of the advantages and disadvantages of correlation and metric methods in this context is provided in [19]. Dynamic model-based clustering is introduced in [21]. A version with reduced computational complexity is proposed in [31]. However, Ernst et al. [7] have recently pointed out that problems may arise in practice. Most of time series data sets are short (less than ten points), and a relatively small set of temporal profiles can be defined for such data. In this study, the identification of basic temporal profiles is automated by feature selection and boosting directly from the population, and it is demonstrated on the Cardiogenomics Mouse Model of Myocardial Infarction data set with six time points and 12 488 genes (cf. [5] and Section V). II. MATHEMATICAL SETUP In this section, we provide notations for the basic reconstruction problem. Given a time-course functional genomics study associated with a two-class experimental design, the data set may be defined as

representing time series at time points for each of the samples, which are partitioned in two classes of labels 1 and 1, of cardinality and , respectively. In problems such as the study of cardiac remodeling after myocardial infarction in a mouse model described in Section V-A, we may define a target function related to the evolution of the underlying biological process in the pathological tissues

where

and , with . In the mouse model of myocardial infarction, we may consider expressions from the tissue close to the infarcted area (class 1) and from control tissues (class 1). If we could choose as the subset of genes belonging to the cardiac remodeling pathway, the series with would describe the evolution of the expressions of the genes potentially responsible for the process, and the resulting target function would describe the overall activity of the pathway. Averaging expressions over the positive samples, we may consider the , and the simplified target function series

2437

The aim of this paper is to provide methods for first identifying the subset (by feature selection, see Section III) and then for approximating the resulting target function as a linear superposition of a smaller subset of the base functions (see Section IV)

(1) The basis functions , for investigated as prototype elements.

, can then be

III. FEATURE SELECTION As the first step of our method, a predictive classification and feature ranking scheme is applied to the data set in order to detect the optimal gene panel . It is now a consolidated result that, in functional genomics, the ranking of features can be even more important than building the classifier itself. It is also a shared notion, among methodologists, that the scheme must ensure that no selection bias effect is contaminating the experiment [2], [26]. The accuracy estimates must be completely validated. In practice, two nested partitioning schemes are used: one external for estimating the prediction error of an algorithm for different training-test splits, and one internal for model tuning and feature ranking. In addition, it is good practice to verify the results by replicating the same experimental setup with randomized class labels. Complete validation has been advocated because the predictivity and gene list stability of several major microarray studies have been questioned on closer analysis [18], [22]. The cost of this caution is high computational complexity. In a binary classification problem with 50 cases and 20 000 genes, a practitioner willing base to implement this scheme will have to develop about models. Our classification scheme is based on a complete validation system developed for support vector machine (SVM) classifiers [24]. Called E-RFE, the scheme provides feature ranking with an accelerated version of the Recursive Feature Elimination procedure (RFE). E-RFE uses an entropy indicator function of the feature importance weights [11]. However, the approach we are introducing does not depend on a particular classification method, provided that accuracy estimation and feature set stability are correctly assessed. Alternatively, different classifiers and new less demanding approaches, such as bolstered error estimation [25], might thus be applied. Given a classification and ranking setup, we first consider the time-course data as two-class vectors in terms of the genomics features. In the case of time-course gene expression studies, we initially discard the additional time coordinate and develop the predictive models in terms of the phenotypic labels (e.g., the classification of developing disease versus control). Possible outliers for the classification task may also be detected and discarded at this point [13]. Also obtained is an ordered list of features, whose ranking algorithm is integrated within the classification task. This avoids ranking by filtering with external criteria, a practice that may eliminate discriminating genes of low

2438

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 6, JUNE 2006

Fig. 2. Example of profile reconstruction for the Cardiogenomics data discussed in Section V (log scale on time axis). (Top) First time series selected by the algorithm is plotted together with the target function, the approximation, and the residual. (Bottom) Series selected at Steps 1 to 5 for the approximation of the target function. Fig. 1. Time-series boosting algorithm.

average dynamics. Systems such as E-RFE also provide an estimate of the optimal number of features. In summary, the classification and ranking step provides a set of size of discriminating features that will be used by the time-series boosting procedure on the non-outlier data. IV. PROFILE RECONSTRUCTION A. Boosting Algorithm Let a custom similarity distance between time series based on the be available. The definition of an appropriate DTW algorithm will be discussed later (see Sections IV-B and IV-C). For the time being, we need only define how is integrated in a stagewise regression procedure in order to fit a target of selected eletime series with a superposition ments from a population of time series. The approximation in (1) is realized through an adapted boosting algorithm [9], detailed in Fig. 1. At the beginning of the algorithm, three sets are introduced: the set keeping track of the indexes of the series not contributing to the reconstruction, the set , including and the indexes of the series used . for the reconstruction and , its complement in At each step of the algorithm, candidates from the time-series population are matched against the target function . At the first step, : the target function is set to , and is determined together with its combinathe best candidate by a greedy search within the set of cantion coefficient didate series and the interval of admissible coefficients. The

term is then subtracted from the target function . The is the new target function for the residual stagewise procedure, which is iterated until one series with admissible coefficients is found. must be tuned considering In practice, the coefficients . In the Cardiobiological constraints within an interval genomics application discussed in Section V-A, only positive and relatively small values of were considered to be biologically relevant by an expert investigator. The effect of the DTW-based distance is a relatively rapid screening of the most interesting candidates having a profile similar to the function being matched. Ultimately, the target function is approximated by the exas in (1). The panel pansion series may be investigated as prototype of ranks such series acelements, where the index cording to their relative importance in the profile reconstruction. In [9], Friedman proposed several numerical procedures for optimizing the choice of the elements and their coefficients . Our approach combines the use of a mixed DTW-Euclidean distance (described in Section IV-C) as the metric strategy with a minimization procedure. First, a grid search is performed to estimate a rough minimum to be used as a pivot for a Brent search. range is found, the series is marked If no minimum in the as not available for the approximation algorithm. Otherwise, the and its norm are computed. The corresponding residual best series for the current step is then selected as the one yielding the residual with minimum norm. An example of profile reconstruction on a unevenly spaced short time series is given in Fig. 2. The series are plotted on a

FURLANELLO et al.: COMBINING FEATURE SELECTION AND DTW FOR TIME-VARYING FUNCTIONAL GENOMICS

2439

log scale at the six steps 1 h, 4 h, 24 h, 48 h, 168 h, 1344 h. As a diagnostic indicator of convergence for the procedure, we consider (2) The main stopping criterion consists in the existence of available series. A second stopping criterion is defined by monitoring the indicator defined in (2). evolution of the By constraining the approximation on a subset of the timestamps, reconstruction can be forced on a defined interval of the series. Time focusing can also be provided considering cyclic intervals, i.e., considering the last time point as connected to the first one. A cyclic interval may be used for the approximation of periodic time series such as the Plasmodium Falciparum data sets [4], [28].

Fig. 3. Dynamic time warping algorithm.

B. Dynamic Time Warping The alignment of temporal patterns by DTW has traditionally been used in the recognition of speech signals [23]. This method is a generalization of standard algorithms for string comparison [14] and for the alignment of time series data. Its application to expression studies was proposed by Aach and Church [1]. Since no general-purpose similarity rules can be assumed in such different domains [16], [17], our implementation allows for a variety of alignment configurations. The solution is inspired by an existing DTW recognition system developed at ITC-irst, Trento, Italy, for speech data preprocessing. Given two series and (with and time points, respectively) as inputs, DTW selects the best possible alignment between them by minimizing or city-block distance or the a local distance (usually the Euclidean ) between the series points. Although defined by means of a recursive formula, the alignment is more efficiently computed by dynamic programming, whose first step consists in matrix . When the the table construction of a table is completed, a multiple of its last computed value gives the DTW-distance between the series. A further step is required to explicitly derive the correct alignment path. Usually known as the backtrack part of the DTW, this step is not required in this context. A unique parameter vector has to be set for DTW, , which gives i.e., the weight configuration for horizontal (H), diagonal (D), and the penalization rule vertical (V) time distortions. As far as the distance computation is involved, the algorithm pseudocode is detailed in Fig. 3: For the symmetric weight con. figurations used in this problem, Applying DTW to time series on the expression data first required an adaptation of the internal weights of the algorithm. The weights establish the elementary costs in the alignment of two temporal patterns along an optimal path. The weight configuration 1/1/1 (costs for horizontal, diagonal, and vertical displacements) was chosen on the basis of experiments on synthetic data, and then on data from the DeRisi P. Falciparum data set. The configuration allows for moderate time shifts, which may have importance in biological processes. To test the behavior of the DTW distance, experiments on a synthetic data set have been conducted. The data consist of four time series on 41 time

Fig. 4. DTW alignment examples on synthetic data: series s2, s3, and s4 (from top to bottom) are aligned to reference series s1. The point-to-point correspondences are reported on the left, while in the right panels, the corresponding alignment paths are shown.

points. Series sion

where Series

,

, and

are defined by the analytic expres-

and is, respectively, 13, 0, and 5. is the Gaussian

with still ranging in displayed in Fig. 4.

. The resulting alignments are

2440

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 6, JUNE 2006

Fig. 5. Three synthetic series. Series B is closest to series A for the Euclidean and D distance D , while series C is more similar to series A for D D . Series D is closest to series A with respect to series C for the Pearson distance D (see Table I).

+

COMPARISON

Fig. 6. Synthetic data set used for testing the reconstruction procedure. Black lines (solid and dashed) represent the two basic shapes. Dark gray and light gray lines represent the series generated from the basic ones by injecting Gaussian noise.

TABLE I DTW, EUCLIDEAN, EUCLIDEAN-CORRECTED DTW, AND PEARSON DISTANCES

OF

C. Euclidean Correction for DTW Distance The DTW distance proves more suitable than the Euclidean metric in curve comparison because it takes into account the shapes of the curves instead of just evaluating the pointwise distance of the vectors. If we consider the three series A, B, and C displayed in Fig. 5, according to Table I, C is correctly found to be more similar to A than B for the DTW metric. Regardless of the similarity of the shapes of series A and C, Euclidean distance evaluates series B as closer to series A: DTW metric instead selects series C as the closest one. A validation of the DTW alignments on synthetic and on the real P. Falciparum data set suggested that a mixed strategy can be introduced in order to better deal with the high intergene variability. Correct detection of shape similarity is preferred, while maintaining control on the squared residual based on the Euclidean distance. The similarity distance is thus defined as

to obtain a tradeoff between the induced by the Euclidean computed by time warping, with , norm and where is the number of time points. favors The adoption of the DTW-based component the matching of series of similar morphology. It reduces the known disadvantages of the use of the Euclidean norm or of correlation distances in matching and then in clustering time series [6], [19]. In the example described in Table I and Fig. 5, the series D is closer to series A than series C for the Pearson , while series C is closer to series A distance than series D for the Euclidean-corrected distance. In summary, the Euclidean-corrected DTW metric is designed to reasonably take into account coherent evolution (pattern usually captured by correlation), shape (by the DTW term), and amplitude (by the Euclidean distance term).

Fig. 7. Results on the synthetic data set. (left) First reconstructing series (dashed line), scaled by the corresponding factor, is plotted versus the target function (solid line). (right) Second reconstructing series (dashed line), scaled by the corresponding factor, is plotted versus the residual function (solid line).

D. A Synthetic Experiment To test the reconstruction procedure on unevenly spaced, short-time series, we used a synthetic data set mimicking the shape of the time series gene expressions of the Cardiogenomics data set (see Section V-A). The data set consists of two sets of 50 time-course series, each built as follows: a basic shape is set and then 50 series are generated, starting from the basic shape by adding Gaussian noise. The data set is displayed in Fig. 6. As in Section V-C, the target function is defined as the sum of the 100 time series described above. The goal of the reconstruction procedure (Euclidean-corrected DTW with boosting) is to identify a minimal subset of time series, which allows the global approximation of the target function. More specifically, we expect that two time series belonging to the two subsets should be sufficient for reconstructing the target function. The procedure selects three time series, accounting for the 96.3% of the reconstruction (see (2)). The first two account for , and they belong to different subsets (Fig. 7). V. RESULTS ON CARDIOGENOMICS DATA A. Data Description We applied the DTW-based gene profiling method to microarray data generated by Project 1 of the Cardiogenomics PGA [5]. In particular, we studied the microarray data of the Mouse Model of Myocardial Infarction (MI) [5]. The goal of the MI model is the elucidation of the global changes occurring in the ventricular architecture as a consequence of myocardial infarction. After a myocardial infarction has occurred, a cardiac remodeling process begins. This consists of an acute inflammatory

FURLANELLO et al.: COMBINING FEATURE SELECTION AND DTW FOR TIME-VARYING FUNCTIONAL GENOMICS

phase, followed by a reconstructive phase involving a transformation of the tissues surrounding the necrotic area. We want to identify the genes involved in this dynamic biological process, whose final evolution is the development of heart failure, a critical condition for most survivors of acute myocardial infarction (AMI). It is thus important to define molecular targets for the early identification of the phases of cardiac remodeling. In the MI model, the effect of AMI is emulated by left coronary ligation. We refer to [5] for a full description of the experimental design and the operational procedures in the MI model. The mice that have been operated on are sacrificed at hours 1, 4, 24, 48, 168 (one week), and 1344 (eight weeks) after the procedure. Gene expressions at those six time instants are analyzed with the Affymetrix MG-U74Av.2 arrays for 12 488 time series. Sham-operated mice are considered for control. Further details on chip preparation are available from the Cardiogenomics PGA website [5]. Three classes of samples are considered: 1) noninfarcted region of the left ventricular free wall (NILV); 2) infarcted region of the left ventricular free wall (ILV); 3) left ventricular from control cases (LV). In this paper, only disease evolution in the NILV tissues versus the LV control tissues was considered. At each time step, three NILV and at least three LV samples were analyzed, for a total of 18 NILV and 23 LV samples. B. Classification and Feature Selection of NILV-LV To obtain a target function representative of the progression of the ventricular remodeling process, we first developed a classification model using the ERFE-SVM methodology (details in [12]). Before classification, variables were standardized to mean zero and standard deviation one. The ERFE-SVM model provided high classification accuracy ( 92%) for an estimated optimal subset of genes of size and standard deviation 25. The method also provided a gene ranking for the NILV versus LV problem. The first 70 ranked genes were then considered. C. Development of the Target Function The target function is investigated as representative of the biological process being studied. Two main classes of target functions may be considered: 1) the evolution of an observed phenotypic variable; 2) the evolution of a function of all or a subfamily of data and genes. We have, in particular, experimented on this second type of target function, aiming at providing a method for discovering prototype genes that characterize the temporal evolution of a biological process, possibly differentiating between conditions. The target function may be obtained by summing or averaging over only the genes that are selected by the classification and ranking process. As a consequence, we expect that the noise injected by the genes that do not contribute to the disease progression will be eliminated from the gene expression budget, thus providing a much clearer global profile. In particular, we may try to fit the sum of genes only for those samples in which the disease is developing (see Fig. 8).

2441

Fig. 8. Target function as an average of the indicated number of gene expression time series ranked according to the selection procedure. Solid lines refer to the target function obtained by summing all the time series (12 488) and by summing the 70 series included in the optimal subset of genes.

Fig. 9. First three reconstructing series (dashed line), scaled by the corresponding factor, are plotted versus the target function (solid line). The bottom right panel shows the values for the complete reconstruction with five time series (see Table II).

Other types of stratification according either to additional covariates or to additional constraints on the regulation of the genes are also possible. The target function may then be obtained by summing or averaging over only those preselected genes. The noise due to the genes that do not contribute to disease progression are eliminated from the gene expression set, resulting in a more significant global profile. In this study, the sum expression functions were computed for the two classes, obtaining clearly different nonoverlapping global profiles. Up-regulation is found for NILV cases in the first two time instants, and down-regulation later. The two profiles cannot be distinguished using the sum of all the 12 488 series. This shows that the feature selection phase is effective in filtering the noisy information of genes ranked low by ERFE-SVM. D. Boosting the Target Function The DTW-based method was applied to the optimal subset of genes on NILV samples. A limited number of genes was sufficient to warrant highly accurate approximations ( 97% achieved with four genes and six time events), as described in Fig. 9 and Table II. Using information from LocusLink, UniGene, and Swiss-Prot, the functional category of the best reconstructing genes were identified. In particular, the first gene (Affymetrix 93 975_at) is activated in the stem cells. The gene is associated with cellular differentiation, and it is overexpressed in the early phase of the remodeling, in the cicatrization of the tissues surrounding the scar. The two genes Affymetrix 100 429_at and 94743_f_at are mouse specific and functionally linked to blood (erythroid differentiation and plasma).

2442

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 6, JUNE 2006

TABLE II RECONSTRUCTION PARAMETERS OF TARGET FUNCTION (NILV SUM FOR BEST RANKED GENES)

TABLE III DTW CLUSTERING AND PROFILE RECONSTRUCTION BY CLUSTER MEANS

VI. CONCLUSION

Fig. 10. Best genes for time subwindows that correspond to the early phase (left) and late phase (right) of the remodeling process. The first candidate series (dashed lines), scaled by the corresponding factor, are plotted versus the target function (solid lines).

Fig. 11. DTW clustering and profile reconstruction through cluster means. (left) Gene expression time series displayed in different shades according to the clustering. (right) Plots of the mean time series for each cluster. The percentage of reconstruction of the parameter is indicated near each curve (see Table III).

The Affymetrix 93 023_f_at is a mouse histon, while the Affymetrix 98 493_at is expressed in myocardial tissues. E. Time Windows The method was also applied to shorter time windows in order to focus the gene profiling mechanism on specific phases of the remodeling process. The best reconstructing gene for the whole period (see Table II) is also the best one for the first two windows of four time points, while for the late phase (one to eight weeks), the gene Affymetrix 102 779_at results the most relevant component (see Fig. 10). This phase consists in the hypertrophy and alterations of ventricular architecture to distribute the increased wall stresses more evenly. Thus, it is conceivable that a gene involved in growth arrest and DNA damage inducible is underexpressed.

The integration of the DTW similarity within the time-series boosting regression method obtained biologically interesting results on synthetic and real data. In the Cardiogenomics NILV-LV time course study, the DTW-based method allowed us to identify groups of genes associated to profound structural changes within cells of the remodeling myocardium. Preselection (from 12 488 to about 100 features) was based on feature selection with supervised SVM classification. Even genes of limited expression range may be considered and selected according to their contribution to the discrimination of pathological and control samples. While limited differences are found for the target functions defined by aggregating NILV and LV expression separately on all the 12 488 genes, the cardiac remodeling trend is much more evident considering only the best-ranked features. A substantial stability of the reduced gene lists is also obtained, and the identification of the first panel elements has shown biological significance. A first analysis in terms of clusters of coregulated genes has shown that the reconstruction property is stable within clusters, an effect that should be further exploited.

ACKNOWLEDGMENT A prototype version of the time-series boosting procedure for gene expression data was developed by S. Riccadonna. The authors would like to thank D. Giuliani, M. Serafini, and S. Baldo for their collaboration. The authors also would like to thank P. Jay of the Cardiogenomics PGA for providing array data and helpful comments.

REFERENCES F. Reconstruction From Coregulated Clusters We have shown that few genes with large coefficients warrant reconstruction of their target function. The effect may be due to presence of a set of highly coregulated genes. To investigate this hypothesis, hierarchical clustering using the DTW similarity function was applied to the first 70 ranked series. Cluster means were then used to reconstruct the target function. The hclust package in the R statistical environment (complete method, 0.3 threshold) was applied [20], and six clusters were identified (Fig. 11). They explain about 90% of the target function in terms of the indicator. The parameters of the reconstruction process are listed in Table III.

[1] J. Aach and G. Church, “Aligning gene expression time series with time warping algorithms,” Bioinformatics, vol. 17, no. 6, pp. 495–508, 2001. [2] C. Ambroise and G. McLachlan, “Selection bias in gene extraction on the basis of microarray gene-expression data,” Proc. Nat. Acad. Sci. USA, vol. 99, no. 10, pp. 6562–6566, 2002. [3] R. Balasubramaniyan, E. Hüllermeier, N. Weskamp, and J. Kämper, “Clustering of gene expression data using a local shape-based similarity measure,” Bioinformatics, vol. 21, no. 7, pp. 1069–1077, 2005. [4] Z. Bozdech, M. Llinas, B. Pulliam, E. Wong, J. Zhu, and J. DeRisi, “The transcriptome of the intraerythrocytic developmental cycle of plasmodium falciparum,” PLOS Biol., vol. 1, no. 1, p. E5, 2003. [5] CardioGenomics. (2004, Feb.) Genomics of Cardiovascular Development, Adaptation, and Remodeling. NHLBI Program for Genomic Applications, Harvard Medical School, Boston, MA. A Mouse Model of Myocardial Infarction. [Online]. Available: http://cardiogenomics.med.harvard.edu/groups/proj1/pages/mi_home.html

FURLANELLO et al.: COMBINING FEATURE SELECTION AND DTW FOR TIME-VARYING FUNCTIONAL GENOMICS

[6] S. Chu, E. Keogh, D. Hart, and M. Pazzani, “Iterative deepening dynamic time warping for time series,” in Proc. 2nd SIAM Int. Conf. Data Mining, R. Grossman, J. Han, V. Kumar, H. Mannila, and R. Motwani, Eds., Arlington, VA, 2002. [7] J. Ernst, G. Nau, and Z. Bar-Joseph, “Clustering short time series gene expression data,” Bioinformatics, vol. 21, pp. i159–i168, 2005. [8] Y. Freund and R. Schapire, “Experiments with a new boosting algorithm,” in Proc. Int. Conf. Machine Learning, 1996, pp. 148–156. [9] J. Friedman, “Greedy function approximation: a gradient boosting machine,” Ann. Stat., vol. 29, no. 5, pp. 1189–1232, 2001. [10] C. Furlanello, G. Jurman, M. Serafini, S. Riccadonna, D. Giuliani, and S. Merler, “DTW and stagewise regression for the molecular profiling of microarray time series,” presented at the 12th Int. Conf. Intelligent Systems Molecular Biology/3rd Eur. Conf. Comp. Biology (ISMB/ECCB), Glasgow, U.K., Jul. 31–Aug. 4, 2004. [11] C. Furlanello, M. Serafini, S. Merler, and G. Jurman, “Entropy-based gene ranking without selection bias for the predictive classification of microarray data,” BMC Bioinformatics, no. 4, p. 54, 2003. , “Methods for predictive classification and molecular profiling [12] from DNA microarray data,” Italian Heart J., vol. 5, no. Suppl. 1, pp. 199S–202S, 2004. , “Semisupervised learning for molecular profiling,” IEEE/ACM [13] Trans. Comput. Biol. Bioinformatics, vol. 2, no. 2, pp. 110–118, Apr.–Jun. 2005. [14] D. Gusfield, Algorithms on Strings, Trees and Sequences, 1st ed. Cambridge, MA: Cambridge Univ. Press, 1997. [15] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. New York: Springer-Verlag, 2001. [16] E. Keogh and M. Pazzani, “An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback,” in Proc. 4th Int. Conf. Knowledge Discovery Data Mining, G. P.-S. R. Agrawal and P. E. Stolorz, Eds., New York, 1998, pp. 239–241. , “Scaling up dynamic time warping for datamining applications,” [17] Knowl. Discov. Data Min., pp. 285–289, 2000. [18] S. Michiels, S. Koscielny, and C. Hill, “Prediction of cancer outcome with microarrays: a multiple random validation strategy,” Lancet, vol. 365, pp. 482–488, Feb. 2005. [19] C. Moeller-Levet and H. Yin, “Modeling and analysis of gene expression time-series based on co-expression,” Int. J. Neural Syst., vol. 15, no. 4, pp. 1–12, 2005. [20] R Development Core Team, “R: A language and environment for statistical computing,” R Foundation for Statistical Computing, Vienna, Austria, 2004. [21] M. Ramoni, P. Sebastiani, and I. Kohane, “Cluster analysis of gene expression dynamics,” Proc. Nat. Acad. Sci. USA, vol. 99, no. 14, pp. 9121–9126, 2002. [22] M. Ruschhaupt, W. Huber, A. Poustka, and U. Mansmann, “A compendium to ensure computational reproducibility in high-dimensional classification tasks,” Statist. Appl. Genetics Molec. Biol., vol. 3, no. 1, Article 37, 2004. [23] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-26, no. 1, pp. 43–49, Feb. 1978. [24] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge, U.K.: Cambridge Univ. Press, 2004. [25] C. Sima, U. Braga-Neto, and E. Dougherty, “Superior feature-set ranking for small samples using bolstered error estimation,” Bioinformatics, vol. 21, no. 7, pp. 1046–1054, Apr. 2005. [26] R. Simon, M. Radmacher, K. Dobbin, and L. McShane, “Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification,” J. Nat. Cancer Inst., vol. 95, no. 1, pp. 14–18, 2003.

2443

[27] R. Simon, E. Korn, L. McShane, M. Radmacher, G. Wright, and Y. Zhao, Design and Analysis of DNA Microarray Investigations, ser. Statistics for Biology and Health. New York: Springer, 2004. [28] The PlasmoDB Collaborative, “PlasmoDB: An integrative database of the Plasmodium falciparum genome. Tools for accessing and analyzing finished and unfinished sequence data,” Nucleic Acids Res., vol. 29, no. 1, pp. 66–69, 2001. [29] M. Visentin and C. Furlanello, “Time series boosting for the automatic selection of panels in marketing studies,” Rend. St. Econ. Quant., vol. 1, pp. 68–78, 2000. [30] F. Wu, “Computational methods for analysis and modeling of time-course gene expression data,” Ph.D. dissertation, Dept. of Biomedical Eng., Univ. of Saskatchewan, Saskatoon, SK, Canada, 2004. [31] F. Wu, W. Zhang, and A. Kusalik, “Dynamic model-based clustering for time-course gene expression data,” J. Bioinformatics Comput. Biol., vol. 3, no. 4, pp. 821–836, 2005.

Cesare Furlanello (M’06) received the Mathematics degree from the University of Padua, Italy, in 1986. Currently, he is a Senior Researcher and responsible for the MPBA (Predictive Models for Biological and Environmental Data Analysis) Project at ITC-irst, Center for Scientific and Technological Research, Trento, Italy. He is interested in machine learning methods, with applications to medical and environmental data. He is a founder of the WebValley initiative for the dissemination of scientific and information science culture.

Stefano Merler received the Mathematics degree from the University of Trento, Italy, in 1994. He is currently a member of the MPBA (Predictive Models for Biological and Environmental Data Analysis) Project team at ITC-irst, Center for Scientific and Technological Research, Trento, Italy. His main interests are the development and application of machine learning techniques for bioinformatics and environmental epidemiology.

Giuseppe Jurman received the Ph.D. degree in mathematics from the University of Trento, Italy, in 1998. He is currently a Postdoctoral Member of the MPBA (Predictive Models for Biological and Environmental Data Analysis) Project team at ITC-irst, Center for Scientific and Technological Research, Trento, Italy. He is involved in the study of various mathematical aspects of bioinformatics.

Suggest Documents