Biometrics , 1–15
DOI: 000
Functional Data Classification with Kernel-Induced Random Forests
Jiguo Cao1,∗ , and Guangzhe Fan2,∗∗ 1 2
Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada, V5A1S6
Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada, N2L3G1 *email: jiguo
[email protected] **email:
[email protected]
Summary:
Scientists and others today often collect samples of curves and other functional data.
The multivariate data classification methods cannot be directly used for functional data classification because the curse of dimensionality and difficulty in taking in account the correlation and order of functional data. We extend the kernel-induced random forest method for discriminating functional data by defining kernel functions of two curves. This method is demonstrated by classifying the phoneme data and temporal gene expression data. The simulation study and applications show that the kernel-induced random forest method increases the classification accuracy from the available methods. The kernel-induced random forest method is easy to implement by naive users. It is also appealing in its flexibility to allow users to choose different curve estimation methods and appropriate kernel functions. Key words: Functional principal component analysis, Gene Expression, Penalized spline smoothing, Phoneme Data, Waveform Data
1
Functional Data Classification
1
1. Introduction Classification and discriminant analysis methods have grown in depths during the past 20 years. Fisher linear discriminant analysis (LDA) is the basic but standard approach. As the structure and dimension of the data becomes more complex in a wide range of applications, such as functional data (Ramsay and Silverman, 2005; Ferraty and Vieu, 2006), there is a need for more flexible nonparametric classification and discriminant analysis tools, especially when the ratio of learning sample size to number of covariates is low and the covariates are highly correlated and the covriance matrix is highly degenerated or when the large number of covariates are generally weak in predicting the class labels. Another nonparametric approach for multivariate data classification is CART (Classification and Regression Trees), proposed by Breiman et al. (1984) and later developed in Breiman (1996, 2001). The data in the space are partitioned sequentially using a set of nonparametric splitting rules and then simple classification rules are estimated in the subspace of the tree nodes. Although a single tree model is still less powerful in many hard classification problems, the ensemble of trees such as bagging (Breiman, 1996) and random forests (Breiman, 2001) are powerful nonparametric classification and prediction tools. But the random forests approach cannot be directly used for functional data classification. Some alternative approaches have bee proposed to improve LDA for functional data classification, such as the flexible discriminant analysis (Hastie et al., 1994) and the penalized discriminant analysis approach (Hastie et al., 1995). Marx and Eilers (1999) developed a generalized linear regression approach that considering the functional feature of the data. Hall et al. (2001) proposed a nonparametric curve discrimination method based on functional principal component analysis. Ferraty and Vieu (2003) proposed a nonparametric curves discrimination (NPCD) method taking into account both the regularity and nonlinearity problems in the classification. When the observations for individual curves are sparse, such
2
Biometrics,
as the time-course gene expression data, Leng and Muller (2006) used the functional principal component analysis to estimate the curves, and then functional logistic regression was used for classification. Fan (2009) proposed the Kernel-Induced Random Forests (KIRF) method for the multivariate classification problem. Kernel-induced classification trees are built using kernels of each two different training observations as candidate splitting rules. The tree algorithm applies to these kernels recursively to build one large tree. Using this concept, if we apply a random vector in choosing these kernelized splitting rules, as suggested in Breiman (2001), a kernel-induced random forest (KIRF) can also be constructed for classification purpose. Fan (2009) showed that KIRF had some advantages in contrast to regular random forests. First, nonlinear relationship in the data can be recursively and locally discovered in the tree search; Second, since each observation can be used in the kernel splitting rule, there are potentially more variability among the trees in the forests and the voting power can be strengthened. Finally, the kernel-induced random forest procedure can be used in various types of data as long as appropriate kernels are defined, while traditional random forests are limited in their power in this sense. We extend KIRF for functional data classification by defining some kernels for functional data. The curves are estimated by some nonparametric smoothing methods, such as the penalized spline smoothing method (Green and Silverman, 1994; Gu, 2002) and the functional principal component analysis (Besse et al., 1997; Yao et al., 2005a; Zhou et al., 2008). Then the KIRF method is applied to implement the classification by defining the kernel functions for two curves. This method is flexible in allowing users to choose different smoothing methods and appropriate kernel functions. The rest of the article is organized as follows. Section 2 introduces the KIRF method for functional data classification. This method is then demonstrated with the real phoneme data
Functional Data Classification
3
in Section 3. We compare the KIRF method with nine other methods with the simulated waveform data in Section 4. The conclusion and discussions are given in Section 5.
2. Method Suppose we have noisy observations for m curves: yij = xi (tij ) + ǫij
(1)
where xi (t), i = 1, · · · , m, are unknown smooth curves, tij , j = 1, · · · , ni , are some discrete points with observations in the interval [0,T], and ǫij are the measurement errors with mean 0 and variance σij2 . Our objective is to classify these m curves using kernel-induced random forests. Before defining the kernel of two curves, the curves have to be estimated from the noisy data using some nonparametric estimation methods. Below are two curve estimation methods for dense or sparse functional data, respectively.
2.1 Curve Estimation from Dense Functional Data When the number of observations for individual curves is large, each curve xi (t) can be estimated with the penalized spline smoothing method, which is introduced as follows. The curve xi (t) is approximated by a linear combination of Ki basis functions: xi (t) =
Ki X
cik φik (t) = cTi φi (t)
k=1
where φi (t) = (φi1 (t), · · · , φiKi (t))T is a vector of basis functions, and ci = (ci1 , · · · , ciKi )T is the corresponding vector of basis coefficients. The cubic B-spline basis functions (De Boor, 2001) are chosen here, which are piecewise cubic polynomials tied together so that they are continuous in value, as well as in first and second derivative. The joining points are known as knots of the spline. The cubic B-spline function are zero in the whole range except positive in a short interval, which is called the compact support property. This property is essential in increasing the computational efficiency, in which the sparse matrix calculation can be used.
4
Biometrics,
The cubic B-splines are determined by the number and locations of the knots, and it is not easy to find the optimal knot sequence (Hastie et al., 2001; Durban et al., 2004). Instead, we use the penalized spline smoothing method, in which one knot is put at each distinct point with measurements, and a roughness penalty term is used to control the roughness of the fitted function. The basis coefficient vector ci is estimated by minimizing the penalized sum squared error (PENSSE) loss function, PENSSE(ci ) =
ni X
2
[yij − xi (tij )] + λ
Z
tini
ti1
i=1
d2 x(s) ds2
2
ds ,
(2)
where the second term penalizes the roughness of the fitted function. The smoothing parameter λ determines the trade-off between the fit of the data and the smoothness of the fitted function. The generalized cross validation (GCV) proposed by Craven and Wahba (1979) is one of the most popular methods for choosing smoothing parameters. The estimate for the basis coefficient vector ci is obtained by taking the derivative of (2) with respect to ci , and solving for ci : −1 Z tin 2 i d T d2 T T cˆi = Φi Φi + λ φi (s) 2 φi (s)ds Φi yi , 2 ds ds ti1 where Φi is a ni by Ki basis matrix with each row containing the values of Ki basis functions at tij , and the ni -dimensional vector yi = (yi1 , · · · , yini )T . The fitted curve is xˆi (t) = cˆTi φi (t).
(3)
We then defining the distance or kernel of two curves. For example, one kernel function of two curves is κ(xi (t), xℓ (t)) =
Z
[xi (t) − xℓ (t)]2 dt .
(4)
In some cases when the curve shape is the interesting feature for classification, the kernel function may be defined in term of the p-derivatives of the curves: κ(xi (t), xℓ (t)) =
Z
[
dp dp x (t) − xℓ (t)]2 dt . i dtp dtp
Functional Data Classification
5
2.2 Curve Estimation from Sparse Functional Data When the number of observations or measurements for individual curves is small, it is hard to use the penalized spline smoothing method to estimate each curve using its own observations. Yao et al. (2005a) proposed to use functional principal component analysis (FPCA) via the Principal Analysis by Conditional Estimation (PACE) algorithm. PACE does not use pre-smoothing of trajectories which is problematic if functional data are sparsely sampled. Instead, PACE pools all observations for all curves together to estimate the covariance function, and then obtain the principal components. It has been shown to efficiently analyze the sparse functional data (Yao et al. (2005b)). The FPCA is introduced briefly below. Suppose the m curves are independent realization of a square integrable stochastic process X(t) on [0,T] with mean E[X(t)] = µ(t) and covariance function cov[X(t), X(s)] = G(s, t). The covariance function has an orthogonal expansion in L2 ([0, T ]) by Mercer’s Theorem: G(s, t) =
X
λg ρg (s)ρg (t) .
g
where the eigenvalues λg , g = 1, 2, · · · , are ordered by size, λ1 > λ2 > · · · , and ρg (t) is the associated eigenfunctions. We call ρg (t) functional principal components (FPCs). Each individual curve then has the following Karhunen-Lo`eve representation: xi (t) = µ(t) +
X
cig ρg (t) .
g
where cig are called FPC scores, and are calculated as cig =
Z
T
(xi (t) − µ(t))ρg (t)dt . 0
The mean function µ(t) is estimated with any nonparametric smoothing methods by pooling all data together. We usually choose a small number of FPCs, say M, which explain most percentage of the total curve variation. Yao et al. (2005a) discussed how to choose the number of FPCs in details. The covariance function G(s, t) is estimated by the two-dimension
6
Biometrics,
nonparametric smoothing on the empirical covariances m
1X C(sk , sl ) = [(xi (sk ) − µ ˆ (sk ))(xi (sl ) − µ ˆ(sl ))] , n i=1 where sk , k = 1, · · · , ngrid , are some dense grid on [0,T] . A spectral analysis is performed on the estimated covariance matrix, and we obtain the first M eigenvalues and eigenfunctions. The number M can be chosen such that the fraction of variance explained (FVE) exceeds some threshold. FVE is calculated as PJ ˆ k=1 λk F V E(J) = Pngrid , J = 1, · · · , ngrid , ˆk λ k=1
The FPC scores are obtained by some numerical quadrature methods, for example, the
Simpson’s rule (Burden and Douglas, 2000). A matlab package have been developed to implement the PACE algorithm by Professor Hans-Georg M¨ uller and his collaborators, and can be downloaded in the website: http://anson.ucdavis.edu/˜wyang/PACE/ . We define the kernel function of two curves as: κ(xi (t), xℓ (t)) =
M X
(cig − cℓg )2 .
(5)
g=1
2.3 General Kernel Functions Besides the kernel functions in the above two subsections, there are a large number of kernel functions to choose, which are introduced as follows. Let the q-dimensional vector xi = (xi1 , · · · , xiq ) ∈ ℜq , Scholkopf and Smola (2002) and Shawe-Taylor and Cristianini (2004) define the inner product in ℜq as: < xi , xj >=
q X
xik xjk .
k=1
Now suppose that xi are not vectors, but rather functions with values xi (t). The nature functional counterpart for the above inner product is < xi (t), xj (t) >=
Z
xi (t)xj (t)dt .
Functional Data Classification
7
Let ν : W1 → W2 be a linear or nonlinear mapping from the input space to an inner product feature space, such that xi (t) 7→ ν(xi (t)) , where W1 and W2 are two functional spaces. A kernel is a function κ, such that for all xi (t) ∈ W1 , κ(xi (t), xj (t)) =< ν(xi (t)), ν(xj (t)) > . One advantage of the kernel methods is that the mapped feature function ν(xi (t)) is unnecessary to be evaluated explicitly after defining the kernel function. Here we give a few examples of kernels. A polynomial kernel with degree d > 2 is κ(xi (t), xj (t)) = (< xi (t), xj (t) > +c)d , where c is a tuning parameter. Another popular kernel function is the Gaussian kernel κ(xi (t), xj (t)) = exp(−||xi (t) − xj (t)||2 /(2σ 2 )) , where σ 2 is a tuning parameter, and ||xi (t) − xj (t)||2 =< xi (t) − xj (t), xi (t) − xj (t) >. The Gaussian kernel is also called a radial basis function kernel.
2.4 Kernel-Induced Random Forests After defining the kernel of two curves, the kernel-induced random forest can be used for functional data classification. We see that each curve, xi (t), in the learning set can interact with other curves via the kernel function κ(xi (t), ·). The value of this kernel function is uniquely defined by the specified curve, xi (t), and an input curve z(t). We could view such an observation-based kernel function as a new feature, defined as κi (z(t)) = κ(xi (t), z(t)). If we have m curves in the learning set, then we will have m such kernel-induced features, κi (·), i = 1, · · · , m. The kernel induced split rule is then “κ(xi (t), ·)