Computational Statistics & Data Analysis 51 (2007) 4984 – 4993 www.elsevier.com/locate/csda
Classifying densities using functional regression trees: Applications in oceanology David Nerinia,∗ , Badih Ghattasb a Centre d’Océanologie de Marseille, UMR LMGEM 6117 CNRS,Campus de Luminy, Case 901, 13288 MARSEILLE Cedex 09, France b Institut de Mathématiques de Luminy, Campus de Luminy, Case 907,13288 MARSEILLE Cedex 09, France
Available online 17 October 2006
Abstract The problem of building a regression tree is considered when the response variable is a probability density function. Splitting criteria which are well adapted to measure the dissimilarity between densities are proposed using the Csiszár’s f-divergence. The comparison between performances of trees constructed with various criteria is tackled through numerical simulations. Afterwards, a tree is constructed to predict the size distribution of a zooplankton community using a set of explanatory environmental variables. Functional PCA is used in order to interpret the main modes of variation of the size spectra around the predicted density in each terminal node. Finally, a bagging procedure is used to increase the accuracy of the tree-based model. © 2006 Elsevier B.V. All rights reserved. Keywords: Functional regression tree; Functional PCA; Zooplankton; Csiszár’s f-divergence; Kullback–Leibler divergence; Hellinger distance
1. Introduction Zooplankton communities are good indicators of environmental changes such as global warming, or fast increase in CO 2 partial pressure in the atmosphere (Beaugrand et al., 2002). The motivation of this work comes from the observation that biologists need statistical tools in order to understand the relations between zooplankton communities and environmental forcing variables such as temperature, salinity, phytoplankton biomass or nutrients concentration. Moreover, a proper method must include the fact that the response variable is functional since nowadays, a zooplankton community is commonly analyzed through the size structure of the individuals (Fig. 1). Many substantial recent works have been carried out in functional data analysis involving various fields of applied statistics (Ramsay and Silverman, 1997, 2003). Particular attention has been focused on the regression of a real response variable Y given some functional explanatory variables, both in a parametric and a nonparametric framework (Cardot et al., 1999; Ferraty and Vieu, 2004, 2006). The problem we face concerns the forecasting of a density Y given a vector X of explanatory variables whose components are real or discrete or both. More precisely, our aim is to estimate a regression procedure where Y is a random function taking values in some space of infinite dimension. The estimation of the regression function is tackled through an extension of regression trees (CART, Breiman et al., 1984; Hastie et al., 2001). ∗ Corresponding author.
E-mail addresses:
[email protected] (D. Nerini),
[email protected] (B. Ghattas). 0167-9473/$ - see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2006.09.028
D. Nerini, B. Ghattas / Computational Statistics & Data Analysis 51 (2007) 4984 – 4993
4985
2.0
1.5
1.0
0.5
0.0 -3.5
-3.0
-2.5
-2.0
log(size) (µm) Fig. 1. Zooplankton size spectra. Each curve is a density which arises from a digital imaging system designed for estimating the size distribution of marine zooplankton net samples (Grosjean et al., 2004).
Regression trees are statistical models concerned with the prediction of a real response variable Y given the set of explanatory variables X. Starting from a set of n i.i.d. observations {yi , xi }i=1,...,n of (Y, X) ∈ R × X, they are constructed by partitioning the X space into a set of hypercubes and fitting a simple model (a constant) for Y in each of these regions. The model is in the form of a binary tree containing q terminal nodes (the regions) where a predicted value of Y is given. The construction of a tree-based model, the way to select the splits and the measure of the accuracy are achieved using the following criterion called deviance defined for each node r of the tree as = R(r) (yi − y r )2 , (1) xi ∈r
where y r is the average of the observations of Y belonging to the region r. These models have been widely studied in machine learning and applied statistics and present many advantages like their representation in a form of a binary tree, working in high dimension and variable ranking. The first extension of regression trees for a multivariate response variable has been introduced by Segal (1992) for the case of longitudinal data. He considered the observations of the response variable Y, a curve sampled on an equally spaced grid, as a vector yi = (yi1 , . . . , yim ) . The generalization of the criterion (1) used to select the splits came with = R(r) (yi − yr ) −1 r (yi − yr ), xi ∈r
where yr is the m × 1 mean response vector for observations within the node r and r the covariance matrix of the responses for region r. The main difficulty arising with this extension of the deviance is that we must ensure that when splitting a node r in two subnodes r1 and r2 , the total within subnodes deviance must be lower than the parent node deviance, that is 1 ) + R(r 2 ). R(r) R(r
(2)
To satisfy this constraint, the author assumed that the covariance of the observations within the parent node, is the same as the covariance of the observations within the both subnodes. Besides this assumption, he assumed also that each
4986
D. Nerini, B. Ghattas / Computational Statistics & Data Analysis 51 (2007) 4984 – 4993
realization of Y may be modelled by an autoregressive AR(1) model. This second assumption, although hard, simplifies considerably the form of the covariance matrix, and so its estimation within each node. The second approach proposed by Yu and Lambert (1999) consists in applying a principal component analysis on the response matrix before constructing the tree. Retaining a suitable number of principal components, a tree model is fitted, taking advantage of the dimensionality reduction and of the filtering step provided by the PCA. By this way, the stability of the predictor is improved. Here, the criterion used to build the tree is given by = R(r)
(i − r ) (i − r ),
xi ∈r
where i denotes a principal components response vector for which the dimension is attached to the number of principal components retained for the analysis and r is the mean vector of principal components for observations belonging to region r. Yu and Lambert (1999) proposed another interesting way to construct trees. They presented each response curve as a linear combination of known basis functions and grew the tree using the coefficients of this expansion. This leads to estimate the criterion as = (3) (ci − cr ) (ci − cr ), R(r) xi ∈r
where ci is the vector of coefficients of the expansion into the chosen basis, cr the mean vector of coefficients belonging to region r, and the matrix of the inner products of the basis vectors. With a spline basis, this method settles the problem of irregular sampling. We propose in this paper some splitting criteria based on the Csiszár’s f-divergence, more suitable to measure dissimilarities between densities. We show that they satisfy the right properties for constructing optimal trees. Comparisons between performances of trees constructed with different splitting criteria, including L2 -distances, are achieved through simulated numerical experiments. Next, the case study in oceanology is conducted on the zooplankton size spectra and functional PCA (Ramsay and Silverman, 1997) is used to enlighten the interpretation of the tree. Finally, we build an aggregated model in order to improve the predictive capabilities of the tree-based model. 2. Functional regression trees Let L = {(yi , xi ), i = 1, . . . , n} be a learning sample of n i.i.d. observations such that each yi is an observation of the functional random response variable Y valued in an Hilbert space on support [a, b] ⊂ R, H = L2 ([a, b]). Let f, g denote the usual inner product of functions f and g in H, defined by f, g =
a
b
f (t)g(t) dt
and let . denote the norm attached to this inner product. The p-variate vectors xi = (xi1 , . . . , xip ) are observations of the random vector X = (X1 , . . . , Xp ) that we suppose, for the sake of simplicity, belonging to Rp . The objective is to estimate the value of the response function Y, for an observation of X, by designing a regression function on L fL (x) = E(Y /X = x) =
q
fj I (X ∈ rj ),
j =1
where fj is a function belonging to H, the regions rj , j = 1, . . . , q are Rp -polytopes with boundaries parallel to the axis and form a partition of Rp , q is the number of regions and I (.) is the indicator function. The estimate fL (.) of fL (.) is computed by recursive partitioning as in CART (Breiman et al., 1984) giving rise to a binary tree T whose terminal nodes are the regions rj , j = 1, . . . , q (see Fig. 6 as an example). The function fj is estimated by the mean function y j = (1/nj ) xi ∈rj yi of the nj observations of the response function Y within the
D. Nerini, B. Ghattas / Computational Statistics & Data Analysis 51 (2007) 4984 – 4993
4987
region rj . This leads to fL (X) =
q
y j I (X ∈ rj ).
j =1
The accuracy of the tree T is measured by the overall mean squared error q
j ), )= 1 R(r R(T n
(4)
j =1
where j) = R(r
yi − y j 2
(5)
xi ∈rj
is the estimated deviance within the region rj . This criterion is also used to measure the heterogeneity between curves belonging to any node r when constructing the tree, and it satisfies the property (2). The estimate (4) is known to be optimistic and can be computed over a test sample or by cross-validation. 3. Dealing with densities We suggest here a more suitable criterion to construct functional trees when the response variable is a density function. Let (, T, ) be a measure space where is a −finite measure on (, T). Let i and j be two probability measures on (, T) dominated by . Denote the densities yi = di /d and yj = dj /d. The quantity yi Cf (yi , yj ) = d (6) yj f yj is called f -divergence of the probability measures i and j (Csiszár, 1967). The quantity Cf (yi , yj ) is a versatile functional form which provides many types of information divergence measures (Anastassiou, 2005). Most common choices of f are restricted to strictly convex functions from [0, +∞[ to R such that f (1) = 0, so that Cf (yi , yi ) = 0. Convexity ensures that the divergence measure Cf (yi , yj ) is nonnegative. The f-divergence is in general asymmetric in i and j . Furthermore, (6) does not depend on the choice of the dominant measure . For a node r, we introduce the following splitting criterion, Cf (yi , yj ). (7) Rf (r) = xi ∈r xj ∈r
According to the properties of (6), this criterion is positive and still verifies (2). It can be used as a deviance for tree construction. We shall consider here two particular cases of (6) for which computation is tractable. The Kullback–Leibler divergence is obtained with f (u) = u ln u : yi (t) K(yi , yj ) = yj (t) ln dt. yj (t) √ The choice of f (u) = ( u − 1)2 produces the Hellinger distance: H 2 (yi , yj ) = ( yi (t) − yj (t))2 dt.
As the Hellinger distance fulfills the properties of an Euclidean distance, the criterion (7) can be written as H 2 (yi , yr ), RH (r) = xi ∈r
nr √
where yr = ((1/nr )
i=1
yi )2 .
4988
D. Nerini, B. Ghattas / Computational Statistics & Data Analysis 51 (2007) 4984 – 4993
A functional regression tree can then be grown using (7) to select the splits. The accuracy of the predictor is still estimated using the overall mean squared error (4). The computation of both the criteria and the measure of accuracy is handled through classical numerical integration methods and is easily tractable when the densities are strictly positive and have the same support. As it will be shown in the next section, when the densities are not available, nonparametric estimation can be achieved using the samples at hand. 4. Simulations Numerical simulations are carried out in order to illustrate the construction of a tree using (7) as splitting criterion. Let X = (X1 , X2 , X3 , X4 ) be a real random vector belonging to R4 whose components Xj , j = 1, . . . , 4 follow an uniform distribution on [−1, 1]. Consider the following theoretical model which can be displayed in the form of a binary decision tree (Fig. 2): ⎧ I N(0, 1) if X4 > 0.5 and X3 > 0.5, ⎪ ⎪ ⎪ ⎨ II N(0, 1.5) if X4 > 0.5 and X3 < 0.5, (8) ⎪ III N(1, 0.5) + (1 − )N(−1, 0.5) if X4 < 0.5 and X3 > 0.5, ⎪ ⎪ ⎩ IV (1 − )N(1, 0.5) + N(−1, 0.5) if X4 < 0.5 and X3 < 0.5, where N( , 2 ) is the Gaussian probability density function with mean and variance 2 , and 0 < < 1 is a mixture coefficient fixed to = 2/5. This model defines four regions (I, II, III and IV) into which a response density function is assessed. Notice that the model only depends on rules concerning X3 and X4 . Suppose that we randomly draw n i.i.d. observations xi , i =1, . . . , n of X. Following the rules in (8), each observation i falls into one of the regions I, II, III or IV. Suppose now that the observation xi belongs to region I. The corresponding observation yi of the response variable is obtained by kernel density estimation over a sample of size s randomly drawn from the same distribution as the response density function fixed in region I (in this case, N(0, 1)). The bandwidths are computed by cross-validation but other choices can be relevant as well (Simonoff, 1996). In our case, it does not matter since the way to introduce variability between curves belonging to the same region when constructing the set yi , i = 1, . . . , n is fixed by the choice of s. We now dispose of a learning sample L, constituted with kernel density curves and with the four explanatory variables. We fixed n = 200 and s = 150. A tree T is constructed and pruned with 20-fold cross-validation using the Kullback–Leibler divergence as splitting criterion.
TRUE
X4> = 0.5
X3> = 0.5
I
II
FALSE
X3> = 0.5
III
IV
Fig. 2. Theoretical functional decision tree. Four probability density functions are conditioned by a sequence of nested rules constructed with four explanatory variables. The sequence of binary rules only concerns X3 and X4 .
D. Nerini, B. Ghattas / Computational Statistics & Data Analysis 51 (2007) 4984 – 4993
4989
X4> = 0.5116
X3> = 0.5229
X3> = 0.4973
X4> = -0.2295
Fig. 3. Functional regression tree with simulated densities. The black curves represent the mean densities as predicted functions in each terminal node. On this example, each observation of Y (grey curves) is well classified. As expected, X1 and X2 do not appear into the splitting rules but the number of terminal nodes is higher than the theoretical tree.
The optimal tree (Fig. 3) has five terminal nodes and the main splitting rules match with the true model (8). The splitting rules are retrieved and the model correctly classifies the entire set of observations. However, the number of terminal nodes is higher than the true model. 5. Comparing the criteria In the preceding section, we presented the way to construct a tree using a splitting criterion based on the Kullback– Leibler divergence. Still following the model (8), we compare here the performances of trees constructed with three different criteria based on the Hellinger distance, the Kullback–Leibler divergence and the usual L2 -distance criterion (5), respectively. Moreover, we test the robustness of these trees by increasing variability between density functions belonging to the same region. One way to proceed is to modify the number s of sampled observations when constructing the yi ’s by kernel density estimation. For a fixed value of s, the following procedure is reiterated 20 times: (1) Generate n = 300 observations from the model (8) to form a sample. (2) Split this sample into two subsets, a learning sample L of size 200 and a test sample LTS of size 100. (3) For the three criteria, construct the pruned trees with 20-fold cross validation, using the observations in L and evaluate their performance with the mean squared error computed with LTS . Once the procedure completed, compute the average of the mean squared errors and the mean number of terminal nodes over the 20 iterations. Fig. 4 displays the average of the mean squared errors against the mean number of terminal nodes for various values of s ranging from 25 to 1000. The more the number s decreases, the more the number of terminal nodes and the mean squared errors of the trees increase. This is due to the properties of the kernel density estimator which increases variability between density curves, as expected. However, the behavior of the Kullback–Leibler criterion is quite different from both the Euclidean criteria (Hellinger and usual L2 -distance). With heterogeneous densities (s 250), both the number of terminal nodes and the performance of the trees are enhanced using Kullback–Leibler splitting criterion. Trees are less complex and better classify the densities. In fact, the Kullback–Leibler criterion is less sensible to small variations between densities than the other criteria. When the estimated densities are close to the theoretical model (s > 500), the criteria are equivalent. This example points out that the choice of a proper measure of similarity between two probability distributions when building a regression tree is an important and complex task. This implies that results will strongly depend on the variability between density curves of the learning sample. From a practical point-of-view, each tree has been grown using the R package RPART. The calculus of the criteria is achieved
4990
D. Nerini, B. Ghattas / Computational Statistics & Data Analysis 51 (2007) 4984 – 4993
mean number of terminal nodes
30
25
25
20 50 75 100
15 150 200 250
10
500
5
L2-Dist Kullback Hellinger
1000
0
2
4 6 8 mean squared error (x 0.001)
10
12
Fig. 4. Comparison between Kullback–Leibler criterion, Hellinger distance criterion and usual L2 -distance criterion. Each curves corresponds to the evolution of the mean squared error and the mean number of terminal nodes computed over 20 simulations, for various values of s. Dashed lines connect same data sets.
at each step of the splitting process. Thus, the computational time is about five times greater with the Kullback–Leibler criterion than with the other ones. 6. Case study in oceanology We dispose of a time-series of size n = 395 constituted with Mediterranean zooplankton size spectra sampled at the same location from 1995 to 2003. Each size spectrum is estimated by a digital imaging system which isolates, and measures individuals of net-collected zooplankton (Beaugrand et al., 2002). The set of explanatory variables is composed with water temperature watt (◦ C), salinity sal (SI), solar radiation rad (J/cm2 ), chlorophyll a concentration chla (mol/l), nitrate concentration nit (mol/l), phosphate concentration phos (mol/l) and wind speed wspe (m/s). The objective is to predict the shape of the zooplankton size spectra using the set of available explanatory environmental variables. The learning set L is then constituted with the size spectra response curves and the 395 × 7 matrix of the explanatory variables. As shown on Fig. 5, three different sources of variability accounting for 90% of the total variance, are identified with functional PCA computed on the entire set of densities (Kneip and Utikal, 2001). If we denote as y the overall mean density of the curves yi , i = 1, . . . , 395, the effect of the eigenfunction j associated to eigenvalue j of the PCA can be displayed as y ± j j (Ramsay and Silverman, 1997). The first factor accounting for the main part of variability (about 51%) involves a shift of the distributions from small sizes to larger ones. The zooplankton size spectra are in fact strongly attached to the season where sampling has been done. Flat distributions with a high mean size are encountered in summer, thin densities with lower mean value characterize winter samples where the biggest individuals have disappeared. The second source accounting for 30% of the variability identifies kurtosis variations of the densities. The distributions are more or less concentrated around the mean size with quite the same variance. The last factor exhibits shape variations around the mode of the densities. A 20-fold cross-validated tree is grown using the whole set L with the Kullback–Leibler divergence as splitting criterion (Fig. 6). The temperature is the main factor which discriminates the densities with thin densities on the right part of the tree and flat curves on the left. This branch of the tree gathers observations corresponding to samples obtained during the end of springs and summers. The solar radiations play here an important role since they isolate particular size structures
D. Nerini, B. Ghattas / Computational Statistics & Data Analysis 51 (2007) 4984 – 4993
PC1 (50.35%)
PC2 (29.73%)
PC3 (9.05%)
2.0
2.0
2.0
1.5
1.5
1.5
1.0
1.0
1.0
0.5
0.5
0.5
0.0
0.0
0.0
-3.5
-3.0 -2.5 log(size) (µm)
-2.0
-3.5
4991
-3.0 -2.5 log(size) (µm)
-2.0
-3.5
-3.0 -2.5 log(size) (µm)
-2.0
Fig. 5. Display of the first three principal components of the functional PCA of the entire dataset. The curves show the mean density (black) and the effects of adding (+) and subtracting (−) a multiple of each eigenfunction. Results show that the size structure of the zooplankton is highly attached to the season.
TRUE
Watt > = 15.29
FALSE
Chla > = 29.31
Watt > = 18.32
Wspe > = 3.379 Rad>=1874
Rad > = 1219
-3.5 -3.0 -2.5 -2.0 -3.5 -3.0 -2.5 -2.0 -3.5 -3.0 -2.5 -2.0 -3.5 -3.0 -2.5 -2.0 -3.5 -3.0 -2.5 -2.0 PC1 (52%)
-3.5 -3.0 -2.5 -2.0 -3.5 -3.0 -2.5 -2.0
PC1 (52%)
PC2 (25%)
PC2 (32%)
Fig. 6. A Kullback–Leibler regression tree constructed over 395 observations of size density. Only four predictive variables appeared as decision rules. A terminal node contains the observations (grey lines) and the predicted density curve (black line). An example of local PCA is displayed for two terminal nodes.
of the zooplankton community. At the opposite, on the right part of the tree, the chlorophyll a seems to be the most important factor when the water temperature is low. This period corresponds to falls and to the beginning of springs, when the increase of phytoplankton biomass (chlorophyll a) is the most important of the year. Once the tree has been grown, a local functional PCA is achieved in each terminal node in order to identify the main modes of variation of the densities around the conditional predicted density. PCA results are displayed for two terminal nodes (Fig. 6). It is interesting to note that these nodes present quite the same modes of variation but with different amounts of explained variability.
4992
D. Nerini, B. Ghattas / Computational Statistics & Data Analysis 51 (2007) 4984 – 4993
Mean squared error (x 0.001)
88 86 84 82 80 78 76 74 0
10 20 30 40 number of bootstrap replicates
50
Fig. 7. Improvement of performances of the functional regression model using bagged predictors. As the number of aggregated predictors increases the total squared error decreases.
7. Aggregated models The main drawback of regression trees is that slight modifications of the learning sample can lead to drastically different structure of the model. To improve the robustness of the tree structure, we construct an aggregated predictor by combining multiple versions of the model constructed on bootstrap samples of L (Breiman, 1996). The bootstrap samples Lb , b = 1, . . . , B are replicated data sets each consisting of n individuals randomly drawn from L with replacement. Over each bootstrap sample, we construct a functional predictor fLb (.) and give the bagged prediction fbag (x) for a new observation x as B 1 fLb (x). fbag (x) = B b=1
Before constructing the aggregated predictor, the data set is randomly partitioned into a learning set containing 70% of the data and a test set with 30% of the remaining data. The aggregated predictor accuracy is evaluated with the mean squared error computed over the test set. Using 50 bootstrap samples (Fig. 7) the prediction error is computed from the bagged predictor over the test sample. For this case study, we gain in prediction accuracy of about 14%. 8. Conclusion In the context of forecasting a density response variable with regression trees, we proposed a general criterion based on the Csiszár’s f-divergence. This approach provides different choices of well-known divergence measure between probability density functions such as the Kullback–Leibler divergence or the Hellinger distance. We have illustrated with numerical simulations that the proposed splitting criteria were efficient to capture the structure of the true model as the densities were estimated on a reasonable number of observations. However, we pointed out that the structure of the tree is sensible to the measure of similarities used to select the splits. In our simulation study, the Kullback–Leibler splitting criterion is the most efficient regarding the number of terminal nodes and the performance of the tree even if the computational time is quite long. For the case study, we got a satisfactory predictive model constructed with this splitting criterion, which allows to compare the effect of the available explanatory environmental variables over the size structure of the zooplankton community. The bagged model gave an improvement of prediction accuracy of about 14% which shows that the model is quite unstable to perturbations of the learning sample.
D. Nerini, B. Ghattas / Computational Statistics & Data Analysis 51 (2007) 4984 – 4993
4993
Acknowledgements The authors wish to thank the ZOOPNEC team for their stimulating remarks and for providing the dataset. References Anastassiou, G.A., 2005. Higher order optimal approximation of Csiszár’s f-divergence. Nonlinear Anal. 61, 309–339. Beaugrand, G., Reid, P.C., Ibanez, F., Lindley, J.A., Edwards, M., 2002. Reorganization of North Atlantic marine copepod biodiversity and climate. Science 296, 1692–1694. Breiman, L., 1996. Bagging Predictors. Mach. Learning 24, 123–140. Breiman, L., Friedman, J.H., Olshen, R., Stone, C.J., 1984. Classification and Regression Trees. Wadsworth, Belmont CA. Cardot, H., Ferraty, F., Sarda, P., 1999. Functional linear model. Statist. Probab. Lett. 1, 11–22. Csiszár, I., 1967. Information-type measures of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar. 2, 299–318. Ferraty, F., Vieu, P., 2004. Nonparametric models for functional data, with application in regression, time series and classification. Nonparametric Statist. 16, 111–127. Ferraty, F., Vieu, P., 2006. Nonparametric Functional Data Analysis. Springer, New-York. Grosjean, P., Picheral, M., Warembourg, C., Gorsky, G., 2004. Enumeration, measurement, and identification of net zooplankton samples using the ZOOSCAN digital imaging system. ICES J. Marine Sci. 61 (4), 518–525. Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning. Springer, New-York. Kneip, A., Utikal, K.J., 2001. Inference for density families using principal component analysis. J. Amer. Statist. Assoc. 96, 519–542. Ramsay, J.O., Silverman, B.W., 1997. Functional Data Analysis. Springer, New-York. Ramsay, J.O., Silverman, B.W., 2003. Applied Functional Data Analysis. Springer, New-York. Segal, M.R., 1992. Tree-structured methods for longitudinal data. J. Amer. Statist. Assoc. 87, 407–418. Simonoff, J.S., 1996. Smoothing methods in statistics. Springer, New-York. Yu, Y., Lambert, D., 1999. Fitting trees to functional data: with an application to time-of-day patterns. J. Comput. Graph. Statist. 8, 749–762.