BIOPHYSICAL PARAMETER ESTIMATION WITH ADAPTIVE GAUSSIAN PROCESSES G. Camps-Valls1, L. G´omez-Chova1 , J. Mu˜noz-Mar´ı1, J. Vila-Franc´es1, J. Amor´os1, S. del Valle-Tascon2 and J. Calpe-Maravilla1 1
Image Processing Laboratory (IPL) 2 Dept. Plant Biology Universitat de Val`encia. Val`encia, Spain.
[email protected], http://www.uv.es/gcamps ABSTRACT We evaluate Gaussian Processes (GPs) for the estimation of biophysical parameters from acquired multispectral data. The standard GP formulation is used, and all hyperparameters (kernel parameters and noise variance) are optimized by maximizing the marginal likelihood. This gives rise to a fully-adaptive GP to data characteristics, both in terms of signal and noise properties. The good numerical results in the estimation of oceanic chlorophyll concentration and leaf membrane state confirm GPs as adequate, alternative nonparametric methods for biophysical parameter estimation. GPs are also analyzed by scrutinizing the predictive variance, the estimated noise variance, and the relevance of each feature after optimization. Index Terms— Biophysical parameter estimation, Gaussian Process, Support Vector Regression, Kernel method, Bayesian learning, non-parametric model, chlorophyll concentration, leaf membrane permeability. 1. INTRODUCTION The estimation of biophysical parameters is of special relevance in order to better understand the environment dynamics at local and global scales [1]. The suitability of imaging spectroscopy is motivated by the fact that the information contained in the spectra allows us to describe and model the physical processes more accurately. Lately, non-parametric models, such as neural networks and support vector machines (SVM), have demonstrated successful performance in this context [2–6]. Alternative kernel-based prediction models formalized under the Bayesian framework, such as the Relevance Vector Machine (RVM) [7], have been also developed. The application of RVM to remote sensing data was presented in [8], where some nice properties of RVM were stressed: accurate predictions, sparsity, and the possibility to obtain confidence intervals for the predictions. Unfortunately, This paper has been partially supported by the Spanish Ministry for Education and Science under projects TEC2006-13845, AYA2008-05965-C0403, and CONSOLIDER/CSD2007-00018.
an important concern about the suitability of such Bayesian algorithms in biophysical parameter estimation was raised: oversparseness was easily obtained due to the use of an improper prior, which led to inaccurate predictions and poor predictive variance estimations. Recently, the introduction of Gaussian Processes (GPs) has alleviated the aforementioned problem at the cost of providing non-sparse models [9]. GPs are also a Bayesian approach to non-parametric kernel learning. Very good numerical performance and stability has been reported in remote sensing parameter retrieval [10, 11]. However, the other properties of this family of algorithms (i.e. confidence intervals and feature ranking) have not been analyzed. In this paper, we evaluate GPs in the context of biophysical parameter estimation and analyze in detail these properties. The standard GP formulation is used, and all hyperparameters (kernel parameters and noise variance) are optimized by maximizing the marginal likelihood. On the one hand, the signal characteristics must be encoded in the kernel function. The adoption of traditional kernels, such as the radial basis function (RBF) kernel, does not constitute a good choice since there is no adaptation to the different intrinsic relevance of the spectral channels. In [6, 12] we presented the Mahalanobis kernel for alleviating this problem, but the lengthscales for each feature were not learned from the data but fixed ad hoc through the estimated inverse of the covariance matrix of the data. In [8], we proposed an alternative RVM to deal with this problem but, due to the oversparsity problem, a slight improvement was observed. On the other hand, the GP standard formulation assumes an additive Gaussian noise model with a dedicated hyperparameter. In this paper, we propose to optimize the marginal likelihood in GPs to learn both the proper kernel and the noise variance. The proposed fully-adaptive GP is tested in oceanic chlorophyll concentration prediction, and in the complex problem of estimating leaf membrane state. All the results are analyzed and discussed by taking into account the nature of the data and the noise. The remainder of the paper is organized as follows. Section 2 revises the standard formulation of GPs and the alterna-
tive optimization of the marginal likelihood. Section 3 shows the experimental results. Section 4 concludes the paper. 2. ADAPTIVE GAUSSIAN PROCESSES This section reviews the formulation of GPs and the proposed optimization of model hyperparameters through marginal likelihood optimization.
2.2. GP model optimization The optimization of GP hyperparameters can be done through standard cross-validation tecniques. However, a good property of the GP framework is the possibility to optimize all involved hyperparameters, θ, iteratively through gradientdescent. This is done by maximizing the negative log marginal likelihood, p(y|x, θ), and its partial derivatives wrt the hyperparameters1: ∂ log p(y|x, θ) 1 ∂K −1 = y⊤ K−1 K y ∂θj 2 ∂θj 1 ∂K ∂K 1 ) = tr((αα⊤ − K)−1 ), − tr(K−1 2 ∂θj 2 ∂θj
2.1. Introduction to Gaussian Processes Gaussian processes provide a probabilistic approach for learning regression and classification problems with kernels [9]. A GP for regression defines a distribution over functions f : X → R fully described by a mean m : X → R and a covariance (kernel) function k : X × X → R such that m(x) = E[f (x)] and k(x, x′ ) = E[(f (x) − m(x))(f (x′ ) − m(x′ ))]. Hereafter we set m to be the zero function for the sake of simplicity. Now, given a finite labeled samples dataset {x1 , . . . xn } we first compute its covariance matrix K in the same way as done for the Gram matrix in SVM. The covariance matrix defines a distribution over the vector of output values fx = (f (x1 ), . . . , f (xn ))⊤ , such that fx ∼ N (0; K), which is a multivariate Gaussian distribution. Therefore the specification of the covariance function implies the form of the distribution over the functions. The role of the covariance for GPs is the same as the role of kernels in SVM, both specify the notion of similarity. For training purposes, we assume that the observed variable is formed by noisy observations of the true underlying function y = f (x) + ǫ. Moreover we assume the noise to be additive independently identically Gaussian distributed with zero mean and variance σn . Let us define the stacked output values y = (y1 , . . . , yn )⊤ , the covariance terms of the test point k∗ = (k(x∗ , x1 ), . . . , k(x∗ , xn ))⊤ , and k∗∗ = k(x∗ , x∗ ). From the previous model assumption, the output values are distributed according to: K + σn2 I k∗ y (1) ∼ N 0, f (x∗ ) k⊤ k∗∗ ∗ For prediction purposes, the GP is obtained by computing the conditional distribution f (x∗ )|y, {x1 , . . . , xn }; x∗ , which can be shown to be a Gaussian distribution with predictive mean k∗⊤ (K+σn2 I)−1 y and predictive variance k∗∗ −k∗⊤ (K+ σn2 I)−1 k∗ . Therefore, two hyperparameters must be optimized: the kernel K and the noise variance σn2 . Note that the GP mean predictor is exactly the same solution that the obtained for the kernel ridge regression where now the noise variance term σn2 appears as the regularization factor. Interestingly, we showed this relation in the context of support vector regression [5]. Even more important is the fact that not only a mean prediction is obtained for each sample but a full distribution over the output values including an uncertainty of the prediction. We will exploit these interesting capabilities in the experimental section.
(2)
where α = K−1 y, which is only computed once. This optimization is done by a particular gradient-based optimization, resulting in a relatively fast method that scales well for less than a few thousand training samples [9]. This technique not only avoids running heuristic crossvalidation methods but also optimizing very flexible kernel functions and estimating the noise variance consistently. For example, by optimizing a different lengthscale per feature, the kernel function is no longer isotrotropic and implements automatic relevance determination (ARD) [13], because the inverse of the length-scale determines how relevant an input is. By optimizing the noise variance, relevant information of the data complexity can be obtained. 3. EXPERIMENTAL RESULTS In this section we evaluate the performance of GPs in the estimation of two biophysical parameters: chlorophyll concentration and leaf permeability. 3.1. Model development In all cases, we used a scaled anisotropic RBF kernel, P ′ 2 −2 (d) K(x, x′ ) = ν exp(− D − x(d) ) ) + σn2 δxx′ , d=1 0.5σd (x where ν is a kernel scaling factor accounting for signal variance, D is the data input dimension (d indicates dimension), σd is a dedicated lengthscale for feature d, and σn is the magnitude of the independent noise component. Before training the models, we standardized the data and scaled the observed variable to lie in the [0, 1] range. We optimized signal model parameters (ν, σd ) and noise vairance σn with Eq. (2) using the training set only. The optimized inverse of σd represents the relevance of feature d. 3.2. Experiment 1: Ocean chlorophyll concentration In this experiment we used simulated data of the medium resolution imaging spectrometer (MERIS) to predict chloro2 I)−1 y − log p(y|x) ≡ log p(y|x, θ) = − 21 y⊤ (K + σn n 2 log(det(K + σn I)) − 2 log(2π). 1 The
1 2
(a)
(b) 95% CI Mean prediction Actual signal
1
(c)
(d)
3
2 1.5
2
0.5 0 −0.5 −1
−0.5
Probability
Estimated
0
Estimated
1 0.5
1
0
−1
−1.5 −1 2110
2115
2120
2125
2130
2135
−2 −2
−1
0 Observed
1
2
−2 −1
−0.5
0 Residuals
0.5
1
0.999 0.997 0.99 0.98 0.95 0.90 0.75 0.50 0.25 0.10 0.05 0.02 0.01 0.003 0.001 −0.6
−0.4
−0.2
0 Data
0.2
0.4
0.6
Fig. 1. (a) Actual concentration (black), mean prediction (red) and 95% confidence intervals (gray); (b) predicted versus observed plot; (c) residuals versus predicted plot; and (d) normality plot. All plots are for the test set. phyll concentration, CC, in subsurface waters. We selected the eight channels in the visible range (412-681 nm), following [2]. The range of variation of the chlorophyll concentration is from 0.02 to 25 mg/m3 . The data were generated according to a fixed (noise-free) model and thus uncertainty is not encountered. This fact allows us to test the capabilities of GPs. The total 5000 pairs of in situ concentrations and received radiances was randomly divided in a training (500) set to build the model, and two independent tests sets (500, 4000 samples) to assess performance. Before applying the GP, concentration was transformed logarithmically, y = log(CC). A summary of the results is shown in Fig. 1. GPs provide excellent results in both test sets. Interestingly, better results in the bigger test set (RMSE=0.1371, r=0.9610) than in the smaller one (RMSE=0.1497, r=0.9535), suggesting good generalization capabilities. It is also noticeable the good alignment in the normality plot. In fact, the estimated noise variance is 0.024 which is much lower than the signal variance and confirms that the model captured the noise-free nature of the synthetic data. The trained GP also returns the fitted lengthscale per spectral band: {227.79, 139.20, 9.34, 15.70, 15.35, 21.79, 11.99, 22.03}. The lower the width the higher the relevance of the band, that is the most relevant channels are λ3 = 490 nm and λ7 = 665 nm, which partially agrees with previous studies [14]. 3.3. Experiment 2: Leaf membrane state estimation The study of leaf cell membrane is relevant for assessing plant state. Stress, such as dehydration, low temperatures, radiation and contaminants can alter the physicochemical properties of membranes leading to a loss of physiological processes, membrane leakage and tissue injury [15, 16]. Traditionally, plant membrane damage has been estimated indirectly by measuring the electrolyte leakage. However, this is a destructive and time-consuming technique. Here we propose the use of GPs to infer a model that relates reflectance spectra of a leave to its membrane state. A total of 57 sunflower leaves (coming from 5 plants grown under different water-stress conditions) were analyzed by both electrolyte leakage and spectroscopic techniques. For
the spectral analysis, sunflower leaves were illuminated with a stabilized halogen lamp and their reflectance spectrum was acquired with an ASD FieldSpec spectroradiometer (2151 channels in 350 − 2500 nm, resolution 1.3 − 2.5 nm). Five different reflectance spectra were obtained for each of the 57 sample leaves. For the electrical conductivity analysis, five pieces were cut from each sunflower leaf. The pieces were submerged for 15 min with deionised water, incubated at room temperature in deionised water, and further shaked at 80 min−1 speed. Electrical conductivity of the bathing solution was measured at 15 hours (C1 ) [17]. After measurements, the tubes were maintained at -20oC for 6 hours and then the conductivity was measured again (C2 ). The membrane stability index (MSI) to be predicted was calculated with MSI = (1 − (C1 /C2 )) ∗ 100. The resulting dataset consists of few high-dimensional training samples, xi ∈ R2151 , and corresponding measured MSI, yi ∈ R, i = 1, . . . , 57. To reduce the high number of features, we performed a principal component analysis (PCA) on the training data, and kept the 15 principal components corresponding to the highest eigenvalues. These results match those obtained with the minimum noise fraction transformation (MNF) [18].Then, we run the GP in this whitened reduced feature space. In this case, given the low number of training samples, we show the 3-fold cross-validation error for different methods in Table 1: canonical linear regression, regression trees, support vector regression (SVR) and GP. We show different quality measures: RMSE and mean absolute error (MAE) for accuracy, Pearson’s correlation coefficient (r) for the goodness-of-fit, and mean error (ME) representing model bias. It can be noticed that all methods yield low bias predictions but, GP and SVR yield better accuracy and goodness-of-fit figures than linear regression and regression trees. However, it is worth noting that GP shows excellent performance in this experiment. In addition, it is worth noting that the estimated noise variance is very high (ˆ σn2 = 0.45) thus suggesting the high uncertainty present in the data. On the other hand, the estimated kernel scaling factor νˆ = 0.53, is similar in magnitude to the standard deviation of the (scaled) data, std(y) = 0.26. The discrepancy may be due to the extremely low number of samples.
Table 1. Results for the MSI prediction. Mean error (ME), root mean-squared error (RMSE), mean absolute error (MAE), and correlation coefficient (r) of models in 3-fold cross-validation. Method Lin. Regression Regression Tree SVR GP
ME -0.02 -0.02 -0.02 -0.02
RMSE 0.13 0.11 0.07 0.03
MAE 0.18 0.18 0.12 0.10
r 0.71 0.74 0.83 0.89
ical parameter estimation,” in IEEE International Geoscience and Remote Sensing Symposium, IGARSS’08, Boston, USA, July 2008. [7] M. E. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machine,” J. of Mach. Learn. Res., vol. 1, pp. 211–244, 2001. [8] G. Camps-Valls, L. Gomez-Chova, J. Vila-Franc´es, J. Amor´osL´opez, J. Mu˜noz-Mar´ı, and J. Calpe-Maravilla, “Retrieval of oceanic chlorophyll concentration with relevance vector machines,” Remote Sensing of Environment, vol. 105, no. 1, pp. 23–33, Nov 2006. [9] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning, The MIT Press, New York, 2006.
4. CONCLUSION A fully adapted GP to signal and noise characteristics is proposed for biophysical parameter estimation. Good results were obtained in the estimation of oceanic chlorophyll concentration and leave membrane stability index. Also GPs provided reasonable noise variance estimation and allowed us to optimize the parameters of the kernel thus providing a feature ranking directly learned from the data. Further work will consider the reformulation of GPs to deal with non-Gaussian noise and alleviate the compututational cost for making the model operational when training with several thousands of training data. 5. REFERENCES [1] D. G. Goodenough, A. S. Bhogall, H. Chen, and A. Dyk, “Comparison of methods for estimation of Kyoto Protocol products of forests from multitemporal LANDSAT,” in Proc. of the Intern. Geosc. and Rem. Sens. Symp. IGARSS ’01, University of California, Santa Barbara, USA, 2001, vol. 2, pp. 764 –767. [2] P. Cipollini, G. Corsini, M. Diani, and R. Grass, “Retrieval of sea water optically active parameters from hyperspectral data by means of generalized radial basis function neural networks,” IEEE Trans. On Geosc. Rem. Sens., vol. 39, pp. 1508–1524, 2001. [3] M. De Martino, P. Mantero, S. B. Serpico, E. Carta, G. Corsini, and R. Grasso, “Water quality estimation by neural networks based on remotely sensed data analysis,” in Proc. of the International Workshop on Geo-spatial Knowledge Processing for Natural Resource Management, Varese, Italy, 2002, pp. 54–58. [4] B. Dzwonkowski and X.-H. Yan, “Development and application of a neural network based colour algorithm in coastal waters,” International Journal of Remote Sensing, vol. 26, no. 6, pp. 1175–1200, March 2005. ´ [5] G. Camps-Valls, L. Bruzzone, J. L. Rojo-Alvarez, and F. Melgani, “Robust support vector regression for biophysical variable estimation from remotely sensed images,” IEEE Geoscience and Remote Sensing Letters, vol. 3, no. 3, pp. 339–343, 2006. [6] G. Camps-Valls, J. Mu˜noz-Mar´ı, L. G´omez-Chova, and J. Calpe-Maravilla, “Semi-supervised support vector biophys-
[10] R. Furfaro, R. D. Morris, A. Kottas, M. Taddy, and B. D. Ganapol, “A Gaussian Process Approach to Quantifying the Uncertainty of Vegetation Parameters from Remote Sensing Observations,” AGU Fall Meeting Abstracts, pp. A261+, Dec. 2006. [11] L. Pasolli, F. Melgani, and E. Blanzieri, “Estimating biophysical parameters from remotely sensed imagery with Gaussian Processes,” in IEEE International Geoscience and Remote Sensing Symposium, IGARSS’08, Boston, USA, July 2008. [12] J. Gomez-Sanchis, J. Blasco, and G. Camps-Valls, “Hyperspectral detection of citrus damage with Mahalanobis kernel classifier,” IEE Electronics Letters, vol. 43, no. 20, pp. 1082– 1084, 2007. [13] Radford M. Neal, Bayesian Learning for Neural Networks (Lecture Notes in Statistics), Springer, 1 edition, August 1996. [14] S. Dransfeld, A.R. Tatnall, I.S. Robinson, and C.D. Mobley, “Prioritizing ocean colour channels by neural network input reflectance perturbation,” International Journal of Remote Sensing, vol. 26, no. 5, pp. 1043–1048, 2005. [15] Daniel A. Sims and John A. Gamon, “Relationships between leaf pigment content and spectral reflectance across a wide range of species, leaf structures and developmental stages,” Remote Sensing of Environment, vol. 81, no. 2-3, pp. 337 – 354, 2002. [16] G. le Maire, C. Franc¸ois, and E. Dufrene, “Towards universal broad leaf chlorophyll indices using prospect simulated database and hyperspectral reflectance measurements,” Remote Sensing of Environment, vol. 89, no. 1, pp. 1 – 28, 2004. [17] L. Guidi, C. Nali, G. Lorenzini, F. Filippi, and G. F. Soldatini, “Effect of chronic ozone fumigation on the photosynthetic process of poplar clones showing different sensitivity,” Environmental Pollution, vol. 113, no. 3, pp. 245 – 254, 2001. [18] A. A. Green, M. Berman, P. Switzer, and M. D. Craig, “A transformation for ordering multispectral data in terms of image quality with implications for noise removal,” IEEE Transactions on Geoscience and Remote Sensing, vol. 26, no. 1, pp. 65–74, 1998.