IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 1, JANUARY 2010
123
Relevance Units Latent Variable Model and Nonlinear Dimensionality Reduction Junbin Gao, Jun Zhang, and David Tien, Member, IEEE
Abstract—A new dimensionality reduction method, called relevance units latent variable model (RULVM), is proposed in this paper. RULVM has a close link with the framework of Gaussian process latent variable model (GPLVM) and it originates from a recently developed sparse kernel model called relevance units machine (RUM). RUM follows the idea of relevance vector machine (RVM) under the Bayesian framework but releases the constraint that relevance vectors (RVs) have to be selected from the input vectors. RUM treats relevance units (RUs) as part of the parameters to be learned from the data. As a result, a RUM maintains all the advantages of RVM and offers superior sparsity. RULVM inherits the advantages of sparseness offered by the RUM and the experimental result shows that RULVM algorithm possesses considerable computational advantages over GPLVM algorithm. Index Terms—Dimensionality reduction, relevance units machines (RUM), relevance vector machine (RVM), gaussian process latent variable model (GPLVM).
I. INTRODUCTION
D
IMENSIONALITY REDUCTION (DR) is one of the important preprocessing steps in many advanced applications such as exploratory data analysis and manifold learning, etc. It has been successfully applied in many areas including robotics, bioinformatics, etc. The goal of DR is mainly to find the corresponding embedding of observed data in a much lower dimensional space without incurring significant information loss. The low-dimensional representation can be used in subsequent procedures such as classification, pattern recognition, and so on. In machine learning, many well-known DR methods that can handle different kinds of data and produce either linear or nonlinear embedding have been reported [7]. Traditionally, DR was performed using linear techniques such as principal component analysis (PCA) and linear discriminant analysis (LDA) in the setting of unsupervised or supervised learning. Those procedures aim to learn a mapping from highdimensional space to a space of the lower “intrinsic” dimension. A more general setting can be formulated when considering DR problems. The recent work [8] implicitly uses the concept of harmonic measurements between higher and lower spaces and
Manuscript received February 28, 2009; accepted October 10, 2009. First published November 24, 2009; current version published January 04, 2010. J. Gao and D. Tien are with the School of Computing and Mathematics, Charles Sturt University, Bathurst, N.S.W. 2795, Australia (e-mail: jbgao@csu. edu.au;
[email protected]). J. Zhang is with the Institute for Pattern Recognition and Artificial Intelligence, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2009.2034964
two mappings can be introduced. For example, the traditional PCA learns a mapping from the high-dimensional space to the low-dimensional space while the probabilistic PCA (PPCA) [9] makes an assumption of a linear latent variable model. A recent work [10] further explored such a framework. In the setting of DR, we are given a set of observation samand we wish to learn their mapped images ples , called latent data hereafter, in a lower dimensional space. We suppose that the latent data is of dimension and the data is of dimension . Although learning mappings between data and latent data is not a major concern in DR probto or a mapping from lems, learning a mapping from to is always desirable. A mapping from to is useful in resolving out-of-sample problems while a mapping from to helps in extrapolation. When we assume that is known, learning a mapping from to (similarly from to ) can be regarded as a conventional regression problem. The general idea of using latent variable models in DR is to utilize a regression model under which the assumed latent variables (in lower dimensional space) are to be learned along with regression model parameters. For example, the probabilistic PCA (PPCA) where is asemploys a linear regression model sumed to be the istropic Gaussian noise; see [9]. Among those regression methods, kernel machine methods [11] have demonstrated great successes. For example, many kernel machines produce a model function dependent only on a subset of kernel basis functions associated with some of the as defined below training samples
(1) are called the “support vectors” in supwhere those samples port vector machine (SVM) methods. The SVM [12] and kernel machine models (KMMs) [11], [13] have attracted considerable interests. These techniques have been gaining more and more popularity and are regarded as the state-of-the-art techniques for regression and classification problems with tremendously successful applications in many areas. Generally speaking, an SVM method often learns a parsimonious model that ensures the simplest possible model that explains the data well. Apart from the obvious computational advantage, practices have demonstrated that simple models often generalize better for the unseen data. Thus, a sparse model is always preferred in most machine learning algorithms. Additionally, learning a sparse model has deep connections with problems of selecting regressors in regression; see, for example, [14]–[17].
1045-9227/$26.00 © 2009 IEEE
124
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 1, JANUARY 2010
However, it has been shown that the standard SVM technique is not always able to construct parsimonious models, for example, in system identification [18]. This inadequacy motivates the exploration of new methods for parsimonious models under the framework of both SVM and KMM. Tipping [4]–[6] first introduced the relevance vector machine (RVM) method which can be viewed from a Bayesian learning framework of kernel machine and produces an identical functional form to SVM/ KMM. The results given by Tipping [6] have demonstrated that the RVM has a comparable generalization performance to the SVM but requires dramatically fewer kernel basis functions or model terms than the SVM. Chen et al. [19] proposed another family of sparse kernel modeling algorithm, based on his orthogonal least squares (OLS) learning algorithm [20]. One of common features shared by all the mentioned approaches is that the sparse model is obtained from a full kernel model defined on the whole data set and the approach employs an algorithm procedure to trim off unnecessary kernel basis functions associated with some input data. The retained input data points used in the resulting sparse model are called such as support vectors [11], relevance vectors (RVs) [4], critical vectors [21], etc. Obviously, it is not necessary for these critical vectors to have to be chosen from the training input data. In one of our earlier papers, we proposed a new sparse kernel modeling regression method, called relevance units machine (RUM), in which the “critical vectors” will be learned from data [3]. The similar idea has existed for many years, for example, the direct method for sparse model learning [22] under the SVM framework, the reduced set (RS) method [11], [23] for sparse SVM, and the generalized kernel models (GKMs) in greedy forward selection procedures [24]–[26]. However, in our approach, a Bayesian inference framework is adopted so that all the parameters including kernel parameters can be learned from Bayesian inference. Latent variable models have been used in many DR algorithms, such as factor analysis (FA) [27]–[29], generative topographic mapping (GTM) [30], probabilistic principal component analysis (PPCA) [9], Gaussian process latent variable models (GPLVMs) [1], [2], and kernel PCA (KPCA) [31]. In these methods, either a linear mapping is assumed or a nonsparse kernel-type nonlinear mapping is used. A linear mapping is not necessarily enough to capture nonlinear information contained in the data while a nonsparse model may have heavy computing cost like GPLVM. So it is anticipated to develop some parsimonious nonlinear mapping in dimensionality reduction. To do so, we take as the underlying latent regression model the RUM [3] and the GKMs [26] as defined by (1) and consider it in the DR setting where the latent data are unknown and to be learned as part of learning procedures. We call this the relevance units latent variable model (RULVM). The main contribution of this paper lies in the following: 1) instead of using the simple linear latent model as in PPCA or using the nonsparse latent Gaussian process model as done in GPLVM, a direct sparse kernel model defined as (1) is employed in the DR setting; (2) the DR setting is formulated under the Bayesian learning framework; and (3) a Bayesian learning procedure is proposed to learn all the undetermined parameters including the latent embedding , kernel “units” , kernel hyperparameters, etc.
The rest of this paper is organized as follows. In Section II, RULVM is described and the algorithm associated with RULVM is presented in Section III. The experiment results are presented in Section IV, followed by our conclusions in Section V. II. THE MODEL DESCRIPTION We consider a latent variable model in the conventional unsupervised learning setting. Given a set of independent and identically distributed (i.i.d.) training data, where is the number of is a -dimensamples and each sional column vector. We assume that for each sample , there is an unknown counterpart in a lower dimensional space where meaning that is much less than . We assume that the data is generated from its counterpart according to . an unknown underlying latent regression model Denote . Sometimes we call the input data and the input space as in the normal setting of supervised learning. A. Sparse Kernel Models In this paper, we are particularly interested in the so-called sparse kernel models. Many kernel machine learning algorithms result in a kernel machine (such as a kernel classifier), whose output can be calculated as where the weights are -dimensional vectors and is a kernel function . When the input is known, the condefined on ventional supervised learning task becomes finding suitable according to different criteria. Most criteria like weights those used by SVM [12] and RVM [6] lead to small values for so that a sparse model is established, and a large number of the used criterion local regularized orthogonal least squares (LROLS [13]) [19], [26] directly results in a sparse model. Inspired by the idea used in the direct method for sparse model learning [22], the GKMs [26], or the generalized linear regression model [33], we consider a latent variable model defined as follows: (2) where are unknown units for the model, as usual, is a kernel function defined on , and is a -dimensional additive noise vector assumed to be a Gaussian of zero mean and an unknown covari, denoted by with the -dimenance controls the sparsity of the model sional identity matrix . and we assume that is known in the modeling process. The sparsity control will be discussed in Section III-B. In this paper, model (2) is considered as the underlying from its cormechanism explicitly generating each datum responding input . Under the model, we say that the latent low-dimensional variable is a representative of the data . has been reduced In this sense, we say that the dimension of to the dimension of . In the proposed model, the unknown parameters to be learned from a learning process are the model , the units , the latent weights
GAO et al.: RELEVANCE UNITS LATENT VARIABLE MODEL AND NONLINEAR DIMENSIONALITY REDUCTION
variable , the variance , and possible hyperparameters in the kernel function . As is assumed to be a zero is the mean Gaussian, it can be shown that the estimate of mean of the data . Without the loss of generality, we assume in the sequel.
125
, and in the Gaussian priors Those hyperparameters are further empowered with hyperpriors given by the following Gamma distributions:
B. Bayesian RUM Formulation To develop an approach for learning all the parameters, we propose a two stages Bayesian learning and inference approach. For this purpose, let us introduce some new notations. Given the data set , denote by the corresponding latent input data. Let
Define a matrix and weights matrix . an According to the assumption of independence of the data points, the likelihood of the complete training data can be written as
where
defined on with . In the experiments conducted, we used the uninformative hyperpriors by fixing hyperprior pa. rameters Combining (3)–(5), the joint distribution of the data set , the weight vector , the units matrix , and hyper, and is given as follows: parameters
(6)
(3) where denotes all the kernel hyperparameters. To make a full Bayesian model, we further specify some prior distributions on the latent variable , units , weights , etc. is the -dimenParticularly, the prior for each latent variable sional Gaussian with the zero mean and unit covariance. That is
Furthermore, we specify a Gaussian prior over the weights as in the RVM approach [6]
(4) where and is hyperparameters which controls precision of the a vector of Gaussian over the weights. Similarly, we specify a Gaussian prior over the units matrix by
We call the probabilistic model defined in (6) as the RULVM. RULVM inherits from the so-called RUM [3] where the latent variable was supposed to be known. The RUM is used as a regression approach in which the learning process aims and to seek for the most suitable relevance units (RUs) results in a sparse regression model. In RULVM, except for becomes learning the RUs , learning the latent variable part of learning procedures. Thus, the primary purpose of and the latent RULVM is to determine appropriate units variable in the model. The Bayesian inference for model (6) involves finding as the posterior distribution of all the latent variables , and given the obserwell as parameter variables vations . The problem is intractable particularly as both and appear inside the kernel function. The variational Bayesian inference can be applied in the inference based on model (6) by assuming independent parametric posteriors , and ; see [32]. Theoretically another possible inference is to use the maximum a posteriori (MAP) approach by maximizing the log of model (6) with respect to all the unknown variables. However, the model may not be identifiable due to the larger flexibility of (6), but we note that we can remove the existence of the weights by integrating the joint distribution over , generating a “collapsed” version of the likelihood. Immediately we can see
(5) where
is the Gaussian precision.
(7)
126
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 1, JANUARY 2010
The integration over the weights space according to its Gaussian prior means that we take an average over the generalized linear model (2) according to the weights prior. This reduces the uncertainty on the weights and then we solve by incorporating our beliefs under the the latent variables assumed RULVM models (6). This process is similar to the dual PPCA [1]. In fact, if the kernel bases (2) are replaced with the feature components of the latent variable , then we have the PPCA model. In PPCA, model integration is carried out with respect to all the latent variable spaces. After weights are learned in PPCA maximum-likelihood (ML) framework, a postprocedure is used to get all the latent variables based on the learned weights. However, in RULVM, the integration is conducted over the weight space first, thus a Bayesian learning process is implemented to deal with the resulting model (7). In have the second stage, once the latent variable and units been found, model (2) becomes a linear Gaussian model that [33]. permits exact inference for weights
“bivariate” functions1. There is a great flexibility in choosing function . Although the radial basis function (RBF) kernel is widely used in many kernel-based modeling, the RBF kernel is quite localized. In some cases, one may prefer a global basis function. For example, a thin-plate spline basis is more useful than a local basis in computer graphics. Note 3: The above model can be easily extended to a model to cope with the outliers in the data . For example, an L1 Laplacian noise model as used in [35] can replace the Gaussian noise assumption in (2), in which a robust RULVM can be formulated with the variational Bayesian inference procedure. We leave this for another work. III. THE RULVM ALGORITHM Ideally, we would integrate all the parameters, but this is too computationally expensive. There is however a middle ground: we can marginalize out from both methods without computational penalty if we make another approximation which will be discussed
C. Relations to Other Works A. Algorithm Derivation Note 1: In terms of model format, model (2) is not new and it has been employed in many supervised learning algorithms such as [3], [22], and [24]–[26] under variant names. In our previous work [3], we call (2) the RUM. Although both the RUM algorithm and the GKM algorithm [24]–[26] result in a similar model to (2), the significant difference between two approaches exists. The GKM algorithm employs a forward selection procedure which is greedy to select the kernel “centers” or units while the RUM algorithm concurrently looks for all the units in the learning procedure introduced in Section III. One of the advantages offered by the GKM algorithm is that the sparin sity of the model is controlled by selecting appropriate the learning according to the model errors. Another major difference is that the GKM works under the supervised regression/classification setting while model (2) is employed in the unsupervised setting for the goal of dimensionality reduction. The third difference is that due to the use of the collapsed likelihood (7) the actually underlying model used in inferring latent variables and is the “average” version of (2). Note 2: The proposed RULVM method has a deep link with the standard Gaussian process latent variable model (GPLVM) [1], [2]. Both use latent variable models. In RULVM, an explicit parametric generalized linear model (2) is assumed while in GPLVM a nonparametric model is specified in terms of a Gaussian process (GP). Due to the nature of GP regression, a latent variable model similar to (2) can be derived in which all . Significant the latent data are used to define latent units computation overheads have to be located in GPLVM, although the sparse GP techniques [34] can be applied as done in [1]. Our experiments demonstrated that the computing cost is much less than what is needed in GPLVM, due to the characteristic of sparse models employed in RULVM. The GP regression produces a similar kernel-based model as (2) in the full size and the function , i.e., the covariance function of the process, must satisfy the Mercer kernel condition [11]. Such a condition can be removed in RULVM. The used in (2) can be any symmetric
We start with (7) by adopting a MAP approach. The learning process can be easily established by denoting the log likelihood as defined by (7). For the sake of convenience, we introduce new notation for matrix and an matrix the (actually is the estimate for the weights matrix ). By a tedious calculation process, we can obtain that
(8) (9) is the where th element of
th diagonal element of and matrix . Similarly, we have
is the
(10) , However, computing derivatives of with respect to and is much more complicated. For clarity, we should work first. It can be shown that out
1Actually the symmetry can be removed as well. More generally, the function
k can be a member of a function family parameterized by the parameter u.
GAO et al.: RELEVANCE UNITS LATENT VARIABLE MODEL AND NONLINEAR DIMENSIONALITY REDUCTION
According to the chain rule, we have (11) Similarly, we have (12) (13) In fact, and depend on a particular kernel function . Once the analytic expression of a kernel function , and is given, it is easy to find out . For example, if we use the RBF kernel function where is the kernel hyperparameter, then
where is the th element of matrix . Kronecker delta when ; otherwise, . is the symbol th element of with , and . With (8)–(13), a gradient-type algorithm for maximizing the joint log likelihood can be constructed. Once all the parameters have been determined with a gradient-type algorithm, the can be immediately solved by a normal reguweight vector larized least squares problem. In fact, we have
127
our experiments, the algorithm converges for any starting values of the variational parameters, poor choices of initialization can lead to local maxima that are far away from the best likelihood. Particularly, we initialize the algorithm by setting the latent embedding to the one given by the kernel PCA algorithm [11], the latent units to the -mean clusters of the kernel PCA embeddings, and the parameters according to a random selection. We run the algorithm multiple times and choose the final parameter settings that give the best bound on the likelihood. used in model (2) not only controls Note 2: The number the sparsity of the underlying model, but also determines the computational complexity of the learning procedure. The major matrix computational cost is for the calculation of the , the inverse of a symmetric matrix. Generally speaking, the overall complexity of the procedure with a Cholesky decompowhere is the number sition is at a manageable level is too large, not only do we have high of training data. When complexity issue, but also we may face overfitting issues. We with in would like to suggest that one should set range , where is the total number of training data, usually larger than 100. Note 3: While we prefer to the smaller , the smaller may result in underfitting problems. Determining an appropriate value for for a practical problem at hand is still an issue to be explored further. For the supervised learning RUM algorithm, based on the we have investigated the way of determining Akaike information criterion (AIC) [37]. However, it is much harder to propose such a way in the unsupervised learning case. As the major purpose of this paper is to learn the low-dimensional representation for the observed data , model (2) is only considered as a possible relation between the data and its latent counterpart, rather than its accurate predictive capability as requested in regression problems. In this sense, an accurate is less important for RULVM than for RUM. In the value for next section, we use one nearest neighbor (1NN) error as a criin the unsupervised terion to demonstrate limited effect of setting.
and the covariance of the prediction outcome C. Back-Constraint Algorithm for a new unseen input
where
can be given by
.
B. Several Issues in Practical Implementation of the Algorithm Note 1: The optimization problem presented in the last section is highly nonlinear and it may suffer from local minima problem. The scaled conjugate gradient (SCG) algorithm [36] , and kernel hypercan be used to compute the optimal parameters. The dimension of the embedded data is usually assigned according to the requirements of the application. For example, in visualization, it is normally 2. As a by-product of this optimization process, the optimal kernel hyperparameters ensure that the kernel is well adjusted for the task at hand. Practical implementation of the above algorithm must address initialization of all the latent variables and parameters. While, in
RULVM learns latent low-dimensional embedding for all and smooth probabilistic mapping from lathe training data to data space ; see (2). Thus, for a given latent tent space , (2) tells its corresponding data point . data However, in the application of dimensionality reduction techniques, it is more important to find a low-dimensional represenfor a given new observation . This is tation the so-called out-of-sample problem. In addition to the mapping to the (2), we wish to learn a mapping from the data space latent space . One approach that can be adopted in the above learning procedure is to introduce a back constraint on the latent variable as done in [2] and [38]. The back-constraint approach is particularly useful in visualizing test data and is implemented and used in our experiments in the next section. The back-constraint mapping to be used is defined as follows: (14)
128
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 1, JANUARY 2010
Fig. 1. Predictions for the simple scalar sinc function modeling problem: dots are the noise testing data, the solid curve is the underlying function sinc(x), the dashed–dotted curve is a prediction generated from RUM, the dashed curve is the prediction generated from RVM, the marker indicates the RUs selected by RUM, and the marker indicates the RVs selected by RVM. (a) Predictions produced by the 7-term RVM and the 5-term RUM. (b) Predictions produced by the 7-term RVM and the 6-term RUM. (c) Predictions produced by the 7-term RVM and the 7-term RUM. (d) Predictions produced by the 7-term RVM and the 8-term RUM.
where is the th component of and is a function parameterized by the parameters . The back constraints in general form [(14)] can be incorporated into the RULVM algorithm by replacing all the in (7) and then we conduct the ML algorithm on with respect to instead of . Once all the are deterthe parameters mined, all the latent embedding can be calculated according to (14). Similarly, for any new observation , its latent variables can be computed by (14) as well. The new algorithm is straightforward. By combining (12) and . (14), it is easy to get all In the next section, we have tested two types of back-constraint mappings. First, we tested an RBF neural network enlightened by neuroscaling [39] (15) where
A. Assessing the RUM Algorithm To assess the ability of the RUM in building sparse regression models and learning kernel hyperparameters, we first use a synthetic data set to compare RUM to Tipping’s RVM [6]. The experiment of the RVM part is conducted by using Tipping’s MATLAB code.2 1) Synthetic Data Example: In this example, synthetic data were generated from the scalar function
The sets of training data and test data are generated for the input and by drawing and target noise from the uniform distribution over and were given by a Gaussian with zero mean within and variance 0.113. The targets is quite noisy compared to the maximal target values 1. The RBF kernel function used in this experiment takes the following form:
Second, we tested an MLP neural network defined as follows: (16) where is the th component of neuron function [40].
and
is the sigmoid
IV. EXPERIMENTS To investigate the performance of the proposed approach, we conducted experiments on both synthetic and real-world data. The computer we used in our experiments is the one with a system of Microsoft Windows XP Professional Version 2002 SP2 and the 2.13-GHz Intel(R) Core(TM)2 CPU and 3.00-GB RAM with MATLAB version 7.0.
where is called the width of the RBF kernel. A full kernel model is defined for the RVM algorithm by all the RBF regressors with centers at each input training datum. In the experiment, the width of the RBF kernel function is set to 2.1213 for RVM algorithm where 9.0 was the best value for the kernel variance as chosen for this example in Tipping’s MATLAB code. For the RUM algorithm, the width of the RBF kernel function, the only kernel parameter for the RBF kernel, is treated as an unknown parameter which is automatically estimated by the learning procedures. Both algorithms produce sparse models. Their predictions on unseen data are presented in Fig. 1. 2http://www.miketipping.com/index.php?page
=rvm
GAO et al.: RELEVANCE UNITS LATENT VARIABLE MODEL AND NONLINEAR DIMENSIONALITY REDUCTION
129
Fig. 2. Oil flow data set visualized in the 2-D latent space: “stratified” is represented by crosses, “annular” is represented by circles, and “homogeneous” is represented by pluses. (a) The oil flow training data set visualized by RULVM with RBF back constraint. (b) The oil flow training data set visualized by GPLVM with RBF back constraint. (c) The oil flow test data set visualized with the RBF back mapping learned by RULVM. (d) The oil flow test data set visualized with the RBF back mapping learned by GPLVM.
TABLE I MEAN SQUARE ERRORS OF RUM EXPERIMENTS
In Fig. 1, the model predications of the 7-term model produced by the RVM and the 5–8-term models produced by the RUM algorithm are shown in Fig. 1(a)–(d). Table I compares the mean square error (MSE) values over the training and test sets for the models constructed by the RUM and the RVM. The number of iterative loops and the RBF kernel widths are also listed. The numbers of chosen regressors are, respectively, 7 (RVM) and 5–8 (RUM). In summary, the results given by the RUM are comparable to the result generated by the RVM algorithm, however, the performance of the RVM algorithm depends on the choice for the value of the RBF width. In the experiment, we also find that the RUM has lower computational cost than the RVM algorithm. In these experiments, the procedure of the RUM is stabilized after fewer than 60 iterative loops while the RVM finally terminates after 1000 iterative loops. B. Assessing RULVM Algorithm In the following experiments, we assess the performance of RULVM in DR, particularly in comparison with GPLVM [1].3 The remainder of this section is structured as follows. First, in Section IV-B1, we conducted two learning procedures on the oil data. For each algorithm, we explore the quality of the visualization of both training and testing data in terms of the ease 3The Matlab source code of GPLVM is freely available from http://www.cs. man.ac.uk/~neill/fgplvm/
of separation of the classes in the latent space. We also look at 1NN classification errors in the latent space. It can be demonstrated that RULVM is comparable to GPLVM in general. In Section IV-B2, we use a small size vowel data set to test the robustness of both procedures in terms of visualization and 1NN testing errors. In Section IV-B3, we turn to a much higher (256) dimension data set of the handwritten digits. Again we compare RULVM with the full-size GPLVM algorithm by seeing how well the different digits are separated in the latent space. We find that the full-size GPLVM suffers severe overfitting problems, while the sparse version of GPLVM fails to produce better results. 1) Multiphase Oil Flow Data: In this example, we look at the “multiphase oil flow” data [41]. This is a synthetic data set modeling nonintrusive measurements on a pipeline transporting a mixture of oil, water, and gas. The flow in the pipe takes one out of three possible configurations: horizontally stratified, nested annular, or homogeneous mixture flow. The data lives in a 12-dimensional measurement space, but for each configuration, there are only two degrees of freedom: the fraction of water and the fraction of oil. The fraction of gas is redundant, since the three fractions must sum to one. Hence, the data lives on a number of “sheets” which are locally approximately 2-D. The data set is divided into training, validation, and test sets, each of which comprises 1000 independent data points. We use 1000 training data and the 1000 test data in our experiment. The base kernel function is a 12-dimensional RBF kernel function where the width parameter is to be learned from RULVM procedure. The number of RUs is set to 100. Similarly, we also set the number of active points to 100 in the sparse version “fitc” of GPLVM; see [1]. To visualize the test data, we used a back-constraint mapping in both procedures. Two different back mappings tested in the
130
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 1, JANUARY 2010
Fig. 3. Oil flow data set visualized in the 2-D latent space: “stratified” is represented by crosses, “annular” is represented by circles, and “homogeneous” is represented by pluses. (a) The oil flow training data set visualized by RULVM with an MLP back constraint. (b) The oil flow training data set visualized by GPLVM with an MLP back constraint. (c) The oil flow test data set visualized with the MLP back mapping learned by RULVM. (d) The oil flow test data set visualized with the MLP back mapping learned by GPLVM.
Fig. 4. Vowel data set visualized in the 2-D latent space, where vowel /a/ is represented by a cross, /ae/ is represented by a circle, /ao/ is represented by a plus, /e/ is represented by a star, /i/ is represented by a square, /ibar/ is represented by a diamond, /o/ is represented by a upsidedown triangle, /schwa/ is represented by a triangle, and /u/ is represented by a left triangle. (a) The vowel training data set visualized by RULVM with an RBF back constraint. (b) The vowel training data set visualized by GPLVM with an RBF back constraint. (c) The vowel test data set visualized with the RBF back mapping learned by RULVM. (d) The vowel test data set visualized with the RBF back mapping learned by GPLVM.
experiment are the MLP neural network and the RBF kernel machine. Figs. 2(a) and 3(a) show 2-D visualization of the training data of the oil flow data given by RULVM model using RBF and MLP back constraints, respectively, while Figs. 2(b) and 3(b) show 2-D visualization of the training data of the oil flow data given by GPLVM model. Figs. 2(c)–(d) and 3(c)–(d) show visualization of the test data under RULVM and GPLVM models using RBF and MLP back constraints, respectively. Tables II and III report the 1NN classification errors, the numbers of iterative loops, and the models training time(s). From these tables, we note that, using the RBF back constraint, both RULVM and GPLVM have the comparable 1NN error while
under the MLP constraint, GPLVM has an 1NN error that is only 30% of the one of RULVM. However, the model training times consumed by RULVM are less than 11% of that by GPLVM. In summary, we can claim that both procedures have comparable performance, however RULVM procedure takes much lower computational cost than GPLVM. In this case, GPLVM procedure has to be terminated at a preset maximal number of iterations 1000. 2) Speaker Vowel Data Set: The oil example above has shown the comparable performance of two procedures. In the following experiment with the vowel data set, we intend to show a slight difference in their performance when the data size is not very large.
GAO et al.: RELEVANCE UNITS LATENT VARIABLE MODEL AND NONLINEAR DIMENSIONALITY REDUCTION
131
Fig. 5. Vowel data set visualized in the 2-D latent space vowel /a/ is represented by a cross, /ae/ is represented by a circle, /ao/ is represented by a plus, /e/ is represented by a star, /i/ is represented by a square, /ibar/ is represented by a diamond, /o/ is represented by a upsidedown triangle, /schwa/ is represented by a triangle, and /u/ is represented by a left triangle. (a) The vowel training data set visualized by RULVM with an MLP back constraint. (b) The vowel training data set visualized by GPLVM with an MLP back constraint. (c) The vowel test data set visualized with the MLP back mapping learned by RULVM. (d) The vowel test data set visualized with the MLP back mapping learned by GPLVM.
Fig. 6. Handwritten digits training and testing results in the 2-D latent space using RBF kernel. “0” is represented by crosses; “1” is represented by circles; “2” is represented by pluses; “3” is represented by stars; and “4” is represented by squares. (a) Handwritten digits training data visualized by RULVM with RBF back constraint. (b) Handwritten digits training data visualized by GPLVM with RBF back constraint. (c) Handwritten digits test data visualized with the RBF back mapping learned by RULVM. (d) Handwritten digits test data visualized with the RBF back mapping learned by GPLVM.
We use the single speaker vowel data set.4 The data set consists of the cepstral coefficients and deltas of nine different vowel phonemes. It is acquired as part of a vocal joystick system [2], [42]. The data set includes 2700 examples of different vowels: /a/, /ae/, /ao/, /e/, /i/, /ibar/, /o/, /schwa/, and /u/. Every class of vowels includes 300 samples. Every sample has 38 features. We only use the first 12 and the last 12 features, so what we used is a data set of 24 features of a single speaker performing nine different vowels 300 times per vowel. We use the first 100 samples of each vowel as the training data, and the other 200 samples as the test data. Thus, the total number of the training data is 900. In our experiment, the number of RUs in RULVM is 90 and the number of active 4Publicly available in preprocessed form in the data set package at http:// www.dcs.shef.au.uk/~neil/fgplvm/
points in the sparse version “fitc” of GPLVM is 100. Similarly, we conducted experiments with two sets of back-constraint training with MLP and RBF. The 2-D visualization of the training data is shown in Figs. 4(a) and 5(a) for RULVM model using RBF and MLP back constraints, respectively, and Figs. 4(b) and 5(b) for GPLVM model. Visualization of the test data given by constrained RULVM and GPLVM are shown in Figs. 4(c)–(d) and 5(c)–(d), respectively. Based on a visual assessment, we can see that GPLVM has failed on this data set of a smaller size (900 data points used for training). We also see, in Tables IV and V, that the 1-NN errors of RULVM are less than 18% of the ones of GPLVM while the model training times consumed by RULVM are less than 6% the time used by GPLVM. In this experiment, we find
132
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 1, JANUARY 2010
Fig. 7. Handwritten digits training and testing results in the 2-D latent space using MLP kernel. “0” is represented by crosses; “1” is represented by circles; “2” is represented by pluses; “3” is represented by stars; and “4” is represented by squares. (a) Handwritten digits training data visualized by RULVM with MLP back constraint. (b) Handwritten digits training data visualized by GPLVM with MLP back constraint. (c) Handwritten digits test data visualized with the MLP back mapping learned by RULVM. (d) Handwritten digits test data visualized with the MLP back mapping learned by GPLVM.
Fig. 8. Handwritten digits training and testing results in the 2-D latent space using RBF kernel with full version GPLVM. “0” is represented by crosses; “1” is represented by circles; “2” is represented by pluses; “3” is represented by stars; and “4” is represented by squares. (a) Handwritten digits training data visualized by full GPLVM with RBF back constraint. (b) Handwritten digits test data visualized by RBF back mapping learned with full GPLVM.
Fig. 9. Handwritten digits training time (1000 s) and testing 1NN errors by RULVM with different active percent of units ranged from 0.05 to 0.20. (a) Handwritten digits training time (1000 s) by RULVM. (b) Handwritten digits testing 1NN errors by RULVM.
that RULVM has a superior performance in computational cost and fewer 1-NN errors than GPLVM. This example shows that RULVM is more robust than GPLVM in the case of small size data sets. Note that when increasing the number of training data, the training performance of GPLVM is getting better. This can be demonstrated by the demo provided with GPLVM package. However, it has the risk of overfitting the data as demonstrated in the following example.
3) Handwritten Digits Image Data: We aim at visualizing handwritten digits from the United States Postal Office (USPS) database with 7291 digits for training and 2007 digits for testing5 [43] in this experiment. To make a fair comparison with GPLVM, we only use the digits of five classes 0–4. All images are in grayscale and have a uniform size of 16 16 pixels. As a result, the training data lie in a space of dimension 256. We 5Available
at http://www.cs.toronto.edu/~roweis/data.html
GAO et al.: RELEVANCE UNITS LATENT VARIABLE MODEL AND NONLINEAR DIMENSIONALITY REDUCTION
133
TABLE II DIMENSIONALITY REDUCTION PERFORMANCE OF RULVM AND GPLVM USING THE RBF BACK CONSTRAINT FOR THE OIL FLOW DATA
TABLE VII DIMENSIONALITY REDUCTION PERFORMANCE OF RULVM AND GPLVM USING MLP KERNEL FOR THE DIGIT DATA 0–4
TABLE III DIMENSIONALITY REDUCTION PERFORMANCE OF RULVM AND GPLVM USING THE MLP BACK CONSTRAINT FOR THE OIL FLOW DATA
TABLE VIII DIMENSIONALITY REDUCTION PERFORMANCE OF RULVM WITH THE ACTIVE PERCENT OF UNITS CHANGED FROM 0.05 TO 0.20
TABLE IV DIMENSIONALITY REDUCTION PERFORMANCE OF RULVM AND GPLVM WITH THE RBF BACK-CONSTRAINT KERNEL FOR THE VOWEL DATA
TABLE V DIMENSIONALITY REDUCTION PERFORMANCE OF RULVM AND GPLVM USING THE MLP BACK CONSTRAINT FOR THE VOWEL DATA
TABLE VI DIMENSIONALITY REDUCTION PERFORMANCE OF RULVM AND GPLVM USING RBF KERNEL FOR THE DIGIT DATA 0–4
made use of RBF and MLP back constraints in both RULVM and GPLVM models. For our dimensionality reduction and visualization purposes, we set the dimension of the latent variables to 2. In order to compare RULVM with GPLVM under the similar conditions used in [1], we applied both RULVM and GPLVM to visualize a subset of 2500 of the digit classes 0–4 (500 images for each digit class) from the 7291 train data. The RBF kernel is also employed in this experiment and the width of the RBF kernel is to be learned in the process. Visualization given by the two models is presented. For testing purposes, we use two types of back constraints as done in the two previous examples. To assess the quality of visualization of the digits, we still use the 1NN classifier in the latent space on all 1187 digits of classes 0–4 from the 2007 test data. The performance benefits associated with the nonlinear visualization are apparent here. RULVM outperforms GPLVM under this criterion. Figs. 6 and 7 show 2-D visualization of both training and testing data in the 2-D latent space learned by RULVM and GPLVM models. In all the figures, digit classes are distinguished by using different colored markers. Tables VI and VII show 1NN classification errors, the numbers of iterative loops, and the models training time(s) for the model procedures. In our experiments, the procedure of RULVM is stabilized before 400 iterative loops, however GPLVM is manually terminated after 1000 maximal iterative
loops. It can be seen that the 1NN errors of RULVM are less than 20% of the ones of GPLVM while the model training times consumed by RULVM are less than 25% times used by GPLVM. The poor testing performance of GPLVM is clearly shown in Fig. 6(d). This indicates the existence of overfitting effect suffered by GPLVM. Fig. 6(b) and (d) shows that GPLVM failed with an RBF back-constraint mapping. To see whether any sparse assumption undermines the performance of GPLVM, we tested a full version GPLVM. On this training data set, it took nearly three days to get the results (manually terminated at 500th iterative). Fig. 8 shows 2-D visualization of the training and testing in the latent space using the full GPLVM model under RBF back constraint. Although the training result is better than those results from the sparse version GP, the testing result shows no meaningful information due to the overfitting of the training processing. of units To check the influence of different numbers used in RULVM, we tested a range of the active percentage between 0.05 and 0.20. Fig. 9(a) and (b) and of units Table VIII show the training times and testing 1NN errors given by RULVM along with different active percentages of units, respectively. The average time used is around 1500 s against 15319.30 s for GPLVM and the average 1NN error is about 0.1 against 0.4381 for GPLVM. Furthermore, we leave of units of RULVM the discussion about a suitable number for our future work. V. CONCLUSION We extended the RUM from its supervised version to its current unsupervised version under the latent variable framework. The proposed RULVM employs a sparse model assumed in the RUM framework and thus offers an efficient procedure in nonlinear DR. Our experiments demonstrated significant improvement over GPLVM in terms of computational cost and robustness to overfitting and underfitting. In addition, the sparse mapping makes the reconstruction from the low-dimensional space to the high-dimensional space more easier. Along with the back-constraint approach, RULVM can also be adopted as a learning process by which the out-of-sample problem in DR can be sorted out. ACKNOWLEDGMENT The authors would like to thank the reviewers for many excellent comments.
134
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 1, JANUARY 2010
REFERENCES [1] N. Lawrence, “Probabilistic non-linear principal component analysis with Gaussian process latent variable models,” J. Mach. Learn. Res., vol. 6, pp. 1783–1816, 2005. [2] N. D. Lawrence and J. Quiñonero-Candela, “Local distance preservation in the GP-LVM through back constraints,” in Proc. Int. Conf. Mach. Learn., W. Cohen and A. Moore, Eds., San Francisco, CA, 2006, pp. 513–520. [3] J. Gao and J. Zhang, “Sparse kernel learning and the relevance units machine,” in Lecture Notes on Computer Science, T. T. , Ed. Berlin, Germany: Springer-Verlag, 2009, vol. 5476, pp. 612–619. [4] M. Tipping, “The relevance vector machine,” in Advances in Neural Information Processing Systems, S. Solla, T. Leen, and K. Müller, Eds. Cambridge, MA: MIT Press, 2000, vol. 12, pp. 652–658. [5] C. Bishop and M. Tipping, “Variational relevance vector machines,” in Uncertainty in Artificial Intelligence 2000, C. Boutilier and M. Goldszmidz, Eds. San Mateo, CA: Morgan Kaufmann, 2000, pp. 46–53. [6] M. Tipping, “Sparse Bayesian learning and the relevance vector machine,” J. Mach. Learn. Res., vol. 1, pp. 211–244, 2001. [7] L. van der Maaten, E. O. Postma, and H. van den Herick, “Dimensionality reduction: A comparative review,” Tilburg Univ., Tilburg, The Netherlands, Tech. Rep. TiCC-TR 2009-005, 2009. [8] Y. Guo, J. B. Gao, and P. W. Kwan, “Twin kernel embedding,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 8, pp. 1490–1495, Aug. 2008. [9] M. Tipping and C. Bishop, “Probabilistic principal component analysis,” J. R. Statist. Soc. B, vol. 6, no. 3, pp. 611–622, 1999. [10] M. A. Carreira-Perpinan and Z. Lu, “Dimensionality reduction by unsupervised regression,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2008, pp. 1–8. [11] B. Schölkopf and A. Smola, Learning With Kernels. Cambridge, MA: The MIT Press, 2002. [12] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [13] S. Chen, “Local regularization assisted orthogonal least squares regression,” Neurocomputing, vol. 69, no. 4–6, pp. 559–585, 2006. [14] B. Kruif and T. Vries, “Support-Vector-based least squares for learning non-linear dynamics,” in Proc. 41st IEEE Conf. Decision Control, Las Vegas, NV, 2002, pp. 10–13. [15] T. Gestel, M. Espinoza, J. Suykens, C. Brasseur, and B. deMoor, “Bayesian input selection for nonlinear regression with LS-SVMS,” in Proc. 13th IFAC Symp. Syst. Identif., Rotterdam, The Netherlands, 2003, pp. 27–29. [16] J. Valyon and G. Horváth, “A generalized LS-SVM,” in Proc. 13th IFAC Symp. Syst. Identif., J. Principe, L. Gile, N. Morgan, and E. Wilson, Eds., Rotterdam, The Netherlands, 2003, pp. 827–832. [17] J. Suykens, T. van Gestel, J. DeBrabanter, and B. DeMoor, Least Square Support Vector Machines. Singapore: World Scientific, 2002. [18] P. Drezet and R. Harrison, “Support vector machines for system identification,” in Proc. UKACC Int. Conf. Control, Swansea, U.K., 1998, pp. 688–692. [19] S. Chen, X. Hong, and C. Harris, “An orthogonal forward regression technique for sparse kernel density estimation,” Neurocomputing, vol. 71, pp. 931–943, 2008. [20] S. Chen, C. Cowan, and P. Grant, “Orthogonal least squares learning algorithm for radial basis function networks,” IEEE Trans. Neural Netw., vol. 2, no. 2, pp. 302–309, Mar. 1991. [21] J. Gao, D. Shi, and X. Liu, “Critical vector learning to construct sparse kernel regression modelling,” Neural Netw., vol. 20, no. 7, pp. 791–798, 2007. [22] M. Wu, B. Schölkopf, and G. Bakir, “A direct method for building sparse kernel learning algorithms,” J. Mach. Learn. Res., vol. 7, pp. 603–624, 2006. [23] C. Burges, “Simplified support vector decision rules,” in Proc. 13th Int. Conf. Mach. Learn., 1996, pp. 71–77. [24] X. Wang, S. Chen, D. Lowe, and C. Harris, “Sparse support vector regression based on orthogonal forward selection for the generalized kernel model,” Neurocomputing, vol. 70, no. 1–3, pp. 462–474, 2006. [25] X. Wang, S. Chen, D. Lowe, and C. Harris, “Parsimonious least squares support vector regression using orthogonal forward selection with the generalized kernel model,” Int. J. Model. Identif. Control, vol. 1, no. 4, pp. 245–256, 2006.
[26] S. Chen, X. Hong, B. Luk, and C. Harris, “Construction of tunable radial basis function networks using orthogonal forward selection,” IEEE Trans. Syst. Man Cybern. B, Cybern., vol. 39, no. 2, pp. 457–466, Apr. 2009. [27] D. J. Bartholomew, Latent Variable Models and Factor Analysis. London, U.K.: Charles Griffin, 1987. [28] A. T. Basilevsky, Statistical Factor Analysis and Related Methods: Theory and Applications. New York: Wiley, 1994. [29] A. Honkela and H. Valpola, “Unsupervised variational bayesian learning of nonlinear models,” in Advances in Neural Information Processing Systems, Y. W. Lawrence Saul and L. Boutton, Eds. Cambridge, MA: MIT Press, 2005, vol. 17, pp. 593–600. [30] C. Bishop, M. Svensén, and C. Williams, “GTM: The generative topographic mapping,” Neural Comput., vol. 10, no. 1, pp. 215–234, 1998. [31] B. Schölkopf, A. Smola, and K. Müller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Comput., vol. 10, no. 5, pp. 1299–1319, 1998. [32] C. Bishop, Pattern Recognition and Machine Learning, ser. Information Science and Statistics. New York: Springer-Verlag, 2006. [33] C. E. Rasmussen and C. K. Williams, Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press, 2006. [34] J. Quiñonero-Candela and C. Rasmussen, “A unifying view of sparse approximate Gaussian process regression,” J. Mach. Learn. Res., vol. 6, pp. 1939–1959, 2005. [35] J. Gao, “Robust L1 principal component analysis and its Bayesian variational inference,” Neural Comput., vol. 20, no. 2, pp. 555–572, 2008. [36] M. F. Møller, “A scaled conjugate gradient algorithm for fast supervised learning,” Neural Netw., vol. 6, no. 4, pp. 525–533, 1993. [37] J. Zhang, J. Gao, and J. Tian, “Relevance units machine based on Akaike’s information criterion,” in Proc. 6th Int. Symp. Multispectral Image Process. Pattern Recognit., M. Ding, B. Bhanu, F. Wahl, and J. Roberts, Eds., Yichang, China, 2009, vol. 7496, pp. 1–8. [38] Y. Guo, J. Gao, and P. Kwan, “Twin kernel embedding with back constraints,” in Proc. Int. Conf. Data Mining, Omaha, NE, 2007, pp. 319–324, DOI:10.1109/ICDMW.2007.112. [39] D. Lowe and M. Tipping, “Feed-forward neural networks and topographic mappings for exploratory data analysis,” Neural Comput. Appl., vol. 4, no. 2, pp. 83–95, 1996. [40] C. Bishop, Neural Networks for Pattern Recognition. Oxford, U.K.: Clarendon, 1995. [41] C. Bishop and G. D. James, “Analysis of multiphase flows using dualenergy gamma densitometry and neural networks,” Nuclear Instrum. Methods Phys. Res., vol. A327, pp. 580–593, 1993. [42] J. A. Bilmes, J. Malkin, and X. Li, “The vocal joystick,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Toulouse, France, 2006, vol. I, pp. 625–628. [43] G. E. Hinton and S. T. Roweis, “Stochastic neighbor embedding,” in Advances in Neural Information Processing Systems, S. T. Sue Becker and K. Obermayer, Eds. Cambridge, MA: MIT Press, 2003, vol. 15, pp. 833–840.
Junbin Gao received the B.Sc. degree in computational mathematics from Huazhong University of Science and Technology (HUST), Wuhan, Hubei, China, in 1982 and the Ph.D. degree in computational mathematics from Dalian University of Technology, Dalian, China, in 1991. In July 2005, he joined the School of Information Technology (now Computing and Mathematics), Charles Sturt University, Bathurst, N.S.W., Australia, as an Associate Professor in Computer Science. He was a Senior Lecturer and a Lecturer in Computer Science from 2001 to 2005 at the University of New England, Australia. From 1982 to 2001, he was an Associate Lecturer, a Lecturer, an Associate Professor, and the Professor at the Department of Mathematics, HUST. His main research interests include machine learning, kernel method, Bayesian learning and inference, and image analysis.
GAO et al.: RELEVANCE UNITS LATENT VARIABLE MODEL AND NONLINEAR DIMENSIONALITY REDUCTION
Jun Zhang received the B.S. degree in mathematics from Shanghai Jiaotong University, Shanghai, China, in 1986 and the M.S. degree in communication and information system and the Ph.D. degree in pattern recognition and intelligent systems from Huazhong University of Science and Technology (HUST), Wuhan, Hubei, in 1999 and 2006, respectively. In 2007, he was a Postdoctoral Researcher at the School of Life Science, HUST. In 2008, he was a Visiting Academic at the School of Computing and Mathematics, Charles Sturt University, Bathurst, N.S.W., Australia. He is currently an Associate Professor at the Institute for Pattern Recognition and Artificial Intelligence, HUST. His current research interests include machine learning, data mining, and computer vision.
135
David Tien (M’00) received the B.S. degree in computer science from the Chinese Academy of Sciences, Beijing, China, in 1982, the M.S. degree in pure mathematics from the Ohio State University, Columbus, in 1985, and the Ph.D. degree in electrical engineering from the University of Sydney, Sydney, N.S.W., Australia, in 1995. His interests are in the areas of image processing, geographic information system (GIS), biomedical engineering, and modern politics. Currently, he teaches computer science at Charles Sturt University (CSU), Bathurst, N.S.W., Australia. Dr. Tien serves as the Chairman of IEEE NSW Section and is the Secretary of National Tertiary Education Union (NTEU), CSU Branch.