Maximum Entropy Discrimination Denoising Autoencoders

0 downloads 0 Views 208KB Size Report
Similarly, [15] consider a classification model that imposes GP priors on the discriminative function that maps the latent encodings into class labels. The result of.
1

Maximum Entropy Discrimination Denoising Autoencoders Sotirios P. Chatzis

arXiv:1402.3427v1 [cs.LG] 14 Feb 2014



Abstract—Denoising autoencoders (DAs) are typically applied to relatively large datasets for unsupervised learning of representative data encodings; they rely on the idea of making the learned representations robust to partial corruption of the input pattern, and perform learning using stochastic gradient descent with relatively large datasets. In this paper, we present a fully Bayesian DA architecture that allows for the application of DAs even when data is scarce. Our novel approach formulates the signal encoding problem under a nonparametric Bayesian regard, considering a Gaussian process prior over the latent input encodings generated given the (corrupt) input observations. Subsequently, the decoder modules of our model are formulated as large-margin regression models, treated under the Bayesian inference paradigm, by exploiting the maximum entropy discrimination (MED) framework. We exhibit the effectiveness of our approach using several datasets, dealing with both classification and transfer learning applications. Index Terms—Autonencoders, maximum entropy discrimination, meanfield, Gaussian process.

1

I NTRODUCTION

It has recently become obvious that deep architectures allow for obtaining better modeling and representational capacities in the context of challenging pattern recognition applications [1]. Deep belief networks and stacked autoencoders are characteristic examples of successful such architectures [2], [3], [4], [5]. The stunning performance of these approaches in a multitude of applications is largely attributed to the use of an unsupervised training criterion to perform a layerwise initialization: each layer is at first trained to produce a higher level (hidden) representation of the observed patterns, based on the representation it receives as input from the layer below, by optimizing a local unsupervised criterion. After initialization, global fine-tuning of the model parameters may be performed by means of an appropriate training criterion. This way of training allows these deep architectures to avoid getting stuck to poor local optima, despite their complexity, contrary to most previous deep neural network architectures (e.g., [6]). In this work, we focus on autoencoder models [7]; specifically, we are interested in a recent variation of this method, namely denoising autoencoders [8]. Traditional autoencoders (AEs) are unsupervised feature extractors designed to retain at minimum a certain amount of “information” about their input, while at the same time being constrained to a given form for

their output (e.g. a real-valued vector of a given size). Denoising autoencoders (DAs) are based on the observation that humans can recognize objects in partially occluded or corrupted images: this leads to the intuitive concept that an effective extracted representation should be capable of capturing stable patterns in the modeled high-dimensional data from partial observation only. Based on this intuition, DAs are designed to satisfy the additional criterion of robustness of the extracted representation to partial destruction of the input; that is, they are optimized to reconstruct input data from partial and random corruption. DAs can be stacked into deep learning architectures, yielding the so-called stacked DA (SDA) architectures [8], which have been shown to yield excellent data modeling performance in several benchmark applications [8], [7], [9], [5], [3], [2]. Existing DA architectures are essentially based on simple feedforward neural network principles. As such, their training algorithms consist in obtaining pointestimates of the model parameters by minimizing a suitable reconstruction error criterion. A repercussion of this design choice is that DAs and their deep learning counterparts, i.e. stacked DAs (SDAs), require large amounts of data to perform learning [10]. This problem is essentially a result of the fact that obtaining point-estimates of the entailed parameters is a frequentist method [11] that works well only under the condition of training using relatively large datasets. However, we know that humans are able to perform inductive reasoning (equivalent to concept generalization) with only a few examples [12]. As such, it is interesting to consider whether DA models can be reformulated in such a way that allows for performing learning using scarce datasets. To address this issue, in this paper we consider a fully Bayesian treatment of DA networks. Contrary to frequentist methods that consider model parameters to be fixed, and the training data to be some random sample from an infinite number of existing but unobserved data points, our Bayesian treatment is designed under the assumption of dealing with fixed scarce data and parameters that are random because they are unknowns. By imposing a suitable prior over model parameters, that encodes prior knowledge or results of a previous model, and obtaining a corresponding posterior distribution based on the given observed data, our Bayesian inference

2

approach allows for including uncertainty in learning the model and generating much better encodings (when dealing with limited and scarce datasets). Recently, several researchers have attempted to exploit the merits of Bayesian inference in the context of parametric unsupervised feature extraction methodologies. However, in most cases, these approaches consist in merely merging supervised nonparametric Bayesian prediction techniques with existing parametric unsupervised learning methodologies. For example, [13] use restricted Boltzmann machines (RBMs) to extract features in an unsupervised manner, which are subsequently fed as inputs to the covariance kernel of a Gaussian process (GP) regressor or classifier [14]. On this basis, after initial RBM pretraining, they adjust the RBM parameters by backpropagating gradients from the GP through the neural network. As we observe, in this approach the employed nonparametric Bayesian model is only used for the purposes of inference at test time, and is not utilized to perform initial learning of the generated latent representations, which are still learned using a conventional approach yielding point-estimates. Similarly, [15] consider a classification model that imposes GP priors on the discriminative function that maps the latent encodings into class labels. The result of this choice is a Gaussian process latent variable model (GPLVM) [16] for the predicted class labels. While much more efficient compared to the approach of [13], neither does this method utilize the component nonparametric Bayesian model to perform learning of the generated input encodings. Further, [17] combined autoencoder training with neighborhood component analysis [18], which encouraged the model to encode similar latent representations for inputs belonging to the same class. Finally, [19] recently proposed a deep Gaussian process (DGP) model. DGP is essentially a deep belief network (DBN) [2], [3] the component models of each layer of which are Gaussian processes instead of RBMs. Inference is performed using an approximate variational marginalization technique. As such, both latent encodings and classification predictions are obtained in a fully Bayesian fashion. In this work, we follow a different path; motivated by the aforementioned advantages of (hierarchical) Bayesian model formulations, in this paper, for the first time in the literature, we present a fully Bayesian formulation of DA models, designed for performing learning when data is scarce. Specifically, we consider a nonlinear encoder module, obtained by imposing a Gaussian process prior [14] over the latent encodings of the observed (corrupt) data. This formulation of the postulated encoders allows for extracting much more complex underlying patterns from the observed data compared to simpler architectures. Further, we consider a linear decoder module for our model with appropriate priors imposed over its parameters. Contrary to existing approaches, in our model the used decoders leverage the large-margin principle to

learn the function mapping the obtained latent representations (encodings) to the original data presented to our model. Application of the large-margin learning principle allows for obtaining a more discriminative learning technique, making more effective use of our training data during estimation of the postulated latent data representations. To introduce the large-margin learning principle in the context of our hierarchical Bayesian model, we build upon the maximum entropy discrimination (MED) framework [20], [21]. The MED framework integrates the large-margin principle with Bayesian posterior inference in an elegant and computationally efficient fashion, allowing to leverage existing high-performance techniques for both hierarchical Bayesian models and support vector machines (SVMs). Adoption of the MED framework in the context of our model is performed by optimizing a composite objective function that takes into consideration both the (negative) log-likelihood of our hierarchical Bayesian model, which measures the goodness of fit to the training data, and a measure of reconstruction error on training data. On this basis, we derive a mean-field inference algorithm, which obtains a regularized posterior distribution over the latent encodings of the observed data in a feasible space defined by a set of expected margin constraints generalized from the familiar SVM-style margin constraints. We dub our model as the maximum-entropy discrimination denoising autoencoder (MED2 A) model. The remainder of this paper is organized as follows: In Section 2, we provide a brief review of the main concept behind traditional autoencoders and their denoising variants. In Section 3, we introduce our method, and derive its inference algorithm. In Section 4, we conduct the experimental evaluation of our approach. Finally, in the last section, we summarize and discuss our results, and conclude this work.

2

AUTOENCODER

NEURAL NETWORKS

2.1 Traditional autoencoders A traditional autoencoder comprises two distinct parts: an encoder and a decoder. The encoder is a deterministic mapping of a d-dimensional input vector x to a hidden representation y. Usually, it is taken as an affine transform followed by a squashing non-linearity, of the form y , f (x; θ) = σ(W x + b) ′

(1)



where W ∈ Rd ×d , b ∈ Rd ×1 , and θ = {W , b}. From the resulting representation y, we can reconstruct its corresponding vector in the considered d-dimensional input space by using the decoder of the AE. Specifically, the obtained input reconstruction z is again derived from an affine mapping, optionally followed by a squashing non-linearity, and reads z , g(y; θ′ ) = σ(W ′ y + b′ )

(2)

where W ′ ∈ Rd×d and b′ ∈ Rd×1 , and θ′ = {W ′ , b′ }. The obtained reconstructions z are generally regarded as an ′

3

approximation of the actual corresponding input vectors x. Typically, they are taken as the mode of a considered distribution p(x|z), which is learnt in an unsupervised way through model training. This, in turn, gives rise to an associated reconstruction error minimized through model training, that reads [8] L(x, z) ∝ −log p(x|z)

(3)

In cases of real-valued observations x, the distribution p(x|z) is typically taken to be of Gaussian form, reading: p(x|z) = N (x|z, σ 2 I)

(4)

which, essentially, gives rise to a squared reconstruction error loss objective function L(x, z). Similarly, in cases of binary observations x, or x ∈ [0, 1]d, it is usually considered that p(x|z) = Bernoulli(x|z) (5) Qd xi 1−xi . This, in where Bernoulli(x|z) = i=1 zi (1 − zi ) turn, gives rise to a cross-entropy reconstruction error loss L(x, z). As already mentioned, AE training consists in minimizing the reconstruction loss over the whole observations space. In other words, it consists in minimizing the expected reconstruction loss, Eq(X) [L(x, z)], where q(X) is the actual distribution of the observed data. Since, this latter quantity is hardly ever known, an approximate solution is obtained by minimizing the empirical expectation given a set of training data D = {xn }N n=1 . In other words, AE training reduces to the optimization problem N X min′ [L(xn , g(f (xn ; θ); θ ′ ))] (6) θ,θ

n=1

which, effectively, aims at retaining as much of the information that was present in the input as possible. 2.2 Denoising autoencoders It has been shown that the optimal reconstruction criterion of AE cannot guarantee extraction of useful features in several occasions. Specifically, it has been shown that it may lead to the solution of just copying the input to the output or other trivial ones that yield very low reconstruction error in the training set combined with extremely poor modeling and generalization performance [8]. Utilizing sparse representation strategies that impose dimensionally constraints on the obtained representation is one solution towards the amelioration of this issue (e.g., [7]). Denoising autoencoders address this overfitting problem in a different way [8]. Instead of imposing constraints, DAs modify the employed reconstruction criterion, stipulating that it also denoises partially corrupted input. This approach is based on the assumption that a higher level representation should be rather stable and robust under corruptions of the input, and that such noise-robust features should capture useful structure in

the input distribution. This intuition leads to the training criterion of DAs, which comprises reconstructing a clean “repaired” input from a corrupted version of it. For this purpose, initially the inputs x used for DA training get corrupted by means of a stochastic mapping ˜ ∼ q(˜ (corruption process) x x|x). Subsequently, the cor˜ are mapped to a hidden representation rupted inputs x y, as in the case of a traditional AE; that is ˜ + b) y , f (˜ x; θ) = σ(W x

(7)

Eventually, DAs try to reconstruct the initial, uncor˜ ; these reconrupted version x of their training inputs x structions z are obtained by means of (2). Therefore, in DAs z becomes a deterministic function of the corrupted ˜ , instead of the original ones x they try to signals x reconstruct. Corruption process selection. Some simple corruption process models often used in the literature are [8]: (i) additive isotropic Gaussian noise (GS), i.e., q(˜ x|x) = N (˜ x|x, σǫ2 I); (ii) masking noise (MN), which forces to zero a randomly selected fraction of the elements in x; and, (iii) salt-and-pepper noise (SP), which forces a randomly selected fraction of the elements in x to their minimum or maximum value (0 or 1), according to a fair coin flip. Using DAs to Build Deep Architectures. Initializing deep architectures by stacking DAs is performed in a way fairly similar to traditional AEs [4], [7]. Training proceeds by layerwisely training the stacked DAs, beginning from the lowest layer DA (which is fed with the original observations).

3

P ROPOSED A PPROACH

Let us assume a DA model with N -dimensional inputs fed to it, and expected to generate C-dimensional outputs. We are seeking a fully Bayesian treatment of this model, capable of obtaining competitive performance when data is scarce. Let us denote as Y = {y d· }D d=1 , where y d· = [ydn ]N n=1 , the set of the (original) N -dimensional observations presented to our model, and as X = {xd· }D d=1 , where xd· = [xdn ]N , the set of their corresponding n=1 corrupt versions, obtained by transforming the original data by employing a suitable corrupting distribution. Let us begin with the formulation of the encoder of our model: to obtain this module, we elect to impose a GP prior, with input variables {xd· }D d=1 , over the latent encodings generated by our model, say {z d· }D d=1 , where z d· = [zdc ]C . A drawback of such a modeling approach c=1 consists in the fact that performing inference for GP models entails inversion of the gram matrix of the input data [22], resulting in an O(D3 ) complexity, which is rather inefficient. To ameliorate this issue, we resort to utilization of a popular sparse pseudo-input Gaussian process (SPGP) modeling approach [23]: We impose a prior over the latent encodings {z d· }D d=1 taking the form of a GP predictive distribution parameterized by a small pseudo¯ = {¯ dataset X xm· }M m=1 , with corresponding latent encoding values {¯ z m· }M m=1 (pseudo-encodings). In other

4

words, we begin by considering a set of simple GP priors over the pseudo-encodings ¯ = N (¯ p(¯ z ·c |X) z ·c |0, K M ) ∀c ∈ {1, . . . , C}

(8)

¯ ·c = [¯ where z zmc ]M m=1 , and K M is the gram matrix of the ¯ j )]i,j . with k(·, ·) being pseudo-inputs, i.e., K M = [k(¯ xi , x the employed kernel function. This way, we eventually obtain a prior over the latent encodings {z ·c }C c=1 parameterized by the introduced pseudo-data, taking the form of a simple GP predictive density, reading [23] ¯ z ¯ ·c ) = N (z ·c |K DM K −1 ¯ ·c , Λ) p(z ·c |X, X, M z

(9)

˜ d ), ¯ j )]i,j , Λ = diag(λ where K DM = [k(xi , x ˜d = kdd − kT K −1 kd λ d M

(10)

¯ i )]i , while kdd = k(xd , xd ). and kd = [k(xd , x Having defined the encoder modules of our model, let us now proceed to the definition of the employed decoder modules. We consider a likelihood function of the form p(y d· |z d· ) =

N Y

N (ydn |η Tn z d· , δ 2 )

encodings z d· and the observed data y d· , is subject to the following large-margin constraints:

 T  ≤ ε + ξdn ydn − E[η n z d· ] T ∗ ∀d, n −ydn + E[η n z d· ] ≤ ε + ξdn   ∗ ξdn , ξdn ≥0

(13)

Our imposed constraints are inspired from large-margin approaches, and especially, the literature on MED regression models [20, Chapter 4], as they are based on maximization of an expected margin, that takes into account the Bayesian nature of our model. Note that, in Eq. (13), ε is a precision parameter, functioning similar to the precision parameter in support vector regression ∗ (SVR), and ξdn , ξdn are some slack-variables, used in a way similar to SVR [25]. This concludes the formulation of our model.

(11)

n=1

where δ 2 is the variance of a simple white noise model. Further, we impose a Gaussian prior with axis-aligned elliptical covariance over the parameter vectors η n , n ∈ {1, . . . , N }, yielding p(η n ) = N (η n |0, diag(ϕ)−1 ) ∀n

(12)

where ϕ = [ϕc ]C c=1 is a precision hyperparameters vector. We also impose Gamma hyperpriors over the ϕc , yielding C Y p(ϕ) = G(ϕc |˜ ω0 , ω ˆ0)

3.1 Inference algorithm To perform inference for our model, we adopt the MED inference framework, and extend it by introducing an additional (negative) likelihood term into the optimized objective function, which emanates from the assumption (11) of our model. Specifically, conventional MED inference in the context of our model would comprise solution of the following minimization problem:

c=1

The selected form of the prior imposed over the model parameters η n allows for obtaining an exponentially tractable solution to the latent encodings dimensionality inference problem, by means of automatic relevance determination (ARD) [24]. The notion of ARD is to continually create new components while detecting when a component starts to overfit. In the case of our model, the ARD mechanism can be implemented by imposing a hierarchical prior over the model parameters η n , to discourage large values, with the width along each latent dimension controlled by a Gamma-distributed precision hyperparameter, as illustrated in (12). If one of these precisions ϕc tends to infinity, then the outgoing weights ηnc ∀n will have to be very close to zero in order to maintain a high likelihood under this prior, which in turn leads the model to ignore the cth latent dimension, and, hence, the corresponding direction in the postulated latent subspace is effectively “switched off.” In addition, we stipulate that the postulated linear regression scheme (11) that connects the obtained latent

min

∗ ,X,δ ¯ ¯ 2 q(Z,Z,H,ϕ),ξ,ξ

 ¯ H, ϕ)||p(Z, Z, ¯ H, ϕ) KL q(Z, Z, +γ

D X N X

∗ ) (ξdn + ξdn

d=1 n=1

(14) ¯ = under the constraints (13), where Z = {z ·c }C , Z c=1 ∗ N ∗ {¯ z ·c }C c=1 , H = {η n }n=1 , ξ = {ξdn }d,n , ξ = {ξdn }d,n , and KL(q||p) stands for the Kullback-Leibler (KL) divergence between the (approximate) posterior and the prior of our model. However, in our work, we follow a different approach, inspired from variational Bayesian inference [26], and the inference procedure of the nonparametric Bayesian large-margin classifier of [27]: We elect to optimize a composite objective function that takes into consideration both the expected (negative) log-likelihood of our hierarchical Bayesian model, which measures the goodness of fit to the training data, as well as the quality of the reconstruction of our training data, as is also performed in the context of conventional approaches. This way, inference for our model eventually reduces to

5

solution of the following problem min

∗ ,X,δ ¯ ¯ 2 q(Z,Z,H,ϕ),ξ,ξ



D X

Let us also introduce the notation



(ξdn +



Posterior Distribution

Let us begin with deriving the expression of the posterior distribution of our model. Following the meanfield principle [28], we assume that the sought posterior ¯ ·c , and η n , similar distribution factorizes over the z ·c , z to the imposed prior. Then, optimizing (15) w.r.t. q(η n ), we obtain q(η n ) = N (η n |λn , Σ) (16) where Σ−1 = diag(E[ϕ]) +

ydn = ydn − v Tn ΣE[z d· ] T r d = U −1 E[z d· ]

D 1 X E[z d· z Td· ] δ2

λn ,ξ·n ,ξ∗ ·n

− λTn

d=1

ydn E[z d· ] + γ δ2

D X

D X 1 ′ 2 ∗ ||λ || + γ (ξdn + ξdn ) n 2 ′ 2 λn ,ξ·n ,ξ∗ ·n d=1  ′ ′ ydn − (λn )T r d ≤ ε + ξdn    ′ ′  ∗ −ydn + (λn )T rd ≤ ε + ξdn ∀d, s.t. :  ξdn ≥ 0    ∗ ξdn ≥ 0

1 + λTn Σ−1 λn 2 T  ≤ ε + ξdn ydn − λn E[z d· ] T ∗ ∀d, s.t. : −ydn + λn E[z d· ] ≤ ε + ξdn   ∗ ξdn , ξdn ≥0 ∗ [ξdn ]D d=1 , ξ ·n

∗ D [ξdn ]d=1 .

q(z ·c ) = N (z ·c |mc , S c ) where −1 + S −1 c = Λ



N 1 X 2 E[ηnc ]I δ 2 n=1

z ·c ] + mc = S c Λ−1 K DM K −1 M E[¯ (18)

where ξ ·n = = Even though the optimization problem (18) may seem computationally cumbersome to solve, it is easy to show that it essentially reduces to a simple SVR problem. Specifically, let us take the Cholesky decomposition of Σ: Σ−1 = U T U (19)

←− µdn ←− µ∗dn ←− νdn ∗ ←− νdn

(24)

∗ where the {µdn , µ∗dn , νdn , νdn } are Lagrange multipliers. The problem (24) is essentially the primal form of the learning problem of a simple SVR model with inputs rd , and targets ydn . Therefore, to compute the λn , we can exploit all the highly-efficient solvers developed for the standard SVR problem. Indeed, in our work, we use the SVM-light toolbox of [29], which implements a highly-scalable working set selection algorithm, and provides both the primal ′ parameters λn (and, hence, the λn ), and the dual ones, ∗ ∗ D i.e. the µ·n = [µdn ]D d=1 , and the µ·n = [µdn ]d=1 . Note also that the N SVR problems implied by (24) can be executed in parallel, by leveraging available parallel processing infrastructure. Further, regarding the posteriors over the encodings of the training data q(z ·c ), we have

∗ (ξdn + ξdn )

d=1

(23)

min

(17)

and the means λn are obtained by solving the primal problem

(22)

It is a matter of simple algebra to show that, using (19)(23), the optimization problem (18) reduces to

d=1

min

(21)

δ2



∗ ξdn )

 ¯ H, ϕ)||p(Z, Z, ¯ H, ϕ) + KL q(Z, Z,  T  ≤ ε + ξdn ydn − E[η n z d· ] ∗ ∀d, n, s.t. : −ydn + E[η Tn z d· ] ≤ ε + ξdn   ∗ ξdn , ξdn ≥0 (15) Note that, in the above expressions, all the expectations ¯ H, ϕ). E[·] are computed w.r.t. the posterior q(Z, Z, Our inference algorithm proceeds in an iterative fashion. On each iteration, each one of the factors of the ¯ sought posterior, i.e., q(Z), q(Z),q(H), q(ϕ), as well as ¯ the pseudo-inputs X, the noise variance δ 2 , the slackvariables ξ, ξ ∗ , and the employed kernel hyperparameters (if any) are consecutively optimized, one at a time. It has been shown that such an iterative consecutive updating procedure is guaranteed to monotonically optimize the objective function of our problem [21].

D X

λn = U (λn − Σv n )

d=1

d=1 n=1

3.1.1

(20)

E[log p(y d· |z d· )]

d=1 D X N X

D X ydn

E[z d· ]

vn =



(25)

(26)

N 1 X y E[ηnc ] δ 2 n=1 ·n

 N N X 1 XX ∗ ′ ′ ] + (µ − µ )E[η ] z E[η η nc nc nc ·c ·n ·n δ 2 ′ n=1 n=1 c 6=c

(27) and y ·n = [ydn ]D . d=1 Similarly, regarding the posteriors over the pseudo¯ ·c , we have encodings z

where

¯ c) ¯ c, S q(¯ z ·c ) = N (¯ z ·c |m

(28)

¯ c = K M Q−1 K M S m

(29)

6

QM = K M + K TDM Λ−1 K DM

(30)

T −1 ¯ c = K M Q−1 m E[z ·c ] M K DM Λ

(31)

and Finally, the precision hyperparameters ϕ yield the following hyperposterior: q(ϕ) =

C Y

max E[ G(ϕc |˜ ωc , ω ˆ c)

θ

(32)

c=1

where

N 2

(33)

N 1X 2 λ 2 n=1 nc

(34)

ω ˜c = ω ˜0 + and ω ˆc = ω ˆ0 +

3.1.2 Pseudo-inputs selection ¯ It is Further, we need to obtain the pseudo-inputs set X. ¯ easy to show that solving (15) w.r.t. to the set X reduces to the optimization problem max E[ ¯ X

C X

¯ z ¯ ·c )] + E[ log p(z ·c |X, X,

c=1

C X

¯ (35) log p(¯ z ·c |X)]

c=1

As the maximization problem (35) does not yield closedform solutions, one has to resort to some iterative algorithm to obtain the sought estimates. For this purpose, the limited memory variant of the BFGS algorithm (L-BFGS) [30] is a suitable candidate solution. However, this procedure entails a prohibitive combinatorial search, especially when the observations space is highdimensional, i.e. for large values of N . Alternatively, to be conservative, in this work we avoid learning the pseudo-inputs (which can potentially greatly increase the training algorithm complexity and runtime). Instead, we use as our pseudo-inputs a subset of the actual (corrupt) observations X. Selection of this subset is performed by means of an iterative greedy selection process, which on each iteration selects the data ¯ addition of which in the set X ¯ point x : x ∈ X ∩ x ∈ /X yields the maximum value of E[

C X c=1

¯ z ¯ ·c )] + E[ log p(z ·c |X, X,

C X

¯ log p(¯ z ·c |X)]

c=1

and proceeds until we reach the desired number M of pseudo-inputs. 3.1.3 Hyperparameter estimation Finally, let us now turn to the estimates of the hyperparameters of our model over which we didn’t impose some prior distribution. We begin with the noise variance δ 2 ; solving (15) w.r.t. δ 2 we obtain D N  1 XX 2 ydn − 2ydn E[η Tn· z d· ] + E[η Tn· z d· z Td· η n· ] δ = ND d=1 n=1 (36) 2

Further, we also consider the case the employed kernel functions of the imposed GP priors do also entail some hyperparameters. To obtain their estimates, we optimize (15) over them. This essentially reduces to the optimization problem C X

¯ z ¯ ·c )] + E[ log p(z ·c |X, X,

c=1

C X

¯ (37) log p(¯ z ·c |X)]

c=1

where θ are the sought hyperparameters. As the maximization problem (37) does not yield closed-form solutions, we may resort to an iterative algorithm to obtain the sought estimates, namely L-BFGS.

4

E XPERIMENTS

Here, we conduct the experimental evaluation of our approach. First, we consider an image classification experiment using benchmark data well-known in the literature of AE neural networks. Our model is used to extract salient features from these images which are subsequently presented to simple linear SVM classifiers to perform the classification task. Further, we consider a video mining application, namely sports video mining in football videos. In this case, we use our model to extract salient features from raw video data, and present them to a simple linear SVM classifier to perform the classification task. Finally, we assess the performance of our method in a transfer learning experiment. In this case, our model is used to extract features from data pertaining to different domains; subsequently, we train a simple binary classifier (linear SVM) with the extracted features pertaining to data from one domain, and evaluate the resulting model using data from different domains. Apart from our method, in our experiments we also evaluate the mSDA model [31], SDAs [8], as well as the related DGP method of [19]. In all our experiments, we use masking noise (MN) with the percentage of zeroed features, ζ, selected so as to optimize the performance of each model. In all cases, we evaluate SDA for a set of alternative selections of the number of latent features, ranging from 100 to an overcomplete representation, matching the dimensionality of the observations space. Our approach is initialized considering an overcomplete representation. mSDA generates an overcomplete representation by default, as described in [31]. In all cases, we report results for optimal model configuration, as experimentally determined. Finally, evaluation of our model is performed using RBF kernels for the postulated GPs (omitting kernel hyperparameters optimization). All our source codes were implemented in MATLAB. 4.1 Image classification benchmarks In this experiment, we consider three benchmarks dealing with the problem of content-based image classification, commonly used in the deep learning literature. Specifically, we experiment with:

7

(i) The standard USPS handwritten digit classification dataset. From this dataset, we retain only 10 examples per class (digit) to perform training. This way, we make it possible to evaluate whether our approach achieves the goal of obtaining high modeling performance by using scarce data, and how it compares to the competition. (ii) The small-NORB dataset [32], which comprises stereo image pairs of fifty toys belonging to five generic categories. Each toy was imaged under six lighting conditions, nine elevations and eighteen azimuths. Originally, the objects were divided evenly into test and training sets yielding 24, 300 examples each. From this dataset, we use only 10 images from each class to perform training, so as to assess whether our approach can obtain high modeling performance by using scarce data. (iii) The Convex/Nonconvex dataset, adopted from [33]. From this dataset, we retain only 50 examples per class (digit) to perform training, so as to evaluate model performance under the restrictions of limited data availability; we use more data than in the previous two cases from each category, since we are here dealing with only two categories (convex, nonconvex). We repeat our experiments multiple times to alleviate the effect of random data selection on the obtained performances. In all cases, the mSDA, SDA, and DGP models yielded optimal performance when evaluated with 4 layers. For the USPS and small-NORB datasets, optimal performance was obtained for the proportion ζ of zeroed features equal to 10%; in the case of the Convex/Nonconvex dataset, a proportion equal to 75% yielded the optimal performance. In all cases, SDA yielded optimal performance for 100 latent features per layer. Our obtained optimal results (for various numbers of pseudo-inputs M in the case of our method) are illustrated in Tables 1-3. These results comprise both the obtained recognition rates and the entailed computational costs of the evaluated algorithms (means over the conducted experiment repetitions). As we observe, our method obtains the best recognition performance results in all cases. We also observe that the computational costs of our approach are comparable to most of its competitors. mSDA is a notable exception to this rule; although obtaining the worst modeling performance among the considered methods, it takes two orders of magnitude less time than all the other approaches. 4.2 Sports Video Mining In this experiment, we consider the problem of sports video mining in football videos. Specifically, we are interested in detecting four camera view classes: central, left, right, and end-zone, and four play types: long play, short play, kick, and field goal play. To perform evaluations, we use a database comprising 30-min NFL American football games. The frames of the videos were downsampled to 45 × 36 resolution, and were preprocessed

Table 1 Small-NORB Dataset: Recognition rate (%) and computational costs of the evaluated methods. Method MED2 A (M = 10) MED2 A (M = 30) MED2 A (M = 50) MED2 A (M = 100) mSDA SDA DGP

Rate (%) 40.25 44.62 48.55 50.63 34.8 35.2 44.5

Time (sec.) 19.84 19.93 19.98 20.54 0.15 18.17 42.37

Table 2 USPS Dataset: Recognition rate (%) and computational costs of the evaluated methods. Method MED2 A (M = 10) MED2 A (M = 50) MED2 A (M = 100) MED2 A (M = 200) mSDA SDA DGP

Rate (%) 84.28 83.79 83.57 85.23 63 71 80.31

Time (sec.) 10.94 11.46 11.55 13.29 0.24 11.39 26.41

so as to remove commercials and replays. We retain only 100 examples from each category (i.e., play type or camera view) to perform training, and use another 1000 to perform evaluations. In Fig. 1, we provide some example frames from our videos. Apart from the related methods, i.e. SDA, mSDA, and DGP, in these experiments we also compare to a baseline approach, namely the iM2 EDM method of [34]. The latter method is not presented with the raw video signals, but with appropriate feature vectors extracted as described in [34]. Specifically, to capture camera view information, we use the color distribution and the yard line angle [35]. For this purpose, we estimate the spatial color distribution, and perform edge detection using the Canny algorithm, which we combine with the Hough transform to detect the yard lines and to compute their angles. Regarding play type information extraction, we utilize for this purpose camera motion information (panning and tilting), as this information is sufficient to characterize different play types: strong panning is usually associated with a long play, while a weak panning effect is usually associated with short Table 3 Convex/Nonconvex Dataset: Recognition rate (%) and computational costs of the evaluated methods. Method MED2 A (M = 10) MED2 A (M = 50) MED2 A (M = 100) mSDA SDA DGP

Rate (%) 61.12 61.12 61.12 55.0 56.8 58.1

Time (sec.) 19.38 19.56 19.87 0.18 19.54 36.85

8

Table 4 Sports Video Mining: Recognition rate (%) of the evaluated methods. Method MED2 A (M = 100) MED2 A (M = 1000) mSDA SDA DGP iM2 EDM

Rate (%) 51.04 54.36 38.1 43.6 49.3 56.87

plays. To compute the two kinds of camera motion, we choose the optical flow-based method of [36]. We repeated our experiments 10 times to alleviate the effect of random data selection on the obtained performances. mSDA, SDA, and DGP yielded their best performance for architectures with 6 layers. Optimal performance was obtained for the proportion ζ of zeroed features equal to 10% in all cases. SDA yielded optimal performance for 300 latent features per layer. In Table 4, we provide the obtained optimal performances of the evaluated algorithms (average results per detected class over the conducted 10 repetitions of our experiments). As we observe, our method works clearly better than the related approaches, yielding performance close to iM2 EDM which relies to sophisticated extraction of appropriate descriptors, contrary to our approach. 4.3 Transfer learning In this experiment, we assess the capacity of our model to learn robust feature representations that capture salient high-level structure from data belonging to different domains. Inspired from the recent work of [37], we use the encodings generated by our model as inputs to linear SVM classifiers, and evaluate the resulting classifiers on the basis of their capacity to effectively perform classification of data from a given domain while being trained on data from different domains. Specifically, similar to [37], we consider a sentiment analysis application across different product domains. Our datasets are derived from the Amazon reviews benchmark [38]. This benchmark contains more than 340, 000 reviews from 25 different types of products from Amazon.com. In this dataset, different domains vary substantially in terms of the number of instances and class distribution. Some domains comprise hundreds of thousands of reviews, while others have only a few hundred. To ensure that our results will not be negatively affected from these imbalances, in our experiments we retain only four domains with similar numbers of samples, namely Books (B), DVDs (D), Electronics (E), and Kitchen appliances (K), similar to [38]. This yields a set of twelve transfer learning tasks between (pairs of) domains with well-balanced classes, which is more amenable to learning by means of a general purpose classifier. From these data, we use only 200 samples from each class for training, and the rest for testing (i.e.,

we train with scarce data). We repeat our experiment multiple times with different splits of the data into training and test sets to alleviate the effect of random data selection on the obtained performances. Similar to the work of [37], we formulate the recognition task as a binary classification problem: the trained models are required to recognize whether a review is positive (higher than 3 stars) or negative (3 stars or lower). Each review text is treated as a bag-of-words and transformed into binary vectors encoding the presence/absence of unigrams and bigrams, similar to [37]. For computational reasons, only the 500 most frequent terms of the vocabulary of unigrams and bigrams are kept in the feature set, similar to [31]. Evaluation metrics. Let us denote as e(S, T ) the transfer error between source domain S and transfer domain T ; the transfer error is defined as the test error of a classifier trained on the source domain S and evaluated on the target domain T . Based on this measure, we can define the in-domain error e(T, T ) as the error of a classifier trained and evaluated in the same domain. Let us also denote as eb (T, T ) the in-domain error of a baseline classifier; in our experiments, the used baseline classifiers are linear SVMs on the raw dataset features. Using these definitions, we can introduce a metric suitable for evaluating the performance of a transfer learning algorithm, expressing whether (and how much of) a performance deterioration is incurred when the transfer error is equal to e(S, T ). Specifically, the introduced metric is the relative transfer loss t(S, T ), given by t(S, T ) =

e(S, T ) − eb (T, T ) eb (T, T )

(38)

Results. We compare to related state-of-the-art methods, namely SDA and mSDA1 . In this experiment, the SDA and mSDA models comprise 5 layers; this configuration was shown in [31] to yield the optimal result (we have confirmed this finding through our experiments). For both SDA and mSDA, and our approach, the proportion ζ of zeroed features of the employed masking noise yields an optimal value equal to 50%. SDA yielded its best performance for 500 latent features. In Fig. 2, we illustrate the obtained relative transfer loss for all the evaluated methods and for all sourcetarget domain pairs, setting the number of pseudoinputs of our method equal to the total number of available data points (“full MED2 A”), as well as equal to 100 and 300. We observe that our method yields significantly better performance than the competition in all the considered transfer scenarios. We also observe that selecting M = 300 resulted in an optimal trade-off between computational complexity and model performance, since increasing M over 300 did not seem to yield performance gains worth the extra complexity. Finally, we would like to emphasize that the computational costs 1. Application of DGP to a transfer learning scenario is not straightforward, given the model formulation of [19].

9

Figure 1. Sports Video Mining: Few example frames.

0.4 SDA mSDA MED2A with M=100 MED2A with M=300 Full MED2A

0.35

0.3

0.25

Transfer Loss

0.2

0.15

0.1

0.05

0

−0.05

−0.1

B−>D

B−>E

B−>K

D−>B

D−>E

D−>K

E−>B

E−>D

E−>K

K−>B

K−>D

K−>E

Figure 2. Transfer Learning Experiment: Obtained relative transfer loss values of the evaluated methods.

of our method are comparable to SDA: for M = 300, our method took 3.94 h to finish, while SDA took 4.16 h, i.e. slightly more than our approach. Note that mSDA takes two orders of magnitude less time (similar to the previous experiment), as also reported in [31].

5

C ONCLUSIONS

DA neural networks are known to yield exceptional results when trained using large datasets; however, their performance deteriorates when data is scarce. This behavior of DAs contrasts to the capacity of humans to perform learning of complex abstract concepts with very limited observations. Under this motivation, in this paper we examined whether DA neural networks can be reformulated in such a way that allows for them to perform better when data is scarce. We assumed that the reason why DA performance deteriorates when decreasing the volume of training data is due to the frequentist nature of their formulation. Based on this assumption, we postulated a novel, fully Bayesian treatment of DAs, utilizing concepts from Bayesian nonparametrics and large-margin principles. Specifically, our approach formulates the encoder modules by considering a Gaussian process prior over the latent input encodings generated given the (corrupt) input observations. Subsequently, the

decoder modules of our model are formulated as largemargin regression models, treated under the Bayesian inference paradigm, by exploiting the maximum entropy discrimination (MED) framework. We provided an efficient variational algorithm for model inference, under the mean-field paradigm. We evaluated the efficacy of our method by considering a number of experiments dealing with high-dimensional observations. As we observed, our method yielded a clear improvement over the competition, while imposing comparable computational costs to most alternatives. Our future research goals are aimed at improving the computational complexity of our approach without undermining its modeling effectiveness. Borrowing ideas from the mSDA method of [31] might be a good starting point for this purpose.

R EFERENCES [1] [2] [3] [4]

Y. Bengio and Y. LeCun, “Scaling learning algorithms towards AI,” in Large scale kernel machines, L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, Eds. MIT press, 2007. G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, pp. 504–507, 2006. G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, pp. 1527–1554, 2006. Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,” in Advances in Neural Information Processing Systems, vol. 19, 2007, pp. 153–160.

10

[5] [6] [7] [8]

[9]

[10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22]

[23]

[24]

[25] [26]

[27] [28] [29] [30] [31]

H. Lee, C. Ekanadham, and A. Ng, “Sparse deep belief net model for visual area V2,” in Advances in neural information processing systems, vol. 20, 2008. G. Hinton, “Connectionist learning procedures,” Artificial Intelligence, vol. 40, pp. 185–234, 1989. M. Ranzato, Y. Boureau, and Y. LeCun, “Sparse feature learning for deep belief networks,” in Advances in Neural Information Processing Systems (NIPS’07), vol. 20, 2007, pp. 1185–1192. P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of Machine Learning Research, vol. 11, pp. 3371–3408, 2010. H. Lee, Y. Largman, P. Pham, and A. Y. Ng, “Unsupervised feature learning for audio classification using convolutional deep belief networks,” in Advances in Neural Information Processing Systems, 2009. Y. Bengio, “Learning deep architectures for AI,” Found. Trends Mach. Learn., vol. 2, no. 1, pp. 1–127, 2009. G. McLachlan and D. Peel, Finite Mixture Models. New York: Wiley Series in Probability and Statistics, 2000. J. B. Tenenbaum, C. Kemp, and P. Shafto, “Theory-based Bayesian models of inductive learning and reasoning,” in Trends in Cognitive Sciences, 2006, pp. 309–318. R. Salakhutdinov and G. Hinton, “Using deep belief nets to learn covariance kernels for Gaussian processes,” in Proc. Neural Information Processing Systems, 2008. C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. MIT Press, 2006. J. Snoek, R. P. Adams, and H. Larochelle, “Nonparametric guidance of autoencoder representations using label information,” Journal of Machine Learning Research, vol. 13, no. 1-48, 2012. N. D. Lawrence, “Probabilistic non-linear principal component analysis with Gaussian process latent variable models,” Journal of Machine Learning Research, 2005. R. Salakhutdinov and G. Hinton, “Learning a nonlinear embedding by preserving class neighbourhood structure,” in Proc. Artificial Intelligence and Statistics, 2007. J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, “Neighbourhood components analysis,” in Proc. Neural Information Processing Systems, 2004. A. C. Damianou and N. D. Lawrence, “Deep Gaussian Processes,” in Proc. AISTATS, 2013. T. Jebara, “Discriminative, generative and imitative learning,” Ph.D. dissertation, MIT, 2002. T. Jaakkola, M. Meila, and T. Jebara, “Maximum entropy discrimination,” in Proc. NIPS, 1999. N. D. Lawrence, J. C. Platt, and M. I. Jordan, “Extensions of the informative vector machine,” in Deterministic and Statistical Methods in Machine Learning, J. Winkler, N. D. Lawrence, and M. Niranjan, Eds. Berlin: Springer-Verlag, 2005, pp. 56–87. E. Snelson and Z. Ghahramani, “Sparse Gaussian processes using pseudo-inputs,” in Advances in Neural Information Processing Systems, Y. Weiss, B. Scholkopf, ¨ and J. Platt, Eds., vol. 18. Cambridge, MA: The MIT Press, 2006, pp. 1259–1266. E. Fokoue, “Stochastic determination of the intrinsic structure in Bayesian factor analysis,” Statistical and Applied Mathematical Sciences Institute, Research Triangle Park, NC, Tech. Rep. TR2004-17, 2004. V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul, “An introduction to variational methods for graphical models,” in Learning in Graphical Models, M. Jordan, Ed. Dordrecht: Kluwer, 1998, pp. 105–162. J. Zhu, N. Chen, and E. P. Xing, “Infinite SVM: a Dirichlet process mixture of large-margin kernel machines,” in Proc. ICML, 2011. T. Jaakkola and M. Jordan, “Bayesian parameter estimation via variational methods,” Statistics and Computing, vol. 10, pp. 25–37, 2000. T. Joachims, “Making large-scale SVM learning practical,” in Advances in Kernel Methods - Support Vector Learning, B. Scholkopf, ¨ C. Burges, and A. Smola, Eds. MIT-Press, 1999. D. Liu and J. Nocedal, “On the limited memory method for large scale optimization,” Mathematical Programming B, vol. 45, no. 3, pp. 503–528, 1989. M. Chen, Z. E. Xu, K. Q. Weinberger, and F. Sha, “Marginalized denoising autoencoders for domain adaptation,” in Proc. ICML, 2012.

[32] Y. LeCun, F. J. Huang, and L. Bottou, “Learning methods for generic object recognition with invariance to pose and lighting,” in Proc. CVPR, 2004. [33] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, “An empirical evaluation of deep architectures on problems with many factors of variation,” in Proc. ICML, 2007, pp. 473–480. [34] S. P. Chatzis, “Infinite Markov-switching maximum entropy discrimination machines,” in Proc. ICML, 2013. [35] Y. Dingand and G. Fan, “Segmental hidden Markov models for view-based sport video analysis,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, 2007. [36] M. Srinivasan, S. Venkatesh, and R. Hosie, “Qualitative estimation of camera motion parameters from video sequences,” Pattern Recognit., vol. 30, pp. 593–606, 1997. [37] X. Glorot, A. Bordes, and Y. Bengio, “Domain adaptation for largescale sentiment classification: A deep learning approach,” in Proc. ICML, 2011. [38] J. Blitzer, R. McDonald, and F. Pereira, “Domain adaptation with structural correspondence learning,” in Proc. EMNLP, 2006, pp. 120–128.