International Journal of Neural Systems, Vol. 0, No. 0 (2005) 111 c World Scientic Publishing Company
Learning Feature Representations with a Cost-Relevant Sparse Autoencoder Martin Längkvist School of Science and Technology, Applied Autonomous Sensor Systems Örebro University, SE-701 82, Örebro, Sweden Email:
[email protected] http://aass.oru.se/~mlt Amy Lout School of Science and Technology, Applied Autonomous Sensor Systems Örebro University, SE-701 82, Örebro, Sweden Email: amy.lout@oru.se There is an increasing interest in the machine learning community to automatically learn feature representations directly from the (unlabeled) data instead of using hand-designed features. The auto-encoder is one method that can be used for this purpose. However, for data sets with a high degree of noise, a large amount of the representational capacity in the auto-encoder is used to minimize the reconstruction error for these noisy inputs. This paper proposes a method that improves the feature learning process by focusing on the task relevant information in the data. This selective attention is achieved by weighting the reconstruction error and reducing the inuence of noisy inputs during the learning process. The proposed model is trained on a number of publicly available image data sets and the test error rate is compared to a standard sparse auto-encoder and other methods, such as the denoising auto-encoder and contractive auto-encoder.
Keywords : sparse auto-encoder; unsupervised feature learning; weighted cost function
1.
ited representational capacity can be spent to model the noisy inputs. Furthermore, the auto-encoder is unable to distinguish between task-relevant or taskirrelevant information without the knowledge of the correct labels. This paper aims to improve the learning of feature representations in an auto-encoder by both reducing the inuence of noise and utilizing the knowledge of the task-relevant information in the raw data by weighting the reconstruction error of every input so that the learned features are more focused on the task-relevant information in the input data. Several strategies proposed in the literature for guiding the learning of feature representations have been proposed in the literature. For example, sparse feature representations are achieved by adding a sparsity constraint 19; 20 . Robustness is achieved in
Introduction
The choice of data representation (or features) used in machine learning is known to heavily inuences the performance of the method. Representation learning methods 5; 3; 8 attempt to circumvent the need to hand-design features and instead use learning to automatically acquire representations of data. For such algorithms, the challenge is to learn relevant, robust, and invariant feature representations. The auto-encoder 25; 4 is one method that can be used for unsupervised feature learning. However, a shortcoming of the auto-encoder is that all inputs are treated equally and the representational capacity of the algorithm is used to minimize the reconstruction error of each input. Therefore in the presence of noisy data, an unnecessary amount of the lim1
2
Martin Längkvist and Amy Lout
a contractive auto-encoder 26 by penalizing changes in the feature representation for small changes in the input and in a denoising auto-encoder 31; 30 by reconstructing clean input from corrupted input. The use of noise injection in the training examples has also been done to improve generalization 13; 21; 24 . The use of dropout 11 also guides the features to learn meaningful representations independently and reduce the co-adaptation between the hidden units by randomly dropping half of the hidden units for each training example. While these methods show promise for guiding the learning of features, they mainly focus on the unsupervised setting where the distinction between task-relevant and task-irrelevant patterns is still a challenge. A method that attempts to solve the problem of nding task-relevant patterns from scratch with the help of labels is a modication of the RBM 10; 12; 19 named the point-wise gated Boltzmann machine (PGBM) 28 . The PGBM uses a feature selection algorithm to divide the learned features from a pre-trained RBM into a group of taskrelevant features and a group of task-irrelevant features. A binary switching unit for each input pixel is used to determine what pixels are task-relevant. The classication is only performed on the task-relevant group of features. The disadvantage of the PGBM is that it requires an pre-trained network before evaluating the inputs and the learned features and still spends half of the model's capacity on modeling the task-irrelevant patterns. The proposed method in this work aims to make every learned feature taskrelevant without the need for pre-training. The proposed method is also related to costsensitive learning 7; 9; 15 . Cost-sensitive learning is used during supervised learning and addresses the issue that sometimes the classication error for each category should not be treated as equal. For example in medical diagnosis, where a false-negative should be considered more costly than a false-positive because the former could result in a loss of a life. Costsensitive learning introduces a cost matrix, C(i, j) that weights the cost of classifying the predicted class j as actual class i. The cost for a correct classication is C(i, i) = 0 and if the costs are equal the o-diagonal values are C(i, j) = 1. The main dierence between cost-sensitive learning and our method is that our method weights the reconstruction error of each input unit during the unsupervised phase of
learning the auto-encoder instead of weighting the classication error of each category during supervised learning. The proposed model presented in this work is designed for data sets that contain many factors of variations and reduces the cost for reconstruction error for irrelevant inputs. The weighting of reconstruction error has also been done by adding a penalty term for the squared error cost between an optimal hidden unit activation and the actual hidden unit activation 20 . The added penalty term has the eect of reducing the allowed parameter space by training the model to have a high reconstruction error for inactive inputs. The optimal hidden unit activation is obtained by xing the model parameters and then the model parameters is updated with the obtained optimal hidden layer activation. Finally, weighting the error term has also been done to avoid poor local minima by preventing early saturation of the hidden units 32 . The goal of the proposed method is to learn more meaningful feature representations from data that has a high amount of irrelevant patterns. The method is trained on a number of visual recognition benchmark data sets and evaluated by examining the learned features and comparing the test error rates to other unsupervised feature learning algorithms. This paper is organized as follows. Section 2 gives an introduction to the sparse auto-encoder. The proposed model is presented in Section 3. The experiments are presented in Section 4 and nally a conclusion and future directions is given in Section 5. 2.
Sparse Auto-Encoder
The model that is used in this work is the autoencoder 4; 1 . The auto-encoder consists of one input layer, one or more hidden layers, and one output layer. The goal of an auto-encoder is to reconstruct the input data via an encoder and a decoder. The feed-forward activations in the encoder from the visible units vi to the hidden units hj is expressed as: ! hj = σf
X
Wji vi + bj
(1)
i
where σf (·) is the activation function for the encoder. In this work the sigmoid function σf (x) = 1+e1−x is used for the encoder activation function. In the decoder phase, the last hidden layer is decoded back to reconstructions of previous input layers. One pass of
Learning Representations with a Cost-Sensitive Sparse Autoencoder 3
get the following cost function:
an decoder with tied weights is calculated as: X vˆi = σg Wij hj + bi
(2)
j
The activation function in the last decoder depends on the nature of the input in the rst visible layer. For a visible layer with assumed Gaussian distribution, a linear activation function σg (x) = x can be used. For a probabilistic interpretation of the autoencoder, the cost function is dened as the negative log likelihood of the probability distribution of the visible layer given the hidden layer. If we assume this distribution has a Gaussian distribution with mean equal to the reconstruction of the visible layer, vˆ, and identity covariance, i.e., P (v|h) = N (ˆv , I) we get: N 1 XX n (v − vˆin )2 L = − log P (v|h) = 2N n=1 i i
(3)
which aims to minimize the average squared reconstruction error of a training batch of N training examples. A regularized auto-encoder restricts the allowed parameter space and is obtained by adding regularization terms. A P common choice is to add a L2 weight P P decay term, λ2 i j l (Wij(l) )2 , that keeps the weight matrices in each layer l small, and a sparsity P penalty term, β j KL(ρ||pj ), where KL(ρ||pj ) = 1−ρ is the Kullback-Leibler (KL) ρ log pρj +(1−ρ) log 1−p j divergence and pj is the mean activation for hidden unit j over all training examples in the current minibatch. The inclusion of these regularization terms prevents the trivial learning of a 1-to-1 mapping of the input. Each regularization term comes with one or more hyper parameters (λ, β , ρ) and can be set with a full grid search, random grid search 2 , or hyperparameter optimization 6 . 3.
Cost-Relevant Sparse Auto-Encoder
An issue with training sparse auto-encoders using the cost function in Eq. 3 is that the reconstruction error of each input unit is treated equal. This comes from the assumption that each input unit given the hidden units has a Gaussian distribution with unit variance. If we instead set the precision (inverse variance) as a learnable parameter, i.e., P (v|h) = N (ˆv , α−1 ) we
L = − log P (v|h) =
N X 1 XX n n 2 (vi −ˆ vi ) αi − log αi 2N n=1 i i
(4) The changes from a regular sparse auto-encoder are the element-wise multiplication of the reconstruction error for the i th unit with the weighting vector α and the added weighting vector penalty term. If αi = 1 for ∀i the model generalizes to a regular sparse autoencoder. If αi is small than the reconstruction error for the i th unit is decreased and the amount by which the weights and biases in the previous layers are changed during back-propagation is decreased compared to if αi = 1. Similarly, for αi > 1 the reconstruction error and the size of parameter update is increased. Therefore, the weighting vector α determines which units should be focused on and have a larger impact on the parameter updates and which input units should have a lesser impact. A crucial aspect of the proposed model is how to determine the values of the weighting vector. In this work we explore two methods, namely using a feature selection method to pre-calculated the values (xed weighting vector), as well as learning the values together with the model parameters (adaptive weighting vector). The xed weighting vector method is designed as a supervised method where the knowledge of the correct labels is utilized and the adaptive weighting vector method is purely unsupervised. 3.1.
Fixed weighting vector
In this setting, αi is set to a xed value for each input unit i. Since the values of αi do not change, the weighting vector penalty term in Eq. 4 is removed. The values can be calculated in a number of ways depending on the data set or if prior knowledge about the data can be utilized. In this work, a feature selection algorithm, namely the score from the t-test algorithm on the input data, is used to set the values of the weighting vector. The scores are normalized to values between 0 and 1 by: α − min α (5) max α − min α P and then normalized so that i αi = dv where dv is αnorm =
the number of input units. This reduces the dierence between the reconstruction error and the other regularization terms so that the hyperparameters do not need to be re-adjusted.
4
Martin Längkvist and Amy Lout
Furthermore, since this method requires the knowledge of the correct labels, a separate weighting vector αi(k) is used for each category k. This allows the model to change focus depending on the category that the current input data belongs to. 3.2.
Adaptive weighting vector
When using an adaptive weighting vector, the values of the weighting vector are learned together with the model parameters. The weighting vector is parametrized according to α = log(1+exp(a)), where a is the learnable parameter. This parametrization is done in order to avoid numerical problems, i.e., keeping α > 0. If αi decreases from 1, the square error term will decrease and the weighting vector penalty term will increase. However, for values of αi above 1 the reconstruction error will increase but the weighting vector penalty term is negative. A hyperparameter γ is introduced to the weighting vector penalty term in Eq.4 in order to control the trade-o between reconstruction error and the weighting vector values deviating from 1 and prevent the values to blow up. During learning, the algorithm determines a weighting vector that minimizes the total cost function. 4.
patterns.
Experiments
This section is divided into two parts where the experiments are performed with a xed or adaptive weighting vector. Early-stopping is used in all experiments to prevent overtting in both pre-training and ne-tuning. Training is stopped if the accuracy for a held-out validation set is not improved in the last 20 epochs. Mini-batch stochastic gradient descent with decaying learning rate and momentum is used as optimization method. The hyperparameters are set with random gridsearch and the combination of hyperparameters that gives the best accuracy on the validation set is used. The weight decay λ ranges between 10−3 to 10−5 , sparsity parameter β ranges between 3 and 0.003, weighting vector penalty parameter γ ranges between 1 and 0.01 (for xed weighting vector this parameter is not used), and learning rate ranges between 0.1 and 0.001. The number of hidden units is set to 500, which is a relatively small size compared to other approaches 17 . Therefore, our aim is to achieve comparable results but with a smaller network since the capacity of the model is not spent on irrelevant
4.1.
Fixed weighting vector
The use of a xed weighting vector is evaluated using two data sets, namely MNIST variations and Jittered-Cluttered NORB.
4.1.1. MNIST variations The MNIST variations data set consists of 4 variations of the standard MNIST handwritten numbers 16 : added background noise (mnist-back-rand ), added background images (mnist-back-images ), rotated numbers (mnist-rot ), and rotated numbers with background images (mnist-rot-back-images ). Each data set consists of 10000 training examples, 2000 validation examples, and 50000 test examples. The input size is 28 × 28. The input data is seen in Fig. 2(a). Fig. 1 shows the values of the weighting vector for all four data sets when it is calculated using the t-test feature selection algorithm on the input data. Each column is one of the four data sets. Top row show the mean of the scores over all classes and the bottom row show the raw scores for each class. The use of a separate weighting vector for each class further indicates which input units should be focused on for each class. The rst two data sets assign a focus for the center of the image while the last two data sets focus on rings with dierent widths and radii. The white sections in the second row indicate that high pixels intensities in the input data play a role for that class and the black parts indicate that the absence of a pixel intensity is signicant. Gray sections indicate class-insignicant input pixels. The weighting vector for each data set and class is set to the maximum score of the t-test algorithm for each category. The values are normalized according to Eq. 5.
Learning Representations with a Cost-Sensitive Sparse Autoencoder 5
Fig. 1. Raw scores from the t-test algorithm on the training images. Each column from left to right is the data sets mnist-back-images, mnist-back-rand, mnist-rot, and mnist-rot-back-images. Top row show the mean of the t-test score over all classes. Bottom row show the scores for each class. Values are plotted with switched sign for a more intuitive visualization.
A cost-relevant sparse auto-encoder with xed weighting vector is trained on the input data. The number of hidden units is set to 500 and the other hyperparameters where set as mentioned in the beginning of this section. Fig. 2 shows the reconstructions and learned features of a regular autoencoder and the cost-relevant auto-encoder trained on mnist-backg-image. The reconstructions from the cost-relevant auto-encoder resemble a depth-of-eld eect where the background is more blurred out compared to the reconstructions of a regular autoencoder. The learned features of the cost-relevant auto-encoder capture little information from the outer part of the image and instead focuses on penstrokes. The test error rate for a sparse auto-encoder (SAE) and a cost-relevant sparse auto-encoder (CRSAE) with a xed weighting vector is presented and compared to previous works in Table 1a . The goal of our approach is to achieve comparable results to other models while using a single layer with a small number of hidden units. Our model uses 500 hidden units while the other works use model sizes be-
tween 500 and up to 3000 (6000 for SAA-3) per layer. The CR-SAE improved the classication result for all data sets compared to a SAE, NNet, and SVM (except for mnist-rot ). For mnist-back-images the CRSAE also achieved a lower classication error than a 3-layered sparse auto-associator. For mnist-backrand comparable results to SdA-3 and CAE-2 were achieved and a lower error than SAA-3, and CAE-1. For mnist-rot CR-SAE also performed better than DBN-1 but not better than any of the deep networks or CAE and this is probably because mnist-rot does not contain any added noise so there is little gain in paying less attention to a subset of the input. For mnist-rot-back-images a lower classication error is also achieved compared to SAA-3, DBN-1 and CAE1. The supervised PGBM 28 and CAE-2 27 achieved a lower test error rate on all 4 data sets than the proposed method. The CAE-2 uses two layers of hidden units with 1000 units in each layer while our method only uses one layer with 500 hidden units. One possible reason for the lower test error for the supervised PGBM is that it uses generative feature selection both at the learned features from a pretrained RBM (high-level) and at the visible units by blocking task-irrelevant units (low-level) while the proposed method of xed weighting vector only uses feature selection on the low-level input units before training. 4.1.2. Jittered-Cluttered NORB The Jittered-Cluttered NORB data set consists of images from the NORB data set 18 where the central object has been jittered and disruptive background clutter and a distractor object has been added in order to produce a more challenging data set. The images are of ve small toy gures (animal, human gure, airplane, truck, and car) taken at 6 lighting conditions, 9 elevations, and 18 azimuths using a stereoscopic camera. The input size is 108 × 108 × 2. Some examples can be seen in Fig. 4(a). There is also a sixth category with no toy gure present in the image and only background clutter and a distractor object. The validation set is created by randomly drawing 19.4% of the images from the training set which resulted in a training set of 234446 images, a valida-
a For the re-calculated results from some of the models see: http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/ Public/DeepVsShallowComparisonICML2007
6
Martin Längkvist and Amy Lout
(a)
(b)
(c)
(d)
(e)
Fig. 2. (a) Input data, (b) reconstruction of regular auto-encoder, (c) learned features for regular auto-encoder, (d) reconstruction of cost-relevant auto-encoder with xed weighting vector, and (e) learned features for cost-relevant autoencoder with xed weighting vector. Each row from top to bottom is the data set mnist-back-images, mnist-back-rand, mnist-rot, and mnist-rot-back-images.
tion set of 57154 images and a test set of 58320 images. The sets are pre-processed in the same way as a previous work 22 , that is, each image is downsampled to 32 × 32 × 2 and normalized by subtracting the mean of the image and divided by the average standard deviation of all training pixels. The training set is used to train a sparse auto-encoder (SAE) using 500 hidden units and a cost-relevant learning sparse auto-encoder (CRSAE) with xed weighting vector using 500 and 2000 hidden units. The same hyperparameter search and optimization method previously used in Section 4.1.1
are also applied here. The networks were rst pretrained and then ne-tuned with a softmax layer attached on top of the hidden layer. Due to the size and nature of the training set, the values of the weighting vector are updated with the t-test algortihm for each training mini-batch. Fig. 3 shows the values of the weighting vector for three dierent training batches.
Learning Representations with a Cost-Sensitive Sparse Autoencoder 7 Table 1. Classication errors [%] with 95% condence intervals on MNIST variations using cost-relevant sparse auto-encoder (CR-SAE) with xed weighting vector, sparse auto-encoder (SAE), supervised neural net (NNet), SVM with Gaussian kernel, 3-layered stacked auto-associator (SAA-3), 1 and 2-layered contractive auto-encoder (CAE-1, CAE-2), 3-layered stacked denoising auto-encoder (SdA-3), deep belief network (DBN-1), and a supervised point-wise gated Boltzmann machine (supervised PGBM) Model CR-SAE SAE NNet 17 SV Mrbf 17 SAA-3 17 CAE-1 27 CAE-2 27 SdA-3 30 DBN-1 17 supervised PGBM
28
mnist-back-images
mnist-back-rand
mnist-rot
mnist-rot-back-images
18.59 ± 0.35 19.97 ± 0.36 27.41 ± 0.39 22.61 ± 0.37 23.00 ± 0.37 16.70 ± 0.33 15.50 ± 0.32 16.68 ± 0.33 16.15 ± 0.32 12.85 ± 0.30
10.98 ± 0.28 13.60 ± 0.31 20.04 ± 0.35 14.58 ± 0.31 11.28 ± 0.28 13.57 ± 0.30 10.90 ± 0.27 10.38 ± 0.27 9.80 ± 0.26 6.87 ± 0.23
12.86 ± 0.30 14.23 ± 0.38 18.11 ± 0.34 11.11 ± 0.28 10.30 ± 0.27 11.59 ± 0.28 9.66 ± 0.26 10.29 ± 0.27 14.69 ± 0.31 −
47.68 ± 0.45 53.56 ± 0.45 62.16 ± 0.43 55.18 ± 0.44 51.93 ± 0.44 48.10 ± 0.44 45.23 ± 0.44 44.49 ± 0.44 52.21 ± 0.44 44.67 ± 0.44
Model SAE CR-SAE RBM (binary) 29 RBM (NReLU) 29 Conv Net 18 CR-SAE Fig. 3. Fixed weighting vector for each class of animal, human, airplane, truck, car, and none (columns) for three training mini-batches (rows).
Fig. 4 shows the input data, reconstruction, and some of the learned features in no particular order. The values of the weighting vector for the "none" category are mostly high around the edges while the airplane, car, and truck category mostly has a high focus in the middle. The animal and gure category has a more scattered distribution of the weighting vector between each training mini-batch. Table 2 shows the test error rate for our models and previous works. The CR-SAE with 500 hidden units achieves a lower test error rate compared to a SAE but not lower than a RBM with 4000 hidden units or a multi-layered convolutional net. When the number of hidden units is increased to 2000, the CRSAE achieved a lower test error rate than a RBM with twice the number of hidden units with rectied linear units and a 6-layered convolutional net. Table 2. Classication errors on the Jittered-Cluttered NORB data set with the proposed method and previous works.
Hidden units 500 500 4000 4000 8-24-100 2000
Test error rate [%] 26.42 20.04 18.7 16.5 16.7 14.48
Table 3 shows the confusion matrix on the JitteredCluttered NORB data set. The largest confusion is cars being classied as trucks. Table 3. Confusion matrix [%] of Jittered-Cluttered NORB from a cost-relevant sparse auto-encoder with 2000 hidden units. animal human airplane truck car none
4.2.
animal 84.77 5.51 6.47 5.80 1.78 0.32
human 6.38 87.96 2.52 0.30 0.42 0.74
airplane 4.92 0.80 81.40 0.33 1.04 0.09
truck 0.76 0.28 2.91 87.70 23.40 0.13
car 1.34 1.94 6.23 5.70 72.82 0.27
none 1.83 3.42 0.46 0.17 0.55 98.45
Adaptive weighting vector
The use of an adaptive weighting vector can be used instead of a xed weighting vector when the training data is unlabeled and there is no obvious way of setting the weighting vector. Another advantage of using an adaptive weighting vector is when the quality of a subset of the input changes over time. The use of an adaptive weighting vector is tested by comparing the classication accuracy on the small NORB data set with partially corrupted input using a sparse auto-encoder with adaptive weighting vector and a regular sparse auto-encoder.
8
Martin Längkvist and Amy Lout
(a)
(b)
(c)
Fig. 4. (a) Input data for Jittered-Cluttered NORB where each rows is one category, (b) reconstruction of the input data with a CR-SAE, and (c) a subset of the learned features of CR-SAE.
4.2.1. Small NORB
0.75
The Small NORB dataset consists of images of size 96 × 96 of 5 categories (animal, human gure, airplane, truck, and car) taken at 6 lighting conditions, 9 elevations, and 18 azimuths. The training and testing set consists of 24300 images each. The images are downsampled to 48 × 48 and normalized with local contrast normalization 23; 14 with patchsize 9 × 9 resulting in images of sizes 40 × 40. The 25 × 25 center patch is selected to further reduce the dimensionality. The training data is locally corrupted with Gaussian noise with standard deviation 0.1 or 1 with a patch of varying size. The location of the patch is set randomly and is the same for all training images. The inputs outside the corruption patch are unaffected. Fig. 5 shows the classication accuracy on the small NORB data set with varying corruption patch size. When no noise is added to the training data, the test accuracy rate using a CR-SAE with adaptive weighting vector is similar to a SAE. For small amounts of noise (standard deviation = 0.1), the dierence in the classication accuracy between the two models is unnoticeable. When the training data is eected with a large amount of noise (standard deviation = 1) the classication accuracy using a CR-SAE with adaptive weighting vector is signicantly higher than a SAE. 18
Classification accuracy [%]
0.7
σ=0.1, Constant α σ=0.1, Adaptive α σ=1, Constant α σ=1, Adaptive α
0.65
0.6
0.55
0.5
0.45
0
10
20
30 40 50 Amount of corruption [%]
60
70
80
Fig. 5. Classication accuracy on Small NORB with CR-SAE with adaptive weighting vector and SAE with varying amount of corrupted training data.
In the next experiment a moving corruption patch of size 8 × 8 or 16 × 16 is added to the training images to examine the how the weighting vector adapts over time. The noise patch is moving randomly one pixel after each training example. A CRSAE with adaptive weighting vector and a SAE is trained on the clean small NORB dataset and with the added local noise. Fig. 6 shows the corrupted NORB input data at dierent times and the values of the weighting vector. Fig. 7 shows the classication accuracy for the moving corruption patch. The dierence in the accuracy between the two models is small for clean input data (a 0-by-0 noise patch). The classication accuracy is higher with a CR-SAE with adaptive weighting vector when a medium-sized 8 × 8 patch is corrupting the input data - and the dierence is even
Learning Representations with a Cost-Sensitive Sparse Autoencoder 9
Fig. 6. Top: Corrupted input data. The 8-by-8 corrupted patch moves 1 step in a random direction after each training sequence. Bottom: Values for αi at pixel i. Black color means αi = 0 and white is αi = 1. The values for αi are lower around where the corruption patch is (and recently have been) but is later increased as the model recognizes that the data becomes easier to reconstruct.
more noticeable for a noise patch of size 16 × 16. The classication accuracy did not improve when using an adaptive weighting vector on noise-free input data.
is much easier to reconstruct than random noise. For the rotating digit the focus becomes a ball-shaped area at the center of the image.
0.72 σ=1, Constant α σ=1, Adaptive α 0.7
Classification accuracy [%]
0.68
0.66
0.64
0.62
0.6
0.58
0.56
0.54 0−by−0
8−by−8 Noise patch size [pixels]
16−by−16
Fig. 7. Classication accuracies on varying degree of added Gaussian noise with standard deviation 1 to Small NORB dataset with xed and adaptive weighting vector. For adaptive method achieves a higher classication accuracy than the xed method when part of the input is corrupted.
Fig. 8. Final values of weighting vector when using adaptive weighting vector on the four MNIST variations data sets (top-left) mnist-back-images, (top-right) mnistback-rand, (bottom-left) mnist-rot, and (bottom-right) mnist-rot-back-images.
5.
4.2.2. MNIST variations The method of using an adaptive weighting vector is tested on the four MNIST variations data sets. Fig. 8 shows the nal values of the weighting vector. For mnist-backg-images (top-left), the weigting vector assigns low values to the background and focuses on the center area where the digit is located and has a more varying input. For mnist-backg-rand (topright), the opposite eect is achieved since the digit
Conclusion
This paper has shown that category-specic selective attention can be achieved by weighting the reconstruction error and improve the learning of feature representations in the unsupervised pre-training phase for data sets with irrelevant patterns. The method achieves a lower test error rate compared to a standard sparse auto-encoder and achieves comparable results to other methods that use larger models. A feature selection technique that utilizes the knowledge of the correct labels has been explored
10
Martin Längkvist and Amy Lout
for setting the selective input and an unsupervised method for automatically learning the selective input has been proposed. Future work include further exploring of methods for setting the weighting vector without using the knowledge of the labels. The possibility to combine the use of a weighting vector with other feature learning techniques 31; 20; 27; 11 and the usefulness of using the method with a RBM or domains such as convolutional settings and time-series data is another interesting future direction.
10. 11.
12. 13.
Acknowledgements
We greatly appreciate the anonymous reviewers for their valuable comments that helped in improving the quality of this paper. References
1. Y. Bengio, Learning deep architectures for AI, Tech. Report 1312, Dept. IRO, Universite de Montreal (2007). 2. Y. Bengio, Practical recommendations for gradientbased training of deep architectures, Tech. Report arXiv:1206.5533, U. Montreal, Lecture Notes in Computer Science Volume 7700, Neural Networks: Tricks of the Trade Second Edition, Editors: Grégoire Montavon, Geneviève B. Orr, Klaus-Robert Müller (2012). 3. Y. Bengio, A. Courville and P. Vincent, Unsupervised feature learning and deep learning: A review and new perspectives, Tech. Report arXiv:1206.5538, U. Montreal (2012). 4. Y. Bengio, P. Lamblin, D. Popovici and H. Larochelle, Greedy layerwise training of deep networks, Advances in neural information processing systems 19 (2007) p. 153. 5. Y. Bengio and Y. LeCun, Scaling learning algorithms towards AI, Large-Scale Kernel Machines , eds. L. Bottou, O. Chapelle, D. DeCoste and J. Weston (MIT Press, 2007). 6. J. Bergstra, D. Yamins and D. D. Cox, Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures, Proc. 30th International Conference on Machine Learning (ICML) , 2013. 7. C. Elkan, The foundations of cost-sensitive learning, International joint conference on articial intelligence , 17(1), Citeseer2001, pp. 973978. 8. D. Erhan, Y. Bengio, A. Courville, P. Manzagol, P. Vincent and S. Bengio, Why does unsupervised pre-training help deep learning?, Journal of Machine Learning Research 11 (February 2010) 625660. 9. H. He and E. A. Garcia, Learning from imbal-
14.
15. 16.
17.
18.
anced data, Knowledge and Data Engineering, IEEE Transactions on 21(9) (2009) 12631284. G. E. Hinton, O. S. and T. Y., A fast learning algorithm for deep belief nets, Neural Computation 18 (2006) 15271554. G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and R. R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors, http://arxiv.org/abs/1207.0580 (2012). G. Hinton and R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 313(5786) (July 2006) 504507. L. Holmstrom and P. Koistinen, Using additive noise in back-propagation training, Neural Networks, IEEE Transactions on 3(1) (1992) 2438. K. Jarrett, K. Kavukcuoglu, M. Ranzato and Y. LeCun, What is the best multi-stage architecture for object recognition?, Computer Vision, 2009 IEEE 12th International Conference on , IEEE2009, pp. 21462153. M. Kukar, I. Kononenko et al., Cost-sensitive learning with neural networks., ECAI , 1998, pp. 445449. H. Larochelle, D. Erhan, A. Courville, J. Bergstra and Y. Bengio, An empirical evaluation of deep architectures on problems with many factors of variation, Proceedings of the Twenty-nine International Conference on Machine Learning (ICML) , 2007. H. Larochelle, D. Erhan, A. Courville, J. Bergstra and Y. Bengio, An empirical evaluation of deep architectures on problems with many factors of variation, Proceedings of the 24th international conference on Machine learning , ACM2007, pp. 473480. Y. LeCun, F. J. Huang and L. Bottou, Learning methods for generic object recognition with invariance to pose and lighting, Computer Vision and Pat-
tern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on , 2,
IEEE2004, pp. II97. 19. H. Lee, C. Ekanadham and A. Y. Ng, Sparse deep belief net model for visual area V2, Advances in Neural Information Processing Systems 20 , 2008, pp. 873 880. 20. Y. Marc��Aurelio Ranzato, L. Boureau and Y. LeCun, Sparse feature learning for deep belief networks, Advances in neural information processing systems 20 (2007) 11851192. 21. K. Matsuoka, Noise injection into inputs in backpropagation learning, Systems, Man and Cybernetics, IEEE Transactions on 22(3) (1992) 436440. 22. V. Nair and G. E. Hinton, Rectied linear units improve restricted boltzmann machines, Proceedings
of the 27th International Conference on Machine Learning (ICML-10) , 2010, pp. 807814.
23. N. Pinto, D. D. Cox and J. J. DiCarlo, Why is realworld visual object recognition hard?, PLoS computational biology 4(1) (2008) p. e27. 24. D. C. Plaut, S. J. Nowlan and G. E. Hinton, Exper-
Learning Representations with a Cost-Sensitive Sparse Autoencoder 11 iments on learning by back propagation (1986). 25. M. Ranzato, C. Poultney, S. Chopra and Y. LeCun, Ecient learning of sparse representations with an energy-based model, Advances in Neural Information Processing Systems (NIPS 2006) , ed. J. P. et al. (MIT Press, 2006). 26. S. Rifai, P. Vincent, X. Muller, X. Glorot and B. Y., Contracting auto-encoders: Explicit invariance during feature extraction, Proceedings of the Twenty-
nine International Conference on Machine Learning (ICML) , 2011.
27. S. Rifai, P. Vincent, X. Muller, X. Glorot and Y. Bengio, Contractive auto-encoders: Explicit invariance during feature extraction, Proceedings of the
28th International Conference on Machine Learning (ICML-11) , 2011, pp. 833840.
28. K. Sohn, G. Zhou, C. Lee and H. Lee, Learning and selecting features jointly with point-wise gated boltzmann machines, Proceedings of The 30th Interna-
tional Conference on Machine Learning , 2013, pp.
217225. 29. G. L. Tarr, Multi-layered feedforward neural networks for image segmentation, tech. report, DTIC Document (1991). 30. P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio and P. Manzagol, Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion, Journal of Machine Learning Research 11 (2010) 33713408. 31. P. Vincent, H. Larochelle, Y. Bengio and P.-A. Manzagol, Extracting and composing robust features with denoising autoencoders, Proceedings of the 25th international conference on Machine learning , 2008, pp. 10961103. 32. X. Wang, Z. Tang, H. Tamura and M. Ishii, A modied error function for the backpropagation algorithm, Neurocomputing 57 (2004) 477484.