An Unsupervised Deep-Learning Architecture that ...

6 downloads 3118 Views 392KB Size Report
channels of input/output such that input on one channel will correctly generate the .... Learning is slow in BMs that have many hidden nodes. This is because the.
An Unsupervised Deep-Learning Architecture that Can Reconstruct Paired Images Ti Wang, Mohammed Shameer Iqbal, and Daniel L. Silver Jodrey School of Computer Science, Acadia University Wolfville, NS, Canada B4P 2R6 [email protected]

Abstract. This paper presents an unsupervised learning system that develops an associative memory structure that combines two or more channels of input/output such that input on one channel will correctly generate the associated response at the other channel and vice versa. A deep learning architecture is described that can reconstruct an image of a MNIST handwritten digit from another paired handwritten digit image. In this way, the system develops a kind of supervised classification model meant to simulate aspects of human associative memory. The system uses stacked layers of unsupervised Restricted Boltzmann Machines connected by a hybrid associative-supervised top layer to ensure the development of a set of high-level features that can reconstruct one image given another in either direction. Experimentation shows that the system reconstructs accurate matching paired-images that compares favourably to a backpropagation network solution.

1

Introduction

Humans learn knowledge by experiencing the world through their senses. Raw data is received at one or more sensory organs, such as the eyes and ears, and related signals are pass to the nervous system. The exact mechanism by which these experiences affect the structure of the human nervous system and how new memory is formed is not well understood [5]. This is a primary goal of research in neuroscience and artificial intelligence, particularly those working in the area of computational learning. Deep learning architectures, or DLA, provide an exciting new substrate upon which to explore possible computational and representational models of how knowledge is acquired, consolidated and used [1]. Prior work has investigated the use of DLAs and unsupervised learning methods to develop models for a variety of purposes including auto-associative memory, pattern completion, and clustering as well as generalization and classification [3]. Our long-term research objective is to create a system that is capable of “showing us what it hears and telling us what it sees” using a DLA. This will require an architecture that can work with three sensory and motor modalities: audio, optical, and vocal. This program of study is meant to accomplish several objectives. Chief among these is the investigation of unsupervised learning methods that can create a model capable of generalization and classification from

one input modality to another (eg. from optical to vocal). We are interested in how this can be done without resorting to any form of supervised learning. We are also interested in the abstract layers of features generated in a DLA for one modality channel and at the intersection of two or more channels – How do these features compare to what we know of the human nervous system? Finally, we are interested in knowledge transfer in a DLA using unsupervised methods for learning new tasks and new modalities. In this paper we take a first step by examining a DLA that is capable of learning paired-associate images at two input channels. The DLA must reconstruct the matching image at channel A when it observes a paired image at channel B, and vice versa. By doing so the system uses unsupervised learning to develop an associative memory model that performs a form of classification from one channel to another. The system uses layers of Restricted Boltzmann Machine (RBM) machines stacked into a DLA. We will show that such a DLA can work quite well when assisted with supervised learning at only the highest level representation. Experimentation shows qualitatively and quantitatively that the system generates reasonably accurate matching images, as compared to a traditional Back-Propagation (BP) network solution.

2

Background

Artificial Neural Networks (ANN) are one of the most commonly used machine learning techniques. Although a variety of ANNs are used in modeling highly complex tasks like image recognition, many do not work in the same fashion as the human nervous system. For example, supervised BP ANNs are good for modeling complex mapping relations between input and output domains, but are not so good for recalling input patterns. Humans have the capability of recovering complete information, from partial information, using associative memory. When a child learns the characteristics of a cat, he or she learns both the appearance of the cat as well as the sound it makes. Later, on seeing a picture of a cat, the child can recall the sound a cat makes [5]. An associative ANN simulates aspects of how collections of neurons store and recall associative memories. Geoffrey Hinton, University of Toronto, advocates using Boltzmann Machine associative networks to simulating human brain structure [3]. After a Boltzmann Machine has been trained on a set of patterns, it has the ability to reconstruct one of those patterns from a partial or noisy version of the pattern. Boltzmann Machines: A Boltzmann Machine (BM) is a stochastic neural network of binary neurons that is capable of reconstructing a stored pattern from a partial pattern [2]. A BM is made up of two layers of binary neurons, or units, that are either visible or hidden. All the neurons in the visible and hidden layers are inter-connected forming a complete graph. Given some inputP on its visible units, a P BM will settle into an equilibrium state with energy E = i Ei ; where Ei = − i6=j si sj wij − bi si , where si and sj are states of two neurons, i and j, wij is the weight of the connection between them and bi is the bias weight for neuron i [2]. After being trained, the BM will settle into a memory state at equilibrium closest to the initial state of the neurons [2]. The activation function of a BM converts a weighted input and a temperature parameter, T ,

Fig. 1. RBM Training Process

to a probability given by pi=on =

Fig. 2. Stacking Multi-level RBMs

1 ∆E 1+exp(− T i )

[2]. The neuron only comes on

if its probability is greater than a random value. The energy E of the BM is affected by the global temperature T value that declines from a maximum value to 1 based on a predetermined schedule [2]. This technique helps the system from getting stuck in a local minima during the early stages of recall. As the temperature reduces to T = 1 the system moves towards a state of equilibrium, which will reconstruct the nearest stored pattern. Learning is slow in BMs that have many hidden nodes. This is because the weight update equation requires sampling each neuron i for each training example, and then sampling the states of all other neurons j in order to compute Ei . The algorithm continues until the network reaches a state of equilibrium where its change in state is below a threshold. Restricted Boltzmann Machines (RBM): An RBM is a variant of a BM that is meant to overcome the problem of long training times by limiting the number of connections in its network and using an approximate weight update algorithm. RBMs have both visible and hidden layers of neurons just like BMs, however all intra-layer connections are restricted [3]. When training data xi is given to the visible neurons vi , the RBM temporarily clamps their states and frees the states of hidden binary neurons hj . Node hj turns on with probability 1P pj = 1+exp(−b − . The visible units are then unclamped and node vi w v ) j

i

ij i

turns on with probability pi = 1+exp(−b −1 P w h ) . The system computes the i ij j j P P P P overall energy E = − i bi vi − j bj hj − i j vi hj wij where bi and bj are the bias terms for their respective nodes [2]. The RBM computes the mean squared error (MSE) between the reconstructed input value x0i and the original input value xi and reduces it with a gradient-decent algorithm that changes the weights, wij . The state hi of hidden neuron i keeps changing with i’s probably pi during training, and weight wij updates until either the global energy E or the probably pi exceeds a threshold. At any point in time, with probability pi , neuron i will reconstruct the input data xi . As shown in Figure 1, the weights are updated as per the following formula ∆wij = (< vi hj >0 − < vi hj >1 ), which is an approximation of the gradient of the log likelihood [3]. This method of weight update is called contrastive

divergence (CD). The CD algorithm is guaranteed not to get stuck in a local minima. The system is trained until the hidden layer is capable of reconstructing the original input pattern at the visible units to the desired level of accuracy. After training, the hidden layer weights of the RBM have learned the feature distribution of the input space, that is wij , gives the probability of feature hj given input vi . Deep Learning Architectures: Most objects are made up of several other smaller parts or features. For example, a car is a combination of smaller features like wheels and a frame. Breaking it down further, a wheel is made up of smaller features like tires and rims. The higher-level abstraction is a car, whereas, the lower-level abstraction is a tire. Deep learning methods aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features [1, 7]. One of the advantages of RBMs is that they can be stacked as layers to learn high level features of input data. As shown in Figure 4 the hidden layer of one RBM can be used as the input layer for a second RBM [1]. This second RBM layer will learn the feature distribution of the hidden layer of the first RBM. As layers are stacked the network learns increasing complex combinations of features from the original data. These systems are capable of doing unsupervised clustering of unlabeled data based on a hierarchy of features. Hence, it is called deep learning or deep feature learning. Neuroscience studies have shown that the mammalian brain has a deep learning architecture with multiple levels of abstraction corresponding to different areas of the neocortex [6]. Many feel that RBM deep learning architectures develop a hierarchy of features in a fashion similar to the mammalian brain. Hinton has presented research on recognizing hand-writing images of digits, which simulates human vision, by using stacked RBMs [3].

3

Theory

The objective of this research is to develop a learning system that can memorize and recall knowledge using an associative memory network. The learning system should be able to recall the pattern from the associative network on one sensory modality given data on another sensory modality. The network will be trained such that when it is given an image, it will generate an associated image and in this way indicate the classification of the first image. To achieve this goal, instead of using traditional labeled datasets, two or more unlabeled datasets are used to support unsupervised feature generation. The deep learning architecture (DLA) of the learning system is composed of two major parts, a hybrid associative-supervised memory network and two or more associative sensory channel networks (see Figure 3). The sensory channel networks are designed for the reconstruction of incoming sensory data. The hybrid associative-supervised memory network, which ties the sensory channel networks together, can be modeled with an RBM associative network [4]. The associative memory at the top of the DLA shown in Figure 3 simulates a human’s long-term memory that combines separate channel features. The DLA will be given a variety of paired-associate handwritten digit images to learn.

Fig. 3. Two channels DLA

Fig. 4. BP ANN used in Experiment 1

The challenge for our DLA at the top level is to create features of the digit images for one channel when presented with the only the features of the other channel [4]. To develop a more accurate model, we currently untie the associative memory weights and use the BP algorithm to fine-tune them using the posterior probabilities gathered from hidden layer 2 and 2’. When training to generate features of channel 2, the BP algorithm uses posteriors at hidden layer 2 as the inputs, and posteriors at hidden layer 2’ as the supervised signal, and vice versa.

4

Experiment 1

Two empirical studies were carried out using two different data sets. The first experiment used the MNIST handwritten dataset. The second experiment used a synthetic dataset of handwritten digits. In both experiments, five pairs of odd and even digits were associated with each other 1-2, 3-4, 5-6, 7-8, and 9-0. A model using the DLA architecture described in Section 3 and two standard BP networks were trained and compared. One BP network is used for mapping from odd to even digits and another BP network is used for mapping from even to odd digits. Both methods were challenged to reconstruct the image of one digit from its paired-associate image. Objective: The objective of this experiment is to compare unsupervised DLA with a supervised BP ANN approach to learning paired-associate images. As shown in Figure 3, each learning system is trained such that when a handwritten digit image is provided, the system will generate its paired digit image. Material and Methods: This experiment uses a dataset of paired 28 x 28 gray-scale images of handwritten digits from MNIST database [3] as described above. A dataset of 5000 examples is used to train the learning system. The training process stops when the maximum iteration (300) is reached or MSE exceeds the pre-set threshold. We use another set of 1000 examples as the validation set to monitor the BP fine-tuning training to avoid under-fitting and over-fitting. An independent set of 1000 examples is used as a test set. The odd digit image of a test example is used to test the reconstruction of its corresponding even digit image, and vice versa.

Table 1. Percent accuracy of test set reconstruction Algorithm 1→2 2→1 3→4 4→3 5→6 6→5 7→8 8→7 9→0 0→9 Average DLA 93.5 98.0 75.5 96.5 90.0 87.0 88.5 91.0 92.0 92.0 90.4 BP ANN 100.0 43.5 89.5 92.0 88.5 96.0 88.5 88.0 91.5 94.0 81.6

Fig. 5. Examples of reconstruction results from DLA and BP ANN

A deep learning architecture of RBMs is used to develop an unsupervised learning model for the problem. The architecture is in accord with Figure 3. A channel network is composed of two RBM layers, each of which contains 500 hidden neurons. Successively, hidden layers 1 and 1’ and then layers 2 and 2’ will develop more abstract features of the original images [3]. Hidden layers 1 and 2 will learn a generative DLA representation of the odd digits. Hidden layers 1’ and 2’ will learn a generative DLA representation of the even digits. The associative top layer contains 3000 neurons. It will bring together the features of layers 2 and 2’ to create mapping functions that can reconstruct an image on one channel from the image on the other. We developed two BP networks to learn the same paired-associate mapping. One network is trained to map odd digit images to even digits, the other vice versa. Both BP networks use the architecture shown in Figure 4. The BP network uses the same training set, validation set and testing set as the DLA. The accuracy of reconstruction is measured by testing the output images using Hinton’s DLA handwritten digits classification software. This software is known to classify MNIST dataset of handwritten digits with only 1.15% errors [3]. We passed the input images and the reconstructed images through Hinton’s software to determine their accuracy. We note that the accuracy of Hinton’s classification software is high because it was developed by using the BP algorithm to fine-tune all the weights of a DLA to classify an image. Our work is focused on generating paired images without little or no supervised learning. Results and Discussion: Using Hinton’s software, we tested reconstruction on the testing set. The results are shown in Table 1. On average, the DLA model generated images that were 90.4% accurate, and the BP ANN generated images that were 81.6% accurate. Figure 5 shows examples of reconstruction done by the DLA and BP ANN. One can see that the images generated by the DLA are clearer than those generated by the BP ANN. We conjecture that the DLA is able to better differentiate features from noise as compared to the BP network. We designed Experiment 2 to investigate this further.

Fig. 6. The templates for each digit.

5

Fig. 7. Examples of digits with 10% noise.

Experiment 2

Objective: The objective of this experiment is to compare the DLA method to the BP ANN in overcoming noise injected into synthetic training examples. The DLA in this study uses only the unsupervised CD algorithm to train it’s model. Material and Methods: This experiment uses a synthetic dataset that contains five different categories of 10 x 5 paired images, similar to that used in Experiment 1 and shown in Figure 6. To create a variety of examples such as shown in Figure 7, 10% random noise was added to each template image to produce 20 instances of each digit, or 200 in total. The first 100 of these images are used as a training set while the remaining 100 is used as a test set. A deep learning architecture of RBMs, in accord with Figure 3, is used to develop an unsupervised learning model. To achieve our goal of using a purely unsupervised DLA, we stack a 3-layer RBM to model the associative memory network instead of using a hybrid associative-supervised RBM. Each of these layers contains 100 hidden neurons. The training process stops when the maximum iteration (100) is reached. As in Experiment 1, we developed two BP networks to learn the same paired-associate mapping. Both BP networks used the architecture shown in Figure 4 with 40 neurons in the layer 1 and 3 and 20 neurons in layer 2. The BP network uses the same training set and testing set as the DLA, and 30 of the 100 examples from the training set is used as the validation set. The accuracy of reconstruction was measured by comparing the similarity between reconstructed images and their corresponding target template images. We compute the pixel root mean square error (RMSE) between the generated image and its corresponding data template (without noise). The RMSE gives an average difference between corresponding pixels in these two images. Results and Discussion: The RMSE of the reconstruction images is shown in Table 2. The DLA out-performs the BP network in generating the images in the presence of noise. Figure 8 and 9 show a set of example digit images reconstructed by the DLA.

6

Conclusion

We have presented work on an unsupervised learning system that is able to develop an associative memory structure that combines two or more channels of input or output. Our desire is to have the input on one channel correctly generate the associated response at the other channel and vice versa. Our long-term goal is to develop learning systems that are able to learn the relationships between different sensory input and/or motor output modalities similar to humans.

Table 2. Percent accuracy of test set reconstruction Algorithm 1→2 2→1 3→4 4→3 5→6 6→5 7→8 8→7 9→0 0→9 Average DLA 95.30 93.52 94.15 93.65 94.01 94.99 94.86 87.96 94.5 94 93.7 BP ANN 74.61 78.33 73.59 82.06 76.73 77.07 70.31 77.98 79.45 71.18 75.75

Fig. 8. Reconstruction of even digits

Fig. 9. Reconstruction of odd digits

In this paper we present a deep learning architecture (DLA) that can reconstruct an image of a MNIST handwritten digit from another paired handwritten digit and vice versa. In this way, the system develops a kind of supervised classification model meant to simulate aspects of human associative memory. The system uses stacked layers of unsupervised Restricted Boltzmann Machines (RBM) connected by a hybrid associative-supervised top layer to ensure the development of a set of high-level features that can reconstruct one image when given another in either direction. Experimentation shows qualitatively (by viewing the generated images) and quantitatively (test set statistics) that the system reconstructs reasonably accurate matching images that compare favourably to a back-propagation network solution. In future work, a full Boltzmann Machine will be used as the top-level associative memory replacing the BP fine-tuning of the current RBM top layer weights. In the long term, the DLA will be expanded to generate sound when provided an image or conversely generate an image when it hears a sound.

References 1. Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009. Now Publishers, 2009. 2. G. E. Hinton and T. J. Sejnowski. Parallel distributed processing: explorations in the microstructure of cognition, vol. 1. chapter Learning and relearning in Boltzmann machines, pages 282–317. MIT Press, Cambridge, MA, USA, 1986. 3. Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Comput., 18(7):1527–1554, July 2006. 4. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. Multimodal deep learning. In ICML’11, pages 689–696, 2011. 5. Mark R. Rosenzweig. Experience, memory, and the brain. American Psychologist, Vol 39(4), April 1984. 6. Thomas Serre, Gabriel Kreiman, Minjoon Kouh, Charles Cadieu, Ulf Knoblich, and Tomaso Poggio. A quantitative theory of immediate visual recognition. Progress in Brain Research, pages 33–56, 2007. 7. Nitish Srivastava and Ruslan Salakhutdinov. Multimodal learning with deep boltzmann machines. In Advances in Neural Information Processing Systems 25, pages 2231–2239. 2012.

Suggest Documents