Landmark Recognition with Deep Learning - NST

Landmark Recognition with Deep Learning PROJECT LABORATORY submitted by Filippo Galli

NEUROSCIENTIFIC SYSTEM THEORY Technische Universität M¨ unchen Prof. Dr Jörg Conradt

Supervisor: Marcello Mulas, PhD Final Submission: 18.01.2016

Technische Universität München

Neurowissenschaftliche Systemtheorie Project Practical Filippo Galli 3672394

Landmark recognition with deep learning 05-Oct-2015 Problem description: Autonomous robotic navigation is based on the development of simultaneous localization and mapping (SLAM) strategies. In general, SLAM algorithms require the integration of odometric information with location specific sensory information. In comparison with human performance, the recognition of visual landmarks is still a problem that does not have a satisfying solution yet. However, recent developments in machine learning [1] seem promising in order to improve robotic recognition skills. In fact, deep learning techniques are currently successfully applied to several problems of visual classification. Task: The primary goal of the students is to use an existing deep learning toolbox [2,3] to recognize landmarks in an indoor environment. The images will be recorded by an on-board camera mounted on top of a mobile robot. In order to reach this goal the students shall:      

identify potential landmarks to recognize record a training set of images of the landmarks train a deep neural network using an available toolbox (Caffe [2] or Theano [3]) classify visible landmarks in a video recorded by a mobile robot while exploring the indoor environment compare and evaluate the performance of different deep learning networks write a report that includes a description of the work done and a summary of the most relevant obtained results.

Bibliography: [1] Hinton, G. E., Osindero, S. & Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural Comp. 18, 1527–1554 (2006) [2] Caffe, http://caffe.berkeleyvision.org [3] Theano, https://github.com/Theano/Theano

Supervisor:

Marcello Mulas

(Jörg Conradt) Professor

Abstract In autonomous robot navigation Simultaneous Localization and Mapping is achieved through integration of odometric and location specific informations. Among the latters, recognition of specific landmarks could be a promising choice, given latest advancements in machine learning techniques. In this report is presented a possible solution for visual landmark recognition, achieved by application of deep learning techniques to video recorded images. In particular, convolutional networks have been used for detection and classification of images containing multiple objects, thanks to open source machine learning libraries.

2

CONTENTS

3

Contents 1 Introduction

4

2 Theory 2.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 2.2 Logistic Regression Layer . . . . . . . . . . . . . . . . . . . . . . . . .

5 5 6

3 Implementation 8 3.1 Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3 Image pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4 Results 13 4.1 Single object testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2 Multiple objects with manual framing . . . . . . . . . . . . . . . . . . 15 4.3 Multiple objects with automatic framing . . . . . . . . . . . . . . . . 15 5 Conclusion

17

List of Figures

18

Bibliography

19

4

CHAPTER 1. INTRODUCTION

Chapter 1 Introduction The ability to recognize landmarks in space could be a solution for retrieving local specific informations, thus allowing new possibilities for Simultaneous Localization and Mapping (SLAM) in the field of autonomous robot navigation. Given recent progresses in deep learning, a promising approach consists in the use of Neural Networks for image recognition, in particular adopting Convolutional Neural Networks [3]. The idea is to train a network for classifying objects depicted on images. In this project, this is achieved by the analysis of frames streamed by a webcam mounted on top of a mobile robot. In the following chapters the problem is tackled as follows: 1. Theory: an overview on what Convolutional Neural Networks are, how they work, and what differs from other kinds of Neural networks 2. Implementation: available toolboxes for machine learning algorithms, chosen network architecture, how the database has been built and how images have been pre-processed to make the project working efficiently 3. Results: what was achieved in terms of image recognition 4. Conclusions: Analysis on what didn’t work and where to concentrate future efforts.

5

Chapter 2 Theory This chapter will shortly introduce the reader to the theory behind the adopted techniques. As mentioned, one of the most high performance network architecture for image recognition and classification is the convolutional network. Being part of the family of deep learning algorithms, its brightest side lays on the ability of abstracting characteristic features of inputs. Differently from the Single or Multi Layer Perceptron (MLP), Convolutional Neural Networks (CNN) exploit spatial correlation, with units being tiled in such a way that in adjacent layers, neurons react not depending on all the units of the previous layer, but only on a subset of them. This choice allows a more deep resemblance with how animal brain processes images. In fact, cells in our visual cortex are sensitive to small sub-regions of the visual field, called receptive fields, and they act like filters, processing the input space, i.e., what we actually see, exploiting the strong spacial local correlation present in natural images.

2.1

Convolutional Neural Networks

The keypoint in CNN is then that information is not retrieved only from the very unit, but also by its position among the others. This is achieved by bulding a network based on the following concepts: • Local receptive fields: In MLPs, units in every layer are fully connected with the units in the previous layer. In CNN to every unit is assigned a subset of them, the local receptive field, as shown in 2.1. • Shared weights In CNN weights and biases in a layer are the same set for all neurons, meaning that for a single layer every neuron assigns to the units in his receptive field

6

CHAPTER 2. THEORY

Figure 2.1: Local receptive field: How neurons are connected in different layers [4] the same set of weights. A possible interpretation of this could be that every neuron in a single layer is excited and detects the same feature. This kind of layer is then also called feature map and constitute the convolutional layer. Usually many feature maps are organized in the network, working on the same input, in order to detect different features. • Pooling Usually, after convolutional layers are present pooling layers. One of the most common kind is the max-pooling layer, in which for every feature map, information is condensed in smaller layers. For instance a max-pooling unit may keep the maximum activation in a 2x2 subregion of the feature map and output that one only value. By doing that, computational efforts are strongly minimized without a great loss of information, since the position of the most evident feature in that region is preserved.

2.2

Logistic Regression Layer

All the information coming from convolutional layers, condensed by the max-pooling layers, need to be elaborated to get an output, i.e., the class which the object represented in the input image belongs to. A clever way to do that is to add a Logistic Regression layer. The Logistic regressor is a probabilistic linear classifier where all units are fully connected with the units of previous layer, meaning that there’s no local receptive field or subregions, but

7

2.2. LOGISTIC REGRESSION LAYER

every unit output is function of all the adjacent layers’ units output and weight. In detail, the probability that an input vector x is a member of a class i, an outcome of a stochastic variable Y, can be written as: exp(Wi x + bi ) P (Y = i|x, W, b) = sof tmaxi (W x + b) = X exp(Wj x + bj )

(2.1)

j

Where W is the weight matrix and b is the weight vector. Hence the model prediction is the class whose probability is maximal: yp red = argmaxi P (Y = i|x, W, b)

(2.2)

8

CHAPTER 3. IMPLEMENTATION

Chapter 3 Implementation Of the two available toolboxes suggested for the implementation, i.e. Theano ([1] and [2]) 1 and Caffé2 , the former was chosen. On the one hand Caffé is allegedly the fastest library for implementing machine learning algorithms and is written in C. Theano, on the other hand is a Python library, allowing fastest prototyping and testing, in spite of less computational speed, which was anyway not the main topic of the project. Moreover, the Theano framework was handled through the use of Lasagne, a lightweight Python library for building and training neural networks.

3.1

Network architecture

The tested architecture for the convolutional network was built in the following way: • Input Layer The input layer has been built as a ”square” of 80x80 units. In fact, this will be the size of the input image in terms of pixels. Moreover, this images are coded as one-channel input, meaning that are described by greyscale levels, instead of RGB values. These choices will be motivated in the following paragraphs. • Hidden Convolutional Layer This convolutional layer defines 32 feature maps. Their local receptive field is a square of 5x5 pixels. Right after the feature maps a max-poolinng layer is set, with a pool size of 2x2 pixels. • Hidden Convolutional Layer Consequently, another convolutional layer is added with the same parameters: 32 feature maps, with a 5x5 local receptive field followed by a max-pooling layer with 2x2 pool size. 1 2

http://deeplearning.net/software/theano/ http://caffe.berkeleyvision.org/

3.2. DATABASE

9

Figure 3.1: Four classes. From top to bottom, from left to right: watering can, plant, bookcase, table • Fully-connected Hidden Layer A layer with 256 units fully connected with the the previous layer. • Logistic Regression Output Layer A 6 unit fully connected Logistic regressor for obtaining the output of the network, one unit for each class.

3.2

Database

Being tested in the corridor of an academic department, the trained network should at last recognize some typical landmarks characteric of the ambient. Though the choice was arbitrary, some technological constraints rose: • Camera contraints: The video-recording camera placed on the mobile robot was able to get images of size 640x480 pixels, taken from almost ground level, thus restricting the choice of landmarks to small and floor-level objects. • Class information content: The choice, for instance, of a white wall as landmark would have lead to a class with a very limited amount of information, since a white wall in most cases was also the background of the other object images. The choice then fell on the following object classes: table, fire extinguisher, bookcase, watering can, plant and trash bin. In Fig. 3.1 are shown some examples of the raw images. In order to train a network able to recognize objects in different conditions with respect to the database, images have been taken with different background condition

10


Figure 3.2: Different images of the same object: trash bin

Figure 3.3: Multiple object images and different object orientation. In Fig. 3.2 is shown how the attempt to train the network to focus only on the invariances of an object reflects on the database. The database, containig 3000 images is then subdivided in 3 sub-datasets for training, validation and testing. Some more images containing multiple objects have been taken for demonstration purposes and are represented in Fig.3.3

3.3

Image pre-processing

Images recorded by the camera on the mobile robot are RGB files of dimension 640x480 pixels. Pre-processing was need for avoiding: • Memory overload: Since in Theano and Lasagne all data have to be loaded

3.3. IMAGE PRE-PROCESSING

11

Figure 3.4: Comparison of informations contained in the luminance plane (top right), and in the chrominance plane (bottom row) with respect to the original image (top left). at the same time during training, and the database consisting of 3000 images, the machine on which the project was carried out ran out of memory, making it impossible to complete the training of the network. • Time consuption: By using raw images the CNN would have required way more time to complete training, thus slowing the testing of the code, with no actual benefit, since lots of informations contained in the images are no help for the classification problem. The proposed solution to these problems requires two steps of image manipulation. First the object contained in the image was cropped, and only a subset of pixels was kept for training. From the original image only a square of variable size (400x400 for the table class, 300x300 for the others) was obtained and a following resizing to 80x80 was applied. Secondly, the image was turned from a RGB to a greyscale file, factorizing by 3 the efforts required by the network for training, yet keeping a vast amount of information. In fact, as shown in Fig. 3.4, the most of it is contained in luminance and not in chrominance. The result of the two steps of image manipulation can be seen in Fig. 3.5.

12


Figure 3.5: Fire extinguisher after applying cropping, resizing, and greyscaling

13

Chapter 4 Results Training the network for 100 epochs produced the following results. In Fig. 4.1 is shown the trend of validation accuracy, i.e. the percentage of samples in the validation set assigned to the correct class. It rises rapidly to 99.83 %, and then saturates, never reaching 100 %. Other tests have been tried for clearing network performances: • Single object testing • Multiple object with manual framing • Multiple object with automatic framing

4.1

Single object testing

A manual testing of the actual performace of the CNN consisted in asking the model for prediction on single images, after training and saving model parameters. Fed images are shown in Fig. 4.2 and the results in Fig. 4.3

Figure 4.1: Validation accuracy (%) for every epoch during training

14

CHAPTER 4. RESULTS

Figure 4.2: Images fed as input for ”manual” testing. From left to right, top to bottom: ’405.jpg’, ’966.jpg’, ’1426.jpg’, ’1910.jpg’, ’2436.jpg’, ’2593.jpg’

Figure 4.3: Results of classification for images represented in Fig. 4.2

4.2. MULTIPLE OBJECTS WITH MANUAL FRAMING

15

Figure 4.4: Single objects cropped from the top row images of Fig. 3.3:’1.jpg’, ’2.jpg’, ’3.jpg’, ’4.jpg’

Figure 4.5: Classification results of images in Fig.4.4.

4.2

Multiple objects with manual framing

From the top row images showed in Fig.3.3 containing multiple objects, single object images have been obtained and are represented in Fig.4.4.Results are shown in Fig. 4.5 . Three out of four images have been classified correctly. The CNN failed on the image of a plant, where the great part of the plant is actually cut off the photo, and basically the prediction has been made only from the vase, mistaken for the trash bin.

4.3

Multiple objects with automatic framing

This test required the trained CNN to check a multiple objects image for finding and labeling them. This task was approched by making a squared frame run through the image, and by saving for each class the highest confidence level retrieved among all squared frames. After this process, if a class showed a confidence level higher than an a-priori value set to 99.4%, the corresponding object coordinates are framed and labeled accordingly. Since this euristic would have required a lot of time for a manual efficient determination, this test did not reached great results. Moreover, objects standing on the sides of the image clearly introduce a further complication, since they cannot be centered in the frame, as the network would espect from the training dataset.

16

CHAPTER 4. RESULTS

Figure 4.6: Multiple object on-image-labeling with automatic framing: detected objects are framed and labeled on the the top left corner of the square Nevertheless, on the bottom row image of Fig 3.3 a satisfying result was obtained, and showed on Fig.4.6.

17

Chapter 5 Conclusion As seen in 4, object classification through CNN has had some interesting results. In fact, once the model is trained, making predictions on new images does not require a lot of computational power and at the same time shows, under some constraints, good performances. One of the most tricky issues showed to be the fact that when dealing with multiple objects contained in the same image, which is actually a real world scenario, automatic framing can visibly lower performance, due to the following reasons: • Objects cannot always be centered in a squared frame, for instance if they stand by the sides of the picture. • As it is now, the algorithm used for automatic framing lacks of the characteristic to be scaling invariant, meaning that the dimension of the object matters and may lead to failed classification or recognition. Moreover, from some objects, like for instance the table, it is difficult to extract features invariant from orientation, and this can complicate the task of recognizing the object itself from different points of view. Hence, results evident that a great enphases should be put on the building of the database and pre-processing, for solving current issues and enhancing preformances. At last, since the vast majority of the time spent on the project was dedicated to get the CNN working, no chance to adopt and try different architectures was possible.

18

LIST OF FIGURES

List of Figures 2.1

Local receptive field: How neurons are connected in different layers .

3.1

Four classes. From top to bottom, from left to right: watering can, plant, bookcase, table . . . . . . . . . . . . . . . . . . . . . . . . . . Different images of the same object: trash bin . . . . . . . . . . . . Multiple object images . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of informations contained in the luminance plane (top right), and in the chrominance plane (bottom row) with respect to the original image (top left). . . . . . . . . . . . . . . . . . . . . . . Fire extinguisher after applying cropping, resizing, and greyscaling .

3.2 3.3 3.4

3.5 4.1 4.2 4.3 4.4 4.5 4.6

6

. 9 . 10 . 10

. 11 . 12

Validation accuracy (%) for every epoch during training . . . . . . . . Images fed as input for ”manual” testing. From left to right, top to bottom: ’405.jpg’, ’966.jpg’, ’1426.jpg’, ’1910.jpg’, ’2436.jpg’, ’2593.jpg’ Results of classification for images represented in Fig. 4.2 . . . . . . . Single objects cropped from the top row images of Fig. 3.3:’1.jpg’, ’2.jpg’, ’3.jpg’, ’4.jpg’ . . . . . . . . . . . . . . . . . . . . . . . . . . . Classification results of images in Fig.4.4. . . . . . . . . . . . . . . . Multiple object on-image-labeling with automatic framing: detected objects are framed and labeled on the the top left corner of the square

13 14 14 15 15 16

BIBLIOGRAPHY

19

Bibliography [1] Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012. [2] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math expression compiler, June 2010. Oral Presentation. [3] Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10), 1995. [4] Michael A. Nielsen. Neural Networks and Deep Learning. Determination Press, 2005.

20

BIBLIOGRAPHY

License This work is licensed under the Creative Commons Attribution 3.0 Germany License. To view a copy of this license, visit http://creativecommons.org or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California 94105, USA.