Conceptually, the approach includes Distributional Semantics (DS) for linking stylistic image features .... unsupervised way to find clusters that were previously unknown. 3 .... to the same image with different scale, rotation and translation. ... [Bru2011] introduce the idea to concatenate the vectors to one multimodal vector.
Lehrstuhl für Informatik mit Schwerpunkt Digital Libraries and Web Information Systems
Deep Learning and Distributional Semantics for Automated Style Recognition
Masterarbeit von Bernhard Bermeitinger
1. Prüfer
2. Prüfer
Prof. Dr. Siegfried Handschuh Prof. Dr. Malte Rehbein
22. Dezember 2015
Contents
1 Introduction 1.1
1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1.1
Digital Humanities . . . . . . . . . . . . . . . . . . . . . . . .
2
1.1.2
Object detection in images . . . . . . . . . . . . . . . . . . . .
3
1.2
Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3
Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2 Background and Related Work 2.1
2.2
7
Distributional Semantics . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.1.1
Distributional Semantics for Words . . . . . . . . . . . . . . .
8
2.1.2
Distributional Semantics for Images . . . . . . . . . . . . . . .
9
2.1.3
SIFT for feature extraction . . . . . . . . . . . . . . . . . . .
10
2.1.4
Distributional Semantics for multiple representations . . . . .
12
Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.2.1
Description . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.2.2
Mathematical definition . . . . . . . . . . . . . . . . . . . . .
16
2.2.3
Convolutional Neural Networks . . . . . . . . . . . . . . . . .
17
2.2.4
Recent Development . . . . . . . . . . . . . . . . . . . . . . .
20
ii
Contents
3 Methodology 3.1
3.2
22
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.1.1
Training phase . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.1.2
Prediction phase . . . . . . . . . . . . . . . . . . . . . . . . .
24
Multi-Label Classification . . . . . . . . . . . . . . . . . . . . . . . .
25
3.2.1
Standard Multi-Class Logarithmic Loss Function . . . . . . .
27
3.2.2
Problems with logarithmic loss for Multi-Label Classification .
27
3.2.3
Multi-Label Loss Function . . . . . . . . . . . . . . . . . . . .
29
3.2.4
Image augmentation . . . . . . . . . . . . . . . . . . . . . . .
30
4 Experiments 4.1
4.2
4.3
35
Overview of data sets . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.1.1
MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4.1.2
Mirflickr-25k . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
4.1.3
Neoclassica . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4.2.1
Data preparation . . . . . . . . . . . . . . . . . . . . . . . . .
42
4.2.2
Convolutional Neural Net . . . . . . . . . . . . . . . . . . . .
45
4.2.3
Configuring layer arrangement . . . . . . . . . . . . . . . . . .
46
4.2.4
Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
Results of Multi-Class Classification . . . . . . . . . . . . . . . . . . .
49
4.3.1
Testing with MNIST . . . . . . . . . . . . . . . . . . . . . . .
50
4.3.2
Results for Neoclassica . . . . . . . . . . . . . . . . . . . . . .
52
5 Future Work and Conclusion
63
5.1
Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
5.2
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
iii
Contents
Appendix A Code
65
Appendix B Data sets
67
B.1 Accessing data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
B.2 Pretraining with ImageNet . . . . . . . . . . . . . . . . . . . . . . . .
68
Bibliography
70
Eidesstattliche Erklärung
76
iv
Abstract
This thesis uses methods from Deep Learning (DL), namely Convolutional Neural Networks (CNNs), to examine its performance in its ability to recognize art styles. The focus in this thesis lies in artistic styles from the epoch of Neoclassicism and the styles of different furniture like chairs, tables and chests of drawers. Conceptually, the approach includes Distributional Semantics (DS) for linking stylistic image features derived from the CNN with natural language features coming from the images’ labels. The theoretical part for the CNN is enriched with a proof-of-concept framework that provides methods for image feature creation. By applying a pretraining method with specific images from ImageNet1 , a F1 score of 0.442 and accuracy of 44% is achieved. On average, an accuracy improvement of 40% can be expected with pretraining.
1
http://image-net.org/
v
List of Figures
1.1
A console table from England from around 1830 . . . . . . . . . . . .
4
2.1
Overview of a simple feedforward neural network . . . . . . . . . . .
15
2.2
Overview of activation functions . . . . . . . . . . . . . . . . . . . . .
17
2.3
Overview of the convolutional stage . . . . . . . . . . . . . . . . . . .
18
2.4
Examples of maxpooling . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.1
General overview of the approach during the training phase . . . . .
23
3.2
Overview over the prediction phase . . . . . . . . . . . . . . . . . . .
24
3.3
Overview of training a Neural Network for a Multi-Label Classification task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
4.1
Example images of the MNIST data set . . . . . . . . . . . . . . . .
36
4.2
Three example images mirflickr-25k data set . . . . . . . . . . . . . .
37
4.3
Box and violin plot of images per artifact for neoclassica . . . . . . .
40
4.4
Box and violin plot of labels per image for neoclassica . . . . . . . . .
41
4.5
Plot for loss over time and accuracy improvement of Net 19b and Net 26 53
4.6
Box and violin plot for images per class in Neoclassica . . . . . . . .
4.7
Bar plot of mean F1 scores over all classes for different configuration on ImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
54
58
List of Figures
4.8
Bar plot over all F1 scores per class for ImageNet data used for pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9
58
Confusion matrix for Net 26 in configuration 3Y on ImageNet pretraining data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
4.10 F1 scores of over all Neoclassica classes with pretraining . . . . . . .
60
4.11 Mean F1 scores for different experiments on Neoclassica with and without pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
62
List of Tables
4.1
Example tags for images 3, 94 and 233 of the mirflickr-25k data set .
37
4.2
Layouts of two different Neural Networks . . . . . . . . . . . . . . . .
51
4.3
Table of results for Multi-Class Classification for MNIST . . . . . . .
52
4.4
Overview over four different setups in the experiments . . . . . . . .
55
4.5
Training progress of Net 26 in different configurations for Neoclassica
56
4.6
Table of accuracy and F1 score of different configuration using ImageNet data
4.7
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
Comparison of F1 scores and accuracy between different configurations with and without pretraining . . . . . . . . . . . . . . . . . . . . . .
viii
61
List of Code Snippets
4.1
Function to read an image as two-dimensional matrix . . . . . . . . .
42
4.2
Function to read an image as two-dimensional matrix . . . . . . . . .
43
4.3
Function to read an image as tree-dimensional matrix . . . . . . . . .
43
4.4
Function to read batch-read images and automatically resize them . .
44
4.5
Simple construct to create a neural network with nolearn . . . . . .
48
4.6
Exemplified implementation of a convolutional neural network . . . .
49
4.7
Implementation of a loss function for Multi-Label Classification in Theano 50
ix
1 Introduction “The idea behind digital computers may be explained by saying that these machines are intended to carry out any operations which could be done by a human computer.” Alan Turing, 1950
1.1 Motivation In the recent years, the research process in the humanities has started to transform fundamentally. It increasingly has been mediated through digital technology. We witness the rise of “the use of modern information technologies or instruments stemming from computer sciences to reach findings that either would not have been possible without the application of these instruments or only on a much lower level of intersubjective verifiability” (Thaller [Tha2014] quoted after Rehbein [Reh2016, p. 16f], translated from German). This new kind of research is generating and accumulating all kinds of genuinely digital or digitized research data such as databases with transcribed text, images and other media like video and sound. This data distinguishes itself from conservative data by
1
1 Introduction
the amount of data available and its serial nature. Whereas conservative data is time consuming to comb through to find what the researcher is looking for, digital data is easily searchable and the storage possibilities are virtually unlimited. Traditional ways of knowledge acquisition such as transcribing and annotating texts, working with images and more recently audiovisuals can be supported by exploiting methods from Machine Learning (ML). ML is very good at finding patterns in huge amounts of data. The found patterns are then used to classify the findings in a supervised or unsupervised way. Supervised ML needs annotated input data from which it knows the ground truth of the data to analyze. In the unsupervised way, the output of the applied ML method is not determined beforehand and left for automatic classification.
1.1.1 Digital Humanities By introducing Machine Learning (ML) to research fields in the Digital Humanities (DH), the gates to many new fields of research can be opened and old ones productively transformed. With the power of Big Data in ML, all kinds of data from past ages like textual documents or images can be processed digitally. Kohle [Koh2013] points out that organizing a database of images and its search functionality can be enriched with “visual identifiers” ([Koh2013, p. 48]). With manual annotations and ordinary databases this does not require ML per se, but crawling all these annotations and finding patterns in them can be done faster with ML and, more importantly, in an unsupervised way to find clusters that were previously unknown. 3 The sub-category classification in ML enables automatic detection e.g. in which century the given text was written – and possibly by whom. Automated authorship attribution has a wide range of applications. It helped clearing up controversies concerning the authorship of works by William Shakespeare in Lowe and Matthews
2
1 Introduction
[Low1995]. Replicating cognitive tasks such as the analysis of visual art has long been a challenge for computers. By employing supervised and more importantly unsupervised ML methods ([Sha2012]) they have recently been able to create classifications for schools and influences among painters that show remarkable resemblance to those of human experts. A particular promising field for applying ML seem repetitive features like aesthetic forms. That makes it especially suited for analyzing artistic styles like in the case of Shamir and Tarakhovsky [Sha2012]. But also material artifacts like architecture or furnishings consist of such features as pointed out by Prown [Pro1980]. While art historians are pointing out the role of style as time- and observer-dependent cultural construct for a couple of decades, recent accounts emphasize that it remains a productive heuristic instrument due to its capability for delineating and sorting ([Krü1997, p. 82]). For this thesis, style features from Neoclassicism (German: „Klassizismus“) have been chosen because a set of images is available at the adviser’s chair that are relevant and well suited for the analysis in this work. Also, the artistic movement of Neoclassicism was selected because of its widespread application all over Europe ([Pal2011, p. 1f]).
1.1.2 Object detection in images Image recognition and object detection is a wide field in computer science. Many applications and services were created to challenge this issue. Most notably, a short time ago Google Photos1 was released. A registered user can upload all her
1
https://photos.google.com
3
1 Introduction
Figure 1.1: A console table from England from around 1830 photographs free of charge2 and Google classifies what is seen on the photos. The number of classes is not publicly known, but they contain entities like “water falls”, “palaces” and “cats” and also activities like “sailing”, “driving” and “cooking”. Apart from this, object recognition in images and classification do have other use cases as well.
1.2 Goal This thesis approaches automated style recognition in images of specific furniture from the era of Classicism. The images are limited to be taken of specific furniture: chairs, chests of drawers, secretaries and the like. It makes use of Neural Networks (NNs) for ML to determine the applied stylistic forms of the depicted object in the
2
up to a certain resolution
4
1 Introduction
image. This is done via Multi-Label Classification (MLC). Common classification tasks like multi-class classification are sorting the given images into one of multiple different buckets (= classes). The term multi-class is the next step of single-class classification. A single-class classifier has a binary output: either the input matches the class or it doesn’t. This approach is extended from this binary choice to multiple classes. From this target classes exact one is chosen as output. In contrast, MultiLabel Classification (MLC) does not put one image into one bucket but into many buckets. One bucket might be labeled with “console”, another with “table”, “1830” and another one with “England”. They are all equally important and significant for the style determination of Figure 1.1. For the NN approach, this thesis proceeds on the assumption that by using multiple tags—rather than one single class—the object in the image can be better attributed to a certain style. Next to images are textual annotations, mainly in the form of written natural language. Labeling an image automatically with annotations aims for finding stylistic forms inside the image that are linked to the labels. Naturally, the classifier has no semantic understanding of the words it learns and returns. The aspect in this thesis for enriching the image features with semantic meaning is using Distributional Semantics (DS) ([Har1981]). Instead of predicting meaningless words (meaningless in the sense of the NN), the NN works with vectors coming from DS that describe the word in a way such that a similarity can be computed, hence adding semantic relationship to them.
5
1 Introduction
1.3 Structure The thesis is structured as follows: Chapter 1 gave an overview about the way of looking on the problem at hand. In Chapter 2 the theoretical background is explained for the two main concepts used throughout this thesis: Distributional Semantics and Neural Nets. The theoretical background is enriched with related work. The next chapter (Chapter 4) is structured like a report about the experiments done to support the methodology from Chapter 3. It contains an overview about three different data sets, examples of a proof-of-concept implementation in Python and ends with a summary about the quality of its results. The thesis ends with a conclusion in Chapter 5 where also a recapitulation of the used methods is given, as well as a small outlook into future work in Section 5.1.
6
2 Background and Related Work
Different Machine Learning (ML) approaches are a common and interesting researching topic. In the following, two possible ML approaches for the task at hand are described: Distributional Semantics (DS) and Neural Network (NN). The proof-ofconcept implementation in Chapter 4 will only contain the NN approach.
2.1 Distributional Semantics Distributional Semantics (DS) derive their meaning from the distributional hypothesis by Harris [Har1981]. The hypothesis comes mostly from a psychological and linguistic interpretation of text. It is based on the theory that finding a word’s meaning can be done by analyzing surrounding words in a certain context window. The meaning of the word is described in a Distributional Semantic Model (DSM). DS is not limited for Natural Language Processing (NLP) application, but that is where it’s origin lays. Sections 2.1.1 and 2.1.2 describe the application when working with words, or images, respectively.
7
2 Background and Related Work
2.1.1 Distributional Semantics for Words When the words are represented as vectors, the DSM is a Semantic Space Model. All words are covered by semantic vectors. Formally defined by Lowe [Low2001], such a DSM is defined as a matrix where the rows consists of these semantic vectors. The entries in the vector are functions that map the occurrence count of the word in a predefined context. There are different possibilities on how to define this context: it can be a frame of fixed length around the word, the whole document or more complicated definitions. Then, the similarity between two words can be evaluated by a vector distance which calculates a numerical similarity score between from the two semantic vectors of the two words. The evaluation of the quality of the similarity function is usually done by correlating its output for a set of words with human created similarity scores. The correlation factor gives an estimate on how well the given similarity function for the DSM at hand can compute the semantic relation of a pair of words. A next step is to sort the word into classes. The classes are concepts by Murphy [Mur2004] where each words are grouped by their semantic relatedness. Semantic vectors and DS in general are useful for a variety of applications. Because of the possibility to compute semantic relatedness and having a numeric score for the similarity of words, they can be used in document classification and dynamic question answering systems. For the latter, the similarity of words is important to know, so that the question can be asked in many different ways rather than a few hard-coded keywords. The example question “How tall is the Empire State Building?” is equal to the question “What is the height of the Empire State Building”. Putting Named Entity Recognition and the basic understanding of the question aside, the core words
8
2 Background and Related Work
are “tall” and “height”. A traditional hard-coded based if-else cascade, somewhere someone had to define that these two words mean the same. But by using a DSM with a well trained similarity function, the equality of this pair is already given by the model.
2.1.2 Distributional Semantics for Images When referring to DS, one usually means the application in NLP. But creating semantic vectors is not limited to words in natural language. When abstracting from natural language, words are just features of a document that are used to convey information. With this is mind, so-called visual words ([Siv2003; Bos2007]) can also be seen as basic features of an image that transport the semantic meaning of the whole structure. Compared to a bag-of-word technique from information retrieval, in this case it is called a bag-of-visual-words (BoVWs) technique. Counting co-occurrences of visual words in an image (= visual document) and then creating a matrix, the same mathematical methods can be applied to define a similarity function and find semantic relatedness of the visual words. The bag-of-word approach is based on a dictionary method. The dictionary contains all possible words, whereas the bag for a document contains a set (= “bag”) of words from the dictionary. Extending this idea to visual words, Bruni et al. [Bru2014] create a bag of visual words that is a set of regions, and appearances inside the visual document. Compared to prominent keywords describing a textual document, a visual document has prominent keypoints. Their generation is not trivial. They need to be consistent
9
2 Background and Related Work
between images, so that a similarity function can calculate the relatedness between two images. In their work, Bruni et al. [Bru2014] are using the SIFT descriptors generated from HSV encoded images in Section 2.1.3. Instead of SIFT feature vectors with 128 components per grayscale image, SIFT descriptors are generated for each one of the three dimensions of the image. This results in three feature vectors with 128 components per dimension. The new feature vector has then 3 × 128 dimensions. Creating a feature space with values is done by generating the SIFT descriptors for a large corpus of example images. The images should be representative of the goal, so that there is a possibility that the descriptors from one of the stored images match some descriptors from a newly analyzed image in order to tell the similarity.
2.1.3 SIFT for feature extraction Scale Invariant Feature Transform (SIFT) is a feature generation algorithm developed by Lowe [Low1999] mainly for object recognition in images. The following descriptions and formulas are mostly taken from Lowe [Low1999], Bruni et al. [Bru2013], and Bruni et al. [Bru2014]. The main idea is to generate a set of feature vectors for one image, that are invariant to the same image with different scale, rotation and translation. Rotation invariance is achieved by selecting those locations in the image where extrema of the onedimensional Gaussian function (Equation (2.1)) occur. The one-dimensional function can be applied because the two-dimensional function, which would be required in a two-dimensional scale space, is separable.
10
2 Background and Related Work
g(x) = √
−x2 1 · e 2σ2 2πσ
For all smoothing operations, Lowe is using the σ value σ =
(2.1) √ 2, which simplifies the
equation to Equation (2.2).
g(x) =
−x2 1 √ ·e 4 2π · 2
(2.2)
Prior to keypoint extraction, the raw image is converted into grayscale thus reducing the image to a two-dimensional matrix rather than using a three-dimensional matrix when handling a colored image. Lowe suggests using color in a later development and concentrates on grayscale images. The grayscale image is scaled up by a factor of 2 using bilinear interpolation, then √ smoothed two times successively with a Gaussian function using again σ = 2. The next step is resampling the image with pixel spacing of 1.5 to achieve a constant linear combination of four pixels next to each other. The modified images can be considered a pyramid scheme where each modification is the next upper layer. Each image patch of three-by-three pixels is examined for its maximum and minimum on the same pyramid layer. If the pixel in the middle is a maximum or minimum of this slice, the closet location is determined one layer beneath. If it is still a maximum or minimum, the test is repeated with the patch one layer above. For a input image of square shape and a length of 512 pixels, usually there are 1 000 keypoints found.
11
2 Background and Related Work
Stability towards rotation and varying illumination like contrast or lightning changes each key location is given a canonical orientation. This is determined by converting the weights of a Gaussian-weighted histogram into a orientation histogram by multiplying them with the gradient values for the locations Rx,y in equation 2.3 for x, y being the coordinates of the image I.
Rx,y = atan2 (Ix,y − Ix+1,y , Ix,y+1 − Ix,y )
(2.3)
The keypoints can then be used as SIFT descriptors for the image.
2.1.4 Distributional Semantics for multiple representations Considering the mathematical similarity between semantic vectors for representing words and those for representing visual words, their combination can be exploited to be used in a multimodal manner. The easiest combination at hand is the merging of the vectors for words and visual words. When the underlying corpus offers images as well as textual descriptions for these images, a DSM can be created that combines DS for text and images. The keywords and keypoints are merged and create a vector space where the similarity function does no longer distinguish word vectors and visual-word vectors. In Bruni et al. [Bru2014], the visual words are generated by clustering all descriptors from all example images. The clustering is done with the k-means clustering algorithm. The optimal value for k varies from corpus to corpus and should be set appropriately. Then, the k clusters are translated to k visual words by relating the clusters—and therefore the descriptors—to the textual words given in the corpus.
12
2 Background and Related Work
Each descriptor corresponds to one textual word. The relation between textual words and visual words is improved when splitting the image into multiple parts prior to generating their descriptors. For this, the underlying corpus should give coordinates or bounding boxes for which textual words are describing the objects in this spatial region. The descriptors are finer grained and the visual words are more accurate to the actual representation of the textual descriptions. Bruni et al. [Bru2011] introduce the idea to concatenate the vectors to one multimodal vector. The vectors can be seen as features for the entity, so the concatenation is called Feature Level Fusion. The result is a correlation factor of 0.52 using WordSim353 by Finkelstein et al. [Fin2001] for textual vectors and ESP-Game collection by Ahn [Ahn2006] for visual-word vectors.
2.2 Neural Networks The basic mathematical principles used in Neural Networks (NNs) first surface as early the 1800s, when Legendre [Leg1805] and Gauss [Gau1809; Gau1821] mention methods of linear regression which are used in supervised learning for NNs in later years. The most important difference between supervised NNs and unsupervised NNs is that in the first case, the training data contains correct labels or classes (=gold data) for the input data. Supervised learning concentrates on predicting the given target values for given input data. Unsupervised learning assigns labels on its own without being given gold data. A hybrid approach is semi-supervised learning, which is used throughout this thesis. It is not unsupervised because the labels are a finite set of predefined values. On the other hand it is not fully supervised as the features for each label are automatically created.
13
2 Background and Related Work
Physical research in the visual cortex of cats by Wiesel and Hubel [Wie1959] and Hubel and Wiesel [Hub1962] showed, that different brain cells of cats react differently to different visual input. This discovery lays the foundation for the basic concept of NNs: linking multiple cells which are organized in layers with intermediate output to eventually get an interpretation from the last layer. A NN is a combination of nodes hierarchically arranged in different layers. The are interconnected with edges and have a graph structure. The first layer is called the input layer. The last layer is called output layer. When using a NN these two layers are the only ones where outer interaction takes place: use the input layer by giving it something, then observe what the output layer returns. The layers between input and output are hidden layers. The number of hidden layers determines if a NN is a Deep NN or a Shallow NN. All nodes from one layer are connected to all nodes from the two surrounding layers. The behavior of a NN is determined by weights that are real-valued numbers. In the case of supervised learning the NN is used to classify the input to a predetermined output value. The main objective is then to adapt the weights to values, such that a loss function is minimized. The loss function usually returns a number which gives the error rate of the current setup. By minimizing this loss function, the error is minimized and the predictions will be more exact.
2.2.1 Description A simple feedforward neural network is designed to find two functions for: 1. defining the features in a non-linear way
14
2 Background and Related Work
Figure 2.1: Overview of a simple feedforward neural network 2. mapping the features to an output which can be defined as classes. A simple feedforward Deep Neural Network (DNN) has at least three layers as shown in Figure 2.1. Each circle in each layer represents one unit (or node, cell). All units from the input layer are connected to each unit in the hidden layer. All units from the hidden layer are connected to the output layer. For visual reasons, the arrows between them are omitted and replaced by one larger arrow. Note, that the arrow does represent all links from all lower units to the upper units rather than one combined value. The output is given for each unit in the output layer as o1 to o5 . Depending on the linear function of the output units, the output is usually a vector of real valued numbers. Deeper networks have more hidden layers where each lower layer is fully-connected to one layer above. It is not usual to have multiple input or output layers. This architecture is also called (single-hidden-layer) Multilayer Perceptron.
15
2 Background and Related Work
2.2.2 Mathematical definition A Multilayer Perceptron can be formalized as the function in Equation (2.4). In matrix notation this gives a function f (x) in Equation (2.4).
f : RD → RL
with
D = #x L = #f (x)
f (x) = ϕ(b(2) + W (2) (σ(b(1) + W (1) x)))
(2.4)
Here, b(1) and b(2) are bias vectors, W (1) and W (2) weight matrices and ϕ and σ are activation functions. The hidden layer is represented by the vector h(x) = Φ(x) = σ(b(1) + W (1) x). The weight matrix from the input layer to the hidden layer is defined as W (1) ∈ RD×Dh . The activation function for the input layer is σ which is usually one of the following functions: ea − e−a ea + e−a 1 sigmoid(a) = 1 + e−a tanh(a) =
ReLU(a) = max(0, a)
(2.5a) (2.5b) (2.5c)
They are applied element-wise for the vectors. The output of the output layer is o(x) = ϕ(b(2) + W (2) h(x)). For multi-class classifi-
16
2 Background and Related Work
tanh(x)
1.0
sigmoid(x)
450
ReLU(x)
6
400 5 350
0.5
300
4
250 0.0
3 200 150
−0.5
2
100 1 50
−1.0
0 −6
−4
−2
0
x
2
4
6
0 −6
−4
−2
0
2
4
x
6
−6
−4
−2
0
2
4
6
x
Figure 2.2: Overview of activation functions cation softmax is used as the activation function ϕ:
eai softmax(a) = ∑ aj je
with
∑ i softmaxi (a) = 1 softmaxi (a) ≥ 0
(2.6)
This has the advantage that the softmax gives a probability distribution over classes. All parameters are learned using Stochastic Gradient Descent (SGD).
2.2.3 Convolutional Neural Networks Deep Neural Networks can be used for object recognition and general ML. There are very many different implementations in approaches on how Deep Neural Networks are constructed. This thesis focuses on Convolutional Neural Networks, a subset of acyclic NN, meaning that the topology of the NN is not cyclic and thus not a Recurrent NN. The name convolution is given by the fact that at least one layer applies convolution instead of matrix multiplication. Photographs are invariant to translation, meaning that the semantic content of the image does not change if the image is shifted few pixels to any direction. Convolutional Neural Networks (CNNs) are invariant to this kind of translation. The units of the
17
2 Background and Related Work
Figure 2.3: Overview of the convolutional stage convolutional layer share some parameters across the input. This method also reduces the total amount of parameters for the whole network, thus reducing the size and therefore the complexity of the model. For image classification tasks this weight sharing improves the network’s domain knowledge. With this translation invariance an object may be recognized in the upper right corner as well as in the lower right corner. A convolutional layer consists of three inner stages: convolution stage, activation stage and pooling stage.
2.2.3.1 Convolution stage
In this stage, the input matrix (e.g. a two-dimensional image) is transformed by an affine kernel method into a smaller matrix.
18
2 Background and Related Work
Figure 2.3 shows how a kernel of size 3 × 3 convolutes the input matrix of size 4 × 4 to a matrix of size 2 × 2. Note, that only the computation of o11 and o22 are shown. The other patches are computed accordingly. Formal definition is given in formula 2.7 with I as the input matrix, K ∈ Rm×n as the kernel and O as the output matrix.
Oi,j = (I × K)i,j =
∑∑ m
Ii−m,j−n · Km,n
(2.7)
n
2.2.3.2 Activation stage
The activation stage is used right after the convolution stage. This is where the nonlinear activation function of this layer is used. Typical activation functions are defined in Equation (2.5). The currently most used function is ReLU (Equation (2.5c)). ReLU stands for Rectified Linear Unit. Compared to the other functions, this function is computational faster.
2.2.3.3 Pooling stage
After convolution and activation, there is the last stage in a convolutional layer: the pooling stage. The pooling function takes one value from a patch of neighboring outputs. There are several different pooling functions, the most used is maxpooling. It can be overlapping or not. Figure 2.4 depicts an example of a non-overlapping and overlapping maxpool2×2 on a 4×4 matrix. The output is a downsized matrix of size 2×2 resp. 3×3 with overlapping. The shift where to start the next patch is called stride. In the overlapping example
19
2 Background and Related Work
(a) maxpooling without overlap (stride = 2)
(b) maxpooling with overlap (stride = 1)
Figure 2.4: Examples of maxpooling (2.4b), the stride is 1, in the non-overlapping pooling example (2.4a) it is 2, which is the default in most applications.
2.2.4 Recent Development There are many possibilities how to arrange the different layers and assigning different parameters for each layer. Finding the best combination is an exhaustive task. Additionally, the topography depends on different use cases: i.e. voice recognition, object detection or sentiment analysis using natural language. Having image classification and object recognition as the task at hand, Convolutional Neural Networks (CNNs) achieve very good results. In the annually organized ImageNet Large-Scale Visual Recognition Challenges (ILSVRCs) by Russakovsky et al. [Rus2015] starting in 2010 the first two years methods using SIFT descriptors by Lin et al. [Lin2011] are the leading approaches. The turning point was 2012, when CNNs reappeared for the ImageNet Large-Scale
20
2 Background and Related Work
Visual Recognition Challenge (ILSVRC). The first place was taken by Krizhevsky et al. [Kri2012], followed by teams using SIFT features and modified versions. Because of the high gain of CNN in 2012 and the growing performance of GPUs, most submissions in the ILSVRC-2013 and -2014 are using DNN. The image classification error rate went down from 28.2% in 2010 to 6.7% [Rus2015]. Simonyan and Zisserman [Sim2014] show that the use of Very Deep Convolutional Neural Networks are a good objective for large scale image recognition. Their definition of very DNNs is, that they use at their maximum configuration 16 convolutional layers and three fully connected layers, adding up to 19 weight layers. By counting five pooling layers, one input layer and two dropout layers this adds up to a total of 27 layers. With different combinations of these nets they achieved a validation error of 24.7% [Sim2014].
21
3 Methodology
This chapter introduces a method to adapt Distributional Semantics (DS) to make use of the current superiority of Convolutional Neural Networks (CNNs) in classification tasks. The method is not fully implemented in the proof-of-concept experiment, mainly feature creation with a CNN is implemented, whereas the parts with DS stay theoretical and are missing from the proof-of-concept framework.
3.1 Overview In this thesis,the keypoint extraction from images using SIFT from Sections 2.1.3 and 2.1.4 is replaced with a Neural Network (NN). The preparation is done in the training phase where the CNN is trained using specific images and their attached labels. Also in this phase, a method of DS is applied to the labels to create a Distributional Semantic Model (DSM). This model gives a semantic vector space, so that the predicted labels for one image in prediction phase are comparable in their semantic to all labels.
22
3 Methodology
Figure 3.1: General overview of the approach during the training phase
3.1.1 Training phase The structural setup is shown in Figure 3.1. Training data consists of images and multiple tags per image. A tag is a word feature for the image. Usually a tag is one word, but could also be a combination of more words. All images must be labeled with tags. Note, that if all images have exactly one tag, this would imply a MultiClass Classification (MCC) task, whereas here it is intended to use more labels per image. The task at hand is then a Multi-Label Classification (MLC), more thoroughly explained in Section 3.2. The training images and corresponding tags are given as training data into a NN. It aims to find a mathematical function and optimizes it in a way that links an image to the corresponding tags. After training, an operational NN is created that is used later when predicting tags for unknown images. The output of the NN is a set of tags describing the input.
23
3 Methodology
Figure 3.2: Overview over the prediction phase Instead of using the tags as plain natural language, they can be first converted into semantic vectors with the help of DS by creating a DSM before training. The advantage is that the output of the NN are not words anymore but semantic vectors.
3.1.2 Prediction phase The goal is to find tags for unseen images and do not have tags attached. Figure 3.2 points out the process of predicting tags for images. The DSM and the trained NN from the training phase are used here. First, the previously unseen image is converted into a format that the NN can work with: a matrix. Considering only gray scale images, an image is a n × m matrix with n rows and m columns. Each entry (or cell) is a positive natural integer between 0 and 255 describing the color intensity. Working with RGB-encoded colored images,
24
3 Methodology
the image is given as three matrices. For internal computational reasons, each cell is divided by 255, thus normalizing the matrix to floating point values strictly between 0.0 and 1.0. After converting, the image is then fed to the trained network to predict tags. The output is a probability distribution over all tags. Converting this binarized distribution back into the words for the tags gives a standard result for a MLC task. But the tags are then further processed and matched to semantic vectors using the DSM from the training phase.
Outlook:
One possible advantage is, that the multimodality of this approach makes
it possible to give a semantic relatedness and similarity score of images that are not labeled. Question answering systems could understand questions that are asked by using not just words but also images as question. Or the other way around, the answers could be given as an image, if appropriate.
3.2 Multi-Label Classification The raw output of the NN in this approach is a vector with its components ranging from 0.0 to 1.0. Each entry represents a probability, hence the name probability distribution. Usually, a hypothesis is accepted if the probability for the correctness exceeds 0.5. Thus, all values above 0.5 are considered to be true. More values than one can meet this criteria, therefore the classification task is a Multi-Label Classification (MLC) task for images. Compared to Multi-Class Classification (MCC) for images the outcome is not a single class that describes the given input image but a list of labels or descriptions that match the given input image.
25
3 Methodology
Figure 3.3: Overview of training a Neural Network for a Multi-Label Classification task In Figure 3.3, during the training phase the NN receives the images from the training data and, as previously mentioned, the labels for each image. Internally, the labels are binarized using a Multi-Label Binarizer. It creates vectors out of each labels for each image. The length of the vector is the amount of distinct labels in the whole set of labels. The components of the vector are either 0 or 1, this is where the name “Binarizer” comes from. The binarizer stores which component of the vector implies the binary indicator for which label. For example, when looking at the whole set of labels containing {“chair”, “blue”, “lion”, “sun”}, one possible binarized vector is v = (1, 1, 0, 0). Translating back to labels, it gives the set {“chair”, “blue”} of target values. When applied as an evaluation function for the vector of output of the NN for one image, an example output of o = (0.2, 0.8, 0.4, 0.6) would result in the labels {“blue”, “sun”} for the image.
26
3 Methodology
3.2.1 Standard Multi-Class Logarithmic Loss Function In a Multi-Class Classification (MCC) task environment usually the logarithmic loss function from Equation (3.1) is used. The following definitions are used in Equation (3.1): • Let I be a list of images. • Let L be a list of labels. • Let N = |I| be the number if images. • Let M = |I| be the number of different labels. • Let P be a N × M matrix with pi,j ∈ P for 1 ≤ i ≤ N, 1 ≤ j ≤ M being the target value out of the set {0, 1}. A 0 has the meaning that the image Ii does not match label lj , whereas 1 means that the image matches label lj . • Let Y be a matrix of size N × M with yi,j ∈ Y for 1 ≤ i ≤ N, 1 ≤ j ≤ M being the predicted probability that the input image Ii matches label lj .
N M 1 ∑∑ (yi,j · ln pi,j ) logloss (y, p) = − · N i=1 j=1
(3.1)
3.2.2 Problems with logarithmic loss for Multi-Label Classification The formula in Equation (3.1) has problems in the corner cases where the predictions are equal to 0.0 or 1.0. Per definition of the formula, both cases are not defined. For
27
3 Methodology
these two cases, the prediction is modified with a small constant factor ε =
1 n
with
n ∈ {10−2 , 10−3 , 10−4 , . . . } as follows: 1 − ε if pi,j = 1 pi,j =
if pi,j = 0
ε pij
otherwise
The modification defines the function in Equation (3.1) for the full interval pi,j ∈ [0, 1]. Still, the logarithmic loss function gives logically invalid results when all predictions are close to 1:
lim logloss (y, p) = −
pi,j →1 1≤i≤N 1≤j≤M
1 · N
N ∑ M ∑ i=1
yi,j · ln pi,j | {z } j=1 →0
N M 1 ∑∑ =− · (yi,j · 0) N i=1 j=1
=−
N M 1 ∑∑ · 0 N i=1 j=1
N 1 ∑ =− · 0 N i=1
=−
1 ·0 N
=0
As the result suggests, there appears to be no error in the predictions (logloss = 0), no matter what the ground truth was, as long as the predictions are all 1 (asymptotically speaking).
28
3 Methodology
3.2.3 Multi-Label Loss Function The problem above can be approached with a different definition for the logarithmic loss function in Equation (3.2).
logloss (y, p) = −
N M 1 ∑∑ · ln (1 − |yij − pij |) N i=1 j=1
(3.2)
Once again, when looking at the corner cases, the following four different possibilities can occur: 1. y1 = (0, . . . 0) and p1 = (0, . . . , 0) 2. y2 = (0, . . . 0) and p2 = (1, . . . , 1) 3. y3 = (1, . . . 1) and p3 = (0, . . . , 0) 4. y4 = (1, . . . 1) and p4 = (1, . . . , 1) The first two cases 1. and 2. are purely theoretical and can not occur. By design, all images have at least one label attached, so a target vector only consisting of zeros is not possible. This leaves the other cases with y3 and y4 . Equation (3.2) gives logloss (y3 , p3 ) = ∞ and logloss (y4 , p4 ) = 0. Both these results are correct, meaning that if all predictions are equal to the targets, all predictions are correct and therefore there is no error. If no prediction is correct and on the other side of the possible set than the target, the error correctly explodes to ∞. In order to overcome the mathematical issue, that if one summand in the sequence is −∞, the whole result would be ∞, all prediction values are clipped by a ε from Section 3.2.2.
29
3 Methodology
3.2.4 Image augmentation Large corpora are rare, for special classification task like style recognition there are none available. For some special tasks, there are smaller corpora, but the experiments with them show, that the NNs tend to overfit pretty fast when always giving the same input images. They learn the training images by heart and then are unable to identify unseen images. To overcome the sparsity of tagged images, the available images can be augmented to virtually increase the number of images for training. The proposed NN can make use of two different methods to augment training data: 1. generate random patches from the image 2. randomly mirror and rotate the images
3.2.4.1 Random patches
The approach of using random patches for training is based on the hypothesis that the important and distinct feature of the image is not near the edges of the image. For almost all images this should be true. During training time, the image is loaded as a two-dimensional matrix. If using colored images, the matrix would be three-dimensional. For the sake of simplicity, only the two-dimensional way is described, but the three-dimensional way works accordingly. The image’s matrix is of size N × M .
30
3 Methodology
The function to create a patch needs the size of the patch it should return. For this, two custom patch factors cx , cy ∈ (0, 1] are introduced1 with arbitrary suggested values of cx = 0.9 and cy = 0.9. The two factors interpreted as percentage values indicate how much of the original image should be kept, leading to a patch size of px = ⌊N · cx ⌋ and py = ⌊M · cy ⌋. Cropping alone does not augment the image in a way that it is different on each iteration. It would return a constant image every time. By adding randomness which patch of the image to use, the different possibilities of results improve greatly. From the remaining lengths of the image that are not inside the patch rx = N − px and ry = M − py two random integer numbers sx ∈ {i ∈ N0 : 0 ≤ i ≤ rx } and sy ∈ {i ∈ N0 : 0 ≤ i ≤ ry } are chosen. The two random numbers give the offset from the left and from the top where the patch starts. The end coordinates are previously computed with the patch sizes ex = rx + px and ey = ry + py . The resulting image is of size px × py . The input layer from the NN uses a fixed input shape of N × M matrices. In order to fulfill this requirement, the patch is then rescaled to this shape using bilinear interpolation. When working with colored images, thus having three N × M matrices, the previous steps are applied to all three matrices. Naturally, the random numbers sx and sy are only generated once and used for all three matrices.
1
Setting a patch factor to 0 would result in cropping the affected dimension to 0 pixels. Whereas a patch factor of 1 is theoretically possible, it would not crop the affected image and thus negating the whole purpose of this operation.
31
3 Methodology
3.2.4.2 Mirroring and rotating
After cropping the image, another augmentation method can be applied: Rotating. With a probability of 0.5 the image is mirrored on the vertical axis and again with a probability of 0.5 mirrored on the horizontal axis. This results in Am = 4 different images: not mirrored, horizontally flipped, vertically flipped or horizontally and vertically flipped. After the probable mirroring, the image is again modified with a probability of 0.25. It is rotated by either 0°, 90°, 180° or 270°. Rotating by 90° and 270° is only possible when having a quadratic image with same width and height.
3.2.4.3 Gain by cropping
The number of possible images is computed by every possible combination of sx and sy which are random integers between 0 and the corresponding rx or ry . Thus, there are Ac = (rx + 1)·(ry + 1) possible outcomes. Replacing them with the original values gives the formula 3.3:
Ac = ((N − ⌊N · cx ⌋) + 1) · ((M − ⌊M · cy ⌋) + 1)
(3.3)
When using an example image size of N × M = 200 × 200 and a crop factor of cx = cy = 0.9, the number of possible outcomes is: A200×200,0.9 = ((N − ⌊N · cx ⌋) + 1) · ((M − ⌊M · cy ⌋) + 1) = ((200 − ⌊200 · 0.9⌋) + 1) · ((200 − ⌊200 · 0.9⌋) + 1) (3.4)
= 441
32
3 Methodology
The exemplary result in equation 3.4 means, that one input image possibly leads to 441 images when applying random cropping.
3.2.4.4 Gain by mirroring and rotating
Mirroring the images gives Am = 4 different images. The number of possible rotated images depends on the shape of the input image. If the image is quadratic, the image count is Arq = 4, if not, then only Ar = 2 rotations are possible. The maximum amount of different images by using mirroring and rotating is Amrq = Am · Arq = 16 when a quadratic image is given or Amr = Am · Ar = 8 when not. Note, that a rotation by 0° does not increase the number of possible outcomes.
3.2.4.5 Total gain
By combining cropping, mirroring and rotating, the amount of different images generated from one image is:
Aq = ((N − ⌊N · cx ⌋) + 1) · ((M − ⌊M · cy ⌋) + 1) · Amrq = ((N − ⌊N · cx ⌋) + 1) · ((M − ⌊M · cy ⌋) + 1) · 16
(3.5)
A = ((N − ⌊N · cx ⌋) + 1) · ((M − ⌊M · cy ⌋) + 1) · Amr = ((N − ⌊N · cx ⌋) + 1) · ((M − ⌊M · cy ⌋) + 1) · 8
(3.6)
Expanding the example from equation 3.4 to mirroring and rotating, the number of train images go up from 1 to A200×200,0.9 = 441 · 16 = 7 056
33
3 Methodology
Possible drawback: On paper this huge number may sound good. But the implementation requires lots of checks, creations of random numbers and most of all, matrix operations. Usually, CPUs and GPUs can do these operations fast. But when thinking of dynamically changing the images for each run again and again the time spent in these operations sums up to an amount that is not to be underestimated. As storage space is cheaper and easier to upgrade than a processing unit, the augmentation could be done in a preprocessing step and all the changed images are then stored on disk.
34
4 Experiments
This chapter explains several experiments which will make use of the previously mentioned methodology to create a proof-of-concept framework for the Neural Network (NN) part. The first section introduces three data sets which are used in the experiments: MNIST [LeC1998] the most used data set throughout image classification, mirflickr-25k [Hui2008] compiled for experiments in Multi-Label Classification (MLC) and then a small custom data set Neoclassica created of photographs of furniture in the design of the age of Neoclassicism. The experiments in Section 4.3 are simplified versions. For the lack of a proper loss function and time constraints, MLC experiments are not executed. Instead, the data sets were slightly changed to match a Multi-Class Classification (MCC).
4.1 Overview of data sets Training NNs is a very exhaustive task. Experimenting with different combinations of layers and hyperparameters is time consuming. Generally, a NN needs many examples to have a prediction with a high probability. This thesis uses the mirflickr-25k data set [Hui2008]. It contains a considerable big amount of images and corresponding tag files that contain multiple tags for each image. This resolves in a MLC task.
35
4 Experiments
Figure 4.1: Example images of the MNIST data set Consequently, the output is not one single class for each image but a varying amount of tags for each image. In regard to future style recognition this would give more insight than a multi-class classification task because styles are often mixed.
4.1.1 MNIST The most prominent data set for image recognition task is the MNIST database of handwritten digits by LeCun et al. [LeC1998] . It consists of 70 000 images. Their resolution is 28 × 28 pixels and they are gray scale, meaning that each image consists of a two-dimensional matrix with integer values between 0 and 255. All of them show one single handwritten digit from 0 to 9. Figure 4.1 shows six different example images of the data set. The data set comes with one label for each image, setting the gold value to the correct number written in the image. The task for this data set is a multi-class classification because the target classes are the numbers 0 to 9 and no mix between them. Therefore, it is not a multi-label classification and not exactly applicable for the approach discussed here. But nonetheless, a multi-class classification task is a special case of a multi-label classification task: the target vector has a fixed size of ten entries, corresponding to the classes, and exactly one entry is of value 1 whereas the other entries are 0.
36
4 Experiments
Figure 4.2: Three example images mirflickr-25k data set Image. Tags (separated by commas) cake
chocolate, cake, chocolateganachebuttercream, shamsd
lion
boston, zoo, franklinparkzoo, lion, animal, wildlife, k10d, outdoors, bostonzoofranklin, simba, aslan, mufasa, coreyleopold, challengeyouwinner
landscape
on, the, edge, field, lonely, tree, lone, karawanken, austria, österreich, carinthia, kärnten, rosental, drau, karawanks, mountain, hill, rock, slovenia, border, lines, sky, cloud, blue, green, forest, nohdr, nocpl, panasonic, fz50, myd300batterieswereempty
Table 4.1: Example tags for images 3, 94 and 233 of the mirflickr-25k data set
4.1.2 Mirflickr-25k The mirflickr-25k data set consists of 25 000 images. They are not concentrated on one subject but on very different topics. The tags for each image are composed in single files for each image. Figure 4.2 shows three completely different classes of images: a cake, a lion and a landscape. The manually assigned corresponding tags by the photographer1 are shown in Table 4.1. As previously mentioned, the data set contains 25 000 images and 25 000 tag files. There are 223 537 tags shared among all images. Which gives an average of 8.9 tags 1
or uploader at Flickr
37
4 Experiments
per image. The amount of unique tags is 69 099. This number results in an average use of 2.8 unique tags per image that are only used for one image. 49 006 tags only occur once. Additionally, there are 2 128 images with no tags attached. The most occurring tag with a count of 1 483 is explore, followed by sky (845 times) and nikon (805 times). Training on images without tags makes no sense. Moreover, tags that scarcely occur or tags that appear too often may have a negative impact. Tags that appear less than 25 times and more often than 800 times are removed. This method removes 68 089 tags and leaves 1 010 tags. Removing the tags from the images resolves again 2 985 images with no valid tags. They are ignored as well. Filtering the dataset removed 5 113 images. The final number of images is 19 887 which results an average number of tags per image of 19.7. Additionally, to further reduce the number of tags, only those tags are used which number of occurrence lies between 100 and 500. After this reduction 188 tags remain. Again, only images that have at least one tag of these 188 are employed which consequently reduces the image count to 15 339. This is an additional reduction of 4 548 images.
Note on meta tags:
Some tags are not describing the scenery of the image or
objects in the image but meta data. The tags canon and nikon are examples for that. Most of the time, these tags are given if the image is taken by a Canon (respectively Nikon) camera and not if the image is a photograph of the camera. These meta tags are currently not filtered out but should be taken into consideration.
38
4 Experiments
4.1.3 Neoclassica Neoclassica is a special data set2 adopted for this thesis. The amount of images and corresponding labels should be considered as insufficient for real world applications. This data set should be treated as a proof-of-concept, such that the classifiers can be used for style recognition for styles from the era of Neoclassicism. An example image can be seen in Figure 1.1 on page 4. All images have commercial background and can not be distributed. The data set consists of 300 folders. Each folder stands for one artifact and contains images of the same physical entity. An artifact is a combination of different labels, e.g. the artifact “Kommode Empire Directoire Frankreich Paris Mahagoni” represents one physical entity that matches all the given labels. Furthermore, some folders contain additional information in form of a text file with additional descriptions about the artifact. These files are ignored because their content varies between different languages, formatting and overall meaning for the artifact. The images and their artifacts are preprocessed in the following way. All images are renamed to a unique identifier and moved into a single folder. The labels for creating the artifacts are split on whitespaces and one text file for each image is created that contains one label per line. Further analysis showed that the encoding and file types of the images vary, some are JPEGs, other PNGs and the color encoding is different. In order to make the image handling inside the NN easier, all images are converted into the same JPEG file format with RGB-encoded color values. The conversion from PNG to JPEG is not lossless but all images are digital photographs, so the loss of precision is negligible for further analysis. The plots in Figure 4.3 give an overview of the distribution of images per label. 2
The data set is created by Simon Donig at the University of Passau.
39
4 Experiments
Box and violin plot for images per artifact Box plot
48
37
37
31
31
number of labels
number of labels
48
24 23 22 20 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Violin plot
24 23 22 20 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1
Figure 4.3: Box and violin plot of images per artifact for neoclassica Following the box plot in Figure 4.3, the values range from 0 to 48 and most of the artifacts have between 4 and 9 images. The violin plot in Figure 4.3 shows that some artifacts are also represented by up to 15 images. Training any classifier with just two representations does not give good results, or any result for that matter. A train/test split may separate them, so the classifier has only one image to find its distinctive features, that are probably not as representative in the test image. When there is no image in the test split, the validation for the label of this image can not take place. In the other way, when there is no training data, the classifier has absolutely no chance in successful classification. Nevertheless, a manual train/test split not done but highly recommend for future work.
40
4 Experiments
Box and violin plot for labels per image Violin plot
14
14
13
13
12
12
11
11
10
10
9
9
number of images
number of images
Box plot
8 7 6 5
8 7 6 5
4
4
3
3
2
2
1
Figure 4.4: Box and violin plot of labels per image for neoclassica
4.2 Implementation The implementation of the proof-of-concept framework is done in Python 3.5. Throughout all code examples and snippets, the following two modules are to be considered as always imported: • “import numpy as np” by Walt et al. [Wal2011] • “import theano” and “import theano.tensor as T” by Bergstra et al. [Ber2010] and Bastien et al. [Bas2012]
41
4 Experiments
4.2.1 Data preparation The neural network needs its two different input data in a very specific format. The images must be a matrix and the target values are an array of vectors. Each data set defines its own structural layout, so the framework provides an abstract class of a DataProvider from code snippet 4.1 which must be inherited by a custom provider which overwrites the abstract methods get_image_paths(), get_tags_for_images() and get_all_tags(). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
import abc import os
class DataProvider(object): def __init__(self, data_dir): # the provider expects the data in the given directory if not os.path.isdir(data_dir): raise ValueError("The data dir '{}' does not exist".format(data_dir)) @abc.abstractmethod def get_image_paths(self): """ Returns all paths for all images for the data set. """ pass @abc.abstractmethod def get_tags_for_images(self): """ Returns a list of lists where the n-th sublist contains all tags for the n-th image. """ pass @abc.abstractmethod def get_all_tags(self): """ Returns a set of unique tags in the data set. """ pass
Code Snippet 4.1: Function to read an image as two-dimensional matrix
42
4 Experiments
4.2.1.1 Reading images
For the framework to be independent of the input image and to make prototyping more convenient, there are two different functions to read an image into a two-dimensional (code snippet 4.2) or a three-dimensional (code snippet 4.3) matrix. There is another third-party module used: scikit-image by Walt et al. [Wal2014]. It is imported with “import skimage.io”. 1 2 3 4 5 6 7 8 9 10 11 12
def read_gray_image(path_to_image): if not os.path.isfile(path_to_image): raise ValueError('not a file: {}'.format(path_to_image)) # read image from file, automatically convert to grayscale image = skimage.io.imread(path_to_image, as_grey=True) # convert the numpy array to float32 values image = image.astype(np.float32) # normalize values to values in the interval [0.0, 1.0] image /= 255.0 # for internal reasons, we need 1xWxH image = image.reshape((1, image.shape[0], image.shape[1])) return image
Code Snippet 4.2: Function to read an image as two-dimensional matrix 1 2 3 4 5 6 7 8 9
def read_color_image(path_to_image): if not os.path.isfile(path_to_image): raise ValueError('not a file: {}'.format(path_to_image)) image = skimage.io.imread(path_to_image) image = image.astype(np.float32) image /= 255.0 # for internal reasons, we need a 3xWxH image instead of WxHx3 image = image.reshape((3, image.shape[0], image.shape[1])) return image
Code Snippet 4.3: Function to read an image as tree-dimensional matrix The neural network needs all images as the same size. The dimension is user-defined
43
4 Experiments
and the images are automatically resized during reading in the corresponding batchreading function in code snippet 4.4. Reading grayscale images works accordingly. 1 2 3 4 5 6 7
def read_color_images(paths_to_images, rescale=None): # read all image images = [read_color_image(img) for img in paths_to_images] # rescale all images to the given scale if rescale is not None: images = [resize(img, rescale) for img in images] return images
Code Snippet 4.4: Function to read batch-read images and automatically resize them The resize function in line 6 of code snippet 4.4 is a small wrapper of the function skimage.transform.resize() from the scikit-image module.
4.2.1.2 Reading target values
The format for target values differ from data set to data set. The MNIST data set is originally design for multi-class classification with one class per image, so the target file is one text file. The n-th line contains exactly one number from 0 to 9 corresponding to the n-th image. The reading from the text file is straightforward and done with inbuilt modules of Python. Reading target values for mirflickr-25k and Neoclassica turns out to be more complicated. They are designed to have multiple labels per image: each image has a corresponding text file in which each line contains exactly one label. The analysis for the data set mirflickr-25k in Section 4.1.2 shows that the data set contains images that have no tags at all and tags that are only given to one image. These tags and images are filtered before the data is given to the neural network. On initialization, the two parameters min_amount and max_amount can be modified to only consider tags
44
4 Experiments
that appear at least min_amount of times and are not shared with over max_amount of images. The rationale behind this is, that tags which appear too rarely might not be useful for other images. The quantity of too frequently-used tags might cross a limit where its expressiveness of one image is reduced. They are arbitrarily set to the default values 100 and 500. It must be denoted, that the filtering of tags takes a considerable amount of time. The lists must be traversed multiple times and a lot of time is spent in equality checks of strings. Each tag for each images must be counted, when the amount undercuts or exceeds the two limits, they are removed from the tags for each image. Then, another check is needed if the image still has tags, if not, it is then removed. The filtering continues as long as needed until the parameters are satisfied. The advantage of inheritance sticks out with this implementation. The DataProvider class should apply this filter dynamically on initialization. With this setup, it is encapsulated from the learning process. Also, the provider can be and should be pickled (using the inbuilt module “import pickle” or a more advanced strategy with the module “import joblib”) and therefore stored on disk for later access.
4.2.2 Convolutional Neural Net The setup code for the convolutional neural network is in the file pyvsem/neuralnet/neuralnet.py. It receives several configurable parameters, so that on each run, the data set and the predefined network can be set. Also, the program loads the data set and makes a train-test-split. All data is stored as python pickles, so that the results can be verified later. The possible parameters are:
45
4 Experiments
• net_number: Identification number of the net structure, must be given as integer. Currently possible values are 1–15. • colors: Number of colors in the image: 1 means grayscale, 3 full color range. Other values are not possible. • width: The width and height of the images to which they are resized prior to giving them to the neural net. Theoretically, it is possible to give height and width but here it is limited to square images. • data_set: Identification string of the data set on which to operate. A folder with the same name must be present in the data/-directory and there must be a DataProvider implemented that preprocesses the images. Current possible values are mnist, mirflickr-25k and neoclassica. • aug: This boolean flag given as string indicators True or False instructs the neural net to apply augmentation techniques from Section 3.2.4.
4.2.3 Configuring layer arrangement Different convolutional neural networks are set up in the files pyvsem/neuralnets/playground/net_X.py where X stands for a number that is considered the identifier for the net in the parameter net_number from above. The wrapper from Section 4.2.2 calls create_net() and expects an object that implements at least a fit()-function, so the net can be trained. The exemplary implementation uses nolearn by Nouri [Nou2015] as a framework for providing easy access to train and validate neural networks. It uses the neural network creation framework Lasagne by Dieleman et al. [Die2015] which internally
46
4 Experiments
uses Theano by Bergstra et al. [Ber2010] and Bastien et al. [Bas2012] for mostly all mathematical demands. The create_net() function takes four parameters: • input_shape: Shape of the array that the neural network should work on. Currently it is set to (colors, width, width) from the possible parameters from Section 4.2.2 above. • y: A set of all unique tags that should be considered during runtime of the neural network. Should contain the values returned by data_provider.get_all_tags(). • train_test_iterator: This is the instance of a iterator which is able to iterate batch-wise through the training data. It also does the augmentation, if configured. • max_epochs: Maximum amount of epochs the neural network should train. All provided playground neural network instantiate the network in the same following way: For example, the playground network net14 has the layout defined in code snippet 4.6. The printed part is cut from line 8 in code snippet 4.5. The three stages of convolutional networks introduced in Section 2.2.3 are in line 4–12 of code snippet 4.6 where the first stage is the Conv2DLayer in line 4, the second stage is the nonlinear activation function ReLU in line 7 and the pooling stage is defined in a MaxPool2DLayer in line 9. The convolutional layer is configured to use a kernel size of 3 × 3, the pooling layer uses a kernel size of 2 × 2 and a stride of 2.
47
4 Experiments
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
from nolearn.lasagne import NeuralNet import lasagne import theano # ... def create_net(input_shape, y, train_test_iterator, max_epochs=200): net = nolearn.lasagne.NeuralNet( layers=[ # layers organized here ], regression=True, # Multi-Label => regression update_learning_rate=theano.shared(0.03), update_momentum=theano.shared(0.9), objective_loss_function=multi_label_loss_function, batch_iterator_train=train_test_iterator, batch_iterator_test=train_test_iterator, max_epochs=max_epochs ) return net
Code Snippet 4.5: Simple construct to create a neural network with nolearn Additionally, after the convolutional stages, a Dropout layer is added in line 13 which sets the output of the nodes from the layer above to 0 with a probability of p = 0.5. The last layer in line 20 is the output layer with a sigmoid activation function to receive a real valued probability for each target vector.
4.2.4 Loss function The loss function is implemented is theano. It is organized in the file pyvsem.neuralnets.options as the function multi_label_log_loss() in code snippet 4.7.
48
4 Experiments
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
(lasagne.layers.InputLayer, { 'shape': (None, input_shape[0], input_shape[1], input_shape[2]) }), (lasagne.layers.Conv2DLayer, { 'num_filters': 32, 'filter_size': (3, 3), 'nonlinearity': lasagne.nonlinearities.rectify }), (lasagne.layers.MaxPool2DLayer, { 'pool_size': (2, 2), 'stride': 2 }), (lasagne.layers.DropoutLayer, { 'p': 0.5 }), # ... (lasagne.layers.DenseLayer, { 'num_units': 1500 }), (lasagne.layers.DenseLayer, { 'num_units': len(y), 'nonlinearity': lasagne.nonlinearities.sigmoid })
Code Snippet 4.6: Exemplified implementation of a convolutional neural network
4.3 Results of Multi-Class Classification There is a manifold of different Convolutional Neural Network (CNN) to be implemented and tested. For the following results, two differently configured CNNs are in use: a very simple CNN (internal name: Net 19b) and a deep neural network VGG-16 from Simonyan and Zisserman [Sim2014] (internal name: Net 26). Their layer layout is shown in Table 4.2. In this table, the first number after each convolutional layers gives the number of filters, the second entry the size of each filter. The size of the pooling function in each pooling layer is shown in the brackets. Each number in the brackets after the fully connected layers stand for the number of units for that layer. The activation function for each convolutional and fully connected layer is the ReLU activation function. The dropout possibility value is p = 0.5 for both dropout values
49
4 Experiments
1 2
ONE = theano.shared(np.float32(1.0)) EPSILON = theano.shared(np.float32(1.0e-7))
# shared variable 1.0 # shared variable ε
3 4 5 6 7 8
def multi_label_log_loss(predictions, targets): # clip values, so log(1) and log(0) does never happen predictions = ∑ T.clip(predictions, ) EPSILON, ONE - EPSILON) ( ti − pi # lossti ,yi = − N1 N ln 1 − j j j=1 return - T.sum(T.log(ONE - abs(targets - predictions))) / targets.shape[0]
Code Snippet 4.7: Implementation of a loss function for Multi-Label Classification in Theano of net 26.
Setup:
All experiments are run on a server with two Intel Xeon E5-2637 processors
running at 3.6 GHz, 64 GB of RAM and one NVIDIA Tesla K40c graphics card with 12 GB of RAM. The run time of the experiments is given as seconds per epoch. The duration is given automatically by the nolearn framework. The numbers are rounded and are not to be considered as fully accurate measurements. They are meant to give a ballpark figure on how complex the computation is and how long one can estimate the duration of the experiments.
4.3.1 Testing with MNIST The MNIST dataset is inherently a MCC task. All images are upscaled to a size of 120 × 120, meaning they are increased from their original size of 28 × 28 pixels. The results from Table 4.3 indicate that the amount of layers and their complexity massively increases the run time per epoch. The learning of both nets can be seen in Figure 4.5. In these plots is drawn the epoch number on the x-axis and loss and accuracy on the y-axis. The plots also show, that the more complex Net 26 decreases
50
4 Experiments
Net 19b
Net 26
input layer convolutional layer (32, 3 × 3) max pooling layer (3 × 3) fully connected layer (100) output layer
input layer convolutional layer (64, 3 × 3) max pooling layer (2 × 2) convolutional layer (128, 3 × 3) max pooling layer (2 × 2) convolutional layer (256, 3 × 3) convolutional layer (256, 3 × 3) max pooling layer (2 × 2) convolutional layer (512, 3 × 3) convolutional layer (512, 3 × 3) max pooling layer (2 × 2) fully connected layer (4096) dropout layer fully connected layer (4096) dropout layer output layer
Table 4.2: Layouts of two different Neural Networks
51
4 Experiments
Net
epochs accuracy
run time per epoch
19b
25
95.4%
65 s
26
25
99.1%
520 s
Table 4.3: Table of results for Multi-Class Classification for MNIST its loss faster than the simpler Net 19b. Judging the accuracy of 99.1% from Net 26 against Net 19b’s 95.4% after 25 epochs also show that a complex NN performs usually better than a simpler version. The most limiting factor is time. Also, Figure 4.5a shows an increasing validation loss although the train loss decreases. This happens when the net is overfitting, meaning, it can replicate the training data very well but has problems with unknown data. The most probable cause for this, is the low number of 100 units in the fully connected layer. Net 26 shows a different pattern. The validation loss in Figure 4.5b stagnates, meaning the network does not overfit in the same amount as Net 19b but also does not improve for unknown data over time.
4.3.2 Results for Neoclassica The data set Neoclassica can also be used as a data set for MCC. The labels are ordered in a way such that the first label for each image describes the artifact in the most abstract way. The data set then has 42 main classes e.g. “bed” or “sideboard”. Their distribution of images per classes can be seen in the plots in Section 4.3.2. The average number of images per class in this setup is 52, which seems like much, but the median of 15 shows that the average is biased by few outliers. The smallest class “bed” has 2 members. The highest number of representatives has “chest of drawers” with 319 images, followed by “table” with 280 images. Considering the
52
4 Experiments
Loss over time
0.5
Accuracy against training epoch
0.96
Training loss Validation loss
0.4
Validation accuracy
0.95
0.3 0.94
0.2 0.93
0.1 0.92 0.0
0
5
10
15
20
25
30
0
5
10
15
20
25
30
(a) Net 19b: Training/validation loss and accuracy Loss over time
1.2
Accuracy against training epoch
0.995
Training loss Validation loss
Validation accuracy
0.990 1.0 0.985 0.8 0.980 0.6
0.975 0.970
0.4 0.965 0.2 0.960 0.0
0
5
10
15
20
0.955
25
0
5
10
15
20
25
(b) Net 26: Training/validation loss and accuracy
Figure 4.5: Plot for loss over time and accuracy improvement of Net 19b and Net 26
53
4 Experiments
Box and violin plot for images per class Box plot
319
300
300
250
250
200
200
number of images
number of images
319
150
150
100
100
50
50
15 0
Violin plot
15 0
1
Figure 4.6: Box and violin plot for images per class in Neoclassica high number of 20 classes with less than the median of of 15 images per class. The results for those classes are to be taken with a grain of salt. The different experiments are run for maximal 100 epochs, with an early stopping mechanism that stops training after at least 50 epochs if the validation rate does not improve. All these experiments are using the same images, where a train/test-split is manually done beforehand, so the training data and the testing data are equal for all twelve experiments. For Neoclassica the same split factor is applied, resulting in 1 733 images for training and 434 images for testing. The split is done randomly, so it does not take different class distribution into account. This is left for future work. All following experiments are done with Net 26 as it promised better results as indicated in the experiment with the MNIST data set in Section 4.3.1. There are different setups used. They have in common that the image size is 120×120 and the batch size is 256. They differ in the decision if to use gray scale images or
54
4 Experiments
the image as RGB as well as in the choice of using augmentation techniques or not. This results in four different setups: color dimensions
augmentation
Name
1
No
1N
1
Yes
1Y
3
No
3N
3
Yes
3Y
Table 4.4: Overview over four different setups in the experiments
4.3.2.1 No pretraining
Each configuration is fit on the same plain train set from Neoclassica and gives the results during training in Table 4.5. The network was not pretrained. The validation accuracy numbers from Table 4.5 indicate that the best result was with augmentation techniques and using colored images with a validation accuracy of 0.37557. The configurations are set up that they stop after 100 epochs. The early stopping mechanism did not halt the training before the 100 epochs, with the exception of configuration 3N where it stopped after 94 epochs, so future improvements and different outcomes might be produced afterwards. Surprisingly, the average run time per epoch went down when using three-dimensional matrices instead of twodimensional image matrices.
55
4 Experiments
Validation accuracy
Epoch 1N
1Y
3N
3Y
1
0.13849
0.17290 0.13849
0.10819
2
0.13849
0.13327 0.13849
0.10819
3
0.13327
0.13849 0.13327
0.07244
4
0.13849
0.13327 0.13849
0.07244
5
0.14759
0.13849 0.13849
0.12281
10
0.13849
0.13849 0.13849
0.10819
15
0.13849
0.13849 0.13849
0.10819
20
0.13849
0.13849 0.13849
0.10819
25
0.13849
0.13849 0.16516
0.13190
30
0.13849
0.15408 0.14435
0.13435
40
0.13849
0.16836 0.24119
0.14294
50
0.13849
0.18395 0.22433
0.18736
60
0.17486
0.18590 0.34137
0.22914
70
0.18139
0.22102 0.32056
0.27102
80
0.20092
0.23402 0.31861
0.32003
90
0.26403
0.26918 0.35238
0.36640
100
0.26793
0.26662
—
0.37557
274s
259s
146s
149s
Avg. run time/epoch
Table 4.5: Training progress of Net 26 in different configurations for Neoclassica 4.3.2.2 Training on ImageNet data
Pretraining is a usual method in NN for approaches with new data with fewer examples. Applying pretraining with the full data set of ImageNet3 is computationally infeasible for this thesis. In regard to the future application of style recognition, using a challenge set for the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 3
http://image-net.org/download
56
4 Experiments
from Russakovsky et al. [Rus2015] might also not be appropriate. These challenges have 1 000 classes and a whole range of very different concepts, like different dog races, fish, houses, clouds and people. But within ImageNet there are also types of furniture included from which a subset of 29 different classes are used for pretraining. The classes are listed in Appendix B.2. The amount of images in these classes sum up to 35 207 with an average of 1 214 images per class. The median is 1 247 images per class and their distribution has a standard deviation of 414. The least populated class has 134 images, whereas the class with the most images has 2 021 images. The train/test-split is done with a factor of 0.8, resulting in 28 165 training and 7 042 testing images. Net 26 is unchanged except for the output layer, where 29 units are created to correspond to the 29 different classes. The different configurations were also trained for at least 50 epochs and at maximum 100 epochs. The results using the a manual testing set created prior to training are in Table 4.6 and Figure 4.7. The best performance is observed with a mean F1 score over all classes of 0.543 and an average accuracy over all classes of 0.539 by using augmentation and colored images. Noticeable here is the indication that colored images have a negative impact on F1 score and accuracy when not using augmentation techniques as compared to gray scale images. Figure 4.8 shows the F1 score per class. This bar plot shows that using augmentation outperforms its counterpart every time, most noticeable for the class “fauteuil”. The confusion matrix for the best performance is shown in Figure 4.9. Most wrong classifications occur on different subclasses of tables, e.g. a “coffee table” is often wrongfully classified as a “tea table”, “testle table” and “console table”.
57
Configuration
mean F1 score
mean accuracy
1N
0.478
0.475
1Y
0.533
0.528
3N
0.453
0.450
3Y
0.543
0.539
mean F1 scores
4 Experiments
Table 4.6: Table of accuracy and F1 score of different configuration using ImageNet data
0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00
F1 score per experiment, on pretraining data net26_1x120x120_aug net26_1x120x120_noaug net26_3x120x120_aug net26_3x120x120_noaug
Experiment
F1 scores vs. all classes for experiment: On pretraining
1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00
writing desk (233)
wing chair (325)
trestle table (261)
tea table (248)
single bed (256)
shelf (283)
settee (270)
pier table (112)
picture frame (358)
pedestal table (238)
secretary, writing table, escritoire, secretaire (321)
classes (count of instances)
hand glass, hand mirror (191)
fauteuil (92)
dressing table, dresser, vanity, toilet table (270)
davenport (166)
daybed, divan bed (200)
console table, console (274)
coffee table, cocktail table (329)
clock (345)
clock pendulum (217)
chesterfield (250)
chiffonier, commode (299)
card table (28)
chandelier, pendant, pendent (265)
captain's chair (70)
candlestick, candle holder (326)
buffet, counter, sideboard (225)
armoire (388)
'net26_1x120x120_aug' 'net26_1x120x120_noaug' 'net26_3x120x120_aug' 'net26_3x120x120_noaug'
Morris chair (202)
F1 score per class
Figure 4.7: Bar plot of mean F1 scores over all classes for different configuration on ImageNet
Figure 4.8: Bar plot over all F1 scores per class for ImageNet data used for pretraining
58
4 Experiments
320
Confusion Matrix for experiment: 'OnPretraining_net26_3x120x120_aug'
280
Morris chair armoire buffet candlestick captain's chair
240
card table chandelier chesterfield chiffonier clock
200
clock pendulum coffee table
True label
console table davenport daybed
160
dressing table fauteuil hand glass pedestal table picture frame
120
pier table secretary settee shelf 80
single bed tea table trestle table wing chair
wing chair
writing desk
tea table
trestle table
shelf
single bed
settee
secretary
pier table
picture frame
pedestal table
fauteuil
hand glass
daybed
dressing table
davenport
coffee table
console table
clock
clock pendulum
chiffonier
chesterfield
card table
chandelier
captain's chair
buffet
candlestick
armoire
Morris chair
writing desk 40
Predicted label 0
Figure 4.9: Confusion matrix for Net 26 in configuration 3Y on ImageNet pretraining data 4.3.2.3 Applying ImageNet pretraining the Neoclassica
After pretraining the different configurations on the manually created ImageNet data set, four new instances of Net 26 are created. This time with the previous number of 42 output nodes. The parameters and weights are exported from the pretrained nets
59
1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00
F1 scores vs. all classes for experiment: With pretraining 'pretrained_net26_1x120x120_aug' 'pretrained_net26_1x120x120_noaug' 'pretrained_net26_3x120x120_aug' 'pretrained_net26_3x120x120_noaug'
Altaza (0) Bett (0) Bilderrahmen (0) Bougeoirs (28) Briefständer (0) Buchständer (1) Buffet (1) Canterburry (3) Cassolettes (3) Centre (6) Ensemble (10) Etagère (3) Fruchtkorb (7) Geschirr (17) Halbschrank (16) Handspiegel (2) Hocker (9) Kandelaber (13) Kleinmöbel (15) Kommode (67) Konsoltisch (20) Luster (1) Pendule (31) Petroleumlampe (2) Pfeilerkommode (5) Schaukelstuhl (1) Schrank (9) Schränkchen (2) Sekretär (28) Sessel (15) Sideboard (4) Sofa (14) Spiegel (32) Standuhr (0) Stuhl (18) Taschenuhr (1) Tazza (1) Tea-Poy (1) Tisch (46) WhatNot (2) Wiege (0) Öllampe (0)
F1 score per class
4 Experiments
classes (count of instances)
Figure 4.10: F1 scores of over all Neoclassica classes with pretraining and the imported into the new nets using nolearn’s included methods. Again, like in the other experiments, all nets are retrained at least 50 and up to 100 epochs on the train split from before. Evaluation is done with the same manual testing set as in the previous Section 4.3.2.1. The F1 scores over all classes are shown in Figure 4.10. In this plot, the low number of class representatives can be seen. 21 classes have no F1 measure, meaning that there are not enough images for those classes to have a meaningful metric. 8 classes only have valid F1 scores when applying augmentation. All experiments give valid scores for 12classes and one class only gives valid scores for three experiments, leaving 3N out of the scores. No scores are given if the number of supporting images for the class is 0 − −2. 3 − −10 images give results in 7 out of 12 cases but not for all experiments, except one class. Reliable scores for all configuration are starting from 13 representatives, with two exceptions where 16 or 17 images were not enough to produce values to configurations without augmentation. The bar plot in Figure 4.11b shows in the bolder parts of the bars the averaged F1
60
4 Experiments
Configuration
Accuracy
F1
Metric
no pretraining
with pretraining
Improvement
1N
0.330
0.347
5.15%
1Y
0.320
0.400
25.00%
3N
0.206
0.369
79.13%
3Y
0.333
0.442
32.73%
1N
0.323
0.407
26.01%
1Y
0.322
0.409
27.02%
3N
0.218
0.416
90.83%
3Y
0.368
0.438
19.02%
Table 4.7: Comparison of F1 scores and accuracy between different configurations with and without pretraining scores of all classes including the zero values. When removing the invalid F1 scores, the results are shown as the less saturated part of the bars. Also, the difference between applying pretraining and doing classification without pretraining can directly be seen in the comparison in the two bar plots Figure 4.11a and Figure 4.11b. Detailed numbers for the different F1 scores and accuracy are in Table 4.7. The improvement for the best performance of 3Y is from F1 over the valid results from 0.333 to 0.442, meaning an improvement of 32.73%. Using pretraining improves the F1 score on average by 35.50% and the accuracy by 40.72%. The best improvement is observed when comparing colored images without augmentation in regard to pretraining, whereas the best F1 score is achieved by using colored images and augmentation.
61
0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00
F1 scores vs. experiment, without pretraining 'net26_1x120x120_aug' 'net26_1x120x120_noaug' 'net26_3x120x120_aug' 'net26_3x120x120_noaug'
mean F1 scores (cleaned/not cleaned)
mean F1 scores (cleaned/not cleaned)
4 Experiments
Experiment
(a) Without pretraining
0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00
F1 scores vs. experiment, with pretraining 'pretrained_net26_1x120x120_aug' 'pretrained_net26_1x120x120_noaug' 'pretrained_net26_3x120x120_aug' 'pretrained_net26_3x120x120_noaug'
Experiment
(b) With pretraining
Figure 4.11: Mean F1 scores for different experiments on Neoclassica with and without pretraining
62
5 Future Work and Conclusion
5.1 Future work There will always be new ways of layer arrangements in Neural Networks (NNs) and their setup, tuning each hyperparameter is exhaustive and leaves much space for future work. But here, the question arises if the validation rate will ever be good enough. As of now, we are far from perfection, so future implementation can and will improve the results from this thesis. The implementation part of the approach in utilizing Distributional Semantics (DS) for the proof-of-concept framework is also left for future work. The data set neoclassica from Section 4.1.3 has too little images with too little labels to train a proper Convolutional Neural Network (CNN). Obtaining new images for this specific furniture is easy, there are many auction houses and catalogs that offer access to their image database. More complicated is finding labels for all the images. Experts in this field are rare and the manual labeling is exhaustive. Manual labeling could be distributed and outsourced for example using Amazon’s Mechanical Turk 1 or
1
https://www.mturk.com/mturk/welcome
63
5 Future Work and Conclusion
an equivalent application. The major drawback is, that the workers there, are most certainly not experts in the field of classifying furniture from the Neoclassica. Another work is left for future implementation: speeding up the training of the CNN. It is currently implemented in Python and Theano. The run time of an interpreted programming language takes it toll. Directly using a more machine-oriented language like C++ can have major impact into the runtime of the program. So, the suggestion for future is, to port the setup to MXNET 2 . The advantage of MXNET is the implementation in C++. It also has a Python module. Additionally, the usage of multiple graphics card comes directly and nearly transparent to the programmer.
5.2 Conclusion The work in this thesis implemented a widely known layout for a Convolutional Neural Network, namely VGG-16 from Simonyan and Zisserman [Sim2014] for a Multi-Class Classification task of the data set Neoclassica that was created for this thesis. There are four different network configurations: using gray scale or RGB images and apply augmentation during training or not. The experiments show that using augmentation techniques always leads to better F1 scores. Moreover, another experiment examined the application of pretraining with manually selected images from ImageNet that match the classes from Neoclassica. This improved the F1 score for all four configurations on average by 31.31% from 0.297 to 0.390. The highest averaged F1 score of for Neoclassica of 0.442 is achieved by using RGB images and augmentation. The highest average accuracy is also achieved by this configuration and yields a success rate of 43.8%. 2
https://github.com/dmlc/mxnet and https://mxnet.readthedocs.org/en/latest/
64
A Code
The provided open-source framework is available on: https://gitlab.com/bernhard.bermeitinger/pyvsem It is recommended to run the code in a new virtual environment under Python 3.5 in the 64bit version. The operating system is not limited to a Linux distribution like Fedora1 or Debian2 but is not tested on any other system. The python environment is Anaconda3 . It should work with python-pip4 accordingly, although all development was done in a virtual environment by Anaconda. The project has several third-party dependencies, which must be installed prior to running the software:
backports_abc cython ipython joblib jupyter
matplotlib nltk notebook numpy openblas
python-dateutil pytz requests scikit-image scikit-learn
1
https://getfedora.org/ https://www.debian.org/ 3 https://www.continuum.io/downloads 4 https://pypi.python.org/pypi 2
65
scipy six traitlets
A Code
The following packages are installed by using the newest version (branch master) and installing it manually:
• Theano: https://github.com/Theano/Theano.git • Lasagne: https://github.com/Lasagne/Lasagne.git • nolearn: https://github.com/dnouri/nolearn.git • pylearn2: https://github.com/lisa-lab/pylearn2.git Although the code should run on a CPU, it is recommended to run it on a CUDAenabled graphics card. Usually, Theano should automatically choose to run on a graphics card, but to be sure, the file ~/.theanorc should exist and contain the following content: [global] device = gpu0 floatX = float32 warn_float_64 = warn Running on a graphics card needs the CUDA-Toolkit5 from NVIDIA properly set up and also the special cuDNN 6 for the convolutional layers.
5 6
https://developer.nvidia.com/accelerated-computing-toolkit https://developer.nvidia.com/cudnn
66
B Data sets
B.1 Accessing data sets Due to copyright issues and licensing ambiguity, the three data sets are not included. MNIST can be downloaded with the Python module scikit-learn. The mirflickr25k data can be downloaded from: http://press.liacs.nl/mirflickr/ The Neoclassica data set is not publicly available because of the images’ copyright and their usage in commercial products. They must be placed in a folder in the data/ directory which name is the name of the data set. The Neoclassica data set must be processed before using them. This is done in the ipython notebook1 (now jupyter notebook2 ) “create tags.ipynb” in the folder data/neoclassica/.
1 2
https://ipython.org/notebook.html http://jupyter.org/
67
B Data sets
B.2 Pretraining with ImageNet The following 29 classes from ImageNet were used for pretraining. In the brackets are the to the labels corresponding synset-ids:
• “armoire” (n02739550) • “buffet, counter, sideboard” (n02912065) • “candlestick, candle holder” (n02948557) • “captain’s chair” (n02957862) • “card table” (n02964075) • “chandelier, pendant, pendent” (n03005285) • “chesterfield” (n03015149) • “chiffonier, commode” (n03016953) • “clock” (n03046257) • “clock pendulum” (n03046802) • “coffee table, cocktail table” (n03063968) • “console table, console” (n03092883) • “davenport” (n03164722) • “daybed, divan bed” (n03165096) • “dressing table, dresser, vanity, toilet table” (n03238586) • “fauteuil” (n03325403) • “hand glass, hand mirror” (n03485198) • “Morris chair” (n03786621) • “pedestal table” (n03904060)
68
B Data sets
• “picture frame” (n03931765) • “pier table” (n03934565) • “secretary, writing table, escritoire, secretaire” (n04164868) • “settee” (n04177755) • “shelf” (n04190052) • “single bed” (n04222210) • “tea table” (n04398951) • “trestle table” (n04480033) • “wing chair” (n04593077) • “writing desk” (n04608329)
69
Bibliography
[Ahn2006] L. von Ahn. “Games with a purpose”. In: Computer 39.6 (June 2006), pp. 92–94. issn: 0018-9162. doi: 10.1109/MC.2006.196 (cit. on p. 13). [Bas2012]
Frédéric Bastien et al. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop. 2012 (cit. on pp. 41, 47).
[Ber2010]
James Bergstra et al. “Theano: a CPU and GPU Math Expression Compiler”. In: Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation. Austin, TX, June 2010 (cit. on pp. 41, 47).
[Bos2007]
A. Bosch, A. Zisserman, and X. Muoz. “Image Classification using Random Forests and Ferns”. In: Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on. Oct. 2007, pp. 1–8. doi: 10 . 1109 / ICCV.2007.4409066 (cit. on p. 9).
[Bru2014]
Elia Bruni, Nam Khanh Tran, and Marco Baroni. “Multimodal Distributional Semantics”. In: J. Artif. Int. Res. 49.1 (Jan. 2014), pp. 1–47. issn: 1076-9757. url: http : / / dl . acm . org / citation . cfm ? id = 2655713 . 2655714 (cit. on pp. 9, 10, 12).
70
Bibliography
[Bru2011]
Elia Bruni, Giang Binh Tran, and Marco Baroni. “Distributional semantics from text and images”. In: Proceedings of the EMNLP Geometrical Models for Natural Language Semantics Workshop. 2011 (cit. on p. 13).
[Bru2013]
E. Bruni et al. “VSEM: An open library for visual semantics representation”. In: 2013. url: http://clic.cimec.unitn.it/vsem/ (cit. on p. 10).
[Die2015]
Sander Dieleman et al. Lasagne: First release. Aug. 2015. doi: 10.5281/ zenodo . 27878. url: http : / / dx . doi . org / 10 . 5281 / zenodo . 27878 (cit. on p. 46).
[Fin2001]
Lev Finkelstein et al. Placing Search in Context: The Concept Revisited. 2001 (cit. on p. 13).
[Gau1821] C. F. Gauss. Theoria combinationis observationum erroribus minimis obnoxiae (Theory of the combination of observations least subject to error). 1821 (cit. on p. 13). [Gau1809] Carl Friedrich Gauss. Theoria motus corporum coelestium in sectionibus conicis solem ambientium. 1809 (cit. on p. 13). [Har1981]
ZelligS. Harris. “Distributional Structure”. English. In: Papers on Syntax. Ed. by Henry Hiż. Vol. 14. Synthese Language Library. Springer Netherlands, 1981, pp. 3–22. isbn: 978-90-277-1267-7. doi: 10.1007/978-94009-8467-7_1. url: http://dx.doi.org/10.1007/978-94-009-84677_1 (cit. on pp. 5, 7).
[Hub1962] D. H. Hubel and T.N. Wiesel. “Receptive Fields, Binocular Interaction, and Functional Architecture in the Cat’s Visual Cortex”. In: Journal of Physiology (London) 160 (1962), pp. 106–154 (cit. on p. 14).
71
Bibliography
[Hui2008]
Mark J. Huiskes and Michael S. Lew. “The MIR Flickr Retrieval Evaluation”. In: MIR ’08: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval. Vancouver, Canada: ACM, 2008 (cit. on p. 35).
[Koh2013]
Hubertus Kohle. Digitale Bildwissenschaft. Werner Hülsbusch, June 21, 2013. 192 pp. isbn: 978-3-86488-036-0. doi: urn : nbn : de : bsz : 16 artdok-21857. url: http://archiv.ub.uni-heidelberg.de/artdok/ volltexte/2013/2185 (cit. on p. 2).
[Kri2012]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “ImageNet Classification with Deep Convolutional Neural Networks”. In: Advances in Neural Information Processing Systems 25. Ed. by F. Pereira et al. Curran Associates, Inc., 2012, pp. 1097–1105. url: http://papers.nips.cc/ paper/4824-imagenet-classification-with-deep-convolutionalneural-networks.pdf (cit. on p. 21).
[Krü1997]
Klaus Krüger. “Geschichtlichkeit und Autonomie. Die Ästhetik des Bildes als Gegenstand historischer Erfahrung”. In: Der Blick auf die Bilder. Kunstgeschichte und Geschichte im Gespräch (1997). Ed. by Otto Gerhard Oexle, pp. 53–86 (cit. on p. 3).
[LeC1998]
Y. LeCun et al. “Gradient-Based Learning Applied to Document Recognition”. In: Proceedings of the IEEE 86.11 (Nov. 1998), pp. 2278–2324 (cit. on pp. 35, 36).
[Leg1805]
Adrien Marie Legendre. Nouvelles méthodes pour la détermination des orbites des cometes. F. Didot, 1805 (cit. on p. 13).
72
Bibliography
[Lin2011]
Yuanqing Lin et al. “Large-scale image classification: fast feature extraction and svm training”. In: In IEEE Conference on Computer Vision and Pattern Recognition. 2011 (cit. on p. 20).
[Low1999]
David G Lowe. “Object recognition from local scale-invariant features”. In: International Conference on Computer Vision, 1999 (1999), pp. 1150– 1157 (cit. on p. 10).
[Low1995]
David Lowe and Robert Matthews. “Shakespeare vs. Fletcher: A stylometric analysis by radial basis functions”. In: Computers and the Humanities 29.6 (1995), pp. 449–461. issn: 1572-8412. doi: 10.1007/BF01829876. url: http://dx.doi.org/10.1007/BF01829876 (cit. on p. 2).
[Low2001]
Will Lowe. “Towards a theory of semantic space”. In: Proceedings of the 23rd Conference of the Cognitive Science Society. Ed. by Johanna T. Moore and Keith Stenning. 2001, pp. 576–581. url: http://www. nottingham.ac.uk/mdi/people/will/pubs.php (cit. on p. 8).
[Mur2004] G. Murphy. The Big Book of Concepts. MIT Press, 2004. isbn: 9780262250061. url: https : / / books . google . de / books ? id = t2jldRsNkgsC (cit. on p. 8). [Nou2015]
Daniel Nouri. nolearn. 2015. url: https://pythonhosted.org/nolearn/ (cit. on p. 46).
[Pal2011]
Allison Lee Palmer. Historical Dictionary of Neoclassical Art and Architecture. Scarecrow Press, 2011 (cit. on p. 3).
[Pro1980]
Jules David Prown. “Style as Evidence”. In: Winterthur Portfolio 15.3 (1980), pp. 197–210. issn: 00840416, 15456927. url: http://www.jstor. org/stable/1180742 (cit. on p. 3).
73
Bibliography
[Reh2016]
Malte Rehbein. “Was sind Digital Humanities? (to appear)”. In: Akademie Aktuell 55.1 (Feb. 2016). to appear, pp. 15–19 (cit. on p. 1).
[Rus2015]
Olga Russakovsky et al. “ImageNet Large Scale Visual Recognition Challenge”. In: International Journal of Computer Vision (IJCV) 115.3 (2015), pp. 211–252. doi: 10.1007/s11263-015-0816-y (cit. on pp. 20, 21, 57).
[Sha2012]
Lior Shamir and Jane A. Tarakhovsky. “Computer Analysis of Art”. In: J. Comput. Cult. Herit. 5.2 (Aug. 2012), 7:1–7:11. issn: 1556-4673. doi: 10 . 1145 / 2307723 . 2307726. url: http : / / doi . acm . org / 10 . 1145 / 2307723.2307726 (cit. on p. 3).
[Sim2014]
Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. In: CoRR abs/1409.1556 (2014). url: http://arxiv.org/abs/1409.1556 (cit. on pp. 21, 49, 64).
[Siv2003]
J. Sivic and A. Zisserman. “Video Google: a text retrieval approach to object matching in videos”. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on. Oct. 2003, 1470–1477 vol.2. doi: 10. 1109/ICCV.2003.1238663 (cit. on p. 9).
[Tha2014]
Manfred Thaller. Grenzen und Gemeinsamkeiten: Die Beziehung zwischen der Computerlinguistik und den Digital Humanities. deutsch. 2014. url: https://video.uni-passau.de/video/Grenzen-und-GemeinsamkeitenDie-Beziehung-zwischen-der-Computerlinguistik-und-den-DigitalHumanities/64f068d148cf066c98beb024c76664e6 (cit. on p. 1).
[Wal2011]
Stéfan van der Walt, S. Chris Colbert, and Gaël Varoquaux. “The NumPy Array: A Structure for Efficient Numerical Computation”. In: Computing in Science & Engineering 13.2 (2011), pp. 22–30. doi: http://dx.doi. org / 10 . 1109 / MCSE . 2011 . 37. url: http : / / scitation . aip . org /
74
Bibliography
content/aip/journal/cise/13/2/10.1109/MCSE.2011.37 (cit. on p. 41). [Wal2014]
Stéfan van der Walt et al. “scikit-image: image processing in Python”. In: PeerJ 2 (June 2014), e453. issn: 2167-8359. doi: 10.7717/peerj.453. url: http://dx.doi.org/10.7717/peerj.453 (cit. on p. 43).
[Wie1959]
D. H. Wiesel and T. N. Hubel. “Receptive fields of single neurones in the cat’s striate cortex”. In: J. Physiol. 148 (1959), pp. 574–591 (cit. on p. 14).
75
Eidesstattliche Erklärung
Hiermit versichere ich, dass ich diese Masterarbeit selbstständig und ohne Benutzung anderer als der angegebenen Quellen und Hilfsmittel angefertigt habe und alle Ausführungen, die wörtlich oder sinngemäß übernommen wurden, als solche gekennzeichnet sind, sowie, dass ich die Masterarbeit in gleicher oder ähnlicher Form noch keiner anderen Prüfungsbehörde vorgelegt habe.
Passau, 22. Dezember 2015
Bernhard Bermeitinger
76