trade-off between recognition and reconstruction

0 downloads 0 Views 1MB Size Report
Feb 6, 2017 - chance to start as a Ph.D. Student at the Engineering Faculty of ..... 4.2 Pdf's graphs for a family of the exponential density functions . .... Thus, the main issue is finding an intrinsic low dimensional ... form (DFT) and its generalization, Wavelet Transform and advanced best basis repre- ...... fourth orders ki. 3,ki.
TEL-AVIV UNIVERSITY The Iby and Aladar Fleischman Faculty of Engineering

TRADE-OFF BETWEEN RECOGNITION AND RECONSTRUCTION: APPLICATION OF NEURAL NETWORKS TO ROBOTIC VISION

Thesis submitted for the degree ”Doctor of Philosophy” by INNA STAINVAS

Submitted to the Senate of Tel-Aviv University 1999

TEL-AVIV UNIVERSITY

This work was carried out under the supervision of Doctor Nathan Intrator and Doctor Amiram Moshaiov

This work is dedicated to my family

Acknowledgment I would like to thank my husband, daughter and parents for their tolerance and moral support during the completion of this thesis. I am greatly indebted to my first advisor Dr. Amiram Moshaiov, who gave me a chance to start as a Ph.D. Student at the Engineering Faculty of Tel-Aviv University, when I was only two months in Israel. I am very grateful to him for proposing to work in Neural Networks and Computer Vision and for allowing me freedom in my research. I have been pleasantly surprised by the flexibility of the educational system of the TelAviv University in allowing me to listen and participate in courses at different faculties, such as the Engineering Faculty, Computer Science and Foreign Languages. While taking courses in Neural Networks, I met Dr. Nathan Intrator, who became my main supervisor and collaborator for more than five years. He opened me to a new world of Neural Networks and I have learned much from him, not only on the technical aspects but also on scientific research methodologies. Without him, this thesis would have never appear. I am grateful to him for his tolerance, endless support and guidance. It is impossible to thank all the people who helped me, but I would like to mention the system administrator of the Engineering faculty, Udi Mottelo, the Department secretary Ariella Regev, the secretary of the Emigration Support department Ahuva, my friends, and the people of the Neural Computation Group of Computer Science faculty, Yair Shimshoni, Nurit Vatnick and Natalie Japkowich. This work was supported by grants from the Rich Foundation, the Don and Sara Marejn Scholarship Fund and by a grant from the Ministry of Science to Dr. Nathan Intrator. Inna Stainvas March 8, 1999

Abstract Autonomous and efficient action of robots requires a robust robot vision system that can cope with variable light and view conditions. These include partial occlusion, blur, and mainly a large scale difference of object size due to variable distance to the objects. This change in scale leads to reduced resolution for objects seen from a distance. One of the most important tasks for the robot’s visual system is object recognition. This task is also affected by orientation and background changes. These real-world conditions require a development of specific object recognition methods. This work is devoted to robotic object recognition. We develop recognition methods based on training that includes incorporation of prior knowledge about the problem. The prior knowledge is incorporated via learning constraints during training (parameter estimation). A significant part of the work is devoted to the study of reconstruction constraints. In general, there is a tradeoff between the prior-knowledge constraints and the constraints emerging from the classification or regression task at hand. In order to avoid the additional estimation of the optimal tradeoff between these two constraints, we consider this tradeoff as a hyper parameter (under Bayesian framework) and integrate over a certain (discrete) distribution.

We also study various constraints resulting from

information theory considerations. Experimental results on two face data-sets are presented. Significant improvement in face recognition is achieved for various image degradations such as, various forms of image blur, partial occlusion, and noise. Additional improvement in recognition performance is achieved when preprocessing the degraded images via state of the art image restoration techniques.

Contents 1 Introduction 1.1

1.2

1

General motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.1.1

Robotic vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.1.2 1.1.3

Internal data representation . . . . . . . . . . . . . . . . . . . . . . Data compression . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 3

1.1.4

Face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2 Statistical formulation of the problem

8

2.1

Bias-Variance error decomposition for a single predictor . . . . . . . . . . .

9

2.2 2.3

Variance control without imposing a learning bias . . . . . . . . . . . . . . 10 Variance control by imposing a learning bias . . . . . . . . . . . . . . . . . 12 2.3.1

Smoothness constraints . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.2

Invariance bias constraints . . . . . . . . . . . . . . . . . . . . . . . 13

2.4

2.3.3 Specific bias constraints . . . . . . . . . . . . . . . . . . . . . . . . 14 Reconstruction bias constraints . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5

Minimum Description Length (MDL) Principle . . . . . . . . . . . . . . . . 17 2.5.1

Minimum description length . . . . . . . . . . . . . . . . . . . . . . 19

2.6

Bayesian framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.7

MDL in the feed-forward NN . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.7.1 MDL and EPP bias constraints . . . . . . . . . . . . . . . . . . . . 24

2.8

Appendix to Chapter 2: Regularization problem . . . . . . . . . . . . . . . 28

3 Imposing bias via reconstruction constraints 3.1

30

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1.1 3.1.2

Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . 30 Autoencoder network and MDL . . . . . . . . . . . . . . . . . . . . 31

3.1.3

Reconstruction and generative models

i

. . . . . . . . . . . . . . . . 34

3.2

3.1.4

Classification via reconstruction . . . . . . . . . . . . . . . . . . . . 35

3.1.5

Other applications of reconstruction . . . . . . . . . . . . . . . . . . 38

Imposing reconstruction constraints . . . . . . . . . . . . . . . . . . . . . . 38 3.2.1 3.2.2

Reconstruction as a bias imposing mechanism . . . . . . . . . . . . 38 Hybrid classification/reconstruction network . . . . . . . . . . . . . 40

3.2.3

Hybrid network and MDL . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.4

Hybrid network as a generative probabilistic model . . . . . . . . . 43

3.2.5

Hybrid Neural Network architecture . . . . . . . . . . . . . . . . . . 44

3.2.6 3.2.7

Network learning rule . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Hybrid learning rule. . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4 Imposing bias via unsupervised learning constraints

50

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2

Information principles for sensory processing . . . . . . . . . . . . . . . . . 51

4.3

Mathematical background . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1 4.3.2

Entropy maximization (ME) . . . . . . . . . . . . . . . . . . . . . . 53 Minimization of the output mutual information (MMI) . . . . . . . 55

4.3.3

Relation to Exploratory Projection Pursuit. . . . . . . . . . . . . . 57

4.3.4

BCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3.5 4.3.6

Sum of entropies of the hidden units . . . . . . . . . . . . . . . . . 59 Nonlinear PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.7

Reconstruction issue . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4

Imposing unsupervised constraints . . . . . . . . . . . . . . . . . . . . . . . 61

4.5

Imposing unsupervised and reconstruction constraints . . . . . . . . . . . . 62

5 Real world recognition

69

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.1.1 Face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2

Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.3 5.4

5.2.1

Different architecture constraints . . . . . . . . . . . . . . . . . . . 75

5.2.2 5.2.3

Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Neural Network Ensembles . . . . . . . . . . . . . . . . . . . . . . . 80

5.2.4

Face data-sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.2.5

Face normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.2.6

Learning parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Type of image degradations . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 ii

5.4.1 5.5

Saliency detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.5.1

5.6 5.7

Different architecture constraints and regularization ensembles . . . 86 Saliency map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Appendix to Chapter 5: Hidden representation exploration . . . . . . . . . 95

6 Blurred image recognition 100 6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.1.1 6.2

Image degradation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.2.1 6.2.2

6.3

6.4

6.5

Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Main filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Other types of degradation . . . . . . . . . . . . . . . . . . . . . . . 106

Image restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.3.1

MSE minimization and regularization . . . . . . . . . . . . . . . . . 107

6.3.2

Image restoration in the frequency domain . . . . . . . . . . . . . . 109

6.3.3 Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.4.1

Image filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.4.2

Classification of noisy data . . . . . . . . . . . . . . . . . . . . . . . 114

6.4.3 6.4.4

Gaussian blur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Motion blur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.4.5

Blind deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.4.6

All training schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 120

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7 Summary and future work 7.1 7.2

124

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Directions for future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

iii

List of Figures 2.1

Supervised feed-forward network . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2

Hybrid network with EPP constraints . . . . . . . . . . . . . . . . . . . . . 25

3.1 3.2

Autoencoder network architecture . . . . . . . . . . . . . . . . . . . . . . . 32 Eigenspaces extracted by PCA . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3

Combined recognition/reconstruction network . . . . . . . . . . . . . . . . 40

3.4

Hybrid network with reconstruction and EPP constraints . . . . . . . . . . 41

3.5

Detailed architecture of the recognition/reconstruction network

4.1

Feed-forward network for independent component extraction . . . . . . . . 53

4.2

Pdf’s graphs for a family of the exponential density functions . . . . . . . . 65

4.3

Exploratory projection pursuit network . . . . . . . . . . . . . . . . . . . . 66

5.1 5.2

Misclassification rate time evolution . . . . . . . . . . . . . . . . . . . . . . 77 MSE (mean-squared) recognition error time evolution . . . . . . . . . . . . 78

5.3

Classification based regularization . . . . . . . . . . . . . . . . . . . . . . . 79

5.4

“Caricature” faces in three resolutions . . . . . . . . . . . . . . . . . . . . 81

5.5

Image degradation and reconstruction (TAU data-set) . . . . . . . . . . . . 84

5.6 5.7

Summary of different networks and different image degradations . . . . . . 90 Saliency map construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.8

Hidden unit activities vs. classes - for an unconstrained network . . . . . . 96

5.9

Hidden unit activities vs. classes - for a reconstruction network . . . . . . . 97

. . . . . . 45

5.10 Pdf ’s of the hidden unit activities . . . . . . . . . . . . . . . . . . . . . . 98 5.11 Hidden weight representation . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.1

Experimental design schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.2

Training scheme C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.3

Degraded images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.4 6.5

Noisy Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Gaussian blur and restoration . . . . . . . . . . . . . . . . . . . . . . . . . 116 iv

6.6

Motion blur and deblur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.7

Blind deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.8

Recognition of blurred images via schemes A–C . . . . . . . . . . . . . . . 120

6.9

Reconstruction of Gaussian blurred images . . . . . . . . . . . . . . . . . . 123

v

List of Tables 4.1

Unsupervised constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.1

Classification results for Pentland data-set . . . . . . . . . . . . . . . . . . 85

5.2 5.3

Different ensemble types (Pentland data-set) . . . . . . . . . . . . . . . . . 87 Different ensemble types (TAU data-set) . . . . . . . . . . . . . . . . . . . 88

5.4

Recognition using saliency map (Pentland data-set) . . . . . . . . . . . . . 92

5.5

Recognition using saliency map (TAU data-set) . . . . . . . . . . . . . . . 93

6.1 6.2

Classification results for filtered data . . . . . . . . . . . . . . . . . . . . . 112 Noise and restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.3

Gaussian blur and restoration . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.4

Motion blur and restoration . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.5

Blind deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.6

Blurred image recognition via joined ensembles . . . . . . . . . . . . . . . . 121

7.1

Classification error for reconstructed images . . . . . . . . . . . . . . . . . 127

vi

Chapter 1 Introduction 1.1 1.1.1

General motivation Robotic vision

Nowadays, robots that can move and operate autonomously in a real-world are in high demand. One of the main perception tasks that has to be addressed in this context is a recognition task. The recognition task in a real-world environment is challenging as it has to address data variability, such as orientation, changing background, partial occlusion and blur, etc. For illustration let us consider a vision-guided robot helicopter which has to navigate autonomously using only on-board sensors and computing power (Chopper, 1997). One of the basic difficulties in recognition of images taken by helicopter cameras during an operation is the significant difference between these images and the images, which the robot is acquainted with in ideal flight conditions. Usually, the images taken during operation contain a large amount of degradation caused by diverse factors, such as illumination changing, bad weather conditions, relative motion between the cameras and the object of interest in the scene, shadows and, low resolution capacity of the cameras, etc. Some of these factors cause images to look blurred and foggy, others lead to noise and partial occlusion. All these factors are crucial for recognition performance and require special care. Among the possible approaches to improve recognition performance of degraded images is an endeavor to recover images using state of the art restoration techniques as preprocessing before a recognition stage. This preprocessing requires estimation of the degradation process, e.g. the type and parameters of the blur operation. Another approach is to directly address the variability in the recognition system.

It is well known

that for a restoration process to be successful a degradation process has to be accurately

1

Chapter 1: Introduction

2

modeled. However, in many cases, an exact modeling is impractical, and the restored images remain partially degraded and contain artifacts. Furthermore, restoration methods are often computationally expensive and require a-priori knowledge or human interaction. It follows that efforts have to be concentrated on development of recognition methods that are more robust to image degradations.

1.1.2

Internal data representation

An important aspect of robust recognition methods is construction of an internal data representation (feature extraction), that captures the significant structure of the data. According to D. Marr (1982) finding an internal representation is an inherent component of the vision process. Feature based representation Many recognition methods include grouping or perceptual organization as a first stage of the visual processing. In this stage, objects are represented as models, containing the essential features and logic tight rules needed for recognition. Some methods extract “anchor points” (Ullman, 1989; Brunelli and Poggio, 1992), others consider edge segments as interesting feature elements (Bhanu and Ming, 1987; Liu and Srinath, 1984). A relatively new approach is a deformable template matching (Grenander, 1978; Brunelli and Poggio, 1993; Jain et al., 1996) and using generalized splines for object classification (Lai, 1994). These methods attempt to extract salient features locally in the low level stage of the visual processing, according to subjective understanding of an investigator. Therefore, finding an internal representation based on extraction of object features and relation between them may be limited. Learning internal representations via Neural Networks A radical alternative approach is to use all the available intensity information for finding internal representation. Principal Component Analysis (PCA) (Fukunaga, 1990) is a non neural network example of this approach, where internal representation space is spanned by the largest eigenvectors of the data covariance matrix. These eigenvectors are macro-features extracted implicitly from the images. When fed with intensity images, Neural Networks similar to PCA extract internal representation in the space of hidden unit activities. Processing an image as a whole is a high dimensional recognition task that leads to the curse of dimensionality (Bellman, 1961) which means that there is not enough data to robustly train a classifier in a high dimensional space. As an example, a network with a single hidden unit and input images of 60 × 60 pixels has 3600 weight parameters that have to be estimated. Thus, the main issue is finding an intrinsic low dimensional

Chapter 1: Introduction

3

representation of the images. As was pointed out by Geman et al. (1992), a way to avoid the curse of dimensionality in Neural Networks is to prewire the important generalizations by purposefully introducing learning bias. The work presented in this thesis is specifically devoted to this issue. We develop image recognition techniques using hybrid feed-forward Neural Networks, obtained by introducing a learning bias. In particular, we investigate the influence of the novel reconstruction learning constraints on the recognition performance of feed-forward Neural Networks. In addition, we propose to use other learning constraints based on information theory, and subsequently compare their efficiency with reconstruction learning constraints. We demonstrate that hybrid Neural Networks are robust to real-world degradation in the input visual data and show that their performance can be further enhanced when state of the art (deblur) techniques are also incorporated.

1.1.3

Data compression

Often, a compression goal is defined as finding a compact data representation leading to good data reconstruction. Principal Component Analysis (PCA), Discrete Fourier Transform (DFT) and its generalization, Wavelet Transform and advanced best basis representations (Coifman and Wickerhauser, 1992), are examples of compression techniques. Compression may be also realized via an autoencoder network (Cottrell et al., 1987). The autoencoder is a multi layer perceptron (MLP) type of the network with the output layer coinciding with the input layer and a hidden layer of a small size. Recently a novel type of an autoencoder network has been proposed by Zemel (1993). The hidden layer is allowed to have a large number of hidden units but it has different constraints on the developed hidden representation. The network is simultaneously trained to accurately reconstruct the input and to find a succinct representation in the hidden layer, assuming sparse or population code formation in the autoencoder hidden layer. When the main task is recognition, the compressed data representation has been used instead of the original (high-dimensional) data (Kirby and Sirovich, 1990; Turk and Pentland, 1991; Murase and Nayar, 1993; Bartlett et al., 1998). Recognition from this representation is faster and may have better generalization performance. However, it is clear, that such compression is task-independent and may be inappropriate for a specific recognition task (Huber, 1985; Turk and Pentland, 1993). We seek a compact data description that is task-dependent, and is good for recognition. Thus, the quality of the compression scheme is judged by its generalization property. Often, a separate low-dimensional representation is created for every specific task at hand. Another strategy could be to discover a hidden representation that is suitable for several

Chapter 1: Introduction

4

potential visual tasks (Intrator and Edelman, 1996). We show that a good task-dependent compression is obtained when the data representation is constructed not only to minimize the mean-squared recognition error, but also to maintain data fidelity and/or to extract good statistical properties. These good properties may be the independence of hidden neurons, maximum information transfer in the hidden layer or a multi-modal distribution of the hidden unit activities. Therefore, in this case compression is task-dependent and is assisted by the a-priori knowledge. In summary, we investigate lossy compression techniques based on the two visual tasks - image recognition and reconstruction. Our goal is to find a hidden representation that optimizes the recognition using hints of the reconstruction task.

1.1.4

Face recognition

The performance of the proposed recognition schemes is examined on two facial data sets. Face recognition has gained much attention in recent years due to the variety of commercial applications, such as video conferencing, security, human communication and robotics. Face recognition has recently attracted special attention of different human robotic groups, that intensively work on the creation of personal adaptive robots to assist the frail and elderly blind people, and creation of working mobile robots for delivery assistance (Hirukawa, 1997; Connolly, 1997). This recognition task is a very difficult one (Chellapa et al., 1995), since it is a high dimensional classification problem leading to “curse of dimensionality”. This is complicated by the large variability of the facial data sets due to: • viewpoint dependence • nonrigidity of the faces • variable lighting conditions • motion The task of face recognition is a particular case of the learning when the variability of the data describing the same class is comparable with the similarity between different classes. Other important possible recognition tasks from the same category may be the recognition of different kinds of tanks, ships, planes and cars, etc.

Chapter 1: Introduction

1.2

5

Overview of the thesis

The thesis focuses on developing Neural Network techniques that improve the recognition performance. A key aspect of this work is finding data representations that lead to better generalization. We show that networks which are trained to recognize and reconstruct images simultaneously extract features that improve recognition. Improved performance is also achieved when networks are trained to find other statistical structures in the data. The thesis is organized as follows: Chapter 2: Formulates the recognition task in the framework of the “bias-variance” dilemma. We show that for a good generalization ability the variance portion of the generalization error has to be properly controlled. We discuss different methods to control the variance portion of the generalization error and present two main approaches: reducing the variance via ensemble averaging and introducing a learning bias. We review different types of learning bias constraints, and finally, propose reconstruction constraints as a novel type of bias constraints in the context of feed-forward networks. Starting from Section 2.5, we discuss the relation between the “bias-variance” dilemma in statistics, MDL principle and Bayesian framework. We show that the introduction of a learning bias corresponds to a model-cost in the description length, which has to be minimized along with an error-cost under the MDL principle. At the same time, under the Bayesian framework, the model-cost corresponds to prior knowledge about the weights and hidden representation distributions. Chapter 3: Introduces a hybrid feed-forward network architecture, which uses the reconstruction constraints as a bias imposing mechanism for the recognition task. This network, which can be interpreted under MDL and Bayesian frameworks, modifies the low dimensional representation by minimizing concurrently the mean squared error (MSE) of reconstruction and classification outputs. In other words, it attempts to improve the quality of the hidden layer representation by imposing a feature selection useful for both tasks, classification and reconstruction. A significance of each of the tasks is controlled by a trade-off parameter λ, which is interpreted as a hyper-parameter in the Bayesian framework. Finally, this chapter presents technical details about the network architecture and its learning rule. Chapter 4: Discusses various information theory principles as constraints for the classification task. We introduce a hybrid neural network with a hidden representation which

Chapter 1: Introduction

6

has some useful properties, such as the independence between hidden layer neurons or maximum information transfer in the hidden layer, etc. Chapter 5: Discusses the face recognition task. We review different Neural Networks methods used for face recognition and apply the hybrid networks introduced in Chapters 3–4. This chapter contains technical details related to face normalization and learning procedures. It is shown that the best regularized network is impractical for degraded image recognition, and integration over different regularization parameters and different initial weights is preferable. This integration is roughly approximated by averaging over network ensembles. We consider three ensemble types: Unconstrained ensemble that corresponds to integration over initial weights and fixed trade-off parameter λ = 0, i.e. the hidden representation is based on the recognition task alone; Reconstruction ensemble that corresponds to integration over different values of the trade-off parameter λ for fixed initial weights. Joined ensemble that corresponds to integration over both the trade-off parameter λ and initial weights and is obtained by merging unconstrained and reconstruction ensembles. Classification results on the degraded images, such as noisy, partially occluded and blurred images are presented. We show that the joined ensemble is superior to the reconstruction ensemble, which in turn is superior to the unconstrained ensemble. Finally we conclude that reconstruction constraints improve generalization, especially under image degradations. In addition we show that via saliency maps (Baluja, 1996) reconstruction can deemphasize degraded regions of the input, thus leading to classification improvement under “Salt and Pepper” noise. Chapter 6: Addresses recognition of blurred and noisy images. In practice, images appear blurred due to motion, weather conditions and camera defocusing. Several methods that address recognition of blurred images are proposed: (i) Expansion the training set with Gaussian blurred images; (ii) Constraining reconstruction of blurred images to the original images during training; (iii) Usage of state of the art restoration methods as preprocessing to degraded images. Three types of joined ensembles were considered and compared: Ensemble of networks trained on the original training data only, and ensembles trained on the training set expanded with Gaussian blurred images and with reconstruction constraints of two types, where the first is a simple duplication of the input in the output and second as described above in (ii). It was shown that training with blurred images leads to a robust classification result

Chapter 1: Introduction

7

under different types of the blur operations and is more important than the restoration methods. Chapter 7: Summarizes our research and gives some perspective to its future development, such as: • Testing the hybrid architecture performance on the non face data sets of similar object images, such as military, medical and astronomical • Ensemble interpretation • Using the recurrent network architecture • Weighted network ensemble averaging based on the different error types between input and output reconstruction layers • Using invariance constraints (tangent prop like, see Chapter 2) regularization terms for different types of blur operations for both recognition and reconstruction tasks • Generalization of the proposed hybrid network on the other types of the generative (reconstruction) models constrained by the classification task

Chapter 2 Statistical formulation of the problem Images as input to Neural Networks are a very high dimensional data with the size equal to the number of pixels in the image. In this case, the number of the network weight parameters is considerably larger than the size of the training set. This leads to the curse of dimensionality (Bellman, 1961), which means that there is not enough data to robustly train a classifier in a high dimensional space. Until recently, estimation in such cases sounded unrealistic, but it is now accepted that such estimation is possible if the actual dimensionality of the input data is much smaller. In other words, a true, intrinsic dimensionality reduction is possible. A simple dimensionality reduction solely via a bottleneck network architecture does not cope with the problem, since a network continues to be an over-parameterized model (i.e. the number of free weight parameters remains large). It is well known that an estimation error is composed of two portions, bias and variance (Geman et al., 1992). The over-parameterized models usually have a small bias (unless they are incorrect), but have high variance, since the available data is always small compared to the number of the free parameters and this leads to a high sensitivity to noise in the training data. To robustify the estimator, the variance portion of the error has to be controlled. One of the ways to control variance is via averaging single estimators trained on the same task. The other method controls variance by introducing a learning bias as constraints on the network architecture. Different types of smoothing constraints are widely spread (Wahba, 1990; Murray and Edwards, 1993; Raviv and Intrator, 1996; Munro, 1997). However, as has been pointed out by Geman et al. (Geman et al., 1992) to solve the bias/variance dilemma innovative bias constraints have to be used. Introduction of these constraints into the network model leads naturally to a true dimensionality reduction (Intrator, 1999). 8

Chapter 2: Statistical formulation of the problem

9

Below, we present the bias-variance dilemma and review methods to control the variance and bias portions of the prediction error. Then we propose to use image reconstruction as an innovative bias constraint for image classification. We proceed with discussion on the relation between the “bias-variance” dilemma in statistics, MDL principle and Bayesian networks.

2.1

Bias-Variance error decomposition for a single predictor

The basic objective of the estimation problem is to find a function fD (x) = f (x; D) given a finite training set D, composed of n input/output pairs, D = {(xµ , yµ )}nµ=1

x ∈ Rd , y ∈

R1 , drawn independently according to an unknown distribution P (x, y), which “best” approximates the “target” function y (Geman et al., 1992). Evaluation of the performance of the estimator is usually done via a mean squared error by taking the expectation with respect to a marginal probability P (y|x): E(x; D) ≡ E[(y − fD (x))2 |x, D] =

E[(y − E[y|x])2 |x, D] + E[(fD (x) − E[y|x])2 |x, D] + {z

|

}

|

{z

}

V ar(y|x)

2E[(y − E[y|x])(fD (x) − E[y|x])|x, D]

|

{z

(2.1.1)

}

=0

It can be seen that the third term in the sum is equal to zero, since (fD (x) − E[y|x]) does not depend on the distribution P (y|x) and plays the role of a factor, while E[(y − E[y|x])|x, D] is equal to zero. The first term does not depend on the predictor f and measures the variability of y given x (in the model with additive independent noise y = f (x) + η(x) this term measures a noise variance in x). The contribution of the second term can be reduced by optimizing f . This term measures the squared distance between the estimator fD (x) and the mean of y given x (E[y|x]). A good estimator has to generalize well to new sets drawn from the same distribution P (y, x). A natural measure of the estimator effectiveness is an average error E(x) ≡ ED [E(x; D)] = ED [E[(y − fD (x))2 |x, D]] over all possible training sets D of fixed size: E(x) = V ar(y|x) + (ED [fD (x)] − E[y|x])2 + ED [(fD (x) − ED [fD (x)])2 ] |

{z

}

intrinsic error

|

{z

squared bias b2 (f |x)

}

|

{z

}

(2.1.2)

variance var(f |x)

The first term is an intrinsic error that can not be altered. If on average, fD (x) is different from E[y|x], then fD (x) is biased. As we can see, an unbiased estimator may still have a large mean squared error if the variance is large. Thus, either bias or variance can contribute to poor performance (Geman et al., 1992). When training with a fixed

Chapter 2: Statistical formulation of the problem

10

training set D, reducing the bias with respect to this set may increase the variance of the estimator and contribute to poor generalization performance. This is known as the tradeoff between variance and bias.

2.2

Variance control without imposing a learning bias

The variance portion of a prediction error can sometimes be reduced without a bias introduction by ensemble averaging. An ensemble (committee) is a combination of single predictors trained on the same task. For example, in neural networks, an ensemble is a combination of individual networks that are trained separately and then their predictions are combined. This combination is done by majority or plurality rules (in classification) (Hansen and Salamon, 1990) or by a weighted linear combination of predictors in regression (Meir, 1994; Naftaly et al., 1997). The plurality rule is defined as the decision agreed by the majority of networks. The majority rule is defined as the decision agreed by more than half of the networks, otherwise the ensemble rejects to classify and an error is reported. The most general method to create ensemble has been presented by Wolpert (Wolpert, 1992). The method is called stacked generalization and a non-linear network learns how to combine the network outputs with the weights that vary over the feature space. It is well known that ensemble is useful if its individual predictors are independent in their errors or disagree on some inputs. Thus, the main question is to find network candidates that achieve this independence. One of the widely spread methods to create neural network ensembles is based on the fact that neural networks are non-identifiable models, i.e. the selection of the weights is an optimization problem with many local minima. Thus, a network ensemble is created by varying the set of initial random weights (Perrone, 1993). Another way is to use different types of predictors, like a mixture of networks with a different topology and complexity or a mixture of networks with completely different types of learning rules (Jacobs, 1997). Another way is to train the networks on different training sets. Below, a bias-variance error decomposition for a weighed linear combination of predictors is presented (Raviv, 1998; Tesauro et al., 1995). Let us consider M predictors fi (x, Di ), each trained on a training set Di . All training sets have the same size and are drawn from the same joint distribution P (y, x). Consider the ensemble based on the linear combination of predictors: fens (x) = X i

X

ai fi (x, Di ),

i

ai = 1, ai ≥ 0, i = 1, 2, . . . , M.

(2.2.3)

Chapter 2: Statistical formulation of the problem The normalization condition

P i

11

ai = 1 is implied to make an ensemble unbiased, when

each individual estimator fi is unbiased. Let us consider the error (2.1.2) for this ensemble: Eens (x) = V ar(y|x) + b2 (fens |x) + var(fens |x),

(2.2.4)

where the bias b(fens |x) is given as: X

b(fens |x) = ED1 ,D2 ,...,DM [ X

ai fi (x, Di ) − E[y|x]] =

i

ai EDi [fi (x, Di ) − E[y|x]] =

X

i

ai b(fi |x).

(2.2.5)

i

Thus the bias of the ensemble is the same linear combination of the biases of the estimators. Expanding the ensemble variance term we get: var(fens |x) =

X

ED1 ,D2 ,...,DM [{

X

ai fi (x, Di ) − ED1 ,D2 ,...,DM [

i

X

ED1 ,D2 ,...,DM [(

ai fi (x, Di ) −

i

X

ED1 ,D2 ,...,DM [(

X

ai fi (x, Di )]}2 ] =

i

ai EDi [fi (x, Di )])2 ] =

i

ai (fi (x, Di ) − EDi [fi (x, Di )])2 ] =

i

X

ED1 ,D2 ,...,DM [ 2

X

a2i (fi (x, Di ) − EDi [fi (x, Di )])2 +

i

ai aj (fi (x, Di ) − EDi fi (x, Di ))(fj (x, Dj ) − EDj fj (x, Dj ))] =

i>j

=

X

a2i var(fi |x) + 2

i

X

ai aj EDi ,Dj [(fi − EDi [fi ])(fj − EDj [fj ])]

i>j

Finally, we get the next expression for the ensemble error: Eens (x) =

V ar(y|x) + ( +2

X

X

ai b(fi |x))2 +

i

X

a2i var(fi |x)

i

ai aj EDi ,Dj [(fi − EDi [fi ])(fj − EDj [fj ])]

(2.2.6)

i>j

If all estimators are unbiased, uncorrelated and have identical variances, simple averaging with the same weights ai = 1/M leads to the following ensemble error (Raviv, 1998): E(x) = V ar(y|x) + b2 (f |x) +

1 var(f |x). M

This decomposition shows that when biases are small and predictors are independent a significant reduction of order 1/M in the variance may be attained. If estimators are unbiased and uncorrelated it is easy to show that optimal weights have to be inversely proportional to the variance of the individual predictors ai ∝ var(f1 i |x) , (Tresp and Taniguchi, 1995; Taniguchi and Tresp, 1997). Intuitively it means that a predictor that is uncertain about its own prediction should obtain a smaller weight.

Chapter 2: Statistical formulation of the problem

2.3

12

Variance control by imposing a learning bias

A regression function (E[y|x]) is the best estimator. In order to find an unbiased estimator, a family of possible estimators has to be abundant. In the MLP (multi-layer perceptron) networks, this may be attained at the expense of network architecture growing. This eliminates bias, but increases variance unless the training data is infinite. In practice, the training data is finite and the main question is to make both a bias and variance “small” using finite training sets (Geman et al., 1992). Geman et al. point out that in this limitation the learning task is to generalize in a very nontrivial sense, since the training data will never “cover” a space of possible inputs. This extrapolation is possible, if the important generalizations are prewired in learning algorithms by purposefully introducing a bias. The most general and weakest a-priori constraints assume that mapping is smooth. Other, stronger a-priori constraints may be expressed as an invariance of the mapping to some group of transformation or an assumption about the class of possible mapping. Another type of specific bias constraints appears when a supervised task is learned in parallel with its other related tasks. One way to categorize different types of constraints into two groups: variance and bias constraints, has been proposed in (Intrator, 1999). Both types of constraints serve to reduce the variance portion of the generalization error, however they have a different effect on the bias portion of the error. Variance constraints always result in an increase of the bias portion of the error. In contrast, bias constraints assist in learning and even may reduce the bias portion of the error. When networks are learned to satisfy constraints only, the bias constraints lead to a meaningful hidden representation, capturing the structure of the input domain; while a hidden representation extracted via the variance constraints is less interesting.

2.3.1

Smoothness constraints

The easiest way to smooth the mapping approximated by neural networks is by controlling network structure parameters such as numbers of hidden units and hidden layers. The larger is the number of network units, the larger is the number of weight fitting parameters. The over-parameterized models are highly flexible and reduce bias. However, they are sensitive to noise that leads to a large variance and a large generalization error. Another way to control smoothness in neural networks, borrowed from the spline theory (Wahba, 1990), is to use weight decay. This involves adding a penalty term controlling a weight’s norm, to the network cost function E =

P

i

k yi − f (xi , ω) k2 (other forms of cost functions

Chapter 2: Statistical formulation of the problem

13

are presented in (Bishop, 1995a)): Eλ = E + λ k ω k2 , where xi and yi are the suitably scaled input and output samples (k z k is the norm in the space of the element z). Another tightly related approach is to constrain a range of the weights to some middle values. The method is called weight elimination and the regularization term has the form λ

P

i

2 ωi2 /(ωi2 + ωi0 ). A direct approach is to consider a

regularizer which penalizes curvature explicitly: Eλ = E + λ k P f k 2 , where P is a differential operator. Another way to control the smoothness is to inject noise during the learning. The noise is usually added to the training data (Bishop, 1995a; Raviv and Intrator, 1996), but may be added to the hidden units (Munro, 1997) or weights (Murray and Edwards, 1993) during learning as well. It has been shown (Bishop, 1995b) that learning with input noise is equivalent to Tikhonov (direct curvature) regularization. Though smoothness constraints bias toward smooth models, they are essentially variance constraints.

2.3.2

Invariance bias constraints

Given an infinite training data and unlimited training time, a network can learn the regression function. However, the data is rather limited in practice and this limitation may be overcome by imposing bias as invariance constraints. One way to implement this regularization is by training the system with additional data. This data is obtained by distorting (translating, rotating, etc.) the original patterns (Baird, 1990; Baluja, 1996), while leaving the corresponding targets unchanged. This procedure, called the distortion model, has two drawbacks. First, the magnitude of distortion and the number of artificial degraded patterns have to be defined. Second, the generated data is correlated with the original training data. This type of regularization is referred to as a data driven regularization (Raviv, 1998). An alternative way is to impose invariance constraints by adding a regularization term to the mean squared error E (Simard et al., 1992). The regularization term penalizes changes in the output when the input is transformed under the invariance group. Let x be an input, y = f (x, w) be the input-output function of the network and s(α, x) a transformation parameterized by some parameter α, such that s(0, x) = x. When the invariance condition for every pattern xµ is written as: f (s(α, xµ ), w) − f (s(0, xµ ), w) = 0

(2.3.7)

Chapter 2: Statistical formulation of the problem

14

the latter constraint for an infinitesimal α may be rewritten as: ∂f (s(α, xµ ), w) |α=0 = 0, ∂α fx (xµ , w) · tµ = 0,

tµ =

or ∂s(α, xµ ) |α=0 , ∂α

(2.3.8)

where fx is the Jacobian (matrix) of the estimator f for a pattern xµ , andtµ is a tangent vector associated with the transformation s. The penalty term is written as Ω(f , w) =

P

µ

k fx · tµ k2 and a penalized function is Eλ = E + λΩ(f , w). This regularization term

states that the function f should have zero derivatives in the directions defined by the group of invariance and is called tangent prop. The tangent prop is an infinitesimal form of the invariance ”hint” proposed by AbuMostafa (Abu-Mostafa, 1993). The conditions of equivalence between adding distorted examples and regularized cost function are presented in (Leen, 1995). In particular, it is shown that smoothed regularizers may be obtained as a special case of a random shifting invariance group: s(x, α) = x + α, where α is a Gaussian variable with a spherical covariance matrix. Obviously, non-trivial invariance constraints belong to a bias type of constraints.

2.3.3

Specific bias constraints

These constraints express our a-priori heuristic knowledge about the problem. A combination of the Exploratory Projection Pursuit (EPP) method with Projection Pursuit Regression (PPR) in feed-forward neural networks (Intrator, 1993a; Intrator et al., 1996; Intrator, 1999) and the multi-task learning (MTL) method (Caruana, 1995), are examples of this type of the bias constraints. Hybrid EPP/PPR neural networks PPR is a method to perform dimensionality reduction by approximating the desired function as a composition of lower dimensional smooth functions that act on linear dimensional projections of the input data (Friedman, 1987). In other words, PPR tries to approximate the best estimator, that is a regression function f (x) = E[Y |X = x] from observations D = {(xµ , yµ )}nµ=1 by a sum of ridge functions gj (functions that are constant along lines): f (x) ≈

m X

gj (aj · x),

j = 1, . . . , m.

(2.3.9)

j=1

In the feed-forward neural networks, the ridge functions are set in advance (as logistic

Chapter 2: Statistical formulation of the problem

15

sigmoidal, for example) and the output is approximated as f (x) ≈

m X

βj σ(aj · x),

j = 1, . . . , m,

x, aj ∈ Rd

(2.3.10)

j=1

where an input vector x is usually extended by adding an additional component equal to 1. Thus, in neural networks only projection directions aj and coefficients βj have to be estimated. However, when the input is high-dimensional, even the dimensionality reduction neural networks (m  d) are over-parameterized models that require additional regularization constraints. The already considered smoothness constraint is one way to reduce a variance of the network. Another way to impose bias constraints related to the data structure has been proposed by Intrator (Intrator, 1993a). An idea is to train a network (via a backpropagation algorithm) to fit the desired output and to extract a low-dimensional structure of the data using EPP (Friedman, 1987) simultaneously. EPP is an unsupervised method that searches in the high dimensional space directions with good clustering properties, characterized by projection indices. An example of combination of supervised learning with unsupervised using a BCM (Bienestock Cooper and Munro) neuron (Bienenstock et al., 1982; Intrator and Cooper, 1992) has been proposed in (Intrator, 1993b). This neuron is learned by minimizing a specific projection index that emphasizes the multimodality in the data. Computationally, EPP constraints are expressed as minimization of a function ρ(w) measuring the quality of the input after projection and a possible nonlinear transformation φ: ρ(w) ≡ E[H(φ(w · x))], where φ(w · x) is a hidden representation A of the network, H is a function measuring the quality of the hidden representation, and averaging takes place over an ensemble of the input. The EPP constraints are introduced by modification a synaptic weight learning rule: ∂wij ∂E(w, x) ∂ρ(w) = −[ + + C], ∂t ∂wij ∂wij

(2.3.11)

where C is an additional complexity penalty term, such as smoothness constraints or the number of learning parameters. Multi-task learning (MTL) Another attractive intuitive way to conceive different types of the bias constraints is MTL. MTL is a wide-spread method used in the machine learning. It proposes to learn additional tasks defined on the same data domain as the special task for improving the generalization ability of the latter. Though the MTL idea is borrowed from the observation that humans

Chapter 2: MDL and Bayesian principles

16

successfully learn many related tasks at once, it has a rigorous mathematical base. It is easy to see that the additional task learning in MTL emerges as a bias imposing mechanism, that controls the balance between the bias-variance portions of the generalization error. The MTL approach in the artificial networks is realized via connectionist network architectures. In connectionist network one shared representation is used for multiple tasks. The hidden weights, connected input and this shared representation are updated as a linear combination of the multi-task gradients in the back propagation of their errors. Such learning moves the shared hidden layer towards representations that better reflect regularities of the input domain. Though the measure of task relation can not be rigorously defined, some mechanisms explaining the benefit of MTL have been suggested (Caruana, 1995; Abu-Mostafa, 1994). Nevertheless, the way to test the appropriateness of the related task as a proper bias is empirical. It is easy to see that the combination of EPP and PPR neural networks can be also considered in the MTL framework, though in MTL, a related task is usually expressed more loosely and heuristically than the EPP constraints.

2.4

Reconstruction bias constraints

As shown above in Section 2.3.3, feed-forward Neural Networks which require estimation of many parameters, are subjected to the bias/variance dilemma. We have seen also in Sections 2.2–2.3 that different ways to control the bias/variance portion of the predictor error exist. However, when the dimensionality of the input is very high, innovative ways to reduce the variance portion of the error, as well as methods to impose (reasonable) bias, are required. In this thesis, continuing the previous line of study, we propose a new kind of specific bias constraints for image classification feed-forward networks in the form of the image reconstruction. We also consider new information theory constraints, seeking diverse structure in the data and compare the effect of the different constraints on the generalization performance of the classification neural network. Below, we discuss Bayesian and minimum description length (MDL) frameworks for learning in neural networks. We show that the bias-variance dilemma can be naturally reformulated in the MDL framework, where learning constraints emerge as a model-cost, that has to be minimized along with an error-cost, which is represented as the mean squared error (MSE) on the main learning task.

Chapter 2: MDL and Bayesian principles

2.5

17

Minimum Description Length (MDL) Principle

In the MDL formulation, one searches for a model that allows the shortest data encoding, together with a description of the model itself (Rissanen, 1985). One of the first perspectives for applying the MDL principle in Neural Networks was pointed out by Nowlan and Hinton (1992) for supervised learning. In supervised learning, the output y is predicted from the input x which is presented at the input layer. The network model is defined by the weight parameters. Thus, to specify the desired output y given x, the weights and errors in the output layer have to be described. If it is assumed that the output errors are Gaussian, then the number of bits to describe the errors is equal to the mean-squared recognition error. The weights are encoded using different weight probability models and their descrition length is a negative log of weight probabilities. The weight description length is equivalent to different complexity terms and the MDL principle leads to a regularization approach in the Neural Networks. For example, the Gaussian probabilistic model leads to the weight decay regularization term (see Section 2.7). A more sophisticated form of weight decay is obtained when the weights are encoded as a mixture of Gaussians (Nowlan and Hinton, 1992). Later on the MDL principle was applied for unsupervised learning, in particular for autoencoder networks (Zemel, 1993) (see also Section 3.1.2). The autoencoder network is a feed-forward network which duplicates the observed input in the output layer. The autoencoder network has a natural interpretation in the MDL framework (Hinton and Zemel, 1994). It discovers an efficient way to communicate data to a receiver. A sender uses a set of input-to-hidden weights and, in general, non-linear activation functions to convert the input into a compact hidden representation. This representation has to be communicated to the receiver along with the reconstruction errors and hidden-to-top weights. Receiving the hidden-to-top weights, the receiver reconstructs the input from this abstract representation and communicated errors. The description length in this case consists of three parts: 1. The set of activities A of the representation units. These are codes that the net assigns to each training input sample. Encoding activities of the representation (hidden) units enables to avoid communication of the hidden weights and does not require the knowledge of the input data X . However, the sender and the receiver have to agree on the a-priori distribution of the internal representation. This part of the message corresponds to the representation-cost. 2. The set of hidden-to-output weights W . This part of the message is represented by the weight-cost.

Chapter 2: MDL and Bayesian principles

18

3. The reconstruction error, which is a disagreement between desired and predicted outputs. This part of the message is represented by the reconstruction or the errorcost. In order to evaluate the latter, the sender and receiver have to agree on the probability of the desired output of the network given its actual output. In the standard autoencoder, the weight cost is neglected and the representation cost is considered to be small and proportional to the number of network hidden units, since it is assumed that all units participate in the equal parity in the data representation. However, instead of the direct evaluation of the representation code, the autoencoder with a bottleneck in the hidden layer is trained to minimize the MSE reconstruction error. In contrast, in the nonstandard versions of autoencoders (Zemel, 1993), the representation cost is evaluated explicitly and its minimization encourages sparse distributed representation, where only few neurons are active, which are responsible for the presence of the specific features in the patterns. The main difference between the MDL principle for supervised and unsupervised learning proposed by Zemel may be understood considering the unlimited number of training samples. When the number of patterns is infinite, the model cost of the supervised learning, which is the cost of the weights, is negligent. In contrast, in the unsupervised learning, the model cost never vanishes and the MDL is applied per sample to minimize representation cost and to maintain data fidelity. In this thesis, we combine supervised and unsupervised learning in the hybrid reconstruction/recognition network and formulate the MDL principle for this case (see Section 3.2.3). It turns out that this interpretation is three-fold, depending on what is defined as the main task: 1. When the main task is reconstruction (Gluck and Myers, 1993, a hippocampus model), the reconstruction MSE is an error cost and the recognition MSE is a model cost (or a representation cost, since the MSE recognition error depends on the hidden layer representation and the recognition top weights that must not affect on the description length). Thus, the network maintains the data fidelity and encourages representation with a good discriminative property. 2. When the main task is recognition and it is assumed that the sender observes both the input and output, while the receiver sees only the input, the recognition MSE is an error cost as in supervised learning and the reconstruction MSE is a model cost (or a representation cost). However, in contrast to a standard supervised learning the representation cost never vanishes.

Chapter 2: MDL and Bayesian principles

19

3. When the main task is recognition, but the receiver does not see both x and y, he has in parallel to reconstruct x and predict y. Thus, the sender encodes x, taking into account also the dependence of y on x. He sends the encoded data and errors of recognition and reconstruction outputs, since in the supervised learning the task is to predict y for the given x. In this case, both the recognition and reconstruction MSE stand for error codes and the representation cost is restricted to a small number of the hidden units.

2.5.1

Minimum description length

MDL can be formulated based on an imaginary communication game, in which a sender observes the data D and communicates it to the receiver. Having observed the data, the sender discovers that the data has some regularity that can be captured by a model M. This fact encourages the sender to encode the data using a model, instead of sending the data as it is. Due to noise, there are always aspects of the data which are unpredicted by the model, that can be seen as errors. Both the errors and the model have to be conveyed to the receiver to enable him to reproduce the data. The goal of the sender is to encode data so that it can be transmitted as accurately and compactly as possible. It is clear, that complex models allow to achieve a high accuracy, but their description is expensive. In contrast, models which are too simple or wrong, are not able to extract the data regularity. Intuitively, such a communication game can be thought of as a tradeoff between the compactness of the model and its accuracy. To transmit the data the sender composes a message consisting of two parts. The first part of the message with a length L(M ) specifies the model and the second with a length L(D|M ) describes the data D with respect to the model M. The goal of the sender is to find a model that minimizes the length of this encoded message L(M, D), called the description length: L(M, D) = L(D|M ) + L(M ),

(2.5.12)

According to Shannon’s theory (Shannon, 1948; Cover and Thomas, 1991) to encode a random variable X with the known distribution p(X) by the minimum number of bits, a realization x has to be encoded by − log p(x) bits. Thus the description length (2.5.12) is represented as: L(M, D) = (− log p(D|M ) − log p(M )),

(2.5.13)

where p(D|M ) is the probability of the output data given the model, and p(M ) is an a-priori model probability. The MDL principle requires searching for a model M ? that

Chapter 2: MDL and Bayesian principles

20

minimizes the description length (2.5.13): M ? = arg min(− log p(D|M ) − log p(M )). M

(2.5.14)

As we have seen in Section 2.1, in the supervised learning the problem is to find a model that describes output y as a function of input x based on the available input/output pairs D = {(xµ , yµ )}nµ=1 . In a standard application of MDL to supervised learning, the output y is treated as the data D that has to be communicated between the sender and the receiver, while the input data X is assumed to be known by them. Therefore, all the probabilities in the formula (2.5.13) are conditioned on the input data, i.e. p(M ) ≡ p(M |X ) and p(D|M ) ≡ p(D|M, X ). However, to simplify the notation we omit X in these expressions. The connection between MDL and Bayesian theory for Neural Networks is demonstrated in the next section.

2.6

Bayesian framework

In the Bayesian framework, one seeks a model that maximizes a posterior probability of the model M given the observed input/output data (X , D): p(M |D, X ) =

p(D|M, X )p(M |X ) , p(D|X )

(2.6.15)

Usually, in the feed-forward networks trained by supervised learning the distribution of the input data p(x) is not modeled1 . Thus, in (2.6.15), X always appears as a conditioning variable, which we omit to simplify the notation (similar to the convention accepted for the description length evaluation): p(M |D) =

p(D|M )p(M ) . p(D)

(2.6.16)

Since p(D) does not depend on the model and the most plausible model M ? has to minimize the negative logarithm of the posterior probability, we get: M ? = arg min[− log(p(D|M )) − log(p(M ))]. M

(2.6.17)

Usually, to apply both the MDL and Bayesian frameworks, one decides in advance on a class of parameterized models and then searches within this class of parameters to optimize a corresponding criterion. The probability of the data, given a model parameterized by w, can be computed by integrating over the model parameter distribution: Z

p(D|M ) = 1

p(D|M, w)p(w|M )dw.

In Section 3.2.3 we will consider the effect of such modelling.

(2.6.18)

Chapter 2: MDL and Bayesian principles

21

Using the Bayesian formula we get: p(w|M, D) =

p(w, D|M ) p(D|M, w)p(w|M ) = , p(D|M ) p(D|M )

(2.6.19)

that shows that a posterior probability of the weights p(w|M, D) is proportional to p(D|M, w)p(w|M ). It is usually assumed that a posterior probability of the weights p(w|M, D) is highly peaked at the most plausible parameter w? , and the integral (2.6.18) may be approximated by the height of the peak of the integrand p(D|M, w)p(w|M ), times a width of this distribution ∆w|M,D (MacKay, 1992): p(D|M ) ≈ p(D|w? , M ) × p(w? |M )∆w|M,D |

{z

}

|

best f it likelihood

{z

(2.6.20)

}

Occam f actor

The quantity ∆w|M,D is the posterior uncertainty in w. Assuming that the prior p(w? |M ) is uniform on some large interval ∆0 w, representing the range of values of w that the model M admits before seeing the data D, p(w? |M ) simplifies to p(w? |M ) ≈ Occam f actor =

1 , ∆0 w

∆w . ∆0 w

and

(2.6.21)

Thus the Occam factor is the ratio of the posterior accessible volume of the model parameter space to the prior accessible volume. Typically, a complex model with many parameters, has larger prior weights uncertainty ∆0 w. Thus, the Occam factor is smaller and it penalizes the complex model more strongly (MacKay, 1992). Another interpretation of the Occam factor is obtained by viewing the model M as composed of a certain number of equivalent sub-models. When data arrive, only one sub model survives and thus the Occam f actor appears to be inversely proportional to the number of sub models. Thus, − log(Occam f actor) is the maximal number of bits required to describe/indicate this remaining sub model. Using the Occam factor (2.6.21) the condition (2.6.17) states that the most plausible model has to minimize the description length: L(M, D) =

− log p(D|w? , M )

− log p(M ) − log(Occam f actor)

inaccuracy f or the best parameters

model complexity

|

{z

}

|

{z

}

(2.6.22)

The first term in (2.6.22) is the ideal shortest message that encodes the data D using w? and characterizes inaccuracy of the model prediction for the best parameters. The second term characterizes the complexity of the model. The more complex the model is, the less is the discrepancy between the data and their prediction, but this accuracy is achieved at the expense of the model description. This relationship between a model accuracy and complexity is tightly related to the bias-variance dilemma considered in

Chapter 2: MDL and Bayesian principles

22

the previous section. We have seen that the introduction of many parameters leads to a better accuracy (decreases bias), but incurs high variance. Thus MDL and the Bayesian approach offer the natural way to resolve the dilemma by seeking a model with a good generalization ability. Another MDL interpretation to (2.6.20) is straightforward: L(D, M ) = − log p(D|w? , M ) − log p(w? |M ) − log ∆w|M,D − log p(M ). |

{z

error−cost

}|

{z

weight−cost

}|

{z

}

(2.6.23)

precision−cost

The first term in (2.6.23) is the length of the ideal shortest message that encodes the data D using the best parameters w? . The second term is the number of bits required to encode the best model parameters. In addition, the negative logarithm of uncertainty about parameters after observing the data (− log ∆w|M,D ) penalizes models which have to be described with a high precision to fit the data. Usually, the third component is neglected since model parameters are communicated only once, while the data arrive one after another. A way to take the third component into consideration in neural networks, but neglecting the second term, describing the a-priori knowledge about the model parameters, has been considered in (Hochreiter and Schmidhuber, 1997).

2.7

MDL in the feed-forward NN

A feed-forward neural network is an example of the parameterized models that is represented graphically as a feed-forward diagram of several layers of activation units, connected by the so called synaptic weights that represent the model parameters. The neural network architecture allows to evaluate the output data as a function of the input data. The network is supplied by the input data presented in the low input layer of the network. The input is successively propagated via the hidden layers using the weights and network units’ activation functions in the forward direction to get the output data D in the top output layer of the network. The network weights, the number of hidden units and the activation unit functions are the main parameters that define the network complexity. In general, it is often assumed that the network architecture is already defined and the main problem is to find the weight parameters. Implementing the MDL principle in neural networks is easy. For simplicity we consider training a single hidden layer feed-forward neural network (Figure 2.1). Neglecting the third term in the description length (2.6.23) and assuming that the models have the same

Chapter 2: MDL and Bayesian principles

23

Supervised feed-forward network Output

W - top weights Hidden representation - A

w - hidden weights

Input - X

Figure 2.1: Feed-forward supervised network. A single arrow between two layers indicates that the units of both layers are fully connected. a-priori probabilities p(M ) an optimal weight vector has to minimize 2 : L(M, D) = − log p(D|w, W, M ) − log p(w, W|M ) +const |

{z

}|

error−cost

{z

}

(2.7.24)

model−cost

The first term in this expression is the error-cost of specifying the data for the given weights, i.e. the cost of specifying the errors between true and predicted by the models with the given weights outputs. The second term is the model-cost. To evaluate the error-cost, the receiver and the sender have to agree on the specific form of the conditional distribution of the output t ∈ Rn . In the assumption of the independent Gaussian additive noise with zero mean in the output layer, the posterior probability of the output is given by: p(t|x, w, W) = where C(λ) =

q

2π λ

λ ˆ 1 exp(− k t(x, w, W) − t k2 ), C n (λ) 2

(2.7.25)

and the parameter λ is inversely proportional to the Gaussian variance

2

(λ = 1/σ ). Provided the samples are drawn independently from the distributions (2.7.25) we get: p(D|w, W, M ) =

r Y

p(ti |xi , w, W),

i=1 2

We have omitted the

?

super-index for convenience

(2.7.26)

Chapter 2: MDL and Bayesian principles

24

where r is the number of training samples. The assumptions (2.7.25) and (2.7.26) produce 1 λ p(D|w, W, M) = nr exp(− ED ), where C (λ) 2 ED =

r X

k ˆt(xi , w, W) − ti k2 .

(2.7.27)

i=1

When the weight probability distribution is Gaussian and the hidden w and top weights W are independent we get:

where Nw , NW

p(w, W|M ) = p(w|M )p(W|M ) 1 γw p(w|M ) = Nw exp(− k w − mw k2 ), C (γw ) 2 1 γW p(W|M ) = N k W − mW k2 ), (2.7.28) exp(− C W (γW ) 2 are numbers of the hidden and top weights, coefficients γw , γW are inversely

proportional to the corresponding Gaussian variances and mw , mW are mean values of the hidden and top weights, respectively. Assumptions (2.7.25,2.7.28) lead to the following expression for the description length (2.7.24): λ γw γW L(M, D) = ED + k w − m w k2 + k W − mW k2 + |2 {z } |2 {z 2 } error

weight decay

Nw log C(γw ) + NW log C(γW ) + nr log C(λ) + const |

{z

}

(2.7.29)

The first term may be recognized as an error and the next as a modified weight decay term. The third term is constant for a chosen net architecture. Thus, the weight-decay term controls a network complexity imposing smoothness constraints. Another form of weight decay term has been obtained by modelling the weights as a mixture of Gaussians (Nowlan and Hinton, 1992). There is a deep relationship linking the MDL approach and regularization techniques. The intuitive idea is that complex models can fit better training data, but are not robust to small variations in the data. This relationship between a generalization ability of the model and its complexity is related to the bias-variance dilemma in statistics (Geman et al., 1992): over-parameterized models have high variance, while restricting the model parameters incurs a high bias in the generalization error. The MDL formulation allows to control bias and variance in a natural way.

2.7.1

MDL and EPP bias constraints

Let us assume again that a network architecture, such as a number of hidden units and nonlinear activation functions, is fixed. Nevertheless, does there exists another way to

Chapter 2: MDL and Bayesian principles

25

control complexity of the network? It turns out that this can be done by imposing bias constraints on the supervised neural network. A general framework for imposing EPP bias constraints in neural networks (Figure 2.2) has been considered in Section 2.3.3. We have seen that computationally these constraints are expressed as a minimization Hybrid network with EPP constraints Output

W - top weights Hidden representation - A

Bias constraints

w - hidden weights

Input - X

Figure 2.2: A hybrid feed-forward network with exploratory projection pursuit (EPP) constraints. A single arrow between two layers indicates that the units of both layers are fully connected. of some function H, measuring the quality of the hidden layer representation A, and averaged over an ensemble of the input. In other words, EPP constraints are constraints on the specific form of the hidden representation that are known a-priori. Thus, the projection index ρ(w) is a complex function depending on the hidden weights via the hidden representation A: ρ(w) ≡ E[H(A)], where A = f (w, x) and H measures the quality of the hidden representation. This form of constraints may be easily wired in the MDL framework assuming a particular form of a-priori probabilities of the hidden weights: µ p(w|M ) = CH (µ) exp(− E[H(f (w, x))]), 2

(2.7.30)

where CH (µ) is a normalization constant. The a-priori probability p(w|M ) (2.7.30) does not depend on the input x explicitly, although it does, since in the Bayesian formulation (2.6.16) all the probabilities have to be conditioned by the input data X . Assuming

Chapter 2: MDL and Bayesian principles

26

independence of the hidden and top weights, we get: L(M, D) =

1 1 λED + µE[H(A)] − log p(W|M ) +const. | {z } 2 2 | {z } {z } | weight−cost

error−cost

(2.7.31)

representation−cost

The expression for the description length (2.7.31) gives a deeper level of description to the data communication and is close (though not equivalent) to Zemel’s interpretation of MDL (Zemel, 1993). In Zemel’s interpretation one gets a more realistic interpretation of the communication game, where a real communication takes place between the hidden layer with internal representation A and the top layer. The receiver requires three items in order to be able to recover the desired output: 1. The set of activities A of the representation units; these are codes that nets assign to each training input sample. Encoding activities of the representation (hidden) units avoids communication of the hidden weights and does not require the knowledge of the input data X . However, the sender and the receiver have to agree on the a-priori distribution of the internal representation. This part of the message corresponds to the representation-cost. 2. The set of hidden-to-output weights W . This part of the message is represented by the weight-cost. 3. Reconstruction error, which is a misfit between desired and predicted outputs. This part of the message is represented by the reconstruction or the error-cost. In order to evaluate the latter, the sender and receiver have to agree on the probability of the desired output of the network given its actual output. Usually, the weight-cost, i.e. the number of bits required to communicate the hiddento-top weights, is not taken into account, since it has to be communicated only once, while representation-cost and error-cost have to be sent for every sample. Thus, the main communication tradeoff takes place between representation and error costs. Reducing dimensionality of the data in the hidden layer, i.e. compressing the data, a shorter description is obtained, but at the same time the errors are larger. The MDL principle is a tool for achieving a good data representation that is compact and accurate. We see that similar to Zemel’s interpretation of MDL, imposing EPP constraints leads to the description length (Eq. 2.7.31) that consists of three parts. It requires the same agreement on probabilities of hidden representation and errors between the sender and receiver as described above. However, the representation cost in (Eq. 2.7.31) is taken only

Chapter 2: MDL and Bayesian principles

27

once for all samples, while in Zemel’s interpretation it is permanent and is assigned to each training input sample. When the number of input patterns is infinite, the representation cost induced by EPP constraints is negligible. Thus, in a manner similar to supervised learning, EPP constraints lead to a model in which model cost vanishes as the number of input patterns becomes infinite. We postpone the consideration of the hybrid autoencoder network with reconstruction constraints and its MDL interpretation to the next section, where reconstruction task and its application are considered.

Chapter 2: Regularization problem

2.8

28

Appendix to Chapter 2: Regularization problem

Regularization may be expressed as a minimization problem with a goal function that is a penalized cost function: Eλ = E + λΩ(f , w),

E=

X

k yi − f (xi , ω) k2 .

i

A large value of the regularization parameter λ leads to a network with a large bias (unless the regularization term captures the underlying structure of the data), while a small value reduces bias but increases variance. Then the regularization task is to find an optimal parameter λ? and corresponding model parameters ωλ? providing the minimal generalization error: Eλ? = E[k y − f (x, ωλ? ) k2 ]. This task is computationally very expensive. Split-sample validation and hold-out method The simplest way to find the regularization parameter is to use split-sample validation. This process includes the following steps for each tested value of the regularization parameter λ (this process is common for the choice of the other regularization parameters, such as the number of hidden units, a choice of the early time stopping moment, etc.): • A random data is split into a training and validation set. Often 2/3 of the data is used for training and 1/3 for testing. • The training set is used for estimation of the predictor parameters by minimizing Eλ . • The validation set is used to test a prediction error (E). The validation set must not be used in any way during training. • The predictor with the smallest prediction error corresponds to the optimal regularization parameter λ. The generalization error of the best predictor is in general too optimistic. The prediction error on a third separately kept data set, called the test set is more realistic and is often reported as the result of the predictor accuracy. This method is called the hold-out method. The disadvantage of the split-sample validation and hold-out method is that they reduce the amount of data available for both training and validation. Two methods that

Chapter 2: Regularization problem

29

overcome this drawback are cross-validation and bootstrapping (Efron and Tibshirani, 1993; Bishop, 1995a). Cross-validation In k-fold cross-validation, the data is divided into k subsets of (approximately) equal size. A network is trained k times, each time leaving out one of the subsets from the training set and using the omitted subset as a validation set to compute an error. If k equals the sample set size, this is called “leave-one-out” cross-validation. “Leave-v-out” is a more elaborate and expensive version of cross-validation that involves leaving out all possible subsets of v cases. A generalization error is then measured as an average performance over all possible validation tests. Cross-validation is an improvement on split-sample validation. Bootstrapping In many cases, bootstrap seems to be better than cross-validation (Efron and Tibshirani, 1993). In the simplest form of bootstrapping, the training data is bootstrapped, instead of repeatedly analyzing subsets of the data as in cross-validation. Given a data set of size n, a bootstrap sample is created by sampling n instances uniformly from the data with replacement. Then the probability of the instance to remain in the test set is (1 − 1/n)n ≈ e−1 ≈ 0.368; and to be in the training data is 0.632. Given a number b of bootstrap samples, the average performance is evaluated as a weighted sums of the training (Eitraining ) and testing (Eitesting ) errors: E=

b 1X (0.632Eitraining + 0.368Eitesting ) b i=1

(2.8.32)

Usually the number of recommended bootstrap samples is between 200 − 2000 (Kohavi, 1995). Cross-validation and bootstrapping require many runs that may be computationally prohibitive, especially for the most interesting perception tasks, when the input dimensionality is very high. Both cross-validation and bootstrapping work well for continuous error functions, such as the mean squared error, but it may perform poorly for non-continuous error functions, such as the misclassification rate.

Chapter 3 Imposing bias via reconstruction constraints 3.1

Introduction

Reconstruction is one of the important tasks of the complex visual processing. It is a process of reproducing the input via some reasonably well chosen model. It is commonly assumed that there is a compression via a bottleneck model and thus, the input is reproduced from a reduced internal representation. The oldest and widely spread reconstruction method is Principal Component Analysis (PCA). PCA is an optimal linear compression, that is based on minimization of the mean squared error between input and its reconstruction. A simple generalization of PCA, in the nonlinear case, is a nonlinear autoencoder. Below, we present both these models and discuss their relationship to the MDL principle. We proceed then with a more general notion of reconstruction via a generative model and reexamine diverse applications of the reconstruction models. Finally, we introduce a novel method that uses reconstruction as a bias constraint to a supervised classification task.

3.1.1

Principal Component Analysis (PCA)

PCA is widely used in multivariate analysis (Duda and Hart, 1973). PCA, also known as the Karhunen-Lo´eve transformation (Oja, 1982; Fukunaga, 1990), is a process of mapping the original data into a more efficient representation, using an orthonormal linear transformation that minimizes the mean squared error between the data and its reconstructed version. It is well-known that the optimal orthogonal basis of the data space is formed by the eigenvectors of the covariance matrix of the data. New data representation is obtained 30

Chapter 3: Reconstruction constraints

31

by projecting the data to this new optimal basis. The eigenvectors corresponding to the largest eigenvalues are the most significant (accounting for most of the variance in the data). Thus, discarding coordinates in these directions, leads to the largest error in the mean-squared sense. Therefore, the coordinates corresponding to the small eigenvalues should be deleted first, when compression is performed. Different PCA algorithms using neural networks have been reported (Haykin, 1994, see review). The first PCA network proposed by Oja (1982), uses a Hebbian learning rule to find the first eigenvector corresponding to the maximal eigenvalue. It’s generalized version, called the generalized Hebbian network (GHA) (Sanger, 1989), extracts the first successive eigenvectors and uses feed-forward connections only. A modification of GHA, an adaptive principal component extraction (APEX) algorithm (Kung and Diamantaras, 1990), uses additional lateral connections to decorrelate network outputs. GHA and APEX are examples of reestimation and decorrelating types of the PCA algorithms, respectively. PCA using Hebbian networks has been considered as a first principle of perceptual processing (Miller, 1995; Atick and Redlich, 1992; Hancock et al., 1992; Field, 1994). The main goal of these studies is to explore the similarities between the PCA eigenvectors and the receptive fields of cells in the visual pathway. It may be shown (Fukunaga, 1990; Gonzalez and Wintz, 1993; Field, 1994), that for stationary and ergodic processes, PCA is approximately equivalent to the Fourier transform. The natural images are not stationary, however, and their covariance matrix does not describe completely the data distribution. It has been recently shown (Hancock et al., 1992), that the first 3−4 eigenvectors extracted from Gaussian smoothed natural images resemble ”Gabor functions”, that provide good models of cortical receptive fields. However, the following eigenvectors no longer look like cortical receptive fields. PCA extracts a fully distributed representation, because only few neurons that carry most of the variance are kept, and thus all components of the observation vector participate in its projection into the eigenspace. Below, we present autoencoder network that is tightly related to PCA and discuss its interpretation in the MDL framework.

3.1.2

Autoencoder network and MDL

An autoencoder network (Figure 3.1) is a feed-forward multi-layer perceptron (MLP) network with the output layer coinciding with the input layer. Usually, it contains a single hidden layer, though variants with additional hidden layers have been also considered (Kramer, 1991). The number of the hidden units is assumed to be much less than dimensionality of the input. Therefore, it reduces dimensionality of the input extracting the

Chapter 3: Reconstruction constraints

32

Autoencoder network architecture

W - hidden-to-top weights

w - hidden weights

Figure 3.1: Reconstruction of the inputs is done from the hidden layer representation. so-called internal representation in the hidden layer. The autoencoder network has a natural interpretation in the MDL framework (Hinton and Zemel, 1994). It discovers an efficient way to communicate data to a receiver. A sender uses a set of input-to-hidden weights and, in general, non-linear activation functions to convert the input into a compact hidden representation. This representation has to be communicated to the receiver along with the reconstruction errors and hidden-to-top weights. Knowing the hidden-to-top weights the receiver reconstructs the input from this abstract representation and communicated errors. From Eq. 2.7.24 the description length is composed of the error-cost and the modelcost. Assuming that the errors are encoded using a zero-mean Gaussian with the same predetermined variance for each output unit, the error-cost is given by the sum of the squared errors. Since in the autoencoder the hidden units are always active, the model cost may be approximated by the size of the hidden layer. Often, the model cost is ignored, and the MDL principle leads to a simple minimization of the sum of squared errors via a network with a bottleneck structure. Thus, the autoencoder learns the compact representation of the input. In addition, the bottleneck structure forces the network to learn prominent features of the input distribution which are useful for generalization. The network is robust to noise and may be used for pattern completion, when part of the input is corrupted or absent. A linear one-hidden layer autoencoder is closely related to PCA, since its hidden weights span the same subspace as found by principal eigenvectors (Bourlard and Kamp, 1988). However, contrary to PCA, the hidden weights are not forced to be orthogonal and do not coincide with the hidden-to-top weights. The analytical solution of the

Chapter 3: Reconstruction constraints

33

optimization problem imposed by the linear autoencoder is given by: W = UT−1 ,

w = TUt

(3.1.1)

where T ∈ Rp×p is an arbitrary nonsingular scaling matrix; U ∈ Rn×p (p ≤ n) is a matrix of the principal eigenvectors stacked by columns; W and w are hidden-to-top and hidden weights respectively; n and p are the number of units in the input and hidden layers respectively. However, since learning in the autoencoder relies on a gradient descent technique it can get trapped in local minima. In the nonlinear case, Bourlard and Kamp claim that nonlinear and linear autoencoders are equivalent, since when the norm of the scaling matrix T is infinitely small, sigmoidal activation functions can be approximated arbitrary close by linear activation functions. However, their proof is valid only from the reconstruction error minimization viewpoint, and not the extracted internal representation context.

Their analysis does

not take into account a convergence issue. Indeed, to make nonlinear and linear autoencoder solutions arbitrarily close, the norm of the matrix T has to be arbitrarily small (for example, by introducing some scaling parameter  → 0). While  is positive the linear autoencoder hidden weights span the same space as the principal eigenvectors, but at the same time there is a difference between hidden weights extracted by the linear and nonlinear autoencoders. This difference disappears only for  = 0, when the matrix T becomes singular. Thus, it is not obvious that the hidden weights obtained in the limit of this convergence span the space extracted by the principal eigenvectors. It has been recently shown, that when the data is whitened (i.e. the data covariance matrix is unit and spherical) and non-linear activation functions are adjusted properly, the autoencoder is able to extract the independent components (Oja, 1995a) (i.e. responses of different hidden neurons are independent, see also Chapter 4), while the PCA solution is not well defined. Thus, the non-linear autoencoder can be made sensitive to higher order statistics, while PCA is sensitive to the second order statistics of the data. The presence of the proper nonlinearities in the autoencoder allows to extract sparse representation, while PCA forms distributed representation. In the distributed representation, all the hidden units participate in the pattern encoding, while in the sparse, only a few are active, which are responsible for the presence of some specific features in the pattern. PCA forms the distributed representation, since only few neurons which carry most of the variance are kept for data reconstruction and they are active for all patterns. Other variants of autoencoders that encourage sparse hidden representations have been proposed by Zemel (1993). The code-cost of the sparse representation is small, even when the number of hidden units is large. Thus, though these autoencoders are trained

Chapter 3: Reconstruction constraints

34

to minimize a sum of the representation (code) and error costs, they do not necessary have a bottleneck structure and develop interesting biologically plausible representations.

3.1.3

Reconstruction and generative models

There is evidence in several psychological experiments (for example, completion of partially occluded contours (Lesher, 1995)) that humans perceive a reconstructed version of the input instead of the raw ambiguous input. The reconstruction may be a more complex process than simple duplication of the incoming information, including deblur, denoising, completion of occluded areas, etc. It is often assumed that the observed signals are synthesized by some generative model from an abstract internal representation. Thus, the reconstruction is considered to be composed of two phases (Hinton and Ghahramani, 1997). The first phase is a recognition phase, inferring the underlying internal representation of the incoming input and the second – a generative phase converts internal representation into an input form (reconstructed object). From a statistical viewpoint, learning to reconstruct is the problem of maximizing the likelihood of the observed data under a generative model. This estimation is often an ill-posed problem, that can be solved using the expectation maximization (EM) algorithm (Dempster et al., 1977; Neal and Hinton, 1993). This iterative algorithm increases (or does not change) maximum likelihood in every iteration, which consists of two steps, expectation and maximization. In EM, the recognition phase corresponds to the expectation step (E-step) and generative phase to the maximization (M-step). In the E-step, a distribution of the internal representation is estimated from the observed data and current model parameters. Using this distribution and the observed data, the generative model parameters are updated via an average likelihood maximization. Different generative models and assumptions about distribution of the internal representation lead to different network models and sensory representations. The inference phase is difficult. In logistic belief networks (LBN) and Boltzmann machine (Hinton and Ghahramani, 1997) the hidden state is picked using Gibbs sampling, i.e. each unit is visited one at a time and its new state is stochastically picked from its posterior distribution given the current states of all the other units (Jordan, 1999, comprehensive survey). In the wake-sleep algorithm (Hinton et al., 1995), a model uses separate bottom-up recognition connections to pick up binary states for units in one layer, given the already selected binary states of units in the layer below. Both PCA and the autoencoder network may be interpreted as generative models. PCA as a generative model emerges as a constrained case of factor analysis (Roweis and Ghahramani, 1997; Hinton and Ghahramani, 1997). In factor analysis the observation

Chapter 3: Reconstruction constraints

35

is a linear transformation of the hidden variables, corrupted with an additive sensory noise that is Gaussian. The linear transformation is realized via a matrix of the generative weight vectors. Each generative weight vector connects hidden variables with the corresponding observation variable. Hidden variables are referred to as factors and are assumed to be Gaussian. PCA is obtained when the covariance matrix of the sensory noise is assumed to be a scaled identity matrix I, with the infinitesimal scaling factor  → 0. In this limiting case, the posterior distribution of the hidden variables shrinks to a single point, i.e. given the observation, the hidden representation becomes non random. In PCA the generative weight vectors are forced to be orthogonal, that leads to a simple recognition of the deterministic hidden representation as a linear transformation with the matrix of the recognition weight vectors equal to the transpose of the generative weight matrix. Interpretation of PCA as a generative model disregards the order of the hidden variables, but allows the use of EM for the extraction of eigenvectors (Roweis, 1997). This method is especially efficient for high dimensional data, where a covariance matrix is not full rank and has a large size that makes the simple diagonalization of the covariance matrix computationally difficult. The transformation from the input to the hidden layer in the autoencoder net is associated with the recognition phase and from the hidden layer to the output as the generative phase. Therefore, the hidden weights emerge as recognition weights and the hidden-to-top weights as generative weights.

3.1.4

Classification via reconstruction

As we have shown above, an implicit reconstruction goal is to find a meaningful internal representation of the data that can be obviously used for data compression and communication. Interpreted as a set of good features, it may be applied for further processing and learning. This usage is not absolutely apparent, since during feature extraction some information is lost. Below, we consider some examples of using internal representations extracted via reconstruction for recognition. PCA for classification PCA was first used as a means of preprocessing for subsequent face recognition in (Kirby and Sirovich, 1990; Turk and Pentland, 1991). Later PCA was used for a man-made object recognition and pose estimation (Murase and Nayar, 1993). PCA proceeds by scanning and representing images as points of a high dimensional space with the dimension equal to the number of image pixels. The eigenvectors of the data covariance matrix represented as images are called the eigenpictures. The first large

Chapter 3: Reconstruction constraints

36

eigenvectors form the basis of a low-dimensional subspace, called the eigenspace. All the sample-images and new images of the objects are projected into the eigenspace and the recognition problem is solved in the reduced dimensional space by different statistical methods (nearest neighbor rule, vector quantization, etc.). Though application of PCA for recognition has been relatively successful, a question of the PCA optimality for recognition task has been also addressed (Turk and Pentland, 1993; O’Toole et al., 1993). Experimental studies of Turk et al. (1993) show that the first few eigenfaces primarily capture the effects of changing illumination and neglecting the first few eigenfaces can lead to a substantial increase in the recognition accuracy. This observation has been supported by a different study (O’Toole et al., 1991; O’Toole et al., 1993). It has been shown that a low-dimensional representation of the faces associated with the small eigenvalues is better for face classification and familiarity, than a highdimensional representation associated with the large eigenvalues when these spaces have the same small dimensionality. The explanation of a PCA utility is based on the fact that the eigenvectors corresponding to the large eigenvalues are the directions with the large data variability (Figure 3.2a). Thus, it seems reasonable that these directions are good for recognition. However, this assumption fails as can be easily seen from Figure 3.2b. This figure demonstrates the main Eigenspaces extracted by PCA a

b

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 e 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 e1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 e 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 e2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 01 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 01 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 01 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 01 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 01 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 01 0 1 0 1 0 1 0 1 0 1 0 1 01 0 1 0 1 0 1 0 1 0 1 01 1 0 1 0 1 Figure 3.2: Two examples of eigenspaces extracted by PCA. The first principle eigenvector 0 1 0 1 0 1 0 1 0 1 0 0 1 01 1 e1 is marked with a bold line and the second e2 (e1 ⊥ e2 ) with a dashed line. Example

(a) demonstrates why PCA can be used for dimensionality reduction before classification. The projection on the e1 direction captures all information needed for classification. In contrast, example (b) indicates a PCA drawback. Classification after projecting data to e1 direction is impossible. drawback of the PCA technique, namely a high sensitivity to the scaling that changes the ordering of the eigenvectors. Scaling affects the reduced low-dimensional eigenspace,

Chapter 3: Reconstruction constraints

37

extracted by PCA and being optimal for reconstruction it may be inappropriate for recognition. When the data is whitened, PCA is not clear at all, since all orthogonal systems are equivalent from a PCA viewpoint. Autoencoder network The autoencoder networks have been successfully used not only for compression (Mougeot et al., 1991; Cottrell et al., 1987), but for classification as well (Elman and Zipser, 1988; Japkowicz et al., 1995; Schwenk and Milgram, 1995). In these works, a classification process is considered to consist of two phases. In the first phase several autoencoders are trained. Each autoencoder is trained separately on the samples of the corresponding class. The second phase is heuristic and is based on the idea that the reconstruction error is, in general, much lower for examples of the learned class than for the other ones. In (Japkowicz et al., 1995) classification is constrained to a two-class discrimination task that is replaced by a dual task of familiarity with a concept. In the first phase, the single autoencoder is trained on the conceptual examples solely. In the second phase, the conceptual examples or two classes examples are used to estimate the decision threshold for a reconstruction error (the sum of squared errors) between input and output. If the reconstruction error is smaller than the decision threshold, the instance is classified as conceptual, if larger it is classified as counter-conceptual. Similarly, in (Elman and Zipser, 1988) the autoencoder is trained on segmented sounds that allows to segment a continuous speech on the base of the mean squared error. In (Schwenk and Milgram, 1995) the basic idea is to use one autoencoder for each class and to train it only with examples of the corresponding class. In contrast with the usual autoencoder, a tangent distance is used instead of the squared reconstruction error. This tangent distance allows to incorporate a high-level knowledge about typical input transformation. Classification is done using the reconstruction errors of the autoencoders as discriminant functions. “Wake-sleep” network Another example of classification based on the reconstruction has been proposed via the “wake-sleep” network (Hinton et al., 1995). Similarly to autoencoders, each “wake-sleep” network is trained separately on different examples of the same digit. Classification is done by observing which of the networks provides the most economical description of the data.

Chapter 3: Reconstruction constraints

3.1.5

38

Other applications of reconstruction

Reconstruction via a modified autoencoder has been used for input reconstruction reliability estimation (IRRE) for autonomous car navigation (Pomerleau, 1993). In IRRE a connectionist network is trained simultaneously to produce the correct steering response for a car navigation and to reconstruct the input image in the mean squared error sense. After learning, the reliability measure which is a correlation between the input and its reconstructed image is evaluated. This reliability measure may be used to control vehicle speed and its location in the a priori known confusing situations. Another application of IRRE is by integrating the outputs of multiple networks trained for different driving situations, i.e. the network that has the best reliability has to be used for a navigation task. Another related recurrent network has been used for autonomous vehicle navigation (Baluja and Pomerleau, 1995). Baluja et al. use prediction of the next future input image as a related task to the navigation task, i.e. the MLP network is learned to predict an input image and to produce a right steering response simultaneously. Computationally, the hidden weights are updated based on the navigation task only, but from the obtained hidden activities the network is trained to predict. Recursion has a place by propagating the predicted image back to the input layer for refining the next input image via noise and unpredicted object elimination. A similar to IRRE connectionist network has been proposed as a hippocampal model (Gluck and Myers, 1993). This model assumes that the hippocampal region develops stimulus internal representation that enhances the discrimination of predictive cues while compressing the representation of redundant cues.

3.2 3.2.1

Imposing reconstruction constraints Reconstruction as a bias imposing mechanism

We have shown above that the reconstruction task is related to the classification task and two main approaches to classification via reconstruction take place. The first approach offers the use of a common hidden representation obtained for all data as a preprocessing step for the following learning (Kirby and Sirovich, 1990; Murase and Nayar, 1993; Moghaddam and Pentland., 1994). In the second approach (Japkowicz et al., 1995; Schwenk and Milgram, 1995; Hinton et al., 1995), reconstruction (generative) networks are used to extract the underlying structure of the data drawn from the same class. The assumption is that an example drawn from another class does not share the already

Chapter 3: Reconstruction constraints

39

learned structure and produces a high description length. Thus, the description length may be used as a discriminant function. Though these approaches have been relatively successful, there are cases when they are not appropriate. For example, when the samples belonging to the same class have multimodal distribution, or the classes are very similar, the second approach is not obvious. As we have shown in Section 3.1.4, PCA is very sensitive to data scaling. This consideration favors the view that each perceptual task needs data preprocessing that can not be obtained based only on the other related task. Contrary to the considered above approaches, we propose to use reconstruction realized via a modified autoencoder as a bias-imposing mechanism in the feed-forward networks for improving the classification task. An intuitive way to conceive the idea of imposing reconstruction as a proper bias constraint for classification is via the multi-task learning (MTL) approach (2.3.3). As has been shown above both recognition and reconstruction are related but different tasks of visual processing. In some cases, they were also replaced by one another. Secondly, it has been experimentally shown (Elman and Zipser, 1988; Cottrell et al., 1987), that reconstruction via an autoencoder extracts a valuable internal representation. Thus, it is reasonable that hidden representation that relies on recognition and reconstruction tasks can improve the generalization performance of classification. This assumes that such hidden representation has to capture some prominent (recognition) features of the data, while keeping most important information needed for reconstruction. As an illustration, let us assume that we want to classify between two individuals and suppose that one of them has some prominent features in the training images (glasses, hair style, moustache, beard and so on), then it seems plausible that recognition will exhibit a tendency to process these corresponding areas of the face and all the other information will be redundant for the recognition goal. However, these features may be absent or appear rarely in new images of this person, thus failure in the testing phase is likely. In contrast, the addition of the reconstruction task during training of the system, forces the system to extract other features which may not be so useful for recognition of the original training images, but may be of use with the novel test set. This motivates our suggestion to add reconstruction constraints during learning of the classification task. Similar approach has been proposed in (Gluck and Myers, 1993) to model a hippocampus function. It is assumed that one of the roles of the hippocampus is to extract a common recognition/reconstruction internal representation of the input stimulus. Though conceptually our work is close to this model we have remarkable differences that are elucidated later on. Below, we present a hybrid classification/reconstruction network.

Chapter 3: Reconstruction constraints

3.2.2

40

Hybrid classification/reconstruction network

Figure 3.3 presents the architecture of the combined classification/reconstruction network. This network attempts to improve the low dimensional representation by minimizing concurrently the mean squared error (MSE) of reconstruction and classification outputs. In other words, it attempts to improve the quality of the hidden layer representation by imposing a feature selection useful for both tasks, classification and reconstruction. The hidden layer should have a smaller number of units compared with the input, so as to achieve a bottleneck compression and to allow for generalization. The combined learning Combined recognition/reconstruction network Reconstruction

Input Hidden layer

Classification

Figure 3.3: A single hidden layer drives the classification layer and the reconstruction layer. rule for the hidden layer units is a composition of the errors backpropagated from both reconstruction and recognition layers. The relative influence of each of the output layers is determined by a constant λ which represents a tradeoff between reconstruction and classification confidence. Below, we present a rigorous mathematical explanation of the hybrid network in the MDL framework.

3.2.3

Hybrid network and MDL

It is easy to see that the proposed network is a modified autoencoder network. The modified autoencoder shares a common hidden representation with the supervised (classification) network. It finds the compact hidden representation that is good for reconstruction in addition to a task at hand (Figure 3.4). In contrast to the autoencoder (Section 3.1.2)

Chapter 3: Reconstruction constraints

41

Hybrid network with reconstruction and EPP constraints Reconstruction

Output

W

2

hidden-to-top weights

1 W

Bias constraints

Hidden representation - A

w - hidden weights

Input - X

Figure 3.4: The hidden layer drives the reconstruction and classification output. In addition, the search of another statistical structure in the data is made. and supervised feed-forward network (Section 2.7), the hybrid network is associated with a different communication game, in which the sender uses a compact internal representation to communicate both the observed data and the corresponding desired output (for example, class labels of the images).

Since this internal representation has to encode

efficiently both the observed data and corresponding output, a cost for communicating the input data X has to be involved in the description length (2.7.24), yielding: L(M, D, X ) = − log p(D, X |w, W1 , W2 , M ) − log p(w, W1 , W2 |M ) + const. Assuming that given the input and the net weights, conditional probabilities of the reconstruction and supervised outputs are independent and Gaussian, we get similar to (2.7.29) the expression for the description length: L(M, D, X ) =

1 (λ1 ED + λ2 EX ) − log p(w, W1 , W2 |M ) + 2 r1 d log C(λ1 ) + r2 n log C(λ2 ) + const.

(3.2.2)

In expression for the description length (3.2.2), λ1 and λ2 are inversely proportional to the variances of the specific task and reconstruction outputs respectively; ED and EX are sums of the squared errors of the supervised task and reconstruction outputs, respectively; r1 , r2 are numbers of training samples for reconstruction and specific tasks. Assuming the same a-priori probability for the hidden weights as in (2.7.30), the entire

Chapter 3: Reconstruction constraints

42

description length may be simplified to: L(M, D, X ) =

1 (λ1 ED + λ2 EX + µE[H(w, x)]) + {z } |2 (1)

r1 d log C(λ1 ) + r2 n log C(λ2 ) + log CH (µ) + const . |

{z

}

(3.2.3)

(2)

In general, the numbers of training samples for reconstruction (r1 ) and specific (r2 ) tasks may be different, which seems to be a common situation in a real-world learning. In the limit, when we do not have enough information provided by supervised learning, internal representation is constructed based on the unsupervised learning only. Since, in our consideration, parameters λ1 , λ2 , µ are assumed to be fixed, the second part of the description length (3.2.3) is a constant and the description length may be rewritten as: L(M, D, X ) = 12 (λ1 ED + λ2 EX + µE[H(w, x)]) + Const

(3.2.4)

Therefore, when one is interested in both tasks, two scaled sum-square errors ED and EX present the error cost and the third term µE[H(w, x)] is the model-cost or representationcost. This interpretation of the hybrid network in the MDL framework is not single. Indeed, an interpretation depends on a way to look at the hybrid network. When one is mainly interested in the reconstruction via bottleneck hybrid structure, the task may be formulated as a compression problem. This compression has benefits compared to a conventional autoencoder, since it admits not only a good reconstruction of the data, but a successful handling on the specific task, such as classification, for example. In this statement, the reconstruction error EX is recognized as the error cost and the scaled classification error and third term as the model-cost. The third and the last interpretation is produced when one is mainly interested in the specific task (for example, classification or control tasks). In this case, the specific error ED is recognized as the error cost and the scaled reconstruction error and third term as the model-cost. This interpretation gives the rigorous mathematical way for imposing reconstruction and other unsupervised types of constraints in the supervised network. Below, based on this last interpretation, we explain why the hybrid network may be better than the conventional classification feed-forward network. Let us consider two different principal melodies. Suppose that the first is embellished with specific tones, however, the second is not arranged at all. Hearing these two melodies, arranged and not, many times, one can decide that these specific tones are enough for recognizing, which

Chapter 3: Reconstruction constraints

43

one of the melodies is played. However, the next time the first melody may be played by a non skilled pianist, that skips all the beautiful ornaments. Obviously, in this case, the first melody will never be recognized, based on the presence of the ornaments only. This example demonstrates that a bottleneck network for classification, attempting to minimize the description length has a tendency to throw away salient information from the data. The internal representation extracted, based on the classification task alone, may be too poor. Reconstruction helps to process information as a whole, it does not concentrate on the particular details, balances the relationship between the whole and its parts, resulting in a better prediction on the supervised specific task. Bayesian interpretation for the hybrid NN We have shown that the MDL approach naturally explains and interprets the proposed hybrid classification/reconstruction network. It also states that the most probable network weights have to minimize the following part of the description length (3.2.4): R(w, W1 , W2 ) = λ1 ED (w, W1 ) + λ2 EX (w, W2 ) + λ3 H(w).

(3.2.5)

We recall now that the MDL principle is tightly related to the Bayesian approach, where parameters λ1 , λ2 , µ are recognized as hyper-parameters. When the hyper-parameters are unknown, the Bayesian correct treatment (Bishop, 1995a) is to integrate the hyperparameters out of any predictions P: Z

p(P|D, X ) =

p(P|D, X , λ)p(λ|D, X )dλ,

(3.2.6)

where λ = (λ1 , λ2 , µ) is a vector of hyper-parameters and p(λ|D, X ) is the evidence for the hyper-parameters. This integration is similar to generating an ensemble from the networks which depend on the hyper-parameters, where instead of evaluation of the hyper-parameter evidences (that is impossible analytically), we integrate predictions in the vicinity of the most likely hyper-parameters assuming equal evidences. Thus, contrary to (Gluck and Myers, 1993; Pomerleau, 1993), we do not consider some fixed manually adjusted parameters, but a class of the reconstruction-classification networks depended on the “regularization” parameter, with the subsequent combination of the networks to ensembles.

3.2.4

Hybrid network as a generative probabilistic model

Both “recognition” and “generative” phases can be identified in the proposed hybrid model. The “recognition” phase infers an internal/hidden representation of the input data.

Chapter 3: Reconstruction constraints

44

The “generative” phase reconstructs the input from the inferred compact representation in the reconstruction output sublayer and, in addition, predicts the specific task output in the corresponding sublayer. From a Bayesian viewpoint, learning in the hybrid network is equivalent to maximization of the joint probability of the input and specific task output, given the observation and specific constraints on the internal representation. According to Bayesian theory, the best classification is based on the conditional probability of the image classes given the input (i.e. the conditional probabilities of the image classes are the best discriminant functions, that lead to the minimal classification error). The output of the recognition layer of the hybrid network estimates this conditional probability and the reconstruction sublayer regenerates the input data, implicitly estimating the probability of the input data. The proposed architecture differs from a probability network that has a generative model (reconstruction) and a recognition model in a manner similar to the binary wake/sleep architecture (Hinton et al., 1995), or the Rectified Gaussian Belief Network (Hinton and Ghahramani, 1997). First, it is not a full forward/backward model, namely there are no two hidden unit representations, one for the top-down and one for the bottom-up, but instead a single hidden representation is used for both (Figure 3.3). Second, its learning goal is to minimize the classification error (via the mean squared error) as well as to minimize the reconstruction error, as opposed to the goal of constructing a probabilistic model of internal representations. The two goals may coincide under a continuous hidden unit network, but are certainly different for a binary network.

3.2.5

Hybrid Neural Network architecture

The detailed architecture of the network is presented in Figure 3.5. This hybrid network is a modification of the well-known feed-forward network. It is supplied by images in the input layer which are propagated via a hidden layer to the output layer. The output layer consists of two sub-layers, one sub-layer reconstructs the image, the second one serves for classification. The number of units in the output reconstruction sub-layer and the input layer are the number of pixels in the image. The hidden layer has a smaller number of units, because we are looking for aggressive compression techniques to overcome the “curse of dimensionality”. The output classification layer has a number of units equal to the number of image classes. Each image is propagated to the hidden layer in the form: hj =

N X

wji xi + wj0

(3.2.7)

i=1

yj = σ(hj ), j = 1, . . . , m,

(3.2.8)

Chapter 3: Reconstruction constraints

45

Detailed architecture of the recognition/reconstruction network Reconstructed image. Reconstruction sublayer Recognition sublayer. (s=1,..K)

Xi Ps

2

Wij

1

Wsj

Yj

(j=1,...,m)

Hj

Wji

(i=1,...,n)

Xi

Input image.

Figure 3.5: Feed-forward Neural Network with recognition and reconstruction output sub-layers where m  N is the number of hidden units, N is the number of pixels in the image, σ is the sigmoid activation function: σ(x) =

1 . 1 + exp(−x)

(3.2.9)

Image reconstruction, based on the hidden layer representation, is given by: xˆi =

m X

Wij2 yj + Wi02 , i = 1, . . . , N,

(3.2.10)

j=1

and the output of the recognition layer unit is calculated according to the formula: ps = σ(

m X

j=1

1 ), s = 1, . . . , K, Wsj1 yj + Ws0

(3.2.11)

Chapter 3: Reconstruction constraints

46

where K is the number of individuals (number of classes). The classification is made according to the maximal response of the recognition sub-layer (ps is interpreted as the probability of the sample to belong to a certain class -s).

3.2.6

Network learning rule

Let us consider the error back-propagation learning rule with a goal of minimizing the cost function, which is a weighted sum of scaled recognition and reconstruction errors with coefficients λ1 and λ2 , respectively1 : E(w, W1 , W2 ) = λ1 E 1 (w, W1 )/K + λ2 E 2 (w, W2 )/N.

(3.2.12)

As has been shown in the Section 3.2.3, the coefficients λ1 and λ2 are inversely proportional to the noise variances in the reconstruction and recognition channels, respectively. Therefore, the larger the noise is in the channel, the less is the weight of the error-cost corresponding to this channel. Recognition E 1 and reconstruction E 2 errors, sum squared over all samples are given by: E1 =

M X K X

(pµs − tµs )2 .

(3.2.13)

µ=1 s=1 2

E =

M X N X

(ˆ xµi − t(xµi ))2

(3.2.14)

µ=1 i=1

In this expression, t(xµi ) determines the target of the µ-sample in the reconstruction unit-i. The most reasonable choice for t(xµi ), not demanding any a-priory knowledge, is t(xµi ) = xµi . Correspondingly, tµs is a target for recognition given by: (

tµs

=

1 if s coincides with class of µ-sample 0 otherwise

The weights between output-to-hidden and hidden-to-input layers update according to the gradient descent rule :

1

∆W1 = −η ? λ1 5W1 E 1 (w, W1 )

(3.2.15)

∆W2 = −η ? λ2 5W2 E 2 (w, W2 )

(3.2.16)

∆w = −η ? (λ1 5w E 1 + λ2 5w E 2 )

(3.2.17)

For convenience, scaling of errors E 1 and E 2 by the number of pixels in the image N , and the number of image classes K respectively is carried on. This scaling serves to balance the values of the recognition and reconstruction errors in Eq. 3.2.4.

Chapter 3: Reconstruction constraints

47

Specifically, the weights between output reconstruction-to-hidden layers are given by: ∆Wij2 = ηλ2

M X µ,2 µ

δi yj

(3.2.18)

µ=1

δiµ,2 = ∆µ,2 ≡ (xµi − xˆµi )/N i

(3.2.19)

i = 1, . . . , N , j = 1, . . . , m, where yjµ (3.2.8) is the output of the hidden unit-j in the feed-forward propagation of the input image-µ, and δiµ,2 is the image reconstruction error, scaled by the number of pixels in the image. Similarly, the weights between output recognition-to-hidden layers change by: ∆Wsj1

= ηλ1

M X

δsµ,1 yjµ

(3.2.20)

µ=1

δsµ,1 = σ0(

m X

1 Wsj1 yjµ + Ws0 )∆µ,1 s

(3.2.21)

j=1

s = 1, . . . , K , j = 1, . . . , m where ∆µ,1 s -recognition (regression) error scaled by the number of image classes: µ µ ∆µ,1 s = (ts − ps )/K.

(3.2.22)

We call δ the output error of the layer and ∆ the input error to the layer in the backward propagation. According to the generalized delta-rule (Hertz et al., 1991), the change of deeper embedded weights between hidden-to-input layers has the form: ∆wji = η

M X µ µ

δj xi

(3.2.23)

µ=1

and the output error of the hidden unit-j δjµ in the backward propagation of the error is given by: δjµ = σ0(hµj )∆µj .

(3.2.24)

Input error to the hidden unit-j ∆µj has the form: µ,2 ∆µj = λ1 ∆µ,1 j + λ2 ∆j

∆µ,1 = j ∆µ,2 = j

N X i=1 K X s=1

(3.2.25)

Wij1 δiµ,1

(3.2.26)

Wsj2 δsµ,2 .

(3.2.27)

Chapter 3: Reconstruction constraints

48

From (3.2.25–3.2.27) it is easy to see that the output error δjµ may be written as the sum of the errors back propagated concurrently from the reconstruction and recognition sub-layers: δjµ = λ1 δjµ,1 + λ2 δjµ,2

(3.2.28)

δjµ,1 = σ0(hµj )∆µ,1 j

(3.2.29)

δjµ,2 = σ0(hµj )∆µ,2 j .

(3.2.30)

In general, the input/output errors to any layer of the network are a weighted sum of the input/output errors back propagated from the lateral sub-layers (a chain rule of the derivatives). Thus, in the error back-propagation mode, hybrid network with lateral sub-layers emerges as a linear superposition of the conventional (classical) subnetworks.

3.2.7

Hybrid learning rule.

We follow the gradient descent algorithm as the errors are back propagated from an input layer to a hidden layer with a properly scaled cost function (3.2.12): E(w, W1 , W2 ) = (1 − λ)E 1 (w, W1 )/K + λE 2 (w, W2 )/N,

(3.2.31)

where λ ∈ [0, 1] (λ = λ2 /(λ1 + λ2 )) is a regularization parameter, which represents a tradeoff between reconstruction and classification confidences. According to the gradient descent method, updating of the weight vector in each iteration has to be done in the direction that has a negative projection on the gradient direction. This permits us to rescale a learning rule (3.2.15-3.2.16): ∆W1 = −η ? 5W1 E 1 (w, W1 ) ∆W2 = −η ? 5W2 E 2 (w, W2 ) ∆w = −η ? ((1 − λ) 5w E 1 + λ 5w E 2 )

(3.2.32)

We emphasize that the parameter λ in our implementation, affects only the weights w between input and hidden layers, i.e. on the hidden layer representation. Our rule (3.2.32) may be treated as the hidden layer belief in the performance of the two upper channels, transferring backward information from reconstruction and recognition sub-layers. Thus, we take the errors of the reconstruction layer with the weight λ, and the errors of the recognition layer with the weight 1 − λ. It can be seen, that for λ = 0 the hidden representation is built based only on the recognition task, and reconstruction is learned from the hidden layer. This marginal case corresponds to the Baluja consideration (Baluja and Pomerleau, 1995). In contrast, when λ = 1, the hidden representation is based on the

Chapter 3: Reconstruction constraints

49

reconstruction task solely; and we attempt to solve the recognition task in the reduced space. We see that this marginal case is equivalent to a first approach to classify via reconstruction (Kirby and Sirovich, 1990; Turk and Pentland, 1991; Murase and Nayar, 1993). This network and its hybrid rule may be interpreted as the parallel concurrent work of two separate feed-forward networks for recognition and reconstruction. The hybrid net hidden weight updating is a linear combination of the gradient directions of both networks in the common hidden weight space. For small λ our method is a kind of gradient descent method that prevents zig-zags (peculiar to the gradient steepest descent method (Ripley, 1996)) in the search of the optimal weights minimizing the recognition regression error.

Chapter 4 Imposing bias via unsupervised learning constraints 4.1

Introduction

Information theory provides some explanation to sensory processing (Rieke et al., 1996). According to these principles, neural cell responses are developed by optimizing criterions based on the information theory. The first proposed information principles are redundancy reduction (Barlow, 1961) and “infomax” (Linsker, 1988), that are similar and lead to a factorial code formation under some conditions (Nadal and Parga, 1994). Recently, with the parallel development of independent component analysis (ICA) (Comon, 1994) in the signal processing, new efficient algorithms for the factorial code formation have been proposed. Of particular interest are algorithms via feed-forward networks with no hidden layer (Bell and Sejnowski, 1995; Yang and Amari, 1997). In this chapter, we propose to use information theoretical measures as constraints for the classification task. We introduce a hybrid neural network with a hidden representation that is arranged mainly for the classification task and, in addition, has some useful properties, such as the independence of hidden neurons or maximum information transfer in the hidden layer, etc. The chapter is organized as follows. In the first section, the main information principles and their relation to sensory processing are discussed. The second section presents the mathematical background and algorithms for ICA and other related information principles. In the third section, a hybrid neural network with unsupervised constraints is introduced and some algorithmical details are presented.

50

Chapter 4: Unsupervised learning constraints

4.2

51

Information principles for sensory processing

Mammals process incomplete and noisy sensory information in an apparently effortless way. This is possible since sensory inputs: images, sounds, etc., have very specific statistical properties that are efficiently encoded by the biological nervous systems. The sensory inputs appear usually smooth over large spatial and temporal regions that lead to redundancy in the sensory input. The redundancy emerges as a statistical regularity, which means that many pieces of a signal are a-priori predictable from other pieces and hence by clever recoding it is possible to get more economical representation of the data. In the past, the principle of redundancy reduction (Barlow, 1961) was suggested as a coding strategy in neurons. According to this principle each neuron should encode features that are as statistically independent as possible from other neurons over a natural ensemble of inputs. The ultimate obtained representation is called the factorial code (Redlich, 1993). In the factorial code, the multivariate probability density function (pdf) is factorized as a product of marginal pdfs. This property provides an efficient way of storing statistical knowledge

1

about the input (Barlow, 1989).

One of the earliest attempts to construct the factorial representation via neural networks was proposed by Atick (1992). The underlying computational learning rule is based on the minimization of the sum of the entropies of the hidden units under constraint to preserve the input entropy (the total information about the signal). A type of gradient descent algorithm in the assumption of a Gaussian input signal and linear output, results in a Hebbian-like learning rule and a decorrelated hidden representation. The major limitation of Hebbian-like rules is dependence on the linear, pairwise correlations among image pixels (second order statistics). Thus, they are not sensitive to phase changes in the image responsible for oriented localized structures, such as lines, edges and corners (Field, 1994). Motivated by the principle of redundancy reduction Field (1994) contrasts two different coding approaches. Both approaches take advantage of the input redundancy, but in a different manner. The first one, compact coding, is based on the mean-squared error and uses only the second order statistics of the input. The main goal of this coding is to reduce dimensionality of the input in the directions with a low input variance. PCA and linear auto-associator networks, considered in Chapter 3, are examples of this coding scheme. An alternative sparse distributed coding does not necessarily imply the reduction of dimensionality. In contrast, the dimensionality may be enlarged. A sparse distributed 1

For an image description the probability of each possible set of pixel values has to be known. For instance, an image having N pixels with Q intensity quantization levels requires the storing of QN possible probabilities. If the code is factorial the number of the required probabilities reduces to N Q.

Chapter 4: Unsupervised learning constraints

52

coding approach encourages representations, where only a small, adaptive to input, subset of hidden units is simultaneously active. Although, there is not a general tool to form the sparse code, it has some typical features. The sparse code is characterized by the extremely peaked distribution of the hidden unit activities which provides both high probability of a neuron to be silent or active according to its relevance to the input pattern representation. A way to construct sparse coding based on this feature has been proposed in (Olshausen and Field, 1996) by minimizing the cost functional consisting of a mean-squared error and a penalty term for neuron activities. Peaked distributions are characterized by high kurtosis or low entropies (Oja, 1995b), thus, maximization of kurtosis or entropy minimization can be used for sparse coding formation. At the same time, via minimization of the sum of the entropies sparse coding is related to a factorial coding. It is also known that under a fixed variance the Gaussian distribution has the largest entropy (Cover and Thomas, 1991). Thus, hidden unit entropy minimization is tightly related to exploratory projection pursuit (EPP), which tries to find a structure in the projected data, seeking directions that are as far from Gaussian as possible (Friedman, 1987). Therefore, a deviation from the Gaussian distribution serves as a good measure for hidden unit independence and can be used as a strategy for sparse coding construction. Recently, an interest in EPP has been revived and formulation of the new unsupervised rules based on the information theory has been stimulated with the development of independent component analysis (ICA). In the next section, ICA is formulated and some algorithms producing factorial codes are presented.

4.3

Mathematical background

ICA has been developed as a tool for blind source separation. The problem is to recover independent sources from sensory observations which are unknown linear mixtures of the unobserved independent source signals. Let us consider m unknown mutually independent sources si (t), i = 1, . . . , m with no more than one being normally distributed. In general, t is a sampling variable, that may be a time variable for signals or a two dimensional spatial variable for images, or an index of the pattern in a data-set. The sources are mixed together linearly by an unknown non-singular matrix A ∈ Rn×m : x(t) = As(t), s(t) = [s1 (t), . . . , sm (t)]

(4.3.1)

It is assumed that in (4.3.1) the number of sensors xi (t), i = 1, . . . n is greater or equal to the number of sources (n ≥ m). The task is to recover the original signals via a linear

Chapter 4: Unsupervised learning constraints

53

transform defined by a matrix W ∈ Rm×n : u(t) = Wx(t), u(t) = [u1 (t), . . . , um (t)]

(4.3.2)

Since recovered signals may be permuted and scaled versions of the sources, the de-mixing matrix W has to be a solution of the following linear equation: ΛP = WA, where Λ is a non-singular diagonal matrix and P is a permutation matrix.

4.3.1

Entropy maximization (ME)

One of the first algorithms extracting the independent components via a neural network has been proposed by Bell et al. (1995). Assuming that the number of sources is equal to the number of sensors, a fully connected n → n feed-forward network consisting from an input and nonlinear output layers, having the same number of units as the number of sources, has been considered (Figure 4.1). The network has been trained to maximize a Feed-forward network for independent component extraction Output y:

y i =g i (u i ) yi ui

u=Wx (u-recovered sources)

Input - x

Figure 4.1: A one layer n → n feed-forward network. joint entropy H(y) of the nonlinear output y: u = Wx + w0 , y = g(u), y ∈ Rn , u ∈ Rn , w0 ∈ Rn Z

H(y) = −

(4.3.3)

p(y) log p(y)dy

In the case of the output additive noise, the entropy maximization (ME) is equivalent to maximization of the mutual information between input and output (Nadal and Parga,

Chapter 4: Unsupervised learning constraints

54

1994). As has been shown earlier (Linsker, 1988), the principle of the mutual information maximization called “infomax” in the case of a linear neural network leads to a Hebbian like learning rule, that is sensitive to the second order statistics only, therefore, nonlinearity in the output layer is essential. The joint entropy of the output can be represented as: H(y) =

X

H(yi ) − I(y),

(4.3.4)

i

R

where H(yi ) = − p(y i ) log p(y i )dy i are marginal entropies of the outputs and I(y) is their mutual information. The mutual information (MI) of the output y is a KullbackLeibler measure between output distribution p(y) and a product of marginal distributions Q

i

p(yi ):

Z

p(y) p(y) log Q dy (4.3.5) i p(yi ) Due to a ∩-convexity of the log function, the Kullback-Leibler measure is nonnegative and I(y) =

attains its minimum zero value if and only if outputs yi are independent almost everywhere. Maximization of the joint entropy consists of maximizing the marginal entropies and minimizing the mutual information. Since the nonlinear functions bound the outputs, the marginal entropies are maximum for a uniform distribution of yi . The mutual information I(y) is invariant under an invertible component-wise transform (I(y) = I(u)) and achieves its minimum equal zero when the presynaptic outputs u (4.3.3) are independent. Thus, if the nonlinear functions gi have the form of the cumulative density function (cdfs) of the true source distribution, then the matrix W recovers independent sources as the presynaptic output u (4.3.3), and this is a single global maximum of the joint entropy H(y), which is a convex ∩ function. As has been rigorously proven (Yang and Amari, 1997), the ME approach leads to the independent components only if the nonlinear activation functions gi in the output layer coincide with the cumulative density functions (cdfs) of the sources. For zero mean mixtures and functions gi not equal to the (cdfs) of the sources, the ME algorithm does not converge to the ICA solution W = ΛPA−1 . However, if the initial matrix is the right −1

ICA solution W0 = ΛPA , the algorithm does not update the de-mixing matrix W in the directions of increasing the cross-talking. This fact partially explains the ME success, even when cdfs are not known exactly. In applications considered by Bell and Sejnowski (1995), nonlinear activation functions have been chosen ad hoc as logistic sigmoidal, that has a highly peaked derivative with long tails. Since sound signals are super-Gaussians2 this type of nonlinearity appears to be appropriate for “infomax” principle. 2

Super-Gaussian signals have pdf with large tail areas and a sharp peak. In contrast, sub-Gaussian signals have pdf with small tail areas and a flat peak (see also Appendix A to Chapter 4.)

Chapter 4: Unsupervised learning constraints

55

The de-mixing matrix W is found as synaptic weights of the network iteratively using the stochastic gradient ascent method applied to the joint entropy H(y): ∆W = η([Wt ]−1 + (1 − 2y)xt ) 1 ∈ Rn , ∆w0 = η(1 − 2y) Amari et al. (1997) have suggested a modification of this rule that utilizes the natural gradient and does not require the inversion of the weight matrix. It proceeds by multiplying the absolute gradient by Wt W, producing3 : ∆W = η(I + (1 − 2y)ut )W

4.3.2

(4.3.6)

Minimization of the output mutual information (MMI)

Another way to derive independent outputs for the blind separation problem has been presented in (Amari et al., 1996). An algorithm minimizes the mutual information (MI) of the linear outputs, Iu (W): Iu (W) = −H(u) +

n X

H(ui ),

(4.3.7)

i=1

u = Wx,

(4.3.8)

that attains its minimum if and only if the outputs ui are independent about everywhere. In order to approximate marginal entropies H(ui ), truncated Gram-Charlier expansion (Stuart and Ord, 1994) of the marginal pdfs p(ui ) has been used and a mild assumption about the original source statistics has been done. It has been assumed that the original sources have zero mean and their variances are normalized to 1. A stochastic gradient descent applied to the approximated expression of the mutual information Iu (W) leads to the following equation for the network weight dynamics: ∆W = η([Wt ]−1 − Φ(u)xt )

(4.3.9)

where Φ(u) = f (k3 , k4 ) ◦ u2 + g(k3 , k4 ) ◦ u3 and the following notations have a place: f ◦ y = [f1 y1 , . . . , fn yn ]t , uk = u ◦ uk−1 f (k3 , k4 ) = [f (k31 , k41 ), . . . , f (k3n , k4n )]t , g(k3 , k4 ) = [g(k31 , k41 ), . . . , g(k3n , k4n )]t k3i = mi3 = E[u3i ], k4i = E[u4i ] − 3(E[u2i ])2 1 9 1 3 3 f (a, b) = − a + ab, g(a, b) = − b + a2 + b2 2 4 6 2 4 3

w0 is assumed to be zero.

Chapter 4: Unsupervised learning constraints

56

The natural gradient descent for MMI leads to the following algorithm: ∆W = η(t)[I − Φ(u)ut ]W

(4.3.10)

As has been pointed out in (Yang and Amari, 1997), both ME and MMI algorithms have the same typical form (4.3.10). In ME Φ depends on the nonlinear activation functions gi and is given by: Φ(u) = −(

g100 (u1 ) gn00 (un ) t , . . . , ) g10 (u1 ) gn0 (un )

(4.3.11)

Since gi should coincide with the cdfs of the unknown original signals, Φ(u) have to be chosen properly. In MMI, functions Φ depend on the cumulants of the third and fourth orders k3i , k4i of the linear scalar output ui . These cumulants may be replaced by instantaneous values or be estimated. Another possibility is to use a-priori knowledge about cumulants of the unknown original signals. Therefore, whereas a success of the ME algorithm depends on the a-priori knowledge about data statistics, MMI is more flexible. In (Yang and Amari, 1997) the following types of Φ(u) = (φ(u1 ), . . . , φ(un ))t have been used: (a) φ(u) = u3

(4.3.12)

(b) φ(u) = tanh(u) 15 14 29 29 3 (c) φ(u) = u11 + u9 − u7 − u5 + u3 4 4 3 4 4

(4.3.13) (4.3.14)

The (a-b) forms of Φ(u) correspond to the ME algorithm and assume pdfs (and equivalently g(u))

4

to be proportional to: (a) p(u) ∝ exp(−u4 /4)

(4.3.15)

(b) p(u) ∝ (cosh(u))−1

(4.3.16)

Therefore in both cases, distributions are assumed to be symmetrical and sub-Gaussian. The form (4.3.14) of Φ(u) is the instantaneous form of MMI (k3i = u3i , k4i = u3i − 3) and it does not assume the shape of the source distributions. The ME and MMI learning rule (4.3.10) has been obtained in the assumption of a square weight matrix, W. However, in some applications it may be interesting to separate 00

(u) = φ(u), which leads to g 0 (u) = exp(− Here we use the fact that − gg0 (u) time g 0 (u) must coincide with pdfs of the original sources. 4

Ru 0

φ(u))du and at the same

Chapter 4: Unsupervised learning constraints

57

only a part of the sources. This may be done via multiplication of the right side of the learning rule (4.3.9) by Wt ΛW, where the block matrix Λ ∈ Rn×n is given by : I 0 0t 0s

Λ=

!

.

(4.3.17)

In (4.3.17) I ∈ Rm×m is an identity square matrix, 0 ∈ Rm×(n−m) is a rectangular zero matrix, 0s ∈ R(n−m)×(n−m) is a square zero matrix and m < n. The final learning rule for ˜ obtained by deleting the last (n − m) rows of the matrix a part of the weight matrix W, ˜ ∈ Rm×n : W will be the same one as (4.3.10): W ˜ = η(t)[I − Φ(u)ut ]W ˜ ∆W ˜ ∈ Rm×n , Φ(u) ∈ Rm×1 , W

u ∈ Rm×1 ,

I ∈ Rm×m

The network architecture then implies dimensionality reduction since the number of output units is less than the number of input units. In addition, such a network extracts independent components.

4.3.3

Relation to Exploratory Projection Pursuit.

MMI has been considered as the starting point for a large family of ICA contrast functions proposed by Hyvarinen (Hyvarinen, 1997a). It has been noted that MI can be expressed using negentropies J(u), J(ui ) 5 (Hyvarinen, 1997a; Girolami and Fyfe, 1996): Iu (W) = J(u) −

X i

Q

Cii 1 , J(ui ) + log i 2 det(C)

(4.3.18)

where C is a covariance matrix of u and Cii are its diagonal elements. Since the negentropy J(u) is invariant for invertible linear transformations (J(u) = J(x), note that J(ui ) = J(xi ) holds only when nonlinear transformation: x → u, is componentwise with ui = f (xi )), MMI is roughly equivalent to finding directions in which negentropy is maximized. This equivalence is rigorous, when components ui are constrained to be uncorrelated (the last term of 4.3.18 is zero). This means that the directions in which the data distribution is as non-Gaussian as possible are preferable. This is the point where EPP and ICA have come into contact. The natural gradient ascent applied to the sum of the marginal negentropies leads to the same learning rule (4.3.10) (Girolami and Fyfe, 1996; Lee et al., 1998). When the 5

Negentropy of the multivariate random variable u is a difference between entropies of the multivariate Gaussian distribution with the same covariance matrix as u and entropy of the u: J(u) = H(uG )−H(u). It measures deviation of the distribution from Gaussian and is nonnegative. The valuable property of negentropy is invariance under invertible linear transforms.

Chapter 4: Unsupervised learning constraints

58

nonlinearities Φ are taken to be: (

φi (ui ) =

ui + tanh(ui ) for super-Gaussian source ui − tanh(ui ) for sub-Gaussian source

the learning rule may be written in the elegant form: ∆W = η(t)[I − K tanh(u)ut − uut ]W,

(4.3.19)

where K is a diagonal matrix with elements sign(kur(ui )) and kur(ui ) is kurtosis of the i-source. The advantage of EPP, however, is the possibility to find independent components recursively one-by-one by maximization of the 1-D negentropy. For the same conditions, as in (Yang and Amari, 1997): E[ui ] = 0, E[u2i ] = 1, negentropy may be approximated by: 1 1 (E(u3i ))2 + (k4 (ui ))2 (4.3.20) 12 48 When source distributions are assumed to be symmetrical, negentropy simplifies to J(ui ) ∝ J(ui ) ≈

(k4 (ui ))2 = kuri2 and minimization of the output mutual information is approximately equivalent to maximization of the sum of the source kurtosises6 : Fmax (W) =

X

kuri2

(4.3.21)

In other words, the directions in which signal distribution is highly peaked or extremely flat, are considered as interesting. In (Hyvarinen, 1997b) a new family of approximated contrast ICA functions has been proposed via the negentropy approximation: J(u) ∝ (E[G(u)] − E[G(ν)])2 ,

(4.3.22)

where ν is a standardized Gaussian variable and the function G fulfills some orthogonality property and is suitable to the assumed original source statistics and is reasonably simple for computation. The simplest proposed choices for the function G is polynomial G = |u|α , where α < 2 for super-Gaussian densities and α > 2 for sub-Gaussian densities. This approach appears finally as a generalization of different projection pursuit indices (Blais et al., 1998), where skewness and kurtosis are used explicitly to measure deviation from the Gaussian distribution. It is related also to the BCM neuron learning rule (Intrator and Cooper, 1992). 6

See Appendix A to Chapter 4 for cumulants and kurtosises definitions.

Chapter 4: Unsupervised learning constraints

4.3.4

59

BCM

An idea of BCM is to find a direction w which emphasizes data multi-modality by minimizing a specific loss function (a specific projection index): 1 1 F(w) = −µ( E[u3 ] − θ2 ) 3 4 t 2 u = w x, θ = E[u ],

(4.3.23)

In order to make this measure robust to outliers, a rectification nonlinear function is applied in the linear output. Thus, in general, y = g(wt x). The gradient descent rule yields the following learning rule: ∆w = µE[φ(y, θ)g 0 (u)x]

(4.3.24)

where φ(y, θ) = y 2 − yθ, θ = E[y 2 ].

4.3.5

Sum of entropies of the hidden units

Being motivated to obtain a hidden representation where each neuron contains as much information as possible, we suggest to maximize the sum of the entropies of the output units: F(W) =

m X

H(yi ).

i=1

The stochastic gradient descent method leads to the following equation for the weight dynamics (details are given in Appendix B to Chapter 4: ∆W = η(f (u) +

g00 )xt g0

,

(4.3.25)

where f (u) is defined as φ(u) in (4.3.14). Since the nonlinear output functions bound the output values yi , the entropy is maximized, when yi is uniformly distributed, which leads to a relation pu (ui ) =

dgi . dui

This means that the distribution of the presynaptic variables

ui is controlled by the nonlinearities in the learning rule (4.3.25). For logistic sigmoidal activation functions, (4.3.25) simplifies to: ∆W = η(f (u) + (1 − 2y))xt

(4.3.26)

The same rule, but with the negative parameter η, can be used for a sparse code formation, as suggested in (Olshausen and Field, 1996; Atick, 1992)7 . 7

When the output is bounded c < y < d, due to a ∩-convexity of the log function, the entropy of the Rd R 1 output is upper bounded: H(y) = c p(y) log p(y) dy ≤ log( p(y) p(y) )dy = log(d − c). Therefore, the entropy maximization is properly defined mathematically. At the same time the lower estimate depends on the Rd Rd distribution: −H(y) = c p(y) log p(y)dy ≤ log c (p(y))2 dy ≤ 2 log((d − c) max p(y)). It is clear that in practice max p(y) is bounded and therefore, the problem of the sum of entropies minimization is also properly defined mathematically.

Chapter 4: Unsupervised learning constraints

4.3.6

60

Nonlinear PCA

Although the nonlinear PCA method has no apparent connection to the ME or MMI, it has been shown that it allows separation of the whitened linear mixtures of sources (Oja, 1995b; Oja, 1995a). In nonlinear PCA, the input signals are first prewhitened, i.e. the signals are represented as the projections on the eigenspace of the input covariance matrix and are properly scaled. As a result prewhitened signal x has a zero mean and a unit spherical covariance matrix. The learning rule is an approximate stochastic gradient descent algorithm that minimizes the mean-squared reconstruction error: E = E[k x − Wt y k2 ],

(4.3.27)

where the weight matrix W and nonlinear output y are defined to be the same as in (4.3.3) and the bias is assumed to be zero w0 = 0. An approximate learning rule has the form: ∆W = ηy(xt − yt W)

(4.3.28)

For separation, odd twice differentiable nonlinear functions gi have to be properly taken to satisfy some stability conditions depending on the data statistics. Particularly, it is shown (Oja, 1995a) that a sigmoidal nonlinear activation function as g = tanh(βu), β > 0 is feasible for sub-Gaussian original signals and polynomial g = u3 for super-Gaussian densities (in this analysis it was assumed that the sources are statistically identical and have a symmetrical distribution). The MSE for whitened data and nonlinear activation functions in the form g(u) = u3 P

or tanh(u) may be approximated as − kuri (Lee et al., 1998). Thus, minimization of the MSE leads to maximization of the sum of the kurtosises: Fmax (W) =

X

kuri

(4.3.29)

The latter expression is equivalent to (4.3.21) for super-Gaussian original sources. This evaluation shows that in some cases the nonlinear PCA can also be viewed from informationtheoretic principals, as a method to minimize approximately the mutual information of the output.

4.3.7

Reconstruction issue

Learning in the nonlinear PCA and nonlinear autoencoders is based on the reconstruction mean-squared error. Similarly to a linear case, nonlinear PCA and nonlinear autoencoder extract different weights. The nonlinear autoencoder with proper activation functions does

Chapter 4: Unsupervised learning constraints

61

not necessary extract the independent components as the nonlinear PCA does in some cases. However, this consideration sheds light on the relation between the unsupervised learning based on the information theory and reconstruction. Reconstruction and ICA are related also via a generative model approach (MacKay, 1996; Roweis and Ghahramani, 1997; Lee et al., 1998). ICA recovering independent components (hidden variables) and de-mixing weight matrix W is itself a recognition phase of the reconstruction process, with a nonlinear generative model that differs from generative models underlying PCA and linear autoencoder. Thus, although ICA (informationtheoretic) constraints may be also considered as some type of “generalized” reconstruction constraints with another underlying generative model, we keep the notion of reconstruction constraints for an autoencoder network. ICA similar to PCA has been also used as a preprocessing step for face classification (Bartlett et al., 1998). As will be clear later, the hybrid classification/feature extraction scheme which is introduced in the next section corresponds to this type of preprocessing, when trade-off parameter λ = 1.

4.4

Imposing unsupervised constraints

The unsupervised learning rules we have used are based on different assumptions about the quality of the low dimensional representation (LDR). These rules are based on statistics of order higher than two, and use low order moments of the distribution and a sigmoidal squashing function for robustness against outliers. The learning rule for hidden weights modification for the constrained network (Figure 2.2) is described by: ∆w = −η((1 − λ) 5w E 1 − λh(w, x)),

(4.4.30)

where the term h(w, x) corresponds to weight updating, that emerges via additional unsupervised feature extraction. When h(w, x) is a gradient of some information measure H(w, x)

8

the learning rule (4.4.30) corresponds to minimization of the penalized mean

squared recognition error: F(w, W1 ) = (1 − λ)E 1 (w, W 1 ) − λH(w, x).

(4.4.31)

Table 4.1 summarizes different learning constraints with the corresponding h(w, x)function. The bottom rows of Table 4.1 describe a few variations on the sum of entropy 8

The term h(w, x) can appear as a gradient of some information measure scaled by a positive definite matrix P(x, w), then in general corresponding H(w, x) may not exist. We use the negative sign before h(w, x) term in Eq. 4.4.30 for convenience, since most of the used feature extraction rules are formulated as a maximization problem.

Chapter 4: Unsupervised learning constraints

62

Unsupervised Constraints Type of constraints Entropy maximization (ME) BCM Sum of entropies: A B C D Nonlinear PCA

h(x, w) (Bell and Sejnowski, 1995) with sigmoidal activation function: ∆W = η(I + (1 − 2y)ut )W ( 4.3.6 ) (Intrator and Cooper, 1992) with sigmoidal activation function ∆wij = ηE[φ(yi , θi )g 0 (ui )xj ] ( 4.3.24 ) ∆wij = η(f (ui ) + (1 − 2g(ui )))xj f (u) = u3 f (u) = 2tanh(u) 3 11 f (u) = 4 u + 15 u9 − 14 u7 − 29 u5 + 29 u3 4 3 4 4 ∆wij = −η(f (ui ) + (1 − 2g(ui )))xj f (u) = [ 34 u11 + 15 u9 − 14 u7 − 29 u5 + 29 u3 ] 4 3 4 4 ∆W = ηy(xt − yt W)

Table 4.1: Different learning rules used as unsupervised constraints in addition to reconstruction (see text and Appendix for details). rules, based on a different type of function f (u). These functions emphasize different statistical properties of the input distribution and are discussed in (Blais et al., 1998). In particular, the last two rows use the Gram-Charlier approximation to the entropy which is done via moments (Stuart and Ord, 1994). The last row represents a minimization of entropy rather than maximization, as might be suggested by the desire to find distributions that are far from Gaussian. Similar to the hybrid network with reconstruction constraints the constrained network with the learning rule (4.4.30) may be interpreted as a competitive learning of two nets for classification and statistical feature extraction. The output layer of the feature extraction network coincides with the hidden layer of the classification network. Thus, the hybrid network learns to classify and extract useful statistical properties simultaneously.

4.5

Imposing unsupervised and reconstruction constraints

Generalizing our approach further, we offer to constrain classification by reconstruction and other types of unsupervised constraints (see Figure 3.4). The generalized learning rule has the form: ∆W1 = −η ? 5W1 E 1 (w, W1 ) ∆W2 = −η ? 5W2 E 2 (w, W2 )

Chapter 4: Unsupervised learning constraints ∆w = −η ? ((1 − λ) 5w E 1 + λ((1 − µ) 5w E 2 − µh(w, x)),

63 (4.5.32)

where now we have two regularization parameters λ and µ. Thus, the most general network corresponds to the goal function (3.2.4) and its flow-chart is presented in Figure 3.4.

Chapter 4: Unsupervised learning constraints

64

Appendix A to Chapter 4: Order statistics Here we give some definitions and relations between order statistics (see (Stuart and Ord, 1994)). Definition:

Moments of order r about the point a µ0r =

Z ∞ −∞

(x − a)r dF,

(4.5.33)

where F is a distribution function. Definition:

Characteristic function c.f. φ(t) =

Z ∞ −∞

exp(itx)dF

(4.5.34)

It may be easily seen that moments of distribution µ0r about pont 0 are related to the r-order derivative drt φ(t) of the characteristic function φ(t) via: µ0r = (−i)r [drt φ(t)]t=0

(4.5.35)

Another set of statistical measures that are widely used in statistics are cumulants. Definition: The cumulants are defined by the identity ∞ X

kr (it)r /r! = log φ(t)

(4.5.36)

r=1

Thus if a moment of order r µ0r is the coefficient of (it)r /r! in the Taylor series expansion of the characteristic function φ(t), kr is the coefficient of (it)r /r! in the Taylor series expansion of log φ(t). Here we present the relation between the first four order statistics: k1 = µ01 k2 = µ02 − µ02 1 k3 = µ03 − 3µ01 µ02 + 2µ03 1 0 02 04 k4 = µ04 − 4µ03 µ01 − 3µ02 2 + 12µ2 µ1 − 6µ1

In order to describe some interesting properties of the distribution, some other statistical measures have been defined: Definition: Kurtosis kur(u) =

µ4 k4 −3= 2 2 µ2 k2

(4.5.37)

The kurtosis characterizes the degree of peakedness of the graph of a statistical distribution. It is indicative of the concentration around the mean. Distribution for which kurtosis is equal to zero is called mesocurtic. Those with positive kurtosis are called

Chapter 4: Unsupervised learning constraints

65

leptokurtic and with negative platycurtic. Kurtosis is equal to zero for Gaussian distribution, is negative for sub-Gaussian and positive for super-Gaussian random variables. The super-Gaussian random variable is “sharper” than the Gaussian, its pdf has large tail areas and is more sharply peaked. The pdf of the sub-Gaussian random variable has smaller tail areas and are also flatter-topped (see Figure 4.2). For a normally distributed random variable (µ1 = 0 and µ2 = 1), kurtosis coincides with the cumulant of the fourth order. For a family of the density function: fα (x) = C1 exp(−C2 |x|α ),

(4.5.38)

where positive constants C1 , C2 are the normalization constants that ensure that fα is a probability density of the unit variance: C1 = (

m2 1/2 ) ; m31

C2 = (

m2 α/2 ) , m1

where

Z ∞

2 1 Γ( ) α α −∞ Z ∞ 2 3 m2 = x2 exp(−|x|α )dx = Γ( ) α α −∞ m1 =

exp(−|x|α )dx =

(4.5.39)

The different values of the positive parameter of α exhibit different shapes of the distriPdf’s graphs for a family of the exponential density functions 1.4

1.2

α=0.75

1

0.8

0.6

α=2

0.4

α=5 0.2

0 −10

−8

−6

−4

−2

0

2

4

6

8

10

Figure 4.2: Sample graphs for a family of the exponential density functions. This figure demonstrates the typical shapes of the super-Gaussian (α = 0.75), and sub-Gaussian (α = 5) random variables.

Chapter 4: Unsupervised learning constraints

66

bution. The random variable is super-Gaussian for 0.5 < α < 2 and is sub-Gaussian for α > 2 (Figure 4.2).

Appendix B to Chapter 4: Derivation of the sum of entropies learning rule We consider compression of the input x = (x1 , x2 , . . . , xn ) via the following nonlinear transformation: ui =

n X

wij xj + wi0 , i = 1, . . . , m, m < n,

yi = g(ui ).

j=1

u and y are vectors of pre and post-synaptic activations of the hidden layer, wij network weights and wi0 network biases; and g-is a nonlinear monotone-increasing activation function. The network architecture is presented in Figure 4.3. As a learning rule we choose Exploratory projection pursuit network Output y:

y i =g i (u i ) yi ui

u=Wx

Input - x

Figure 4.3: Feature extraction is achieved via (non linear) projection and dimensionality reduction to maximize the sum of the entropies of the hidden units: F(W) =

m X

H(yi ).

i=1

The probability of the output of the hidden unit py (yi ) can be written as: py (yi ) = This leads to:

pu (ui ) , yi0

Z

where yi0 =

dyi dui

Z

p(yi ) ln(p(ui ))dyi =

p(ui ) ln(p(ui ))dui ,

Chapter 4: Unsupervised learning constraints

67

which implies the following expression for the sum of entropies: m X

F(W) = E[

i=1

ln

m m X X yi0 ] = ( H(ui ))1 + E[( ln yi0 )2 ] p(ui ) i=1 i=1

Thus, our goal consists of two terms, the first , F1 is the sum of the entropies of the presynaptic activations of the hidden units and was evaluated by Amari et al. (1996) using the truncated Gram-Charlier expansion to approximate the probability density function (pdf) pu (ui ) and the second, E[F2 ] represents an expectation of the sum of the log-terms. The weights W have to be adjusted to maximize F(W). Using a gradient ascent algorithm we obtain: ∆wij = η(

∂F1 ∂F2 + E[ ]) ∂wij ∂wij

Replacing the gradient method by a stochastic method we obtain: ∆wij = η(f (ui )xj +

∂F2 ), ∂wij

where f(u) is defined by Amari et al. (1996) and is the same as the function φ(ui ) in the expression (4.3.14). However, in our simulation similar to Amari et al., we use f(u) as in (4.3.12,4.3.13). The second term for nonlinearities gi (ui ), may be written as: ∂F2 g00 = 0 xt ∂W g 00 g00 g100 (u1 ) gm (um ) t = ( , . . . , ) 0 0 (u ) g0 g1 (u1 ) gm m

(4.5.40)

Thus, the learning rule simplifies to: ∆W = η(f (u) +

g00 )xt g0

(4.5.41)

The second term for any nonlinear function y = g(u), such that its derivative depends only on y itself yu0 = G(y), can be simplified by the following: ∂ln yi0 1 ∂yi0 1 ∂y 0 1 ∂G(yi ) ∂G(yi ) = 0 = 0 i xj = 0 yi0 xj = xj ∂wij yi ∂wij yi ∂ui yi ∂yi ∂yi For the logistic sigmoidal activation function g(u) = 1+exp1 (−u) the derivative G(yi ) can be easily evaluated as G(yi ) = yi0 = yi (1 − yi ). Thus, we obtain: ∆wij = η(f (ui ) + (1 − 2yi ))xj .

(4.5.42)

Chapter 4: Unsupervised learning constraints The only difference in the

∂F1 ∂wi j

68

evaluation is the presence of the bias wi0 in (4.5.40).

Therefore, we must require that the expectation of pre-synaptic activations of the hidden units ui be zero and their second moments be mi2 = E[u2i ] = 1. This can be achieved by normalizing u before the calculation of f (ui ). Furthermore, the network’s input is normalized, so that E[x] = 0 and consequently, omitting bias at all (w0 = 0), the condition E[u] = 0 is automatically satisfied. The second condition mi2 = E[u2i ] = 1 constrains the norm of W.

The same rule, but with the negative η, can be used as a

goal for sparse coding (Olshausen and Field, 1996; Atick, 1992).

Chapter 5 Real world recognition 5.1

Introduction

Real-world object recognition is impeded by natural climate conditions such as fog, rain or snow and also by other conditions such as partial occlusion and noise. This is further complicated by changes in illumination and shadows that are due to movement of surrounding objects. Some of these factors cause image blur, and all these factors are crucial for recognition performance and have to be properly addressed during training and testing. This chapter addresses face recognition under various image degradations. We compare different regularized recognition networks and different ensembles by testing their performance on the degraded images. Results on two data-sets under various resolutions and image degradations are demonstrated. We conclude that a combination that includes ensembles with reconstruction constraints achieves the best performance on the degraded images. In addition we show that via saliency maps reconstruction can deemphasize degraded regions of the input, thus leading to classification improvement under “Salt and Pepper” noise.

5.1.1

Face recognition

Face recognition is an active field of research with possible applications in such areas as man-machine interaction, robotics, access control, automatic search in visual databases and low bit-rate compression.

This task is challenging, since faces do not appear as

fixed image patterns; they can appear anywhere, at any size and orientation and with varied background (Chellapa et al., 1995). Thus, face detection and normalization are usually performed that reduce variability caused by these factors. However, such localization preprocessing is not sufficient, since faces are not rigid and lighting conditions

69

Chapter 5: Real world recognition

70

are not uniform. Different facial expressions, changes in hair-style and eyeglasses, and lighting conditions lead to a large amount of face variability. In some applications, this normalization may be further complicated by low quality of the images. For example, systems installed at airports yield foggy, blurred images; cheap cameras, such as those used for robot navigation, lead to images with low resolution. Thus, face recognition is a particular case of the training when the variability of the data describing the same class is comparable with the similarity between different classes (Moses, 1994). Face recognition approaches can be divided into two basic groups, feature-based methods (Samal and Iyengar, 1992, survey) and processing images as a whole (Kirby and Sirovich, 1990; Turk and Pentland, 1991; Moghaddam and Pentland., 1994; Valentin et al., 1994, survey). Most of the effort in the feature-based methods is focused on finding individual features (e.g., eyes, mouth, nose, head outline, etc.) and measuring statistical parameters to describe those features and their relationship. Different methods for feature extraction were proposed such as template matching (Baron, 1981), deformable templates (Yuille et al., 1989), combination of perceptual organization and Bayesian networks (Yow and Cipolla, 1996) and methods using facial symmetry and elementary knowledge of faces (Reisfeld et al., 1990; Tankus, 1996), etc. However, selecting a set of features that captures the information required for a face recognition is not easy and there is no a complete satisfactory solution to it. An alternative approach, inspired by the Gestalt school of perception (Hochberg, 1974; Kanizsa and Gaetano, 1970) is to process faces as a whole. One of the method presenting this approach is PCA, that was used for face recognition (Kirby and Sirovich, 1990; Turk and Pentland, 1991, see Section 3.1.4 for description). Another way is to process images via Neural Networks. Under this processing faces are presented as pixel intensity images and extraction of geometrical relationship, texture and subtle facial details is realized implicitly. Recognition from intensity images is also sensitive to substantial variations in lighting conditions, head orientation and size. In order to avoid these problems, an automatic preprocessing of the faces (i.e., normalization for size and position) is required. Although this normalization stage is also based on the feature extraction, it is rather constrained and completed by the definition of eyes and mouth or nose locations. Among the first network models proposed for face recognition are autoassociative networks and autoencoders (Valentin et al., 1994, survey). Although these network models were proposed for recognition, they are trained to reconstruct faces. In autoassociative networks, the recognition task is constrained to a face familiarity task. The cosine between every face and its reconstructed version is evaluated and is thresholded to decide if the face is familiar or not (O’Toole et al., 1991). In the autoencoders, their hidden representation

Chapter 5: Real world recognition

71

has been used as an input for the back-propagation sex and identity networks without hidden layer (Cotrrell and Fleming, 1990). Radial basis function (RBF) networks in the context of face recognition have been first implemented by Edelman et al. (1992). The famous data-set (Turk and Pentland, 1991) described below in Section 5.2.4 has been used in their experiments. The faces were normalized by the same procedure, as described below in Section 5.2.5, to reduce variability to viewpoint and illumination direction. A set of Gaussian receptive fields (RFs) of different size and elongation were applied to reduce dimensionality of the input. These RFs were applied in different locations inspired by observation RFs of the simple cells in the primary visual cortex of mammals. Every RBF network was intended for a certain person recognition and was trained only by positive examples for which a single output neuron had a desired value equal to 1. The face was considered as recognized by the individual RBF network if its output exceeded some threshold. Later on, when training of the individual RBF networks was ended, their outputs were used as inputs to a new RBF network with the number of output units equal to the number of persons. The desired activities were taken to be equal to 1 for the neuron responsible for a given input image, and was equal to 0 for others. The misclassification rate equal to 9% vs. 22% for individual networks was achieved by this new RBF network. Recently, an interest to RBF networks as a tool for face recognition has been revived. Different novel variants of the RBF network schemes were proposed (Howell, 1997; Sato et al., 1998; Gutta et al., 1996). In (Howell, 1997), the hyper RBF network, which has the number of hidden units equal to the number of training samples and trained on the images of all persons, is reorganized into a group of smaller face recognition unit networks. Each face recognition unit network is intended for a particular person recognition and has two output units. The first unit is responsible for the particular person presence and the second has to be active when an ”anti” person is presented. The network uses views of the certain person as positive examples and some selected ambiguous images of other people as negative ones. Although this approach increases complexity, as more networks need to be trained, it allows to reduce dimensionality of each unit network and it is adaptive to a new person addition. When a new person is added, only one additional unit network has to be trained, and perhaps a small number of ambiguous unit networks needs to be retrained. A way to combine the standard RBF network with face unit networks based on their confidences was also proposed. Ensembles of standard RBF networks for face recognition have been proposed in (Gutta et al., 1996). Two ensemble variants, defined in terms of their specific topology (connections and RBF nodes) and the data they are trained on, were considered.

Chapter 5: Real world recognition

72

In the first variant (ERBF1), three groups of networks, which were trained separately on the original data, and on the same original data with either some Gaussian noise or subject to some degree of geometrical distortion were combined. Inside each group three networks with the different topology were taken. The decision is based on the averaging of the networks outputs (see Section 2.2) and takes place if the maximal response is larger than some threshold. In the second variant (ERBF2), three RBF networks with different topology were trained on the extended data consisting of original data and their corrupted versions. Later on these ensembles are combined with inductive decision trees classifiers. Sato et al. (1998) use as input to RBF networks partial face images, such as ears, eyes and nose, which are cropped by hand. The network is trained with sub-images of known and unknown images, taken under uniform lighting condition and with the fixed distance between a camera and subjects. Each output unit of the RBF network corresponds to the certain person. The input is recognized according to the unit with a maximal output response, if the latter is larger than some threshold. This threshold is set by hand due to separability of the maximal responses of known and unknown sub-images. Thus, a network is also able to reject unknown faces. A variant of a hybrid supervised/unsupervised network for automatic face recognition has been proposed by Intrator et al. (1996). A network is trained using a hybrid training method. This method is based on a formulation that combines unsupervised (exploratory) methods for finding structure (extracting features) and supervised methods for reducing classification error. The unsupervised training is based on the biologically motivated BCM neuron (Intrator and Cooper, 1995) and is aimed at finding hidden units with a multimodal distribution of their activities. The supervised portion is aimed at finding features (in network hidden units) that minimize classification error on the training set. The same data-set and normalization as in (Edelman et al., 1992) were used. The classification result for averaged output of five hybrid BCM/recognition was 99.38%, which is better than using RBF networks (Edelman et al., 1992). A new approach to face recognition using Support Vector Machines (SVM) has been proposed by Phillips (1998). SVM is a binary classification method that finds the optimal linear decision surface based on the concept of the structural risk minimization (Vapnik, 1995). Since the face classification is a multi-class problem, the task has been previously reformulated as a two class recognition problem in a difference space (space of differences between face images). In other words, the multi-class problem is replaced by the problem of discriminating between within-class differences set (difference of faces of the same persons) and between-class differences set (difference of faces of different persons). The extension of SVM to nonlinear decision surfaces has been used and slightly adapted by

Chapter 5: Real world recognition

73

introducing a threshold parameter ∆ to a decision surface parameterization. When the task is recognition of some unknown probe face x, it is converted to a set of difference faces x − xg , where xg are faces of known individuals, which are called a gallery set. For each difference face a similarity score δg , which depends on the decision surface parameters (but does not include ∆), is evaluated. The probe face is identified as a person for which a face xg from the gallery set has the minimal similarity score δg? that satisfies the inequality δg? < ∆, otherwise the probe face is claimed as unfamiliar. When the probe is verified rather than identified the task is simplified, since the difference images are constructed as the difference between the probe face and the faces of a person under verification. Some results on the FERET database (Phillips et al., 1996; Phillips et al., 1997) are reported, such as a 77% − 78% classification rate. Although these results are not impressive, it is marked that only two images per 50 different and the most difficult persons were used for training. Another approach to face recognition from live video has been recently proposed by Atick et al. (1997). Their scheme, called FaceIt, is based on the construction of the factorial code, by transforming facial images into a large set of simpler statistically independent elements. The recognition task then consists of estimating the probability that a scene contains any pattern that was processed previously. Another different scheme which attempts to find a new good representation for face recognition has been proposed in (Bartlett et al., 1998). Bartlett et al. used ICA (see Section 4) for reduced face representation which was extracted using PCA. Classification from the extracted independent components is improved compared to classification from principal components. Another advanced feature-based method for face recognition using Hidden Markov Models (HMM) has been proposed by Samaria et al. (1993). HMM models with the states which are five facial features (forehead, eyes, nose, mouth and chin) are modeled and the HMM parameters are separately estimated for face images of the same person. For an unknown face identification, its conditional probabilities given parameters of different HMM models are evaluated and recognition is done as a label of the model with the highest value of the conditional probability. Another advanced feature-based method is the dynamic link approach (Wiskott and von der Malsburg, 1993; Wiskott et al., 1997). The method proceeds by applying Gabor filters of 5 different frequencies and 8 orientations in a set of fiducial points (the pupils, the corners of the mouth, the tip of the nose, the top and bottom of the ears, etc.). The obtained responses in every point compose the so called bunch Gabor jet. Subsequently every known face is represented as a labeled graph of these fiducial points and edges

Chapter 5: Real world recognition

74

between them. The nodes are labeled by their jets and edges are labeled with vectors between the nodes, which they connect. The geometrical structure of the graphs unlabeled by jets is called a grid. It is assumed that different known faces have the same grids and correspondence between graph nodes of their models is set by hand. The face models corresponding to the same orientation are joined into FBG (face bunch graph), that has the average geometrical structure and combination of the bunch jets of all its models. Therefore, FBG is a representation of all faces with the same orientation. When an unknown image is given, its fiducial point locations which maximize a similarity between the unknown image graph and FBG are searched. The similarity measure between face graph and FBG is defined as a sum of jet and geometrical similarity measures, controlled by a trade-off parameter. The optimization task is simplified by constraining the group of possible geometrical transformation of FBG to translation, scale, aspect ratios and local distortions. Subsequently, the similarity measure between a found image graph and image graphs of all FBG faces are evaluated. Recognition is done picking up the known face with the highest similarity measure. The similarity measure between image graphs is defined as the average similarity between corresponding jets. In this consideration, it is assumed that the unknown face is normalized, i.e. its position is estimated before recognition procedure. In this chapter, we implement hybrid networks that were presented in Chapters 3–4 for face recognition. Our approach is the succession of the hybrid supervised/unsupervised network approach (Intrator et al., 1996) with a novel type of unsupervised constraints. Different types of the bias constraints are given below in Section 5.2, where a regularization procedure is also presented. The regularization procedure is completed by creation of various hybrid network ensembles. These ensembles are tested on degraded facial datasets. Image degradation, which has been simulated in our experiments, is briefly described in Section 5.3 and recognition results are presented in Section 5.4. In particular, for the same data and normalization as in (Edelman et al., 1992; Intrator et al., 1996) (see also Section 5.2.5), we achieve a misclassification rate of 0.5% despite using smaller training and larger testing sets.

5.2

Methodology

Face recognition problem requires extrapolation from the training set since its distribution may be rather different from the distribution of the testing set. Thus this problem requires an efficient use of a-priori knowledge that can be introduced in the form of bias constraints during training (Section 2.3).

Chapter 5: Real world recognition

5.2.1

75

Different architecture constraints

In Chapter 3, reconstruction constraints were suggested as a learning bias and the hybrid recognition/reconstruction network was introduced (Figure 3.3). This hybrid network attempts to improve the low dimensional representation by minimizing concurrently the mean squared error (MSE) of reconstruction and classification outputs. The proposed reconstruction/classification network is controlled by a trade-off parameter λ and includes a conventional classification network for λ = 0. We refer to the networks corresponding to λ = 0 as unconstrained networks or conventional classification networks. In the special case of λ = 1, we get a nonlinear autoencoder for nonlinear activation functions and a linear autoencoder for linear (see Section 3.1.2). As has been discussed in Section 3.1.2, the linear autoencoder hidden weights span the PCA eigenspace. Below, we refer to the network obtained in this case as a PCA network. All the networks corresponding to the trade-off parameter inside the interval [0 1] are called the reconstruction networks. In Chapter 4, unsupervised constraints were introduced as statistical feature extraction constraints on the hybrid network. The hybrid neural network with a hidden representation that is arranged mainly for the classification task and, in addition, has some useful properties, was considered. We have used such statistical properties as an independence of hidden neurons or maximum information transfer in the hidden layer. The proposed unsupervised/classification networks are also controlled by a trade-off parameter λ and include a conventional classification network for λ = 0. We consider several types of unsupervised constraints (see also Table 4.1): • Entropy maximization constraint, which maximizes a joint entropy of the hidden layer (Section 4.3.1) • BCM constraints, which emphasize data multi-modality by minimizing a specific loss function (Section 4.3.4) • Sum of entropies of the hidden units constraints. We consider four variants of these constraints (see Table 4.1). Constraints A-C maximize the information carried by each hidden neuron (Section 4.3.5). The case D corresponds to the sum of entropies minimization. • Nonlinear PCA constraints, which extract nonlinear principal components in the hidden layer (Section 4.3.6) In the general case, bias constraints are a composition of reconstruction and unsupervised constraints (see Section 4.5). For simplicity we take these constraints with the same

Chapter 5: Real world recognition

76

strength, i.e., the parameter µ in Eq. 4.5.32 is set to 0.5, and only the trade-off parameter λ is variable. In particular, we consider the combination of reconstruction and entropy maximization constraints. We refer to the corresponding hybrid networks as reconstruction with entropy maximization networks. Thus, independent of the applied constraints networks are controlled by a trade-off parameter λ and regularization is required.

5.2.2

Regularization

Regularization task is to find an optimal parameter λ and corresponding synaptic weights ωλ which provide the minimal misclassification rate. The choice of the optimal parameter can be done by hold-out, cross-validation or bootstrap methods (see Appendix 2.8). We have not used cross-validation and bootstrap methods as they are computationally demanding. Our regularization scheme is a variant of the split-sample validation method. We split the data into approximately equal portions of training and validation sets. Finding optimal weights ωλ depends on a stopping time in the training stage. The stopping time has been set observing the behavior of the misclassification rate on the validation set. Our regularization method includes the following steps (see Figures 5.1, 5.2, 5.3). 1. For every λ, train corresponding network until a minimum misclassification rate is achieved on the validation set within a predefined number of epochs . 2. Since the misclassification rate is a stepwise function, we further choose a stopping time, which corresponds to a minimum misclassification rate together with a minimal recognition MSE on the validation set. 3. A λ-value providing a minimum misclassification rate on the validation set is an optimal one. 4. Choose an ensemble of networks around the optimal λ value. Later this ensemble is combined with a zero-λ ensemble. In order to study solely effect of the trade-off parameter λ on the classification performance, we have fixed all other training conditions, such as initial weights and a learning rate. The initial weights have been chosen at random from a uniform distribution on the interval [0, µ]. The learning rate has been taken small enough in order to ensure convergence. From a practical viewpoint, the choice of the best network is not reasonable, since it depends on the degradation that is unknown a-priori. Instead of the search of the optimal λ, we average over several regularization values, that is roughly equivalent to

Chapter 5: Real world recognition

77

Misclassification rate time evolution 2

λ= 0

10

1

1

10

0

0

0

1000 2000 3000 4000

10

0

epochs 2

λ= 0.2

10

1

epochs 2

λ= 0.3

1

10

0

0

1000 2000 3000 4000

10

10

10

λ= 0.1

10

10

10

2

0

1000 2000 3000 4000 epochs

10

0

1000 2000 3000 4000 epochs

Figure 5.1: Validation set results vs. the regularization parameter λ. Regularization with λ > 0.3 provide larger error than with λ = 0.3 (see also the top graph of Figure 5.3).

Chapter 5: Real world recognition

78

MSE recognition error time evolution

λ= 0

0

10

10

−1

−1

10

10

−2

−2

10

10

−3

10

λ= 0.1

0

0

−3

1000 2000 3000 4000

10

0

epochs

λ= 0.2

0

10

epochs

λ= 0.3

0

10

−1

−1

10

10

−2

−2

10

10

−3

10

1000 2000 3000 4000

0

−3

1000 2000 3000 4000 epochs

10

0

1000 2000 3000 4000 epochs

Figure 5.2: Validation set recognition MSE scaled per sample vs. the regularization parameter λ.

Reconstruction

error

Recognition

error

Misclassification

error

Chapter 5: Real world recognition

79

Classification based regularization. 7 6 5 4 3

0

0

0.1

−3

0.2

0.3

x 10

6.383

4.417 0

0.1

0.2 λ

0.3

0.1

0.2 λ

0.3

0.0205

0.0127 0

Figure 5.3: Classification based regularization (for Pentland data-set in the intermediate resolution (32 × 32)): The upper graph shows the minimal number of misclassified faces in the validation set versus λ. The middle graph shows a minimal mean-squared recognition error corresponding to the level of misclassification error in the upper graph. In the bottom graph the mean squared reconstruction error corresponding to the upper graphs is shown. All errors are calculated on the validation set per sample.

Chapter 5: Real world recognition

80

integrating over a uniform regularizaion distribution between some values. Such averaging is equivalent to the Bayesian approach (see Section 3.2.3) for combining neural networks having the same evidences for the chosen interval of the hyper-parameter λ. We have experimentally found that training several networks on different λ values around several optimal values that were found once, and then averaging the different network results, yields a performance that is close to the optimal (a posteriori) λ and sometimes is even better (see Section 5.4). Thus, we do not regard the need to estimate an appropriate λ as problematic. In the results described below, we refer to an optimal λ as the one which gives best test results versus degradation. It is thus clear that this is the upper limit of performance under this scheme and this limit can be attained and sometimes surpassed by a simple method of averaging over several λ values.

5.2.3

Neural Network Ensembles

An ensemble of experts is capable of improving the performance of single experts (Section 2.2).

We have used two types of ensemble classification prediction. The first, is a

majority rule over all the experts in the ensemble. We call this a classification ensemble. Another rule is based on averaging the real values of the outputs of all the ensemble members and then producing a decision by the Bayesian classification rule. We call this a regression ensemble. It was shown (Section 2.2), that the largest reduction in the variance portion of the error is achieved when the predictors are independent and this may be achieved by combining networks with different initial weights. We generate such ensemble of unconstrained nets (λ = 0) and use it as a baseline for ensemble performance comparison. It turns out that by averaging (in either way) over ensemble members that have been trained with different values of the trade-off parameter λ (see Section 3.2.3), some additional independence is achieved, leading to a useful collective decision. We call these ensembles regularization ensembles and classify them further according to the training constraints that were used during training of the ensemble networks. Therefore, ensemble with the networks constrained by the reconstruction task is called the reconstruction ensemble, by BCM – the BCM ensemble, etc. Different ensembles are further combined with each other to generate more powerful predictors. The additional variance reduction is attained due to different constraints used for network training, that makes them independent. In particular, we have considered the combination of the reconstruction and unconstrained λ = 0 ensembles, and the combination of the reconstruction and reconstruction with entropy maximization ensembles. We refer to the latter ensemble as the reconstruction and entropy maximization ensemble.

Chapter 5: Real world recognition

5.2.4

81

Face data-sets

The widely available facial data-set (Turk and Pentland, 1991) as well as a face data-set locally collected by the Tel-Aviv University Computer Vision Group (Tankus, 1996) were used in our simulations. While there have been many successful classification approaches to the Turk/Pentland data, we demonstrate that when the images are given in low resolution, or are degraded either by blur or partial occlusion, classification performance deteriorates dramatically.

The Turk/Pentland data-set contains 27 images of 15 male

faces (we omitted the single bearded person). From each face, we randomly chose 14 training images and 13 validation images (total of 210 training and 195 validation images). Preprocessing details and previous results studying the effect of background, illumination and comparison with PCA are given in (Intrator et al., 1996). The preprocessing partially removes the variability due to viewpoint, by setting (automatically) the eyes and tip of the mouth to the same position in all images (see Section 5.2.5). Further preprocessing evaluates the difference between each image and an average over all the training set patterns, leading to the so called “caricature” images (Kirby and Sirovich, 1990). Three resolutions were used: high - (64 × 64), intermediate - (32 × 32) and low - (16 × 16) pixels. Examples of a face in three resolutions are shown in Figure 5.4.

Resolution 16*16

Resolution 32*32

Resolution 64*64

“Caricature” faces in three resolutions

Figure 5.4: Pixel resolutions used in the classification results (Pentland data-set). The second data-set contains images of 37 male and female faces with 10 pictures for each person in high resolution (84 × 56). We split the data to 6 training images and 4 validation images for each person and used a similar preprocessing as described above, except that only the eye locations were fixed.

5.2.5

Face normalization

This section describes the face normalization which was used for the facial data-sets. The normalization is based on finding anchor points: eyes, nose or mouth and then warping the face images to some predefined locations of these points. The anchor points are identified using the Generalized Symmetry Transform (Reisfeld, 1993; Tankus et al., 1997).

Chapter 5: Real world recognition

82

The method proceeds starting from an edge map and assigning a symmetry measure at each point, producing a ”symmetry map” of the image. A symmetry measure for each point and direction is defined as follows. Let pk = (xk , yk ) be any image point and ∂I ∂I ∇I(pk ) = ( ∂x , ∂y )|(x,y)=pk be the gradient of the intensity at point pk . The gradient is considered in the logarithmic scale, i.e. a vector vk = (rk , θk ) is associated with each point ∂I ∂I pk , where rk = log(1+ k ∇I(pk ) k) and θk = arctan( ∂x / ∂y )|(x,y)=pk . For each two points

pi and pj the line l passing through them and the counterclockwise angle αij between it and horizontal are introduced. The set Γ1 (p, ψ), a distance weight function Dσ (i, j) and a phase weight function P (i, j) are defined by: Γ1 (p, ψ) = {(i, j)|(pi + pj )/2 = p, αi,j = ψ} Γσ (p) = {(i, j)|(pi + pj )/2 = p, k pi − pj k< 3σ} 1 k pi − pj k Dσ (i, j) = q exp(− ) 2σ (2πσ) P (i, j) = (1 − cos(θi + θj − 2αij ))(1 − cos(θi − θj )) The first multiplier term of the measure Pij has peak when the gradients at pi and pj are oriented in the same direction towards each other, while the second term suppresses P (i, j) when θi = θj = π/2, which occurs for points lying in the straight line. The radial symmetry measure M (p) and directional symmetry measure Sσ (p, ψ) of each point p in direction ψ are defined as : X

Sσ (p, ψ) =

Dσ (i, j)P (i, j)r(i)r(j)

(i,j)∈Γ1 (p,ψ)

M (p) =

X

Dσ (i, j)P (i, j)r(i)r(j)sin2 ((θi + θj )/2 − α(p)), where

(i,j)∈Γ2 (p)

α(p) = (θi? + θj ? )/2 and (i? , j ? ) = argmax(i,j)∈Γσ (p) Dσ (i, j)P (i, j)r(i)r(j) The maps produced by these operators are then subjected to detection of the highest peaks. Geometrical relationship among these peaks, together with the location of the midline are defined to infer the face position as well as eyes and mouth in the image. Detection of the midline of the face image is found as a peak in the autocorrelation function of the edge map. Common information, such as the assumption that eyes should be on both sides of the midline and the mouth should intersect it, is used.

5.2.6

Learning parameters

We have used hidden layer consisting of 10 units for both data-sets. This number was chosen by trial. The value of the parameter µ which locates initial weights in the small

Chapter 5: Real world recognition

83

vicinity of the weight space origin, was set to µ? = 0.001 for the experiments with the Pentland data-set in the intermediate resolution 32 × 32 and was set to µ = 4µ? = 0.004 and µ = 0.25µ? = 0.00025 for low and high resolution, respectively, in order to obtain the consistent results in all three resolutions. The number of predefined training steps was ad hoc to 5000 epochs for intermediate resolution, 3000 epochs for high and low resolutions. The learning rate η has been adjusted according to the bias constraints. In experiments with reconstruction constraints, the learning rate was equal 0.2. For the TAU data-set µ? was equal 0.001, the learning rate η was set to 0.05 and number of epochs about 10000 epochs was used.

5.3

Type of image degradations

For the Pentland data-set, we have performed experiments in three resolutions: low (16 × 16), intermediate (32×32) and high (64×64). The test images were obtained by simulating degradation on the validation-set only, i.e. all results are based on networks that were trained on ”clean” data and were tested with either clean or degraded validation data. A few examples of degraded faces and their reconstructed versions by different networks are shown in Figure 5.5. Below, we briefly describe the type of degradations that were used. For a comprehensive treatment of degradation see Chapter 6. “Clean” data: The original test set without any image degradation. Blurring with Gaussian filter: Blurring with a Gaussian filter is one of the simplest types of image degradations. We used a Gaussian blurring with a standard deviation σ = 2. This scale of smoothing retains many details needed for human perceptual recognition for high resolution images, but for intermediate and low resolutions, many details around the eyes and mouth appear to be lost. Blurring with DOG filter: Difference of Gaussians (DOG) filter, which produces a Mexican hat type receptive field, is a form of image preprocessing known to be present in early mammal vision (center-surround cells) (Marr, 1982; Kandel and Schwartz, 1991) (see also Section 6.2.1). Standard deviations of the on and off center (positive and negative Gaussians) were 1 and 2 respectively. This type of preprocessing is known to enhance edges.

Chapter 5: Real world recognition

84

Image degradation and reconstruction (TAU data-set)

Figure 5.5: Reconstruction is done using an architecture with reconstruction constraints. The faces in each row from left to right represent: A “clean” face, a corresponding “caricature”, a degraded version, a reconstruction of the degraded version obtained by the first 10 Principal Components, a reconstruction by a single unconstrained Network with λ = 0; Reconstruction by a network ensemble with reconstruction constraints and tradeoff parameters λ = 0.04, 0.3. Degraded faces from top to bottom: Upper row: “Salt and Pepper” noise with 20% degradation. Middle row: nose area was replaced by average intensity in that area. Bottom row: DOG-blur with the deviation of on and off center equal 1 and 3. Partial occlusion: This is achieved by replacing the pixel values at a certain rectangular area of arbitrary size in any part of the face by the average intensity of the pixels in that rectangle. “Salt and Pepper” noise: This degradation replaces pixel intensities by either the maximum or minimum grey-level value at random locations of a certain percentage of the image (Rosenfeld and Kak, 1982). Results presented here were done with 10% and 20% replacement.

5.4

Experimental results

Table 5.1 presents results on classification schemes generated by networks with reconstruction constraints and their combination into ensembles. The results are in three im-

Chapter 5: Real world recognition

85

Classification results for Pentland data-set Classification scheme λ=0 λopt classification ensemble regression ensemble PCA Classification scheme λ=0 λopt classification ensemble regression ensemble PCA

Low Resolution 16 × 16 3.1 3.1

Intermediate Resolution 32 × 32 2.6 1.5

High Resolution 64 × 64 1.5 1.0

2.6

0.5

1.0

2.6 15.9 Low Resolution 16 × 16 5.1 3.6

1.0 13.8 Intermediate Resolution 32 × 32 3.6 2.6

0.5 17.9 High Resolution 64 × 64 1.5 1.5

4.1

1.5

1.5

4.1 16.4

1.5 14.8

0.5 17.9

Classification scheme λ=0 λopt classification ensemble regression ensemble PCA Classification scheme λ=0 λopt classification ensemble regression ensemble PCA

Low Resolution 16 × 16 5.1 4.6

Intermediate Resolution 32 × 32 12.3 7.2

High Resolution 64 × 64 16.9 11.8

4.6

8.2

13.3

4.6 22.1 Low Resolution 16 × 16 36.4 34.4

8.2 35.4 Intermediate Resolution 32 × 32 13.8 11.7

10.3 50.8 High Resolution 64 × 64 5.6 3.6

33.8

14.9

4.1

32.3 46.7

13.3 33.3

2.6 26.7

Table 5.1: Percent misclassification rate for Turk-Pentland data-set in three resolutions. Top left: on the “clean” testing set. Top right: Blurred images with a DOG-filter with σ1 = 1 σ2 = 2. Bottom left: Results of partial occlusion around the nose. Bottom right: Results of a “Salt and Pepper” noise of 20% of the image. For 32 × 32 resolution, single unconstrained net with λ = 0 and reconstruction ensemble correspond to initial “weights B” of Table 5.2. PCA stands for PCA network. age resolutions with different image degradations. They show that constrained networks which may not show significant performances difference on tests with original, undegraded test-set, do show a significant improvement when tested with degraded images. Below, we highlight some consequences of Table 5.1. Single PCA network When λ = 1 and the activation functions of the hidden and output units are linear, the hidden weights of the network span the space of principal eigenvectors (Section 3.1.2). Classification results for network PCA representations are presented in Table 5.1 (bottom rows). These results are inferior to other methods and demonstrate that the first few principal components may be inefficient for classification1 . Ensemble combination Classification ensemble, or voting, is quite common in computational learning theory (Section 2.2). We find that a regression ensemble is superior to classification ensemble especially in higher image resolutions. We note that for a regression ensemble variance reduction by averaging is achieved when the errors of the different classifiers are independent. It appears that the use of different λ values leads to 1

It is known however, that a larger number of PCA produces improved results (Kirby and Sirovich, 1990).

Chapter 5: Real world recognition

86

some independence in misclassification and thus, the regression ensemble produces better results. Different image resolutions Generally, the results from the 16 × 16 resolution are only slightly worse than results with higher resolutions. This resolution is less sensitive to difference of Gaussians blur, but very sensitive to “Salt and Pepper” noise which produces significantly worse results. This is a strong indication to the usefulness of multi-resolution detection as a means to improve performance under various image degradations. In short, Table 5.1 indicates that reconstruction constraints under regression ensemble produce more robust results. In the following set of experiments, we consider other network constraints.

5.4.1

Different architecture constraints and regularization ensembles

Table 5.2 presents results on different classification schemes that were generated by various network constraints and regression ensemble combinations. All results were obtained on the Turk-Pentland data-set with an intermediate (32 × 32) resolution, using networks trained with two sets of initial random weights A and B. Unconstrained ensembles The first two rows in Table 5.2 represent two single (conventional) unconstrained networks, corresponding to training with different initial weights A and B. This serves as a base-line comparison and demonstrates the increased sensitivity of single networks to image degradation, in particular to blur. Before concentrating on the effects of additional constraints we note that ensemble without additional constraints (1st numbered row of Table 5.2) is already significantly better than a single network. Similar results for the TAU data-set are presented in the second row of Table 5.3. Reconstruction networks and their ensembles The next two rows (numbered 2 and 3, in Table 5.2) show the variability of the reconstruction ensemble results due to a different initial set of weights (A and B). Classification results of the reconstruction ensemble for the TAU data-set are shown in the 3rd row of Table 5.3. For this (more difficult) data-set, the ensemble of unconstrained networks is always inferior to the ensemble with reconstruction constraints. The largest difference between the unconstrained and reconstruction ensembles is observed for blurred images. The 4th row of Table 5.2 represents the reconstruction ensemble composed from the networks of two reconstruction ensembles with weights A and B. The main observation

Chapter 5: Real world recognition

87

Different Ensemble Types (Pentland data-set) Ensembles: Type of regularization constraints: Single unconstrained net with initial “weights A” Single unconstrained net with initial “weights B” 1. Ensemble for λ = 0 and different initial weights 2. Reconstruction with initial “weights A” λ : 0.05 0.1 0.3 0.35 3. Reconstruction with initial “weights B” λ : 0.1-0.3, step 0.05 4. Reconstruction ensemble with initial “weights A,B” 5. Reconstruction (A+B) and λ = 0 ensembles 6. Reconstruction with entropy maximization λ : 0.05 - 0.3, step 0.05 7. Reconstruction and entropy maximization ensembles 8. Entropy maximization λ : 0 - 0.4, step 0.05 9. BCM λ : 0.05 -0.3, step 0.05 10. Sum of entropies A λ : 0.05 0.1 0.2 0.25 0.3 11. Sum of entropies B λ : 0.05 0.1 0.2 0.25 0.3 12. Sum of entropies C λ : 0.05 0.1 0.2 0.25 0.3 13. Sum of entropies D λ : 0.05 - 0.3, step 0.05 14. Nonlinear PCA λ : 0-0.3, step 0.1

Optimal NN for testing set

Regression ensemble on testing set

Gaussian filter σ=2

DOG filter σ1 = 1 σ2 = 2

Occlusion nose half area face area

“Salt and Pepper noise” d=0.1 d=0.2

1.0

*

10.3

8.2

1.0

7.2

4.6

10.8

2.6

*

12.8

12.3

3.6

8.7

6.2

13.8

1.0

0.5

6.7

7.7

0.5

5.6

2.1

7.2

2.1

2.1

8.2

4.1

1.5

5.6

6.7

12.8

1.5

1.0

8.7

8.2

1.5

6.2

3.1

13.3

1.5

1.5

6.2

4.6

2.1

4.6

4.1

9.7

1.0

0.5

5.6

4.6

0.5

4.6

2.6

6.7

1.5

2.1

7.2

4.6

2.6

4.6

4.1

8.7

1.0

1.5

5.6

3.1

1.5

4.6

4.1

7.2

0.5

1.5

8.7

4.1

2.1

6.7

3.6

11.3

2.1

2.6

11.3

5.1

2.1

8.2

4.1

11.8

1.5

2.1

8.2

7.7

2.6

3.6

4.1

10.8

1.5

2.1

8.2

7.7

2.6

3.6

3.6

13.3

0.5

2.1

7.7

7.7

2.1

6.7

2.1

9.7

1.0

2.1

8.2

6.7

2.1

5.6

3.6

8.7

2.6

3.6

20

21.5

7.6

26.2

51.3

74.4

Table 5.2: Percent classification error for different image corruptions on the Turk-Pentland data-set in intermediate (32 × 32) resolution. All results are for an ensemble of networks that includes the indicated λ values. The column optimal NN refers to the single best in the ensemble λ-network. In the Salt and Pepper experiments, either 10% or 20% of the image were corrupted. Information that is not relevant for single networks is marked with *.

Chapter 5: Real world recognition

88

is that this combined reconstruction ensemble is better than the unconstrained and reconstruction ensembles with either weights A or B in classification of Gaussian and DOG blurred images. As can be seen the unconstrained (λ = 0) ensemble is slightly better than this combined reconstruction ensemble when distortion is small. This motivated us to combine the unconstrained and the combined reconstruction ensemble (5th row of Table 5.2). We note that this joined ensemble leads to robust results and is superior to other ensembles. Similar results for the TAU data-set, with reconstruction ensemble, are presented in the 3rd and 4th rows of Table 5.3. Different Ensemble Types (TAU data-set) Type of regularization constraints: Best single unconstrained net, λ = 0 Ensemble for λ = 0 and different initial weights Reconstruction ensemble λ = 0, 0.04, 0.1, 0.2, 0.3 Joined reconstruction and unconstrained ensemble Entropy maximization λ = 0 : 0.4, step 0.05

Best single net results on test set

Regression ensemble on test set

Gaussian filter σ=2

DOG blur with σ1 = 1 σ2 = 2

Occlusion nose half area face area

“Salt and Pepper noise” d=0.1 d=0.2

15.5

*

23.0

44.6

29.1

35.1

24.3

31.1

15.5

12.8

19.6

31.8

18.2

20.9

16.2

22.3

15.5

12.8

16.2

26.4

18.2

26.4

16.2

14.9

15.5

13.5

16.9

23.6

16.9

20.9

14.2

16.2

20.3

12.8

18.2

32.4

16.9

23

13.5

20.3

Table 5.3: Misclassification error (in percent) for various ensembles and joined reconstruction and unconstrained ensembles. Results are given for the TAU data-set with different image corruptions. In the “Salt and Pepper” experiments, 10% or 20% of the image were corrupted. Information that is not relevant for single networks is marked with *.

Ensembles with unsupervised constraints In a manner similar to networks with reconstruction constraints, we have generated different families of networks with unsupervised feature extraction constraints (see Section 5.2.1). The entropy maximization constraint is superior (on the TAU data-set) to an unconstrained ensemble under image occlusions and all types of image degradations (the last row of Table 5.3). The BCM constraint (9th row of Table 5.2) and the sum of entropies constraints were useful under image degradations using a DOG filter. Ensembles with reconstruction and unsupervised constraints The reconstruction ensemble with entropy maximization (6th row of Table 5.2) is better than the unconstrained ensemble and the reconstruction constraint ensembles ”with weights B” in classification of DOG blurred images. Joining this ensemble with both reconstruction ensembles attains results that are better than the joined ensemble with re-

Chapter 5: Real world recognition

89

construction and no constraints ensemble for DOG blurred images (7th row of Table 5.2). In general, however, merging of ensembles with reconstruction constraints and with no constraints (λ = 0) leads to more robust results and is superior to the joined reconstruction and entropy constraints ensemble. Figure 5.6 summarizes most results of Table 5.2 and compares between different ensemble averaging schemes and different learning constraints on the original and degraded images. It shows that the “joined reconstruction ensemble” (pink, fifth bar) performs better than each reconstruction ensemble from which it is composed. Additional merging with the unconstrained (λ = 0) ensemble (black, seventh bar), gives a better performance in most of the cases. The same useful property of the reconstruction ensemble can be observed when merging reconstruction and entropy maximization ensembles (yellow, sixth bar). This ensemble is superior under DOG blurred images.

5.5

Saliency detection

This section presents a way to improve recognition of corrupted images using network generalization ability to reconstruct. Due to the bottleneck structure of the network, reconstruction is efficient even when images contain a large amount of noise or are partially occluded by contrast objects. Reconstructed images, which we call prototypes, are able to recover partially degraded or occluded areas of the input. However, there is a difficulty to find these degraded areas in the input, or more generally, to define relevance (confidence) of the image areas. This task is common in artificial intelligence and robotic vision. It is referred to as a saliency detection or intelligent filtering (Baluja, 1996). The definition of relevance depends on the desired task and the learning algorithm. For example, for an autonomous vehicle navigation (Baluja and Pomerleau, 1995) a saliency map derived from a specific Neural Network representation (see Section 3.1.5) was designed to highlight significant (salient) regions of the input and deemphasize unimportant regions. Their saliency map is based on the difference between an input image and its prediction by the network from the previous video frame. Below, we present a saliency map construction for still images that is suitable for the classification task and uses the hybrid network, which was introduced in Chapter 3.

5.5.1

Saliency map

After training a difference map (image) xd given by the difference between the input image x and its prototype xp : xd = abs(x − xp ), can be used for extracting unreliable areas (areas with a large noise or unexpected objects) in the input image. Due to the

Chapter 5: Real world recognition

90

Summary of different networks and different image degradations b: DOG blurring

a: "Clean" data

Misclas. rate %

Misclas. rate %

10.2564 2.5641

1

1 Av

Clas

Reg

Av

Clas

Reg

d: Half face crop

c: "Salt and Pepper" noise 17.4359 Misclas. rate %

Misclas. rate %

9.7436

1

1 Av

Clas

Reg

Av

Clas

Reg

Figure 5.6: Misclassification rate (%) for different network ensembles and different types of image degradation for Pentland data-set 32 × 32: “Av”- average performance of network ensembles. “Clas”- performance of the classification ensembles. “Reg”- performance of the regression ensembles. The bars in the group from left to right correspond to the following ensembles of Neural Networks: 1. Ensemble of unconstrained Neural Networks with λ = 0 and different initial weights. 2. Reconstruction with initial weights A. 3. Reconstruction with initial weights B. 4. Reconstruction with entropy maximization. 5. Joined reconstruction ensemble (merged 2 and 3 ensembles). 6. Joined reconstruction and entropy maximization ensemble (merged 2,3 and 4 ensembles). 7. Joined reconstruction and unconstrained λ = 0 ensemble (merged 1,2 and 3 ensembles). For comparison, classification errors of single Neural Networks with initial weights A are shown by horizontal lines of dashed type.

Chapter 5: Real world recognition

91

bottleneck structure of the network, the output of the reconstruction layer has to be better for recognition than the original signal, in areas where xd is large, i.e., the original signal x is messy. Thus, we propose before recognition to replace the original image x by the image xn using a saliency map Φ(xd ): xn = Φ(xd )x + (1 − Φ(xd ))xp ,

(5.5.1)

where all operations are pixel-wise. We have constrained a saliency map Φ(xd ) to be a decreasing function, such that Φ(0) = 1 and have considered two types of saliency maps. The first type of saliency maps is given by: Φ(x) = exp(−µx2 ) and parameter µ, tuned to µ = 0.9. The second saliency map was taken as: ( 1 if x < x¯ Φ(x) = 0.5 otherwise where a threshold x¯ was adjusted to 0.3. Figure 5.7 shows examples of the xn images obtained using two saliency maps. Classification was improved for some types of the Saliency map construction Input

Reconstruction

Difference

Saliency map-1

Reconstruction with map-1

Saliency map-2

Reconstruction with map-2

Figure 5.7: Reconstruction using saliency maps for network with reconstruction constraints and trade-off parameter λ = 0.04 (TAU data-set). The white pixels of the first map (map-1) correspond to intensity equal to 1 and black to zero intensity. In map-2 the white pixels have intensity equal to 0.5 and black 0. degradation process, especially for “Salt and Pepper” noise (Tables 5.4–5.5).

For other

Chapter 5: Real world recognition

92

Recognition using saliency maps (Pentland data-set) Types of degradation “Salt and Pepper” noise with d = 0.1 input map-1 map-2 “Salt and Pepper” noise with d = 0.2 input map-1 map-2 “Salt and Pepper” noise with d = 0.3 input map-1 map-2 “Right eye” with ν = 3 input map-1 map-2 “Half face” with ν = 3 input map-1 map-2 “DoG 1-2” input map-1 map-2

Unconstrained

Regression Ensembles Reconstruction A Reconstruction B

Joined

1.5 1.5 1.5

2.1 3.1 3.1

3.1 3.1 4.1

1.0 1.0 1.5

7.2 2.6 2.6

11.3 4.6 4.6

11.3 4.1 5.6

6.7 3.6 3.6

25.1 11.3 12.8

26.2 13.3 14.9

30.8 13.3 15.9

23.6 11.8 12.3

3.1 3.1 2.6

1.5 1.5 2.1

2.1 2.1 2.6

2.1 0.5 1.0

15.9 16.9 16.9

26.2 26.2 25.6

22.1 22.1 19.5

18.5 16.9 17.4

7.7 7.7 8.2

4.1 3.6 3.6

9.2 9.7 8.7

4.6 4.1 5.6

Table 5.4: Percent misclassification error results for images obtained using two types of saliency maps. Reconstruction ensembles A and B correspond to training with weights A and B. In contrast with reconstruction ensembles A–B in Table 5.2 (2-3 rows), they contain also one unconstrained network (λ = 0). Rows marked with “input” show stand for the input degraded images. Pentland data-set at 32 × 32 resolution.

Chapter 5: Real world recognition

93

Recognition using saliency maps (TAU data-set) Types of degradation “Salt and Pepper” noise with d = 0.1 input prototype map-1 map-2 “Salt and Pepper” noise with d = 0.2 input prototype map-1 map-2 “Salt and Pepper” noise with d = 0.3 input prototype map-1 map-2 “Right eye” with ν = 3 input prototype map-1 map-2 “Half face” with ν = 3 input prototype map-1 map-2 “DoG 1-2” input prototype map-1 map-2

Regression Ensembles Unconstrained Reconstruction

Joined

16.2 23.0 14.2 14.2

16.9 16.2 14.2 14.9

13.5 14.9 12.8 12.2

25.0 31.1 18.2 18.2

20.3 18.2 14.2 15.5

20.3 20.3 12.8 12.8

37.8 43.2 25.7 21.6

31.8 31.1 18.9 17.6

31.1 32.4 20.3 18.9

14.2 14.9 14.2 13.5

15.5 14.9 15.5 15.5

13.5 13.5 13.5 13.5

43.9 43.2 42.6 41.9

36.5 41.9 36.5 35.8

34.5 36.5 34.5 36.5

31.8 33.1 32.4 33.1

26.4 26.4 27.7 27.0

23.6 27.0 24.3 25.0

Table 5.5: Percent misclassification error results for images obtained using two types of saliency maps. Rows marked by “prototype” stand for the reconstructed images (TAU data-set).

Chapter 5: Real world recognition

94

types of image degradation, classification improvement was not significant. To enforce the efficiency of the saliency map, in the experiments with partially occluded images, the occluded region was enhanced by multiplying the average intensity over the occluded area by some factor ν.

5.6

Conclusions

We have shown that constraints on the properties of the low-dimensional internal representation of the images, such as entropy maximization, BCM and the sum of entropies, are useful and can be considered in conjunction with reconstruction constraints, to improve generalization for classification. It was further shown that an averaging of Neural Networks with different constraint strengths is preferable to a simple choice of the optimal regularized network parameters. The best classification results were obtained by merging the ensemble with reconstruction constraints and the unconstrained, λ = 0 ensemble. Reconstruction constraints significantly improve classification results under partial occlusion, lossy compression, “Salt and Pepper” noise and some image blur operations. In addition, we have shown that via saliency maps, reconstruction can deemphasize degraded regions of the input, thus leading to classification improvement under “Salt and Pepper” noise. In the next chapter, we investigate the influence of the reconstruction constraints on image recognition under a wide family of image blur and consequent deblur operations.

Chapter 5: Real world recognition

5.7

95

Appendix to Chapter 5: Hidden representation exploration

Image recognition improvement is based on the extraction of a good hidden data representation. Although recognition performance is a single reliable measure that allows one to judge the hidden representation quality, it may be interesting to consider some statistics of the hidden layer units. Statistics of the hidden unit activities characterize the data distribution after projection on the hidden weight directions. Some properties of the hidden representation are presented below. In Figures 5.8–5.9, the hidden unit activities per classes and different bias constraints are shown. As can be seen, in both networks images of the same class excite similar activation patterns in their hidden space and at the same time there is a big difference between patterns corresponding to different classes. It is clear that such a representation has to be good for recognition. However, from the observations it is difficult to decide which type of constraints is preferable. The pdfs of the hidden unit activities are presented in Figure 5.10. As can be seen, they are multi-modal for unconstrained network and multi-modal or super-Gaussian for reconstruction network. Both these properties are useful for recognition (Chapter 4). Another way to get some impression about hidden layer structure is to look at the hidden weights as images (Figure 5.11). We note, however, that network ensemble hidden representation is not well defined.

Chapter 5: Real world recognition

96

Hidden unit activities vs. classes - for an unconstrained network

2 4 6 8 10

2 4 6 8 10

2 4 6 8 10

class−13

class−9 2 4 6 8 10

class−11 Neuron

Neuron

class−10 2 4 6 8 10

2 4 6 8 10

class−8 Neuron

Neuron

class−7

class−6 Neuron

2 4 6 8 10

2 4 6 8 10

class−5 Neuron

Neuron

class−4

Neuron

2 4 6 8 10

class−3

Neuron

2 4 6 8 10

2 4 6 8 10

class−2 Neuron

Neuron

class−1

Neuron

2 4 6 8 10

2 4 6 8 10

class−12 Neuron

2 4 6 8 10

Neuron

Neuron

Unconstrained network λ = 0

class−14

2 4 6 8 10 class−15

Figure 5.8: Results on “clean” Pentland data set at intermediate resolution 32 × 32. Each square area represents a neuron activity magnitude vs. different inputs (such representation is similar to Hinton diagrams for network weights representation). The color indicates a magnitude sign: red for negative and green for positive values (in non colored printers, a red color appears more dusk than a green color).

Chapter 5: Real world recognition

97

Hidden unit activities vs. classes - for a reconstruction network

2 4 6 8 10

2 4 6 8 10

2 4 6 8 10

class−13

class−9 2 4 6 8 10

class−11 Neuron

Neuron

class−10 2 4 6 8 10

2 4 6 8 10

class−8 Neuron

Neuron

class−7

class−6 Neuron

2 4 6 8 10

2 4 6 8 10

class−5 Neuron

Neuron

class−4

Neuron

2 4 6 8 10

class−3

Neuron

2 4 6 8 10

2 4 6 8 10

class−2 Neuron

Neuron

class−1

Neuron

2 4 6 8 10

2 4 6 8 10

class−12 Neuron

2 4 6 8 10

Neuron

Neuron

Reconstruction network λ = 0.3

class−14

2 4 6 8 10 class−15

Figure 5.9: Results on “clean” Pentland data set at intermediate resolution 32 × 32.

Chapter 5: Real world recognition

98

Pdf’s of hidden unit activities Unconstrained network λ = 0

neuron−1

neuron−2

neuron−3

neuron−4

neuron−5

neuron−6

neuron−7

neuron−8

neuron−9

neuron−10

Reconstruction network λ = 0.3

neuron−1

neuron−2

neuron−3

neuron−4

neuron−5

neuron−6

neuron−7

neuron−8

neuron−9

neuron−10

Figure 5.10: Hidden unit activity pdfs - for unconstrained λ = 0 and reconstruction network λ = 0.3 for “clean” Pentland data set at intermediate resolution 32 × 32.

Chapter 5: Real world recognition

Hidden weight representation Unconstrained network λ = 0

Reconstruction network λ = 0.3

Figure 5.11: Pentland data set at intermediate resolution 32 × 32.

99

Chapter 6 Blurred image recognition This chapter studies a case where the required generalizations are for data which may be “far” from data in the training set, namely data with a different distribution than the training set. In the previous chapter, we considered unsupervised and particularly reconstruction constraints, as a mechanism to impose useful bias during training. We have shown that these constraints improve generalization performance for various image degradations, such as “Salt and Pepper” noise, low resolution and partial occlusion. However, sensitivity to image blur was still too high. This chapter is devoted to performance improvement under various types of image blur.

6.1

Methodology

Recognition of blurred images requires a substantial amount of training data processed by different blur operators. Unfortunately, such data is not available, and therefore, an alternative way to solve the problem is to impose a priori information about possible degradation transformations. For example, in the character recognition problem, the possible transformations are geometrical, such as shift, rotation and scaling (Simard et al., 1992; Baird, 1990). The regularization there appears as the invariance tangent prop constraints in the form of the penalty term to the cost function or using the distortion model, i.e. by data driven regularization (Section 2.3.2). We choose to add Gaussian blurred images to the training set as a representative of all blur operations and recognition is done on a wide variety of blur operations. We further propose to enforce reconstruction of blurred images to either their copy or to the original non-blurred images. Such training causes the hidden representation to become insensitive to blur operation. Another obvious way to improve classification of the blurred images is to restore the 100

Chapter 6: Recognition of blurred images

101

blurred images beforehand. In this case, before testing the recognition system on blurred images, their degradation is reduced via image restoration techniques.

6.1.1

Experimental design

Training schemes In Chapter 5, hybrid networks were trained to classify and reconstruct “clean” images (Figure 6.1 A, training stage), i.e., the reconstruction of a copy of the input in the output layer was used. Below, we refer to this training scheme as training scheme A. This training encourages internal representation where patterns of the same class are clustered together (due to the reconstruction part of the learning), while the distance between patterns of different classes is stretched (due to the discriminative/classification part of learning) (Gluck and Myers, 1993). As a result, classification in this hidden space is simpler and is more robust to various forms of degradation. To further improve recognition of degraded images, we have added Gaussian blurred images (with standard deviation σ = 2) during training. This data expansion procedure gives two additional types of the training procedure with reconstruction constraints. The first training scheme B enforces reconstruction of the original “clean” images from the blurred inputs (Figure 6.1 B, training stage) and the second scheme C, is a simple duplication of the inputs at the output (Figure 6.2). Both training schemes B and C encourage internal representation to be more robust to blurring, but training scheme B introduces additional invariance constraints on the image reconstruction task. As in Chapter 5, three types of ensembles are studied for each of the training schemes A–C: unconstrained, with reconstruction constraints and joined. The number of networks in the unconstrained ensemble of all schemes A–C is equal to 6. Ensembles with reconstrcution constraints of all schemes A–C have been composed from networks with the trade-off parameter λ, which changes from 0 till 0.3 with an increment of 0.05. Testing schemes Two testing schemes were used to evaluate the generalization ability of networks and their ensembles. The first testing scheme A is the same as in Chapter 5, i.e., various image degradations are simulated and a misclassification rate for different ensembles is evaluated (Figure 6.1 A, testing stage). In testing scheme B, the degraded images are first preprocessed using several restoration methods and only then classification is carried out (Figure 6.1 B, testing stage). Our experiments consist of several groups which differ by simulated degradation types and applied restoration techniques. In the next section,

Chapter 6: Recognition of blurred images

102

Experimental design schemes A

B Training stage

Training stage Reconstruction

Reconstruction clean image

clean image clean image

Classification Class label

clean image Class label

Reconstruction

Classification

Blurred image

clean image

A

B

Testing stage

Testing stage

Class

Class Restoration

?

?

Degraded image

Blurred image

Restored image

Figure 6.1: (A): In the training stage, networks are trained to classify and reconstruct “clean” images. In the testing stage A, generalization ability to classify artificially degraded images is tested; (B): Artificially blurred images are added to the training stage. Networks are trained to classify images and reconstruct their “clean” prototypes. In the testing stage B, restoration preprocessing is used before recognition schemes. we review image degradation operations and restoration methods which we apply.

6.2

Image degradation

Usually degradation process is modeled as both a space-invariant blurring with a convolution operator h and a corruption with an additive noise n: g = h ∗ f + n,

(6.2.1)

where f is the original image. The major known causes for image blur are misfocus, camera jitter, object motion and atmospheric turbulence. These types of blur lead to a low pass operation on the image. Of particular interest is a difference of Gaussians (DOG) filter, which is a band-pass filter, and is known to be present in early mammal vision (Kandel and Schwartz, 1991). This operator is equivalent to simultaneous image smoothing and enhancement. A third family of image filters is the high pass filter which leads to image sharpening. This filter is common in medical imaging, industrial inspection and military

Chapter 6: Recognition of blurred images

103

Training scheme C Training stage

Reconstruction

clean image

clean image Classification

Class label Reconstruction

Blurred image

clean image

Figure 6.2: In the training stage, the network is trained to classify and reconstruct “clean” and blurred images. Reconstruction is a copy of the input in the output sublayer. applications. The presence of noise in images is inevitable. It may be a result of image generation, recording, transmission, etc. Noise corruption complicates image acquisition and even a small amount of it is harmful for restoration of blurred images. We consider two types of additive noise: Gaussian white noise and pulse noise. We limit ourselves to Gaussian noise that acts independently on each pixel, with zero mean and some variance σ. Pulse noise (otherwise called “Salt and Pepper” noise) replaces pixel intensities by either the maximum or minimum grey-level values with some probability (Rosenfeld and Kak, 1982), producing separate high contrast black-and-white points. This noise is common in video transmission.

6.2.1

Main filters

Filtering may be done both in the frequency and spatial domains. Convolution in the spatial domain is equivalent to multiplication of the Fourier transforms of the image and the filter in the frequency domain. In each particular case we indicate in which domain filtering is done and represent point spread function or its Fourier transform (referred to as a transfer function) as required. Examples of images with various degradations are shown in Figure 6.3. Ideal filters Ideal filters represent a class of frequency domain filtering that are easy to simulate. Transfer functions of these filters are radially symmetric about the origin and though they are not physically realizable, they are widely used in image processing for comparing the behavior of different types of filters. The name ideal indicates that some specified

Chapter 6: Recognition of blurred images

104

Degraded Images original

a

b

c

d

e

f

g

h

i

Figure 6.3: a) Result of Gaussian noise with σ = 2; b) Result of pulse noise with density 20%; c) Result of replacement of the nose area with average intensity over this area; d) Result of the root filter with α = 0.6; e) Result of the out-of-focus filter with the blur radius R = 5; f) Motion blur with blur propagation on 7 pixels; g) Result of Gaussian blur with σ = 2; h) Result of the DOG filter with on and off centers equal to σ1 = 1 and σ2 = 2 i) Result of the ideal high pass filter with cutoff w = 3 frequencies are completely eliminated. Depending on the eliminated frequencies ideal low, band and high pass filters are known (Gonzalez and Wintz, 1993). A transfer function of the ideal filter in the frequency domain (u, v) is given by the expression: (

H(u, v) =

1 if (u, v) ∈ D 0 otherwise,

where the area of the unchanged frequencies D is: √ √ √ a) u2 + v 2 ≤ W ? , b) u2 + v 2 ≥ W 0 , c) W ? < u2 + v 2 < W 0 , for low, high and band-pass filters respectively, W ? , W 0 are called cutoff frequencies. Motion blur Motion blur is a form of image degradation that may degrade recognition performance (Figure 6.3f). It is due to a relative motion between the camera and the object. Assuming that a relative camera motion is horizontal and uniform and the total displacement during the exposure time T is a, the transfer function H(u, v) (Gonzalez and Wintz, 1993) is given by: H(u, v) =

T sin(πua) exp(−πiua). πua

(6.2.2)

Chapter 6: Recognition of blurred images

105

H vanishes at values of u given by u = na , where n is a nonzero integer. In general, the amplitude of H(u, v) is characterized by periodic lines of zeros, which are orthogonal to the direction of motion and are spaced at intervals of

1 a

in both sides of the frequency

plane. Out-of-focus blur The point spread function (PSF) of a defocused lens with a circular aperture is approximated by the cylinder whose radius R depends on the extent of the focus defect (Cannon, 1976): (

h(x, y) =

1 πR2

0

if x2 + y 2 ≤ R2 otherwise,

where R is the “blur radius” which is proportional to the extent of defocusing. The Fourier transform of h(x, y) in this case is H(u, v) = J1 (πRr)/(πRr), where J1 is the first-order Bessel function and is characterized by “almost-periodic” circles with zero valued H(u, v). This occurs for r satisfying: 2πRr = 3.83, 7.02, 10.2, 13.3, 16.5 . . . The well-defined structure of H(u, v) zeros in the case of motion and misfocus blur is used for the identification of the blur parameter (Cannon, 1976; Fabian and Malah, 1991) for the purpose of image restoration. However, these methods are sensitive to noise. To overcome this drawback, some preprocessing stage for noise reduction and estimation were used (Fabian and Malah, 1991). An example of a misfocus image with blur radius R = 5 is shown in Figure 6.3e. Gaussian blur Gaussian blur may be caused by atmospheric and optical blur. It is known that the eyes’ lenses cause such blur. Computer tomography images also suffer from Gaussian blur (Kimia and Zucker, 1993). The Gaussian convolution filter written in polar coordinates h(r, φ) in the spatial domain is given by: h(r, φ) = Cσ −2 exp(

−r2 ), 2σ 2

(6.2.3)

where C is a normalization constant. The lack of zero crossing of the Gaussian filter in the frequency domain makes its identification very difficult. Moreover, Gaussian deblurring is numerically unstable (Humel et al., 1987; Kimia and Zucker, 1993). An example of an image blurred by this filter with σ = 2 is shown in Figure 6.3g.

Chapter 6: Recognition of blurred images

106

DOG filter The difference of Gaussian (DOG) filter is a good approximation to the circular symmetric Mexican hat type receptive fields (center-surround) found in early mammal vision (Marr, 1982; Kandel and Schwartz, 1991). It performs a band-pass filter that is the result of applying the Laplacian operator ∇2 to an image which is blurred with a Gaussian filter. The zero-crossings of the resulting convolved image are commonly used for edge detection and segmentation. The DOG filter written in polar coordinates is described by: h(r, φ) = Cσ1−2 exp(

−r2 −r2 −2 ) − Cσ exp( ), 2 2σ12 2σ22

(6.2.4)

where σ1 < σ2 and are the standard deviations of the on and off center (positive and negative Gaussians). An image blurred with a DOG filter is shown in Figure 6.3h. Root filter Root filter is commonly used for image enhancement and deblurring (Jain, 1989). It affects the magnitude of the frequency response of an image V as given by: k Vˆ k=k V kα . For small values of α < 1, it acts as a high pass filter, increasing the ratio between amplitudes in the high and low frequencies. An image enhanced with a root filter (α = 0.6) is shown in Figure 6.3d.

6.2.2

Other types of degradation

Noise We consider two types of additive noise: Gaussian white noise and pulse noise. Gaussian white noise is commonly used to model sensor noise and quantization process. We limit ourselves to Gaussian noise that acts independently on each pixel with zero mean and some variance σ 2 (Figure 6.3a). Pulse noise replaces pixel intensities by either the maximum or minimum grey-level value with some probability (Rosenfeld and Kak, 1982), producing separate high contrast black-and-white points. This explains why pulse noise is called otherwise ”Salt and Pepper” noise. Pulse noise often appears during TV image transmission (Figure 6.3b). Occlusion Occlusion occurs as a result of motion, when two or more objects touch or overlap one another. Another cause for occlusion in 2D images is the change of viewpoint, when part of an object is occluded by another one. We simulate occlusion by replacing pixel

Chapter 6: Recognition of blurred images

107

intensities at a certain rectangular area in any part of the image by some constant intensity in that rectangle (Figure 6.3c). A level of occlusion is characterized by a factor ν to the average intensity of an occluded area.

6.3

Image restoration

Image restoration refers to the problem of recovering an image from its blurred and noisy version, using some a priori knowledge of the degradation phenomenon and the image nature. It is well-known that the restoration problem is an ill-posed problem (Gonzalez and Wintz, 1993; Jain, 1989; Stark, 1987), i.e. a small noise in the observed image results in an unbounded perturbation in the solution. This instability is often addressed by a regularization approach (Tikhonov and Arsenin, 1977; Katsaggelos, 1989; Sezan and Tekalp, 1990; Rudin et al., 1992; You and Kaveh, 1996) that includes restricting the set of admissible solutions and introducing some a priori knowledge about the image and the degradation model.

6.3.1

MSE minimization and regularization

Assuming the blur operator H is known, a natural criterion for estimating an original pixel image f from an observed pixel image g in the absence of any knowledge about noise, is to minimize the difference between the observed image and a blurred version of the restored image: min M(f ) = min k g − Hf k2 . f

f

(6.3.5)

Often, gradient or conjugate gradient descent methods are used for M(f ) minimization (Katsaggelos, 1989; Sezan and Tekalp, 1990). An application of the gradient method to the minimization problem (6.3.5) produces the following iterative scheme: fk+1 = fk + β(Ht g − Ht Hfk ),

f0 = 0.

(6.3.6)

When the blur matrix H is nonsingular and β is sufficiently small, the iterative scheme converges to the fˆ = H −1 g. This solution is known as the inverse filter method. In the frequency domain, it corresponds to the following estimation of the ideal image frequency response: G(u, v) Fˆ (u, v) = . H(u, v)

(6.3.7)

As mentioned before, blur such as motion or defocusing leads to a singular H matrix. In this case, the above optimization method yields an iterative scheme that converges to the

Chapter 6: Recognition of blurred images

108

minimum norm least square solution H + g of Eq. 6.3.5 (Katsaggelos, 1989; Jain, 1989), where H+ is the generalized inverse of matrix H. In the presence of noise the iterative algorithm converges to H + gb + H + n (where gb is a blurred image without noise interference) and thus contains noise filtered by the pseudo-inverse matrix. Often, H is a low-pass filter, therefore, the noise is amplified and the obtained solution may be very far from the desired one. To overcome this sensitivity to noise, some a priori information about the noise or the ideal image is often introduced as a quantitative constraint that replaces an ill-posed problem by a well-posed one. This method is called regularization. The most well known regularization methods (Tikhonov and Arsenin, 1977; Sezan and Tekalp, 1990) have a general formulation as a minimization of the function: L(f ) =k Hf − g k2 +α k Cf k2 , where the regularization operator C is chosen to suppress the energy of the restored image in the high frequencies, that is equivalent to an assumption about the smoothness of the original image in the spatial domain. Since usually the H filter is a low pass filter, in order to obtain the smooth original image, the regularization operator C is taken to be a Laplacian ∇ · ∇f , where ∇ – is a differential operator. A regularization parameter α may be known a priori or estimated, but theoretically it is inversely proportional to the signal to noise ratio (SNR). Although regularization of the MSE criterion with smoothness constraint k Cf k is the basis for most of the work in image restoration, it often leads to unacceptable ringing artifacts around sharp intensity transitions. This effect is due to image blurring around lines and edges. Some solution to this problem is given by the following functional minimization (Katsaggelos, 1989): L(f ) =

X

[g(x) − h(x) ∗ f (x)]2 + λ

x∈Ω

X

ω(x)[c(x) ∗ f (x)]2 .

(6.3.8)

x∈Ω

The first term in (6.3.8) represents the fidelity of the restored image with respect to an observation and the second represents a smoothness constraint, ∗ – is a convolution operator. The space adaptivity is achieved through the introduction of the weight function ω. The weight function ω is set to be small around the edge areas, larger near the smooth areas and usually is taken in practice as the inverse of the local variance of the image. The space adaptivity approach has been extended to the case of an unknown blur operator (You and Kaveh, 1996; Chan and Wong, 1997). The method incorporates a priori knowledge about the image and the point spread function (PSF) simultaneously. It proceeds by minimizing the cost function, which consists of a restoration error measure

Chapter 6: Recognition of blurred images

109

and two regularization terms for the image and the blurring kernel; under constraints on the blur filter energy. You et al. (You and Kaveh, 1996) formulate the problem as a minimization of the function dependent on the discrete image and filter values (2D image and filter functions are quantized on the grid): L(f, h) =

X

ω(x)[g(x) − h(x) ∗ f (x)]2 +

x∈Ω

λ1

X

X

ω1 (x)[c1 (x) ∗ f (x)]2 + λ2

ω2 (x)[c2 (x) ∗ h(x)]2

(6.3.9)

x∈Ω

x∈Ω

In (6.3.9) the first term is responsible for the image fidelity and the second and third terms represent smoothing constraints on the image and the blur filter, respectively. Smoothness is introduced adaptively via the weights ω1 (x) and ω2 (x). Though the gradient descent method is commonly applied for minimization, an alternating minimization (AM) algorithm is used, which is a particular realization of the coordinate descent method (Luenberger, 1989). The filter and the image are considered as dual variables. The algorithm alternately minimizes a cost function by descending with respect to the filter or the image, while fixing the dual variable. In every alternating step, a quadratic cost function L(f, h|fˆ) or L(f, h|ˆ g ) is minimized by the conjugate gradient method. We note that this formulation is equivalent to minimization of a functional: L(f , h) =k ω(h ∗ f − g) k2L2 +λ1 k



ω1 C1 ∗ f k2L2 +λ2 k



ω2 C2 ∗ h k2L2 ,

where f and h are image and blur kernel 2D real functions and k · kL2 is an L2 – norm. Regularization with another form of constraint has been considered in (Chan and Wong, 1997), where the problem is formulated as a minimization of the functional: Z

L(f , h) =k h ∗ f − g

k2L2

+α1

Z Ω

|∇f |dx + α2



|∇h|dx.

(6.3.10)

The proposed method is called total variation blind deconvolution (TV regularization). In Eq. (6.3.10) the regularization term has the form

R



|∇f |dx, called a total variation

(TV) norm (Rudin et al., 1992). It follows the idea that the image consists of the smooth patches, instead of being smooth everywhere, thus providing better recovering of image edges.

6.3.2

Image restoration in the frequency domain

All the restoration methods considered up to this point were derived in the space domain, though historically the first methods were designed in the frequency domain. Herein we survey briefly the most widely spread frequency domain restoration methods.

Chapter 6: Recognition of blurred images

110

Wiener filter A fundamental result in filtering theory used commonly for image restoration is a Wiener filter. Wiener filtering has been successfully used to filter images corrupted both by noise and blurring. This filter gives the best estimate of the object from the observations in the MSE sense. The Wiener filter frequency response is given as (Jain, 1989): HW =

Sgg − Sηη H ? Sf f = . 2 k H k Sf f + Sηη HSgg

(6.3.11)

In the case where only one observation is available, Sf f and Sgg are power spectrums of ideal and observed images, respectively, and Sηη is a power spectrum of the noise. Since the phase of the Wiener filter coincides with the phase of the inverse filter, it does not compensate for phase distortions due to noise in the observations. In the absence of the blurring, the Wiener filter becomes: HW =

Sf f snr = , Sf f + Sηη snr + 1

(6.3.12)

where snr = Sf f /Sηη is a signal-to-noise ratio. In practice, snr is defined as a ratio between variances of the blurred image and the noise (or 10 log10 snr, if signal-to-noise ratio is measured in Db) . This filter (6.3.12) is called the Wiener smoothing filter. It suppresses all frequency components in which the signal-to-noise ratio is small and does not change the frequency components when snr is large (snr  1). For images, Sf f is usually very small for high frequencies, therefore the noise smoothing filter is a low pass filter. Another marginal case is the absence of noise, in which the Wiener filter coincides with the inverse filter HW = H −1 . Since the blurring process is usually a low pass filter, the Wiener filter acts in this case as a high pass filter. In the presence of noise and blur, the Wiener filter achieves a compromise between low-pass noise smoothing and high-pass inverse filtering, resulting in a band-limited filter. It is clear, nevertheless, that the Wiener filter is also unstable (like the inverse filter), if the frequency response is zero or close to it. Inverse and pseudo-inverse filters As has been already mentioned, in the case of the noise absence, the Wiener filter becomes an inverse one and requires stabilization. A standard stabilized version of the inverse filter is described by the following equation: (

H

−1

(w1 , w2 ) =

1 H(w1 ,w2 )

0

if H(w1 , w2 ) ≥ 1 otherwise

Chapter 6: Recognition of blurred images

111

Instead we have used the next version of the pseudo-inverse filter in our simulations (

H

−1

(w1 , w2 ) =

1 H(w1 ,w2 ) 1 H(w1 ,w2 )+2

if H(w1 , w2 ) ≥ 1 otherwise

The choice of the 1 and 2 parameters defines the quality of the deblurred image. In our simulations, they have been chosen by trial once for all the data set. It is known that great care must be taken to obtain approximate solutions that achieve the proper balance between accuracy and stability. (Stark, 1987). Another nonlinear deblur filter is a root filter (see Section 6.2.1) that is also used for image enhancement.

6.3.3

Denoising

Denoising may be considered a particular restoration method when the PSF of the blur operator is a delta function. Thus, some of the methods described above are appropriate for denoising (Rudin et al., 1992; You and Kaveh, 1996). We also consider two examples of the rank algorithms (Yaroslavsky and Eden, 1996). Rank algorithms are especially designed for noise reduction. They are based on the statistics extracted from the variational row, that is a sequence of central pixel and its neighbors, ranked in increasing order of their intensities. Different definitions of the neighborhood and variational rank statistics lead to diverse rank algorithms. Rank statistics may be also obtained from local histograms and are rather computationally efficient, when applied recursively. The main advantage of the rank algorithms is local adaptivity. Different denoising algorithms may be also applied in the cascade. First, we consider an averaging technique, called peer group averaging (PGA), in which a central pixel intensity is replaced by an average intensity of some predefined neighboring pixels, which are closest by intensity value. The number of pixels over which averaging is performed is called the peer group size and it controls the amount of smoothing. The second method – the median filter, replaces the gray level intensity of each pixel by the median of its neighboring pixel intensities. This method is particularly effective when the noise is spike-like. It is nonlinear, is very robust and preserves edge sharpness.

6.4

Results

Our experiments have shown that training with both schemes B and C (see Section 6.1.1) leads to recognition improvement compared with the training scheme A. We have also observed that scheme B is superior to scheme C, but the difference between them is insignificant. Therefore, below we concentrate on ensembles obtained by using two training

Chapter 6: Recognition of blurred images

112

schemes A and B, and postpone with summary comparison results for all three schemes until Section 6.4.6. All experiments are carried out on the TAU facial data-set.

6.4.1

Image filtering

In the first group of experiments, the abilities of different ensembles to classify images processed by ideal and some typical low, band and high pass filters have been compared. Classification results are presented in Table 6.1 and some degraded images in Figure 6.3. Classification results for filtered data Types of corruption

”Clean data”

Training scheme A

Training scheme B with extra blurred images Unconstrained Reconstruction Joined λ=0 ensemble ensemble 9.5 10.8 8.8

Unconstrained λ=0 12.8

Reconstruction ensemble 12.8

Joined ensemble 13.5

15.5

14.2

13.5

9.5

10.1

9.5

19.6

16.2

16.9

14.2

11.5

10.8

20.9

20.9

17.6

16.2

10.8

11.5

32.4

26.4

26.4

29.7

24.3

19.6

21.6

24.3

19.6

16.9

14.9

12.2

41.2

41.2

35.8

39.2

31.8

28.4

31.8 39.2

26.4 35.1

23.6 32.4

23.0 33.8

26.4 31.8

20.9 27.7

16.9

17.6

12.8

10.1

10.8

8.1

12.8

13.5

12.8

9.5

9.5

8.1

Ideal low-pass cutoff w = 10 Gaussian blur with σ = 2 Out-of-focus blur with r = 5 Motion blur in the diagonal direction with d = 5 Motion blur in the horizontal direction with d = 5 Ideal band-pass 3 3 Root filter with α = 0.6 Root filter with α = 0.8

Table 6.1: Percent classification error for filtered data (TAU data set)

Low-pass filtering We have considered the ideal low-pass filter with cutoff w = 10 , the Gaussian blur with standard deviation σ = 2, motion blur in diagonal and horizontal directions and the out-of-focus blur, all with blur propagation on 5 pixels. We note that for each of training schemes A–B, the unconstrained (λ = 0) ensemble is inferior to the reconstruction and joined ensembles in the blurred image recognition. In turn, the reconstruction ensembles are superior to the unconstrained ensembles. For example, for Gaussian blurred images the unconstrained ensemble of the training scheme A yields the misclassification rate of 19.6%, while the reconstruction ensemble produces

Chapter 6: Recognition of blurred images

113

16.2%. For ensembles trained with the training scheme B, the misclassification rate falls from 14.2% for the unconstrained ensemble to 11.5% for the reconstruction ensemble. Merging of the unconstrained and reconstruction ensembles improves classification results further on. For example, for out-of-focus images, the joined ensemble of the training scheme A has the misclassification rate of 17.6%, while the reconstruction ensemble produces 20.9%. For diagonal motion the joined ensemble of the training scheme B has the misclassification rate of 19.6% compared with 24.3% for the reconstruction ensemble. We note that reconstruction ensembles often give better classification results than unconstrained ensembles and joined ensembles improve classification further on. Band-pass filtering Band-pass filtering is presented by the DOG filter with the size of on and off receptive fields equal to 1 and 2 pixels respectively, and ideal band-pass filtering with inner and outer cutoff radiuses equal to 3 and 10 respectively. Our experiments show that joined ensembles are better than reconstruction ensembles, which in most of the cases are better than unconstrained (λ = 0) ensembles. Therefore, for the training scheme A with testing on DOG filtered images, the misclassification rate falls from 31.8% for the unconstrained ensemble, to 26.4% for the reconstruction ensemble, and then to 23.6% for the joined ensemble. For the training scheme B the reconstruction ensemble is inferior to the unconstrained ensemble, but the joined ensemble is superior. Its classification performance is 2.1% more than for the unconstrained λ = 0 ensemble. Finally, the joined ensemble with the scheme B improves the results by 10.9%, in comparison with the classical unconstrained ensemble of the training scheme A. High-pass filtering High pass filtering is presented by the ideal high pass filter wih cutoff w = 3 and by the root filter. Though images degraded with the high pass filter bear a resemblance to original images (Figure 6.3i), they are difficult for recognition. The smallest misclassification rate on this data is achieved by the joined ensemble of the training scheme B (27.7%). When degradation becomes less, recognition improves and even may be useful. Classification results on root filtered images are slightly better than the results for “clean” images. Surprisingly, humans also recognize slightly enhanced images better than the original images. Remarkably, joined ensembles are best in recognition of differently degraded images.

Chapter 6: Recognition of blurred images

6.4.2

114

Classification of noisy data

In the following section, we shall test the performance of our scheme under realistic noise and blur degradations. We first test the performance under various noise operations on non-blurred objects in order to have a base line for comparison with the blurred results. Results of an ensemble of networks on noisy and restored images are presented in Table 6.2. Two kinds of noise, “Salt and Pepper” and Gaussian noise of small and large levels are considered. “Salt and Pepper” noise is implied with density parameters d = 0.2 and d = 0.6. Gaussian noise corresponds to snr = 10 and snr = 1. Median filter with a window size 3 × 3 is used to denoise images corrupted with “Salt and Pepper” noise. To denoise images degraded with Gaussian noise, peer group averaging (PGA) has been used. PGA window size 3 × 3 and group size ng = 5 have been chosen for snr = 10 and ng = 6 for snr = 1. Noise and Restoration Types of corruption

”Clean data” “Salt and Pepper” noise with d = 0.2 Median filter denoising “Salt and Pepper” noise with d = 0.6 Median filter denoising Gaussian noise with snr = 10 PGA denoising with ng = 5 Gaussian noise with snr = 1 PGA denoising with ng = 6

Training scheme A

Training scheme B with extra blurred images Unconstrained Reconstruction Joined λ=0 ensemble ensemble 9.5 10.8 8.8

Unconstrained λ=0 12.8

Reconstruction ensemble 12.8

Joined ensemble 13.5

25.0

20.3

20.3

20.3

18.2

14.2

13.5

12.8

12.8

9.5

8.8

8.8

70.3

66.9

69.6

81.8

76.4

74.3

25.0

20.3

21.6

20.9

20.9

14.9

13.5

13.5

12.8

8.1

10.8

8.1

13.5

14.9

13.5

8.8

10.8

8.8

15.5

16.9

12.8

10.1

10.8

8.1

14.9

15.5

12.8

10.8

12.8

8.8

Table 6.2: Percent classification error for noisy data (TAU data set) Examples of noisy and restored images are presented in Figure 6.4. We note that classification is more sensitive to “Salt and Pepper” noise than to Gaussian noise, which may be explained by the quasi-linear type of MLP network transformations. For a “Salt and Pepper” noise of density d = 0.6, 60% of the image pixels intensities are replaced by marginal intensity values, which leads to a very high misclassification rate. Additional preprocessing by median filter significantly improves classification and gives the mild misclassification rate of 14.9% for the best joined ensemble of the training scheme B. Sensitivity of the network ensembles to Gaussian noise is small. Moreover, the joined

Chapter 6: Recognition of blurred images

115

Noisy Images a

b

c

d

Figure 6.4: a) An image contaminated with “Salt and Pepper noise” at 20% corruption. b) Results of the median smoothing in a window of size 3 × 3. c) An image contaminated with Gaussian noise with snr = 1. d) Results of the peer group averaging in a window of size 3 × 3 and with a peer group of size ng = 6. ensembles of both schemes A and B are insensitive to Gaussian noise and denoising, which is carried out beforehand, even slightly spoils classification results.

6.4.3

Gaussian blur

The classification results for Gaussian blurred images without noise interference and for their restored images are presented in Table 6.3. The Gaussian operator has the standard deviation equal to σ = 2. Gaussian Blur and Restoration Types of corruption

”Clean data” Gaussian blur with σ = 2 Pseudoinverse filter: with σ = 1.5 : with σ = 2.0 : with σ = 2.5: Root filter: α = 0.6: α = 0.8:

Training scheme A

Training scheme B with extra blurred images Unconstrained Reconstruction Joined λ=0 ensemble ensemble 9.5 10.8 8.8 14.2 11.5 10.8

Unconstrained λ=0 12.8 19.6

Reconstruction ensemble 12.8 16.2

Joined ensemble 13.5 16.9

15.5 13.5 15.5

13.5 13.5 15.5

14.2 12.8 12.8

8.1 9.5 9.5

10.1 10.8 10.1

8.8 8.8 7.4

12.8 13.5

13.5 14.2

12.8 14.2

14.9 12.2

12.8 12.2

10.8 8.8

Table 6.3: Percent classification error for deblurred data The most sensitive to the Gaussian blur is the unconstrained λ = 0 ensemble of the training scheme A and the best is the joined ensemble of the training scheme B. For deblurring, pseudo-inverse and root filters have been used. In pseudo-inverse filter, the standard deviation of the Gaussian kernel is assumed to be known only approximately. The inverse Gaussian operator with an approximated standard deviation σ ˆ in

Chapter 6: Recognition of blurred images

116

the frequency domain is given by: Hσˆ−1 (w) = exp(−2π 2 σ ˆ 2 w2 ).

(6.4.13)

Thus two main cases exist. In the first case, the guessed value is less than the original σ ˆ < σ and image remains partially blurred with Gaussian filter. In the second case, the guessed value exceeds the original (ˆ σ > σ), which corresponds to filtering with high-pass filter that is given in the frequency domain by: Hβ (w) = exp(2π 2 β 2 w2 ),

β=



σ ˆ 2 − σ2.

(6.4.14)

This analysis does not consider computational problems connected with the asymptotic behavior of Hσ−1 (w) as w tends to infinity. Classification results with pseudo-inverse filtered images are presented in Table 6.3 in the rows marked with “Pseudo-inverse filter” and restored images are given in Figure 6.5 (d-f). Pseudo-inverse filter has been applied three times with approximated standard deviations σ ˆ = 1.5, 2, 2.5. As expected, deblurring improves the classification results and Gaussian blur and restoration a b c

d

e

f

Figure 6.5: a) Image blurred with Gaussian filter with standard deviation σ = 2 b) Enhancement with root filter with α = 0.8 c) Enhancement with root filter with α = 0.6 d) Pseudo-inverse filter with guessed σ = 1.5 e) Pseudo-inverse filter with guessed σ = 2 f) Pseudo-inverse filter with guessed σ = 2.5 the best one are for the joined ensemble trained with the scheme B. We note that both

Chapter 6: Recognition of blurred images

117

joined ensembles classify pseudo-inverse deblurred images with σ ˆ = 2.5 slightly better than ”clean” data. We have observed a similar behavior for high-pass filtered data. A simple enhancement with root filter also improves the classification results.

6.4.4

Motion blur Motion Blur and Restoration

Types of corruption

”Clean data” d = 5 pixels snr=inf (no noise) deblurring d = 5 pixels and Gaussian noise snr=100 deblurring d = 5 pixels and Gaussian noise snr=10 smoothing and deblurring d = 7 pixels snr=inf (no noise) blind deconvolution

Training scheme A

Training scheme B with extra blurred images Unconstrained Reconstruction Joined λ=0 ensemble ensemble 9.5 10.8 8.8 16.9 14.9 12.2

Unconstrained λ=0 12.8 21.6

Reconstruction ensemble 12.8 24.3

Joined ensemble 13.5 19.6

12.8

12.8

12.8

9.5

10.8

8.8

20.9

24.3

19.6

16.2

15.5

12.2

13.5

14.2

13.5

9.5

10.8

8.8

21.6

23.6

19.6

16.9

15.5

12.2

14.9 27.0

14.9 29.1

14.2 23.6

9.5 20.3

10.8 23.0

9.5 16.2

13.5

15.5

12.8

10.8

11.5

9.5

Table 6.4: Percent misclassification rate for motion blurred and restored images. Motion takes place in the horizontal direction and Gaussian noise is added. Motion propagation is given as a parameter d. Noise level is indicated as a signal-to-noise ratio snr, if noise is absent snr = inf . MSE minimization with adaptive Tikhonov regularization is used for restoration. Lines marked with “deblurring” stand for deblurring with a known blur operator. Table 6.4 presents classification results for images degraded as a result of horizontal motion and additive Gaussian noise (Figure 6.6). As expected, with increase of the blur propagation, classification declines. As we have already seen, the influence of noise is less dramatical, in particular, for joined ensembles. Indeed, negative role of the noise is revealed during image restoration. The blur propagation may be estimated from the welldefined periodic structure of zero-crossing line locations of motion filter in the frequency domain. However, this method is highly sensitive to noise. For restoration, MSE minimization with Tikhonov adaptive regularization is used. In all experiments with motion propagation on d = 5 pixels, a motion filter is assumed to be known. For noise degradation with snr = 10 (10 Db), a simple smoothing (averaging) in the window of size 3 × 3 pixels is carried out before restoration. Classification after deblurring of images degraded with small noise snr = 100 (20 Db) is the same as for “clean”

Chapter 6: Recognition of blurred images

a

Motion blur and deblur b c

118

d

Figure 6.6: a) Motion blur with propagation on 5 pixels and Gaussian noise with snr = 10. b) Motion deblur using the constrained regularization method with the known blur filter and with the simple averaging in the window 3 × 3 before its application. c) Motion blur with blur propagation equal to 7 pixels. d) Blindly restored image. images for both joined ensembles. For larger noise with snr = 10 (10 Db) classification is slightly worse. To restore the images blurred as a result of motion with blur propagation parameter d = 7 pixels in the absence of noise, the Tikhonov regularization for both image and filter is applied. Since the direction of motion blur can be easier estimated than the motion propagation parameter, it is assumed to be known. The kernel support of the blur filter is taken to be 9 pixels in the motion direction. Initial guesses are the observed blurred image for an image and a delta function for a blurring operator. The results of this experiment are presented in the two last rows of Table 6.4. Though deblurred images differ slightly visually from the “clean” data, their classification is the same as for “clean” data. The joined ensemble obtained using the training scheme B is the best in classification of motion blurred and restored images. The classification results for images, blurred with Gaussian filter, and contaminated with Gaussian noise, along with their deblur using blind deconvolution are presented below.

6.4.5

Blind deconvolution

This section presents classification results for blindly deconvolved images. The blurred images are obtained as a spatial convolution of the original images with Gaussian kernel with standard deviation equal to σ = 2 and pruned to have a support 7 × 7 pixels. Blind deconvolution is done using the regularization approach to image identification and restoration (You and Kaveh, 1996). The filter and image are assumed to be positive and a kernel support is taken to be 15 × 15 pixels. The sum of filter kernel coefficients and summary image intensity are normalized to 1. The initial guess for an image is

Chapter 6: Recognition of blurred images

119

the degraded face and we start from a delta function filter, no symmetry constraints (Chan and Wong, 1997) are used. The regularization parameters are set by hand from visual appearance once and for all images. An image blurred, with a truncated Gaussian filter, and contaminated with Gaussian noise of snr = 100, and its blind deconvolution are presented in Figure 6.7. Classification results for two cases, with and without noise Blind deconvolution b c

a

d

Figure 6.7: a) Image blurred with Gaussian filter with standard deviation σ = 2, pruned to a support area 7 × 7 and Gaussian noise with snr = 100. b) Blind deblurring of the degraded image. c) Original filter. d) Found filter, pruned to the same support as the original filter. interference, are presented in Table 6.5. Blind Deconvolution Types of corruption

”Clean data” Blur with pruned Gaussian filter Blind deconvolution Blur with pruned Gaussian filter and Gaussian noise, snr = 100 Blind deconvolution

Training scheme A

Training scheme B with extra blurred images Unconstrained Reconstruction Joined λ=0 ensemble ensemble 9.5 10.8 8.8 10.1 10.1 10.8

Unconstrained λ=0 12.8 18.2

Reconstruction ensemble 12.8 16.2

Joined ensemble 13.5 17.6

12.8

13.5

12.8

9.5

10.8

8.1

18.9

16.2

18.2

10.8

10.1

10.8

12.8

14.2

12.8

10.1

10.1

8.8

Table 6.5: Percent misclassification rate for blurred and blindly deblurred images. The images are blurred with pruned Gaussian filter. We note that between ensembles obtained with the training scheme A, reconstruction ensemble is the less sensitive to blurring and noise. Ensembles obtained with the training scheme B are less sensitive to noise and blur. The joined ensemble obtained with the training scheme B has the best classification performance.

Chapter 6: Recognition of blurred images

6.4.6

120

All training schemes

Misclassification rate %

Recognition of blurred images via schemes A–C

30 20 10 0 f e d c b a Bd

B

Cd

C

Ad

A

Figure 6.8: Percent classification error bar graph for reconstructed images. Regression ensembles A-C correspond to joined ensembles obtained using recognition schemes A-C, respectively. Heights of the bars marked with Ad-Cd show misclassification of ensembles A-C respectively on restored images. See also corresponding Table 6.6 for description of degradation types a-f. Summary classification results for joined ensembles, corresponding to all training schemes A–C, are presented in Figure 6.8 and Table 6.6. First, we observe that ensembles of networks trained using the expanded training data-set are superior to the joined ensemble trained without blurred images. Secondly, we note that both ensembles B-C have about the same classification performance and ensemble B is slightly better. This may be explained by the drastic compression rate, that causes reconstructed images to look blurred in both cases and results in the similarity of two types of reconstruction constraints (see Figure 6.9). The third and important observation is that training with blurred images seems to be more important than restoration preprocessing. Indeed, recognition of restored images using scheme A (column and bars marked with Ad) is inferior to degraded image recognition using schemes B–C. However, as was already marked, usage of image preprocessing

Chapter 6: Recognition of blurred images

121

Blurred image recognition via joined ensembles Image degradation types and deblur type a) clean and root filter α = 0.8 b) Gaussian blur σ = 2 root filter α = 0.8 c) Truncated Gaussian blur σ = 2 and Gaussian noise snr=20 Db and blind deconvolution d) Motion blur d = 7 and blind deconvolution e) Out-of-focus blur with a = 5 f ) DoG filter σ1 = 1 and σ2 = 2

Joined ensembles Bd

B

Cd

C

Ad

A

8.1

8.8

9.5

8.1

12.8

13.5

8.8

10.8

8.8

10.8

14.2

16.9

8.8

10.8

8.1

10.1

12.8

18.2

9.5

16.2

10.8

16.9

12.8

23.6

*

11.5

*

12.8

*

17.6

*

20.9

*

20.9

*

23.6

Table 6.6: Percent classification error for reconstructed images. Regression ensembles AC correspond to joined ensembles obtained using recognition schemes A-C, respectively. Columns marked with Ad-Cd show misclassification of ensembles A-C respectively on restored images. Information where restoration process was not done is marked with *. Experiments with TAU data-set before recognition schemes leads to improved classification results.

6.5

Conclusions

Two ways to improve the challenging problem of blurred image recognition were proposed: (i) Preprocess the blurred images using blind deconvolution methods before recognition; (ii) Apply our regularized reconstruction constraints technique (Chapter 3) to a training set that has been expanded by blurred images of some form. This forces the reconstruction operator that is estimated during training to become less sensitive to the blur operation. For this reason, training without reconstruction using the expanded training set does not improve results. Two training schemes with and without blurred images have been compared and different network ensembles have been considered. The best classification scheme is the scheme that includes both the hybrid recognition/reconstruction architecture and usage of blurred images. The best network ensemble is the joined ensemble, obtained by merging of the unconstrained and the reconstruction ensembles trained with blurred images.

Chapter 6: Recognition of blurred images

122

We have shown that the combination of both ways, the restoration and regularized classification approach are superior to each one separately. Since restoration techniques are very sensitive to noise and require a priori knowledge or visual human interaction, it is important that the hybrid classification/reconstruction is less sensitive to the restoration parameters.

Chapter 6: Recognition of blurred images

123

Reconstruction of Gaussian blurred images

Training scheme B

Training scheme C

Figure 6.9: Reconstruction of Gaussian blurred images by Neural Networks obtained using training schemes B–C. Images in the top row from left to right are an original image, its “caricature” image and Gaussian blurred image. In the middle row, images reconstructed by Neural Networks with λ = 0.05, 0.15, 0.25 and with reconstruction defined by the training scheme B are presented. In the bottom row, images reconstructed by Neural Networks with λ = 0.05, 0.15, 0.25 and with reconstruction defined by the training scheme C are presented. Note that though images in the middle row are sharper than images in the bottom row, they nevertheless look blurred.

Chapter 7 Summary and future work In this final chapter, we summarize the main contribution of the thesis and present several possible directions for future work.

7.1

Summary

Our primary goal in this thesis was to improve the performance of a high dimensional image recognition task, by extracting a good hidden representation of the image data. We developed several approaches to achieve a good generalization in image recognition. First we developed a novel hybrid feed-forward reconstruction/recognition network architecture, with two output sublayers for reconstruction and recognition, and one common hidden layer shared by both tasks (Chapter 3). The network was trained to minimize concurrently MSE of reconstruction and recognition output sublayers. Though, a similar architecture was used previously (see Section 3.1.5), we first used it for improving image recognition and gave a new interpretation of the hybrid network as a tool to control bias via imposing a novel type of reconstruction bias constraints. In addition, we introduced a trade-off parameter λ that defines the influence of each of the tasks and is unknown a-priori. We have considered networks with different values of λ, instead of considering only a single value, as has been proposed previously. In addition, the network and its learning rule were interpreted in the MDL and Bayesian frameworks. In Bayesian formulation, the network is trained to maximize the conditional joint probability of the reconstructed image and its class label given the observed image. In the proposed architecture, the reconstructed image and its class label are independent given the observed image and under the assumption of a Gaussian distribution of the errors, this maximization leads to the proposed learning rule. The trade-off parameter λ emerges as a hyper-parameter and according to the Bayesian theory, the right approach is to integrate predictors over this parameter. If the initial weights of the 124

Chapter 7: Summary and future work

125

feed-forward network are also considered as hyper-parameters, then the predictor f is given by: Z Z

f (x) =

fλ,w0 (x)p(λ, w0 |X )dw0 dλ

(7.1.1)

This interpretation has led us to the second approach to improve image recognition. We have proposed to replace the integration in Eq. 7.1.1 by a rough approximation via ensemble network averaging. Networks with a good recognition performance were included in the ensemble and their posterior probabilities p(λ, w0 |X ) were assumed to be equal. It is well known, that ensemble averaging can reduce the variance portion of the prediction error. We have considered three ensemble types (Chapter 5): • Unconstrained ensemble, which corresponds to integration over w0 for λ = 0 • Reconstruction ensemble, which corresponds to integration over λ for fixed w0 • Joined ensemble, which is a combination of unconstrained and reconstruction ensembles and corresponds to integration over both parameters We have shown that the joined ensemble is superior to the reconstruction ensemble, which in turn is superior to the unconstrained ensemble, in recognition of images degraded by Gaussian and pulse noises as well as by partial occlusion or image blur. Our third contribution concerns especially in improving recognition of blurred images. It is well known, that in many practical recognition tasks, images appear blurred due to motion, bad weather conditions and defocusing of cameras. for improving blurred image recognition:

Three ways were proposed

1. Expanding the training set with Gaussian blurred images 2. During training, constraining reconstruction of the blurred images to the original clean images 3. Application of state of the art restoration methods to the blurred images before using the hybrid architecture The first two ways have led to two additional joined ensembles that we trained with extra Gaussian images and reconstruction constraints. Reconstruction was either to the blurred image or the clean (non-blurred) image (Chapter 6). We have shown that ensembles that were trained on extra (blurred) images had improved recognition performance on different image degradation types. In addition, we have shown that training with extra images

Chapter 7: Summary and future work

126

combined with restoration techniques achieved robust and best recognition performance under a wide range of blur operators and parameters. Additional contribution of the thesis is developing hybrid networks with unsupervised learning constraints (Chapter 5), which were mainly used for comparison with reconstruction constraints. We have shown that these constraints can also be used for improving the recognition performance instead, or in parallel with reconstruction constraints. In addition, we addressed the issue of a network interpretability by investigating the network hidden representation and hidden weights (Appendix 5.7), and by the saliency map construction (Section 5.5). In contrast, to explicit understanding what information is encoded in the hidden space, the saliency map allows one to decide which features in the input are more important. We showed that usage of the saliency maps further improves recognition of images degraded with “Salt and Pepper” noise.

7.2

Directions for future work

Non face data sets We have tested the proposed hybrid system on facial data sets. Faces, however, are a special type of stimuli where all pixels are important (Biederman and Kalocsai, 1997). It should be interesting to test the hybrid architecture performance on data sets of similar objects, such as military images (different kinds of tanks, ships, cars, etc.), medical images (different kinds of tumor cells) and astronomical images (images of different stars and galaxies). Ensemble interpretation In Appendix 5.7, hidden representations of single Neural Networks with reconstruction constraints were investigated. In addition it was noted, that network ensemble hidden representation is not well defined. However, another form of interpretation using the mean derivative (over networks and images) with respect to the inputs for each of the classes (Intrator and Intrator, 1993) may be very interesting. Recurrent network architecture Images reconstructed by Neural Networks (which we called prototypes, see Section 5.5) are reduced representation of the original images, since a drastic compression occurs via the bottleneck architecture (see Figure 5.5). However, as can be seen, prototypes corresponding to the same class look similar, while prototypes corresponding to different classes look different. It is also clear that a good reconstruction/recognition network has to be able to recognize its own prototype images. Table 7.1 presents recognition performance of the unconstrained and reconstruction ensembles (see Chapter 3), when they are tested on the prototype images. These results

Chapter 7: Summary and future work

127

Classification error for reconstructed images Types of degradation Unconstrained ”Clean data” input prototype “Salt and Pepper” noise with d = 0.2 input prototype “nose” occlusion input prototype “half face” occlusion input prototype ”DOG 1-2” input prototype

Regression Ensembles Reconstruction A Reconstruction B

0.5 1.5

1.5 2.1

1 2.6

7.2 7.2

12.8 11.8

13.3 12.8

0.5 1.5

1.5 2.1

1.5 2.1

5.6 6.2

5.6 6.7

6.2 7.2

7.7 8.2

4.1 4.1

8.2 8.2

Table 7.1: Errors are given in percent (Pentland data-set). show that networks are better in recognition of the original images than their own prototypes. This can be corrected by propagating reconstructed images back to the input layer during learning. In other words, during learning we propose to extend the training set with extra images xe , which are a linear combination of the input x and its prototype image xp : xe = ρ(t)x + (1 − ρ(t))xp ,

ρ ∈ [0, 1],

where ρ(t) is a non increasing function of the training epoch number t, equal to 1 at the beginning and 0 at the infinity. This procedure may give better results that should be tested by simulation. Network ensembles We considered ensembles corresponding to the simplest version of integration (7.1.1) with equal posterior probability p(λ, w0 |X ). Though it is impossible to find posterior probability p(λ, w0 |X ) analytically, it may be heuristically postulated. Therefore, integration (7.1.1) may be replaced by the weighted network ensemble averaging. We tried to use weights based on different error types between input and output reconstruction layers, such as Euclidean metric or correlation measure and their soft version using the exp(−x) function. However, our preliminary experiments do not show significant recognition improvement. Since the hybrid networks solve both recognition and reconstruction tasks, it is reasonable to use the ensemble of hybrid networks for reconstruction. The obtained prototype may be used for recognition by all the networks.

Chapter 7: Summary and future work

128

Degradation invariance constraints We considered the simplest version of invariance constraints expanding the data with Gaussian blurred images. Another type of invariance constraints is the tangent prop constraint, that was used for a group of geometrical transformations (see Section 2.3.2). This type of constraints may be adapted for different types of blur operations for both recognition and reconstruction tasks. Generalization It would be interesting to generalize the hybrid architecture in the direction taken by other generative models (Hinton and Ghahramani, 1997; Ullman, 1995).

Bibliography Abu-Mostafa, Y. (1993). A method for learning from hints. In Touretzky, D. S. and Lippmann, R. P., editors, Advances in Neural Information Processing Systems, volume 5, pages 73–80. Morgan Kaufmann, San Mateo, CA. Abu-Mostafa, Y. (1994). Learning from hints. J. of Complexity, 10:165–178. Amari, S., Cichocki, A., and Yang, H. H. (1996). A new learning algorithm for blind signal separation. In Tesauro, G., Touretzky, D., and Leen, T., editors, Advances in Neural Information Processing Systems, volume 8, pages 757–763. MIT Press. Atick, J. (1992). Entropy minimization: A design principle for sensory perception. International Journal of Neural Systems. Proceedings of the Neural Networks: From biology to High Energy Physics, 3:81–90. Atick, J. J., Griffin, P. A., and Redlich, A. N. (1995). Face recognition from live video for real-world applications. Advanced Imaging. Atick, J. J. and Redlich, N. (1992). What does the retina know about natural scenes. Neural Computation, 4:196–210. Baird, H. (1990). Document image defect models. In IAPR, Workshop on Syntatic and Structural Pattern Recognition, pages 38–46. Murray Hill, NJ. Baluja, S. (1996). Expectation-based selective attention. PhD thesis, School of computer science, CMU. Baluja, S. and Pomerleau, D. A. (1995). Using the representation in a neural network’s hidden layer for task-specific focus of attention. In Proceedings of the International Joint Conference on Artificial Intelligence, Montreal, Canada. Barlow, H. B. (1961). Possible principles underlying the transfomations of sensory messages. In Rosenblith, W., editor, Sensory Communication, pages 217–234. MIT Press, Cambridge, MA. 129

Bibliography

130

Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1(3):295–311. Baron, R. J. (1981). Mechanisms of human facial recognition. The international Journal of Robotics research, 15:137–178. Bartlett, Stewart, M., Lades, Martin, H., and Sejnowski (1998). Independent component representations for face recognition. Proceedings of the SPIE Symposium on Electronic Imaging: Science and Technology. Conference on Human Vision and Electronic Imaging III. Bell, A. J. and Sejnowski, T. J. (1995). An information-maximisation approach to blind separation and blind deconvolution. Neural Computation, 7(6):1129–1159. Bellman, R. E. (1961). Adaptive Control Processes. Princeton University Press, Princeton, NJ. Bhanu, B. and Ming, J. C. (1987). Recognition of occluded objects: A cluster-structure algorithm. Pattern Recognition, 20(2):199–211. Biederman, I. and Kalocsai, P. (1997). Neurocomputational bases of object and face recognition. Philosophical Transactions of the Royal Society of London B, 352:1203– 1219. Comparison of face and other object recognition. Bienenstock, E. L., Cooper, L. N., and Munro, P. W. (1982). Theory for the development of neuron selectivity: orientation specificity and binocular interaction in visual cortex. Journal Neuroscience, 2:32–48. Bishop, C. M. (1995a). Neural Networks for Pattern Recognition. Oxford University Press. Bishop, C. M. (1995b). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1):108–116. Blais, B. S., Intrator, N., Shouval, H., and Cooper, L. N. (1998). Receptive field formation in natural scene environments: comparison of single cell learning rules. Neural Computation, 10(7):1797–1813. Bourlard, H. and Kamp, Y. (1988). Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics, 59:291–294. Brunelli, R. and Poggio, T. (1992). Face recognition through geometrical features. Proceedings of the 2nd European Conference on Computer Vision.

Bibliography

131

Brunelli, R. and Poggio, T. (1993). Face recognition: Features versus templates. IEEE PAMI, 15:1042–1052. Cannon, M. (1976). Blind deconvolution of spatially invariant image blurs with phase. icassp, 24:58–63. Caruana, R. (1995). Learning many related tasks at the same time with backpropagation. In Tesauro, G., Touretzky, D., and Leen, T., editors, Advances in Neural Information Processing Systems, volume 7, pages 657–664. Morgan Kaufmann, San Mateo, CA. Chan, T. F. and Wong, C. K. (1997). Total variation blind deconvolution. Technical report, UCLA. Chellapa, Wilson, and Sirohey (1995). Human and machine face recognition of faces. Proceedings of the IEEE, pages 704–740. Chopper (1997). Autonomous helicopter project. CMU. Coifman, R. R. and Wickerhauser, M. (1992). Entropy-based algorithms for best basis selection. IEEE Trans. Info. Theory, 38(2):713–719. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36:287–314. Connolly (1997). Robotics internet resources page. UMass Laboratory for Perceptual Robotics. Cotrrell, G. W. and Fleming, M. K. (1990). Categorization of faces using unsupervised feature extraction. Proc. Int. Conf. on Neural Networks, 2:65–70. Cottrell, Munro, P., and Zipser., D. (1987). Image compression by back propagation: An example of extensional programming. Advances in Cognitive science, 3. Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. WileyInterscience, New York. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Proceedings of the Royal Statistical Society, B-39:1–38. Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis. John Wiley, New York.

Bibliography

132

Edelman, S., Reisfeld, D., and Yeshurun, Y. (1992). Learning to recognize faces from examples. In Proceedings of the 2nd European Conference on Computer Vision, pages 787–791, Santa Margherita Ligure, Italy. Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York. Elman, J. L. and Zipser, D. (1988). Learning the hidden structure of speech. Journal of the Acoustical Society of America, 4(83):1615–1626. Fabian, R. and Malah, D. (1991). Robust identification of motion and out-of-focus blur parameters from blurred and noisy images. cvgip, 53(5):403–412. Field, D. J. (1994). What is the goal of sensory coding. Neural Computation, 6:559–601. Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82:249–266. Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition. Academic press, London. Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the biasvariance dilemma. Neural Computation, 4:1–58. Girolami, M. and Fyfe, C. (1996). Negentropy and kurtosis as projection pursuit indices provide generalised ica algorithms. Preprint. Gluck, M. A. and Myers, C. E. (1993). Hippocampal mediation of stimulus representation: A computational theory. Hippocampus, 3(4):491–516. Gonzalez, R. C. and Wintz, P. (1993). Digital Image Processing. Addison-Wesley Publishing Company. Grenander, U. (1978). Pattern analysis. Springer-Verlag, New York. Gutta, S., Huang, J., Imam, I. F., and Wechsler, H. (1996). Face and hand gesture recognition using hybrid classifiers. In Proceedings of the second IEEE International Conference on Automatic Face and Gesture Recognition, pages 164–169, Washington, Brussels, Tokyo. Hancock, J. B., Baddeley, R. J., and Smith, L. S. (1992). The principal components of natural images. In Network: Computation in Neural Systems, 3(1):61–70.

Bibliography

133

Hansen, L. K. and Salamon, P. (1990). Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intellignce, 12(10):993–1001. Haykin, S. (1994). Neural Networks. Imprint MACMILLAN, New York. Hertz, J., Krogh, A., and Palmer, R. G. (1991). Introduction to the Theory of Neural Computation. Addison Welsley, New York. Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The ”Wake-Sleep” algorithm for unsupervised neural networks. Science, 268:1158–1161. Hinton, G. E. and Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. 352:1177–1190.

Philosophical Transactions of the Royal Society B,

Hinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum description length, and helmholtz free energy. Advances in Neural Information Processing Systems. Hirukawa (1997). Etl robotics research group. Japan. Hochberg, J. (1974). Organization and the gestalt tradition. In Carterette, C. and Friedman, M., editors, Handbook of Perception, New York. Academic Press. Hochreiter, S. and Schmidhuber, J. (1997). Flat minima. Neural Computation, 9(1):1–42. Howell, A. J. (1997). Automatic Face Recognition using Radial Basis Function Networks. PhD thesis, University of Sussex. Huber, P. J. (1985). Projection pursuit. (with discussion). The Annals of Statistics, 13:435–475. Humel, R. A., Kimia, B. B., and Zucker, S. W. (1987). Deblurring Gaussian blur. cvgip, 38(1):66–80. Hyvarinen (1997a). Independent component analysis by minimization of mutual information. Technical report a46, Laboratory of Computer and Information Science, Helsinki University of Technology. Hyvarinen (1997b). New approximations of differential entropy for independent component analysis and projection pursuit. Technical report a46, Laboratory of Computer and Information Science, Helsinki University of Technology.

Bibliography

134

Intrator, N. (1993a). Combining exploratory projection pursuit and projection pursuit regression with application to neural networks. Neural Computation, 5(3):443–455. Intrator, N. (1993b). On the combination of supervised and unsupervised learning. Physica A, pages 655–661. Intrator, N. (1999). Robust prediction in many parameter models: Specific control of variance and bias. In Kay, J. W. and Titterington, D. M., editors, Statistics and Neural Networks: Advances at the Interface. Oxford University Press. To appear. Intrator, N. and Cooper, L. N. (1992). Objective function formulation of the BCM theory of visual cortical plasticity: Statistical connections, stability conditions. Neural Networks, 5:3–17. Intrator, N. and Cooper, L. N. (1995). BCM theory of visual cortical plasticity. In Arbib, M., editor, The Handbook of Brain Theory and Neural Networks, pages 153–157. MIT Press. Intrator, N. and Edelman, S. (1996). Making a low-dimensional representation suitable for diverse tasks. Connection Science, Special issue on Reuse of Neural Networks Through Transfer, 8(2):205–224. Also in Learning to Learn, S. Thrun and L. Pratt (ed., Kluwer press.). Intrator, N., Reisfeld, D., and Yeshurun, Y. (1996). Face recognition using a hybrid supervised/unsupervised neural network. Pattern Recognition Letters, 17:67–76. Intrator, O. and Intrator, N. (1993). Interpreting neural-network models. Preprint. Jacobs, R. (1997). Bias/variance analyses of mixtures-o-experts architectures. Neural Computation, pages 369–383. Jain, A., Zheng, Y., and Lakshmanan, S. (1996). Object matching using deformable templates. T-PAMI, 18:267–278. Jain, A. K. (1989). Fundamentals of Digital Image Processing. Prentice Hall, London. Japkowicz, N., Myers, C., and Cluck, M. (1995). A novelty detection approach to classification. In the proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 518–523. Jordan, M., editor (1999). Learning in Graphical Models. The MIT Press, Cambridge, Massachusetts, London, England.

Bibliography

135

Kandel, E. R. and Schwartz, J. H. (1991). Principles of Neural Science. Elsevier, New York, third edition. Kanizsa and Gaetano (1970). Organization in Vision: Essays on Gestalt Perception. Praeger, New York. Katsaggelos, A. K. (1989). Iterative image restoration algorithms. Optical Engineering, 28(7):735–748. Kimia, B. B. and Zucker, S. W. (1993). Analytic inverse of discrete Gaussian blur. Optical Engineering, 32(1):166–176. Kirby, M. and Sirovich, L. (1990). Application of the Karhunen-Lo`eve procedure for characterization of human faces. PAMI, 12(1):103–108. Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. International Joint Conference on Artificial Intelligence. Kramer, M. A. (1991). Nonlinear principal component analysis using autoassociative neural networks. AIChe Journal, 37(2):233–243. Kung, S. Y. and Diamantaras, C. I. (1990). A neural network learning algorithm for adaptive principal component extraction (apex). International Conference on Acoustics, Speech and Signal Processing, 2:861–864. Lai, K. F. (1994). Deformable Contours: Modeling, Extraction , Detection and Classification, Phd Thesis. PhD thesis, University of Wisconsin-Madison,Electrical Engineering. Lee, T., Girolami, M., Bell, A. J., and Sejnowski, T. J. (1998). A unifying informationtheoretic framework for independent component analysis. International Journal on Mathematical and Computer Modeling. in press. Leen, T. (1995). From data distribution to regularization in invariant learning. In Touretzky, D. S. and Lippmann, R. P., editors, Advances in Neural Information Processing Systems, volume 7, pages 223–230. Morgan Kaufmann, San Mateo, CA. Lesher, G. W. (1995). Illusory contours: Toward a neurally based perceptual theory. Psychonomic Bulletin abd Review, 2(3). Linsker, R. (1988). Self-organization in a perceptual network. IEEE. Computer, 88:105– 117.

Bibliography

136

Liu, H.-C. and Srinath, M. D. (1984). Partial shape classification using contour matching in distance transformation. IEEE Transactions on PAMI, 12(11):1072–1078. Luenberger, D. G. (1989). Linear and nonlinear programming. ADDISON-WESLEY, READING, MASS. second edition. MacKay, D. (1992). Bayesian intertpolation. Neural Computation, 4(3):415–447. MacKay, D. (1996). Maximum likelihood and covariant algorithms for independent component analysis. Technical report draft 3.7, Cavendish laboratory, University of Cambridge. Marr, D. (1982). Vision. Imprint FREEMAN, New York. Meir, R. (1994). Bias, variance and the combination of estimators: The case of linear least squares. ftp://archive.cis.ohio-state.edu/pub/neuroprose/meir.bias-variance.ps.Z. Miller, K. D. (1995). Receptive fields and maps in the visual cortex: Models of ocular dominance and orientation columns. In Domany, E., van Hemmen, J. L., and Schulten, K., editors, Models of Neural Networks 3, pages 55–78. Springer-Verlag, New York. Moghaddam, B. and Pentland., A. (1994). Face recognition using view-based and modular eigenspaces. Automatic Systems for the identification and Inspection of Humans,SPIE, 2227. Moses, Y. (1994). Face recognition: the problem of compensating for illumination changes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):721–732. Mougeot, M., Azencott, R., and Angeniol, B. (1991). Image compression with back propagation: Improvement of the visual restoration using different cost functions. Neural Networks, 4:467–476. Munro, P. (1997). Noisy encoders. NIPS*97 Workshop: Advances in AutoencoderAutoassociator Based Computations. Murase, H. and Nayar, S. K. (1993). Learning object models from appearance. Proceedings of the Eleventh National Conference on Artificial Intelligence. Murray, A. F. and Edwards, P. J. (1993). Synaptic weight noise during multilayer perceptron training: Fault tolerance and training improvements. IEEE Transactions on Neural Networks, 4(4):722–725.

Bibliography

137

Nadal, J.-P. and Parga, N. (1994). Nonlinear neurons in the low-noise limit: a factorial code maximizes information transfer. Network, 5:565–581. Naftaly, U., Intrator, N., and Horn, D. (1997). Optimal ensemble averaging of neural networks. Network, 8(3):283–296. Neal, R. M. and Hinton, G. E. (1993). A new view of the EM algorithm that justifies incremental and other variants. Submitted to Biometrica. Nowlan, S. J. and Hinton, G. E. (1992). Simplifying neural networks by soft weightsharing. Neural Computation, 4:473–493. Oja, E. (1982). A simplified neuron model as a principal component analyzer. Math. Biology, 15:267–273. Oja, E. (1995a). The nonlinear pca learning rule and signal separation - mathematical analysis. Technical Report A26, Helsinki University, CS and Inf. Sci. Lab. Oja, E. (1995b). Principal and independent components in neural networks - recent developments. In Proc. VII Italian Workshop on Neural Nets WIRN’95, Vietri sul Mare, Italy. Olshausen, B. A. and Field, D. J. (1996). Natural image statistics and efficient coding. Network, 7:333–339. O’Toole, A. J., Valentin, D., and Abdi, H. (1991). Categorization and identification of human face images by neural networks: A review of the linear autoassociative and principal component approaches. Journal of Biological Systems, 2(3):413–430. O’Toole, A. J., Valentin, D., and Abdi, H. (1993). A low dimensional representation of faces in the higher dimensions of the space. Journal of the Optical Society of America, series A, 10:405–411. Perrone, M. P. (1993). Improving Regression Estimation: Averaging Methods for Variance Reduction with Extensions to General Convex Measure Optimization. PhD thesis, Brown University, Institute for Brain and Neural Systems. Phillips, P. J. (1998). Support vector machines applied to face recognition. In Touretzky, D. S. and Lippmann, R. P., editors, Advances in Neural Information Processing Systems, pages 1–7. Morgan Kaufmann, San Mateo, CA. To appear.

Bibliography

138

Phillips, P. J., Moon, H., Rauss, P. J., and Der, S. Z. (1997). The feret september 1996 database and evaluation procedure. In The First International Conference on Audio and Video-based Biometric Person Authetication, Crans-Montana, Swizerland. Phillips, P. J., Rauss, P. J., and Der, S. Z. (1996). Feret (face recognition technology) recognition algorithm development and test results. Technical report, arl-tr-995, US Army Research Laboratory Technical Report. Pomerleau, D. A. (1993). Input reconstruction reliablility estimation. In Giles, C. L., Hanson, S. J., and Cowan, J. D., editors, Advances in Neural Information Processing Systems, volume 5, pages 279–286. Morgan Kaufmann. Raviv, Y. (1998). Bootstrapping with noise ensemblening. Master’s thesis, The Department of Computer Science. The Raymond and Beverly Sackler Faculty of Exact Sciences. Tel-Aviv University. Raviv, Y. and Intrator, N. (1996). Bootstrapping with noise: An effective regularization technique. Connection Science, Special issue on Combining Estimators, 8:356–372. Redlich, A. N. (1993). Redundancy reduction as a strategy for unsupervised learning. Neural Computation, 5:289–304. Reisfeld, D. (1993). Generalized Symmetry Transforms: Attentional Mechanisms and Face Recognition. PhD thesis, Tel-Aviv University. Reisfeld, D., Wolfson, H., and Yeshurun, Y. (1990). Detection of interest points using symmetry. In Third International Conference on Computer Vision, pages 62–65, Osaka, Japan. Rieke, F., Warland, D., de Ruyter van Steveninck, R., and Bialek, W. (1996). Spikes: Exploring the Neural Code (Computational Neuroscience). The MIT Press, London. Ripley, B. D. (1996). Pattern Recognition and Neural Networks. Oxford Press. Rissanen, J. (1985). Minimum description length principle. Encyclopedia of Statistical Sciences, pages 523–527. Rosenfeld, A. and Kak, A. C. (1982). Digital Picture Processing. Academic press, New York. Roweis, S. (1997). Em algorithms for pca and spca. NIPS97.

Bibliography

139

Roweis, S. and Ghahramani, Z. (1997). A unifying review of linear gaussian models. Submitted for Publication. Rudin, L. I., Osher, S., and Fatemi, E. (1992). Nonlinear total variation based noise removal algorithms. Physica D, 60:259–268. Samal, A. and Iyengar, P. A. (1992). Automatic recognition and analysis of of human faces and facial expressions: A survey. Pattern Recognition, 25:65–77. Samaria, F. and Fallside, F. (1993). Face identification and feature extraction using hidden markov models. Elsevier. Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks, 2:459–473. Sato, K., Shah, S., and Aggarwal, J. K. (1998). Partial face recognition using radial basis function networks. In Proceedings of the third IEEE International Conference on Automatic Face and Gesture Recognition, pages 288–293, Nara, Japan. Schwenk, H. and Milgram, M. (1995). Transformation invariant autoassociation with application to handwritten character recognition. Neural Information Processing Systems (NIPS 7), pages 991–998. Sezan, M. I. and Tekalp, A. M. (1990). Survey of recent developments in digital image restoration. Optical Engineering, 29(5):393–404. Shannon, C. E. (1948). A mathematical theory of communication. Bell Syst. Tech. J., 27:379–423 and 623–656. Simard, P., Victorri, B., LeCun, Y., and Denker, J. (1992). Tangent prop – a formalism for specifying selected invariances in an adaptive network. In Moody, J., Lippman, R., and Hanson, S. J., editors, Neural Information Processing Systems, volume 4, pages 895–903. Morgan Kaufmann, San Mateo, CA. Stark, H. (1987). Image recovery: Theory and application. Academic press, San Diego. Stuart, A. and Ord, J. K. (1994). Kendall’s Advanced Theory of Statistics. Edward Arnold. Taniguchi, M. and Tresp, V. (1997). Averaging regularized estimators. Neural Computation, 9:1163–1178.

Bibliography

140

Tankus, A. (1996). Automatic face detection and recognition. Master thesis, Tel-Aviv University. Tankus, A., Yeshurun, Y., and Intrator, N. (1997). Face detection by direct convexity estimation. Pattern Recognition Letters, 18(9):913–922. Tesauro, G., Touretzky, D., and Leen, T., editors (1995). Neural Network Ensembles, Cross Validation, and Active Learning. The MIT Press, London. Tikhonov, A. N. and Arsenin, V. Y. (1977). Solutions of Ill-Posed Problems. V. H. Winston and Sons, Washington. Tresp, V. and Taniguchi, M. (1995). Combining estimators using non-constant weighting function. In Tesauro, G., Touretzky, D., and Leen, T., editors, Advances in Neural Information Processing Systems, volume 7. MIT Press. Turk, M. and Pentland, A. (1991). Eigenfaces for recognition. J. of Cognitive Neuroscience, 3:71–86. Turk, M. and Pentland, A. (1993). Experiments with eigenfaces. Looking At People Workshop, IJCAI’93, pages 1–6. Ullman, S. (1989). Aligning pictoral descriptions: an approach to object recognition. Cognition, 13:13 – 254. Ullman, S. (1995). Sequence-seeking and counter-streams: a model for information flow in the cortex. Cerebral Cortex, 5:1–11. Valentin, D., Abdi, H., O’Toole, A. J., and Cottrell, G. W. (1994). Connection models of face processing: A survey. Pattern Recognition, 27:1209–1230. Vapnik, V. N. (1995). Statistical learning theory. Springer, Check this, New York. Wahba, G. (1990). Splines Models for Observational Data. Series in Applied Mathematics, Vol. 59, SIAM, Philadelphia. Wiskott, L. and von der Malsburg, C. (1993). A neural system for the recognition of partially occluded objects in cluttered scenes. Int. J. of Pattern Recognition and Artificial Intelligence, 7(4):935–948. Also in NIPS 7, 1994. Wiskott, L., von der Malsburg, C., Fellous, and Kruger, N. (1997). Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):775–779.

Bibliography

141

Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5:241–259. Yang, H. and Amari, S. (1997). Adaptive on-line learning algorithms for blind separation – maximum entropy and minimum mutual information. Neural Computation, 9(7):1457–1482. Yaroslavsky, L. and Eden, M. (1996). Fundamentals of digital optics. Imprint Birkhauser, Boston. You, Y.-L. and Kaveh, M. (1996). A regularization approach to joint blur identification and image restoration. IEEE Transactions on Image Processing, 5(3):416–427. Yow, K. C. and Cipolla, R. (1996). A probabibilistic framework for perceptual grouping of features for human face detection. In Proceedings of the second IEEE International Conference on Automatic Face and Gesture Recognition, pages 16–21, Washington, Brussels, Tokyo. Yuille, A. L., Cohen, D., and Hallinan, P. W. (1989). Feature extraction from faces using defomable tempaltes. In Proc. Computer Vision and Pattern Recognition, pages 104– 109, Washington. IEEE Computer Society Press. Zemel, R. (1993). A Minimum Description Length Framework for Unsupervised Learning. PhD thesis, University of Toronto.