Learning Multidimensional Signal Processing - Semantic Scholar

5 downloads 188 Views 376KB Size Report
mode model and a setup for reinforcement learning of online video coder ..... mation. 4.1 Simplicity. Since simplicity has been a main theme in our presentation it.
ICPR’98

Invited paper

Learning Multidimensional Signal Processing Hans Knutsson Magnus Borga Tomas Landelius Computer Vision Laboratory Link¨oping University

2

Abstract This paper presents our general strategy for designing learning machines as well as a number of particular designs. The search for methods allowing a sufficient level of adaptivity are based on two main principles: 1. Simple adaptive local models and 2. Adaptive model distribution. Particularly important concepts in our work is mutual information and canonical correlation. Examples are given on learning feature descriptors, modeling disparity, synthesis of a global 3mode model and a setup for reinforcement learning of online video coder parameter control.

1

Designing World Models

In a high-dimensional input space, everything will not occur. There is simply not time enough for everything to happen. The number of actual events will leave the space almost empty. As an example, consider an industrial robot with 10 degrees of freedom, each with a resolution of 4 bits. This leads to 240 ≈ 1012 possible states for the robot. Imagine that each state is visited with a speed of 1 state/ms, never visiting a state twice, it will take 109 s, which is more than 30 years, to visit all states! And yet, in practise, such a robot would be able to perform well and with high speed in a wide range of situations. This shows that the distribution of real world event samples will necessarily, even for relatively low dimensional systems, be very sparse indeed.

Introduction

The need for a generally applicable method for learning in high dimensional signal spaces is evident in problems involving vision - dimensionality of the input data often exceeding 106 . In practice vision problems are handled by reducing the dimensionality to typically < 10 by throwing away almost all available information in a basically ad hoc manner. This approach is however likely to fail if, as is frequently the case for vision problems, the mechanisms by which the necessary information can be extracted is not well understood. For this reason designing system capable of learning the relevant information extraction mechanisms appears to be the only possible way to proceed. The development in the field of neural computation provides consistent evidence that incorporation of knowledge gained in more mature fields of research can help speed up progress in understanding important learning mechanisms. Much of the knowledge developed within the areas of information theory, signal theory, control theory and computer science is in fact at the core of learning and many researchers in the field, including us, are seeking to integrate pertinent theory and principles from these areas. Particularly important concepts in our recent work is mutual information [?] and canonical correlation [?]. Incorporation of dynamic programming and optimal control methods has resulted in a first proof of convergence for an important class of systems (LQR systems) [?]. Further examples of our work along these lines are given by [?, ?, ?]

Constraint manifolds What then, if anything, can in general be said about the distribution of real world samples? Fortunately it turns out that there is, as a rule, a manifold defined by nonlinear constraints, on which the values are jointly restricted to lie [?]. (Would a human cope otherwise?) Our hope is that it is possible to find a model generating structure adaptive enough to model only this manifold.

20

20

0 2

20

0 2 0 −2 −2

Figure 1:

0

2

0 2 0 −2 −2

0

2

0 −2 −2

0

2

Figure showing the inter-dependence of model order and model distribution. For a given accuracy fewer more complex models can be traded for a higher number of simpler models. Also note that the shape of the signal space for which the model is valid can become quite complex, a problem that grows rapidly with increasing dimensionality.

1

ICPR’98

2.1

Invited paper

Distributed Local Models

A commonly used approach for finding relations in complex systems is correlation analysis. However, correlation measured over all signals can be zero, even if there are strong correlations locally in the parameter space. Consider, for example, the correlation between x and x2 when x has a symmetric distribution around zero. The correlation measured over all x will tend to zero. If, however, the correlation is measured in an interval, e.g. x ∈ [1, 1.5], the correlation is significantly stronger (0.999), see figure ??. This example illustrates the fact that local models can indicate correlations that are impossible to find using a global model.

Our approach naturally decomposes into two subtasks. 1. Simple local models: Find a mapping to describe the change of coordinates from the original high dimensional signal space to coordinates on a low dimensional manifold. 2. Adaptive model distribution: Since the models are only locally applicable they must be distributed over the manifold and also be incorporated in some sort of data structure to provide access given a signal vector in the original space. Although this approach is not uncommon for low dimensional problems, ,e.g. fitting of one-dimensional splines, no general methods are known in higher dimensional spaces. Our investigations have resulted in a neural tree algorithm for model distribution [?].

x^2 4

3.5

3

2.5

10 2

8 1.5

6

1

0.5

4

0 −2

2

−1.5

−1

−0.5

0 x

0.5

1

1.5

2

0

Figure 4: Local models can indicate dependencies that are impossi-

−2

ble to find using a global model. In this example no global correlation can be found, local correlation, however, measures 0.999.

−4

−6 −10

−8

−6

−4

−2

0

2

4

6

8

10

Figure 2: Figure showing adaptability in the distribution of local models to the distribution of the signal. The signal is uniformly distributed along the curve drawn with a heavy black line. The validity regions of the local models are indicated by circles.

40

40

20

20

0 10

10

0

0 10

0 −10 −10

Mutual information The most general measure of the interdependency between input and output is the amount of mutual information. A system consisting of several local models can approximately find the global mutual information between the input and output spaces if each local model is valid and finds the mutual information in it’s region. For continous signals locally linear models will suffice and for a linear system the mutual information is maximized when correlation is maximized. It is crusial, however that the local models are base independent. Standard correlation analysis will serve to examplify this fact.

10

0

0 −10 −10

Figure 3:

The global model is formed by weighting together a number of local models (left) according to their applicability functions (right).

Base independent models Correlation analysis only measures correlations in fixed directions in the parameter space, i.e. the projections of the signal onto the basis vectors in a given coordinate system. Consider the following simple example, where the process is described by eight parameters: Suppose that a linear combination of the first four parameters correlates fully with a linear combination of the last four. In the coordinate system given by the process parameters, the correlation matrix looks like the first matrix below. It is hard to see that the first four parameters have much to do with the last four. However, if we make the analysis in a coordinate system where the two linear combinations define two of the basis vectors, the relation is obvious (and the correlation equals one) according to the second matrix below.

It is important to note that the two issues are in fact intimately related as the choice of local model complexity will determine the appropriate distribution and applicability region shapes for the local models. A simple example of this interdependence is shown in figure ??. The distribution and applicability region of local models also need to adapt to the signal distribution. This is demonstrated in figure ??.

2.2

Simple Local Models

Local models For our purposes a useful model of the environment is one that relates input signals to output signals. 2

ICPR’98

1.00 0.00 0.00  0.00  0.25  0.25  0.25 0.25 1.00 0.00 0.00  0.00  0.00  0.00  1.00 0.00

Invited paper

0.00 1.00 0.00 0.00 0.25 0.25 0.25 0.25

0.00 0.00 1.00 0.00 0.25 0.25 0.25 0.25

0.00 0.00 0.00 1.00 0.25 0.25 0.25 0.25

0.25 0.25 0.25 0.25 1.00 0.00 0.00 0.00

0.25 0.25 0.25 0.25 0.00 1.00 0.00 0.00

0.25 0.25 0.25 0.25 0.00 0.00 1.00 0.00

0.25 0.25 0.25  0.25  0.00  0.00  0.00 1.00

0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00

1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00

0.00 0.00 0.00  0.00  0.00  0.00  0.00 1.00

combinations of pixel data that have the highest correlation. In this case, the canonical correlation vectors can be seen as linear filters. In general, f can be any vector-valued function of the image data, or even different functions fx and fy , one for each signal space. The choice of f is of major importance as it determines the representation of input data for the canonical correlation analysis.

Relations of this kind, where linear combinations of parameters are correlated and a standard correlation analysis gives weak indications of correlation, can be found if we instead look for canonical correlations [?, ?]. Canonical correlation analysis (CCA) is base independent and finds correlations between inner products, i.e. Corr(xT wx , yT wy ). We have developed an efficient algorithm for estimation of canonical correlation ,[?].

f

CCA

|F(E17 + i*E18)| for wx 1

2

2

|F(E13 + i*E16)| for wx 2

2

|F(E12 + i*E14)| for wx 2

Figure 6:

Spectra for the eigenimages interpreted as complex quadrature filter pairs.

f

It is shown in [?] and [?] that if f is an outer product and the image pairs contain sine wave patterns with equal orientations but different phase, the CCA finds linear combinations of the outer products that convey information about local orientation and are invariant to local phase. The combinations resembles a set of quadrature filters, see figure ??. [?].

Figure 5:

A symbolic illustration of the method of using CCA for finding feature detectors in images. The desired feature (orientation: here illustrated by a solid line) is varying equally in both image sequences while other features (here illustrated with dotted curves) vary in an uncorrelated way. The input to the CCA is a function f of the image.

3

2

|F(E15 + i*E16)| for wx 1

Four examples Disparity

10

Learning feature descriptors Descriptors for higher order features are in practice impossible to design by hand due to the overpowering amount off possible signal combinations. In [?] it is shown how canonical correlation analysis can be used to find models that represent local features in images. The basic idea behind the CCA approach, illustrated in figure ??, is to analyze two signals where the feature that is to be represented generates dependent signal components. The signal vectors fed into the CCA are image data mapped through a function f . If f is the identity operator (or any other full-rank linear function), the CCA finds the linear

150

0 100

−10 100 80

50

60 40 20 Vertical position

Figure 7:

0 0

The result of the stereo algorithm for two random dot images corresponding to two semi-transparent crossing planes.

3

ICPR’98

Invited paper

0.5

Modeling disparity An important problem in computer vision that is suitable to handle with CCA is stereo vision, since data in this case naturally appear in pairs. In [?] a novel stereo vision algorithm that combines CCA and phase analysis is presented. Canonical correlation analysis is used to create adaptive linear combinations of quadrature filters. These linear combinations are new quadrature filters that are adapted in frequency response and spatial position maximizing the correlation between the filter outputs from the two images. The disparity estimate is then obtained by analyzing the phase of the scalar product of the adapted filters. It is demonstrated that the algorithm can handle traditionally difficult problems such as 1. Producing multiple disparity estimates for semi-transparent images, see figure ??, 2. Maintain accuracy at disparity edges, and 3. Allowing differently scaled images. See [?] for more results.

1.5 1

0.4

0.5

0.3

0 0.2

−0.5

0.1 0 0

−1 100

200

300

400

−1.5 0

100

200

300

400

100

200

300

400

4 0.4 3 0.3 0.2

2

0.1 1 0 −0.1

0

0.2

0.4

0 0

Figure 9: Top left: Evolution of the components of the state vector (left tank level dashed and right tank level solid). Top right: Two of the estimated system parameters. Bottom left: State vector distribution on the 1D manifold and hierarchical applicability functions. Bottom right: Interpolated system mode.

Synthesis of a global 3-mode model A simple experiment was performed to show some of the principles behind the local model approach. This experiment involves the idea of finding and merging local models. The system under investigation lives, due to its dynamics, on a 1D manifold in a 2D state space and is a simple example of a locally linear system. Consider a fluid bi-tank where the state vector is given by the fluid levels in each of the two tanks, see figure ??. The two compartments of height H are separated by a wall of height h < H. Water is fed into the left compartment (input signal) and both tanks are emptied at a rate proportional to their fluid level. The system can be modeled with three linear models. 1. The level in the left tank is below h. 2. The level in the left tank is h and water flow over into the right compartment. 3. Both levels are above h and the two tanks appear as one. In figure ?? the essence of our experiment is presented. To

bandwidth has been an important issue. As a consequence a lot of effort has been spent to develop efficient compression techniques. However, no objective technique for determining image or image sequence quality has yet emerged and it is becoming increasingly apparent that further development of compression techniques will require objective quality estimation methods. The goal of the project is to develop principles and methods for the objective measurement of image sequence quality that are in good agreement with human perceived quality. The work plan for this project is as follows: 1. Implement a suitable experimental coder allowing interactive parameter settings. 2. Identify a collection of human visual perception models which can be deemed potentially relevant. 3. Run the coder interactively to obtain human quality assessments. 4. Produce features that can be expected to have some correlation with both coding parameter changes and perception model responses. 5. Employ learning techniques to find the relation between sequence type, coding parameters, image sequence features and perceived quality. 6. Use the learned relations to produce on-line coding parameters for optimal expected perceived quality. See figure ??. No doubt it will put our ideas to the test.

Figure 8: Bitank system modes. Left: mode 1. Middle: mode 2. Right: mode 3. the top left the two components of the state vector, i.e. the fluid levels, are plotted. The system starts out in mode 1 then enters mode 2 and mode 3. It then switches back to mode 2 and finally mode 1. The system is put into these modes by changing an offset in the input flow at iterations 100, 200, 300 and 400. Two of the six system parameter estimates are shown at the top right. The algorithm detects the different models as the system state moves through different parts of the state space. The distribution of the state vectors is shown in the lower left of the figure.

4

The Future

The concept of mutual information provides a solid and general basis for the study of a broad spectrum of problems including signal operator design and learning strategies. A broadly applicable general approach, illustrated in figure ??, is to maximize mutual information subject to constraints given by a chosen model space. This could be done by varying not only the linear projections ,i.e. the CCA part, but also the functions fx and fy .

Learning to supervise a video coder Since the advent of television obtaining high perceived quality using a limited 4

ICPR’98

Invited paper

Estimated quality

municate, have evolved in such a way that frequently useful statements have become easy to express. The evolution of this simplicity can be seen as making efficient use of a common context which is implicitly understood by all individuals involved. The pertinent question then becomes: Are we yet advanced enough to express the mechanisms of learning in a simple way ?

QE QS

Supervisor

Human

Subjective quality

P − parameters

Monitor Decoded video

Acknowledgment We like to thank Kenneth Andersson for helping with material on the coding project. We also like to thank The Swedish Fund for Strategic Research (SSF), The Swedish Research Council for Engineering Sciences (TFR) and The Swedish National Board for Industrial and Technical Development.

VD

VR Coder

Decoder

’Raw’ Video

References

Figure 10:

Setup for reinforcement learning of online video coder parameter control.

[1] C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 1948. Also in N. J. A. Sloane and A. D. Wyner (ed.) Claude Elwood Shannon Collected Papers, IEEE Press 1993.

Finding suitable function classes and efficient parameterization/implementations for these functions is still the central issue and will be an important theme in our continued investigations.

sx

Figure 11:

fx

[2] H. Hotelling. Relations between two sets of variates. Biometrika, 28:321–377, 1936.

ρ

[3] T. Landelius. Reinforcement Learning and Distributed Local Model Synthesis. PhD thesis, Link¨oping University, Sweden, S–581 83 Link¨oping, Sweden, 1997. Dissertation No 469, ISBN 91–7871–892–9.

CCA

[4] M. Borga. Learning Multidimensional Signal Processing. PhD thesis, Link¨oping University, Sweden, S–581 83 Link¨oping, Sweden, 1998. Dissertation No 531, ISBN 91– 7219–202–X.

fy

sy

[5] H. Knutsson, M. Borga, and T. Landelius. Learning Canonical Correlations. Report LiTH-ISY-R-1761, Computer Vision Laboratory, S–581 83 Link¨oping, Sweden, June 1995.

A general approach for finding maximum mutual infor-

mation.

4.1

[6] C. Bregler and S. M. Omohundro. Surface learning with applications to lipreading. In Advances in Neural Information Processing Systems 6, pages 43–50, San Francisco, 1994. Morgan Kaufmann.

Simplicity

[7] T. Landelius and H. Knutsson. The Learning Tree, A New Concept in Learning. In Proceedings of the 2:nd Int. Conf. on Adaptive and Learning Systems. SPIE, April 1993.

Since simplicity has been a main theme in our presentation it may be appropriate to conclude with a comment on this topic. The assumption that simple models of the world are more likely to be useful or robust can be traced back to the year 1320 A.D. and the monk William of Occam [?] and is often referred to as Occam’s razor. It should be noted, however, that simplicity of a model implies that the model is simple to describe in a certain language. But what is simple to express in one language may be complex to express in another and vice versa.

[8] H. D. Vinod and A. Ullah. Recent advances in regression methods. Marcel Dekker, 1981. [9] M. Borga, H. Knutsson, and T. Landelius. Learning Canonical Correlations. In Proceedings of the 10th Scandinavian Conference on Image Analysis, Lappeenranta, Finland, June 1997. SCIA. [10] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Occam’s razor. Information Processing Letters, 24:377–380, 1987.

Occam’s razor or simply evolution? What then, are the implications of simplicity? Viewing Occam’s razor in an evolutionary setting provides an interesting perspective. It is plausible that languages competing for efficiency, i.e. being beneficial for the group of individuals using it to com5