Maplets for Correspondence-Based Object ... - Semantic Scholar

In:

Neural Networks, 17(8-9):1311-1326, 2004.

Maplets for Correspondence-Based Object Recognition Junmei Zhu∗ Computer Science Department University of Southern California

Christoph von der Malsburg Institut f¨ ur Neuroinformatik Ruhr-Universität Bochum

Abstract We present a correspondence-based system for visual object recognition with invariance to position, orientation, scale and deformation. The system is intermediate between high-dimensional and low-dimensional representations of correspondences. The essence of the approach is based on higher-order links, called here maplets, which are specific to narrow ranges of mapping parameters (position, scale and orientation), which interact cooperatively with each other, and which are assumed to be formed by learning. While being based on dynamic links, the system overcomes previous problems with that formulation in terms of speed of convergence and range of allowed variation. We perform face recognition experiments, comparing ours to other published systems. We see our work as a step towards a reformulation of neural dynamics that includes rapid network self-organization as essential aspect of brain state organization.

Keywords: Object recognition, correspondence, dynamic link, map formation, selforganization, maplet.

1

Introduction

The classical way to model the data structure of the brain is based on a vector of neural activity values Vi (t), i = 1, ..., N . It has been argued previously that this data structure is deficient in leaving open the binding problem (Legéndy, 1970; von der Malsburg, 1981; Hummel and Biederman, 1992; Ajjanagadde and Shastri, 1991; von der Malsburg, 1999). In (von der Malsburg, 1981) it is proposed that a better interpretation of dynamic brain states must contain the equivalent of dynamic links between neural units. These could be binary, involving pairs of elements, (i, j), i, j = 1, .., N , or links could be of higher order, e.g., involving triplets (i, j, k), i, j, k = 1, ..., N , and would have weights changing on the psychological time scale of 100 msec or less. In this paper, we present a concrete application of those general ideas, using invariant object recognition as example. We see this as an important paradigm for brain state organization in general. We place particular emphasis here, first, on the efficiency of the process, efficiency measured in terms of the number of iterations necessary in a fully parallel implementation, and, second, on the enlargement of the useful search space, permitting invariance to scale and orientation in addition to translation. Although we pay attention to neural constraints (especially such as locality of ∗ Current address: Computer Science, University of Memphis, Dunn Hall 373, Memphis, TN 38152. Email: [email protected]. Phone: 901-678-1539. Fax: 901-678-2480.

1

information), we stop short of formulating a neural implementation in any detail, see the discussion. Our general approach is similar to that given by (Olshausen et al., 1993), but goes beyond it in essential ways, being able to cope with variation in scale and orientation and opening a learning pathway. For preliminary results on the learning of maplets, see (Zhu and von der Malsburg, 2003).

1.1

Invariant Object Recognition by Correspondence

The visual perception of objects, even the mere recognition of objects, is made difficult by the tremendous variance in their retinal images, variance in terms of image transformation (translation, scale and orientation), depth-rotation, deformation, occlusion, change in illumination and noise. Computational models for invariant recognition come in two main types, feature-based and correspondence-based. Both start with the extraction of features defined by templates with definite parameter values for position etc., resulting in graded or binary responses. In feature-based recognition systems, invariance to position, scale and so on is achieved feature-wise, with the help of a logical OR, that is, fan-in connections from parameter-specific feature detectors to parameter-independent master units which thus represent parameter-invariant feature types. Object recognition is achieved by comparing the list of activated master units to stored lists for known objects and picking the best match. The characteristic of this approach is that information on the original parameter values, position, scale etc., and consequently also on relative position etc., is given up. Examples of feature-based systems include the Neocognitron (Fukushima et al., 1983), VisNet (Elliffe et al., 2002), SEEMORE (Mel, 1997) and Edelman (Edelman, 1995). Although in principle feature-based systems leave the door open for the confusion of objects that agree summarily in feature types but differ in the features’ relative position, scale or orientation, it is argued, e.g., in (Elliffe et al., 2002; Mel, 1997), that this loophole can be stopped with the help of combination-coding units of modest complexity, such that any re-arrangement of features in the image shows up as the pop-up or drop-out of some other features, establishing a non-ambiguous representation. Examples of correspondence-based systems include (Olshausen et al., 1993; Wiskott and von der Malsburg, 1996; Wiskott et al., 1997; Hummel and Biederman, 1992; Ullman, 1989). These represent objects explicitly, in terms of models that represent objects as ordered arrays of local features. Models are matched to the image by the establishment of an organized set of point-to-point correspondences (or “maps” or “mappings”) between points in the image and points in the object model. Mappings are represented by sets of connections, also called links, that run between model and image units with similar features and that are consistent with each other in terms of relative position, scale and orientation. The distinction between feature-based and correspondence-based systems is somewhat obscured by the fact that in both approaches features are usually extracted by a correspondencebased method, and that in feature-based systems the features may be global patterns, see, for instance, the chorus of prototypes of (Edelman, 1995). The distinguishing mark of feature-based systems is their neglect of the explicit representation of relations within the image or between image and model. Although feature-based object recognition may play an important role in our visual system, we proceed here on the assumption that the ability to explicitly represent feature relations and the establishment of precise correspondences is nevertheless a necessary ingredient for visual perception. Certain operations, such as reading text or manipulating objects, require the ability to navigate purposefully within the image, which is impossible without the 2

explicit representation of spatial relations between features and relations between points on the object model and points in the image. A further problem for the feature-based method, which requires appropriate sets of feature types tuned to the particular properties of objects to be distinguished, is the learning deadlock: without the ability to distinguish objects a system cannot select features to support the process, and without such features it cannot distinguish objects. (Interestingly, the feature-bases system VisNet (Elliffe et al., 2002) resorts to a correspondence mechanism as basis for learning.) Correspondence-based object recognition, on the other hand, can proceed immediately on the basis of elementary features such as Gabor-based wavelets, which can be developed by plausible mechanisms (Olshausen and Field, 1996; Bell and Sejnowski, 1997). Correspondence-based object recognition opens the avenue to one-shot learning of new objects (Loos and von der Malsburg, 2002), and once an object type can be recognized, it is possible to select complex feature recognizers as are required for the feature-based method (Wersing and K¨ orner, 2003). We therefore believe that the correspondence-based method is more fundamental. Moreover, psychophysical experiments (Biederman et al., 1999) have shown convincingly the competence of a model based on correspondences of fields of Gabor responses (Lades et al., 1993) when applied to the discrimination between objects (such as faces or Shepard blobs) that are perceived holistically and require quantitative distinctions. (When dealing with recognition of composite entry-level objects, a one-step correspondence-based method is no longer adequate, and objects will have to be represented as structured arrays of object parts, as proposed and experimentally supported by Biederman as Recognition by Components (Biederman, 1987). For the whole two-tiered process the same arguments in favor of representing relations and establishing correspondences apply, though.)

1.2

The Correspondence Problem

In addition to object recognition the correspondence problem is also invoked in the context of stereo fusion (Julesz, 1971; Dev, 1975; Marr and Poggio, 1976) and motion estimation (Horn and Schunck, 1981; Ullman, 1979) and, more generally, of the detection of isomorphy between structures. A convenient starting point to formulate the correspondence problem is subgraph isomorphy. Two graphs G = (V, E) and G′ = (V ′ , E ′ ), where V is a set of vertices and E = {(u, v)|u, v ∈ V } is a set of edges, or links, are isomorphic if there exists a one-to-one mapping f : V → V ′ , called an isomorphic mapping, such that (u, v) ∈ E iff (f (u), f (v)) ∈ E ′ . A graph Gs = (Vs , Es ) is a subgraph of G if Vs ⊆ V and Es ⊆ E. The general problem of subgraph isomorphy is to decide if there is a subgraph of one graph that is isomorphic to a subgraph of another graph. One speaks of labeled graph matching if on the basis of some labels attached to nodes in G and G′ a similarity S : (u, u′ ) →R is defined, and the compound similarity is to be optimized by f . Subgraph isomorphy in its general basic form is NP-complete (Garey and Johnson, 1979), a fact, however, that is totally irrelevant for practical applications, where approximate solutions are sought, and where labels and specific graph structures reduce the complexity of the problem decisively. In vision applications, the structures formalized as graphs are representations of images, and the mapping between structures is supposed to connect image elements that correspond to the same points in the external world. In such graphs, vertices are points in the image, labeled by features, such as Gabor filter responses at that point, and edges are neighborhood relationships between adjacent points, labeled by the geometrical distance vectors between the two points. The link arrangements within structures therefore typically run between neighbors in two-dimensional lay-outs, expressing continuity in space, and the cor3

respondence mappings tend to be smooth. In practical cases, similarity relationships are ambiguous and subject to noise and loss of information (an element in one structure potentially having many strong similarities to elements of the other, the correct one in the above sense not necessarily the strongest), so that on the basis of between-structure similarities alone a unique mapping cannot be defined (one speaks of ill-posed problems (Poggio and Koch, 1985)). If, however, approximate isomorphy is required, linked elements connecting to linked elements, a unique mapping may result. In this paper, mapping, map, and correspondence are used interchangeably.

1.3

Neural Dynamics

Our brain is a self-organizing system. It has elements (neurons, synapses, ultimately molecules), with interactions arranged such that it falls into organized states. The natural way to formulate such systems is in terms of rules or differential equations for the interaction of variables (for such formulations relating to the correspondence problem see (Dev, 1975; H¨ aussler and von der Malsburg, 1983; Hummel and Zucker, 1983; Julesz, 1971; Marr and Poggio, 1976; Pollard et al., 1985; Wiskott and von der Malsburg, 1996; Olshausen et al., 1993; Ullman, 1979; Nicolescu and Medioni, 2002)). Alternatively and for the sake of simplicity of analysis, self-organizing systems are often formulated in terms of the constrained optimization of a global objective function (for relevant examples see (Yuille et al., 1989; Olshausen et al., 1993; Bienenstock and von der Malsburg, 1987; Aonishi and Kurata, 2000)). In the context of map formation, the constraints of uniqueness (only one connection per element) and smoothness can then be implemented kinematically, by admitting only mappings that fully conform to the constraints, in which case one speaks of a lowdimensional representation (Horn and Schunck, 1981; Miller and Younes, 2001; Aonishi and Kurata, 2000). This restriction to a low-dimensional representation is artificial, the more fundamental “high-dimensional” representation attributing one variable to each of the full set of all-to-all connections, in which case the dynamic interactions conspire to create final states that conform approximately to the constraints, for examples see (Julesz, 1971; Dev, 1975; Marr and Poggio, 1976; Olshausen et al., 1993; Wiskott and von der Malsburg, 1996; Ullman, 1979; Nicolescu and Medioni, 2002). An example of high and low dimensional representations of mappings between onedimensional image and model is shown in figure 1. This figure shows identical image and model, each with length of 128 pixels, in horizontal and vertical, respectively. For convenience of viewing the patterns are extended in the perpendicular direction. Let r be the position coordinate in the image, and t that in the model. Fig. 1(a) shows a high-dimensional mapping, as represented by a two-dimensional weight function W (t, r) (a weight matrix W = (Wtr ) if r and t are discrete). For each link in the mapping, white represents high weight values, indicating strong connections, black represents small values and weak connections. The values as shown in the figure are local feature similarities. The ideal mapping for identical image and model is W = I, where I is the identity matrix. Strong connections off the diagonal, as can be seen in the figure, indicate pairs of image and model points that look similar locally, but are not correct correspondence. Fig. 1(b) is a low-dimensional mapping, represented by a one-dimensional function f (t). The ideal map in this case is f (t) = t, as shown in the figure. For object recognition, the mapping needs to be point-to-point. This form is expressed naturally in the low-dimensional representation, and can be made explicit from high-dimensional mappings easily (section 2.4 describes a way to do this). Our formulation here will be intermediate between the two extremes of high and low 4

image

r model

wtr = w(t, r) t

(a) High dimensional mapping

image

r model

r = f (t) t

(b) Low dimensional mapping Figure 1: Representations of a mapping between a 1D image and a 1D model. The 1D patterns are extended in the perpendicular direction, for convenience of viewing. Variables r and t are position coordinates in the image and the model, respectively. (a) “High dimensional” representation by the 2D matrix (wtr ). (b) “Low dimensional” representation by a 1D function f (t).

5

dimensionality. Using the language and methodology of stability analysis (see, e.g., (Haken, 1977)), a mapping can be seen as a superposition of “modes,” of connectivity patterns that are solutions to linearized dynamic equations and that one by one preserve their form while growing or decaying exponentially as dictated by a positive or negative growth coefficient (modes and growth coefficients being obtained as eigenpatterns and eigenvalues of the linearized dynamic system). In self-organizing systems this formulation has the advantage of directly reflecting the emergence of ordered states on a hierarchy of temporal and spatial scales. It further has the advantage, as will become evident, that the factors controlling the speed of convergence to the desired organized state can be explicitly analyzed, understood and modified.

1.4

Efficient Map Formation

If dynamical graphs — arrangements of dynamic links — are indeed an important aspect of the data structure of the brain’s state, it is important to understand their organization. We are assuming that organized graph structures are originally formed, in the infant brain, by basic and slow mechanisms of self-organization, as modeled, for instance in (Willshaw and von der Malsburg, 1976; Wiskott and von der Malsburg, 1996; Bienenstock and von der Malsburg, 1987), that these graph structures leave memory traces in the form of subgraphs that are common elements of many link patterns, and that adult graph formation consists more in the retrieval of ordered arrays of subgraphs than de novo self-organization. In our present context we would like to know how correspondence maps can be formed rapidly, within fractions of a second, and flexibly, that is, in spite of variations in position, scale, orientation and deformation in the patterns to be matched. The subgraphs that form common elements of many ordered mappings are high-order links coherently connecting small disks in the image domain to small disks in the model domain. Let us call these link arrangements maplets; they correspond closely to the control neurons of (Olshausen et al., 1993). (For neural implementation issues see section 5.) Each maplet is definite in terms of the map parameters that specify relative position, orientation and scale of image and model. Different maplets are connected according to their degree of mutual consistency in terms of these parameters, as explained in more detail in section 2. Figure 2 shows links and a maplet example between a model and an image, both twodimensional. Tiny circles on the image and model planes are graph nodes (vertices). Each link connects a node in the model and a node in the image. The group of links enclosed in the dotted ellipse is a maplet. For maplets differing in relative position, orientation and scale, see Fig. 7. From a dynamical point of view the essence of efficient organization lies in adapting the underlying interactions of the system such as to give a selective advantage to the desirable patterns at the expense of irrelevant patterns. Using the language of stability analysis, the growth coefficients of the desirable modes must be much larger than the irrelevant modes, so that the former can quickly outgrow the latter. The naturally given link interactions cannot possibly be required to prefer consistency in terms of map parameters, such that a large number of mutually and internally inconsistent link patterns compete with almost equal growth rate, leading to very long symmetry breaking and decision times. But once consistent correspondence maps have been repeatedly formed in early development, link interactions can be modified such as to favor those link patterns. The formation of maplets 6

o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

maplet o

o

model

o

o

link

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o o

o

image

o

Figure 2: Links and maplets. Tiny circles on the image and model planes are graph nodes. Each link connects a node in the model and a node in the image. The group of links enclosed in the dotted ellipse is a maplet. by learning will be the subject of a later communication (for preliminary results see (Zhu and von der Malsburg, 2003)), whereas we here show in in silico experiments that correspondence map formation can be made very efficient on the basis of interacting maplets.

2 2.1

Fast Dynamic Link Matching by Cooperating Maplets Heuristics: From Links to Maplets

The starting point of our dynamic formulation is a set of coupled differential equations developed in (H¨ aussler and von der Malsburg, 1983) to describe the ontogenetic development of a retinotopic map between a one-dimensional model retina of N elements and a similar model tectum, connected by an N × N matrix W = (wtr ), each positive real number wtr being the connection weight between position r in the retina and position t in the tectum. We adapt the model, by identifying retina with image and tectum with model and making the transition to two-dimensional domains, by introducing higher-order links and their interactions, and by introducing feature similarities. The dynamics of the connections are described by the set of N ×N differential equations w˙ tr = ftr (W ) −

X X 1 wtr ( ft′ r (W ) + ftr′ (W )). 2N t′ r′

(1)

The growth term ftr (W ) of link wtr expresses the cooperation from all its neighbors, with positive rate β, plus a non-negative synaptic formation rate α: ftr (W ) = α + βwtr

X

c(t, t′ , r, r′ )wt′ r′ ,

(2)

r ′ ,t′

where c(t, t′ , r, r′ ), referred to here as C function, describes the mutual cooperative help the link (t, r) receives from its neighbors (t′ , r′ ). Under the assumptions of the original model (H¨ aussler and von der Malsburg, 1983), this cooperation was due to Hebbian plasticity induced by the signal correlations caused by local connections within domains and 7

r’

C function

X

t’

wt 'r ' wtr Figure 3: Dynamics of connections. In the connection matrix (wt′ r′ ) between an image and a model, indexed by r′ and t′ respectively, ”X” indicates a link wtr . The shaded region (the disk) indicates the set of neighboring links, the support of the C function. The two dotted lines indicate the sets of divergent and convergent links that compete with link wt,r . the mutually consistent connections wtr between domains. Correspondingly, the C function had the properties of being separable and isotropic (H¨ aussler and von der Malsburg, 1983), see Fig. 4(a). (The introduction of direct interactions between links, advocated here — interactions that are shaped by learning — more task-adapted C functions are made possible, see Fig. 5(a)). The negative term in equation (1) expresses competition, between links that diverge from one point in the source domain (retina or image) and between links that converge to one point in the target domain (tectum or model). This dynamics is shown schematically in figure 3. It shows a image and a model, both one-dimensional and indexed by r′ and t′ respectively, and their connection matrix (wt′ r′ ). The ”X” indicates link wtr . The links within the shaded region (the disk) are its neighbors, from whom it gets cooperative support to grow. The two dotted lines indicate all the divergent and convergent links that compete with wtr . For fixed (t, r), c(t, t′ , r, r′ ) is a function of (t′ , r′ ) whose support is represented by the disk. A candidate form of this function is a Gaussian, centered at (t, r), and with standard deviation σ, such that the function is significantly greater than zero only within the disk. c(t, t′ , r, r′ ) =

2.2

1 −[(t′ −t)2 +(r′ −r)2 ]/2σ2 e 2πσ 2

Linear Analysis

In this section we linearize the dynamic equations 1 and 2 around a totally unorganized homogeneous stationary initial state (where all links have equal strength) to find the dominant growth patterns as solutions to these linearized equations. This will permit us to identify and discuss the factors (essentially the shape of the C function) that influence the growth coefficients of the desirable growth patterns relative to those of non-desirable patterns. In the previous systems, most of the time for the correct mapping to grow was spent to spontaneously break the symmetry with many competing other patterns. Non-

8

isotropic C functions will break this symmetry right from the start and lead to very rapid map organization. As in (H¨ aussler and von der Malsburg, 1983), we assume that the two one-dimensional domains had periodic boundary conditions, thus forming rings of units, to simplify the stability analysis. The system (1) has the stationary homogeneous solution W = 1 (wtr =1 for ∀t, r ). Linear expansion around this point gives a system whose eigenvectors are the complex exponentials 2π ek,l (kt + lr)) (3) tr = exp(i N for k, l ∈ ZN . These eigenvectors are products of harmonic functions in t and in r, k and l being frequencies in the two domains, respectively. The eigenvalues of the linearized version of system (1) are:

λ

k,l

=

   −α − 1

−α +

(γ k,l

  −α + γ k,l

,k=l=0 − 1)/2 , k = 0, l 6= 0, or k 6= 0, l = 0 , otherwise,

(4)

where γ k,l are the eigenvalues of the C function on the same set of eigenfunctions. They are obtained as Fourier transform of the C function and are real numbers. The eigenfunctions (3), also called linear modes of the linearized version of (1), have amplitudes that grow or decay exponentially in time, with the eigenvalues as growth coefficients. In figure 4 we plot an isotropic C function, as given by natural link interactions, and the corresponding eigenvalues λk,l for α = 0, sorted in descending order. The decisive quantities determining the speed of growth and convergence of the system are the margins by which the eigenvalues of the desirable modes outdistance those of the unwanted modes. With natural link interactions induced by correlation-controlled synaptic plasticity there is only one choice in shaping the C function, controlling its width (and correspondingly that of the distribution of eigenvalues, the width of which is related inversely). Due to the symmetry of the system, there are four modes that have the same maximal eigenvalue. They correspond to the lowest positive frequency, k, l = ±1, the matrix wtr taking the form of broad diagonals of either orientation, each with one of two positions (phases cos or sin). In order to develop a mapping, the system must break the symmetry between those modes, which, as shown in (H¨ aussler and von der Malsburg, 1983), happens on the basis of the non-linearities inherent in (1), but is a slow process. The symmetry between the two map orientations cannot be broken by an isotropic C function. This problem will get much worse in the realistic case of two-dimensional domains with open boundary conditions, where many mappings of a continuous variety of relative orientations, scales and positions are competing with each other. It would therefore be of great advantage if the link interactions were selectively restricted to such sets of links that are consistent with each other in terms of map parameters. To show the effect we plot in figure 5 a C function that by its shape already favors a diagonal of one of the two orientations. The figure also shows the largest eigenvalues. This time the difference between the largest phase pair, belonging to the favored diagonal orientation, and the second-largest pair of eigenvalues, belonging to the dis-favored orientation leads to immediate strong map development.

9

→ r−r′ c(t−t′,r−r′) 1

σ1=3

0.9

σ2=3

0.8

diff=0.000000

0.7

eigenvalues

0 0.6

0.5

0.4

0.3

0.2

↓ t−t′

0.1

0

0

5

10

15

20

25

30

35

40

index

0

(b) Eigenvalues λk,l

(a) C as function of ∆t and ∆r of interacting links

Figure 4: Isotropic C function and eigenvalues λk,l . (a) Contour plot of C function c(t − t′ , r − r′ ). The two axes are ∆t = t − t′ and ∆r = r − r′ . The C function shown here is a Gaussian with σ1 = σ2 = 3, with total size 32 × 32. (b) Eigenvalues λk,l , for α = 0, sorted in descending order. The horizontal axis is the rank (index). Only the first 40 largest eigenvalues are shown. For the maximal phase pairs of either diagonal orientation the eigenvalues are equal (diff=0).

→ r−r′ c(t−t′,r−r′) 1

σ1=9

diff=0.104730

0.9

σ2=1

0.8

0.7

eigenvalues

0 0.6

0.5

0.4

0.3

0.2

↓ t−t′

0.1

0

0

5

10

15

20

25

30

35

index

0

(a) C as function of ∆t and ∆r of interacting links

(b) Eigenvalues λk,l

Figure 5: Non-isotropic C function and eigenvalues λk,l . Labels are the same as in Fig. 4. Here, the C function c(t − t′ , r − r′ ) is a Gaussian with σ1 = 9, and σ2 = 1. Note that in distinction to the isotropic C of Fig. 4 the eigenvalues for diagonal modes of different orientation now have a positive difference (diff).

10

40

2.3

Maplets

Guided by the heuristics of the last section we now introduce formalism to describe the interaction of groups of links that are spatial neighbors and are consistent with each other in terms of map parameters. We call those groups maplets (in a preliminary communication (Zhu and von der Malsburg, 2002) we called them control units but find the term maplet more suggestive). A maplet corresponds to a mapping between a small disk in the image domain and a small disk in the model domain. It controls the growth of a number of individual links and in turn is driven by the links in that control region. A maplet is represented by a function KTp R (t, r), where p is a scale and orientation index and T and R refer to the positions of the centers of the two disks mapped to each other. The domain of KTp R (t, r) defines a pool of individual links (that is, (t, r) pairs) that the maplet controls. This domain is compact and its size controls the amount of map deformation that will be tolerated. KTp R (t, r) is formulated explicitly for one-dimensional image and model as follows. For one-dimensional t and r, the disks become line segments, and the overall shape of maplets is a two-dimensional Gaussian with mean µc = (T, R), and covariance matrix Σ: KTp R (t, r) = N (µc , Σ) =

1 1 exp {− ((t, r) − µc )Σ−1 ((t, r) − µc )T }. Z 2

(5)

where Z is a normalization factor such that t,r KTp R (t, r) = 1. The matrix Σ has two eigenvalues σL > σc . The direction of its eigenvectors is defined by the scale parameter p (which for one-dimensional domains is the only mapping parameter besides the relative translation, which is set by (T, R)). Examples of maplets with two different p values are shown in figure 6. They both have σc = 1, and σL = 3, and the slope is the only difference. P

→r p

KTR(t,r)

→r p

KTR(t,r)

p= same scale

T

p= scale diff. 0.5 octave

T

↓ t

↓ t R

R

(a)

(b)

Figure 6: Maplets for 1D patterns, with different scale parameter p. Contour plot of maplet function KTp R (t, r), a Gaussian centered at (T, R), with standard deviation σL = 3, σc = 1. Horizontal axis is r and vertical is t. (a) model and image are of the same scale. (b) image and model have a scale difference of 0.5 octave. The generalization of maplet formulation to two dimensions is straightforward. The rule 11

is, for parameters p, T and R, to find the rTp −R (t) to which a given t should map, and punish the deviation of r from the correct target point, by a Gaussian with standard deviation σc . Four examples of 2D maplets are shown in figure 7, corresponding to different parameters: (a) identical image and model; (b) different shift; (c) different rotation; (d) different size. In each case, the two planes correspond to model and image. The tiny circles indicate node positions, and large circles the disks connected by the maplets.

o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o o

o

o

o

o

o o

o

o

o

o

o

o

o

o

o

o

o

(c)

o

o

o o

o

o

o

o

o o

o o

o

o o

o o

o

o o

o

o

o

o

o

o

o

o o o

o

o

o

o

o o

o

o o o

o

o

o

o

o o

o

o o o

o

o

o

o o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o o

o

o

o

o o

o

o

o

o

o

o o

o

o

o

o

o

o

o

o

o

o

o o o

o

o

o

o o o

o

o

o

o o o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

(b)

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

(a)

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o o o

o

o

o

o o o

o

o

o

o o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o o

(d)

Figure 7: Example of maplets between two-dimensional domains. Each small figure shows a maplet, with different shift, scale, and orientation parameters. (a) identical image and model; (b) different shift; (c) different rotation; (d) different size. In each small figure, the upper layer is the model domain, and the lower layer is the image domain. Each tiny circle indicates a node position, and large circles are the disks connected by the maplets. For simplicity, in (a)-(c), the maplets have σc = 0, namely only rTp −R (t), and not its Gaussian neighbors, is linked to t. In (d), however, we show the mapping of the node in the middle to several neighboring nodes in the image as it cannot map to exact node positions because of the size change. Maplets cooperate if they are neighbors in terms of spatial location and transformation parameters. The cooperation coefficient s(p, p′ , T, T ′ , R, R′ ) between maplets KTp R and ′ KTp ′ R′ falls off with decreasing likelihood of their being compatible in terms of the mapping. We have modelled it as a Gaussian falling off with variance σK as a function of the distance between p and p′ and with variance σW as function of the Euclidean distance between (T, R)

12

and (T ′ , R′ ): s(p, p′ , T, T ′ , R, R′ ) = exp {−

d2 (p, p′ ) (T − T ′ )2 + (R − R′ )2 − } 2 2 2σK 2σW

(6)

+

+

where d(p, p′ ) is a distance between the two transformation parameters p and p′ , d(p, p′ ) = 0 if p = p′ . Figure 8 is a schematic illustration of the interactions between maplets. Each maplet is represented by an ellipse (one contour in the plot of Fig. 6), and the figure shows the cooperation strength to the filled maplet. The solid line ellipses have the same p parameter as the filled maplet, and the dashed line ellipses are of different p. The number of + signs is an indicator of interaction strength. This interaction pattern looks very much like the association field in contour integration (Field et al., 1993), and the control neuron interactions in shifter circuits (Olshausen et al., 1993). A difference is that we have only cooperation and no competition.

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

Figure 8: Maplet interactions. Each ellipse represents a maplet, and the number of + signs indicates its strength of connections to the filled unit. Maplets provide a specific way of calculating the growth term ftr (W ) for links. The system does the following three steps to compute a ftr (W ) replacing the original form of equation (2): 1. Compute maplet input from momentary link strengths. The direct excitation opT R the maplet with parameters p, T , R receives from “its” elementary links wtr is: opT R =

X

KTp R (t, r)wtr .

(7)

t,r

2. Gather cooperation between maplets. Analogously to equation (2), the effective strength fTp R of maplet (p, T, R) is influenced cooperatively by other maplets and itself according to X ′ (8) fTp R = α + βopT R s(p, p′ , T, T ′ , R, R′ )opT ′ R′ p′ ,R′ ,T ′

13

where β is a positive rate and the non-negative term α gives maplets a chance to be active even without any link support. The cooperation coefficient s(p, p′ , T, T ′ , R, R′ ) reflects the mutual consistency between maplets (p, T, R) and (p′ , T ′ , R′ ), as defined in equation 6. 3. Feedback from maplets to links. The growth coefficient for a link, ftr (W ), is influenced by the maplet with maximal effective strength from among those whose domains cover the link: p∗ ftr (W ) = kfTp∗∗R∗ KTp∗∗R∗ (t, r)Str ,

(9)

p where k is a coefficient, fTp∗∗R∗ = maxp,T,R {ftr | domain of KTp R (t, r) includes (t, r)} p∗ with {p∗, R∗, T ∗} being the parameter set for which the maximum is attained and Str is a feature similarity as defined in the next section. (In a fully dynamic formulation the maximum function would have to be replaced by some winner-take-all mechanism.)

2.4

Establishing a Correspondence Mapping

Initial link weights The initial weights are determined by the similarities Str of local features, defined below in equation (12), wtr (0) = Str . Iterations two steps:

Using the Euler method, we simulate the continuous dynamics of links in

• Computation of the growth term ftr (W ) for each link as mediated by maplet activity, using the three steps given in the last section. On the first iteration to compute the activity of maplets in equation (7) we use the parameter-specific initial similarities p (Str ) instead of wtr (0), see also equation (12) for the definition. • Update link weights according to: w˙ tr = ftr (W ) −

X 1 wtr ( ft′ r (W )). N t′

(10)

In deviation from equation (1) we here use only a divergent competition term, involving the average of the growth terms of the links that diverge from one image unit. Leaving out convergent competition on model units allows them to link with multiple image units, which is necessary because the number of units in the image domain is usually larger than that in the model domain. Our experiments, described in section 3, will show that this dynamics is very effective in creating acceptable correspondence maps after a fairly small number of iterations. Low-dimensional maps from high-dimensional ones After a given number of iteration steps, the mapping W may still have a number of links going into each of the image units. For purposes of map display in two dimensions as in Fig. 10 and calculation of a model-to-image similarity it is desirable to determine a low-dimensional map, giving a unique r(t) for a given t. We compute it as a weighted average of the projected position: 14

P a ′ ′ (wtr ′ ) r r(t) = Pr a

(11)

r ′ (wtr′ )

where a is a positive number. For a = ∞, r(t) = argmaxr′ (wtr′ ).

3

Experiments on Map Formation and Object Recognition

We have performed a number of experiments on map formation and object recognition, using one-dimensional or two-dimensional gray-level patterns for image and model. For twodimensional object recognition, we took input images and models from two different human portrait galleries, comparing the recognition results to those of other systems reported in the literature. In the present formulation we create a separate correspondence mapping for each model in the model gallery, extract from the final mapping a global similarity between image and model, and take the model with the best similarity as the one recognized.

3.1

Feature Similarities

As in all forms of the correspondence problem, the mapping to be found should optimize some feature similarity between linked points. In our experiments we employ as elementary features Gabor wavelets. For one-dimensional patterns we use Gabor responses with five scales, for two-dimensional images we use wavelets of five scales and eight orientations as in (Lades et al., 1993; Wiskott and von der Malsburg, 1996). In both cases, Gabor magnitudes only are used. The vector of five or 40 Gabor responses obtained in one point forms a jet. The similarity between two jets depends on the relative scale and orientation of the mapping. We here take as similarity Str between two jets J = (a1 , a2 , · · · , an ) at position r in the image domain and J ′ = (a′1 , a′2 , · · · , a′n ), at position t in the model domain: Str =

p max Str p

P

= max qP p

ai a′fp (i)

a2i

P

a′2 fp (i)

(12)

p where p is the index of relative scale (and orientation), (Str ) is the similarity matrix for parameter p, and fp (i) is the function that maps corresponding elements in two jets with parameter difference p. In the case of scale difference, summations run only over those elements of jets that do find a corresponding element in the other. For one-dimensional patterns similarity matrices computed thus are very ambiguous, making it difficult or impossible to extract correspondences. We reduced that problem by using an overlay of three gray-level patterns (three intensities per sample point) instead of one for image and model, computing similarities for the resulting more specific feature vectors of triple length.

3.2

Image Similarity from a Mapping

The similarity between an image and a model as mapped by W can be computed as follows. First, a 1-1 mapping is computed by using equation (11), so that for each t we have an r(t). Then the optimal orientation and scale p∗ at each link is estimated from the maplets, as the average of orientation and scale of all maplets looking at this link, weighted by the maplets’ activities.

15

The similarity value is then the sum of the jet similarities of all corresponding point pairs: P ai (r(t))a′f ∗ (i) (t) X p∗ X p qP (13) s= Str(t) = 2 (r(t)) P a′2 a t t i f ∗ (i) (t) p

p∗ Str(t)

where is the jet similarity between jets J(r(t)) = (a1 (r(t)), · · · , an (r(t))), at position r(t) in the image, and J ′ (t) = (a′1 (t), · · · , a′n (t)), at position t in the model, with relative scale and orientation p∗ .

3.3

Recognition

For object recognition, the following steps are performed for each image in the probe gallery: Map creation For the image and each model in the gallery individually: • Establish a mapping between the two, performing a predefined number of iterations. • Compute the image similarity under this mapping (equation (13)). Recognition The recognized model is picked as the one with the greatest image similarity. This simple implementation of recognition always returns the most similar model for an input image. For simplicity we have renounced the possibility to compute a confidence measure to diagnose cases in which the image is not in the model gallery and no model should be picked.

3.4

Map Formation: One-Dimensional Patterns

Figure 9 shows an example of map formation between two one-dimensional patterns. Fig. 9(a) shows image and model, 1D patterns with length 128 pixels. Relative local scale is non-uniform, representing deformation. Features are computed as 1D Gabor wavelet responses of three patterns at a time, as explained in section 3.1. There are n = 5 different scales, differing by 0.5 octave from each other. Fig 9(b) shows the evolution of the system dynamics. The first row shows synaptic weights, the second maplet activities. The image coordinate runs horizontally and model coordinate vertically. There are three groups of maplets corresponding to three relative scales. The columns indicate different time steps (initial, showing the feature similarities, after 10 iterations and after 20 iterations). The transformation parameters of the map can be easily recognized by visual inspection from the activity of the maplets, by just paying attention to coherent chains of them. For the maplet function, in equation 5, the Gaussians have variance σL = 9 and σc = 1. Their domain size in the t dimension is taken to be 13,√and the √ r dimension it varies with 2, 1, respect to scale. The 3 scales (r/t ratio) shown are 1/ 2. Accordingly, the domain √ √ sizes in the r dimension are ⌈13 ∗ 1/ 2⌉, 13, ⌈13 ∗ 2⌉, respectively. The performance is not sensitive to the exact values of parameters in the system dynamics. In the experiments, both with one and two dimensions, we just set them to some reasonable value, and the system worked fine without any need to tune them. The values used are shown in table 2. 16

image model initial

(a) Input iteration=10

iteration=20

synapse maplets (b) Map formation Figure 9: Map formation between 1D patterns. (a) Image and model. 1D of 128 pixels. Note the non-uniform relative scale of the two patterns. (b) The evolution of synaptic weights and maplet activities. The three columns refer to the initial state and to states after 10 and 20 iterations. Top row, synaptic weights wtr . Bottom row, maplet activity oT R . Horizontal axis is r(R), and vertical is t(T ). High values are shown in white, low values in black. The three maplet matrices under each weight matrix correspond to different scale (p) parameter values.

3.5

Map Formation: Two-Dimensional Patterns

The creation of a mapping between an image and a rotated copy is shown in figure 10. The first row shows model and image. The second row shows a regular grid on the model, and its mapping to the image domain, computed as in equation (11), at different time steps. It can be seen that the grid is rotated along with the image, as it should. The sampling of maplets in the transformation domain is 1 sample in scale (same size) and 3 samples in orientation (0, π/6, and π/3).

3.6 3.6.1

Face Recognition Galleries

We use the FERET database (Phillips et al., 2000; Wiskott et al., 1997). There are two groups of faces with the same set of persons: frontal view (fa), and frontal view with different facial expression (fb). The image size is 128 × 128 pixels. To test the robustness of our system to deformations, we also tested the Bochum database (Wiskott and von der Malsburg, 1996), where the two groups of faces are frontal (fa0) and rotated in depth by 30o (hr4). In order to test large variance in scale and rotation, we transform fb images into “Tfb” images. The transformation is a change in scale and an in-plane rotation around the center of the image. The transformed images are then reformatted as 128 × 128 images, filling in empty margins by extending grey levels from the boundaries of the transformed image outward. We get two sets of Tfb images with different transformation range. In Tfb-large, the new size is in the size range [77, 128] pixels on a side and rotation angle in the range [−30o , 30o ]. In Tfb-small, the new size range is [110, 128] pixels and the rotation angle range [−9o , 9o ]. Both size and rotation are chosen randomly and uniformly from their respective 17

model

image

model grid

initial

iteration=10

iteration=20

Figure 10: A 2D example of map creation. The first row shows model and image. The second row shows a regular grid on the model, and its mapping to the image domain at different time steps. range. Several images from the fa and Tfb-large galleries are shown in figure 11. The first row is from the fa gallery, and the second row is from the Tfb-large gallery.

Figure 11: Images for face recognition. First row: fa gallery (frontal views); Second row: Tfb-large gallery (frontal view with a different expression, transformed in scale and in-plane rotation). In all experiments, the recognition is between a model gallery and a probe gallery. The model gallery is composed of the original fa or fa0 images. There are 124 images in the galleries fa and Tfb, 110 in galleries fa0 and hr4. 3.6.2

Results

All results were obtained after 3 iterations of map formation. A summary of the recognition rates is shown in table 1, and the parameters used in table 2. Test 1 This is a basic test of the system. The model gallery is fa and the probe gallery Tfb-large. Maplets sample the transformation domain at relative orientation values √ √ ( 0, 0.1π, −0.1π) and at relative scale values ( 2, 1, 1/ 2). The recognition rate is 85% (=106/124). 18

Test 2 With test 2 we compare our system’s performance with that of other face recognition systems. These systems didn’t explicitly permit changes in scale and orientation. We accordingly switched off our search in scale and orientation by using only maplets with relative orientation 0 and scale 1. As model gallery we still used fa, and as probe gallery Tfb-small. We obtained a recognition rate of 96.8% (=120/124). This is higher than in Test 1, as is to be expected given the much smaller permitted range of variation. The result is not significantly different from the recognition rate with elastic bunch graph matching (EBGM) (Wiskott et al., 1997), where in one experiment with 250 fa against 250 fb images the recognition rate was 98%. Test 3 To explore the robustness of our system against deformations we tested it with faces rotated in depth. The model gallery this time is fa0 with probe gallery hr4. We used only one set of parameters for the maplets as in Test 2. The recognition rate is 93.6% (=103/110). This compares very favorably with the dynamic link matching (DLM) system (Wiskott and von der Malsburg, 1996), where a recognition rate of 66.4% (=73/110) was obtained with the same galleries. test 1 2 3

model fa fa

probe Tfb-large Tfb-small

size 124 124

recognition rate 85% 96.8%

fa0

hr4

110

93.6%

other systems 98%(EBGM (Wiskott et al., 1997)) 66.4% (DLM (Wiskott and von der Malsburg, 1996))

Table 1: Recognition rate

System Dynamics

Maplets

Feature sample in space

parameter α β k time step σK σW σc ctl size t dim. ctl size r dim. a model image

value 0.01 1 50 0.5 1 10 1 5×5 varies 32 8 × 8 regular grid 14 × 14 regular grid

notes Eq.(8) Eq.(8) Eq.(9) Euler method Eq.(6) Eq.(6) Eq.(5)

Eq.(11) center 108 × 108 whole image

Table 2: Parameters used in the experiments

3.7

Time complexity

The speed, or time complexity, of our system is measured in the unit of number of iterations. This measure is appropriate for the parallel distributed processing in neural architectures, as it does not consider the complexity of each iteration, as required in sequential 19

computers. In contrast, the complexity analysis in computer science is to count the number of basic operations in a von Neumann computer architecture. The latter is irrelevant for the (highly parallel) brain, although at the moment our experiments have to be implemented in sequential computers. Our aim is not to improve the technology by reducing the number of sequential computer operations. In fact there have been lots of algorithms in technology that can perform the correspondence task efficiently, such as the EBGM system (Wiskott et al., 1997). Some comparisons are in the next section.

4 4.1

Comparison with other systems Object recognition methods

Systematic independent tests of object recognition are available almost exclusively for face recognition. Therefore we restrict our comparisons to that application. Moreover, we compared our system here only with the systems EBGM (Wiskott et al., 1997) and DLM (Wiskott and von der Malsburg, 1996) because these systems are based on correspondence between fields of Gabor responses, so that comparative results say the most about the mapping issue, which is the focus here, and are not shrouded by issues having to do with different feature types. In the world-wide FERET test by the US Army Research Laboratory, this approach has outperformed all other techniques such as feature-based methods, eigenface, and classification neural nets (Phillips et al., 2000). More recently, in the Face Recognition Vendor Test (FRVT, www.frvt.org), the biggest and most well-known evaluation of face recognition technologies sponsored by DoD, DARPA, the National Institute of Standards, and others, two of the three best systems (Eyematic and Cognitec) were based on the correspondence of Gabor jet graphs and the first (by Visionics) presumably differs only in the type of local features used. Systems based on the correspondence of Gabor fields differ in their matching process. The first algorithmic implementation of the matching process is the elastic graph matching (EGM) (Lades et al., 1993). In EGM, model graphs are rectangular grids as in our system, and the algorithm provides a systematic way of deforming the image graph to optimize the matching of its vertex labels and edge labels with the model graph. EBGM (Wiskott et al., 1997) further improves the efficiency of EGM by using bunch graphs whose vertices, placed at fiducial points in the face, are labeled with a set of jets from different individuals. Our work here aims at implementing the matching process in a brain-like structure. From the face recognition result in the last section, our recognition rate is comparable to EBGM (Wiskott et al., 1997). Using fiducial points in our system might improve the performance further, but it is beyond the scope of this paper. The DLM system (Wiskott and von der Malsburg, 1996) is a fully neural realization of face recognition. In this implementation, link dynamics is controlled by temporal signal correlations. This is inherently sequential and requires a sequence of thousands of activity patterns in the image and the model. In contrast, our current system uses only 3 iterations of map formation.

4.2

Shifter circuits

Our general approach is similar to the shifter circuits of Olshausen et al. (Olshausen et al., 1993). Both approaches share the philosophy of handling mappings directly, aiming at providing a good routing model that preserves spatial relationships. Their model relies 20

on a set of control neurons which are very much like our maplets. There are fundamental differences, however. Their dynamic variables are the coefficients (activities) of the control neurons, and are subjected to gradient descent of an energy function. In contrast, our dynamic variables are the link themselves, giving a different role to our maplets and their control neurons. The goal of maplets is only to guide the dynamics of links, in contrast to the all-or-none gating behavior of control neurons. Whereas our system is self-organized, shifter circuits require a third party, the pulvinar, to provide the control signals. The architecture is also different in these two systems. Our system has only two domains (image and model), while their model has several intermediate levels between the two domains. Multiple levels have the advantage of needing fewer connections per node, so that each level has smaller fanin and simpler control. They are also biologically more realistic. But when using patterns stored in the associative memory to guide the control variables, i.e., top-down control, there is the problem of how to back-propagate this information down to lower levels.

4.3

Correspondence methods

The core of our system is to establish a correspondence between two graphs. The correspondence problem has been an active area in computer vision, especially also for stereo and motion analysis. In this section, we briefly discuss some major stereo correspondence methods. The earliest stereo correspondence method is the cooperative algorithms (Julesz, 1971; Dev, 1975; Marr and Poggio, 1976). Our system belongs to this kind of self-organizing systems, and inherits the advantage of being robust. Cooperative algorithms differ in their formulation of the cooperation and competition constraints, and which form is correct has been the center of debate (Marr and Poggio, 1976; Frisby, 2002; Arbib et al., 1974). In our system, the cooperation constraint is mediated by maplets. Learning maplets results in learning the constraint, thus eliminating the arbitrariness in defining the constraint. Our maplets and their connections act very much like the association field for contour integration (Field et al., 1993), and the extension field in edge detection (Medioni et al., 2000). Dynamic programming is one of the most common approaches to stereo correspondence, for example (Ohta and Kanade, 1985; Geiger et al., 1995). The central idea of dynamic programming design is to solve and store the result of subproblems in their topological order. This makes the method inherently sequential and limits it essentially to one-dimensional problems. The stereo correspondence problem is typically reduced to a 1D matching problem on independent scanlines, due to the epipolar constraint so that dynamic programming can be readily applied. In fact, even when the inter-scanline constraint is applied, the search is only in a 3D space that stacks the 2D weight matrices of 1D patterns (Ohta and Kanade, 1985), and not the true 4D space of general correspondences between a pair of 2D patterns.

4.4

Image registration

Image registration has been an ever growing field especially for medical image analysis. If, as in our current system, the variations are only translation, scale and in-plane rotation, functional approximation could be very efficient (Brown, 1992). It is easy to write down an energy function and the dynamics on these four parameters. But, as is true for all top-down constrained optimization methods, in general this method is difficult to extend to other variations, and learning the energy function from examples has never been attempted.

21

5

Discussion

If, as claimed in (von der Malsburg, 1981), the data format of the brain is a dynamic graph containing link variables in addition to node variables, it is an important scientific task to work out the mechanisms by which these additional variables are organized on the psychological time scale of fractions of a second. This paper aspires to contribute to that task. We are taking as paradigm the correspondence problem, (which by force has to talk about the organization of links) in its application to object recognition. Object recognition is an important and intensively studied issue in itself. Although also for stereo matching and motion estimation the correspondence problem must be solved, those applications are less general in scope, as the patterns to be matched there have small parameter differences and are rather similar to each other. In distinction, the search space for object recognition is large as important differences in scale, orientation and position are involved, as the correct model must be selected from a possibly very large number of competitors, and as images of the same object may differ in illumination, pose and deformation. The general ability to efficiently find structures that are isomorphic to each other may well be central to the function of the brain, a conclusion well supported by the important role search procedures play in the field of artificial intelligence. The brain has the inherent tendency to establish organized structural states by mechanisms of self-organization and learning, and it must be part of its function to stabilize such arduously found states by modifying the underlying system of interactions so that these states can be recovered efficiently and reliably in the future. Isomorphic mappings are among the organized patterns that can be established on the basis of rather naturally given simple interactions (Willshaw and von der Malsburg, 1976; Bienenstock and von der Malsburg, 1987; Wiskott and von der Malsburg, 1996). However, this process is very slow, as the globally correct isomorphism has to compete with a very large number of nearly-optimal connectivity patterns. An estimate of the time taken by this mechanism if employed for object recognition, as worked out in (Wiskott and von der Malsburg, 1996), would amount to three to ten seconds. This may be an acceptable estimate for object recognition in the infant brain, but adult object recognition is estimated to take less than a tenth of a second (Thorpe et al., 1996). We therefore assume that isomorphic mappings formed in early life leave memory traces in the form of altered link interaction patterns, modelled here as maplets and their interactions, on the basis of which the process is accelerated by one or two orders of magnitude. If each iteration in our dynamic formulation is estimated to take a few milliseconds, then the ten or twenty iterations to form a fairly precise correspondence mapping could be accomplished well within a tenth of a second. Moreover, as shown in (Wiskott, 1999), object recognition can, at least in simple cases, be reliably performed long before the correspondence map has reached full precision, as illustrated also in our face recognition tests, in which a decision was forced already after three iterations of map development. Reaction-time arguments can therefore no longer be raised against correspondence-based object recognition. As far as object recognition (and not correspondence) is concerned, our formulation is still unrealistic in treating object models as completely disjunct and letting each of them develop its own correspondence. A more mature formulation will have to take into account the structural overlap between different object models (as is exploited in the bunch graph method (Wiskott et al., 1995)) and will have to let them collaborate in the establishment of a single mapping, as in (Olshausen et al., 1993). Partial structural overlap between structures in the brain can be represented either by common subsets of neurons or with the 22

help of high-order links analogous to maplets. How are links implemented in the brain? The basic issue has been discussed extensively elsewhere (von der Malsburg, 1999; von der Malsburg, 2002) and we will here concentrate mainly on higher-order links and their interactions, as postulated in this paper in the form of interacting maplets. The issue is complex, and much detailed work will be needed before a satisfactory solution is found. A detailed neural model for dynamic link matching on the basis of binary links is described in (Wiskott and von der Malsburg, 1996): links are identified with individual synaptic connections, which are subject to rapid reversible synaptic plasticity under the control of temporal correlations (von der Malsburg, 1981). It may be surmised that this is the original neural implementation in the infant brain. That implementation, however, is not acceptable as a model for object recognition in the adult, being too slow by one or two orders of magnitude. The reason for this slowness is twofold — the temporal correlations for synaptic control inherently need time to be expressed, and the synapto-synaptic interactions (corresponding to the C function of equation (2) in section 2.3) as induced by temporal correlations are too unspecific in terms of map parameters. There is the logical possibility of direct associative interaction routes within sets of synapses, and candidate pathways for such interactions could be afforded by astrocytes, as proposed by (Antanitus, 1998); astrocytes are very numerous relative to neurons, have many processes with an obvious affinity to synapses and especially to synaptic clefts, and astrocytes carry ionic signals that could possibly switch synapses on and off in pools. So far, however, all ionic signals observed in astrocytes are much too slow (several hundred msec) to serve the intended purpose (see (Rose et al., 2003) concerning Ca++ signals). Another possibility is the direct control of synaptic switching by control neurons, by way of synaptic terminals that are closely apposed to the controlled synapses, either pre-synaptically or post-synaptically (Olshausen et al., 1993; Hinton, 1981). These control neurons would, in the present context, stand for maplets, and synaptic interactions between them would implement maplet interaction. A difficulty with that mode of implementation is the open question how information on feature similarities or on momentary synaptic strength could be transported to the control neurons (see eq.(7)), unless the connections of control neurons supported bi-directional signal transport. Also, it is difficult to imagine ontogenetic pathways to the generation of such control neurons. A more conventional way to implement links is by way of neurons that are specialized to carry the intended synaptic connections (Dev, 1975; Marr and Poggio, 1976; Sejnowski, 1981). As pointed out in (Sejnowski, 1981), link-representing neurons would transmit fluctuating signals only if not in high or low saturation, thus being able to represent dynamic links if appropriately controlled. Seeming difficulties with neurons-as-links lie with economy and anatomy. It just seems very wasteful to employ whole neurons as links (which must be much more numerous than neural units), and the anatomy of cortical neurons, carrying many thousand synaptic connections on dendrites and axon, doesn’t look as if made just for connecting a few other units. Realizing, however, that only large numbers of activated synapses can fire a postsynaptic neuron (Abeles, 1982), a plausible possibility arises. Assume that the units that are to be linked are composed of many neurons each, forming multi-cellular units (MCUs). Then a link between MCUs A and B can be established by activating subsets a and b of A and B, respectively, if there are many synaptic connections between those subsets, positively interfering such as to effectively link a and b. Each neuron in, e.g., a has many more connections in addition to those to b, creating an ineffectual spray of single signals over many other MCUs, connections that become functional only when interfering positively in the context of other subsets a′ of A. Thus, each neuron in an MCU 23

would be part of a combinatorial code contributing to a large number of links to other MCUs. Implementation of links would thus stay within the realm of conventional neural networks, but would require rather specific and massive connectivity patterns together with mechanisms for their development and learning, of which the details will have to be worked out. On the basis of this implementation, direct link-to-link interactions are simply realized by further neural connections, which could be learned by Hebbian plasticity.

Acknowledgments This work was supported by the DDR&E/ARO Multidisciplinary University Research Initiative, contract number DAAG55-98-1-0293 and by ARO-WASSP, Contract Number. DAAD19-00-1-0356. The authors would like to thank the developers of the FLAVOR software environment, which served as the platform for this work.

References Abeles, M. (1982). Studies of Brain Function. Vol. 6: Local Cortical Circuits: An Electrophysiological Study. Springer-Verlag, Berlin. Ajjanagadde, V. and Shastri, L. (1991). Rules and variables in neural nets. Neural Computation, 3:121–134. Antanitus, D. (1998). A theory of cortical neuron-astrocyte interaction. Neuroscientist, 4(3):154–159. Aonishi, T. and Kurata, K. (2000). Extension of dynamic link matching by introducing local linear maps. IEEE Transactions on Neural Networks, 11(3):817–822. Arbib, M., Boylls, C., and Dev, P. (1974). Neural models of spatial perception and the control of movement. In Keidel, W., Handler, W., and Spreng, M., editors, Cybernetics and bionics, pages 216–231. Oldenbourg. Bell, A. and Sejnowski, T. (1997). The independent components of natural scenes are edge filters. Vision Research, 37(23):3327–3338. Biederman, I. (1987). Recognition-by-components: a theory of human image understanding. Psychological Review, 94:115–147. Biederman, I., Subramaniam, S., Bar, M., Kalocsai, P., and Fiser, J. (1999). Subordinate-level object classification reexamined. Psychological Research, 62:131–153. Bienenstock, E. and von der Malsburg, C. (1987). A neural network for invariant pattern recognition. Europhysics Letters, 4:121–126. Brown, L. (1992). A survey of image registration techniques. Surveys, 24(4):325–376. Dev, P. (1975). Perception of depth surfaces in random dot stereograms: A neural model. Int. J. Man-Machine Studies, 7:511–528. Edelman, S. (1995). Representation, similarity and the chorus of prototypes. Minds and Machines, 5:45–68. 24

Elliffe, M., Rolls, E., and Stringer, S. (2002). Invariant recognition of feature combinations in the visual system. Biological Cybernetics, 86:59–71. Field, D., Hayes, A., and Hess, R. (1993). Contour integration by the human visual system: evidence for a local ”association field”. Vision Research, 33(2):173–93. Frisby, J. (2002). Stereo correspondence. In Arbib, M., editor, The Handbook of Brain Theory and Neural Networks, pages 1104–1108. MIT Press, 2nd edition. Fukushima, K., Miyake, S., and Ito, T. (1983). Neocognitron: A neural network model for a mechanism of visual pattern recognition. IEEE Transactions on Systems, Man and Cybernetics, 13(5):826–834. Garey, M. and Johnson, D. (1979). Computers and Intractability. W.H. Freeman and Co., New York. Geiger, D., Gupta, A., Costa, L., and Vlontzos, J. (1995). Dynamic programming for detecting, tracking and matching elastic contours. PAMI, 17(3):294–302. Haken, H. (1977). Synergetics: An Introduction: Nonequilibrium Phase Transitions and Self-Organization in Physics, Chemistry, and Biology. Berlin; New York: SpringerVerlag. H¨ aussler, A. F. and von der Malsburg, C. (1983). Development of retinotopic projections – an analytical treatment. Journal of Theoretical Neurobiology, 2:47–73. Hinton, G. (1981). A parallel computation that assigns canonical object-based frames of reference. In Proceedings of the Seventh International Joint Conference on Artificial Intelligence, volume 2, pages 683–685, Vancouver BC, Canada. Horn, B. and Schunck, B. (1981). Determining optical flow. Artificial Intelligence, 17:185–203. Hummel, J. and Biederman, I. (1992). Dynamic binding in a neural network for shape recognition. Psychological Review, 99(3):480–517. Hummel, R. and Zucker, S. (1983). On the foundations of relaxation labeling processes. PAMI, 5(3):267–287. Julesz, B. (1971). Foundations of Cyclopean Perception. Chicago: Chicago Univ. Press. Lades, M., Vorbr¨ uggen, J., Buhmann, J., Lange, J., von der Malsburg, C., W¨ urtz, R., and Konen, W. (1993). Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on Computers, 42(3):300–311. Legéndy, C. (1970). The brain and its information trapping device. In Progress in Cybernetics, volume 1, pages 309–338. Gordon and Breach, New York. Loos, H. and von der Malsburg, C. (2002). 1-Click Learning of Object Models for Recognition. In B¨ ulthoff, H., Lee, S.-W., Poggio, T., and Wallraven, C., editors, Biologically Motivated Computer Vision 2002 (BMCV 2002), volume 2525 of Lecture Notes in Computer Science, pages 377–386, T¨ ubingen, Germany. Springer Verlag.

25

Marr, D. and Poggio, T. (1976). Cooperative computation of stereo disparity. Science, 194:283–287. Medioni, G., Lee, M.-S., and Tang, C.-K. (2000). A Computational Framework for Segmentation and Grouping. Elsevier. Mel, B. (1997). SEEMORE: Combining color, shape, and texture histogramming in a neurally-inspired approach to visual object recognition. Neural Computation, 9:777– 804. Miller, M. and Younes, L. (2001). Group actions, homeomorphisms, and matching: A general framework. International Journal of Computer Vision, 41(1/2):61–84. Nicolescu, M. and Medioni, G. (2002). 4-d voting for matching, densification and segmentation into motion layers. In International Conference on Pattern Recognition, Quebec City, Canada. Ohta, Y. and Kanade, T. (1985). Stereo by intra- and inter-scanline search using dynamic programming. PAMI, 7(2):139–154. Olshausen, B., Anderson, C., and Van Essen, D. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. The Journal of Neuroscience, 13(11):4700–4719. Olshausen, B. and Field, D. (1996). Emergence of simple-cell receptive fields properties by learning a sparse code for natural images. Nature, 381:607–609. Phillips, P., Moon, H., Rizvi, S., and Rauss, P. (2000). The FERET evaluation methodology for face-recognition algorithms. IEEE Trans. PAMI, 22(10):1090 –1104. Poggio, T. and Koch, C. (1985). Ill-posed problems in early vision: From computational theory to analog networks. Proceedings of the Royal Society London B, 226:303–323. Pollard, S., Mayhew, J., and Frisby, J. (1985). PMF: A stereo correspondence algorithm using a disparity gradient limit. Perception, 14:449–470. Rose, C., Blum, R., Pichler, B., Lepier, A., Kafitz, K., and Konnerth, A. (2003). Truncated TrkB-T1 mediates aneurotrophin-evoked calcium signalling in glia cells. Nature, 426:74–78. Sejnowski, T. (1981). Skeleton filters in the brain. In Parallel Models of Associative Memory, pages 189–212. Hillsdale, N.J.: Lawrence Erlbaum. Thorpe, S., Fize, D., and Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381:520–522. Ullman, S. (1979). The Interpretation of Visual Motion. MIT Press, Cambridge MA. Ullman, S. (1989). Aligning pictorial descriptions: an approach to object recognition. Cognition, 32:193–254. von der Malsburg, C. (1981). The correlation theory of brain function. Internal Report 81-2, MPI Biophysical Chemistry. Reprinted in E. Domany, J.L. van Hemmen, and K.Schulten, editors, Models of Neural Networks II, chapter 2, pages 95–119. Springer, Berlin, 1994. 26

von der Malsburg, C. (1999). The what and why of binding: The modelers perspective. Neuron, 24(1):95–104. von der Malsburg, C. (2002). Dynamic link architecture. In Arbib, M., editor, The Handbook of Brain Theory and Neural Networks, pages 365–368. MIT Press, 2nd edition. Wersing, H. and K¨ orner, E. (2003). Learning optimized features for hierarchical models of invariant object recognition. Neural Computation, 15(7):1559–1588. Willshaw, D. J. and von der Malsburg, C. (1976). How patterned neural connections can be set up by self-organization. Proceedings of the Royal Society London B, 194:431– 445. Wiskott, L. (1999). The role of topographical constraints in face recognition. Pattern Recognition Letters, 20(1):89–96. Wiskott, L., Fellous, J.-M., Kr¨ uger, N., and von der Malsburg, C. (1995). Face recognition and gender determination. In International Workshop on Automatic Face- and Gesture-Recognition, Z¨ urich, June 26-28, 1995, pages 92–97. Wiskott, L., Fellous, J.-M., Kr¨ uger, N., and von der Malsburg, C. (1997). Face recognition by elastic bunch graph matching. IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(7):775–779. Wiskott, L. and von der Malsburg, C. (1996). Face recognition by dynamic link matching. In Sirosh, J., Miikkulainen, R., and Choe, Y., editors, Lateral Interactions in the Cortex: Structure and Function, chapter 11. The UTCS Neural Networks Research Group, Austin, TX, http://www.cs.utexas.edu/users/nn/webpubs/htmlbook96/. Electronic book, ISBN 0-9647060-0-8. Yuille, A., Cohen, D., and Hallinan, P. (1989). Feature extraction from faces using deformable templates. In Proceedings of Computer Vision and Pattern Recognition, pages 104–109, San Diego. IEEE Computer Society Press. Zhu, J. and von der Malsburg, C. (2002). Synapto-synaptic interactions speed up dynamic link matching. NeuroComputing, 44:721–728. Zhu, J. and von der Malsburg, C. (2003). Learning control units for invariant recognition. NeuroComputing, 52:447–453.

27