Computational Strategies for Model-Based Scene Interpretation

1

Computational Strategies for Model-Based Scene Interpretation

Yali Amit, Donald Geman and Xiaodong Fan

Yali Amit is with the Department of Statistics and the department of Computer Science, University of Chicago, Chicago, IL, 60637. Email: [email protected]. Supported in part by NSF ITR DMS-0219016. Donald Geman is with the Whitaker Institute and Department of Mathematical Sciences, Johns Hopkins University, Baltimore, MD 21218. Email:[email protected]. Supported in part by ONR under contract N000120210053, ARO under grant DAAD1902-1-0337, and NSF ITR DMS-0219016. Xiaodong Fan is with the Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218. Supported in part by by ONR under contract N000120210053. Email:[email protected]. October 9, 2003

DRAFT

2

Abstract Scene interpretation, in the sense of detecting and localizing instances from multiple object classes, is formulated as a two-step process in which non-contextual detection primes global interpretation. During detection a list of instantiations (object identities and poses) is compiled constrained only by invariance - no missed detections - at the expense of false positives. Contextual information, such as expected relationships among poses, is incorporated afterwards to remove ambiguities. This division is motivated by computational efficiency. In addition, detection itself is organized as a coarse-to-fine search simultaneously in class and pose. This search can be interpreted as successive approximations to likelihood ratio tests arising from a simple (“naive Bayes”) statistical model for the edge maps extracted from the original images. The key to constructing efficient “hypothesis tests” for multiple classes and poses is local ORing; in particular, spread edges provide imprecise but common and locally invariant features. Natural tradeoffs then emerge between discrimination and the shape and extent of spreading. These are analyzed mathematically within the model-based framework and the whole procedure is illustrated by experiments in reading license plates.

I. I NTRODUCTION We consider detecting and localizing objects in cluttered grey level scenes when the objects may appear in many poses and there are many classes of interest. In many applications, a mere list of object instantiations, where each item indicates the generic class and approximate pose, provides a useful global description of the scene. (Richer descriptions, involving higher-level labels, occlusion patterns, etc. are sometimes desired.) The set of lists may be restricted by contextual information involving the joint configuration of poses; this is the situation in our application to reading license plates. In this paper, detection will refer to noncontextual recognition in the sense of compiling a list of object instantiations independently of any global constraints; interpretation will refer to incorporating any such constraints, i.e., relationships among instantiations. In our approach detection primes interpretation. Detection Ideally, we expect to detect all instances from all the classes of interest under a wide range of geometric presentations and imaging conditions (resolution, lighting, background, etc.). This can be difficult even for one generic class without accepting false positives. For instance, all October 9, 2003

DRAFT

3

approaches to face detection (e.g., Rowley et al. (1998), Fleuret & Geman (2001), Viola & Jones (2001)) must confront the expected variations in the position, scale and tilt of a face, varying angles of illumination and the presence of complex backgrounds; despite considerable activity, and marked advances in speed and learning, no approach achieves a negligible false positive rate on complex scenes without missing faces. With multiple object classes an additional level of complexity is introduced and subtle confusions between classes must be resolved in addition to false positives due to background clutter. Invariance will mean a null false negative rate during detection, i.e., the list of reported instantiations is certain to contain the actual ones. Discrimination will refer to false positive error – the extent to which we fantasize in our zeal to find everything. We regard invariance as a hard constraint. Generally, parameters of an algorithm can be adjusted to achieve near-invariance at the expense of discrimination. The important tradeoff is then discrimination vs computation. Hypothetically, one could achieve invariance and high discrimination by looking separately for every class at every possible pose (“templates for everything”). Needless to say, with a large number of possible class/pose pairs, this would be extremely costly, and massive parallelism is not the answer. Somehow we need to look for many things at once, which seems at odds with achieving high discrimination. Such observations lead naturally to organizing scene interpretation as a coarse-to-fine (CTF) computational process. Begin by efficiently eliminating entire subsets of class/pose pairs simultaneously (always maintaining invariance) at the expense of relatively low discrimination. From the point of view of computation, rejecting many explanations at once with a single, relatively inexpensive “test” is clearly efficient; after all, given an arbitrary subimage, the most likely hypothesis by far is “no object of interest” or “background,” and initially testing against this allows for early average termination of the search. If “background” is not declared, proceed to smaller class/pose subsets at higher levels of discrimination, and finally entertain highly discriminating procedures but dedicated to specific classes and poses. Accumulated false positives are eventually removed by more intense, but focused, processing. In this way, the issue of computation strongly influences the very development of the algorithms, rather than being an afterthought. A natural control parameter for balancing discrimination and computation is the degree of invariance of local features, not in the sense of fine shape attributes, such as geometric singuOctober 9, 2003

DRAFT

4

larities of curves and surfaces, but rather coarse, generic features which are common in a set of class/pose pairs. “Spread edges” (Amit & Geman (1999), Amit (2000), Fleuret & Geman (2001)) provide a simple example: an edge is said to be detected at a given location if the response of an edge operator is strong enough anywhere nearby. The larger the spreading (degree of local ORing), the higher the incidence on any given ensemble of classes and poses, and checking for a certain number of distinguished spread edges provides a simple, computationally efficient test for the ensemble. During the computational process, the amount of spreading is successively diminished. Interpretation The outcome of detection is a collection of instantiations - class/pose pairs. No contextual information has been employed. In particular, some instantiations may be inconsistent with prior information about the scene layout. Moreover, several classes will often be detected at roughly the same location due to the insistence on minimizing false negatives. The passage from detection to interpretation is largely based on taking into account prior knowledge about the number of objects and the manner in which they are spatially arranged. Assuming objects do not overlap, a key component of this analysis is a competition among objects or sequences of objects covering the same image region, for which we employ a likelihood ratio test motivated by our statistical model for local features. Since a relatively small number of candidate instantiations are ever involved, it is also computationally feasible to bring finer features into play, as well as templatematching, contextual disambiguation and other intensive procedures. New directions We explore three new directions: Multiple Objects: Previous work concerned coarse-to-fine (CTF) object representations and search strategies for single objects, and hence based entirely on pose aggregation. We extend this to hierarchies based on recursively partitioning both class and pose. Contextual Analysis: With multiple object classes, testing one specific explanation against another is eventually unavoidable, which means we need efficient, online tests for competing hypotheses. In particular, we derive edge-based tests for resolving one specific hypothesis (a character at a given pose) against another. October 9, 2003

DRAFT

5

Model-Based Framework: We introduce a statistical model for the edges which provides a unifying framework for the algorithms employed in all stages of the analysis, and which allows us to mathematically analyze the role of “spread edges” in balancing discrimination and computation during coarse-to-fine detection. These ideas are illustrated by attempting to read the characters appearing on license plates. Surprisingly, perhaps, there does not seem to be any published literature apart from patents. Several systems appear to be implemented in the US and Europe. For example in London cars entering the metropolitan area are identified in order to charge an entrance fee, and in France the goal is to estimate the average driving speed between two points. We have no way to assess the performance of these implementations. Our work was motivated by the problem of identifying cars entering a parking garage, for which current solutions still fall short of commercial viability, mainly due to the high level of clutter and variation in lighting. It is clear that for any specific task there are likely to be highly dedicated procedures for improving performance, for example only reporting plates with identical matches on two different photos, taken at the same or different times. Our goal instead is a generic solution which could be easily adapted to other OCR scenarios and to other object categories. In particular, we do not use any form of traditional, bottom- up segmentation in order to identify candidate object regions or jump start the recognition process. There are many well-developed techniques of this kind in the document analysis literature which are rather dedicated to specific applications; see for example the review Nagy (2000) or the work in LeCun et al. (1998). Related work on visual attention, CTF search, hierarchical templatematching and local ORing is surveyed in the following section. Our formulation of scene interpretation is given in III, followed in IV by a brief overview of the computational strategy. The statistical edge model is described in V, leading to a natural likelihood ratio test for an individual detection. Efficient search is the theme of VI- VIII. The spread edges are introduced in VI, including a comparison of spreading edges versus two natural alternatives – summing them and downsampling – and the discrimination/computation tradeoff is studied under a simple statistical model. How tests are learned from data and organized into a CTF search are discussed in VII and VIII respectively. In IX we explain how an interpretation of the scene is derived from the output of the detection stage. The application to reading license plates, including the contextual analysis, is presented in X and we conclude with a discussion of our approach in XI. October 9, 2003

DRAFT

6

II. R ELATED W ORK The division of the system into detection followed by interpretation is motivated by computational efficiency. More generally, our detection phase is a way of focusing attention, which is studied in both computer vision and in modeling biological vision. The purpose is to focus subsequent, detailed processing on a small portion of the data. Two frameworks are usually considered: task-independent, bottom-up control based on the “saliency” of visual stimuli (see e.g. Lindeberg (1993), Reisfeld et al. (1995), Privitera & Stark (2000), Itti & Koch (1998)); and task-driven, top-down control (see e.g. Ullman (1995), Amit & Geman (1999), Viola & Jones (2001), Navalpakkam & Itti (2003)). Our approach is essentially top-down in that attention is determined by the objects we search for, although the coarsest tests could be interpreted as generic saliency detectors. CTF object recognition is scattered throughout the literature. For instance, translation based versions can be found in Rowley et al. (1998), Kanade & Schneiderman (1998) and work on distance matching (Rucklidge (1995)). The version appearing in Geman et al. (1995) prefigures our work. Related ideas on dealing with multiple objects can be found in Amit (2002). In addition, CTF search motivated the face detection algorithm in Amit & Geman (1999) and was systematically explored in Fleuret & Geman (2001) based on a nested hierarchy of pose bins and in Blanchard & Geman (2003) based on an abstract theoretical framework. Variations have also been proposed in Socolinsky et al. (2002) and Viola & Jones (2001): whereas most poses are explicitly visited, computational efficiency is achieved by processing which is CTF in complexity. Whether CTF in pose or complexity or both, the end result is that intensive processing is restricted to a very small portion of the image data, namely those regions containing actual objects or object-like clutter. Work on efficient “indexing” based on geometric hashing (Lamdan et al. (1988)) and Hough transforms (Grimson (1990), Rojer & Schwartz (1992)) is also related. The issue of context is central to vision and several distinct approaches can be discerned in the literature. In ours, object detection is entirely non-contextual and is followed by global scene interpretation in conjunction with context. Other work involves “contextual priming” (Torralba (2002)) to overcome poor resolution by starting scene interpretation with an estimate of global context based on low-level features. In contrast, all scene attributes are discovered simultaneously in the compositional approach (Geman et al. (2002)) which could potentially offer a powerful

October 9, 2003

DRAFT

7

method of dealing with context and occlusion, but involves formulating interpretation as global optimization, which raises computational issues. There are strong connections between spreading local features and neural responses in the visual cortex. Responses to oriented edges are found primarily in V1, where so-called “simple” cells detect oriented edges at specific locations, whereas “complex” cells respond to an oriented edge anywhere in the receptive field; see Hubel (1988). In other words local “ORing” is performed over the receptive field region and the response of a complex cell can thus be viewed as a “spread edge.” Because of the high density of edges in natural images, the extent of spreading must be limited; too much will produce responses everywhere. Neurons in higher level retinotopic layers V2 and V4 exhibit similar properties, inspiring the work in Fukushima & Miyake (1982) and Fukushima & Wake (1991) about designing a neural-like architecture for recognizing patterns. In Amit (2000) and Amit & Mascaro (2003) spreading of more complex features is incorporated into a neural architecture for invariant detection and recognition. An extension to continuous-valued variables can be achieved with a ‘MAX’ operation, a generalization of ORing, as proposed in Riesenhuber & Poggio (1999). Recent work on hierarchical template matching using distance transforms, such as Gavrila (2003), is related to ours in several respects even though we are not doing template-matching per se. Local ORing as a device for gaining stability can be seen as a limiting, binary version of distance transforms such as the Chamfer distance (Barrow et al. (1977)). In addition, there is a version of CTF search in Gavrila (2003) (although only translation is considered based on multiple resolutions) which still has much in common with our approach, including edge features, detecting multiple objects using a class hierarchy and imposing a running null false negative constraint. Another approach to edge-based, multiple object detection appears in Olson & Huttenlocher (1997). Finally, in connection with spreading local features, Lowe (2003) has proposed another mechanism that allows for affine or 3D viewpoint changes or non-rigid deformations. The resulting “SIFT descriptor”, based on local histograms of gradient orientations, characterizes a neighborhood (in the Gaussian blurred image) around each individual detected key point, which is similar to “spreading” the gradients over a 4x4 region. A detailed comparison of the performance of SIFT with other descriptors can be found in Mikolajczyk & Schmid (2003)

October 9, 2003

DRAFT

8

Fig. 1.

Two images of back side of car from which license plate is read

III. S CENE I NTERPRETATION Consider a single, grey level image. In particular, there is no information from motion, depth or color. We anticipate a large range of lighting conditions, as illustrated in Figure 1 (see also Figure 4), as well as a considerable range of poses at which each object may be present. Moreover, we anticipate a complex background consisting partly of extended structures, such as clutter and non-distinguished objects, which locally may appear indistinguishable from the objects of interest.

be the raw intensity data on the image lattice . Each object of interest has a class and each instantiation (presentation in ) is characterized by a pose . Broadly speaking, the pose represents (“nuisance”) parameters which at least partially Let

characterize the instantiation. For example, one component of the pose of a printed character might be the font. In some contexts, one might also consider parameters of illumination. For simplicity, however, we shall restrict our discussion to the geometric presentation, and specifically (in view of the experiments on license plates) to position, scale and orientation. Much of what follows extends to affine and more general transformations; similarly, it would not be difficult

to accommodate parameters such as the font of a character.

the identity pose, namely

October 9, 2003

" #$ % &('

!

) % *,+ , and by - a reference sublattice of fits inside - . For any subset . of let the full image lattice such that any object at ./ 012(3 546 7.859 For a pose , let

be the translation,

the scale and

the rotation. Denote by

and

DRAFT

9

In particular,

-

is the “support” of the object at pose .

The set of possible interpretations for an image

is

6 where, obviously, represents the maximum number of objects in any given layout. Thus . The support of an interpretation is each interpretation has the form 80 6 6

denoted

- )

6 - 9

& % .

We write for the true interpretation, and assume it is unambiguous, i.e.,

Prior information will provide some constraints on the possible lists; for instance, in the case of the license plates we know approximately how many characters there are and how they are laid out. In fact, it will be useful to consider the true interpretation to be a random variable, , and to suppose that knowledge about the layout is captured by a highly concentrated prior distribution on

. Most interpretations have mass zero under this distribution and many interpretations in

its support, denoted by

3 & '

have approximately the same mass. Indeed

for simplicity we will assume that the prior is uniform on its support IV. OVERVIEW

OF THE

.

C OMPUTATIONAL S TRATEGY

What follows is summary of the overall recognition strategy. All of the material from this point to the experiments pertains to one of four topics: Statistical Modeling: In V we introduce a transformation of the intensity data array

/

and a likelihood model

"!#

$

for

to an edge

,

given a scene interpretation

This model motivates the definition of an image-dependent set %

'&

(

$

.

of detections

based on likelihood ratio tests. According to the invariance constraint, the tests are performed with null type I error, which in turn implies that principle). However, direct computation of %

&)%

with probability one (at least in

is highly intensive due to the loop over class/pose

pairs. Efficient Detection: The purpose of the CTF search is to accelerate the computation of % . This depends on developing a “test” *,+

the order of the test for a single pair

October 9, 2003

for an entire subset -

&

.

whose complexity is of

but which nonetheless retains some discriminating

DRAFT

10

power; see VI. The set %

-

is then found by performing * +

individual hypotheses in -

one-by-one only if * + is positive. This two-step procedure is then

first and then exploring the

easily extended (in VIII) to a full CTF search for the elements of % , and the computational gain provided by the CTF can be estimated. Spread Edges: The key ingredient in the construction of * + is the notion of a “spread edge” based on local ORing. Checking for a minimum number of spread edges provides a test for

the hypothesis

-

. The same spread edges are used for many different bins, thus

precomputing them at the start yields an important computational gain. In the Appendix the optimal domain of ORing, in terms of discrimination, is derived under the proposed statistical model and some simplifying assumptions. Global Interpretation: The final phase is choosing an estimate

competition between any two interpretations

&

%

for which

&

%

. A key step is a

- - , i.e., which

cover the same image region. The sub-interpretations must satisfy the prior constraints, namely

; see IX. A special case of this process is a competition between single detections

with different classes but very similar poses. (We assume a minimum separation between objects, in particular no occlusion.) The competitions once again involve likelihood ratio tests based on the edge model. V. DATA M ODEL We describe a statistical model for the possible appearances of a collection of objects in a scene as well as a crude model for “ background,” i.e., those portions of the image which do not belong to objects of interest. A. Edges The image data is transformed into arrays of binary edges, indicating the locations of a small number of coarsely-defined edge orientations. As usual, we begin with the local maxima of the gradient and take one of four possible directions and two polarities. There is a very low threshold on the gradient; as a result, several edge orientations may be present at the same location. However, these coarse edges have two important advantages: they are robust with respect to photometric variations and they provide the ingredients for a simple “background” model based on labeled point processes. Although the statistical models below are described October 9, 2003

DRAFT

11

in terms of the edges arrays, implicitly they determine a natural model for the original data, namely uniform over intensity arrays giving rise to the same edges. Still, we shall not be further concerned with distributions directly on .

be a binary variable indicating whether or not an edge of type is present at location . The type represents the orientation and polarity. The resulting family of . We still binary maps – transformed intensity data – is denoted by assume that / , i.e. is uniquely determined by the edge data.

Let

$

B. Probability Model

)

To begin with, we assume the random variables

,

independent given

$

are conditionally

. We offer two principal “justifications” for this hypothesis as well as

an important drawback:

1) Conditioning: In general, the degree of class-conditional independence among typical local features depends strongly on the amount of information carried in the “pose”

-

the more detailed the description of the instantiation, the more decoupled the features. In the case of printed characters, most of the relevant information (other than the font) is captured by position, scale and orientation. 2) Simplicity: In a Bayesian context, conditional independence leads to the “naive Bayes classifier,” a major simplification. When the dimensionality of the features is “large” relative to the amount of training data favoring simple over complex models (and hence sacrificing modeling accuracy) may be ultimately advantageous in terms of both computation and discrimination. 3) Drawback: The resulting “background model” is not realistic. The background is a highly complex mixture model in which nearby edges are correlated due to clutter consisting of parts of occluded objects and other non-distinguished structures. In particular, the independence assumption renders the likelihood of actual “background” data (see (4)) far

too small, and this in turn leads to the traditional MAP estimator

being unreliable. It

. Instead, we base the upcoming

is for this reason that we will not attempt to compute

likelihood ratio tests on thresholds corresponding to a fixed type I error learned from data, either by estimating background correlations or test statistics under object hypotheses.

October 9, 2003

DRAFT

12

46

"

-8

-8

Sample five

; in the image for one pose, ; the model edges, !"#$ , for the entire pose bin #&%(')"*,+.-!/ ; a partition of #0 into disjoint regions of the form 13254 ; the locations (black points) of actual edges and the domain of local ORing (grey strip), resulting in 6879 1;: = .

Fig. 2.

From left to right: The horizontal edge locations in the reference grid,

, we assume the objects have non-overlapping - - @? "A C B . Decompose the image lattice into - - D . supports, i.e. The region - ED represents “background”. Of course the image data over - D may be quite

For any interpretation

complex due to clutter and other regular structures, such as the small characters and designs which often appear on license plates. It follows that

N

/ 0 ) GFIHKJ"LM 0 6 GFOHKP : L 0 (1) . where we have written RQ for 7 ) TS for a subset S We assume that the conditional distribution of the data over each - depends only on , and hence the distribution of RFOHKP : L is characterized by the product of the individual

"!#

$

!

!#

&

(marginal) edge probabilities

(2) / + 77- 9 9 9 to indicate conditional probability given the event

where we have written

!

!

2 . Notice that (2) is well- defined due to the assumption of non-overlapping supports.

For ease of exposition we choose a very simple model of constant edge probabilities on a distinguished, class- and pose-dependent set of points. The ideas generalize easily to the case where the probabilities vary with type and location. Specifically, we make the following approximation: for each class

and for each edge type

there is a distinguished set

of locations in the reference grid at which an edge of type object

&

-

has high relative likelihood when

is at the reference pose (see Figure 2). In other words,

October 9, 2003

D

D

is a set of “model edges.” DRAFT

13

Furthermore, given object appears at pose , the probabilities of the edges at locations are given by:

/ + 1 )

where

!

if

2 D

(i.e.

-

46 ) D

otherwise

. Finally we assume the existence of a “background” edge frequency which is the

same as . From (1) and (2), the full data model is then

N

N

6

M : HKP : L

N

)

"!

H L +

N

FOH J L M N H L FIH P : L M : HKP : L

64

H L +

6 4 H L

H L +

6 4 H L

(3)

Under this model the probability of the data given no objects in the scene is

N N

0 5 )

"!

VI. D ETECTION : S EARCHING

Detection refers to compiling a list %

H L +

6 4 H L 9

FOR I NDIVIDUAL

(4)

O BJECTS

of class/pose candidates for an image

without

considering global constraints. The model described in the previous section motivates a very simple procedure for defining %

based on likelihood ratio tests. The snag is computation -

compiling the list by brute force computation is highly inefficient. This motivates the introduction of “spread edges” as a mechanism for accelerating the computation of % . A. Likelihood Ratio Tests

/, 6 6 %9 9 9 8 . We are going to compare the likelihood of the edge data under 0 to the likelihood of the same data under where is the same as except that one of the elements is replaced by the background interpretation, say 8 % %9 9 9 . Then using (3) and cancellation outside 6 : D 6 4 H L H L + / 0 N N (5) / 0 + M HKP L

Consider a non-null interpretation

!

#"

"!#

October 9, 2003

"!#

"

$

"

DRAFT

14

This likelihood ratio simplifies:

0 0

"!

where

"!

+

+

M HKP L

"

"

7

+

+

and the resulting statistic only involves edge data relevant to the class pose pair

6 6 . The log likelihood ratio test at zero type I error relative to the null hypothesis # (i.e., for class at pose ) reduces to a simple, linear test – evaluating D P 9 D P D P & + if D P D P ' otherwise where 9 P (6) D K H ; P L M and the threshold P is chosen such that P $' $' . Note that the sum is over a

*

D

D . We therefore seek the set %

*

D

!

0+ .

relatively small number of features, concentrated around the contours of the object, i.e. on the set

of all pairs

the individual object detections. Notice that

&"%

for which *

D P

) + . These are

Bayesian Inference: Maintaining invariance (no missed detections) means that we want to perform the likelihood ratio test in (5) with zero type I error. Of course computing the actual (model-based) threshold which achieves this is intractable and hence it will be estimated from

6 6

training data; see VII. Notice that threshold of unity in (5) would correspond to a likelihood ratio test designed to minimize total error; moreover,

implies that the ratio in

(5) must exceed unity. However, due to the severe underestimation of background likelihoods (due to the independence assumption), taking a unit threshold would result in a great many false positives. In other words, the thresholds that arise from a strict Bayesian analysis are far more conservative than necessary to achieve invariance. It is for these reasons that the model motivates our computational strategy rather than serving as a foundation for Bayesian inference. B. Efficient Search

/

Fix , let be a neighborhood of the identity pose We begin with purely pose-based subsets of

October 9, 2003

. Eventually we will consider more general

, 1

and put -

. Suppose we want DRAFT

15

to find all

and evaluate in -

D P

D P

for which *

)0+ . We could perform a brute force search over the set

for each element. Generally, however, this procedure will fail for all elements

since the background hypothesis is statistically dominant (relative to - ). Therefore it would

be preferable to have a computationally efficient binary test for the compound event - . If that test fails there is no need to perform the search for individual poses. For simplicity we assume that either the image contains only one instance from -

The test for '+ vs.

+

-

, or no object at all -

.

will be based on a thresholded sum of a moderate number of binary

features, approximately the same number as in equation (6). The test should be computationally efficient (hence avoid large loops and online optimization) and have a reasonable false positive rate at very small false negative rate. Note that the brute force search through as a test for the above hypothesis of the form * +

Let -

8 -

)

'

+

P / * D P

if

8

of all model edges of type

D

-

P /

8 -

for the poses in

9

(7)

This is shown on the first panel of Figure 2 for the class polarity and a set of poses

) +

otherwise

denote the set of image locations

:

can be viewed

for horizontal edges of one

consisting of small shifts and scale changes. Roughly speaking,

is merely a “thickening” of the -portion of the boundary of a template for class .

C. Sum Test One straightforward way to construct a bin test from the edge data is simply to sum all the detected model edges for all the poses in - , namely to define + P

9

P /

D

)

The corresponding test is then * +

should satisfy October 9, 2003

+

meaning of course that we choose

P /

+ if

+

* +

)$'

* +

M HKP;L

+

+

H+ L

and choose

!

if * +

(8)

(' . The threshold

& ' 9 DRAFT

16

The discrimination level of this test (type II error) +

) + 9 $

* +

!

We would not expect this test to be very discriminating. A simple computation shows that,

& +

under '+ , the probabilities of

for

8 , are all on the order of the background -

probabilities . Consequently, the null type I error constraint can only be satisfied by choosing a relatively low threshold +

, in which case +

might be rather large. In other words, in order

to capture all the objects of interest, we would need to allow many configurations of clutter (not to mention other objects) to pass the test. This observation will be examined more carefully later on. D. Spread Test A more discriminating test for -

can be constructed by replacing

H+ L

sum of “spread edges” in order to take advantage of the fact that, under imately how many on-object edges of type end, let

+

by a smaller

, we know approx-

to expect in a small subregion of

8 . To this -

be a neighborhood of the origin whose shape may be adapted to the feature type .

(For instance, for a vertical edge ,

might be horizontal strip.) Eventually the size of

depend on the size of - , but for now let us consider it fixed. For each spread edge of type

at location

to be

Thus if an edge of type

*

7

is detected anywhere in the

-shaped region “centered” at

will

, define the

.

and

it is

recorded at . (See Figure 2, fifth panel.) Obviously, this corresponds to a local disjunction of

elementary features. The spread edges stored. Define also Let

6 %9 9 9

be a set of locations whose surrounding regions

are pre-computed and

“fill”

-

in the

sense that the regions are disjoint and

6

<

!

&

8 9 -

See Figure 2. To further simplify the argument, just suppose these sets coincide; this can always be arranged up to a few pixels. In that case, we can rewrite + +

October 9, 2003

H+ L

&

6

as

9

DRAFT

17

Now replace

and +

* +

by

+

. The corresponding bin test is then +

satisfies

+

The type II error is

E. Comparison

Both * +

* +

and *

*

+

+

'

+

$

*

+

!

+

6

(9)

' 9

) + 9 !

+

require an implicit loop over the locations in

. The exhaustive test

requires a similar size loop (somewhat larger since the same location can be hit twice

by two different poses). However there is an important difference: the features

and

* + * + are significantly more efficient than * + . Since all tests are invariants for - (i.e., have null type I error for + vs ), the key issue is really one of discrimination - comparing

with + . Notice that as ! ! increases, the probability of occurrence of the features + increases, both conditional on + and conditional on . As a result, the effect of spreading can be computed offline and used for all subsequent tests. They are reusable. Thus the tests

on type II error is not entirely obvious. Henceforth we only consider rectangular sets

, which are

of length in the direction

orthogonal to the edge orientation and of length in the parallel direction. (See, for example,

6 6 , i.e. regions of just one pixel. Assume now that the set 8

panels 4,5 of Figure 2, and panels 3,4 in Figure 11.) Note that * + -

* +

if we take regions

has more or less fixed width

. In the Appendix we show, under simplifying assumptions, that: The test * +

with regions

In other words the smallest + for

6

is the most discriminating over all possible combinations .

2 7 + , and hence the optimal choice

is achieved with

is a single-pixel strip whose orientation is orthogonal to the direction of the edge type

6

and whose length roughly matches the width of the extended boundary

is very intuitive: Spreading - as opposed to summing - over a region at most one object edge for any instantiation in October 9, 2003

. This result -

that can contain

prevents off-object edges from contributing DRAFT

18

excessively to the total sum. Note that if

' , i.e. no off-object edges appear, then the two

tests are identical. For future use, for a general spread length , let refers to the optimal test using regions

6 .

7

"

9 Also *

+

now

F. Spreading vs. Downsampling A possible alternative for a bin test could be based on the same edge features, computed on blurred and downsampled versions of the original data. This approach is very common in many algorithms and architectures; see, for example, the successive downsampling in the feedforward network of LeCun et al. (1998), or the jets proposed in Wiskott et al. (1997). Indeed, low resolution edges do have higher incidence at model locations, but they are still less robust than spreading at the original resolution. The blurring operation smoothes out low-contrast boundaries and the relevant information gets lost. This is especially true for real data such as that shown in figure 4 taken at highly varying contrasts and lighting conditions. As an illustration we took a sample of the ‘A’ and produced 100 random samples from a pose bin involving shifts of pixels, rotations of

10 degrees, and scaling in each axis of

2

20%, (see Figure 3 left panel).

With spread 1 in the original resolution plenty of locations were found with high probability. For example in the middle two panels of figure 3 we show a probability map of a vertical edge type at all locations on the reference grid, darker represents higher probability. Alongside is a binary image indicating all locations where the probability was above .7. In the right two panels the same information is shown for the same vertical edge type from the images blurred and downsampled by 2. The probability maps were magnified by a factor of 2 to compare to the original scale. Note that many fewer locations are found of probability over .7. The structures on the right leg of the ‘A’ are unstable at low resolution. In general the probabilities at lower resolution without spread are lower than the probabilities at the original resolution with spread 1. G. Computational Gain

We have proposed the following two-step procedure. Given data , first compute the * the result is negative, stop, and otherwise evaluate * which must contain October 9, 2003

-

; moreover, either % +

D P

for each

7

or % +

%

-

-

+ ; if

. This yields a set % +

. DRAFT

19

Fig. 3.

Left: A sample of the population of A’s. Middle: Probability maps of a vertical edge type on the population of ‘A’s

alongside locations above probability .7. Right: Probability maps of a vertical edge type on the population of ‘A’s blurred and downsampled by 2, alongside locations above probability .7.

It is of interest to compare this “CTF” procedure to directly looping over - , which by definition results in finding % %

-

. Obviously the two-step procedure is more discriminating since % + &

. Notice that the degree to which we overestimate

-

-

will affect the amount of processing

to follow, in particular the number of pairwise comparison tests that must be performed for detections with poses too similar to co-exist in . As for computation, we make two reasonable assumptions: (i) mean computation is calculated

. (Recall that the background hypothesis is usually true.) (ii) the test has approximately the same computational cost, say , as * P . i.e., checking for a single * D hypothesis . As a result, the false positive rate of * + is then + . Consequently, direct search has cost !#- ! whereas the two-step procedure has (expected) cost !#- ! . + under the hypothesis

+

Measuring the computational gain by the ratio gives

A

+

! + !#!#-

which can be very large for large bins. In fact, typically has only two elements. There is some extra cost in computing the features

!

+

' 9 , so that we gain even if -

relative to simply detecting the original

edges . However since these features are to be used in many tests for different bins, they are computed once and for all a priori and this extra cost can be ignored. This is an important advantage of re-usability - the features that have been developed for the bin test can be reused in any other bin test. VII. L EARNING B IN T ESTS We describe the mechanism for determining a test for a general subset by

+

of

. Denote

and + , respectively, the sets of all poses and classes found in the elements of - . From

October 9, 2003

DRAFT

20

here on all tests are based on spread edges. Consequently, we can drop the superscript and simply write * + , + , etc.

for each )

8 ; the locations

For a general bin - , according to the definitions of to identify

D

.-

-

in (7) and * + in (9), we need

and the extent of the spread

1

edges appearing in + ; and the threshold + . In testing individual candidates -

(6), there is no spread and the points are given by the locations in be directly computed from the distinguished sets

D

is simple enough that we can

in the plate experiments (see X-A).

.-

D

D . This is the procedure adopted

do everything directly from the distinguished “model” sets

8 , can be difficult. It may be -

, and computing

more practical to directly learn the distinguished spread edges from a sample

P - Find all pairs ) 2

using

which we assume are derived from object

models, e.g., shape templates. In some cases the structure of -

In other cases identifying all

. These in turn can

+

of subimages

8 ' 9 . Start with spread 0+ . & + , where denotes

with instantiations from - . Fix a minimum probability , say

such that

an estimate of the given probability based on the training data

! '+

+

. If there are more than some

of these, we consider them a preliminary pool from which the ’s will

minimum number

be chosen. Otherwise, take and repeat the search, and so forth, allowing the spread to

.

increase up to some value If fewer than

such features with frequency at least

are found at

, we declare the bin

to be too heterogeneous to construct an informative test. In this case, we assign the bin trivial test * +

+ , which is passed by any data point. If, however,

the

features are found, we

prune the collection so that the spreading regions of any two features are disjoint.

5 8 and a set of feature/location pairs, say 7. , has (estimated) probability at least of being found on an

This procedure will yield a spread such that the spread edge

-

+

instantiation from the bin population. The basic assumption is that, with a reasonable choice of

5 8

and , the estimated spread -

will more or less correspond to the width of the set

+

.

Our bin test is then * +

+ +

+

H L

and + is the threshold which has yet to be determined.

/ $'

Estimating + is delicate, especially in view of our “invariance constraint” October 9, 2003

* +

!

+

(' , DRAFT

21

which is severe, and somewhat unrealistic, at least without massive training sets. There are

several ways to proceed. Perhaps the most straightforward is to estimate + based on the minimum value observed over

+

+

: + is

, or some fraction thereof to insure good generalization.

This is what is done in Fleuret & Geman (2001) for instance. An alternative is to use a Gaussian approximation to the sum and determine +

" .

the distribution of +

on background. Since the variables

correlated on background, we estimate a background covariance matrix

covariances between The matrices

and

under

!

are actually

whose entries are the

for a range of displacements !

are then used to determine + for any -

/ &0+

probabilities

+

based on

! .

as follows. First, estimate the marginal

based on background samples; call this estimate , which allows

for -dependence but is of course translation-invariant. The mean and variance of + are then estimated by

+

(10) 9 H L H L 4

Finally, we take where is, as indicated, independent of , i.e., is adjusted to obtain no false negatives

+

+

+

+

-

for every -

in the hierarchy. This is possible (at the loss of some discrimination) due to the

inherent background normalization. Of course, since we are not directly controlling the false positive error, the resulting threshold might not be in the “tail” of the background distribution. VIII. CTF S EARCH The two step procedure described in VI-B was dedicated to processing that portion of the scene determined by the bin -

1"

. As a result of imposing translation invariance, this

is easily extended to processing the entire scene in a two level search and even further to a multi-level search. A. Two-Level Search

D be the set of poses for which . For any D and any element denote by the set of class/pose pairs #35 & 1 25

Fix a small integer -

&

and let

!

!

-

-

October 9, 2003

DRAFT

22

namely all poses appearing in -

with positions shifted by . Thus +

H L

. Due to translation invariance we need only develop models for subsets of

; it is not essential that the elements of

/ D . Let

be

8 D be disjoint. In any case, assume that for each a test & ) & has been learned as in VII based on a set . of distinguished features. be the sublattice of the full image lattice based on the spacing : 6 . Let

a partition of

-

* +

+

+

+

Then the full set of poses is covered by shifts of the elements of

$

+

-

along the coarse sublattice:

9

In order to find all candidate detections we first loop over all elements -

, and for each

and perform the test where 8 6 9 For those subsets for which + , we loop over all individual explanations 1 and examine each one separately based on the likelihood ratio test P described in VI-A. D -

*,+

we loop over all

-

* +

*

B. Multi-Level Search

be a coarsening of the partition is the union of elements in , say

The extension to multiple levels is straightforward. Let of

where

D ; that is, each element -

+ &

If a subset -

-

+

-

. Perform the same loop with shifts as described above for the elements of

passes the associated test, loop over all elements in -

partition and perform the test for -

+

,

.

from the finer

, and so on. Note that the loop over all shifts in the

image is performed only on the coarse lattice for the top level of the hierarchy. A strategy of this type with six levels was carried out in Fleuret & Geman (2001) in order to detect faces for a purely pose-based hierarchy; each level was identified with a binary split of one aspect of the pose. C. Detections

on the image data. More precisely, 1 8

7 which of course depends if and only if + for every appearing

The result of such a CTF search is a set of detections % & "%

October 9, 2003

* +

-

DRAFT

23

1 . In other words, such a pair

in the entire hierarchy (i.e., in any partition) which contains

has been “accepted” by every relevant hypothesis test. If indeed every test in the hierarchy

had zero false negative error, then we would have

&

%

, i.e., the true interpretation would

only involve elements of % . In any case, we do confine future processing to % .

, then . Of course, constraints on learning and memory render this difficult when 8 is very

In general %' and % , the set of class/pose pairs satisfying the individual hypothesis test (6), are different. However, if the hierarchy goes all the way down to individual pairs &"% %

large. Hence, it may be necessary to allow the finest bins -

to represent multiple explanations,

although perhaps “pure” in class. Given that our hierarchy is a tree-structured family of classifiers - each one devoted to testing +

vs

for some -

&

- it might appear very similar to a traditional decision tree.

But there are major differences: first, in our CTF search, the input data (an image region) might reach more than one terminal node, thus resulting in multiple detections on a single image region; second, we are designing rather than learning the questions and their order of execution. IX. F ROM D ETECTION

TO I NTERPRETATION :

R ESOLVING A MBIGUITIES

We now seek the admissible interpretation &"%

with highest likelihood. In principle

we could perform a brute force loop over all subsets of %

. But this can be significantly

simplified.

6 , where 6 are two admissible interpretations whose concatenation gives 6 and that the supports of the two , and similarly let 2 6 . Assume that 6 interpretations are the same, i.e., - - , which implies that - - . Then, due to cancellation over the background and over the data associated with 6 , it follows Let

immediately from (3) that

/ 0 FOH J L 9 0 GFOH J L

"!#

"!

!#

!#

A. Individual Object Competition

%

differ by only one object, i.e., if . Thus we need to compare and , then the assumptions imply that

In the equation above if the two interpretations

October 9, 2003

DRAFT

24

the likelihoods on the two largely overlapping regions

-

and

- . This suggests that

an efficient strategy for disambiguation is to begin the process by resolving competing object detections in % with very similar poses. Different elements of % may indeed have very similar poses; after all, the data in a region can pass the sequence of tests leading to more than one terminal node of the CTF hierarchy. In principle one could simply evaluate the likelihood of the data given each hypothesis and take the largest. However, the estimated pose may not be sufficiently precise to warrant such a decision, and such straightforward evaluations tend to be sensitive to background noise. Moreover we

- , which may not coincide with - . A more robust approach is to perform likelihood ratio tests between pairs of hypotheses , and on the region - 2 - - so that the data considered is the same

are still dealing with individual detections and the data considered in the likelihood evaluation involves only the region

for both hypotheses. The straightforward likelihood ratio based on (3) and taking into account

cancellations is given by

F

F

!

!

M HKP;L

M H P L

M HKP L

M H P;L

+

+

+

+

+

+

(11)

B. Spreading the LRT

Notice that for each edge type

the sums range over the symmetric difference of the edge

D

D

supports for the two objects at their respective poses. In order to stabilize this log-ratio we restrict the two sums to regions where the two sets

and

are really different as

opposed to being slight shifts of one another. This is achieved by limiting the sums to

D

.8

respectively, where for any set , where in Figure 8.

October 9, 2003

D

. &

D

D

, we define the expanded version

(12)

. 1 3

is a neighborhood of the origin. These regions are illustrated

DRAFT

25

C. Competition Between Interpretations This pairwise competition is performed only on detections with similar poses

. It makes

no sense to apply it to detections with overlapping regions where there are large non-overlapping areas, in which case the two detections are really not “explaining” the same data. In the event of such an overlap it is necessary, as indicated above, to perform a competion between

2 6 6 9%9%9

6 6 %9%9%9

admissible interpretations with the same support. The competition between two such sequences

and

is performed using the same log likelihood

ratio test as for two individual detections. For edge type

J

$

6 D : 9

and each interpretation let

The two sums in equation (11) are now performed on

J

J

and

J

J

respectively. These regions are illustrated in Figure 9. The number of such sub-interpretation comparisons can grow very quickly if there are large chains of partially overlapping detections. In particular, this occurs when detections are found that straddle two real objects. This does not occur very frequently in the experiments reported below, and various simple pruning mechanisms can be employed to reduce such instances. X. R EADING L ICENSE P LATES Starting from a photograph of the rear of a car, we seek to identify the characters in the license plate. Only one font is modeled - all license plates in the dataset are from the state of Massachusetts - and all images are taken from more or less the same distance, although the location of the plate in the image can vary significantly. Two typical photographs are shown in Figure 1, illustrating some of the challenges. Due to different illuminations, the characters in the two images have very different stroke widths despite having the same template. Also, different contrast settings and physical conditions of the license plates produce varying degrees of background clutter in the local neighborhood of the characters, as observed in the left panel. Other variations result from small rotations of the plates and small deformations due to perspective projection. For example the plate on the right is somewhat warped at the middle and the size of the characters is about 25% smaller than the size of those on the left. Detecting the plate in the original photograph can be done using a very coarse, edge-based model for a set of characters at the expected scale and arranged on a line. We omit the details October 9, 2003

DRAFT

26

Fig. 4.

Top: The subimages extracted from the images in Fig. 1 using a coarse detector for a sequence of characters. Middle:

Vertical edges on left plate; horizontal edges on the right plate. Bottom: Spread vertical edges on the left plate; spread horizontal edges on the right plate.

because this is not the limiting factor for this application. Once the plate is detected, a subimage is extracted and processed using the proposed methodology. The subimages extracted from the two images of Figure 1 are shown at the top of Figure 4. The mean spatial density of edges in the subimage then serves as an estimate for , the background edge probability, and we estimate

in (10) by . In this way, the thresholds + for the bin tests are adapted to the data, i.e.,

image-dependent. A. The CTF Hierarchy

defined as follows: 9 ) + 9 ; +"'

Since the scale is roughly known, and the rotation is generally small, we can take

!

!

degrees; !

!

$ D ,

(i.e., confined to a

window). There are 37 classes defined by the prototypes (bit maps) shown in Figure 5. A

simple clustering of the prototypes yields the pure-class hierarchy shown in Figure 5; not shown are the root (all classes) and the leaves (individual classes). The class/pose hierarchy starts with October 9, 2003

DRAFT

27

A B C D G J O QS U 04568

J U S5

A46

A4

EFHK LMNR

6

S5

BCDG OQ08 KR

UJ

B D GO BD

Fig. 5.

H K NR

GO

I TY 12|

NH

EF LM EF

IT Y| LM

C 0 Q 8 C0

TY

PWV XZ 379

1

PZ 79

2

I|

Z7

V X

W3

P9

Q8

Top: The 37 prototypes for characters in Massachusetts license plates. Bottom: The class hierarchy. Not shown is the

root and the final split into pure classes.

D , where is a set in the class hierarchy. Each bin in the last layer is then of the form ,1" D and is split into + sub-bins corresponding to two scale ranges ( 9 + and + + 9 ) and to nine determined by . (overlapping) windows inside the the same structure - there is a bin -

corresponding to each +(

+

-

!

The spreading is determined as in Section VII and the sets

D

!

are computed directly from the

D are constructed by taking all edge/location pairs that belong to all classes in at the reference pose. The spread is not allowed under . A subsample because we can anticipate the “width” of based on the range of poses in D character templates. The tests for bins -

+

-

+

of all edge/location pairs is taken to ensure non-overlapping spreading domains. This provides the set

.

+

described in Section VII. There is no test for the root; the search commences with

the four tests corresponding to the four subnodes of the root because merging any of these four with spread The subsets

.

+

produced very small sets

.

+

. Perhaps this could be done more gradually.

for several bins are depicted in Figure 6. For the sub-bins described above,

which have a smaller range of poses, the spread is set to

. Moreover since this part of

the hierarchy is purely pose-based and the class is unique, only the highest scoring detection is retained for % . B. The Detection Stage We have set

so that the image (i.e., subimage containing the plate) is scanned at the

coarsest level every 5 pixels, totaling approximately October 9, 2003

"+ ' ' '

tests for a plate subimage of size DRAFT

28

Fig. 6.

The sparse subset

'

@*

of edge/location pairs for some of a few bins

'E#

#

in the hierarchy. From left to right: 45 degree

@*

'

)"* , 45

edges on the cluster , 90 degree edges on cluster , 0 degree edges on cluster

'E

degree edges on cluster

Fig. 7.

* , and 135 degree edges on cluster ';6 * .

Top left: coarse level detections. Top right: fine level detections. Bottom left: remaining fine level detections pruning

based on vertical alignment. Bottom right: final interpretation after competitions.

'

7+ +"' . The outcome of this stage is shown in the top left panel of Figure 7; each white dot

represents a

window for which one of the four coarsest tests (see Figure 5) is positive at that

shift. If the test for a coarse bin -

passes at shift , the tests at the children -

are performed at

shift , and so on at their children if the result is positive, until a leaf of the hierarchy is reached. Note that, due to our CTF strategy for evaluating the tests in the hierarchy, if the data

+

reach a leaf - , then necessarily *!

* ) +

the condition * +

for every ancestor

"$#

-

in the hierarchy; however,

by itself does not imply that all ancestor tests are also positive. The set

& ) +

of all leaf bins reached for which *,+

(equivalently, the set of all complete chains) then

constitutes % . Each such detection has a unique class label

(since leaves are pure in class),

but the pose has only been determined up to the resolution of the sub-bins of October 9, 2003

do

D . Also, there

DRAFT

29

Fig. 8.

Competition between

%

% )

and

at a location on the plate. From left to right: In grey

M 7 ; Same for class ; Locations in M M M 7 where an edge is detected;

white

"

"

M 7 "

M , in "

where an edge is detected; Locations in

"

can be several detections corresponding to different classes at the same or nearby locations. The set of locations in % is shown in the top right panel in Figure 7. The pose of each detection

in %' is refined by looping over a small range of scales, rotations and shifts and selecting the with the highest likelihood, that is, the highest score under *

D P .

C. Interpretation: Prior Information and Competition The set % consists of several tens to several hundred detections depending on the complexity of the background and the type of clutter in the image. At this point we can take advantage of the a priori knowledge that the characters appear on a straight line by clustering the vertical coordinates of the detected locations and using the largest cluster to estimate this global pose parameter. This eliminates some false positives created by combining part of a real character with part of the background, for example part of small characters in the word “Massachusetts” at the top of the plate; see the top right panel of Figure 7. Among the remaining detections we perform the pairwise competitions as described in IX.

6

This is illustrated in Figure 8. We show a region in a plate where both a “3” and a “5” were

detected. For one type of edge - vertical - we show in grey the regions

D

6

D

"

(left) and

, and in green

(middle). The white areas illustrate a “spreading” of these regions as defined in IX.

the locations in

D

D "

6 , where an edge isD detected.

The rightmost panel shows in blue those locations in

"

D

After the pairwise competitions there are sometimes unresolved chains of overlapping detections. It is then necessary to perform competitions, as described in IX, between valid candidate subsequences of the chain. A valid subsequence is one which does not have overlapping October 9, 2003

DRAFT

30

Fig. 9.

Sequence competition. First panel: detected classes on a subimage - a chain with labels ,R, ,K, ,R, ,I. Second panel:

7 for the subsequence “ ”. Third panel: 7 O 7 O 7 . the symmetric difference

the sets

Fig. 10.

and

for the subsequence ‘ K ’. Last panel:

Examples of false positives.

characters, and is not a subsequence of a valid subsequence. This last criterion follows simply from (5). In Figure 9 we show a region in a plate where a chain of overlapping detections was found. The regions

J J

-

for one competing subsequence (“

”) are shown in first panel,

for another (“ ! .! ”) in the second panel, and the resulting symmetric difference in the last panel. D. Performance Measures Classification rate: We have tested the algorithm on 520 plates. The correct character string is found on all but 17 plates. The classification rate per symbol is much higher - over 99%. Most of the errors involve confusions between However, there are also false positives, about

'

and ! and between

and .

in all the plates combined, including a small

number in the correctly labeled plates, usually due to detecting the symbol “ ! ” near the borders of the plate. Other false positives are due to pairs of smaller characters as in Figure 10. We have not attempted to deal separately with these in the sense of designed dedicated procedures for eliminating them. Computation time: The average classification time is 3 1Mghz laptop. Approximately October 9, 2003

+ 9

9

seconds per photograph on a Pentium

seconds is needed to obtain the set % via the CTF search. DRAFT

31

The remaining

+9

seconds is devoted to refining the pose and performing the competitions.

Of interest is the average number of detections per bin in the tree hierarchy as a function of the level, of which there are five not including the root. For the coarsest level (which has four bins) there are, on average, 183 detections per bin per plate, then 37, 29 and 18 for the next three levels, and finally 4 for the finest level. On average, the CTF search yields about 150 detections per plate. If the CTF search is initiated with the leaves of the hierarchy in Figure 5, i.e., with the pairwise clusters, the classification results are almost the same but the computation time doubles and detection takes about

seconds. Therefore, approximately the same amount of time is

devoted to the post-detection processing (since the resulting % is about the same). This clearly demonstrates the advantage of the CTF computation. XI. D ISCUSSION We have presented an approach to scene interpretation which focuses on the computational process, dividing it into two rather distinct phases: a search for instances of objects from multiple classes which is CTF, context-independent, and constrained by minimizing false negative error, followed by arranging subsets of detections into global interpretations using context and modelbased competitions to resolve ambiguities. Spread edges are the key to producing efficient tests for subsets of classes and poses in the CTF hierarchy; they are reusable, and hence efficient, common on object instantiations, and yet sufficiently discriminating against background to limit the number of false detections. Spreading also serves as a means to stabilize likelihood ratio tests in the competition phase. The experiments involve reading license plates. In this special scenario there is exactly one prototype shape for each object class, but the problem is extremely challenging due to the multiplicity of poses, extensive background clutter and large variations in illumination. The CTF approach can be extended to multiple prototypes per class, e.g., multiple fonts in OCR, as well as to situations in which templates do not exist (e.g., faces) and the tests for class/pose bins are learned directly from sample images. Furthermore, the framework could be extended from edges to more complex features having much lower background probabilities. Other issues that are currently being explored are hypothesis tests against specific alternatives (rather than “background”); inducing CTF decompositions directly from data; and sequential October 9, 2003

DRAFT

32

learning techniques such as incrementally updating CTF hierarchies, and refining the tests, as additional classes and samples are encountered. R EFERENCES Amit, Y. (2000), ‘A neural network architecture for visual selection’, Neural Computation 12, 1059–1082. Amit, Y. (2002), 2D Object Detection and Recognition, M.I.T. Press. Amit, Y. & Geman, D. (1999), ‘A computational model for visual selection’, Neural Computation 11, 1691–1715. Amit, Y. & Mascaro, M. (2003), ‘An integrated network for invariant visual detection and recognition’, Vision Research. in press. Barrow, H., Tenenbaum, J. M., Boles, R. C. & C., W. H. (1977), Parametric correspondence and chamfer matching: two new techniques for image matching, in ‘Proc. Intl. Joint Conf. on Artificial Intell.’, pp. 659–663. Blanchard, G. & Geman, D. (2003), Hierarchical testing designs for pattern recognition, Technical report, University of Paris Orsay. Fleuret, F. & Geman, D. (2001), ‘Coarse-to-fine face detection’, Inter. J. Comp. Vision 41, 85–107. Fukushima, K. & Miyake, S. (1982), ‘Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position’, Pattern Recognition 15, 455–469. Fukushima, K. & Wake, N. (1991), ‘Handwritten alphanumeric character recognition by the neocognitron’, IEEE Trans. Neural Networks 2, 355–365. Gavrila, D. M. (2003), Multi-feature hierarchical template matching using distance transforms, in ‘Proc. IEEE ICPR ’98’. Geman, S., Manbeck, K. & McClure, E. (1995), Coarse-to-fine search and rank-sum statistics in object recognition, Technical report, Brown University. Geman, S., Potter, D. & Chi, Z. (2002), ‘Composition systems’, Quarterly J. Appl. Math. LX, 707–737. Grimson, W. E. L. (1990), Object Recognition by Computer: The Role of Geometric Constraints, MIT Press, Cambridge, Massachusetts. Hubel, H. D. (1988), Eye, Brain, and Vision, Scientific American Library, New York. Itti, L. & Koch, C. amd Niebur, E. (1998), ‘A model of saliency-based visual attention for rapid scene analysis’, IEEE Trans. PAMI 20, 1254–1260. Kanade, T. & Schneiderman, H. (1998), Probabilistic modeling of local appearance and spatial relationships for object recognition, in ‘CVPR’. Lamdan, Y., Schwartz, J. T. & Wolfson, H. J. (1988), Object recognition by affine invariant matching, in ‘Proc. IEEE Conf. on Computer Vision and Pattern Recognition’, pp. 335–344. LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998), ‘Gradient-based learning applied to document recognition’, Proceedings of the IEEE 86(11), 2278–2324. Lindeberg, T. (1993), ‘Detecting salient blob-like image structures and their scales with a scale space primal sketch: a method for focus-of-attention’, Inter. J. Comp. Vision 11, 283–318. Lowe, D. (2003), Distinctive image features from scale-invariant keypoints, Technical report, University of British Columbia. Mikolajczyk, K. & Schmid, C. (2003), A performance evaluation of local descriptors, in ‘Proc. IEEE CVPR ’03’, pp. 257–263. Nagy, G. (2000), ‘Twenty years of document image analysis’, IEEE PAMI 22, 38–62. Navalpakkam, V. & Itti, L. (2003), Sharing resources: buy attention, get recognition, in ‘Proc. Intl Workshop on Attention and Performance in Comp. Vision (WAPCV ’03)’. October 9, 2003

DRAFT

33

Olson, C. F. & Huttenlocher, D. P. (1997), ‘Automatic target recognition by matching oriented edge segments’, IEEE Trans. Image Processing 6(1), 103–113. Privitera, C. M. & Stark, L. W. (2000), ‘Algorithms for defining visual regions-of-interest: comparison with eye fixation’, IEEE Trans. PAMI pp. 970–982. Reisfeld, D., Wolfson, H. & Yeshurun, Y. (1995), ‘Context-free attentional operators: The generalized symmetry transform’, Inter. J. Comp. Vision 14, 119–130. Riesenhuber, M. & Poggio, T. (1999), ‘Hierarchical models of object recognition in cortex’, Nature Neuroscience 2, 1019–1025. Rojer, A. S. & Schwartz, E. L. (1992), A quotient space hough transform for scpae-variant visual attention, in G. A. Carpenter & S. Grossberg, eds, ‘Neural Networks for Vision and Image Processing’, MIT Press. Rowley, H. A., Baluja, S. & Kanade, T. (1998), ‘Neural network-based face detection’, IEEE Trans. PAMI 20, 23–38. Rucklidge, W. (1995), Locating objects using the hausdorff distance, in ‘Proc. Intl. Conf. Computer Vision’, pp. 457–464. Socolinsky, D. A., Neuheisel, J. D., Priebe, C. E., De Vinney, J. & Marchette, D. (2002), Fast face detection with a boosted cccd classifier, Technical report, Johns Hopkins University. Torralba, A. (2002), Contextual priming for object recognition, Technical report, A.I.Lab, Massachusetts Institute of Technology. Ullman, S. (1995), ‘Sequence seeking and couter streams: a computational model for bidirectional information flow in the visual cortex’, Cerebral Cortex 5, 1–11. Viola, P. & Jones, M. J. (2001), Robust real-time face detection, in ‘Proc. ICCV01’, p. II: 747. Wiskott, L., Fellous, J.-M., Kruger, N. & von der Marlsburg, C. (1997), ‘Face recognition by elastic bunch graph matching’, IEEE Trans. on Patt. Anal. and Mach. Intel. 7, 775–779.

A PPENDIX Recall from VID that our goal is to determine the optimal domain of ORing for a bin of the

1" 7

form -

under our statistical edge model.

A. Simplifying Assumptions

To simplify the analysis suppose the class is a square. In this case, there are two edge types

8 -

of interest - horizontal and vertical - and a corresponding set of model edge locations each one. Suppose also that

captures only translation in an

+

for

neighborhood of the origin;

scale and orientation are fixed. This is illustrated in Figure 11. The regions are rectangles, for

O

2 + %9 9%9 . When 2 + , a detected edge is spread to a

and

strip oriented perpendicular to the direction of the edge; for instance, for a vertical edge, an edge detected at

is spread to a horizontal strip of width 1 and length centered at . See the third

and fourth panels of Figure 11 for two different region shapes corresponding to and

October 9, 2003

.

7 +

DRAFT

34

; The range of translations of the square; The model square "%$'& with !!% and # % = , centered around points 1E: ; The same with !!% and

Fig. 11. From left to right: The model square with the region

#$

with the region

,%

&)(

#

tiled by regions

:

"

We restrict the analysis to a single , say vertical, and drop the dependence on ; the general result, combining edge types, is then straightforward. Define

3 * : 7 + 9 The thresholds are chosen to insure a null type I error and we wish to compare the type II errors, 6 6 , for different values of and . Note that and 6 . The are taken 8 ; see Figure to be disjoint and for any choice of and their union is a fixed set 8 11. Thus the smaller or the larger the number of regions. We also assume that the image either contains no object or it contains one instance of the object at some pose . Fix and and let denote the number of regions in . Since is the width of the region . , we have -, , where , is the number of regions used when 2 + . Let . , and . Note that we assume each pose hits the same number, , of regions. Conditioning on we have / H 46 L 9 + + + 0 / + ) / + + 2 9 1 0 #3 and + 129 This implies that / + * + 4 12 but the

+

6

+

+

+

-

+

!

$

$

6

October 9, 2003

-

-

!

)

61

!

+

. Furthermore

+

!

variables are not independent given 5

&

+

6 1 ) +

!

$

41/ +

71

9

DRAFT

35

Since the conditional expectation does not depend on 5 +

9 5 6

! '+

we have

610

+

61 9

$

The conditional variance is also independent of , and since variance of the conditional expec-

'

tation is :

+

9

6

& ) + + 61 + 1 9 5 1 and we have 1 , and !

+

On background the test is binomial B. The Case

0+

+ ,

This case, although unrealistic, is illuminating. Since variable added to the constant

1

+

1

9

is a non-negative random

. Thus the largest possible zero false negative threshold is

for + , since we are simply replacing parts of the sum by maxima. Since is independent of , it follows that . , assume (i) 6 and (ii) + + . Then 6 6 . As a Proposition: For

. For any fixed we have

6

result,

In particular, the test

+

+

$

6 6 6 9 is the most efficient.

Note: The assumptions are valid within reasonable ranges for the parameters

9'

, say 9 ' +

%+ ' 9 . Proof: When we have and, under , the statistic is binomial - %+ + 6 6 is binomial - , . Using the normal approximation to the binomial and the statistic 6 6 ,+ , , 6 + , 6 + 6 , + + and, similarly, + + 6 + 6 + 6 9 ,+ + + + + + Now

+ + + + ,

+ + + in the third. The where we have used (ii) in the first inequality, (i) in the second and and

$

$

$

$

$

$

$

$

$

result follows directly from this inequality. October 9, 2003

DRAFT

36 8 2.5

7

q=.01

2 k=1

5 4 3

1.5 q=.02

R(s,k)

R(s,1)

6

q=.03

1 0.5 k=2

q=.04

0

q=.05

2

−0.5

1 0

5

10

15

S

Fig. 12.

= !% )

(

Left: plot of for #

C. The Case For

0! # .

, &% &

%

5

(

"

%R= #$%R=

(

=

for

$+

10

15

S

(

) . Right: plot of 0! # , %

we cannot guarantee no false negatives. Instead, let +

+ , making the event $' very unlikely under normal approximation, the error is a decreasing function of 5 5 + + - ) 9

a

+

%>=

&

k=3

−1 0

5

+

For general

% (

"

%

and choose

. Again, using the

we do not attempt analytical bounds. Rather, we provide numerical results for

the range of values of interest:

+

+

&

9' +

9' , 9

+ , +"' , ! ' , + , and ' + 9 % 9 ' + 9 ' , and

+"' . In Figure 12, we show plots for the values , 2 + . The conclusions are the same as for + : is decreasing in in the range + and increasing for 6 6 for any $+ . The optimal test is , corresponding to + .

.

+

October 9, 2003

DRAFT

Computational Strategies for Model-Based Scene Interpretation

Computational Strategies for Model-Based Scene Interpretation

Suggest Documents

Scene Interpretation for Lifelong Robot Learning - IDA.LiU.se

A modelbased PID controller for Hammerstein

Control of Scene Interpretation - Center for Machine Perception (CMP)

Generation of Rules from Ontologies for High-Level Scene Interpretation

Toward Laser Pulse Waveform Analysis for Scene Interpretation

automatic incremental model learning for scene interpretation - KOGS

Feature Extraction and Scene Interpretation for Map-Based Navigation

Computational Auditory Scene Analysis: Principles, Algorithms, and

Modelbased analysis of skin conductance

A Computational Paradigm for Three Dimensional Scene Analysis

Computational strategies for flexible multibody systems - CiteSeerX

Parallelization Strategies for Computational Fluid Dynamics Software ...

Computational neuroimaging strategies for single ...

COMPUTATIONAL STRATEGIES FOR ANALYZING DATA IN GENE ...

Negotiation Strategies for Autonomous Computational Agents

Efficient computational strategies for Bayesian social networks

Computational strategies for dissecting the high-dimensional

Learning Optimal Dialogue Strategies - Association for Computational

LNCS 7786 - Computational Strategies for Skin

viewpoint quality and global scene exploration strategies

Towards a Computational Interpretation of Physical Theories

A Computational Interpretation of the -calculus - CiteSeerX

A Computational Model of Metaphor Interpretation

Rice Consortium for Computational Seismic Interpretation - Rice ECE