POP: Patchwork of Parts Models for Object ... - Semantic Scholar

7 downloads 1406 Views 570KB Size Report
Apr 4, 2005 - Email: [email protected]. ... Email:[email protected]. 1 ..... viewed as assigning a template to each training example, thus ...
POP: Patchwork of Parts Models for Object Recognition Yali Amit∗ and Alain Trouv´e † April 4, 2005



Yali Amit is with the Department of Statistics and the department of Computer Science, University of

Chicago, Chicago, IL, 60637. Email: [email protected]. Supported in part by NSF ITR DMS0219016. † Alain Trouv´e is with the CMLA at the Ecole Normale Superieur, Cachan and the LAGA at Universit´e Paris XIII. Email:[email protected].

1

Abstract We formulate a deformable template model for objects with a clearly defined mechanism for parameter estimation. A separate model is estimated for each class, and classification is likelihood based - no discrmination boundaries are learned. Nonetheless high classification rates are achieved with small training samples. The data models are defined on binary oriented edge features that are highly robust to photometric variation and small local deformations. The deformation of an object is defined in terms of locations of a moderate number reference points. Each reference point is associated with a part - a probability map assigning a probability for each edge type at each pixel in a window. The likelihood of the edge data on the entire image conditional on the deformation is described as a patchwork of parts (POP) model - the edges are assumed conditionally independent, and the marginal at each pixel is obtained by a patchwork operation: averaging the marginal probabilities contributed by each part covering the pixel. Object classes are modeled as mixtures of POP models that are discovered sequentially as more class data is observed. Experiments are presented on the MNIST database, hundreds of deformed LATEX shapes, reading zipcodes, and face detection.

1 Introduction Two directions of research - classification and detection - have dominated the field of shape and view based object recognition. The first, classification, refers to the discrimination between several object classes based on segmented data (see Vapnik (1995), Amit & Geman (1997), LeCun et al. (1998), Hastie & Simard (1998), Belongie et al. (2002)), and the second, detection, to finding instances of a particular object class in large images (see Leung et al. (1995), Rowley et al. (1998), Viola & Jones (2004), Amit & Geman (1999), Fei-Fei et al. (2003), Torralba et al. (2004).) The latter is often considered as a problem of classification between object and background. Both subjects are viewed as building blocks towards more general algorithms for the analysis of complex scenes containing multiple objects.

2

1.1 The challenge of scene analysis An important question is how to integrate the detectors and the classifiers into one framework when multiple objects of different classes, at times very similar in shape, are present in the image together with noise and clutter. One can imagine running detectors for several object classes at low false negative rates. This will typically yield quite a large number of false positives as well as multiple hits (for different detectors) in the same region. It is then necessary to classify among these and eliminate false positives. Furthermore, if several objects can be present in the scene, one needs to choose among multiple candidate interpretations, i.e. different assignments of labels and poses for a number of objects. It is hard to imagine performing this task based on pre-trained classifiers among interpretations since it is impossible to anticipate all possible configurations in training. This calls for an online mechanism to determine the optimal configuration. The same issue would arise if bottom-up segmentation (Shi & Malik (2000)), or saliency detection (Itti & Koch (1998)) is used to determine candidate regions or locations of the objects of interest. Competing segmentations/classifications need to be resolved. In Amit et al. (2004) a simple example of such an approach is developed in a limited context. The basic building blocks are simple “naive Bayes” statistical models for the data around each individual object, assuming independence conditional on the pose or deformation parameters. Modularity is achieved by concatenating data models of the individual objects at there respective poses together with a simple background model, yielding a model for the entire object configuration or scene. Competitions between alternative candidate configurations are resolved by comparing the likelihood of the data under the two configurations. There are no pre-trained decision boundaries, and yet it is possible to make very fine distinctions between very similar objects. The use of conditional independence models leads naturally to (i) an efficient coarse-to-fine mechanism for computing candidate detections for the different objects, (ii) localization of the computation needed for choosing between different interpretations to regions of ambiguity. In the application presented in Amit et al. (2004) - reading license plates - geometric variation is limited to a small range of linear variations and each class is defined in terms 3

of a single prototype. The main challenge is the presence of clutter and a high degree of photometric variation. In view of the relative success of this approach to dealing with scenes containing multiple objects, it is of primary interest to investigate the possibility of extending these models to highly deformable objects. In this paper we propose such an extension, using the same binary edge features and within the framework of conditional independence models. This is briefly described below.

1.2 Edge based deformable models We start with the ‘Bernoulli Model’ - a general edge based deformable model proposed in Amit (2002). A class of non-linear deformations is defined together with model probability maps on a reference grid. The assumption is that given only one object is present in the image at a particular deformation, all edges in the image are independent, and the marginal probability of finding an edge of a certain type at a pixel is computed in terms of the deformation and the model probability maps. To implement this model in practice it is necessary to (i) parameterize the non-linear deformations, (ii) develop a mechanism for estimating the probability maps from training data, where the deformation is not known, (iii) devise an algorithm for computing the most likely deformation given the estimated probability maps and a test image. In this paper we propose such an implementation which lends itself to efficient training and testing. A moderate number of reference points is defined in the reference grid. Deformations are defined in terms of a mapping of these reference points into the image grid. Each reference point is associated to a part which is simply a subwindow of the model probability map around the reference point. These subwindows/parts are not disjoint, they can have significant overlaps. The part is mapped together with the reference point into the image grid. The distribution of the edge data on the entire image given a particular mapping of the reference points and their associated parts is again described by a conditional independence model obtained with a patchworking operation: The marginal probability of an edge feature at any pixel in the image is the simple average of the probabilities proposed at that pixel by any of the parts covering it. This ties together the parts in a consistent way into a patchwork of parts (POP) model for the data. Figure 1 provides an illustration of 4

(B)

(A)

(C)

(D)

(E)

(F)

Figure 1: (A) The probability map for horizontal edges for the class ’7’ (dark corresponds to high probability). In red are the reference points. (B) Three subwindows of the probability map centered at three of the reference points. (C) A sample ‘7’. (D) detected horizontal edges. (E) The shifts of the reference points denoted by small red arrows together with the support of the subwindows - darker pixels are covered by more subwindows. (F) The probability map on horizontal edges determined by the patchwork operation and the shifts given in (E).

these ideas. One advantage of this patchwork model is that the individual parts can be trained separately. The location of a part is assumed to vary within a restricted range in the reference grid, i.e. relative to the object center at reference scale. The specific location of the part associated to each training image is unknown and is considered as an unobserved variable. The expectation-maximization (EM) procedure is then applied to estimate the marginal probabilities at each point in the window, as well as to get the distribution on the location of the part. This yields a collection of parts - the local subwindows of the probability map. Placing these at their mean locations and performing the patchwork operation yields the full model probability map on the reference grid. Each object class is modeled as a mixture of such POP models. These are discovered sequentially by keeping track of training ex5

amples that have low likelihood with respect to all current components of the mixture for a particular class. Classification is likelihood based - the class with highest likelihood is chosen. For detection the likelihood ratio to an online adaptive background model is compared to a threshold. In this setting no training is performed for the background model. It is simply assumed that all edges are independent and the marginal probabilities of each type are the same at all pixels and determined online by the density of these edge types in the particular image region being analyzed.

1.3 Summary of results The proposed models allow for a simple and efficient training procedure from very small sample sizes and yield high classification rates - far higher than achievable with SVM’s on the same features. For example with 30 examples per class on the MNIST dataset we achieve 3% error on the test set (6% error with SVM’s), reaching 1.52% error with 500 examples per class, where in effect only 80-100 examples were actually used to update the model parameters through sequential learning. Using a different clustering mechanism and with 1000 examples per class we achieve under .8% error. Preliminary experiments for reading zipcodes by combining object models to scene models (as in Amit et al. (2004)) yields a recognition rate of 82%, with the correct zipcode being in the top 10 interpretations for 94% of the zipcodes. These are very high rates considering that only 500 examples per class are used to train the models, and no presegmentation or preprocessing is performed. To test the relevance of these models to other object types we train face models from 300 Ollivetti face data base images. The likelihood ratio with respect to the adaptive background model is used as a filter on detections of the algorithm described in Amit (2002). The reduction in false positives is by a factor of 20-30. The resulting false positive rates at given false negative rates are about 3-5 times higher than the state of the art (see e.g. Viola & Jones (2004),Schneiderman & Kanade (2004)) but given the simplicity of the test and the lack of training on background we believe it is evidence of the generality and usefulness of the proposed models.

6

1.4 Paper organization The paper is organized as follows. In section 2 we survey related work on object modeling and classification. In section 3 the full POP model is described in detail, followed by the training procedure in section 4. Issues involving computation are discussed in section 5. In section 6 we provide details on a number of experiments. In addition to handwritten AT X symbols where there are hundreds digit classification we experiment with deformed L E

of classes, with reading zipcodes, and with face detection. We conclude in section 7 with a discussion of some possible extensions of the algorithm.

2 Other work We describe related work in the literature which can be broadly divided between modeling (generative) frameworks on one hand and non-parametric discriminative frameworks on the other.

2.1 Modeling frameworks By modeling frameworks we refer to those approaches where the basis for classification is a statistical model of the data given each object class. These are also denoted generative models although rarely can these models generate realistic images. 2.1.1 Deformable templates Our model falls squarely in the category of deformable template models proposed in Grenander (1978), Grenander (1993). Two attractive features of the deformable template paradigm are the formulation of a comprehensive model for the image data and the explicit modeling the geometric variability (rigid and non-rigid) of objects within a given category. In particular, the ability to tie together a dense field of very basic and poorly informative features (grey levels, local edges) within a global model is the key to the wide-spread use of this paradigm in the medical imaging community. There has recently been significant progress in modeling large deformations (see Beg et al. (2005)), photometric variability (see Trouv´e 7

& Younes (2004)) or multimodality (see Roche et al. (2000)) providing a promising technology for detailed modeling and analysis of shape and appearance variability (see Grenander & Miller (1998))). However, in the object recognition community, modeling is not a goal per se but a path to object detection and classification. Speed, robustness, and learnability of the parameters play a major role in the birth and death of object recognition paradigms. Despite demonstrated usefulness of the template matching approach (see Jain et al. (1998), Coughlan et al. (2000)), a persistent weakness of dense modeling approaches is the computational cost of the matching process (though one should not underestimate the computational cost of many other combinatorial exploration schemes) and more importantly, the very incomplete statistical scheme for estimating the template and the geometric model when the deformations are unobserved. In the deformable template framework training requires simultaneous estimation of the template distribution as well as the deformation of each training image. This is of course a chicken and egg problem. One needs the template model to compute the deformations, and one needs the deformations to estimate the template distribution. In Burl et al. (1998), a part decomposition of the global grey level template is proposed leading to a generative model for grey level images by a mechanism similar to ours: a geometric arrangement of ‘rigid’ parts whose locations are drawn from a given distribution. Conditional on this arrangement, the pixel values are assumed independent, those on the support of a part are given by a specific gaussian appearance model, those in the background are given by a generic i.i.d. gaussian model. However, to ensure a well defined distribution overlaps are forbidden. A similar type of model involving local ‘rigid’ models, i.e. parts, but based on edges can be found in Crandall et al. (2005). The advantage here is the use of edges that are photometrically invariant. It is very difficult to write a workable model directly on the gray level values which is photometrically invariant. Moreover an i.i.d constant mean model for background is not very useful. However the preclusion of overlap is a drawback of both approaches in that large areas of the object are modeled as background, leading to a loss in precision and discriminatory power. Furthermore the ‘gaps’ render unsupervised training very problematic. Indeed in Crandall et al. (2005) the centers of the parts are given by the user on the training images.

8

With respect to these approaches the main contribution of the POP models is twofold: (i) a simple and efficient training procedure for the templates and the distribution on geometry, (ii) the formulation of a dense data model explaining all edge data on the object allowing for fine discrimination between similar shapes. 2.1.2 Deformable nearest neighbor approaches In Hastie & Simard (1998) a nearest neighbor approach is proposed with a distance that depends both on a deformation parameter and a gray level distance measure. There is no explicit definition of a statistical data model, but since sums of squares are used as image distance, they are implicitly using a model where conditional on the deformation, the gray level values are independent Gaussian with fixed variance and mean provided by the template. This approach is problematic for generic gray level data since it does not address photometric variability and the deformations are restricted to the affine class. There is an attempt to obtain more efficient classification by estimating ‘templates’ from clusters of training examples, this is not explicitly formulated as a statistical estimation problem, although many of the ingredients are present. In Wiskott et al. (1997) a distance based approach is used for face recognition, with a geometric cost defined in terms of the distortions of a graph, and a data distance defined in terms of the distance between the outputs of a library of Gabor type filters - jets. Recognition is also based on nearest neighbors. In Belongie et al. (2002) the image data is transformed to a labeled point process where each point is characterized by a rather complex set of measurements describing the distribution of edges in its neighborhood. The labels or features are designed to be robust to a wide range of deformations of the object as well as photometric variability. Once the data is reduced to a point process, the deformation is expressed as a point match, thus greatly enriching the space of allowable deformations. Again the distance is defined as a combination of a geometric penalty and a measure of how well the labeled features match. These nearest neighbor approaches, each of which has been highly successful, can be viewed as assigning a template to each training example, thus requiring intensive computation and extensive memory. The distances are not explcitly formulated in a statistical 9

deformable template framework and are hence somewhat ad-hoc. One conclusion of this paper is that statistical modeling and estimation procedures yield compact and efficient representations of the shape ensembles (e.g. handwritten digits, faces, etc.) where distances are defined in a principled manner in terms of likelihoods. 2.1.3 Sparse models Our approach shares some elements with sparse models based on interest point detection, (see Burl et al. (1995), Amit & Kong (1996).) In Fei-Fei et al. (2003), the authors propose a constellation model where the variability within an object class is represented by the joint distribution on the geometrical arrangement and the local appearance of a fixed small family of interest points. It is important to underline that in these approaches, one is no longer modeling the distribution of a dense set of features such as local edges or gray level values. Rather the image data is transformed to a sparse collection of points using local filters that fire with low probability on generic background. The statistical model is for the transformed data, which can be viewed as a sparse labelled point process. In Fei-Fei et al. (2003) a principled probabilistic model is proposed for the transformed data together with a detection and classification mecanism based on computing the maximum posterior on constellations. Furthermore the learning algorithm is formulated assuming the interest points are unobserved. The primary difference with respect to the models proposed here is that the sparse models do not provide a generative model for the raw data but for the point process which is computed prior to the object detection and classification steps. In particular, no top-down process driven by the current pose hypothesis in the detection step is allowed to look for missing interest points or more informative ones. Moreover, the average number of points detected in an image as well as the total number of points involved in an object model need to stay extremely small to avoid a combinatorial explosion in the learning process and in the detection/clasification steps. Such a small number of parts, and such drastic filtering of the original data will not be sufficient for discriminating between very similar classes, as arising in character recognition problems, or to achieve very low false positive rates in detection problems. 10

By contrast the models proposed here directly provide a ‘dense’ data model for the observed edge maps everywhere in an image and explicitely deal with the issue of the overlap of local edge models. This allows us to make detailed discriminations. Indeed sparse data models can be viewed as approximations to the dense model that can help streamline computation. 2.1.4 Dense representations As indicated, ours is a dense deformable template model. Since classification or detection require the estimation of the deformation, the end result is not only a class label or a location of the object in the scene but an explicit map of the model into the image. A by product is the identification of an object support, at the level of edges, in terms of those locations in the image where the model probabilities are different from background. This is illustrated in figure 9 for zipcodes and figure 15 for faces. In Borenstein et al. (2004) object representations are explicitly learned in order to accurately define a support or a figure ground segmentation. Their representation is also defined in terms of a collection of overlapping parts, and in each part the region corresponding to object or background is learned. The authors use a gray level data representation, but there is no explicit statistical model for the data, and multi-class problems or multi-object scene problems, can not be solved directly in terms of probability ratios. Furthermore these models are learned in a discriminative framework - object against background. In Leibe & Schiele (2003) and Liebe & Schiele (2004), a probabilistic Hough transform based on scale-invariant interest points and scale adaptive mean-shift is proposed for object detection and object/background segmentation. Each interest point casts a ’probabilistic’ vote for different (object/location/scale) hypotheses. The use of predetected interest points puts this algorithm in the sparse category described above in that the data is transformed to a labeled point process prior to processing. However the authors also propose a method for determining a dense object support. The interest points on a detected object use a learned ‘support probability map’ relative to the point location to cast votes for points as object supports. This approach, although based on local patches, and a learned support, again does not stem directly from a clearly defined statistical model for the image data (object + 11

background) or for the distribution of interest points and their local appearance. The statistical formulation proposed here allows us to extending such approaches to deal with competitions between scene interpretations containing several deformable classes of similar shapes, as in the zipcode experiments. 2.1.5 Comprehensive scene models The models described above mainly describe the data around one or several known objects, assuming at best a very simple model for off object data. Others have attempted to develop more comprehensive models for the ‘background’. In Geman et al. (2002) the authors argue for a a hierarchy of compositions of increasingly complex elements - reusable parts leading to a natural likelihood based choice of the optimal interpretation. In their proposal an interpretation involves not only the objects and their poses but an assignment of part labels to structures in the background that are not associated to any object. Despite offering a comprehensive modeling framework there remain significant challenges from the point of view of training and efficient computation. In Tu et al. (2004) a comprehensive model for scenes is proposed within the Markov random field framework which is richer than the conditional independence models used here. However, such models are very complex and are known to pose a significant challenge in training and in computation. The authors propose a sequence of simplifications and approximations to circumvent these obstacles which involve the introduction of classification and detection methods as initialization techniques. Finally there have been attempts to apply statistical models in the particular context of handwritten character recognition, see for example Parizeau & Plamondon (1995), Revow et al. (1996), Kim & Kim (2003), these however are rather dedicated to the particular application.

2.2 Discriminative methods The most successful methods reported in the literature for discrimination between highly variable object classes involve various types of non-parametric classifiers that aim directly

12

at estimating decision boundaries as opposed to data models for each class. For some examples covering a range of techniques see Vapnik (1995) (SVM’s), Amit & Geman (1997) (Randomized decision trees), LeCun et al. (1998) (Feedforward neural networks), and nearest neighbor approaches Hastie & Simard (1998), and Belongie et al. (2002). These techniques yield very high classification rates on the MNIST dataset LeCun (2004). Their success is due to a combination of factors including novel learning algorithms, judicious choice of invariant representations (features) and closely related - a careful choice of distance measures that incorporate certain geometric invariances. Non-parametric classification techniques that directly learn decision boundaries are typically trained in batch form. Nearest neighbor algorithms can accumulate additional training examples sequentially, however one is forced to store all training data in memory and computation is formidable. Indeed nearest neighbor methods are used more as a proof of the power of certain sets of features or certain chosen distances. Both the kernel based (SVM) methods and the sequential boosting methods use the more difficult examples to define the decision boundaries. The former actually uses these examples - among others to define the dividing hyper-plane in the high dimensional implicit feature space. The latter retrains classifiers, typically decision trees, with higher weight on misclassified examples. As will be seen below the mixture of POP models provides a simple and natural framework for sequential updating without need to store previously observed examples. In terms of scalability to large numbers of classes, most experiments are performed with several tens of classes at most. In Amit & Geman (1997) some experiments were performed with several hundreds but since the classifiers (decision trees in this particular case) are multi-class it is necessary to train in the presence of all the classes in the first place. On the other hand when the classifiers are produced as multiple object/non-object classifiers (the standard form for SVM’s or boosting) the definition of the non-object population is of primary importance, and the relative weighting between the different object/non-object classifiers is a sensitive issue. A closely related problem is how to integrate such classifiers as modular building blocks for resolving ambiguities between different scene interpretations involving several objects. Indeed the relative weighting in the form of likelihoods is one of the direct advantages of the statistical modeling approach. Finally it is often the

13

case that the decision boundaries determined by these classifiers are not easily visualized nor can they be easily adapted. The work in LeCun et al. (1998) develops a full system for reading strings of handwritten digits where segmentation and classification are fully integrated both in the training stage and in the recognition stage. However the underlying principle in training is optimizing on large numbers of parameters for the correct labeling of training images of object sequences, hence the need for very large training sets.

2.3 Discovering parts There has been interest recently in finding local features “parts” at various scales which are discriminative for objects against background. See for example Viola & Jones (2004), Ullman et al. (2002), Torralba et al. (2004), Kremp et al. (2002), and Bernstein & Amit (2004). In some of these approaches features are discovered using discriminative rules relative to a background population (see Viola & Jones (2004), Ullman et al. (2002), Schneiderman & Kanade (2004)). In Torralba et al. (2004) the features are jointly discovered for several classes again using a discriminative procedure. Note that in Ullman et al. (2002) and Schneiderman & Kanade (2004) after the features are determined, a conditional Bayes model is used to classify against background. In Bernstein & Amit (2004) the approach is entirely model based, no discriminative criterion is used to choose the features or create the object models. In these models the local features are computed in a feedforward fashion from the edges or the pixel intensities, and the distribution on the original data is never modeled, nor is the original edge data ever revisited. To some extent it is possible to view these part based models as an approximation to the full model on the edge data with the hidden deformation parameter. We believe that certain ambiguities can only be resolved with highly localized simple features at the original resolution, and therefore it is important to be able to write reasonable statistical models at this level which incorporate geometric information. Perhaps most of the analysis can be done at the level of complex features with minimal geometric information, but for some tasks the original detail is important. This is consistent with the ideas proposed in Ahissar & Hochstein (2002), where it is suggested that lower level features 14

Figure 2: A sample digit image with edgemaps for four of the eight orientations. The first two are horizontal with opposite polarities, and the last two are vertical with opposite polarities. In black are the original edge locations in gray are the locations after spreading.

at higher spatial resolution, for example those computed in V1 are used in full detail only at a later stage for refined disambiguation. Of course in the initial bottom-up processing they are used in a feedforward fashion to compute more complex features stored on coarser grids.

3 From Bernoulli models to Patchworks of parts The data model we propose is based on a set of binary oriented edge features (see Amit & Geman (1999)), computed at each point in the image which is defined on a grid L. We write X = {Xe (x) | x ∈ L, e = 1, · · · , E}, where E = 8, corresponding to 8 orientations at increments of 45 degrees. These features are highly robust to intensity variations. Each detected edge is spread to its immediate 3 × 3 neighborhood. This spreading operation is crucial in providing robustness to small local deformations, and greatly improves performance of any classifier implemented on the data. In figure 2, we show edge maps for four orientations on a sample image from MNIST. In black are the original edges and in gray are the locations after spreading. The advantage of focusing on such features is their applicability to many types of problems, character recognition, detection and recognition of 3d objects, medical imaging (see Amit (2002)) as well as reading license plates from rear photos of cars (see Amit et al. (2004)). The Bernoulli model in its most general form assumes that there exists a model probability map pe,y , y ∈ G on a reference grid G, and a set Θ of admissible instantiations of the object with some distribution p(θ). Given only one object is present in the image at 15

instantiation θ we assume that the edges in the image are conditionally independent with marginal probabilities at each point x given by (θpe )(x). In different situations we can use different definitions of the instantiations θ and the operation θp. For example if θ represents a set of mappings of the entire grid G into the image lattice we can write   −1  . pe,(θ−1 x) if θ x ∈ G θpe (x) = P (Xe (x) = 1|θ) =  pe,bgd otherwise,

(1)

where pe,pgd is a generic background probability for edge type e. The problem with this model is that the space of mappings is very large and this poses serious challenges both for estimating the probabiity maps pe (y) when θ is not observed and for computing the most likely θ in testing. A major simplification occurs when we think of summarizing the instantiation as the mapping θ of a moderate number of reference points y 1 , . . . , yn in the reference grid into the image lattice. Let zi = θyi , i = 1, . . . , n and with some abuse of notation let θ = (z1 , . . . , zn ). This does not specify a full map of the entire reference grid. Instead we use a subwindow W to define parts Qi - of the full probability map. . Qi,e,s = pe,yi +s , e = 1, . . . , E, s ∈ W, namely the local subwindows of the probability map around y i . Now imagine that the part Qi is ‘moved’ to the point zi . Edges at points in the image lattice that are not covered by any of the windows zi + W get assigned background probabilities. Edges at points covered by one or more of these parts are assigned the average of the probabilities contributed by each of them, see figure 1. Specifically for each x ∈ L let

I(x) = {i : x ∈ zi + W } be the set of shifted reference points whose W neighborhood covers x (see figure 3 (C) for examples of I). For each element i ∈ I(x) the marginal probability assigned to Xe (x) is Qi,e,x−zi = pe,yi +x−zi . We keep the main assumption that in the global model all the variables X e (x) are independent conditional on θ. The marginal probability at each point x is then given by    1 P if |I(x)| > 0 i∈I(x) Qi,e,x−zi |I(x)| θpe (x) = P (Xe (x) = 1|θ) =  pe,bgd if |I(x)| = 0, 16

(2)

(A) (B)

(C)

(D)

Figure 3: (A) The probability map for horizontal edges for a 2 model with the reference points y i in red. (A) For two different 2’s the original image. (B) The estimated map θ - shown as red arrows indicating the shift of each reference point together with a visualization of I(x).The darker a pixel the more patches cover it. (C) The resulting global POP model (for horizontal edges) obtained by patching the shifted local models.

where pe,bgd is a generic background probability for edge type e. Thus we are proposing a patchwork of the local models using the pointwise average of all local submodels covering the point x. Hence the term patchwork of parts (POP) model. The global POP model conditional on the points θ is then P (X|θ) = P (X|z1 , . . . , zn ) =

YY x

[θpe (x)]Xe (x) [1 − θpe (x)](1−Xe (x)) ,

(3)

e

with θpe (x) defined in (2). Let θ¯ = (yi )i=1,...,n , i.e. the original reference points. The ¯ e (y). Since the windows have not been original probability map pe,y , y ∈ G is given by θp moved the probabilities in the average in (2) are all the same and equal to p e,y . In the left panel of figure 3 we show the probability map estimated for class ‘2’ (A). For sample ‘2’-s we show the original image (B); The overlaps of the patches - the function I(x) - and for each shifted patch a red arrow pointing from the original reference point zi to the new location yi (C); The global POP model determined by θ (D). Note how a combination of shifts of local models can accomodate a variety of deformations of the original probability map.

17

3.1 Modeling the deformations We assume a joint Gaussian distribution f (θ) = f (z1 , . . . , zn ) on the locations, with means given by the reference points: θ¯ = (yi )i=1,...,n . These are highly correlated and the covariance matrix is summarize through a set of 2n principle eigenvectors v k , k = 1, . . . , 2n and variances λ1 ≥ λ2 . . . ≥ λ2n . Typically f will depend on the particular class being modeled. On top of these class variations we assume a uniform distribution g(σ) on affine maps over a limited range, centered around the identity map. The distribution g is the same for all classes. The joint distribution on edges and deformations is summarized as P (X, σ, θ) = P (X|σθ)f (θ)g(σ),

(4)

where P (X|σθ) = P (X|σz1 , . . . , σzn ), is given by equation (3). All together we now have a generative model for the edge features: sample a configuration θ according to f (θ), sample an affine map σ according to g(σ), place each local model Qi at its corresponding point σzi . At each point x ∈ L compute σθpe (x), according to equation (2), and draw Xe (x), x ∈ L, e = 1, . . . , E independently.

3.2 Class clusters and Classification One POP model may not be sufficient to describe the population of a given class. There are qualitative differences in shape between say different instances of the digit 7 that can not be accommodated through local deformations of the parts. We thus model each class c = 1, . . . , C as a mixture Pc (X) =

Dc X d=1

P (d)

Z

Pc,d (X|σθ)fc,d (θ)g(σ)dθdσ,

(5)

of POP models Pc,d (·|θ), d = 1, . . . , Dc , each with a different distribution fc,d on deformations. One approach to classification would proceed by finding the class c that maximizes the marginal likelihood of the observed data given by equation (5). This integration can 18

be prohibitive, furthermore it can be useful to have an instantiation associated with the classification. We therefore choose to classify by maximizing Yˆ = argmaxc max max max Pc,d (X|σθ)fc,d (σθ)g(σ). 1≤d≤Dc

θ

σ

(6)

In other words, assuming a uniform prior on classes, classification is obtained by taking the class label from the maximum a-posteriori on class model and instantiation. In the next section we describe how the probability model for one cluster is estimated. In section 4.4 we explain the mechanism for estimating the different clusters for each class, and in section 5 we explain how to efficiently compute the above maximum.

4 Training We describe the training of one POP model from a given training set. One could envisage simultaneously training the mixture of POP models for each class, but we prefer to discover the mixture components sequentially as described in section 4.4. A priori we do not know which reference points yi to choose nor do we know the probability map. Intuitively a reference point is one around which similar non-trivial structures occur for different instances of the object. Furthermore we want the collection of windows zi + W to cover all structures of interest on the object, possibly with some redundancy. We therefore start with a fixed grid of start points xi , i = 1, . . . , n ∈ G and search in the neighborhood of each such point for a possible part, i.e. a local probability map around a reference point yi . The full probability map is never directly estimated, rather each y i , Qi is estimated separately around each start point xi . The estimate of the full probability map is then given by equation (2) where we set zi = yi , i = 1, . . . , n. This is denoted the mean global POP model. Namely the estimated parts are patched together with locations at the estimated reference points yi . We return later to the question of whether this indeed can capture the original probability map that generated the data.

19

(1)

Ix+U

(1)

Iz(1) +W

(2)

Ix+U (2)

Iz(2) +W

Estimated model

(3)

Ix+U (3)

Iz(3) +W

Figure 4: Right: Illustration of estimation procedure. The subwindows x + U and x + W centered at start point x for three sample training images. The common underlying structure is found in image I (j) at z (j) + W (dashed line) where the shift is given as t (j) = z (j) − x ∈ V . Right: After alignment and averaging the correct model emerges.

4.1 Training one part model Since we are dealing with one part around one start pointt we remove subscripts and to further simplify notation we assume only one binary feature type X(x), x ∈ L. Let Z denote the random variable describing the location of the point of interest in a training image X. Given a start point x we consider only data XU +x from a fixed subwindow U + x where W ⊂ U and we assume that Z + W ⊂ x + U . Namely the random shifts of the region Z + W always fall within x + U . Let T = Z − x, the shifts T are hidden variables taking values in a finite square neighborhood V of the origin. see for example figure 4. We

20

assume that given T the likelihood of the data is Y

P (XU +x |T, {ps }s∈W , pbgd ) =

) pX(x+s+T (1 − ps )1−X(x+s+T ) s

s∈W

Y

·

X(s)

pbgd (1 − pbgd )(1−X(s)) .

(7)

s∈U \(x+T +W )

i.e. the variables X(s), s ∈ U are independent given T and we ignore the fact that the window x+U may contain additional on-object points beyond the set Z +W . The marginal distribution on T is denoted π. The unknown parameters are therefore Q = {p s , s ∈ W }, pbgd , and π. The log-likelihood of a set of m training images X (j) with observed shifts T (j) has a unique maximizer at m

π ˆ (t) =

1 X 1 (j) , t ∈ V m j=1 {T =t} m

ˆs = Q

1 X (j) X (x + T (j) + s), s ∈ W m j=1 m

pˆbgd =

X 1 m · (|U | − |W |) j=1

X

X (j) (s).

(j) +W s∈x+T /

And we can set the corresponding reference point y to be y =x+

X

t ∈ V πˆ (t).

However, since we do not observe T (j) , the likelihood of the observed data X (j) has the form of a mixture over all possible shifts: P (XU +x |Q, pbgd , π) =

X

π(t)P (XU +x |t, Q, pbgd ).

t∈V

We are now in the classical setting of estimating the parameters of a mixture distribution. An important unique feature of the present setting is that the distributions of the components of the mixture are ‘shifts’ of each other, thus more data can be pooled to estimate the parameters. The standard method for finding a local maximum of the likelihood is the EM algorithm, see Dempster et al. (1977), which we describe below, assuming for simplicity that the start point x = 0. (`)

We generate iterative estimates Q` , pbgd , π ` as follows. 21

(0)

(0)

pbgd

1 m

X (j) (s), s ∈ W . P P (j) 1 = m(|U |−|W (s). s∈U \W jX |)

1. Initialize Qs =

P

j

π (0) (t) = 1/|V |, t ∈ V

2. For each training point j and shift t ∈ V , compute (j)

(j) (`) P (t|XU , Q(`) , pbgd , π (`) )

=P

(`)

P (XU |t, Q(`) , pbgd )π (`) (t) (j)

t0

(`)

P (XU |t0 , Q(`) , pbgd , )π (`) (t0 )

.

A simplification occurs if the numerator and denominator are divided by Q X(s) 1−X(s) , yielding s∈U pbgd (1 − pbgd ) P (t, XU , Q, pbgd , π) ∝ π(t)

Y  ps X(s)  1 − ps ]1−X(s) , pbgd 1 − pbgd s∈W

(where we have omitted the superscripts (j) and (l).) 3. Compute π (`+1) (t) =

Qs(`+1) =

(`+1)

pbgd

=

1 X (j) (`) P (t|XU , Q(`) , pbgd , π (`) ). m j

1 XX (j) (`) P (t|XU , Q(`) , pbgd , π (`) )X (j) (t + s), s ∈ W m t∈V j

XX 1 m(|U | − |W |) t∈V j

X

(j)

(`)

P (t|XU , Q(`) , pbgd , π (`) )X (j) (t + s).

s∈U \(t+W )

` → ` + 1, goto 2. The corresponding reference point y is set as y =x+

X

t∈Vπ ˆ (`) (t).

After a small number of iterations the probabilities Q(l) converge and are recorded for each part. Typically after a few iterations, for each training image the distribution (j)

(`)

P (t|XU , Q(`) , pbgd ) is highly peaked at one particular shift where the common structure is found, as shown in figure 4. 22

Figure 5: Results of training. Probability maps for a horizontal edge type. High probability areas are darker. Top left: Mean global model for the ‘A’ symbol in the LATEX dataset. Top right: Mean global model without iterations. Bottom: Same for the the ‘0’ digit in MNIST. (See also figure 11)

4.2 Mean global model (MGM) and inter-part consistency The mean global model is obtained by applying the patchwork operation with the estimated parts Qi at the reference points yi . Namely the marginal probabilities at each point in the reference grid are obtained through equation (2) using the estimated parts Q i with zi = yi . This can be viewed as an estimate of the full probability map associated to the class under the Bernoulli model. The mean global model for two characters is shown in figure 5. On the left for one type of edge (horizontal) is the result with the EM iterations. On the right is the result (0)

with no iterations, namely each part is obtained by taking Qs obtained with no shifting. The alignment generated by the training procedure produces far more concentrated models 23

Figure 6: Left: Mean global model for 7’s Dark - high probability. Right: The local models of the seven, used to compute left panel shown displaced for visualization.

where local variability has been factored out. As shown in the results section, this leads to significant improvements in performance since the likelihood contrasts become sharper. Similar displays for faces are shown in figure 11. Note that even though the local models are trained separately they patch together in a consistent manner when placed at the estimated reference points - the MGM ‘looks right’. As an additional illustration, in figure 6 we show the MGM for the digit ‘7’ together with some estimated local models (slightly displaced from their original location.) The darker areas are those of higher probability. For ease of presentation these models were trained with a single binary feature, which is present if the gray level value is above a threshold. Again the MGM looks exactly like a seven. Placing the parts at the estimated reference points yi yields a consistent part configuration in the sense that the distributions induced on the overlap regions by several overlapping parts are very similar. Had this not been the case, the model would appear very blurred and diffuse. The training algorithm is trying to explain the data in windows around two nearby start points, and the consistency in the image data across these windows forces the estimated models to patch together consitently if placed at the correct locations. The resulting MGM is a probability map on the full reference grid and hence can be 24

viewed as an estimate of the underlying probability map which generated the data. An indication that this is a reasonable estimate has to do with the fact that the local models can be ‘reread’ from the MGM. In other words, instead of keeping a record of each of the estimated local models Qi , we can keep only the probability map determined by the MGM. The local models are then read off by extracting the probabilities in subwindows around the reference points yi . The resulting local models are not identical to the original ones, but are close enough, and the resulting classification rates are essentially the same. We emphasize that this would not have been the case if neighboring local models were very inconsistent on the regions of overlap. The probabilities of the MGM at each point in the overlap region would be an average of two very different probabilities. The local models we would reread would be different from the original ones.

4.3 Estimating the distribution on deformations We estimate the joint distribution on locations fc (θ) as follows. Re-loop through the training data with the estimated POP model. For each example and each part p i,W , compute (j),∗

zi

the optimal local placement of part i as detailed in section 5 below. Use the sam-

ple θ (j),∗ , j = 1, . . . , m to estimate the covariance matrix, for the joint Gaussian centered around θ¯ = (y1 , . . . , yn ). As will be seen in the experimental results, adding in this distribution has a small positive effect on the error rates. A more judicious choice of distribution on the deformations and its estimation procedure may yield more interesting results.

4.4 Learning mixture models Our goal is to enable any model we develop to evolve as more data gets processed. The idea is to envisage each class as a mixture of POP models, and have the number of mixture components and the parameters of the mixture components evolve as additional data is introduced. This is a non-trivial task and very little can be found in the literature on sequential clustering. We describe a rather naive approach that has performed remarkably well. Given data from one class we train initially on a small training set T0 of size Mmax and

25

produce one POP model P0 from this dataset. For this model we compute the mean and standard deviation µ0 , σ0 of the log-likelihoods `0 (X), X ∈ T0 . Any ‘problematic’ data point in T0 with log-likelihood one standard deviation below the mean - `0 (X) < µ0 − σ0 - is added to a new list T1 . Now as additional data points X arrive (not from the original set T0 ), we evaluate `0 (X) and add to T1 only those for which `0 (X) < µ0 − σ0 . Once the size of this list |T1 | = Mmax estimate a POP model P1 from this data set and estimate µ1 , σ1 . All points in T1 are already below threshold for the model P0 . Those that also fall below threshold for P1 are used to start a new list T2 . New data points that fall below threshold on all existing models (in this case P0 , P1 ), are added to T2 , and so on. In this manner additional models are added once a sufficient number of points has accumulated whose likelihood is below threshold for each of the current models. This is summarized in the following pseudo-code. 1. Set D = 0. T0 = ∅. j = 0. 2. While |TD | < Mmax If `d (X (j) ) < µd − σd for all d = 1, . . . , D add X (j) to TD . EndIf j = j + 1. EndWhile 3. Estimate the POP model PD from TD , and estimate µD , σD from TD . 4. Initialize TD+1 with all elements in TD for which `D (X) < µD − σD . Goto 1.

Note that j is never reinitialized and the maximum number of points stored at any stage is Mmax . Clearly as the number of models grow the rate at which the ‘bad’ set grows is slower and slower. Effectively we are training with only a very small subset of the data points, since already after the first set most points are above threshold. In figure 7 we show two of the models estimated for the class 7. It is encouraging to see that the second cluster has picked up sevens that are qualitatively different from those represented by the first, i.e. European sevens with a cross bar. 26

Figure 7: Seven models: The MGM’s for two seven clusters. Dots show the reference points y i . Left: probability maps for horizontal edges. Right: probability maps for vertical edges.

5 Computation Computing the global maximum in equation (6) is difficult. In practice we will limit ourselves to finding a local maximum. So far we have implemented the following simple approximation. Given a particular POP model (corresponding to a particular class cluster), for each part model Qi , start from the reference point yi and perform a local search in its neighborhood for zi∗ maximizing the likelihood ratio of the local model Qi,W (Xz+W ) to a background model, disregarding the other parts (i.e. ignoring the overlaps, and the joint distribution f (θ)). Specifically we use the search neighborhood V of section 4, (see also figure 4) and the larger neighborhood U which contains any possible shift of W by an element of V . Let pe,bgd = ne /|U | where ne is the number of instances of edge type e. We view pbgd as a local 27

estimate of the probability of edge e outside the region z + W . Simply maximizing Qi (Xz+W ) over z ∈ yi + V would imply that likelihoods over different observations are compared. Instead we write the same model for the fixed region U as in equation 7 of section 4, where on the shifted subwindow z+W we have probabilities from Qi and on the remainder probabilities pe,bgd .  E Y Y X (s+z) e  Qi (XU ; z) = pe,s,i (1 − pe,s,i )(1−Xe (s+z)) e=1

s∈W

Y

s∈U \z+W

X (s)



e pe,bgd (1 − pe,bgd )(1−Xe (s)) 

(8)

Dividing by a background model Qbgd (XU ) =

E Y Y

X (s)

e pe,bgd (1 − pe,bgd )(1−Xe (s)) ,

(9)

e=1 s∈U

which does not depend on z, the the ratio simplifies to a product over the data in z + W alone:  Xe (s+z)  (1−Xe (s+z)) E (1 − pe,s,i) . Qi (XU ; z) Y Y pe,s,i = . Ji (z) = Qbgd (XU ) p (1 − p ) e,bgd e,bgd e=1 s∈W

(10)

zi∗ = argmaxz∈¯zi +V log Ji (z).

(11)

Thus

After local optimization for each point use θ = (z1∗ , . . . , zn∗ ) to compute the full likelihood under the POP model as given in equations (3), (4). Thus the full conditional distribution involving the overlaps and the term f (θ) is computed only after the maximization. One can iteratively maximize the full likelihood, computing a local maximum for z i , given the current locations of all other points. This is much slower and does not seem to yield any improvement in performance.

6 Experimental Results In this section we try to illustrate the usefulness of the proposed data models and associated training procedure in a number of applications. Due to space limitations not all details of the implementation can be provided. 28

Edge spreading window - 3 × 3 Part size (W ) - 9 × 9 Overlap parameter determining initial point grid - ν = .5 Search neighborhood (V ) - 11 × 11 Number of iterations in EM - 10 Number of data points at which additional model is estimated - Mmax = 10 Table 1: Default setting of algorithm parameters.

6.1 Classifying segmented digits We present detailed experiments on the MNIST data set to explore the dependence of the algorithm on several of the model parameters. The default parameters are defined in table 1. Computing the affine component - normalization Since the distribution g on the affine component does not depend on the class, with a view to reducing the computational load, we assume σ can be computed directly from the data without using any particular model. Thus instead of ‘deforming’ the model by σ as in equation (4) we renormalize the data to the identity pose. First the center of mass (m x , my ) of all detected edge locations is moved to the center of the image (c x , cy ). For each edge location we substitute x → x − mx + cx , y → y − my + cy . Slant is corrected by computing byx - the slope of the regression line of the x coordinates fitted on the y coordinates of the edge locations, and substituting x → x − byx (y − cy ). Finally after deslanting we compute the standard deviations sx , sy of the edge locations in each coordinate and scale separately in each coordinate by rx /sx , ry /sy where rx and ry are predetermined constants describing the average width and height of the edge clouds produced by sample digits. Remark: The above procedure is clearly sensitive to noise and clutter, and is viewed as a computational shortcut. A full minimization over σ, z is costly. On the other hand, coarse to fine approaches (see Fleuret & Geman (2001), Amit et al. (2004)) have yielded accurate affine pose estimates very efficiently, some of these ideas are used in the experiments on 29

Training data Avg. clusters Error rate

Error rate

SVM

per class

per class

with f

without f

error rate

10

1

6.5

6.05

12.61

30

2.6

3

3

6.17

50

3.4

2.46

2.58

4.18

100

4.1

1.96

2.14

3.02

500

8

1.52

1.85

1.47

Table 2: Right: Classification rates as function of number of training data per class. Middle column indicates number of clusters found in each class with the sequential clustering algorithm. We report error rates with the prior on θ and without, as well as the best rates achieved with SVM’s on the same edge features (Using a quadratic kernel).

reading zipcodes where this normalization is not possible. Error rates as function of training set size The first question of interest is the evolution of the error rates with the training set size. This is summarized in table 2. The classification results are for a test set of 10,000 where the margin of error is about .17%. The error rate starts at 6% with 100 training data, i.e. 10 per class with 1 POP model per class, to 1.5% with 5000 training data, i.e. 500 per class with on average 8 models per class. Note that this means that the models were estimated with about 80 samples per class of the 500 available. Ignoring the joint distribution on shifts the rate is 1.85%. With 1000 training images (100 per class) the error rate is just under 2% (1.96%). Without the joint distribution on shifts it is just over 2% (2.14). In this experiment the estimated joint distribution f seems to have a small effect. Stability with respect to the training set. The models reported in table 2 were trained with the first (100,300,1000,5000) training examples of the MNIST training set. Of interest is the stability of the results with respect to variations in the training set. For sample size 300 - 30 per class - we trained 25 classifiers 30

Training data

No. of Clusters POP Error rate

SVM error rate

100

5

1.75

3.02

500

20

1.11

1.47

1000

30

.8

1.15

Table 3: Classification results with non-sequential clustering using all available data. on disjoint subsets of the training set. The mean classification rate was 3.1% with standard deviation .3%. This is an encouraging finding. Despite the very small training set size, the variance of the final classification rate is very small. Comparison to non-parametric classifiers. For the smaller size datasets 10-100 per class, the results are far better than anything we were able to achieve with non-parametric classification methods such as SVM’s or boosted randomized decision trees on the same edge features, see last column of table 2. The 3% error rate reported for 30 examples per class, and the 2% error rate reported for 100 examples per class are competitive with many algorithms listed in LeCun (2004) that have been trained on 6000 per class. The results become indistinguishable as the sample sizes increase. Non-sequential clustering The clustering algorithm described above is appealing because of its sequential nature and the ability to update the model as more data is observed. However for optimal results it may be preferable to estimate the clusters simultaneously from all the available data. This is difficult to do in our context because of the fact that the instantiation parameter θ is unobserved. However using a coarse approximation to the POP model in terms of a fixed library of local parts as proposed in Bernstein & Amit (2005) we can implement an EM type algorithm to estimate a predefined number of clusters. Then using the data assigned to each such cluster we estimate a POP model. This improves the results for the larger training set sizes as summarized in table 3. 31

With only 1000 training points per class, using likelihood ratio based classification with no discrimination boundaries, we achieve a state of the art error rate of 0.8%. Computation time. Computation time is about .09 per image per cluster on a Pentium IV 3Ghz, for example 20 images per second with five clusters per class. Thus as the number of clusters grows classification slows down. One remedy is to use simpler models to detect the top 2,3 classes. For example with the models made from 30 examples per class, with 2-3 clusters per class the top 3 classes are correctly identified for 99.6% of the data. Using some confidence threshold the more intense computations using more cluster models can be performed on only the ‘uncertain’ examples. Ultimately this classification method should be incorporated in a comprehensive coarse-to-fine computation. Varying parameter settings. At 100 examples per class we experimented with some of the parameter settings. We summarize the results in table 4 where the modified parameter value is indicated all others being at default value. Varied Parameter

Error rates

Default No opt. No EM Mmax = 40 W = 6 W = 12 1.96

10.1

iters.

w/wo f

3%

2.04/2.35

2.79

2.44

Table 4: Comparing error rate with default parameters to individual parameter changes. First we show the importance of optimizing on the locations zi . If the likelihoods are computed directly at the mean locations, i.e. the search neighborhood V = {0}, the classification rate falls to 10.1%. The classification is highly dependent on Maxing out the deformation nuisance parameter! It is possible to estimate the local models assuming there are no shifts. In the window xi + W around each start point xi estimate the edge probabilities without any shifting. This yields cruder models, (see figure 5) with higher error rates - 3% (instead of 2%) - which is 32

Figure 8: Examples of randomly deformed latex symbols. Left, a random sample from the entire training set. Right: 30 randomly deformed 4’s &’s and [.

still respectable for this training set size but is significantly lower than the rate obtained with the original model. It should be recalled that the moment based prealignment performed on the edge data, as described in section 6.1 already yields good local alignment. We also tried increasing the value of Mmax which determines the number of points needed to estimate a new POP model, the number of models per class dropped from 4.1 to 2.6. This led to a very slight decrease in performance. It is interesting that in this setting there was a somewhat larger drop in classification rate when the distribution on deformations is ignored. With more clusters part of the geometric variation is covered by the different models and the constraints on the deformation captured by the distribution f are redundant.

6.2 Large numbers of classes - LATEX characters This experiment is primarily meant to test the scalability of this method to large numbers of classes. We emphasize again that each class model is learned separately and classification is based on simple likelihood comparisons. We use essentially the same type of data set as in Amit & Geman (1997) or Amit (2002). A sample of deformed latex symbols is shown 33

in 8. We used the default parameters shown in table 1. With 10 examples per class the error rate is 7.8%. With 30 training examples per class for each of the 293 classes and 2 models per class, the error rate is 3.7%, again essentially the same as that achieved with multiple decision trees and reported in Amit (2002). In this example since we do not apply the normalization described in section 6.1 there is a much greater decrease in the error rate, down to 10.7%, if training is performed with no shifting, i.e. V = {0}.

6.3 Reading zipcodes We have run preliminary experiments on the applicability of the digit models to parsing a ‘scene’ i.e. reading zipcodes. The goal is to perform a likelihood based labeling of the zipcode avoiding any preprocessing or pre-segmentation. The digit models are trained from the isolated and segmented MNIST dataset. However the zipcode digits appear at widely different scales (at least 2:1) - even in the same zipcode. Thus instead of estimating one POP model for each class cluster, we estimate a number of models where all the data in the cluster is simultaneously scaled or slanted. In this way we avoid the normalization step described above which would require pre-segmentation of the individual digits and can be quite sensitive to noise. All the models are estimated with 500 digits per class from the first 5000 MNIST training digits. There are 20 clusters per class, and 4 scales and 3 slants per cluster. An initial coarse-to-fine computation finds candidate detections for all 10 digits, using very conservative thresholds. Typically this will result in a set of 2-3 hundred detections denoted D. Each detection comes with a class label c, a model label m, a location x, an instantiation θ and a support S - the set of image pixels for which the assigned edge probabilities θpe (x) (see equation (2)) are different from the background edge probabilities. The model m refers to a particular cluster model for class c. For some example detections and their support on a sample zipcode see figure 9. Note that due to the many different scales and slants at which the digits appear there can be many detections on a particular part of the zipcode that ‘make sense’ unless the full context of the interpretation is taken into account.

34

Figure 9: The support of several detections on a zipcode.

35

Figure 10: Top: Optimal interpretation with supports. Bottom: Next best interpretation. An hypothesis for the interpretation of the entire zipcode involves a quintuple: I = {ci , mi , xi , θi , Si , i = 1, . . . , 5}, The edge data of the entire zipcode conditional on such an interpretation is modeled by concatenating the individual data models on the union of the supports S = ∪ i Si . On the complement S c the edges are again assumed independent with background probabilities. On regions where the supports overlap we can either use the overlap trick, or simply pick the probabilities assigned by one of the two models, according to some preassigned order (say from left to right). Let Ti = ∪ij=1 Sj , then 5

YY P (X|I) = P (X|bgd) i=1 e

Y

x∈Si \Ti−1



θi pe (x) pe,bgd

Xe (x) 

1 − θi pe (x) 1 − pe,bgd

1−Xe (x)

,

(12)

(see Amit et al. (2004) for more details). The goal is to find the most likely interpretation from among all quintuples of D. In our setting, since the detections have to be arranged in a linear fashion, it is possible to compute the optimal interpretation efficiently with dynamic programming on a five node graph where each detection is a possible state at any node in the graph. Clearly the linear constraints rule out many configurations. The top panel of figure 10 shows the optimal interpretation found by the dynamic programming. 36

We tested the results on a set of 1000 zipcodes from the CEDAR data base. No segmentation or preprocessing of any kind is performed. We obtain a correct zipcode recognition rate of 82%. A simple modification of dynamic programming can yield the top K most likely interpretations. The bottom panel of figure 10 show the second best interpretation. For 94% of the zipcodes the correct labeling was among the top 10. Furthermore using a simple rejection criterion comparing the likelihoods of the top two interpretations, we get 96% correct with 50% rejection. Computation time on a Pentium IV 3Ghz is 8 seconds per zipcode. There is not much literature on reading zipcodes in recent years. However comparing to the literature from the mid to late 90’s this initial result is within the range of results obtained by very dedicated algorithms. Some results are presented in table 5. Note that the training and testing datasets are not the same so it is hard to provide an accurate comparison. Author

n

correct at 0 rej.

% correct

% rej

Ha et al. (1998)

436

85%

97%

34%

96.5%

32%

Palumbo & Srihari (1996) 1566 Wang (1998)

1000

72%

95.4%

43%

POP models

1000

81%

96%

50%

Table 5: Comparison of zipcode reading rates.

6.4 Faces To verify whether this model is applicable to gray level objects that are not line-drawings we performed a face detection experiment. Using the first 300 images of the Olivetti data set we trained 5 face POP models at .3 of the original scale - on average 10 pixels between the eyes. As in the zipcode problem, to accomodate different scales we simultaneously scaled the images in each cluster at .27, .3 and .33 to create scaled versions of each model. Thus in total there are 15 POP models for faces. In figure 11 we show the mean global model for one of the face models at scale .3. For comparison we show the global model obtained if no EM iterations are performed, this simply records the edge probabilities at 37

each pixel over all training faces in the cluster, with no local shifting. The former model is much sharper or equivalently has lower entropy.

Figure 11: Top row: Global mean model for a face cluster without EM iterations. Left, horizontal edges, right vertical edges. Bottom row: Global mean models after EM iterations. Dark means higher probability.

Using an efficient but crude face detector (see Amit (2002)) we obtain candidate locations/scales for testing the POP models. These detectors are based on the same edges and are coarse approximations of the POP models. At each candidate window we use an adaptive estimate of the background edge probabilities pe,bgd , and define a homogenous background model assuming conditional independence of the edges with the same marginal probabilities everywhere in the window. Using a likelihood ratio test of the MGM’s (no shifting) to background for each of the 5 models and each of the 3 scales we pick the best model. This does not involve the intensive computation of optimizing the shift of each part in the POP model. Only at the chosen model do we compute the optimal instantiation θ as 38

0.96 0.94

Fraction Detected

0.92 0.9 0.88 0.86 0.84 0.82 0.8 0.78 0

0.5

1

1.5

2

False Positive Rate (per pixel)

2.5

3 −5

x 10

Figure 12: ROC curve for face detection. X axis - number of false positives among 130 images. Y axis, fraction of detected faces.

explained in section 5. Finally the likelihood ratio of the fitted POP model to the locally adapted background model is compared with a threshold to decide if the candidate detection is a face or not. Varying this threshold yields and ROC curve presented in figure 12. We tested on the combined CMU MIT test sets of upright faces (testA, testB, testC). In testing we excluded a couple of upside-down card faces, two profiles and several ‘charicature’ or line drawing faces (not the playing cards) leaving 490 ‘faces’. The images containing the excluded faces remain in the test set for false positive rates, and to detect the other faces that might occur in these images. There are 130 images in the test set. The results are not as good as some reported in the literature. At a false negative rate of 12% we have approximately 250 false positives compared to around 80 reported for example in Viola & Jones (2004) or Schneiderman & Kanade (2004). However all other models have used explicit training with large numbers of faces and even larger numbers of background images. The interest here is that the face models are trained with only 300 faces, no background, and yet a simple likelihood ratio test to an adaptive background model has so much power. 39

Figure 13: Some face detections on images from the CMU-MIT dataset. Some images with detections, a few false positives and false negatives are shown in image 13 and 14. In addition to a location and scale we obtain a full instantiation of the face. As an example in figure 15 (A) we show the subimage of a detected face from the upper lefthand panel of figure 13, together with the shifts of the reference points (B), and the global POP model for the horizontal edges (C). Note how the deformed probability model is adjusting to the fact that the face is partially rotated. The lower panel of the figure shows the support computed for each of the faces.

40

Figure 14: Some additional face detections.

7 Discussion We have introduced a new class of statistical object models with rather general applicability in a variety of data sets. These models describe the dense oriented edge maps obtained from the gray level data, and assume independence conditional on the instantiation. The advantages of statistical modeling and likelihood based classification have been demonstrated at several levels: (i) robust estimation of models from small datasets, (ii) easy scalability to large numbers of classes and new data from existing classes, (iii) state of the art classification rates on the MNIST handwritten digit dataset (iv) composition of object models to scene models for zipcode reading (v) face detection through likelihood ratios to adaptive background models. One inherent drawback of the current models is sensitivity to rotations beyond say +/ − 15 degrees. We allow only shifts of the parts but when an articulated component of the object undergoes a significant rotation or skew, the probabilities of the edges at each location can no longer be represented as a shift of the original model. Currently this can be accomodated through an additional cluster in the class. An alternative is to ‘rotate’ the models by estimating transition probabilities between edge types as a function of the

41

(A)

(B)

(C)

Figure 15: Top. (A) The subimage around a face. (B) The shifts of the reference points relative to the hypthesized center of the detected face. (C) The resulting global POP probability map for horizontal edges. Bottom. Supports of global POP models for the five faces.

42

rotation. Another possibility is to develop part models on even smaller windows, based on the original gray level data, where photometric invariance is explicitly modeled through the mean level in each window. It is easier to rotate an actual image than to rotate an edge map. In general it would seem useful to pursue the possibility of gray level POP models. In the current implementation a different part model is trained for each point of interest for each cluster of each class. Some preliminary experimentation has shown that many of these parts are very similar, and that the estimated parts can be replaced by representatives from a relatively small library as described. This raises the possibility of discovering a universal family of parts with which all models are trained. This has been partially explored in in Bernstein & Amit (2004) where a set of parts is estimated from unlabeled data, and are then used in a feed forward framework. Namely each pixel is labeled by the most likely part and all the parts from each neighborhood are merged onto one pixel on a coarser grid. This is then the input for the object models and likelihood based classification. We believe that there is place for a statistical model integrating the two levels: the part based level and the edge based level into one consistent hierarchical model. This type of model should benefit from the information contained at both levels. The coarser part based models can be viewed as the coarse step in a coarse to fine hierarchy, thus pointing to a possible integration of the hierarchical models with the CTF methodology.

References Ahissar, M. & Hochstein, S. (2002), ‘View from the top: hierarchies and reverse hierarchies in the visual system.’, Neuron 36, 791–804. Amit, Y. (2002), 2d Object Detection and Recognition: Models, Algorithms and Networks, In press, MIT Press, Cambridge, Mass. Amit, Y. & Geman, D. (1997), ‘Shape quantization and recognition with randomized trees’, Neural Computation 9, 1545–1588. Amit, Y. & Geman, D. (1999), ‘A computational model for visual selection’, Neural Computation 11, 1691–1715. 43

Amit, Y. & Kong, A. (1996), ‘Graphical templates for model registration’, IEEE PAMI 18, 225–236. Amit, Y., Geman, D. & Fan, X. D. (2004), ‘A coarse-to-fine strategy for multi-class shape detection’, To appear, IEEE-PAMI. Beg, M., Miller, M., Trouve, A. & Younes, L. (2005), ‘Computing large deformation metric mappings via geodesic flows of diffeomorphisms’, International Journal of Computer Vision 61(2), 139–157. Belongie, S., Malik, J. & Puzicha, S. (2002), ‘Shape matching and object recongition using shape context’, IEEE PAMI 24, 509–523. Bernstein, E. & Amit, Y. (2004), A statistical approach to discovering parts for multi-class object recognition, Technical report, Dept. of Statistics, University of Chicago. Bernstein, E. J. & Amit, Y. (2005), Statistical models for object classification and detection, in ‘to appear, Proceedings CVPR 2005’. Borenstein, E., Sharon, E. & S., U. (2004), Combining bottom up and top down segmentation, in ‘Proceedings CVPRW04’, Vol. 4, IEEE. Burl, M. C., Leung, T. K. & Perona, P. (1995), Face localization via shape statistics, in M. Bichsel, ed., ‘Proc. Intl. Workshop on Automatic Face and Gesture Recognition’, pp. 154–159. Burl, M., Weber, M. & Perona, P. (1998), A probabilistic approach to object recognition using local photometry and global geometry, in ‘Proc. of the 5th European Conf. on Computer Vision, ECCV 98’, pp. 628–641. Coughlan, J., Yuille, A., English, C. & Snow, D. (2000), ‘Efficient deformable template detection and localization without user initialization’, Computer Vision and Image Understanding 78(3), 303–319. Crandall, D., Felzenszwalb, P. & Huttenlocher, D. (2005), Spatial priors for part-based recognition using statistical models, in ‘to appear, Proceedings CVPR 2005’. 44

Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977), ‘Maximum likelihood from incomplete data via the em algorithm’, Journal of the Royal Statistical Society 1, 1–22. Fei-Fei, L., Fergus, R. & Perona, P. (2003), A bayesian approach to unsupervised oneshot learning of object categories, in ‘Proceedings of the International Conference on Computer Vision’, Vol. 1. Fleuret, F. & Geman, D. (2001), ‘Coarse-to-fine face detection’, International Journal of Computer Vision 41, 85–107. Geman, S., Potter, D. F. & Chi, Z. (2002), ‘Composition systems’, Quarterly of Applied Mathematics LX, 707–736. Grenander, U. (1978), Pattern Analysis: Lectures in Pattern Theory I-III, Springer Verlag, New York. Grenander, U. (1993), General Pattern Theory, Oxford University Press, Oxford. Grenander, U. & Miller, I. M. (1998), ‘Computational anatomy: an emerging discipline’, Quarterly of Applied Mathematics LVI(4), 617–694. Ha, T. M., Zimmermann, M. & Bunke, H. (1998), ‘Off-line handwritten numeral string recognition by combining segmentation-based and segmentation-free methods’, Pattern Recognition 31, 257–272. Hastie, T. & Simard, P. Y. (1998), ‘Metrics and models for handwritten character recognition’, Statistical Science. Itti, L. & Koch, C. amd Niebur, E. (1998), ‘A model of saliency-based visual attention for rapid scene analysis’, IEEE Trans. PAMI 20, 1254–1260. Jain, A. K., Zhong, Y. & Dubuisson-jolly, M. P. (1998), ‘Deformable template models: a review’, Signal Processinf 71(2), 109–129. Kim, I.-J. & Kim, J.-H. (2003), ‘Statistical character structure modeling and its application to handwrittne chinese character recognition’, IEEE PAMI 25, 1422–1436. 45

Kremp, S., Geman, D. & Amit, Y. (2002), Sequential learning of reusable parts for object detection, Technical report, CIS, Johns Hopkins. LeCun, Y. (2004), ‘The mnist database’, http://yann.lecun.com/exdb/mnist/. LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998), ‘Gradient-based learning applied to document recognition’, Proceedings of the IEEE 86(11), 2278–2324. Leibe, B. & Schiele, B. (2003), Interleaved object categorization and segmentation, in ‘BMVC’03’. Leung, T., Burl, M. & Perona, P. (1995), Finding faces in cluttered scenes labelled random graph matching, in ‘Proceedings, 5th Int. Conf. on Comp. Vision’, pp. 637–644. Liebe, B. & Schiele, B. (2004), Scale invariant object categorization using a scaleadaptative mean-shift search, in ‘DAGM’04 Annual Pattern Recognition Symposium’, Vol. 3175, pp. 145–153. Palumbo, P. & Srihari, S. (1996), ‘Postal address reading in real time’, Intr. Jour. of Imaging Science and Technology. Parizeau, M. & Plamondon, R. (1995), ‘A fuzzy suntactic approach to allograph modeling for cursive script recognition’, IEEE PAMI 17, 702–712. Revow, M., Williams, C. K. I. & Hinton, G. E. (1996), ‘Using generative models for handwritten digit recognition’, IEEE PAMI 18, 592–606. Roche, A., Guimond, A., Ayache, N. & Meunier, J. (2000), Multimodal elastic matching of brain images, in ‘ECCV ’00: Proceedings of the 6th European Conference on Computer Vision-Part II’, Springer-Verlag, pp. 511–527. Rowley, H. A., Baluja, S. & Kanade, T. (1998), ‘Neural network-based face detection’, IEEE Trans. PAMI 20, 23–38. Schneiderman, H. & Kanade, T. (2004), ‘Object detection using the statistics of parts’, Inter. Jour. Comp. Vis. 56, 151–177. 46

Shi, J. & Malik, J. (2000), ‘Normalized cuts and image segmentation’, IEEE PAMI 22, 888– 905. Torralba, A., Murphy, K. P. & Freeman, W. T. (2004), Sharing visual features for multiclass and multiview object detection, Technical Report AI-Memo 2004-008, MIT. Trouv´e, A. & Younes, L. (2004), ‘Metamorphoses through lie group action’, Foundations of Computational Mathematics. To appear. Tu, Z. W., Chen, X. R., L., Y. A. & Zhu, S. C. (2004), ‘Image parsing: unifying segmentation, detection and recognition’, to appear, Int’l J. of Computer Vision. Ullman, S., Vida-Naquet, M. & Sali, E. (2002), ‘Visual features of intermediate complexity and their use in classification’, Nature Neuroscience 5, 682–687. Vapnik, V. N. (1995), The Nature of Statistical Learning Theory, Springer Verlag, New York. Viola, P. & Jones, M. J. (2004), ‘Robust real time face detection’, Intl. Jour. Comp. Vis. 57, 137–154. Wang, S. C. (1998), A statistical model for computer recognition of sequences of handwritten digits, with applications to zip codes, PhD thesis, University of Chicago. Wiskott, L., Fellous, J.-M., Kruger, N. & von der Marlsburg, C. (1997), ‘Face recognition by elastic bunch graph matching’, IEEE Trans. on Patt. Anal. and Mach. Intel. 7, 775– 779.

47

Suggest Documents