Internship Report - Semantic Scholar

10 downloads 53374 Views 2MB Size Report
Apr 28, 2005 - School of Computer Science. Institute for ... research in computer vision. ..... The database entry which best resembles the de- scription derived ...
Otto-von-Guericke-University Magdeburg

School of Computer Science Institute for Knowledge and Language Engineering

Internship Report Design and Implementation of an Algorithm and Data Structure for Matching of Geometric Primitives in Visual Object Classification Author:

Sebastian Stober April 28, 2005 Supervisors:

Prof. Dr. Rudolf Kruse

A/Prof. Saman Halgamuge

Otto-von-Guericke-University Magdeburg School of Computer Science P.O. Box 4120, D–39016 Magdeburg Germany

The University of Melbourne Mechanical & Manufacturing Engineering 3010 Victoria Australia

Stober, Sebastian: Design and Implementation of an Algorithm and Data Structure for Matching of Geometric Primitives in Visual Object Classification Internship Report, Otto-von-Guericke-University Magdeburg, 2005.

i

Abstract This report refers to work completed during my internship with the Mechatronics Research Group at the department of Mechanical and Manufacturing Engineering at the University of Melbourne, Australia from September 5th, 2003 until March 5th, 2004. Recognition of three-dimensional objects in two-dimensional images is a key area of research in computer vision. One approach is to save multiple 2D views instead of a 3D object representation thus reducing the problem to a 2D to 2D matching problem. The Mechatronics Research Group is developing a novel system that focuses on artificial objects and further reduces the 2D views to symbolic descriptions. These descriptions are based on shape-primitives: ellipses, rectangles and isosceles triangles. Evidence in support of a hypothesis for a certain object classification is collected through an active vision approach. This work deals with the design and implementation of a data structure that is capable of holding such a symbolic representation and an algorithm for comparison and matching. The chosen symbolic representation of an object view is rotation-, scaling- and translation-invariant. For the comparison and matching of two object views a branch & bound algorithm based on problem specific heuristics is used. Furthermore, a GA-based generalization operator is proposed to reduce the number of object views in the system database. Experiments show that the query performance scales linearly with the size of the database. For a database containing 10000 entries, a response time of less than a second is expected on an average system.

Acknowledgments This research internship was made possible by funding from the German National Academic Foundation. I would like to thank my supervisors, Prof. Rudolf Kruse and A/Prof. Saman Halgamuge, and further Jun/Prof. Andreas Nuernberger, Prof. Horst Hollatz, and my family for encouragement, support, and help. Special thanks are extended to Ruby Law, who never seemed to tire of writing comments and remarks, for discussions, feedback, ideas, and for help on the literature review. I would also like to thank Christian Borgelt, who read this report prior to publication and was kind enough to offer his comments. My grateful thanks go to all the people at the Mechatronics Research Group and their friends for the pleasant co-operation and hospitality they offered during my stay, particularly Genevieve and Karl, Salim, Kenneth, Guru, and Asanga. Finally, I want to thank Matthias Steinbrecher, especially for help on C++ but also for superb culinary art and bearing my company for almost seven months.

ii

CONTENTS

iii

Contents List of Figures

vii

List of Tables

ix

List of Algorithms

xi

List of Abbreviations

xiii

1 Introduction

1

1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Related work in object recognition

. . . . . . . . . . . . . . . . . . . . .

2

1.2.1

Ideas from cognitive science . . . . . . . . . . . . . . . . . . . . .

2

1.2.2

Non-cognitive approaches . . . . . . . . . . . . . . . . . . . . . .

4

1.3

Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.4

Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.5

Outline of this report . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2 Database

7

2.1

Object representation syntax . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.2

Building and storing the database . . . . . . . . . . . . . . . . . . . . . .

9

3 Classification matching component

13

3.1

Comparison of two object views . . . . . . . . . . . . . . . . . . . . . . .

13

3.2

Processing a query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

3.2.1

20

Branch & bound algorithm . . . . . . . . . . . . . . . . . . . . . .

iv

CONTENTS

3.2.2

Underlying data structure . . . . . . . . . . . . . . . . . . . . . .

22

3.2.3

Lower bound computation . . . . . . . . . . . . . . . . . . . . . .

24

3.2.4

Upper bound computation . . . . . . . . . . . . . . . . . . . . . .

24

3.2.5

An error-overestimating heuristic . . . . . . . . . . . . . . . . . .

26

3.2.6

Further extensions of the branch & bound algorithm . . . . . . .

27

4 Forming generalizations in the database

29

4.1

About genetic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

4.2

Representation of an individual and fitness function . . . . . . . . . . . .

31

4.3

Initial population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

4.4

Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

4.5

Genetic operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

4.6

Termination criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

5 Graphical user interface

39

6 Test Result

45

6.1

6.2

Test data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

6.1.1

Real test data sets . . . . . . . . . . . . . . . . . . . . . . . . . .

45

6.1.2

Artificially generated test sets . . . . . . . . . . . . . . . . . . . .

46

Test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

7 Conclusion & future work

53

7.1

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

7.2

Ideas for further improvement . . . . . . . . . . . . . . . . . . . . . . . .

54

7.2.1

Introduction of an error threshold . . . . . . . . . . . . . . . . . .

54

7.2.2

Optimization of the parameters for the shape error functions . . .

54

7.2.3

Incorporation of shape confidences . . . . . . . . . . . . . . . . .

54

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

7.3

A Definition of the shape-primitive property errors

57

CONTENTS

v

B Overview of the source files

59

C Number of possible matchings between two object views

63

D Data structure for a (partial) matching

65

Bibliography

67

vi

CONTENTS

LIST OF FIGURES

vii

List of Figures 1.1

Overall structure of the object recognition module . . . . . . . . . . . . .

2.1 2.2 2.3 2.4 2.5

Six orthogonal 2D views of a mug . . . . . . . . . . . . . . . . . . . . . Object description syntax scheme . . . . . . . . . . . . . . . . . . . . . Shape-primitives and properties . . . . . . . . . . . . . . . . . . . . . . Shape-primitives detected by the detection module . . . . . . . . . . . Normalized object view representation of the input shown in figure 2.4

. . . . .

7 8 8 11 11

3.1 3.2 3.3 3.4 3.5 3.6

One-to-one mapping of shape-primitives of two object views α and Matching example . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of two entries - flowchart (part1) . . . . . . . . . . . Comparison of two entries - flowchart (part2) . . . . . . . . . . . Optimal search-tree for the example used in section 3.1 . . . . . . Cases of “heavyness” of an AVL tree . . . . . . . . . . . . . . . .

. . . . . .

13 15 16 17 19 23

4.1 4.2

Flowchart of mutateWeak and mutateWeakMod-operator . . . . . . . . . Flowchart of nPointCrossover-operator . . . . . . . . . . . . . . . . . . .

36 37

5.1 5.2 5.3 5.4

Screenshot Screenshot Screenshot Screenshot

. . . .

40 42 42 43

6.1 6.2

Benchmark results for large scale databases . . . . . . . . . . . . . . . . . Changes in the lower and upper bounds for the matching error . . . . . .

49 50

A.1 Triangular membership function . . . . . . . . . . . . . . . . . . . . . . .

57

of of of of

GUI GUI GUI GUI

β . . . . . . . . . .

with display areas in shape-match mode . . . . . . displays in unlinked mode and with position labels displays in size-linked mode . . . . . . . . . . . . . displays in match mode . . . . . . . . . . . . . . .

. . . . . .

. . . .

4

viii

LIST OF FIGURES

LIST OF TABLES

ix

List of Tables 2.1

Conversion of shape-primitive properties . . . . . . . . . . . . . . . . . .

9

2.2

Value ranges of shape-primitive properties . . . . . . . . . . . . . . . . .

10

6.1

Partitioning of the first test set . . . . . . . . . . . . . . . . . . . . . . .

46

6.2

Number of expanded nodes . . . . . . . . . . . . . . . . . . . . . . . . . .

49

x

LIST OF TABLES

LIST OF ALGORITHMS

xi

List of Algorithms 1

Generic branch & bound algorithm . . . . . . . . . . . . . . . . . . . . .

21

2

Computation of the estimated error . . . . . . . . . . . . . . . . . . . . .

25

3

Computation of the unmatched error . . . . . . . . . . . . . . . . . . . .

25

4

Computation of an upper bound for the matching error . . . . . . . . . .

26

5

Generic genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .

30

xii

LIST OF ALGORITHMS

xiii

List of Abbreviations α β Fα Fα,t αi |α| m(αi )

object view (the query object view in the context of a query) object view (an object view in the database in the context of a query) set of free shape-primitives of object view α set of free shape-primitives of object view α with type t the i-th shape-primitive of object view α the number of shape-primitives of α the shape-primitive matched with the i-th shape-primitive of α (may be “unmatched” indicating that αi could not be matched)

GUI API GA STL MFC GDI

Graphical User Interface Application Program Interface Genetic Algorithm Standard Template Library Microsoft Foundation Classes Graphics Device Interface

xiv

Chapter 1. Introduction

1

Chapter 1 Introduction This chapter starts with an explanation of the motivation and the context of this work. Then a brief overview on the problem of object recognition is given. In section 1.3 the system framework is described. Afterwards, the task is defined. The chapter closes with an outline of the remaining chapters of this work.

1.1

Motivation

Recognition of three-dimensional objects in two-dimensional images is a key area of research in computer vision. One approach is to save multiple 2D views instead of a 3D object representation thus reducing the problem to a 2D to 2D matching problem. The Mechatronics Research Group is currently developing a novel system that focuses on 2D views and incorporates the idea of “active vision”. In active vision a feedback loop is closed between the image generating process and an actuator module: based on hypotheses about a perceived object, commands to the actuator module are derived to change the camera position in an attempt to refine the hypothesis. In order to be useful, an active vision system must return its results within a fixed delay which depends on the application domain and requires limitation of the data. The system limits the amount of data to process by working on geometric information extracted from images instead of the raw image data. The so derived symbolic descriptions are based on shapeprimitives: ellipses, rectangles and isosceles triangles. This decompositional approach may yield advantages in terms of computational costs and real-time capability. On the other hand, the fundamental restriction on the shape complexity will limit the differences that can be captured between objects. System performance has therefore to be examined in terms of processing time as well as classification robustness. The question of how to represent objects and the idea of decompositional approaches have been subjects of wide discussions amongst cognitive scientists. This has led to many different approaches. A brief overview on related ideas is given in the following section.

2

1.2. Related work in object recognition

1.2

Related work in object recognition

As a key task of computational vision, object recognition deals with the retrieval (classification or identification) and localization of objects of interest from an image (scene) based on object models which are known (or have to be learned) a priori. The complexity of this task is dependent on [Mai]: • the number of objects that can occur within a single image (scene), • whether the objects may partially occlude others, • the size of the model database, • whether images are acquired under similar conditions to those of the models, e.g. illumination, background, camera parameters and viewpoint, and • the choice of the internal representation for the objects. Whilst the first four factors are mainly domain specific, the last factor is a matter of the approach chosen and may have a big impact on the performance of the system.

1.2.1

Ideas from cognitive science

David Marr, a cognitive scientist, divided the process of human object recognition into four different stages from low-level to high-level visual processes [MN78, Mar82]: 1. Grey level description of the intensity of light at each point in the retinal image. 2. Primal sketch - a 2 dimensional, viewpoint-dependent description. 3. 2.5 Dimensional sketch - a “viewer centered”1 representation (still viewpointdependent). 4. 3 Dimensional description - In this stage, the viewpoint-dependent (or “viewercentered”1 ) sketch is remapped into a viewpoint-independent (or “objectcentered”1 ) representation. Eventually, the constructed 3D model is matched with a 3D object model stored in the long term memory (the equivalent to the model database). In 1987 Irving Biederman developed Marr’s approach into the “Recognition by components theory” [Bie87] identifying 36 different “geons”, i.e., fundamental shapes2 from which all real world objects can be composed. The geons have co-occurring patterns of 1

Usually, “viewer-centered” is used equivalent to 2D and “object-centered” equivalent to 3D. arcs, wedges, spheres, cylinders, blocks etc. Examples can be found e.g. in the TarrLab Stimuli collection [Tar]. 2

Chapter 1. Introduction

3

lines and edges that are non-accidental and can be detected independent of the viewpoint. However, viewpoint-independent, object-centered representations have been questioned since the evidence that human object recognition seems to be rather viewpoint-dependent as shown by B¨ ulthoff et al. [BET95] and Tarr et al. [TWHG98]. In fact, it seems as if the representation of objects in the human brain is neither purely viewer-centered nor purely object-centered [TB95]. Possibly the kind of information stored depends on the complexity of the object and the context in which it appears (e.g. everyday use) [Kos96]. As this discussion moves on at the level of cognition and neuroscience, knowledge about the processes and structures in the human brain grows and more sophisticated ideas are introduced as e.g. the “chorus of fragments” approach by Edelman et al. [EIJ02, EI02, EI03]. It picks up ideas of the “Recognition by components theory” [Bie87], but instead of the restriction to a fixed set of components, a network of “what and where”-neurons learns the most frequent “fragments” occurring in the database. The scheme of “what and where”-neurons is justified by analogies to the receptive fields of the retina [EN98]. In viewer-centered representation systems, entries in the database describe 2D views of objects from different viewpoints. The database entry which best resembles the description derived from a 2D input image is then found. This is an advantage of a viewer-centered system since object-centered approaches contain 3D models of objects in the database. Generally, 3D models are much more complex than the 2D view representations and thus harder to acquire. They work well with 3D inputs as a 3D-3D matching function is relatively straightforward to implement (even though it is slower than a 2D-2D matching function) but the generation of 3D input images requires special hardware such as a laser scanner or at least a stereo (or triple) vision camera setup with additional preprocessing steps (as e.g. in [DY95]). In the case of a 2D only input either • a matching function between 2D input and 3D model has to be developed, or • each 3D model in the database has to be projected to a viewpoint before a 2D-2D matching function is applied as e.g. in Lowe’s “viewpoint consistency constraint” [Low85] or Ullman’s “recognition by alignment” approach [Ull89, Ull96], or • a 3D description from the 2D input has to be derived (as pointed out by Marr [Mar82]) and a 3D-3D matching function is applied. These additional computations for object-centered approaches make the viewer-centered ones appear less computationally expensive. On the other hand, viewer-centered approaches require a bigger database because, for each object, multiple views are required. To reduce the number of views, input and models can be normalized, or functions that interpolate or extrapolate objects of the model database can be used. Tarr and B¨ ulthoff [TB98] provide a more detailed overview on viewer-centered recognition pointing out parallels between man, monkey and machine. For a survey on model-based recognition refer to [Pop94]. An overview on computational theories of

4

1.3. Framework

object recognition with discussion of pros and cons of the different approaches is given by Edelman [Ede97].

1.2.2

Non-cognitive approaches

Non-cognitive approaches to the object recognition problem are solutions based on feature spaces and mathematical concepts. Comparison of query and database object representations can be facilitated by mapping into a different space. Methods such as Minimum Description Length principle and Principle Component Analysis concentrate on significant differences between known objects. They have the added benefit of reducing the dimension of the feature space. Another common way is to use transformations like the Discrete Fourier Transform. E.g. Funkhouser et al. [FMK+ 03] implemented a search engine for 3D models using spherical harmonics. An approach based on the computation of the shape distribution as the signature of an object is presented by Osada [OFCD01] and there are many other approaches that concentrate on the shape of the object as e.g. shown in [OMT03]. Veltkamp [Vel01] provides an overview on shape matching similarity measures and algorithms.

1.3

Framework

The complete object recognition module that is currently being developed at the Mechatronics Research Group involves three components as shown in figure 1.1.

Figure 1.1: Overall structure of the object recognition module. The database stores the module’s knowledge of objects in the world. This knowledge is encoded using a symbolic description based on shape-primitives. Apart from the object

Chapter 1. Introduction

5

representation (which is a viewer-centered3 one) the following assumptions have been made4 : • There is only one object in each single image (scene). • As there is only one object in each image, the subject of (partial) object occlusion will not be addressed. • The size of the model database has not been taken into consideration. • Slightly different lighting conditions are addressed by generalization of the models in the database. Interpolation of views can be implemented, but this is not part of this work. The visual shape detection component5 provides the sensory input to the matching component. The matching component generates a set of hypotheses about the identity of the sensed object view. Based on these hypotheses, commands to the actuator module can be derived to change the camera position in an attempt to refine its hypothesis.

1.4

Task

In the context of this project, this work covers the database and the classification matching component. The original task was to design and implement: 1. a data structure capable of importing and holding all information that is extracted from an image by the detection module. This information includes: • type (ellipse, triangle or rectangle), • size (width and height),

• rotation from the main-axis and • position

of shape-primitives detected in the image, 2. a database structure that holds information about all object views known to the system, and 3. an algorithm that finds the most similar object within the database for any given query. 3

Refer to section 1.2.1 for a discussion of viewer-centered and object-centered representation. regarding the complexity criterion stated at the beginning of the section 1.2 5 Implemented by Ruby Law at the Mechatronics Research Group.

4

6

1.5. Outline of this report

The implementation is to be in C/C++ and should be capable of running in real-time to be applicable in the active vision domain. During the process of development the following task was added: 4. Provide an algorithm to reduce the size of the database to one entry for each of the six orthogonal 2D views of an object. I.e. a set of entries representing the same object from the same view (e.g. under slightly varying lighting conditions) has to be reduced to a single entry.

1.5

Outline of this report

The remaining chapters are organized as follows: • Chapter 2 introduces the syntax used to encode the system’s knowledge of objects in the world and covers tasks 1 and 2. • Task 3 is addressed in chapter 3, where the classification matching algorithm is described. • For the additional task of forming generalizations in the database, a genetic algorithm approach is presented in chapter 4. • Chapter 5 deals with the graphical user interface (GUI) that has been developed for debugging and visualization purposes. • Test results are presented and discussed in chapter 6. • The last chapter concludes this work and proposes ideas for further improvement of the system. • The appendix contains a detailed description of the implemented error functions for the shape properties, an overview of the source files, an analysis of the total number of possible matchings between two object views, and a detailed description of the data structure representing a (partial) matching.

Chapter 2. Database

7

Chapter 2 Database Knowledge about the individual visual appearances of a collection of objects is stored in the database. Each object is represented by a maximum of six orthogonal 2D views as exemplified in figure 2.1. (For identical views only one representation with multiple labels is stored.)

Figure 2.1: Six orthogonal 2D views of a mug. Each of these views is termed an “entry”. Rather than storing these entries as images, geometric information is stored symbolically using the syntax described in the following section and whose schematic is shown in figure 2.2. Symbolic representation compresses the amount of information stored in the database when compared to images. Furthermore, the representation developed is size-, translation- and rotation-invariant which will benefit matching between query and database object views.

8

2.1. Object representation syntax

Figure 2.2: Object description syntax scheme.

2.1

Object representation syntax

Every entry is described as a geometric formation of shape-primitives. Each shapeprimitive has a set of geometric properties prescribing its shape type, size, aspect ratio, rotation and relative coordinates to a reference point. Shape-primitives used in this application belong to 3 shape types: ellipses, isosceles triangles and rectangles. This set of shape types was chosen as they cover a wide range of shapes found in artificial objects and each can be described by two parameters: their height and width, see figure 2.3.

b

Figure 2.3: Shape-primitives ELLIPSE, (isosceles) TRIANGLE and RECTANGLE and properties: width (a), height (b), diameter of circumcircle (d) and rotation angle (ϕ) Rather than storing the height and width of each shape-primitive, a generic definition of size coupled with the aspect ratio is used for all shape types. The diameter of the circumcircle, i.e. the smallest circle encompassing the shape, is used as the size. For ellipses and rectangles the ratio of the shorter dimension to the longer dimension is used as the aspect ratio, whilst for triangles the aspect ratio is the height divided by the

Chapter 2. Database

9

width. The relationship for converting between height/width to size/aspect ratio for each shape type is given in table 2.1. ellipse

triangle

rectangle

ratio = ab , a ≥ b size = a

ratio = ab 2 size = a ∗ 1+4∗ratio 4∗ratio

ratio =√ab , a ≥ b size = a ∗ 1 + ratio2

Table 2.1: Conversion of the properties of shape-primitives: computation of size (diameter of circumcircle) and aspect ratio from width, a, and height, b. To combine the shape-primitives into a description for individual entries, a cartesian coordinate system is defined for the formation and normalized as follows: • The center of the largest shape-primitive is used as the origin (0, 0). • The sizes of all shape-primitives are normalized by the size of the largest shapeprimitive in the entry, thus yielding size-invariant entries. • The x-axis is defined by the line joining the shape-primitive farthest from the largest shape-primitive. The center of the farthest shape-primitive has coordinates (δ, 0), where δ is the distance between the largest shape-primitive and the farthest shape-primitive. Where the shape-primitives are concentric, i.e. δ is 0, the axis is defined such that the rotation of the largest shape-primitive is 0. (Rotation of each shape-primitive is defined as the angle between the width “a” in figure 2.3 and the positive x-axis.) • The y-axis is defined in the usual manner of 90◦ counter clockwise from the positive x-axis. • The remaining shape-primitives are assigned coordinates relative to the origin as defined above. Shape-primitives are ordered within each entry with the largest shape-primitive first followed by the other shape-primitives sorted by non-ascending Euclidean distance of their circumcenters from the circumcenter of the first shape-primitive. Given the consistent detection of the largest shape-primitive and the farthest shapeprimitive, this representation syntax provides a size, translation and rotation invariant representation for each entry. These definitions impose natural bounds for the ranges of size, aspect ratio and rotation as presented in table 2.2.

2.2

Building and storing the database

The database is built by importing descriptions from the detection component: Raw information about detected shape-primitives (type, position, aspect ratio and rotation

10

2.2. Building and storing the database

ellipse size (0, 1] ratio (0, 1] ϕ [0, π)

triangle rectangle (0, 1] (0, 1] (0, ∞] (0, 1] [0, 2π) [0, π)

Table 2.2: Value ranges of shape-primitive properties: Due to symmetries only rotation angles up to π for ellipses and rectangles have to be considered. For ellipses and rectandimension gles, ratio is defined as smaller and thus its range is (0, 1]. For triangles, however, bigger dimension ratio and rotation angle can take any arbitrary value. angle) is read from a text file and normalized as described in the preceding section. The imported data is stored in two flat text files: One holds the description of the entries based on the object representation syntax and the other contains the corresponding text descriptions (or labels), e.g. “mug front 04”. For an object view there may be multiple text descriptions as different objects may look the same from certain points of view. Figure 2.5 shows the result of a normalized object view representation of the input shown in figure 2.4. The object view in figure 2.4 will henceforth be represented by the collection of shape-primitives, shape 0 to 5, as an entry in the database.

Chapter 2. Database

11

Figure 2.4: Shape-primitives detected by the detection module.

Figure 2.5: Normalized object view representation of the input shown in figure 2.4.

12

2.2. Building and storing the database

Chapter 3. Classification matching component

13

Chapter 3 Classification matching component After converting detection results from a query image into the object representation syntax, it is passed as a query to the classification matching component. The following section describes the comparison of two object views. In section 3.2 an algorithm is presented that extends this basic operation to an operation on the whole database which finds the most similar entry in the database to the query.

3.1

Comparison of two object views

In order to compare two arbitrary entries, α and β, a similarity measure is defined based on one-to-one mappings between shape-primitives in entries α and β, see figure 3.1.

Figure 3.1: One-to-one mapping of shape-primitives of two object views α and β. Such a one-to-one mapping, called a “matching”, consists of several elementary mappings that map one shape-primitive to another one. Each result of such an elementary one-to-one mapping of shape-primitives, termed a “shape match”, may differ in the

14

3.1. Comparison of two object views

properties of the shape-primitives except the shape type as it is assumed that there is no accidental cross shape type detection. For each shape match, differences in size, aspect ratio, rotation angle and position are accumulated in a shape match error. The only exception are matches with the virtual shape-primitive “unmatched”. A match of a shape-primitive of α with “unmatched” indicates that the specific shape-primitive has not been matched with any shape-primitive of β. In this case a special shape match error (eunmatched ) is computed as a penalty, because there are no properties that could be compared. For details on the computation of the specific errors refer to appendix A. Accumulating all elementary shape match errors finally results in an error for the whole matching. Only the error for the aspect ratio is translation-, rotation- and scaling-invariant. For comparison of sizes, rotation angles and positions, some alignment is necessary to achieve scaling-, translation- and rotation-invariance, which is demonstrated in the following considering β to be the object view shown in figure 2.5 and α to be an artificially derived object view from β. Figure 3.2a shows α (left) and β (right) normalized and at the same scale (the diameter of the biggest circumcircle in each object view is 1.0). α has been constructed from β, assuming for demonstration purposes that for some reason the biggest shape-primitive (the rectangle, that covers most of the main body of the mug) has not been detected (e.g. due to different lighting conditions), although that seems to be rather unlikely. Furthermore one additional small rectangle has been detected but not the triangle (which represents a shadow region) and the properties of the remaining shape-primitives have been altered slightly. This example resembles the worst case scenario for a comparison because the complete set of transformations shown in the following has to be applied. Figure 3.3 shows a flowchart of the first part of the comparison algorithm. The first transformations are applied after the first time a shape-primitive of α is not matched with “unmatched”. In the example, the first (i.e. biggest) shape-primitive of α has been matched with a corresponding shape-primitive of β as shown in figure 3.2b. For the position and rotation comparison the following transformation has to be made: Both object views are shifted so that the circumcenters of both shape-primitives of the matched pair are (0, 0). To attain scaling invariance, β is scaled so that the relative sizes of both shape-primitives of the corresponding pair are the same. As a result, the errors for size and position of this shape match are zero. At this stage, the coordinate systems of the two object views use the same scale and correspond at least at the point of the origin. Obviously, the latter computations are only necessary if the first shape-primitive of α has not been matched with the first shape-primitive of β as in the case of the example. Figure 3.4 shows a flowchart of the remaining computations. For the following shape matches the error computation is limited to the size and aspect ratio error, because the rotation angles cannot be compared, and the position error is substituted by a distance error as long as the rotation of β has not been aligned. To do this final step of alignment, a second shape match has to be found with the additional constraint that the position of both shape-primitives must be different from (0, 0), see figure 3.2c. This shape match

Chapter 3. Classification matching component

15

Figure 3.2: Matching example: In this illustrative (worst) case the complete set of transformations has to be applied for the comparison.

16

3.1. Comparison of two object views

Figure 3.3: Comparison of two entries - flowchart (part1). (For details on the computation of the specific shape property errors refer to appendix A.)

Chapter 3. Classification matching component

17

Figure 3.4: Comparison of two entries - flowchart (part2). (For details on the computation of the specific shape property errors refer to appendix A.)

18

3.2. Processing a query

is very likely to be found very early because of the order of the shape-primitives within an object view, as was explained in section 2.1. After the rotation, all positions and rotation angles refer to the same coordinate system. At this stage, all four errors mentioned above can be computed and the errors for all previous shape matches are updated. Continuing the example, the next two rectangles of α are matched with rectangles of β as shown in figure 3.2d. The last remaining rectangle of α is matched with “unmatched”, leaving two unmatched shapeprimitives of β because there are no more shape-primitives of α they could be matched with (figure 3.2e). Theoretically, the last shape-primitive of α could be matched with the big rectangle of β, which might result in a matching with a smaller error. In fact, this example covers only one out of an exponential number of possible matchings. A formula for the computation of the number of possible matchings between two object views is given in appendix C. The smallest error that a matching between two object views produces defines the similarity of the two views.

3.2

Processing a query

For a given query, the most similar entry within the database has to be found based on the similarity measure introduced in the preceding section. Thus the original task (of finding the most similar entry) can be redefined to: find the optimal matching (i.e. the one with the minimum error) out of all possible matchings of the query with an entry of the database. In the following it is important to differentiate between complete and partial matchings. Recall the comparison example from section 3.1, a (partial) matching describes the shape matches and alignments at an intermediate stage of the comparison. A matching describing the end stage is a complete matching because all shape-primitives of both object views have been matched (shape-primitives mapped to “unmatched” are regarded as matched). In contrary, partial matchings describe stages of the comparison where there are still “free” shape-primitives, i.e. shape-primitives that are not matched with others or “unmatched”. A detailed description of the data structure representing a (partial) matching is given in appendix D. The search space for the matching problem contains all possible partial and complete matchings of the query object view with all object views in the database. Obviously, a solution of the problem is a complete matching. Hence, the solution space is the subspace of the search space that only comprises all complete matchings. This solution space is exponential.1 The search space can be structured as a tree with the empty matching 2 as 1

Refer to appendix C for a formula to compute the number of possible matchings between two object views. The size of the solution space is the sum of the number of possible matchings of the query with each entry of the database. 2 The empty matching is a partial matching that contains no matching data. I.e. no database entry has been assigned and no shape primitive has been matched.

Chapter 3. Classification matching component

19

Figure 3.5: Optimal search-tree for the example used in section 3.1. (Matchings with database entries other than β are not shown.) root. The root has n children, where n is the number of object views in the database. Let this be the 0-th level of the search tree. To each partial matchings at this level, a corresponding database entry has been assigned, but apart from that these nodes contain no matching data. All nodes in the subtree rooted at a node at level 0 correspond to matchings with the same database entry. The tree has |α| more levels, where α is the query object view and || denotes the number of shape-primitives of α. The nodes at levels 1 ≤ k ≤ |α| are all possible extensions of the partial matchings of level k − 1 that can be constructed by matching the k-th shape primitive of the query. The leafs of the search tree (i.e. the nodes at level |α|) are complete matchings. Figure 3.5 shows the optimal search tree that leads to the matching constructed in the example used in section 3.1. A naive exhaustive search algorithm would solve the optimal matching problem by simply enumerating all possible matchings and picking the solution with the minimum

20

3.2. Processing a query

error. An enumeration of all possible matchings can be obtained e.g. by breadth-first or depth-first traversal of the tree described above. Obviously, this is a highly inefficient algorithm. It can be observed that extension of a matching cannot decrease the matching error. Thus, the error of an internal node in the search tree is a lower bound for the matching error of all complete matchings (leafs) in the subtree rooted at this node. The whole subtree can be pruned, if already a complete matching with a smaller error has been found. Pruning significantly improves the algorithm’s efficiency but depends very much on the quality of the complete matching that is used for pruning. Additionally, pruning cannot be applied as long as no complete solution has been found. Using an heuristic to compute a start solution before traversing the search tree solves these problems. In addition, the algorithm’s efficiency can by further improved by finding a tighter lower bound for the matching error. An algorithm that incorporates these ideas is called “branch & bound algorithm” [LW66]. A generic branch & bound algorithm is described in section 3.2.1. The data structure holding all (partial) solutions created by the branch & bound algorithm is explained in section 3.2.2. Section 3.2.3 gives a detailed description of the lower bound that is used for pruning. The heuristic to compute the start solution is presented in section 3.2.4. As an optional replacement for the lower bound, an error-overestimating heuristic is proposed in section 3.2.5. Finally, section 3.2.6 discusses further extensions of the branch & bound algorithm.

3.2.1

Branch & bound algorithm

The main structure of the generic branch & bound algorithm is presented in algorithm 1. At the beginning, an initial solution is generated using a (greedy) heuristic and stored as the best complete solution known so far. This best known solution is used for pruning the search tree (“bound”) and is updated every time a better (complete) solution is found (line 8). The search begins with the empty solution as root. In each iteration the node that represents the “best” partial solution amongst those created so far is expanded (“branch”). Generally, the branch & bound algorithm places no constraints on the choice of the partial solution to be extended. However, for this application, the algorithm converges to the best solution faster, if only those nodes that lead to good solutions are expanded. In the expansion step (line 5) the next3 free shape-primitive of the query is matched with all free shape-primitives (of the same shape type) of the database entry the partial matching refers to and with “unmatched”. From the resulting (partial) solutions, only those that could lead to a better solution than the one known so far are kept and the algorithm stops when no better solutions can be constructed. Performance of the search is highly dependent on the quality of the lower bound of the error as this lower bound is used, in line 6, to prune the search tree. Obviously, the greater the underestimate of the minimum error of a partial solution, the more branches are created in the search tree thus affecting performance. On the other hand, the error 3

The order of shape-primitives has been introduced in section 2.1.

Chapter 3. Classification matching component

Algorithm 1 Generic branch & bound algorithm. Require: problem: min {f (x)|x ∈ B, B 6= ∅, |B| < ∞} Ensure: optimal solution: best and f (best) 1: best ← initial solution 2: list ← {the empty solution} 3: while list is not empty do 4: x ← best partial solution from list 5: for all possible extensions c of x do 6: if flower bound (c) < f (best) then 7: if c is a complete solution then 8: best ← c 9: for all e ∈ list do 10: if f (e) ≥ f (best) then 11: remove e from list 12: end if 13: end for 14: else 15: insert c into list 16: end if 17: end if 18: end for 19: end while

21

22

3.2. Processing a query

should not be overestimated, as a branch leading to the best solution could be cut off and the algorithm cannot be guaranteed anymore to return the best solution. However, as a trade off between the quality of the solution and speed, the lower bound may be replaced by a heuristic that may overestimate the minimum error.

3.2.2

Underlying data structure

The data structure that holds all partial solutions created by the branch & bound algorithm (referred to as “list” in algorithm 1) is implemented as an “AVL tree”.4 Named after its inventors, Adelson-Velskii and Landis [AVL62], an AVL tree is a height-balanced binary search tree. I.e. each node within the tree has at most two child subtrees which may differ in height by at most one. All nodes in the left subtree have smaller values whereas those in the right subtree have bigger values. Look-up, insertion and deletion are all O(log(n)) in both the average and worst cases where n is the number of nodes. These operations are used very frequently by the branch & bound algorithm: In each iteration there are one look-up (line 4) and multiple insertions (depending on the number of good extensions of the partial solution chosen in this iteration) (line 15). Every time a complete solution is found, a batch deletion is called to prune the search tree (line 11). Inserting or deleting a node may result in an unbalanced subtree, but rebalancing is done in only a few operations: At most one rotation is required after an insert operation whereas O(log(n)) rotation may be required after a delete operation, because it might be necessary to continue rebalancing back up the tree after a rotation (at most O(log(n)) operations). Figure 3.6 shows the four rebalancing operations: • Left-left-heavyness can occur in a subtree of an AVL tree after a node has been inserted in subtree 1 or deleted from subtree 3 (in this case subtree 2 may have height h + 1 as well). The operation labeled “LL” is a single right rotation in node B. • Left-right-heaviness can result from deleting a node from subtree 4 or from inserting a node into subtree 2 or 3 (in this case both subtrees of node C would have had height h − 1 before the insertion and only one of them - it does not matter which one - would have the height h after the insertion). The operation labeled “LR” is a left rotation in node A (to reduce the problem to the case of left-left-heavyness) followed by a right rotation in B. The whole operation is called a “double rotation”. • Right-right-heavyness and right-left-heavyness are basically the symmetric cases of left-left-heavyness and left-right-heavyness. 4

In the current implementation only the node with the smallest value has to be returned. Thus, the full functionality of the look-up operation is not needed and the AVL tree may be replaced by a simpler data structure such as a heap. However, this may result in only a slight improvement of the performance of the branch & bound algorithm. During development the additional functionality of the AVL tree had seemed to be required and the look-up operation has been extensively used during debugging.

Chapter 3. Classification matching component

23

Figure 3.6: The four cases of “heavyness” that can occur in an AVL (sub)tree and the rotations required to rebalance it.

All operations can be performed in O(1) as it is only necessary to update a fixed number of pointers. The nodes of the AVL tree are implemented as linked lists to be able to hold multiple partial matchings with the same lower bound for the matching error. (The lower bound is used as the value of the node and introduced later on in this section.) This is necessary because using the matchings directly as tree nodes would result in multiple nodes having the same value. Note that multiple entries for the same partial matching cannot occur because each partial solution can be created only once (see description of how a partial matching is extended in section 3.1). Two operations can be performed on a node: A matching can be inserted (push) or retrieved and deleted (pop). Theoretically, the popoperation may return any arbitrary element of the list. Here, a stack-like LIFO (last in first out) behavior has been chosen, because returning the head of the list involves the least computational cost (O(1)). (Besides it slightly biases the branch & bound algorithm towards a depth-first search.)

24

3.2. Processing a query

3.2.3

Lower bound computation

Pruning the branch & bound search tree requires the computation of a lower bound of the matching error for partial matchings. (For complete solutions the matching error can be computed as presented in section 3.1.) The lower bound of the matching error can be divided into three parts: 1. The initial error is the sum of the individual shape match errors for all shape matches of the partial matching. This error is exact as these shape matches are fixed and it cannot decrease during further extensions of the partial solution. 2. The estimated error is the lower bound for the increase of the matching error during further extension of the partial solution. The algorithm is shown in algorithm 2. For each shape type, t, the maximum number of shape-primitives that can be matched in further extensions, nt , is calculated (lines 3-5). Then, for each free shape-primitive of the query the minimum shape match error for all possible shape matches with shape-primitives of the database entry or “unmatched” is computed (lines 6-10). Finally, the estimated error is accumulated from the nt smallest of these shape match errors for each shape type t (lines 11-15). This procedure underestimates the real error as the estimate permits multiple shape-primitives of the query to match to a single database shape-primitive. 3. The unmatched error is a lower bound for the matching error resulting from shape matches with “unmatched”. The computation is very similar to the one for the estimated error and shown in algorithm 3: Firstly, for each shape type, t, the minimum number of shape-primitives that have to be matched with “unmatched”, ut , is determined (lines 3-5). Then the unmatched error is accumulated from the ut smallest shape match errors for matching free shape-primitives of this shape type with “unmatched” (lines 11-15). (Depending on where there are more shapeprimitives of this type, shape-primitives of the query or the database entry are chosen (lines 6-10).) When all shape-primitives of the query are matched, i.e. it is a complete solution, this part of the matching error holds the penalty for shape-primitives of the database entry that were matched with “unmatched”.

3.2.4

Upper bound computation

In line 1 of the branch & bound algorithm (see algorithm 1) an initial solution has to be generated. Algorithm 4 shows the basic structure of the algorithm used. It is a greedy algorithm that extends a partial matching, M, to a complete matching, M ′ , by matching each free shape-primitives, s, of the query object view, α, with the free shape-primitive, s′ , that has the same shape type as s and produces the minimum shape match error (line 5). The matching error for such an initial solution is used as an initial upper bound for

Chapter 3. Classification matching component

Algorithm 2 Computation of the estimated error. Require: Fα -set of free shape-primitives of α (the query) Require: Fβ -set of free shape-primitives of β (the database entry) Ensure: errorestimated 1: errorestimated ← 0 2: for all t ∈ {ELLIP SE, T RIANGLE, RECT ANGLE} do 3: Fα,t ← {s ∈ Fα |type(s) = t} 4: Fβ,t ← {s ∈ Fβ |type(s) = t} 5: nt ← min{|Fα,t |, |Fβ,t|} 6: Et ← ∅ 7: for all s ∈ Fα,t do 8: e ← min{shape match error(s, s′)|s′ ∈ Fβ,t ∪ “unmatched”} 9: Et ← Et ∪ e 10: end for 11: for 0 ≤ i < nt do 12: e ← min{Et } 13: Et ← Et \ e 14: errorestimated ← errorestimated + e 15: end for 16: end for Algorithm 3 Computation of the unmatched error. Require: Fα -set of free shape-primitives of α (the query) Require: Fβ -set of free shape-primitives of β (the database entry) Ensure: errorunmatched 1: errorunmatched ← 0 2: for all t ∈ {ELLIP SE, T RIANGLE, RECT ANGLE} do 3: Fα,t ← {s ∈ Fα |type(s) = t} 4: Fβ,t ← {s ∈ Fβ |type(s) = t} 5: ut ← |Fα,t | − |Fβ,t | 6: if ut < 0 then 7: Et ← {shape match error(s, “unmatched”)|s ∈ Fα,t } 8: else if ut > 0 then 9: Et ← {shape match error(s, “unmatched”)|s ∈ Fβ,t } 10: end if 11: for 0 ≤ i < |ut | do 12: e ← min{Et } 13: Et ← Et \ e 14: errorunmatched ← errorunmatched + e 15: end for 16: end for

25

26

3.2. Processing a query

Algorithm 4 Computation of an upper bound for the matching error. Require: partial matching M of object views α (query) and β (database entry) Ensure: complete matching M ′ with errorU B 1: M ′ ← M 2: Fα ← {s ∈ α|f ree(s)} 3: Fβ ← {s ∈ β|f ree(s)} 4: for all s ∈ Fα do 5: m(s) ← argmin {shape match error(s, s′)} 6: 7: 8: 9: 10:

s′ ∈Fβ ∪“unmatched” type(s′ )=type(s) ′

M ′ ← M ∪ (s, m(s)) if m(s) 6= “unmatched” then Fβ ← Fβ \ m(s) end if end for

the matching error. All (partial) solutions that exceed this threshold can be discarded as a better solution is already known. Originally only intended for the initial solution, this algorithm can by applied to any partial solution, M, yielding an upper bound for the minimum matching error of all solutions in the subtree of the branch & bound search tree that is rooted at M.

3.2.5

An error-overestimating heuristic

As a trade off between the quality of the solution and speed, the lower bound used to prune the branch & bound search tree may be replaced by a heuristic that may overestimate the minimum error. Using such a heuristic, it is no longer guaranteed that the best solution will be found but on the other hand the algorithm can be sped up. The implemented heuristic is just a weighted mean of the lower bound and upper bound introduced in the preceding paragraphs (see algorithms 2, 3 and 4): errorsum = errorinit + wU B errorU B + (1.0 − wU B )(errorestimated + errorunmatched ) To use the heuristic the module has to be compiled with the switch USE_HEURISTIC defined in config.h. The weight wU B is defined by HEURISTIC_WEIGHT_UB in config.h and may have any value in between 0 and 1. The higher the weight, the bigger the influence of the upper bound and the higher the probability that the branch of the search tree that leads to the best solution is cut off. During the test period the heuristic has only been used to speed up the genetic algorithm that is presented in the following section (the switch GA_USE_ONLY_ESTIMATION has to be defined in config.h). But as the genetic algorithm is supposed to run offline and there is usually no need for real-time capability the heuristic is disabled by default.

Chapter 3. Classification matching component

3.2.6

27

Further extensions of the branch & bound algorithm

During the evaluation process and for the application in an active vision process it has become useful to have information not only on the best solution but on the k best ones. Unfortunately branch & bound algorithms are not supposed to return more than the best solution. To illustrate that, assume that the initial solution is already the best one. As a result no leaf node of the search tree (complete solution) with a bigger error would be reached. This behavior can be circumvent by either: • running the branch & bound algorithms k times excluding the database entry referring to the best solution of each run from the database, or • not pruning (and letting the tree grow exponentially), or • allowing the error of (partial) solutions to exceed the upper bound by a certain amount. Obviously, the first approach is very inefficient and would take to much time. For the second approach, an exponential amount of space would be necessary. Thus, a derivate of the third approach has been implemented, posing only the constraint that at least k object views have to be stored in the database. The variable best in algorithm 1 is extended to an array storing k solutions that is consequently initialized with k initial solutions. For the pruning step (line 6) the worst of all solutions in best is used. For line 8 a more complex update logic has been implemented. Additionally the switch DONT_ALLOW_MULTIPLE_MAPPINGS can be set in config.h to prevent the algorithm from returning multiple matchings with the same database entry (this is very likely but usually the question is which database entry is the next most similar one). Obviously, more nodes will be created because of the weakened bounding criterion which will have a negative impact on the performance of the algorithm. In every iteration the branch & bound algorithm accesses the leftmost node in the AVL tree. Access cost for this node can theoretically be improved by an additional pointer to this node. This reduces the complexity of this access operation from O(log(n)) (where n is the number of nodes in the AVL tree) to O(1), i.e. constant time. This optional extension can be enabled by defining CAVLTREE_FIRSTPTR in CAVLTree.h. However, in practical use it does not improve performance but seems to slightly increase the running time (for a small database as well as a large database with 5000 random entries). This unexpected behavior is likely to be caused by the additional logic that is needed to maintain the pointer to the leftmost node.

28

3.2. Processing a query

Chapter 4. Forming generalizations in the database

29

Chapter 4 Forming generalizations in the database To reduce the size of the database to one entry for each of the six orthogonal 2D views, a set of entries representing the same object from the same view (e.g. under slightly varying lighting conditions) has to be reduced to a single entry that should in some way resemble all entries in the set. As there is no operation that computes an “average” object view (the varying description length makes it even more complicated), a genetic algorithm has been applied to find such an “average” object view.

4.1

About genetic algorithms

Genetic algorithms, like neural networks, fuzzy systems and probabilistic reasoning, belong to the “soft computing” techniques. Soft computing provides tolerance of imprecision, uncertainty and partial truth as well as low solution cost, which makes it very attractive for problems of high computational complexity or incomplete/inaccurate input data. On the other hand, solutions are only approximated. There is no guarantee that an optimal solution will be found. Generally, genetic algorithms (also referred to as evolutionary algorithms) can be applied to any kind of optimization problem such as parameter optimization, path-finding problems or strategy-finding problems. They can easily be parallelized and can search spaces of hypotheses containing complex interacting parts [Mit97]. Their basic underlying idea is to simulate the process of biological evolution which has proven to be a robust method for adaption within biological systems: Starting from an initial population of individuals, following generations are generated by random variations (mutation) and combination (crossover). During this recombination process new features can evolve. Individuals with advantageous features are favored in the selection for the next generation. They benefit from their better “fitness” and thus have higher probabilities to have offspring. Algorithm 5 shows the structure of a generic genetic algorithm. Implementation of

30

4.1. About genetic algorithms

Algorithm 5 Generic genetic algorithm as e.g. in [GKK04]. Ensure: best hypothesis in popt 1: t ← 0 2: initialize popt 3: evaluate popt 4: while termination criterion is not met do 5: t← t+1 6: select popt from popt−1 7: alter popt 8: evaluate popt 9: end while such an algorithm comprises of: • Representation of an individual (hypothesis, solution candidate) The representation of the individuals defines the search space of the GA. According to Goldberg’s “principle of the minimal alphabet” [Gol89], the smallest representation that permits a natural expression of the problem should be selected. Otherwise, choosing an oversized representation might result in wasting time by searching irrelevant regions of the search space, or contrary, some (possibly good) hypotheses might not be represented if an undersized representation is selected. A very common way to code hypotheses (e.g. sets of if-then-else rules) is by bit strings which can be easily manipulated by genetic operators. Symbolic representations as e.g. widely used in the domain of genetic programming (e.g. in [Mit97]) require more sophisticated implementations of the genetic operators. • Generation of an initial population In line 2 of algorithm 5 the population is initialized, which usually means that it is generated by random. Thus a function is needed that randomly generates individuals. • Definition of a fitness function The fitness function is used to compute the quality of each hypothesis (lines 3 and 8 in algorithm 5). The definition of this function together with the choice of the representation are the most important and challenging tasks when implementing a GA. The fitness function is so important because actually, the GA optimizes this function and not the original problem. Thus mistakes in the definition of the fitness function will have a great impact on the quality of the result of the GA. • Selection Based on their fitness, individuals are selected for the next generation (line 6 of algorithm 5). There are several methods to do this, e.g.: – fitness proportionate selection (also known as roulette wheel selection or Monte Carlo selection): The probability that an individual is selected for

Chapter 4. Forming generalizations in the database

31

the next generation is defined by the proportion of its fitness to the fitness of the whole population (sum of the fitness values of all individuals of the population). – rank selection: The probability that an individual is selected for the next generation is proportional to its rank (considering all individuals of the population to be sorted by their fitness). – tournament selection: The “winner” of a tournament of k ≥ 2 randomly chosen individuals is selected. Chances for an individual to win a tournament depend on its fitness. • Genetic operators Genetic operators are needed to alter the population (line 7 of algorithm 5) to create individuals with new features. These operators usually correspond to those found in biological evolution. Common genetic operators are: – mutation: An individual is randomly altered. – crossover: Taking two individuals as input, one or two offsprings are generated by recombination of the features of the parents. • Termination criterion GAs approximate solutions. Therefore it cannot be expected that a perfect solution is found. To guarantee that the algorithm terminates, some criterion has to be defined that is checked after each iteration (line 4 of algorithm 5). Some possible termination criteria are: – The maximum number of iterations has been reached. – The fitness of the best individual is larger than a certain value. – The fitness of the best individual or the average fitness did not improve during the last k iterations by a certain amount. Each implementation detail will be addressed by following sections. For a more detailed overview on GAs refer to [Mit97] or [GKK04].

4.2

Representation of an individual and fitness function

For the representation of the individuals, the same coding as presented in section 2.1 is used. Some functions have been added to apply the genetic operators (see class CGAIndividual in appendix B). Using this representation, it is ensured that every potential hypothesis can be represented and that every individual in turn is a valid hypothesis (i.e. an object view as described in section 2.1). However, this choice of

32

4.2. Representation of an individual and fitness function

representation holds a disadvantage as well: Such a complex object description requires complex genetic operators in contrast to simple bit-manipulating operators that would have been applicable for simple bit-string representations. But this drawback is compensated by another big advantage: A fitness function that evaluates the quality of a hypothesis (i.e. the fitness of an individual) can easily be derived from the function that compares two object views (see section 3.1). As the GA has to find an “average” object view for a set of object views, O, the best hypothesis minimizes the sum of the matching errors with all object views, β, in O: X hbest = argmin{ matching error(h, β)} (4.1) h∈H

β∈O

where H is considered to be the hypotheses space. An appropriate fitness function that satisfies hbest = argmax{f itness(h)} (4.2) h∈H

is obviously: f itness1 (h) = −

X

matching error(h, β)

(4.3)

β∈O

This function has a negative co-domain, but to be able to apply fitness proportionate selection the fitness values need to be in R+ . Thus, the values need to be shifted, i.e. a positive constant has to be added: a ∈ R+

f itness2 (h) = f itness1 (h) + a

(4.4)

With a = − min {f itness1 (h)} h∈popt

(4.5)

this resembles the linear dynamic scaling proposition by Greffenstette and Baker [GB89]. (popt is considered to be the population at time t as in algorithm 5.) Using this fitness function, the fitness of the “worst” individuals is exactly 0 and all other individuals have positive fitness values at any time t. The fitness function resulting from combination of equations 4.3, 4.4 and 4.5 is: X X { matching error(h′, β)} (4.6) f itness(h) = − matching error(h, β) + max ′ β∈O

h ∈popt

β∈O

Defining the switch GA_SIGMA_SCALING in config.h enables a further transformation of the fitness function (equation 4.6) as follows: f itness′ (h) = (max(0, f itness(h) − (µt − b · σt )))k

(4.7)

The parameters b and k can be set in config.h using the defines GA_SIGMA_SCALING_A and GA_SIGMA_SCALING_POW. The default value of the former is 2.0, the latter can be a constant value close to 1.0 (default is 1.005) or a dynamic value as e.g. proposed by

Chapter 4. Forming generalizations in the database

33

Michalewicz [Mic96]. µt is the mean and σt is the standard deviation of the distribution of fitness values in the population at time t (for the standard deviation the unbiased estimator has been used): µt =

X 1 f itness(h) |popt | h∈pop

(4.8)

t

σt =

s

1

X

|popt − 1| h∈pop

t

(µt − f itness(h))2

(4.9)

Equation 4.7 is a combination of the σ-scaling as defined by Goldberg [Gol89] and exponential scaling as defined by Goldberg [Gol89] and Greffenstette and Baker [GB89]. In case of k = 1 it is pure σ-scaling which aims to lower the selection pressure when deviation is high (in the early stage of the GA) and to increase it when deviation is low (when convergence starts to take place). Usual values of parameter b are in [1, 2]. For small b the pressure is higher than for bigger values. Parameter k controls the impact of the exponential scaling. Value smaller than 1 decrease the selection pressure, whereas individuals with high fitness benefit from values bigger than 1.

4.3

Initial population

Several possibilities for the generation of the initial population have been implemented. By using the parameter initMode of the GA the initialization method can be chosen. The following values are supported: • GA_INIT_USE_RANDOM This is the straightforward approach to generate individuals by random. A function for random initialization of object views has been implemented. It takes as parameters value ranges for the number, size, aspect ratio and position of the shape-primitives. These values are derived from the object views that have to be approximated. Generated hypotheses will have only small fitness but cover the whole search space. • GA_INIT_COPY_ONLY Instead of starting with a completely random set of hypotheses, the GA is initialized with randomly chosen elements from the set of object views from which the average has to be computed for. In most cases the size of this set is small in comparison to the size of the population. Thus there can be multiple copies of an element in the initial population. (Usually, this is not desired! Therefore this initialization mode is not recommended.) Generated hypotheses will have relatively high fitness but cover only a tiny fraction of the search space which holds the danger of getting stuck in a local optimum. However, the global optimum is very likely to be found in this small region.

34

4.4. Selection

• GA_INIT_ALTER_WEAK The motivation for this mode is the same as for GA_INIT_COPY_ONLY and they are nearly identical. The difference lies in that each copy is slightly altered by using the mutateWeak-operator that will be explained in section 4.5. This reduces the undesired extreme homogeneity of the initial population that is present in the second mode. The generated hypotheses will still have relatively high fitness but cover only a tiny fraction of the search space. • GA_INIT_ALTER_STRONG This mode is the default and identical to GA_INIT_ALTER_WEAK except for the alteration-operator that is applied. Here the mutateStrong-operator is used (as explained in section 4.5) which leads to higher diversity in the initial population. The GA is still only confined to a region of the search space but the risk of suboptimal results is significantly reduced. The default size of the population can be set by the GA_POPSIZE parameter in config.h.

4.4

Selection

A fitness-proportionate approach has been chosen as the selection method: On a “roulette wheel” with a fixed number of equal sectors (the parameter GA_ROULETTE_WHEEL_SIZE is defined in config.h and has a default value of 10000) each hypothesis in the population gets a certain fraction. The number of assigned sectors corresponds to the proportion of its fitness to the fitness of the whole population: $ % f itness(h) sectors(h) = P (4.10) ′ h′ ∈popt f itness(h ) Due to rounding, the total number of assigned sectors may vary (but not exceed the maximum). Individuals of the next population are then selected by “turning”‘the “roulette wheel”, i.e. a random integer between 1 and the total number of assigned sectors is generated and the individual that corresponds to the selected sector is copied to the next population. Optionally, an extension called “elitism” can be enabled (define GA_ELITISM in config.h, enabled by default). Elitism assures that the best hypothesis known so far will remain in the population. A copy of this hypothesis is preserved and does not undergo the process of alteration.

4.5

Genetic operators

Following genetic operators have been implemented:

Chapter 4. Forming generalizations in the database

35

• mutateWeak This operator slightly alters a single individual, i.e. it alters only one feature of one shape-primitive. Figure 4.1 shows the flowchart of the operator. At first, the shape-primitive and the feature to be altered is selected by random.1 If the shape type has been selected for alteration, it is simply overwritten by a new randomly chosen value. For size, position (center), ratio or rotation, a random value is added to the old value. Bounds for these random values can be set in config.h. The object view needs to be normalized afterwards.2 • mutateStrong The strong mutation operator overwrites a whole shape-primitive or adds a new one to an individual. This operator may change the description length of an individual. The input and output individuals may differ significantly in their fitness values. • mutateWeakMod This operator is an extension of the mutateWeak-operator. The mutateWeakoperator maintains the description length of the original individual, whereas, in the extension, shape-primitives with a size below a certain threshold are removed and there is an additional case to create new shape-primitives. (See the shaded regions in figure 4.1.) Thus mutateWeakMod may change the description length of an individual and may (partially or fully) replace mutateStrong. • nPointCrossover This operator overwrites an individual by a recombination of two other individuals. The flowchart is shown in figure 4.2. Figuratively speaking, the input individuals are broken into pieces. Each piece has the size of one shape-primitive. Their order is preserved. The “child” is then assembled from the pieces as follows: for each index in the order of the shape-primitives, it is decided by random from which parent the piece is copied. If the selected parent does not have a shape-primitive with this index, the copy-process is skipped. Obviously, the resulting individual will have a description length, l, which corresponds to the number of shape-primitives: min{|α|, |β|} ≤ l ≤ max{|α|, |β|} The crossover-operator is applied before the mutation operators. The default impact of the specific operators can be controlled by the parameters GA_MUTATE_WEAK, GA_MUTATE_STRONG and GA_CROSSOVER in config.h. Different parameter values can be set when the GA is called.

4.6

Termination criterion

The GA terminates in any of the following two cases: 1 2

For a description of the shape properties refer to section 2.1. The normalization is described in section 2.1.

36

4.6. Termination criterion

Figure 4.1: Flowchart of mutateWeak and mutateWeakMod-operator (extensions in mutateWeakMod are shaded).

Chapter 4. Forming generalizations in the database

37

Figure 4.2: Flowchart of nPointCrossover-operator. 1. The maximum number of iterations has been reached. The default threshold is defined in parameter GA_MAX_EPOCHS in config.h. 2. The fitness of the best individual has not improved for the last GA_MAX_EPOCHS_WITH_NO_FITNESS_CHANGE iterations. If elitism (see section 4.4) is disabled, the current best hypothesis is compared to the best hypothesis of the preceding population instead of the all-time best. (Otherwise, this would lead to early termination if GA_MAX_EPOCHS_WITH_NO_FITNESS_CHANGE has a small value).

38

4.6. Termination criterion

Chapter 5. Graphical user interface

39

Chapter 5 Graphical user interface The graphical user interface (GUI) has been designed as a tool to visualize object views and matching results in the debugging process. It runs only on Microsoft Windows systems because it is MFC-based and uses GDI+. MFC [Micb] stands for “Microsoft Foundation Classes”. This is a library of C++ classes developed by Microsoft for Windowsbased applications. GDI+ is an extension of the Windows Graphics Device Interface (GDI). This API enables applications to use graphics and formatted text on both the video display and the printer without the need to access hardware directly [Mica]. Both, MFC and GDI+, require a Windows system to run. This implementation has been chosen because it appeared to be less time-consuming than a platform-independent one (which was not required in the case of the GUI). Figure 5.1 shows a screenshot of the main window. The menu allows access to the main functions of the core components: • Database - submenu – Load the database from the hard disc - file names for the files to be loaded are defined by DB_FILE, which contains the data, and DB_DESC_FILE, which contains the string descriptions, in config.h. (Refer to section 2.2.) – Save the database to hard the disc - refers to the same files as load. – Import the database from the detection module. The import file is specified by the define DB_IMPORT_FILE in config.h. (See section 2.1 for a description of the import algorithm.) – Generate a database by random. • Match - submenu – Find k Best Matches within the loaded database for the loaded query object view. Parameter k is defined by NUM_BEST_MATCH in config.h. (For the description of the matching algorithm refer to section 3.2.6.). – Match Query With Selected dbEntry - The loaded query object view is matched with the currently selected object view from the database.

40

Figure 5.1: Screenshot of the graphical user interface (GUI). The display areas are in shape-match mode. Again, the NUM_BEST_MATCH best matches will be retrieved. If no object view of the database has been selected. The result is the same as for Find k Best Matches. – Evaluation-submenu - Provides access to several functions needed to determine the performance of the matching algorithm (refer to chapter 6). • Genetic Algorithm - submenu – Filter Database - This has to be done as a preparation step for a manuallyinitiated run of the GA. Usually, the database contains representations of objects from different views but for the GA all database entries must refer to the same object and the same view. – Run GA - Runs the GA on the currently loaded database. – Built New Database - For each set of database entries referring to the same object from the same view an “average” object view is computed (see section

Chapter 5. Graphical user interface

41

4) and stored in a new database. File names are defined by GA_DB_FILE (data) and GA_DESC_FILE (descriptions) in config.h. • Active Vision provides control of the active vision process. These methods are still under development and are not within the scope of this work. • About displays information about the GUI. Furthermore, the main window is divided into three parts: • The upper part of the main window contains an edit field that is used to load query object views and a list control that displays the content of the database and allows selection of single database entries. In this list control the field “id” refers to the index of the entry within the database, “description” contains the descriptions from DB_DESC_FILE and “raw data” the data from DB_FILE in the object representation syntax which is used to save the database in the file. (The same format is used for the query.) • The middle of the main window is again divided into three parts: The query display area (left), the database entry display area (center) and two list controls (right). The upper list control displays the results of Find k Best Matches (submenu Match). Fields are “#” for the rank of the matching result, “id” for the index of the corresponding database entry, “error” for the total error, “init” for the initial error and “unmatched” for the unmatched error. (Refer to section 3.2.3 for an explanation of the different errors.) Each element (row) of this list control refers to a matching result and can be selected. The lower list control shows the details of the selected matching result, i.e. the different shape matches. Fields are “#1” for the index of the shape-primitive of the query, “#2” for the index of the shape-primitive of the database entry, “error” for the shape match error and “size”, “position”, “ratio” and “rotation” for the errors on the specific shape properties. (Refer to section 3.1 for details on the shape match error.) A value starting with “u” in column “#1” or “#2” stands for “unmatched” and indicates that the other corresponding shape-primitive has not been matched. The query display area shows the currently loaded query object view and the database entry display area shows the currently selected database entry. The two display areas are linked with each other and support four different display modes: 1. Unlinked mode (figure 5.2): This display mode is used, if either no query object view has been loaded or no database entry has been selected, i.e. one display area is empty. In this mode the displayed object view is scaled to fit exactly into the display area (maximal scaling). All shape-primitives are colored blue. 2. Size-linked mode (figure 5.3): As soon as an object view is assigned to the other (empty) display area, both

42

Figure 5.2: Screenshot of GUI object view displays in unlinked mode and with position labels. displayed object views are scaled in the following way: The bigger object view fits exactly into its display area and the smaller one uses the same scaling factor, i.e. both object view are displayed at the same scale. Again, all shape-primitives are colored blue.

Figure 5.3: Screenshot of GUI object view displays in size-linked mode.

3. Match mode (figure 5.4): This mode is enabled by selecting a matching result in the upper list control. The scaling is the same as in the size-linked mode. Additionally, the rotation of the database entry has been aligned in the manner described in section 3.1. Corresponding shape-primitives have the same color and each corresponding pair has a different color. (Because of the low opacity, which has been chosen to visualize occlusions, colors might mix.) 4. Shape match mode (figure 5.1): This mode is enabled by selecting a shape match in the lower list control. The shape-primitives of the selected shape match are highlighted red, all remaining shape-primitives are colored blue. Scaling and rotation alignment are the same as for match mode.

Chapter 5. Graphical user interface

43

Figure 5.4: Screenshot of GUI object view displays in match mode. Additionally, the coordinates of the circumcenters of the shape-primitives can be displayed in any of the modes as demonstrated for the unlinked mode (figure 5.2). A click into the display toggles between showing and hiding position labels. • The lower part is used to display log messages from the core components. By default, these messages are redirected to stdout (console). However, if USE_MFC_GUI is defined in config.h, log messages are displayed in the log window and status messages are displayed in both, the status bar and the log window. The log window can be cleared (Clear-button) and opened in a separate window (View Extrabutton). The logging behavior can be controlled in config.h by various logleveldefines and a log threshold. Only log messages with levels that are higher than the log threshold are displayed. Status messages are always displayed.

44

Chapter 6. Test Result

45

Chapter 6 Test Result This chapter contains a performance analysis of the classification matching component based on empirical tests. Firstly, the construction of the test data sets that have been used for evaluation is described. The second part is the presentation and discussion of the test results.

6.1

Test data sets

For a performance analysis on real images, a large collection of images that can be processed by the detection component is required. The detection component works on greyscale images with a maximum size of 100x100 pixels. Additionally the images should contain single objects from different orthogonal, axis-parallel views. Unfortunately, a collection containing such images could not be found. Although, there exist several free image collections on the web [SoCS], none of them was suitable for this application. E.g. the “Columbia University Image Library” (Coil) [COI] consists of images of single objects but the views are not axis-parallel. For the evaluation of this work, but more importantly for development and evaluation of the visual shape detection component, two small image data sets have been generated. They are described in the following. However, to be able to measure the performance on larger-scale data sets, an artificial test set has been generated as explained in section 6.1.2.

6.1.1

Real test data sets

First test set The first test set was build as part of the internship on November 10th, 2003. The data set consists of 360 manually labeled images of 10 single objects. Each image contains only one object centered in the image and each object is represented by a maximum of six orthogonal views (the number of views depends on symmetries of the objects). There are 10 images with varying lighting conditions for each view. All images were captured

46

6.1. Test data sets

by a Canon IXUS V2 digital camera (2.0 M pixel CCD) that was statically mounted on a tripod. The distance from the object was about 50cm. The objects were centered in front of a white background and neither object nor camera were moved during capturing one particular view. The flash was disabled and as light source a halogen lamp was used that was moved for each view in a repeating sequence to produce 10 different lighting conditions for each view. The images were captured as JPEGs at a resolution of 640x480 pixel with quality level set to “superfine” (highest JPEG quality to minimize the presence of artifacts). Afterwards they were converted to greyscale, cropped, resized to 100x64 pixels and stored as uncompressed greyscale TIFFs. This final image format resembles the required input format of the detection component. The captured objects were chosen from a kitchen environment, preferring objects with simple geometry. Table 6.1 shows the partitioning of the whole set of images. Object description “cascade” bottle cordial bottle cup can of fish styrofoam cup box for glasses mug salt soy sauce masking tape

Number of images 30 50 50 30 30 30 50 30 30 30

Number of views (descriptions) 3 5 5 3 3 3 5 3 3 3

(bottom, side, top) (back, bottom, front, side, top) (back, bottom, front, side, top) (front, side, top) (bottom, side, top) (front, side, top) (back, bottom, front, side, top) (bottom, side, top) (bottom, side, top) (front, side, top)

Table 6.1: Partitioning of the first test set (360 images in total).

Second test set For the development of the active vision component, a second test set was build by Ruby Law under similar conditions as for the first set. Both sets have been merged and are used in the following to derive parameters for the artificially generated test sets.

6.1.2

Artificially generated test sets

Whilst real test sets a useful for developing and debugging components for detection and classification matching, the “real” test sets are far too small to allow estimation of the running time of the algorithm in real-life applications where databases with thousands of object views may occur. To generate databases of this size, the random generation function mentioned in section 4.3 and the mutation operator discussed in section 4.5 have been used. Firstly, a database of 10000 randomly generated object views has been

Chapter 6. Test Result

47

created. This database is the “evaluation database”. Parameters for the generation of an object view were as follows: • The number of shape-primitives has a range of [3, 10]. For the real test sets, the detection component returned values in [0, 6]. But object views with only a few shape-primitives can be matched very quickly. This would have had a positive influence on the performance of the algorithm, which was not desired for the evaluation. • The size of each shape-primitive is initially (i.e. before normalization) in [0.7, 1.5]. Very small shape-primitives were not desired (and unlikely to be detected by the detection component). Therefore, the size had to be bounded downwards and upwards (because of the final normalization step). • The position (i.e. the values of the coordinates of the circumcenter) of each shapeprimitive is bounded by ±2.0 to avoid excessive scattering of the shape-primitives. • The ratio of each shape-primitive is in [0.4, 2.0] to avoid degenerated shapeprimitives. From the evaluation database, three different test sets of object views are derived. Each set resembles one of three scenarios that are explained in the following and contains 10 query object views. The performance of the algorithm on the evaluation database for a specific scenario is then the mean value of the queries with each object view from the corresponding set. Ideal case This scenario, though very unlikely, helps to estimate the lower bound of the processing time. In this set, for every query object view, α, there is an identical object view, β, in the database. Thus, the matching error of α and β would be 0. The query set for this scenario contains simple copies of the first 10 object views from the evaluation database. Normal case In this scenario, for the query object view, α, there is no identical object view in the database as in the previous case, but there is at least a similar object view, β. “Similar” means, that the matching error of α and β is below some threshold which in this case has been chosen to be 5.0. The query set is initialized with copies of the first 10 object views from the evaluation database (as for the previous scenario). Then each object view in the set is mutated until the matching error with the evaluation database exceeds 3.0. Additionally, the entries of the evaluation database are reordered to ensure that the best matches for all

48

6.2. Test results

query object views are among the first 1000 database entries.1 The actual errors for the queries used in this evaluation are in the range of [3.17, 4.85], with a mean of 3.84. (Approximated) Worst case This scenario approximates worst cases and provides an estimate of the upper bound for the processing time. The query object view is not in the database and the matching error with the database exceeds a high threshold (5.0). The query set is constructed from random object views satisfying the additional constraint that the matching error for each query object view on the evaluation database is at least 5.0. Again, the entries of the evaluation database are reordered to ensure that the best matches for all query object views are amongst the first 1000 database entries.1 The actual errors for the queries used in this evaluation are in the range of [5.09, 7.32], with a mean of 6.64.

6.2

Test results

Figure 6.1 shows the benchmark results for the three query scenarios on the evaluation database introduced in the preceding section. To demonstrate the dependency of the performance on the size of the database, the queries are run on the first k thousand object views of the evaluation database, where k takes every integer value from 1 to 10. The number of created nodes refers to the number of partial matchings that have been created during the matching process. The execution times (dotted lines) refer to an Intel Pentium M system running on 1.5 GHz with 512 MB memory. The results can be reproduced by running the command Evaluate from the submenu Match>Evaluation of the GUI. (The required database files are generated by Generate Benchmark DBs from the same submenu.) Additionally, figure 6.2 shows as illustrative example the details of the matching process of one object view from the “normal case” query set (left) and one object view from the “worst case” query set (right) with the corresponding best matches from the evaluation database. The following observations can be made: • For all three scenarios, the processing time and the number of created nodes scale linear with the size of the database. • For the “ideal case” the number of created nodes is exactly the number of entries in the database. This is the absolute minimum number, considering that the branch & bound algorithm needs to be initialized with a set of nodes containing one node 1

The test queries are run on the first k thousand object views of the evaluation database, where k takes every integer value from 1 to 10. The reordering ensures that the best match for each query is in the evaluation database regardless of the value of k. Note that, reordering, in general, has no impact on the performance.

Chapter 6. Test Result

49

Figure 6.1: Benchmark results for large scale databases (the dotted lines refer to the execution times).

DB size Ideal Case Normal Case Worst Case 1000 0.0 (0.0) 311.5 (10.2) 2115.2 (424.9) 2000 0.0 (0.0) 610.3 (17.4) 4247.6 (855.2) 3000 0.0 (0.0) 908.5 (23.1) 6358.0 (1280.9) 4000 0.0 (0.0) 1196.0 (27.8) 8476.5 (1706.4) 5000 0.0 (0.0) 1477.5 (33.7) 10571.3 (2120.3) 6000 0.0 (0.0) 1765.3 (39.3) 12690.7 (2539.4) 7000 0.0 (0.0) 2046.4 (45.5) 14737.6 (2920.2) 8000 0.0 (0.0) 2342.2 (51.3) 16889.1 (3352.1) 9000 0.0 (0.0) 2633.3 (56.8) 18972.4 (3762.4) 10000 0.0 (0.0) 2909.7 (62.4) 21098.9 (4194.4) Table 6.2: Number of expanded nodes (values in brackets refer to the number of nodes with depth > 2).

50

6.2. Test results

Figure 6.2: Changes in the lower and upper bounds for the matching error during the matching process of two example object views. Left: “normal case”-scenario. Right: “worst case”-scenario. (The lower bound is the sum of initial, unmatched and estimated error as described in section 3.2.3.) for each database entry. Table 6.2 which shows the number of nodes that have been expanded reveals that not a single node has been expanded. • For the scenario that is considered as the “normal case”, the number of nodes created is about twice as high as in the previous scenario. Table 6.2 shows that more than 95% of the expanded nodes have depth ≤ 2. This indicates that the branch & bound search process gets more directed towards the optimal solution with increasing search depth. • For the (approximated) worst case scenario the number of nodes created is significantly higher than in the other scenarios. Moreover, about 20% of the number of expanded nodes have depth ≥ 2 indicating that the branch & bound search tree is still broad on deeper levels. The observations can be explained as follows: • The search performance for the “ideal case” scenario can only be optimal if the initial solution is already the optimal matching, i.e. the heuristic used to generate the initial solution, which is used as upper bound in the search process, is very efficient in this scenario. • In the early stages of the search, the computation of the lower bound is limited. This is caused by the definition of the error functions that are used for the computation.2 As can be observed in figure 6.2, the estimated error increases significantly 2

Recall that the comparison of sizes of shape-primitives requires at least one correspondence point. For the computation of the errors for the position and rotation, two correspondence points are necessary. For details refer to section 3.1.

Chapter 6. Test Result

51

once the 1st and 2nd correspondence point is found whereas the unmatched error is independent of the number of correspondence points. In the early stages of the search, the unmatched error contributes the major part for the lower bound. Consequently, the search can be easily misled. This explains the dominating percentage of expanded nodes with depth ≤ 2. • As long as all nodes in the branch & bound search tree have nearly identical lower bound values, selection of the node to expand is almost random, which renders the search process nearly undirected and uninformed. This is the case for the “worst case” scenario where all matchings are bad.3 In addition, using a bad complete solution for pruning does not bound the branch & bound search tree well. This explains the significantly higher number of created and expanded nodes in the “worst case” scenario.

3

See definition of the query set for this scenario in section 6.1.2.

52

6.2. Test results

Chapter 7. Conclusion & future work

53

Chapter 7 Conclusion & future work This chapter summarizes the entire work and presents possibilities for further improvement of the algorithm and thoughts on how the work could be continued.

7.1

Discussion

There is evidence that an object’s geometry is decomposed into several parts during the process of human object recognition1 but this process is far from being fully understood. Whilst it is rather unlikely that human object recognition is based on artificial shapes such as ellipses, triangles and rectangles, a decompositional approach using this set of shape-primitives may yield advantages in terms of computational costs and real-time capability. Moreover, relying on such very basic shape-primitives, it may be even possible to migrate the detection of the shape-primitives from software to hardware. However, the fundamental restriction on the shape complexity limits the differences that can be captured between objects and possibly confines the setting to non-real-world environments with less complex shapes and structures like the manufacturing environment in a factory. Such an approach cannot be expected to return results that are on a level as achieved by other object recognition approaches as e.g. those mentioned in section 1.2. Rather, it can be regarded as a possible preprocessing step in object recognition that helps to decide whether and what more sophisticated (and possibly computationally more expensive) further steps in object recognition should be taken to gain additional information about perceived objects.

1

Refer to section 1.2.1 for an overview on the discussion on the field of cognitive science.

54

7.2. Ideas for further improvement

7.2

Ideas for further improvement

7.2.1

Introduction of an error threshold

The error of a matching of two object views, α and β, is obviously only bound by |α| + |β|.2 Thus, even cases worse than 5.0 are imaginable. In fact, such cases are not unlikely. However, it seems advisable to define a certain threshold for the matching error and implement the following behavior: If this threshold is exceeded for a query, the matching process can be aborted and a message like “There is no similar object in the image.” or “unknown object” is returned. The threshold should be database specific and determined empirically.

7.2.2

Optimization of the parameters for the shape error functions

The functions for the comparison of the size, aspect ratio, rotation and position of shapeprimitives described in appendix A are parameterized. Currently, the parameter values are chosen based on empirical tests. An optimization of the values could improve the quality of the computed errors and have a positive impact on the performance of the classification matching component.

7.2.3

Incorporation of shape confidences

The detection component provides additional information about the quality of the detected patterns of the shape-primitives. This information is currently not used. It may be incorporated into the object representation syntax3 and used e.g. as a measure of uncertainty in the shape error function. (See appendix A for the current implementation of the shape error functions.)

7.3

Conclusion

In the context of an object recognition system that is based on the decomposition of 2D object views into shape-primitives4 , a symbolic object view representation has been designed. This representation is capable of holding all information that can be gathered by a detection component such as type, size, aspect ratio, rotation and position of individual shape-primitives. It can be normalized and is rotation-, scaling- and translation-invariant which benefits matching between object views. For the comparison of properties of shapeprimitives, shape error functions based on the concept of fuzzy similarities are used. The 2

|| denotes the number of shape-primitives in an object view. Import and storage of the values is already supported. 4 Shape-primitives are basic geometries such as ellipses, rectangles and isosceles triangles. 3

Chapter 7. Conclusion & future work

55

symbolic representations of object views known to the system can be stored in a database that supports querying of other object views. To keep the size of the database as small as possible, generalizations of object views can be learned by a genetic algorithm. For querying the database, a classification matching algorithm has been implemented. It is based on a branch & bound algorithm that utilizes error estimates and heuristics that have been designed specifically for this problem. Furthermore, the branch & bound algorithm has been modified to return the k most similar database entries (instead of only the most similar database entry) for any given query. The implemented application can be accessed via command line (platform independent) or a graphical user interface (requires a Microsoft Windows systems). The query performance scales linearly with the size of the database. For a database containing 10000 entries, a response time of less than a second is expected on an average system. (It can be further improved by using a heuristic as a trade off between the quality of the solution and speed.5 ) Thus, it is possible to apply the system in the active vision domain. Based on this work, an active vision module is currently developed.

5

Using the heuristic, it is not guaranteed that the most similar database entry is returned for any given query. The probability of this event can be influenced by a parameter.

56

7.3. Conclusion

Appendix A. Definition of the shape-primitive property errors

57

Appendix A Definition of the shape-primitive property errors The implemented error functions for the comparison the of properties of two shapeprimitives are based on the concept of fuzzy similarity relations [KGK94]. A fuzzy similarity relation E based on a set X is a mapping from X × X to [0, 1] satisfying the following characteristics: ref lexivity : symmetry : pseudo − transitivity :

E(x, x) = 1 E(x, y) = E(y, x) max{E(x, y) + E(y, z) − 1, 0} ≤ E(x, z)

E(x, y) = 1 denotes that x and y are identical and E(x, y) = 1 denotes maximum dissimilarity of x and y respectively. For a fuzzy similarity relations E a corresponding distance measure can simply be defined as 1 − E.

Figure A.1: Triangular membership function used for modeling similarity and the corresponding distance function. All fuzzy sets used in this implementations are based on triangular membership functions. Figure A.1 shows such a function modeling similarity and the corresponding distance function which is used here for the error computation. The functions are

58

parametrized by δ which has an impact on the width of the base of the “triangle” and thus can be used to control the error tolerance. For each shape-primitive property, a value defining the error tolerance can be set in config.h. The error for matching a shape-primitive with “unmatched”, eunmatched , is - for reason of consistency - defined as the error for the shape-primitive property “size” for a matching with a zero-size shape-primitive.

Appendix B. Overview of the source files

59

Appendix B Overview of the source files The following source files contain the implementation presented in this report (excluding the GUI): • config.h - is the global configuration file. • datatypes.h - contains definitions of the following basic datatypes: supported types of shape primitives a point in the plane represents a shape-primitive used to export an absolute shape-primitive representation for visualization TAbsoluteObjectView used to export an absolute object view representation for visualization TMatchDetailShape holds details of a shape-match to be displayed in the GUI TMap represents a (partial) matching

TShapeType TPoint TShape TAbsoluteShape

• CAVLTree.h - contains the following template classes for the AVL tree used by the branch & bound algorithm: CListNode CAVLNode CAVLTree

template class for a node of a simple linked list template class for a node of an AVL tree template class for an AVL tree

• tools.h/cpp - contains auxiliary functions for logging (for the command line version as well as for the GUI), random number generation, string tokenization and simple mathematic operations. • CObjectView.h/cpp - contains the class CObjectView that represents an object view and provides function for import, random initialization, modification and export.

60

• CGAIndividual.h/cpp - contains the class CGAIndividual that is inherited from CObjectView and represents an individual for the genetic algorithm. Added functionality comprises initialization with a CObjectView and alteration operators (several implementations of mutation and crossover operators). • CDatabase.h/cpp - contains the class CDatabase which is the main class of this work and incorporates the functionality of the database and the classification matching component (see figure 1.1), i.e. functions for import and export of the database, the implementation of the branch & bound algorithm, error computations, heuristics and the control function for the genetic algorithm. Some functions have recently been added for communication with the active vision component and do not belong to the content of this report. • shapes.h/cpp - contains high level control functions for CDatabase. These functions are called by the GUI. Some functions have recently been added for communication with the active vision component and do not belong to the content of this report. • CActiveVision.h/cpp - contains the implementation of the active vision component which is currently developed and does not belong to the content of this report. However, these files are needed to be able to compile the project. The following source files are only GUI-related: • resource.h - is an automatically generated include file. It contains the IDs of the GUI components. • shapesGUI.rc - is an automatically generated resource script. • shapesGUI.h - is the main header file for the GUI. (The location of the header files for MFC and GDI+ needs to be specified in this file.) • shapesGUI.cpp - defines the class behaviors for the GUI. • shapesGUIDlg.h/cpp - contains the implementation of the main dialog window of the GUI. • CLogEdit.h/cpp - contains the class CLogEdit which is an extension of the MFC class CEdit with additional functionality for logging. • CObjectViewerCtrl.h/cpp - contains the class CObjectViewerCtrl which inherits from the MFC class CWnd and is the implementation of a display area used to visualize a TAbsoluteObjectView (see e.g. figure 5.2). The implementation of the database and the classification matching component is in C++ and uses only STL-classes1 . Thus, the core components can run on any plat1

The “Standard Template Library”(see e.g. [STL]) is a platform-independent collection of container classes, generic algorithms and related components that can greatly simplify many programming tasks in C++.

Appendix B. Overview of the source files

61

form whereas the GUI that served as a tool in the debugging process is not platformindependent (for further explanation refer to section 5).

62

Appendix C. Number of possible matchings between two object views 63

Appendix C Number of possible matchings between two object views Let α and β to be two object views. In the following, a formula that computes the number of possible matching between α and β is derived. For any shape type t, let nα,t be the number of shape-primitives of this shape type in object view α and nβ,t the number of shape-primitives of this shape type in object view β. Obviously, the number of shape matches for this shape type excluding shape matches with “unmatched” is bound by min(nα,t , nβ,t ). For any integer k, 0 ≤ k ≤ min(nα,t , nβ,t ), all sets containing exactly k shape matches (non-overlapping) can be constructed as follows: 1. k shape-primitives of α are chosen to be matched (shape-primitives that have not been chosen are matched with “unmatched”). This is a combination of nα,t shapeprimitives, k at a time. Different arrangements of the same elements do not count. The number of different choices is:   nα,t nα,t ! = (C.1) k (nα,t − k)!k! 2. k shape-primitives of β are chosen to be matched (again shape-primitives that have not been chosen are matched with “unmatched”), where the i-th chosen shapeprimitive is matched with the i-th chosen shape-primitive of α. Thus, different arrangements of the same elements count. This is a permutation of k out of nβ,t elements. The number of different choices is: nβ,t ! (nβ,t − k)!

(C.2)

Combining (1) and (2) leads to the number of different sets containing exactly k shape matches: nα,t ! nβ,t ! (C.3) (nα,t − k)!k! (nβ,t − k)!

64

The number of different matchings with any number of shape matches regarding only one particular shape type t is then given by: min(nα,t ,nβ,t )

X k=0

nα,t ! nβ,t ! (nα,t − k)!k! (nβ,t − k)!

(C.4)

Taking all shape types t into consideration finally results in the total number of possible matching between α and β, where tmax is the maximum shape type value. (In this implementation there are 3 different shape types: ellipses (0), triangles (1) and rectangles (2).) α,t ,nβ,t ) tY max min(n X nα,t ! nβ,t ! (C.5) (n − k)!k! (n − k)! α,t β,t t=0 k=0

Appendix D. Data structure for a (partial) matching

65

Appendix D Data structure for a (partial) matching The data structure of a (partial) matching holds the following information: • unsigned short dbEntry; This is an explicit reference to an object view in the database. (A reference to the query is not stored because there is only one query.) • vector map; A matching is constructed iteratively as demonstrated in section 3.1. The shapeprimitives of the query are matched according to their order (introduced in section 2.1). After each shape match, the reference to the shape-primitive of the database entry is appended to this vector. I.e. the n-th element of this vector holds the index of the shape-primitive of the database entry that has been matched with the n-th shape-primitive of the query. • float error_sum; For a complete matching, this value holds the exact matching error. In the case of a partial matching this is a lower bound for the matching error of any complete matching that can be constructed from this partial one. In addition to these attributes, others may be stored to avoid repeating computations and increase the performance of the branch & bound algorithm: • float error_init; This is the part of error_sum that does not need to be recomputed. It is called initial error and further explained in the paragraph that deals with the computation of the lower bound.

66

• double scale;

This is the scaling factor for the database entry to achieve scaling invariance (see section 3.1 and figure 3.3).

• TPoint shiftQ; TPoint shiftDB; These attributes indicate how the query and the database entry have to be shifted to achieve translation invariance (see section 3.1 and figure 3.3). The TPoint structure holds separate values for x and y-axis. • double graPhi; double graSin; double graCos; These attributes hold the value, sine value and cosine value of the global rotation angle. This is the angle by which the database entry has to be rotated to achieve rotation and position invariance (see section 3.1 and figure 3.4). • short unmatched[NUM_SHAPE_TYPES];

Each element of this array holds the number of shape-primitives of the corresponding shape type that cannot be matched at all. A negative number means that the query has too many shape-primitives, a value greater than 0 indicates there are too many in the database entry. (These are called ut in algorithm 3.)

• vector matched;

Each element of this vector corresponds to a shape-primitive of the database entry and indicates whether this shape-primitive is “free” or has already been matched.

• unsigned char correspondencePoints;

This attribute is a counter for the correspondence points that have been found between the query and the database entry. Only the values 0, 1 and 2 are important. These values refer to the different stages in the comparison algorithm (see section 3.1 and figures 3.3 and 3.4).

BIBLIOGRAPHY

67

Bibliography [AVL62]

Georgii M. Adelson-Velskii and Evgenii M. Landis. An algorithm for the organization of information. Doklady Akademii Nauk SSSR, 1962. (Russian). English translation by Myron J. Ricci in Soviet Math. Doklady, 3:1259-1263, 1962. 22

[BET95]

Heinrich H. B¨ ulthoff, Shimon Edelman, and Michael J. Tarr. How are threedimensional objects represented in the brain? Cerebral Cortex, 5(3):247–260, 1995. Available from: citeseer.lcs.mit.edu/525321.html. 3

[Bie87]

Irving Biederman. Recognition by components: a theory of human image understanding., volume 94 of Psychol. Reviews, pages 115–147. 1987. 2, 3

[COI]

Columbia university image library (COIL-100) [online, cited Jan 13th, 2005]. Available from: www1.cs.columbia.edu/CAVE/research/softlib/coil-100.html. 45

[DY95]

Gang Dong and Masahiko Yachida. Acquiring fuzzy relational model from 3-D hierarchical structure of objects. Proceedings of the Fourth IEEE International Conference on Fuzzy Systems, 3:1367–1374, March 1995. Available from: intl.ieeexplore.ieee.org/xpl/abs free.jsp?arNumber=409859. 3

[Ede97]

Shimon Edelman. Computational theories of object recognition. Trends in Cognitive Sciences, 1:296–304, 1997. Available from: citeseer.nj.nec.com/edelman97computational.html. 4

[EI02]

Shimon Edelman and Nathan Intrator. Visual cessing of Object Structure. 2002. Available kybele.psych.cornell.edu/∼ edelman/arbib2e-final.pdf. 3

[EI03]

Shimon Edelman and Nathan Intrator. Towards structural systematicity in distributed, statically bound visual representations. Cognitive Science, 27:73–110, 2003. Available from: kybele.psych.cornell.edu/∼ edelman/cogsci-03.pdf. 3

Profrom:

68

BIBLIOGRAPHY

[EIJ02]

Shimon Edelman, Nathan Intrator, and Judah S. Jacobson. Unsupervised learning of visual structure. 2002. Available from: kybele.psych.cornell.edu/∼ edelman/bmcv02longer.pdf. 3

[EN98]

Shimon Edelman and Fiona Newell. On the representation of object structure in human vision: evidence from differential priming of shape and location. CSRP 500, 1998. Available from: citeseer.nj.nec.com/edelman98representation.html. 3

[FMK+ 03] Thomas A. Funkhouser, Patrick Min, Michael Kazhdan, Joyce Chen, Alex Halderman, David Dobkin, and David Jacobs. A search engine for 3D models. ACM Transactions on Graphics, 22(1), 2003. Available from: citeseer.ist.psu.edu/funkhouser02search.html. 4 [GB89]

John J. Grefenstette and James E. Baker. How genetic algorithms work: A critical look at implicit parallelism. In J. David Schaffer, editor, Proceedings of the 3rd International Conference on Genetic Algorithms, pages 20–27, San Mateo, CA, June 1989. Morgan Kaufmann. 32, 33

[GKK04]

Ingrid Gerdes, Frank Klawonn, and Rudolf Kruse. Evolution¨ are Algorithmen. Vieweg, July 2004. 30, 31

[Gol89]

David E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Professional, 1989. 30, 33

[KGK94]

Rudolf Kruse, J¨org Gebhardt, and Frank Klawonn. Foundations of fuzzy systems. Wiley, Chichester, 1994. 57

[Kos96]

Stephen M. Kosslyn. Image and Brain. MIT Press, Cambridge, MA, 1996. 3

[Low85]

David G. Lowe. Perceptual Organization and Visual Recognition. Kluwer Academic Publishers, Boston, MA, 1985. 3

[LW66]

Eugene L. Lawler and D. E. Wood. Branch-and-bound methods: A survey. Operations Research, 14(4):699–719, 1966. 20

[Mai]

Introduction to computer vision and image processing [online, cited Aug 1st, 2004]. Available from: www.netnam.vn/unescocourse/computervision/computer.htm. 2

[Mar82]

David Marr. A Computational Investigation into the Human Representation and Processing of Visual Information. W. H. Freeman, San Francisco, CA, 1982. 2, 3

[Mica]

Microsoft Corporation. GDI+ [online, cited Jan 24th, 2005]. Available from: msdn.microsoft.com/library/en-us/gdicpp/gdiplus/gdiplus.asp. 39

BIBLIOGRAPHY

69

[Micb]

Microsoft Corporation. Microsoft foundation class library (MFC) [online, cited Jan 24th, 2005]. Available from: msdn.microsoft.com/library/en-us/vcmfc98/html/mfchm.asp. 39

[Mic96]

Zbigniew Michalewicz. Genetic Algorithms + Data Structures = Evolution Programs. Springer-Verlag, Berlin, 3rd edition, March 1996. 33

[Mit97]

Tom Mitchell. Machine Learning. McGraw Hill, 1997. 29, 30, 31

[MN78]

David Marr and Herbert Keith Nishihara. Representation and recognition of the spatial organization of three dimensional structure. Proceedings of the Royal Society of London B, 200:269–294, 1978. 2

[OFCD01] Robert Osada, Thomas A. Funkhouser, Bernard Chazelle, and David P. Dobkin. Matching 3D models with shape distributions. In Shape Modeling International, pages 154–166. IEEE Computer Society, 2001. Available from: citeseer.nj.nec.com/373604.html. 4 [OMT03]

Ryutarou Ohbuchi, Takahiro Minamitani, and Tsuyoshi Takei. Shapesimilarity search of 3D models by using enhanced shape functions, 2003. Available from: citeseer.nj.nec.com/573301.html. 4

[Pop94]

Arthur R. Pope. Model-based object recognition - A survey of recent research. Technical Report TR-94-04, Dept. Computer Science, Univ. British Columbia, January 1994. Available from: citeseer.nj.nec.com/pope94modelbased.html. 3

[SoCS]

Pittsburgh PA School of Computer Science, Carnegie Mellon University. Computer vision test images [online, cited Jan 13th, 2005]. Available from: www-2.cs.cmu.edu/afs/cs/project/cil/www/v-images.html. 45

[STL]

STLport [online, cited Oct 1st, 2004]. Available from: www.stlport.org. 60

[Tar]

Tarrlab stimuli [online, cited Jan 21st, 2005]. www.cog.brown.edu/∼ tarr/stimuli.html. 2

[TB95]

Michael J. Tarr and Heinrich H. B¨ ulthoff. Is human object recognition better described by geon-structural-descriptions or by multiple views? Journal of Experimental Psychology, Human Perception and Performance, 21(6):1494– 1505, 1995. Available from: citeseer.ist.psu.edu/tarr95is.html. 3

[TB98]

Michael J. Tarr and Heinrich H. B¨ ulthoff. Image-based object recognition in man, monkey and machine. Cognition, Special issue on Image-Based Object Recognition in Man, Monkey and Machine, 67:1–20, 1998. Available from: citeseer.ist.psu.edu/tarr98imagebased.html. 3

Available from:

70

BIBLIOGRAPHY

[TWHG98] Michael J. Tarr, Pepper Williams, William G. Hayward, and Isabel Gauthier. Three-dimensional object recognition is viewpoint dependent. Nature Neuroscience, 1:275–277, 1998. Available from: citeseer.nj.nec.com/35937.html. 3 [Ull89]

Shimon Ullman. Aligning pictorial descriptions: An approach to object recognition. Cognition, 32:193–254, 1989. 3

[Ull96]

Shimon Ullman. High Level Vision: Object Recognition and Visual Cognition. MIT Press, Cambridge, MA, 1996. 3

[Vel01]

Remco C. Veltkamp. Shape matching: Similarity measures and algorithms. In Shape Modeling International, pages 188–199. IEEE Computer Society, 2001. Available from: citeseer.nj.nec.com/veltkamp01shape.html. 4

71

Selbst¨ andigkeitserkl¨ arung Hiermit erkl¨are ich, dass ich die vorliegende Arbeit selbst¨andig und nur mit erlaubten Hilfsmitteln angefertigt habe.

Magdeburg, den 28. April 2005 Vorname Nachname des Bearbeiters

72