Qualitative Scene Descriptions from Images for Integrated ... - CiteSeerX

Qualitative Scene Descriptions from Images for Integrated Speech and Image Understanding

Dissertation zur Erlangung des Grades eines Doktors der Ingenieurwissenschaften (Dr.-Ing.)

der Technischen Fakultät der Universität Bielefeld vorgelegt von

Gudrun Socher

Pasadena, Mai 1997

zum Geleit Die Forschungsaktivitäten in Zielrichtung möglichst “natürlicher” Mensch-MaschineSchnittstellen stehen zunehmend im Vordergrund der Informatik. Geprägt und in der Alltagssprache etabliert haben sich dadurch Begriffe wie “Virtuelle Realität” oder “Multimedia”. Die Interaktion zwischen Mensch und Maschine in natürlicher Umgebung kann so zwar als letztendliches Ziel angesehen werden, ist aber beim aktuellen Stand der Forschung zumindest nicht kurzfristig erreichbar. Im SFB Projekt “Situierte künstliche Kommunikatoren” steht die Untersuchung derartiger Interaktionen in einer restringierten Umgebung und am Beispiel einer speziellen Konstruktionsaufgabe im Vordergrund. Um eine möglichst natürliche Interaktion zu ermöglichen, muß die Maschine in der Lage sein, ihre Umgebung visuell wahrzunehmen, sprachliche Hinweise und Anweisungen zu verstehen und auf der Grundlage derart interpretierter Signale sinnvolle Schlußfolgerungen zu ziehen sowie angemessen zu reagieren. In diesen Kontext ist die von Frau Socher vorgelegte Arbeit eingeordnet. Dabei steht die Gewinnung von Informationen aus Bildern und deren Repräsentation für die wechselseitige Unterstützung von bild- und sprachlicher Interpretation im Vordergrund. Die in der Arbeit vorgestellten Beiträge zu diesem komplexen Forschungsfeld konzentrieren sich auf die Fragestellungen modellbasierte 3D-Rekonstruktion, Generierung qualitativer Objektbeschreibungen und Objektidentifikation im Wechsel- und Zusammenspiel von Bild- und Sprachdaten. Ausgehend von einer gründlichen Literaturdiskussion, entwickelt die Autorin für jeden dieser drei Teilbereiche innovative Verfahren. Der Kern der 3D-Rekonstruktion ist ein nichtlineares Optimierungsverfahren zur Abbildung von Objektmodellen auf Bildmerkmale. Das vorgestellte 3D-Raummodell und die damit verbundene Generierung von Raumrelationen kombinieren geometrische Verfahren mit Resultaten aus psychologischen Experimenten. Zur Objektidentifikation werden die Unsicherheiten und Variabilitäten von Bild- und Sprachdaten entlang der Merkmale Objekttyp, Form, Farbe, Größe und Raumrelation mit Hilfe von Bayes-Netwerken kombiniert und präsidiert. Damit liefert der vorliegende Band eine bemerkenswerte Darstellung vieler wichtiger Aspekte integrierter bild- und sprachverstehender Modelle und Systeme. Das reichhaltige Methodenrepertoire, die erzielten Ergebnisse, aber auch die ausführliche Diskussionen unterschiedlicher Ansätze können als Grundlage für weitere Entwicklungen in der MenschMaschine-Interaktion dienen. Gerhard Sagerer

vi

Acknowledgments First of all, I would like thank my advisor, Prof. Gerhard Sagerer, for his open-minded and creative interaction. His way of giving me full responsibility and freedom and at the same time entire support was a challenging and very rewarding experience. I’d also like to thank him for giving me the great opportunity of spending the last year as a visiting graduate student at the California Institute of Technology. I very much acknowledge Prof. Pietro Perona from the California Institute of Technology in Pasadena for inviting me to work at his research laboratory for the last year. I very much enjoyed the inspiring and fruitful discussions with him and the great work atmosphere in his lab. I also appreciate him being the second reviewer in my thesis committee. My stay at Caltech would not have been possible without the support of the German Academic Exchange Service (DAAD). I would like to thank all my colleagues of the “Arbeitsgruppe Angewandte Informatik” at the University of Bielefeld. The intense interactions in the group was very helpful for my work. Especially, Franz Kummert and Thomas Fuhr had always an open ear for problems and discussions. I enjoyed the time of having Uta Naeve (see Fig. 1.1) as my officemate. The collaboration with Christian Scheering and Christian Schwarz, and especially Torsten Merz and Sven Wachsmuth in their student projects was very fruitful for this work. Peter Koch was a great help in all computer problems. The questionnaire in the WWW would not have been possible without his engagement. This thesis is embedded in the work of the joint research project “Situierte K u¨ nstliche Kommunikatoren” (Sonderforschungsbereich (SFB) 360). The work in this interdisciplinary project was a very instructive and good experience. I’d like to thank my project-leaders Hans-Jürgen Eikmeyer, Gert Rickheit, and Gerhard Sagerer. Thanks go also to all the members of the SFB 360. I particularily enjoyed the wonderful collaboration with Constanze Vorwerg. The SFB 360 is supported by the German Research Foundation (DFG). I also would like to thank all members of the Vision Lab at Caltech for their joyful collaboration and friendship. The “agents” Jean-Yves Bouguet, Enrico Di Bernardo, Luis Goncalves, Mario Munich, and George Barbastathis are an immense source of help and fun. I appreciate the help of Alan Bond and Lavonne Martin in proof-reading this text. Finally, I would like to thank Dieter Koller very much for all his love and support, especially during the intense time of finishing this thesis. Gudrun Socher

viii

Abstract Human-computer interaction using means of communication which are natural to humans, like spoken instructions or gestures, has always been a challenging task. In this thesis, we address the subproblem of fusing the understanding of spoken instructions with the visual perception of the environment. We describe the design and implementation of a high-level computer vision component for the integrated speech and image understanding system Q UA SI-ACE. Q UA SI-ACE is a prototype of a ‘situated artificial communicator’, a system which aims to interact with humans in a natural way given a specific scenario or situation. A toy assembly scenario is our domain. The system Q UA SI-ACE is able to identify objects intended in spoken instructions given by a human instructor, based on results from the image understanding component which visually observes the scene. The high-level image understanding is accomplished by first reconstructing the 3D scene from uncalibrated stereo images. We use a model-based approach fitting 3D object models to 2D features extracted from images to estimate the pose of all objects and the camera parameters. In this context we developed a new method of using ellipses for 3D reconstruction based on projective invariants. The second image understanding step involves the computation of the qualitative features ‘type’, ‘color’, and ‘spatial relations’ from the 3D data and from 2D object hypotheses obtained from object recognition. These qualitative features are represented as fuzzified vectors assigning a likelihood value to each category in the feature space. The identification of the intended objects in the spoken instructions is based on a Bayesian network approach. The objects with the highest joint probability of being observed in the scene and being intended in the instructions are identified using the common qualitative representation for observed ‘type’, ‘color’, and ‘spatial relations’ as well as the uttered ‘type’, ‘color’, ‘size’, ‘shape’, and ‘spatial relations’. Domain and prior knowledge is incorporated in the identification process. We show the performance and robustness of our image understanding and object identification modules in the context of the entire system Q UA SI-ACE using real images and spoken instructions.

x

Contents 1 Introduction

1

1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.3

Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2 A Situated Artificial Communicator

7

2.1

Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.2

Knowledge-Based Interpretation . . . . . . . . . . . . . . . . . . . . . . . .

11

2.3

Integration of Speech and Image Understanding . . . . . . . . . . . . . . . .

14

2.4

The Image Understanding Component . . . . . . . . . . . . . . . . . . . . .

19

2.4.1

Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

2.4.2

Module Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

3 Image Understanding – Qualitative Descriptions 3.1

25

Issues of Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.1.1

Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.1.2

Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . .

27

3.1.3

Cognitive Psychology and Linguistics . . . . . . . . . . . . . . . . .

28

3.2

Why Choose Qualitative Description? . . . . . . . . . . . . . . . . . . . . .

29

3.3

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

4 Model-based 3D Reconstruction

35

4.1

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

4.2

Object Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

xii

CONTENTS

4.3

Model Primitives and their Perspective Projection . . . . . . . . . . . . . . .

40

4.3.1

Camera Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

4.3.2

Projection of Model Points . . . . . . . . . . . . . . . . . . . . . . .

43

4.3.3

Projection of Model Line Segments . . . . . . . . . . . . . . . . . .

43

4.3.4

Projection of Model Ellipses . . . . . . . . . . . . . . . . . . . . . .

45

4.4

Stereo Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

4.5

Pose Estimation and Camera Calibration . . . . . . . . . . . . . . . . . . . .

50

4.5.1

Levenberg-Marquardt Method . . . . . . . . . . . . . . . . . . . . .

51

4.5.2

Minimization Scheme . . . . . . . . . . . . . . . . . . . . . . . . .

54

4.6

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

4.7

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

5 Qualitative Descriptions

65

5.1

Representing Qualitative Descriptions . . . . . . . . . . . . . . . . . . . . .

65

5.2

Object Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

5.2.1

Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

5.2.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

5.3.1

Color Classification . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

5.3.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

Spatial Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

5.4.1

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

5.4.2

A Computational Spatial Model . . . . . . . . . . . . . . . . . . . .

80

5.4.3

Generation of Spatial Relations for 3D Objects . . . . . . . . . . . .

81

5.4.4

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

5.4.5

Understanding Spatial Relations for 2D Object Localization . . . . .

87

5.4.6

Empirical Psycholinguistic Validation . . . . . . . . . . . . . . . . .

89

5.4.7

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

5.3

5.4

6 Object Identification 6.1

Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95 95

xiii

CONTENTS

6.2

6.1.1

Belief Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

6.1.2

Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

6.1.3

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

Object Identification using Bayesian Networks . . . . . . . . . . . . . . . . 100 6.2.1

Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.2.2

Use of prior Knowledge . . . . . . . . . . . . . . . . . . . . . . . . 105

6.2.3

Questionnaire about Size and Shape in the World Wide Web . . . . . 106

6.2.4

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.2.5

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7 Results and Discussion

131

7.1

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

7.2

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

8 Conclusions

137

8.1

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

8.2

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

A Notations

143

B Baufix Domain – Objects and Geometric Modeling

145

C Homogeneous Transformations

151

D Ellipses as Four Points and Cross Ratio

155

E Jacobians

159

F Covariance Matrices

165

Bibliography

169

Index

181

xiv

CONTENTS

Chapter 1

Introduction

What kind of abilities must a computer or robot possess in order to be able to carry out a verbal instruction like, “Give me the red bolt which is to the right of the long and flat bar”? When we start analyzing this instruction, we find that the problem can be decomposed into several subtasks. First of all, the computer has to recognize the verbal instruction, this means a speech recognition component must analyze and process the digitized acoustic data which is recorded with a microphone. The subsequent speech understanding should then result in a symbolic description of the content of the utterance. The computer can now start analyzing this symbolic description. The instruction talks about a bolt. The computer must not only know what a bolt is but also what such an object looks like, which means how the computer can distinguish and locate this object in the environment. Visual sensors, for example, cameras, provide appropriate signal data for this task. With its knowledge about bolts, the computer can try to recognize and to locate an object in the visual data that looks like a bolt. Color perception is required in order to find the red bolt. The instruction addresses a specific bolt, the one which is to the right of the long and flat bar. Knowledge about spatial relations as well as the ability to compute them are furthermore necessary in order to find the intended bolt. We see from this example that a natural interaction with machines or computers requires far more abilities than the recognition of the actual command or instruction. Domain knowledge, the perception of the environment, and inference processes based on knowledge and perception are necessary. Understanding encompasses therefore recognition processes, knowledgebased interpretation, and inferences. The use of multiple sources of information, for example, speech and vision, is a good strategy to interact in natural environments. An instruction like the one above is an example for a natural human-computer interaction. No special commands are necessary and a human can just talk to the computer. Human-computer interaction should be designed to be most convenient for humans. Ideally, the computer should operate in the user’s three-dimensional world rather than forcing the user to adapt to the computer’s world. Such a human-computer interaction requires in the first place means of communication that are natural for humans. Depending on the task, more convenient and natural

2

1

I NTRODUCTION

means of communication are speech or gestures rather than using a keyboard. At least, a more flexible and tolerant syntax is helpful. A second requirement is the perception of the environment in order for the computer to interact and to react adequately to the given situation. As seen from the example above, the use of multiple sources of information, such as speech and vision, is necessary to manage complex interaction tasks. Verbal descriptions often cover only a minimal specification which may be enough to distinguish an object in the current but not in a general context (Herrmann & Grabowski, 1994). Vision is an excellent complementary source to speech. Humans make extensive use of their visual abilities. Furthermore, vision can be used in a very flexible way: (a) it is possible to measure or to estimate whether something fits and is suitable or not, (b) to supervise the execution of actions, and (c) gestures, eye movements, or many other actions can be interpreted. If we have a closer look at the just identified subtasks, we see that understanding processes require not only the processing of data and the extraction of features but also the extraction of symbolic or qualitative properties and their adequate representation. The adequate representation is an important issue as inference processes are based upon it. The understanding of complex meanings needs complex inference processes that are more than simple comparisons. Furthermore, if the inferences are able to cope with uncertainties in the extracted qualitative features, then a successful understanding is possible even if intermediate results are partially erroneous due to noisy input data from real environments. In this context, there also general questions about communication should be addressed. Communication is context dependent. Therefore, most research in this area is focused on specific situations or domains (Rickheit & Strohner, 1994). A number of encouraging approaches towards a better understanding of a natural interaction between humans and computer systems can be found for restricted situations (e.g., Nagel, 1988; Wahlster, 1989; Mc Kevitt, 1994b; Thórisson, 1995). Many researchers investigate the use of different communication and perception channels. The interpretation and understanding of speech or textual input of natural language is a major issue (e.g., Lee, 1989; Mast et al., 1994; Nakatani & Itho, 1994). The visual recognition of scenes, gestures, faces, etc. is a further source of communication information (e.g., Kollnig & Nagel, 1993; André et al., 1988; Kjeldsen & Kender, 1996; Leung et al., 1995; Face & Gesture, 1996).

1.1 Motivation At the University of Bielefeld the joint research project “Situated Artificial Communicators1” has been established to study advanced human-computer interaction. The goal is to develop an integrated system where visual, linguistic, sensory-motor, and cognitive abilities interact. 1

Sonderforschungsbereich (SFB) 360 supported by the German Research Foundation (DFG).

1.1

M OTIVATION

3

Relevant aspects of natural communication and interaction are studied and implemented.

The underlying scenario is the cooperative assembly of toy airplanes using the Baufixr 2 construction kit. A human plays the role of an instructor and gives verbal instructions to a system, a ‘situated artificial communicator’. The system Q UA SI-ACE3 is developed in this context as a prototype of a ‘situated artificial communicator’. The verbal instructions are recorded and digitized. Q UA SI-ACE has a stereo camera to observe the scene. Using the information from both, speech and visual data, the system should understand the instructions, relate them to the objects in the scene, and carry them out. Fig. 1.1 sketches Q UA SI-ACE which is currently under development.

Fig. 1.1: The goal of the joint research project “Situated Artificial Communicators” is to build a system with the ability to interact with human instructors like shown in the image above. The understanding of verbal instructions is integrated with the visual observation of the assembly process for a successful cooperative assembly.

Spoken language is a primary means of human communication. But often spoken phrases are ambiguous and highly context-dependent. Context information can be obtained through other perceptual skills, for example, vision. The visual understanding of the environment provides key knowledge and common grounding for successful communication. For humans, vision is one of the most used and best developed senses. Humans can very quickly observe a scene and then describe objects in that scene. The visual cognitive abilities of humans include not only generating descriptions but also understanding references to visible objects. It is easy for humans to visually identify what is mentioned in scene or object descriptions. 2

Wooden toy construction kit, see Fig. 2.1 for an airplane built with Baufix and Appendix B for all objects of the domain. 3 Qualitative Speech and Image understanding in an Artificial Communication Environment

4

1

I NTRODUCTION

In this thesis, we address the design and implementation of a high-level computer vision component for Q UA SI-ACE, the prototype of a situated artificial communicator which is an integrated speech and image understanding system. The domain is the Baufix assembly scenario. The major requirements which guided the design of the high-level vision component are:

Understanding which objects are referred to in the verbal instructions. This comprises the extraction of visually observable object features from images which are relevant in the communication with humans. Furthermore, the observed features have to be represented adequately for an easy identification of verbal references. Generating meaningful object or scene descriptions, which means the generation of qualitative descriptions from images that are useful as a response to the human instructor.

This leads to the general questions which are addressed throughout this thesis:

What kind of information has to be extracted from images? What kind of information can be extracted from images? How should we represent this information for an integrated understanding of the verbal instructions and images?

1.2 Contributions The contributions of this thesis are modules suggesting an approach to answer the questions just raised. We designed and implemented a high-level computer vision component for integrated speech and image understanding which includes the extraction of qualitative features from images and the identification of visually observed objects given spoken instructions. The image understanding component works full automatically and is able to generate descriptions as well as to understand instructions about an observed scene. It is suitable for the interaction with a human instructor. Well known techniques in computer vision and knowledge representation are combined in a beneficial way within a general framework of different levels of abstraction. Our approach can cope with noisy image data and erroneous intermediate results. We apply computer vision techniques to use cameras efficiently as visual sensors. Insights from psychology, linguistics, and cognitive science are incorporated and results and empirical data of psycholinguistic experiments investigating human communication in the context of the chosen scenario are taken into account. The vision component extracts the relevant visual information from images and represents it adequately for the communication with human instructors in the given scenario.

1.2

C ONTRIBUTIONS

5

Starting with the hypotheses available from object recognition components provided by collaborators (Moratz et al., 1995; Heidemann & Ritter, 1996), we compute a model-based threedimensional reconstruction of the observed scene from uncalibrated stereo images (see Chapter 4). The 3D data are of great advantage in the communication scenario for the following reasons: (a) Human perception is three-dimensional and the assembly task is performed in a threedimensional environment. Therefore, 3D data are very helpful and sometimes even necessary to follow the instructions. (b) The camera view and the view of the human instructor are decoupled. The view of the instructor can be inferred from a 3D scene representation if her/his position is known. We use this primarily for the understanding of spatial relations (see Section 5.4). (c) Ambiguities in the image data which are due to the loss of depth information in the perspective projection can be resolved using the 3D model information. The 3D reconstruction provides a numerical scene representation. This representation is stable and viewpoint invariant. However, for any verbal communication with humans, the system needs symbolic or qualitative information. For that purpose, we designed and implemented modules which compile qualitative object and scene properties from images and the numerical 3D data (see Chapter 5). The resultant qualitative descriptions form a high-level abstraction of the visual data. We introduce a vectorial fuzzified representation for the qualitative descriptions. This is an elegant way of representing multiple concurring, or even contradicting, hypotheses. A qualitative property, i.e. type, color, size, shape, and spatial relations, is characterized by a set of categories and likelihood values assigning to each category an applicability for a given object or object pair. We furthermore designed and implemented a Bayesian network which enables the integrated speech and image understanding system Q UA SI-ACE to identify those objects in the scene, which are referred to by the human instructor (cf. Chapter 6). The probabilistic object identification accounts for the uncertainties in the input data as well as in the recognition and understanding processes. We support data-driven (bottom-up) as well as model-driven (top-down) data flow. The system is also designed for incremental processing, therefore, intermediate results are available as restrictions at any time. The restrictions can be propagated within the bidirectional data flow. The Bayesian object identification module supports the bidirectional data flow as well, and it is used for the exchange of restrictions between image and speech understanding. We focus in this work on the qualitative understanding of visual data. Therefore, the algorithms are not designed to extract a specific physical quantity with the best possible accuracy, but to compile qualitative descriptions of it. Our goal is to robustly obtain qualitative quantities in increasing levels of abstraction which are suitable for a natural human-computer interaction.

6

1

I NTRODUCTION

1.3 Outline In Chapter 2, we provide an overview of Q UA SI-ACE. It builds the context for the image understanding component which is the content of this thesis. We explain and define the meaning of image understanding for this work in Chapter 3. Different approaches and related work are presented and discussed. In the subsequent chapters, we address the individual modules of the image understanding component. Chapter 4 describes our 3D reconstruction approach. We use a model-based 3D reconstruction of the observed scene as a first stable high-level numerical scene representation. Qualitative descriptions of the object type, color, and spatial relations build a high-level abstract representation of the image content. Type, color, and spatial relations are the qualitative properties that are extracted from the images and the 3D scene representation. Chapter 5 outlines and discusses the representation and the computation of these qualitative features. We have chosen the representation of the qualitative properties to be rather general whereas the extractions of the qualitative properties are specific to the scenario. This guarantees the relevance of the extracted information in the given scenario. Chapter 6 describes the identification of the object(s) which are intended in the instructions. It is a Bayesian network approach which is based on the extracted qualitative descriptions. Results of the entire high-level computer vision component and a discussion follow in Chapter 7. Conclusions are drawn in Chapter 8. Specific details such as notations, a description of the objects of our domain, and mathematical background are given in the appendices.

Chapter 2

A Situated Artificial Communicator

Situated artificial communicators (SAC) are computer systems which reconstruct the behavior of human communicators in relevant aspects. Human communication includes not only verbal but also non-verbal communication and the interaction with the physical environment (Rickheit & Strohner, 1994). This requires understanding and producing language and signs, the execution of actions, and the ability to report the consequences of actions. The simulation of these abilities needs strategies that can cope with complex tasks. The three principles of situated artificial communicators, situatedness, integration, and robustness, aim to provide suitable mechanisms (Schade, 1995). We briefly explain the three principles in the following: Situatedness: A system that is intended to carry out actions in a natural environment must be able to find its way about a given situation. This means that the system has to capture its environment with its sensors in addition to the understanding of the actual task. The environment and the current situation are additional sources of information which can be used in a beneficial way. Situatedness means the ability of an intelligent system to profit as much as possible from knowledge about the current situation. This knowledge can be used to improve the robustness and efficiency of perception, communication, and action execution (Barwise & Perry, 1983; Lobin, 1993). It is much easier to understand a person who is mouthing words if we know what this person is talking about than if we are not aware of the context. A common source of misunderstandings in everyday conversation is that people think of different scenarios and relate the communication to different situations with different meanings. We also recognize objects visually more easily when we expect them in a certain context. We would rather think, for example, that something with the form of a flexible tube lying on a sidewalk is a hose and not a snake. The efficient use of situated knowledge provides restrictions for understanding, disambiguation, and inferences. Integration: A good strategy to cope with the complexity of the interaction with humans is the smart combination of different sources of relevant information. Furthermore, not

8

2

A S ITUATED A RTIFICIAL C OMMUNICATOR

Fig. 2.1: The scenario of the joint project “Situated Artificial Communicators” is being chosen as the cooperative assembly of toy airplanes using the Baufix construction kit.

only isolated capabilities are required. Integrated abilities to perceive and exploit different types of information help the system to react adequately. Understanding processes profit from mutual restrictions. Then model-driven processing can be applied using the available expectations about what to understand. The search space is restricted and the noise in the sensor data can be better compensated. For example, if the instruction “give me the red cube” is uttered and there are few red objects in the scene, then object recognition can concentrate on these objects. Mutual restrictions are also good for error corrections. Errors can occur when an instruction is produced (“erratum humanum est”). Speech and object recognition errors are always possible. Thus, it is not unlikely that a SAC may not be able to find an object in the image data which has exactly the named features, according to the SAC’s perception. If then, for example, there is only one cubic object in the image then an instruction referring to a cube is most likely to intend this object even if not all named and perceived features may correspond. Either error (in perception or production) can be correct ed. Elliptic references can only be understood with of additional information. For instance, the utterance “give me the object behind the long thing” is useless without spatial information and without information about the relative length of the objects. A visual channel can provide the necessary additional cues.

2.1

S CENARIO

9

Robustness: Robustness and error tolerance are further requirements for situated artificial communicators. Everyday language is, in contrast to formal languages, often vague, ambiguous, and fragmented. Slips of the tongue can occur and minor syntactical errors are contained in nearly every spoken phrase. The SAC must not only cope with irregularities in human spoken language but also with noisy and distorted sensor data. Humans can communicate very efficiently through their language. They don’t need formal correctness. Irregularities, vagueness, and ambiguities are overcome through the use of additional information, and domain or situated knowledge. Every-day-language works, in contrast to formal languages, not autonomously but is embedded into situation and knowledge (Herrmann, 1985). Systems that use robust and efficient algorithms according to the paradigms of situated artificial communicators promise powerful, advanced human-computer interaction. The prototype of a situated artificial communicator (Q UA SI-ACE) which is developed at the University of Bielefeld is described in the following sections. First, the scenario is sketched. We outline then the employed knowledge-based interpretation. It is followed by a section about Q UA SI-ACE and an overview of Q UA SI-ACE’s high-level vision component which is the subject of this thesis.

2.1 Scenario The cooperative assembly of toy airplanes, such as shown in Fig. 2.1, is the chosen scenario for the joint research project “Situated Artificial Communicators” at the University of Bielefeld. The system Q UA SI-ACE is developed in this context as a prototype of a situated artificial communicator which simulates the constructor in this scenario. A human gives instructions to Q UA SI-ACE in a task-oriented dialog. Q UA SI-ACE observes the assembly platform (the scene) with a color stereo camera providing a source of visual information about the assembly process. The visual perception of aspects relevant in the communication should be similar to the instructor’s perception. As Q UA SI-ACE plays the role of the constructor, it should understand the instructions and react according to them. This means that the system should either carry out the instructions, or check back in case of misunderstandings or ambiguities. Fig. 2.2 sketches the principal components of the system Q UA SI-ACE. Speech understanding is applied to recognize and to interpret the verbal instructions. The content of an instruction is represented qualitatively. We assume at the current stage of the system implementation that an instruction contains a reference to an object in the scene. We call this object the intended object. The qualitative description of the utterance comprises therefore a reference to the intended object which is characterized in terms of the object’s type, color, size, shape, and spatial relations to other objects. In case of ambiguities or non-understandings the system checks back with the instructor.

10

2


Inference Machine

Qualitative Descriptions


- Object Identification - Command Execution INSTRUCTOR MIC

CAMERAS

Speech Understanding

Image Understanding

MANIPULATOR

SCENE

Fig. 2.2: Overview of Q UA SI-ACE, the prototype of a situated artificial communicator: The speech and image understanding modules derive qualitative descriptions from speech and visual input data. An inference machine is used for the object identification and command execution.

The scene is captured by cameras and qualitative descriptions for each recognized object in the images are extracted. The qualitative information that is relevant in the communication, i.e. object type, color, and spatial relations, are computed by the image understanding component. The common qualitative representation of (intermediate) understanding results is a prerequisite for the integration of speech and image understanding. It forms the basis for the exchange of mutual restrictions and for the identification of the object(s) that are intended in the instruction(s). The identification of the intended object(s) is driven by an inference machine. The inference machine should also infer the actual command intended by the instructor and trigger its execution. So far, the command execution is not yet fully implemented. The representation of the situated knowledge (domain knowledge and knowledge about the communication situation) and the knowledge-based interpretation are essential for the understanding processes. We apply the semantic network formalism E RNEST (Niemann et al., 1990) for knowledge representation and knowledge-based interpretation. We give an overview of knowledge-based interpretation techniques and a description of E RNEST in the following section.

2.2

11

K NOWLEDGE -B ASED I NTERPRETATION

2.2 Knowledge-Based Interpretation Knowledge-based techniques aim to generate individual symbolic descriptions of domain entities. In contrast to classical approaches in artificial intelligence which deal with symbol to symbol transformations, knowledge-based interpretation techniques are concerned with the transformation of numerical data into symbolic descriptions (Sowa, 1991). It is not possible to achieve this overall goal by optimizing one decision function or by adjusting weights to a given problem. The task must be decomposed into several processing steps. But due to the variability of the sensor or numerical data and the aim of individual descriptions, the sequence of processing as well as the transformation steps to be applied can not be fixed a priori. Therefore, we have to deal with a search process where it must be decided for each step which transformation is best at the current state in order to achieve the overall goal of the analysis process. The search process must be guided and restricted by information about the problem domain and the specific task. The knowledge about objects, events, structural properties, and constraints must be explicitly represented in such a way that it can be efficiently used in the interpretation process. A knowledge base must cover models which enable connections to be established between numerical sensor data and symbolic entities. Fig. 2.3 reflects the two main lines for knowledge acquisition for the construction of a knowledge base (Niemann, 1985). physical environment - time and space constraints explicit representation of structural properties declarative and procedural knowledge explicit representation of rules and algorithms rules and algorithms symbolic world - experience and commonsense knowledge

Fig. 2.3: Knowledge-based systems use an explicit knowledge representation. The knowledge is derived from common sense knowledge and experience as well as from time and space constraints of the physical environment.

The base line of knowledge-based interpretation is given by a state search approach (Sagerer, 1990). The initial state is given by some complex pattern ( ) and the knowledge base. This covers the modeled entities, procedures, and functions which realize transformations and inference processes. The transformations operate between and inside both the numerical and the symbolic world. An inference process provides a state transformation by generating new states or by manipulating data. If data(i) denotes the knowledge base and the already achieved intermediate results of state i, and = fT1; :::; TN g is the set of transformations, the complete

fc

T

12

2


3 f (c) T1

Ti ...

TN ...

3 B (f (c)) Fig. 2.4: Search tree of an interpretation process (cf. Sagerer, 1990).

interpretation process can be outlined by a search tree as depicted in Fig. 2.4. The initial state data(0) 3 ( ) includes the input pattern and a final state contains its symbolic description data(M ) 3 B ( ( )). In general, several transformations can be applied to one state. They compete with each other and the successful and optimal sequence of transformations forms a path in the search tree.

fc

fc

For different types of knowledge, different representation mechanisms are necessary (Sagerer, 1990). Inquiries on knowledge representation (McCarthy & Hayes, 1969; Minsky, 1975; Schefe, 1986) point out that, for example, definitions, descriptions, structural relations, and constraints should be separated in their representation. Each such type provides its own inferences and consequently its own results. A serious problem is scoring when dealing with noisy and uncertain data. The efficiency of a system strongly depends on an adequate scoring calculus. It provides the hints which transformation should be applied next, which knowledge should be evaluated, and what intermediate results are to be processed. As shown by Niemann et al. (1990), semantic networks are suitable for the representation of declarative and procedural knowledge as well as results of inference processes on the represented knowledge. The semantic network formalism E RNEST complies with the main requirements for knowledge-based interpretation that are outlined above. E RNEST is described in the following subsection. It allows us to handle a complete interpretation process as knowledgebased state search and as optimization problem regarding the chosen scoring calculus.

Semantic Network Formalism E RNEST The procedural semantic network formalism E RNEST (Niemann et al., 1990; Kummert, 1992; Kummert et al., 1993) is applied for knowledge-based speech and image understanding in the system Q UA SI-ACE. It is used for knowledge representation as well as for guidance

2.2

K NOWLEDGE -B ASED I NTERPRETATION

13

and control of the knowledge-based understanding processes. E RNEST is characterized by a neat definition of the available data structures, especially of nodes, links, and substructures. A problem-independent control strategy based on the A -algorithm is supplied for the purpose of its efficient use in the analysis of signal data.

Types of nodes and links In the ERNEST semantic network language, three types of nodes (concept, modified concept, and instance) and five types of links (part, specialization, concrete, context-dependent part, and reference) exist which have well defined semantics. Concepts represent classes of objects, events, or abstract conceptions. Concepts may consist of links to other concepts, attributes (modeling numerical or symbolic features), and structural relations that hold between attributes or linked concepts. An instance contains information gathered from parts of the input signal. This part of the signal data is interpreted as the extension of some concept in the knowledge base. An instance is a copy of the related concept where the common property descriptions of a class are substituted by the values derived from the input signal. Instances of some concepts may not be computable in an intermediate state of the analysis because certain prerequisites are missing. Nevertheless, the available information can be used to constrain an uninstantiated concept. This is done via the node type modified concept. Information can be propagated in the semantic network through links between concepts. The part link decomposes a concept into its natural components. The link of type specialization has a related inheritance mechanism by which a specialized concept inherits all properties of the general one. The link type concrete is introduced for a clear distinction of knowledge into different levels of abstraction. The context-dependent part link can be used to express, for example that an image region can only be interpreted as a hole in the context of an object containing it. Relationships that are established dynamically can be modeled by reference links. Through reference links, access is possible to valid instances created in some earlier state of the analysis. In addition to its links, a concept is described by attributes representing mainly numerical features and restrictions on these values according to the modeled term. Attribute procedures may estimate attribute values directly from the signal input or other system components as well as calculate them from attributes of linked concepts. Scoring functions estimate the adequacy of the mapping w.r.t. the intensional model. Furthermore, relations defining constraints for the attributes can be specified and must be satisfied for valid instances.

Analysis and control The analysis starts with a modification of the goal-concept chosen by the user. Top-down part or concrete links are expanded. During the analysis, each concept of interest is transformed

14

2


into an extensional description (modified concept) representing the set of all admissible extensions. The analysis process is guided by six domain-independent rules. These inference rules successively refine and score the modified concepts by establishing new links according to the knowledge base, by calculating attribute values, and by propagating information along the links. When all prerequisites for the instantiation of a concept are complete, all attributes can be computed and as a result, scored instances are obtained. They represent single extensions of concepts and how well they map to the signal data. As soon as a modified concept or an instance is computed the values of attributes of that modified concept or instance can be used to restrict the further analysis as they are propagated through the network. Restrictions are either derived from the state of analysis reached so far or encoded in the knowledge base, and they give the range for the search of attributes. Restrictions are stored in modified concepts. By use of inverse attribute computations they are handed down to modified concepts. Both, top-down and bottom-up flow of data is supported. During the analysis a search tree is built. Each node represents a state of the analysis. As images and speech can be interpreted ambiguously, competing instances and thus competing states of analysis are calculated. The goal is to find an optimal mapping between concepts and the input signal data with respect to the modeled scoring functions. This mapping is constructed by a search process. The A -algorithm is used to direct the analysis in order to focus on the most promising interpretation.

2.3 Integration of Speech and Image Understanding We developed an integrated speech and image understanding approach in the system Q UA SIACE according to the above mentioned paradigms of situated artificial communicators. We start with a set of examples of instructions which Q UA SI-ACE is able to understand, so far. This motivates the current system architecture and implementation which is then outlined.

Instructions So far, the instructions the system can understand are in German and they are simple directions for actions like “nehmen” (take) and “geben” (give, hand). Pure identifications where the system has only to show an object are possible, too. The instructions can be embedded in simple dialogs. The current implementation of Q UA SI-ACE is focused on the identification of objects and the understanding of isolated instructions. Complete assemblies are not yet possible. Therefore, the aim of simple dialogs between instructor and system are the unique identification of the intended object. A dialog ends when this goal is achieved. The object(s) that are referred to in the instructions can be described in terms of type, color, size, shape, and spatial relation relative to other objects. At any place in an utterance, fillers

2.3

I NTEGRATION

OF

S PEECH

AND I MAGE

U NDERSTANDING

15

can occur, which carry no meaning that is relevant right now. Not only complete sentences but also elliptical phrases are understood. Examples for possible utterances are:

¨ Ahm, gib mir die kleine lila Scheibe! (Eh, hand me the small violet ring!) Ich möchte die lange Leiste! (I want the long bar!) Nimm den Klotz vor der kurzen Leiste! (Take the cube in front of the short bar!) Mhm, den grünen! (Mhm, the green one!) Kannst du mir die rote Schraube geben! (Can you hand me the red bolt!) Ich möchte das große Ding! (I want the big object!)

These sentences or phrases can be used in simple dialogs between instructor and system. For the moment, no rules are incorporated to cope with ambiguous instructions. Therefore, Q UA SI-ACE requires a specification of the intended object that makes a unique identification possible with respect to the current scene (or what the system has captured from the scene). In case of misunderstandings or ambiguities, the system checks back as long as the intended object can not be identified uniquely. Q UA SI-ACE asks for the repetition of an instruction or it points out which set of objects recognized in the images corresponds to what it captured from the instruction. The following is a typical example of a dialog with a successful outcome between the instructor (I) and the system (S). The information contained in all utterances throughout one dialog are accumulated as all utterances in a dialog – the dialog-steps – should refine an instruction until a unique specification with respect to the current scene is possible. The information is discarded when explicit corrections (containing a negation) occur. I: Gib mir den Schraubwürfel. (Hand me the threaded cube.) S: Ich habe drei Würfel gefunden, welchen meinen Sie? (I have found three cubes, which one do you want?) I: Den roten! (The red one!)

16

2


S: Ich habe zwei rote Würfel gefunden, welchen meinen Sie? (I have found two red cubes, which one do you want?) I: Oh nein, ich möchte die Mutter. (Oh no, I would like the nut.) S: Ich habe eine orange Rautenmutter gefunden. (I have found an orange rhomb-nut.) Utterances in speech are not unique regarding the relation between interpretation and the words or structure used. The range of constructs and the choice of words is immense. Even in a restricted domain like ours the problem of mapping utterances onto their interpretation still demands much attention.

System Architecture Fig. 2.5 shows the architecture of Q UA SI-ACE. Here only the system parts relevant for speech and image understanding are depicted. The triangles in Fig. 2.5 represent the different modules. The core of this system is the homogeneous knowledge base for speech and image understanding. The knowledge base is modeled using the procedural semantic network system E RNEST. The linguistic part of the knowledge base is only shown as a shaded parallelogram for sake of readability. Objects, the scene, linguistic entities like words and phrases, and dialog-relevant entities are modeled as concepts. The knowledge is stored hierarchically with stepwise increasing levels of abstraction. At bottom level, the knowledge specific to the speech and object recognition processes are represented. Common qualitative descriptions of visually observed and verbally referred objects build the top level of abstraction. The different levels of abstraction are connected through concrete links. Speech Understanding The dialog between instructor and Q UA SI-ACE is central for the chosen scenario and it is modeled in the concept DIALOG. A dialog consists of multiple dialog-steps connected by higher-dimensional part links. Only the instructor’s dialog-steps (INST DIALOG STEP) are shown in Fig. 2.5. The instantiation of INST DIALOG STEP is yielded by procedurally coupling linguistic concepts to the speech recognition module which is based on Hidden Markov Models (Fink et al., 1994). The linguistic part of the knowledge base is an adaptation of the EVAR system described in (Mast et al., 1994). It is based on Winograd’s (1983) speech understanding model and Fillmore’s (1968) deep case theory.

I NTEGRATION

OF

S PEECH

AND I MAGE

tw Ne

CON

ERNEST concept

part

k or

Se m

17

U NDERSTANDING

an tic

2.3

context-dependent part

SCENE

concrete reference

DIALOG

interface to independent system module

... ...

INST_DIALOG_STEP DIALOG_ST_IO DIALOG_ST_RO

3D Spatial Model

...

Probabilistic Object Identification

...

OBJECT

linguistic knowledge

3D_OBJECT 2D_OBJECT

3D Reconstruction

Speech Recognition Object Recognition

acoustic data

image data

Fig. 2.5: The architecture of the system Q UA SI-ACE, a prototype for a situated artificial communicator: The triangles represent the system modules.

Image Understanding A scene (modeled by the concept SCENE) consists of objects and relations between them. Objects are represented in three different ways. The concept 2D OBJECT models an object that is shown in an image. As there could be more than one image of a scene (stereo images), multiple instances of the concept 2D OBJECT can exist for one physical object. 2D OBJECT interfaces with the object recognition component that incrementally detects objects in an image based on the integration of neural and semantic networks (see Section 5.2.1 and Heidemann et al., 1996; Sagerer et al., 1996). The concept 3D OBJECT represents the 3D pose of an object. The model-based reconstruction module instantiates 3D OBJECT with the pose of each object in 3D (see Chapter 4). The concept OBJECT stands for one physical scene object.

18

2


An instance of the concept OBJECT groups all projections of that object in all images and its 3D pose together. A qualitative description of that object containing its type, color, and geometric description is stored in this instance. The geometric description and the 3D pose of objects are used to determine spatial relations between objects (cf. Section 5.4). The concept OBJECT is linked with concrete links to 2D OBJECT and 3D OBJECT. Generation of Linguistic Language Models It is assumed that the instructor is cooperative and refers in the instructions to the scene. The qualitative information extracted from the image data and represented in the instances of the concept OBJECT can therefore be used to automatically generate linguistic language models for robust speech recognition (Naeve et al., 1995). If no object is detected yet, general domainspecific linguistic language models are used. As soon as at least some objects in the scene are recognized, the qualitative descriptions of these objects can restrict the speech understanding process. The restrictions are propagated in the semantic network. The automatically extracted language models (Fink et al., 1992) represent the admissible linguistic structure of constituents in an instruction that refer to objects in which are recognized in the image(s). Object Identification Q UA SI-ACE’s goal is the identification of the verbally referred object(s) among the visually perceived objects. Hence, the system strives to instantiate the concept for the intended object(s) (DIALOG ST IO) which is referred to by the instructor in some dialog-step (INSTDIALOG STEP). The instructor and the system are engaged in the dialog (DIALOG) about the observed scene (SCENE) (see Fig. 2.5). The concept DIALOG ST IO contains a qualitative description of the object, i.e. type, color, shape, size, and spatial relations to some reference object, which is intended in the instruction (noun phrase). DIALOG ST IO is a context-dependent part of the concept INST DIALOG STEP, i.e. the qualitative information of the intended object is derived from the concept INST DIALOG STEP. The concept DIALOG ST IO is linked by a multi-dimensional reference link to the concept OBJECT. It is required for each instantiation of DIALOG ST IO that at least one object (an instance of the concept OBJECT) is dynamically linked. The semantics of this reference link is that the linked scene object(s) is/are intended in the instruction. The inferences for the object identification are based on Bayesian networks (see Chapter 6). The probabilistic inference module is embedded in the semantic network formalism. If the instruction contains spatial relations also a reference object has to be linked dynamically to an instance of the concept DIALOG ST RO. In this case, the computation of spatial relations by the external spatial module is invoked (see Section 5.4). ERNEST is used to guide the analysis and to represent the domain knowledge for the system Q UA SI-ACE. Speech recognition, object recognition, 3D reconstruction, and the computation

2.4

T HE I MAGE U NDERSTANDING C OMPONENT

19

of spatial relations are performed by specialized external modules. These are shown in Fig. 2.5 as small triangles which are shaded in dark grey. The external modules are run in parallel to the knowledge-based analysis. The communication tool DACS (Fink et al., 1995) is used for the data transfer in this heterogeneous distributed system. More details about the current system implementation are documented in (Socher & Naeve, 1996).

2.4 The Image Understanding Component Image understanding is an important ability for Q UA SI-ACE. It provides the information necessary to understand the instructions in the context of the assembly scenario. It is also a fundamental precondition for generating descriptions of the scene. For this work image understanding means both, the image analysis, i.e. computation of numerical quantities, and the image interpretation, i.e. the extraction of qualitative features. Fig. 2.6 shows the architecture of the image understanding component for Q UA SI-ACE. This image understanding component, excluding object recognition and color classification, is addressed throughout this thesis. The architecture is described in the following subsection. Then, the three main modules and the guidelines for their design and implementation are explained.

2.4.1 Architecture Boxes with rounded corners in Fig. 2.6 represent algorithms or modules. Three main modules are addressed throughout this thesis for high-level image understanding. These are 3D Reconstruction (Chapter 4), the computation of Spatial Relations (Section 5.4), and Object Identification (Chapter 6). Data are symbolized in Fig. 2.6 with rectangular boxes. The text above the box describes the type of data. As image understanding is a hierarchical process, three basic levels of abstraction are modeled. Real image data, i.e. stereo images, are the source of input data for image understanding. These are two-dimensional projections of the scene. The object recognition is performed on the images and leads to object hypotheses. They are a 2D object representation. An estimation of the three-dimensional geometric scene is obtained by 3D reconstruction. Images, object hypotheses, and the geometric representation of the 3D scene are numerical data. A qualitative description is a symbolic vectorial representation of each object. It is extracted from the images and the 3D data in order to represent relevant visual information adequately for an identification of those object(s) which are verbally referred to in instructions. The generation of scene or object descriptions is also supported without the context of an instruction. Knowledge is represented in Fig. 2.6 by shaded rectangles. Three types of knowledge are necessary. Object models are needed for 3D reconstruction, knowledge about the current reference frame is required for the computation of spatial relations, and domain knowledge is

20

2


Utterance: ... nut below the socket

Domain Knowledge

Qualitative Description (Utterance) type bar bar, 3-h. bar, 5-h. bar, cube 7-h. rhomb-nut rim tire ...

Object Identification

spatial rel.

=

Qualitative Description (Object 2) Qualitative Description (Object 1) type bar, 3-holes bar, 5-holes bar, 7-holes cube rhomb-nut rim tire ...

color

=

0.0 0.1 0.1 0.2 0.9 0.2 0.0

white red yellow orange blue green purple ...

=

0.0 0.5 0.2 0.9 0.1 0.1 0.2

qualitativ numerical 3D Scene

=

...

...

spatial relations RO = Object 1 spatial relations 0.17 RO = Object 2 0.1 left right above below behind in-front

.5 .0 .1 .0 .4 .7 .0 ...

0.1 0.01 0.0 0.84 0.04 0.02

0.52 0.0 0.16 0.09

left right above below beh. in-fr.

.1 .2

= .0 .7 .6 .2

Reference Frame

Spatial Relations

Object Models

3D Reconstruction 3D 2D Images

object recognition color classification

Fig. 2.6: The high-level image understanding component for a prototype of a situated artificial communicator (Q UA SI-ACE): Boxes with rounded corners represent algorithms or modules. Data, input or output data for or from the different modules, are drawn as rectangular boxes with text indicating their type. Shaded rectangles mean external knowledge. The light shaded rectangle at the top represents external data from speech understanding. Arrows indicate the data flow.

2.4


21

used for the object identification. The light shaded box at the top of Fig. 2.6 represents the qualitative description of an utterance. It results from speech understanding. Restrictions for this qualitative description can be inferred from the image data. The qualitative description of visually recognized objects and the qualitative description of the object(s) verbally referred to in an instruction are mapped through the object identification process. Restrictions for missing information in either source (visual or verbal) can be inferred. The data flow within the image understanding component is shown through the arrows in Fig. 2.6. The object identification module and the computation of spatial relations are designed for data-driven as well as for model-driven processing. Estimates about camera calibration and the 3D pose of objects can be incorporated in the 3D reconstruction approach. With no restrictions, the 3D reconstruction works data-driven.

2.4.2 Module Design The three main modules and their design guidelines are outlined in the following. 3D Reconstruction Neuropsychological findings indicate that object localization is decoupled from object recognition (Kosslyn et al., 1990). This is one reason for separating object recognition and 3D reconstruction in this image understanding component. The high-level image understanding starts from object hypotheses. Object hypotheses result from object recognition which is provided by collaborators in the joint research project “Situated Artificial Communicators” (Moratz et al., 1995; Heidemann & Ritter, 1996). It is not required that all objects are recognized nor that all hypotheses are correct. Although object identification can be solved using sets of 2D representations, an interactive assembly task requires 3D information. 3D data facilitates image understanding in our scenario a lot. The human perception is three-dimensional and the assembly takes place in a three-dimensional environment. The understanding of spatial localizations under changing reference frames is not possible without 3D data. Furthermore, ambiguities in the image data often result from the loss of depth information in the perspective projection. Occlusions and ambiguities can therefore be resolved with 3D information. In addition, the 3D pose of an object is necessary for grasping purposes. The design and implementation of our 3D reconstruction approach is described in Chapter 4. It is a model-based approach for the estimation of the 3D poses of multiple objects from uncalibrated stereo images. Qualitative Descriptions We generate qualitative descriptions of the objects perceived in the scene and use them in the communication with humans. An object is characterized by its type, color, and by spatial rela-

22

2


tions to other objects. The type of the object is extracted from the results of object recognition as well as from the 3D data. There are less false object hypotheses in the 3D scene representation due to the model-based verification during the 3D reconstruction. The color of the object results from a color segmentation of the image which is a preprocessing step for object recognition. We compute spatial relations using a 3D computational model (cf. Section 5.4). A vectorial fuzzified representation (see Section 5.1) is chosen for the qualitative descriptions. This representation offers the possibility to model overlapping meanings and multiple concurring hypotheses. Furthermore, there is a greater flexibility of the represent ation of the perceived object characteristics. The namings of objects and their features in the instructions are of a wide variety. It may vary from instructor to instructor, for example colors may be named as red or orange. The chosen representation accounts for this fact. Fuzzy and overlapping restrictions can also be held for missing object or scene features. The qualitative description of the content of an instruction contains, in addition to the features for visually perceived objects, the features shape and size. These are inferred from domain knowledge for the visually perceived objects in the object identification process (cf. Section 6.2.3). Object Identification The object identification is based on a Bayesian network approach (Pearl, 1988) (see Chapter 6). In contrast to the deterministic knowledge-base understanding, probabilistic techniques are chosen here. We want to cope with different kinds of uncertainties resulting from recognition errors, ambiguities, incorrect instructions, etc. In our opinion, the development of systems which are able to solve complex tasks requires the study and the combination of various approaches to overcome the limitations of single techniques. We combine the two fundamental paradigms, deterministic and probabilistic techniques. Guidelines The implemented high-level vision component shows incremental processing at each step. 3D reconstruction is an iterative processes. So at any time information is available. The results get better as more time can be used for computation. No time consuming calibration is necessary at the beginning. The model-based reconstruction approach uses domain knowledge efficiently. The Bayesian network approach for object identification also fulfills any-time requirements. A probabilistic decision which object is identified can always be made. The probability for the right decision is higher the more information has been propagated through the network. Ballard (1996) (pp. 15) points out that “a way of reducing [computational] cost is to have a solution that has a high probability of being the best, but may not actually be the best. Yet another way of approximating is to have a solution that is almost correct. In a biological system that needs time to implement solutions, a partial solution may be satisfactory, as the rest can be filled in later on.”

2.4


23

System errors are tolerated and they should be corrected verbally by check-backs of the instructor in the dialog. In human-human communication many misunderstandings are solved interactively. Furthermore, understanding and therefore communication is influenced by the subjective perception and background of each individual (Lewis, 1969; Heydrich & Rieser, 1995).

24

2


Chapter 3

Image Understanding – Qualitative Descriptions

Image understanding denotes the ability to derive and to abstract specific, non-numerical information from images. This can be, for example, motion capturing in image sequences or segmentation of regions of interest. It can also mean to verbally describe aspects of the image content. In this work, we mean by image understanding the latter case: verbal descriptions of certain objects and their properties. Image understanding can be regarded as a transformation of information which is captured in a matrix of numerical values to a different type of representation where the information of interest, which is extracted from the image data, is made explicit. Thus, image understanding results in a representation of an abstraction of the image content. This definition raises two important points:

What is a representation, and what kind of representation should be used? What kind of abstractions are used, and what kind of information are of interest in the image data?

Palmer (1978) defines a representation as “something that stands for something else. In other words, it is some sort of model of the thing (or things) it represents. This description implies the existence of two related but functionally separate worlds: the represented world and the representing world. The job of the representing world is to reflect some aspects of the represented world in some fashion. Not all aspects of the represented world need to be modeled; not all aspects of the representing world need to model an aspect of the represented world.”

To specify a representation totally, Palmer (1978) requires us to define: 1. what the represented world is,

26

3

I MAGE U NDERSTANDING – Q UALITATIVE D ESCRIPTIONS

2. what the representing world is, 3. what aspects of the represented world are being modeled, 4. what aspects of the representing world are doing the modeling, 5. what are the correspondences between the two worlds. The represented world is the content of the image data, and the representing world are the results of the image understanding processes. Palmer’s requirements 3 and 4 refer to our second question. They define what are the kinds of information of interest in the represented world, and what are the means of representation. Understanding is always task or context oriented. The task or context then defines what needs to be modeled. In our scenario, the information of interest are properties of objects or the observed scene which can be named in instructions or which can be given as feedback to the instructor. In this thesis, these are the type, color, size, and shape of objects and the spatial relations between objects. Our representation for image understanding results is motivated and outlined in this chapter. An exact definition is given in Section 5.1. The correspondences in item 5 of Palmer’s (1978) requirements are in our case the algorithms used for image understanding. These are described in Chapter 5. Representations for image data and image understanding results are subject to studies in different research areas including computer vision, artificial intelligence, cognitive psychology, and linguistics. We sketch some ideas in the next section. Our qualitative representation is defined in Section 3.2. We refer to related work in the context of the system Q UA SI-ACE in Section 3.3.

3.1 Issues of Representation Images and processes of image understanding are of interest in different ways in the following fields: (a) in computer vision, in order to study how to build artificial vision systems, (b) in artificial intelligence (AI), in order to apply AI techniques such as reasoning, inference processes, knowledge representation, etc. to data from visual sensors, and (c) in psychology and linguistics, in order to study biological vision systems and the processes of naming perceived information. Due to the different nature of interests, different representational schemes are used. We show examples of representational schemes which are developed in the abovementioned fields in the following subsections. We refer only to ideas which have influenced our work. Reviews of work in these fields can be found, e.g., in (Kosslyn, 1994; Ullman, 1996).

3.1.1 Computer Vision Marr’s (1982) Vision book is one of the most well known computational theories on vision. He believes that the fundamental clues of information are discontinuities of light intensities

3.1

I SSUES

OF

R EPRESENTATION

27

in images, which may be correlated with physically significant features of surfaces. Hence, the raw primal sketch is a 2D representation of line segments, blobs, and contours. From this, the primal sketch is computed as a representation of the two-dimensional geometry of the field of view. The 2 12 D sketch is then derived from the primal sketch, and it represents some geometric aspects of the objects in 3D. It is the input to construct a 3D object centered shape representation, the computer vision’s goal. All these representations are numerical. They are perfectly adequate for this kind of task, but they do not allow to reason about the content of images at a symbolic or qualitative level. Most other purely computer vision approaches use numerical representations, and they are based on representations of increasing levels of abstraction (Nagel, 1979; Kanade, 1980). In addition to that, some authors (e.g., Nagel, 1988) attempt to derive conceptual or qualitative descriptions from the numerical representations as further levels of abstraction to generate, for instance, verbal descriptions of the image content. Other approaches (e.g., Brooks, 1981; Ballard & Brown, 1992) try to explain actions without high-level, abstract representations. The argument is that one needs very little representation when behaviors are taken to be the fundamental primitive (Brooks, 1981). The claim is that learning of categories can occur by means of the storage of and later matching to particular examples rather than by means of the abstraction of attributes. However, high-level reasoning and inferences are not possible with this type of approach.

3.1.2 Artificial Intelligence Different kinds of representations have always been an issue in artificial intelligence (AI). Representation is important for many AI fields such as reasoning, inferences, decision processes, learning, and knowledge representation. It would lead us too far afield to discuss these ideas in detail (see e.g., Russell & Norvig, 1995). Most representational schemes are non-numerical. They can be divided into analog and propositional representations. Or they can be named according to their purpose, for example, symbolic representations in the context of language processing, or conceptual or categorical representations in the context of knowledge representation or activation. We would like to use the term qualitative representation as a superimposed concept. Qualitative models are based on discretizations that are distinct in some fundamental aspect of the domain. Qualitative representations are underdetermined (i.e. stand for a class of possible instances) and context dependent. These representations allow for reasoning at various levels of granularity (coarse vs. fine reasoning) (Mukerjee, 1994). But it should not be confounded with qualitative physics or qualitative reasoning which is a subfield of knowledge representation concerned specifically with constructing a logical, non-numerical theory of physical objects and processes based on qualitative abstractions of physical equations (De Kleer & Brown, 1984; Forbus, 1985).

28

3


3.1.3 Cognitive Psychology and Linguistics The interest of psychologists and linguists in different representational schemes is to study and model aspects of human perception and understanding processes. Engelkamp (1990) states that processing and representation of visual and linguistic information is central for the modeling of inferences and cognitive processes. A variety of suggestions for representations exist which range from propositional or symbolic representations to mental models. The type of representation which is used in a model is mostly influenced by the task or context of the model. Here again, we present only the ideas which inspired our work. For Harnad (1987), categories are the basic representational units: “Categories and their representations can only be provisional and approximate, relative to the alternatives encountered to date, rather than exact. There is also no such thing as an absolute feature, only those features that are invariant within a particular context of confusable alternatives.”

Medin & Barsalou (1987) distinguish between two different kinds of categories: (1) all-ornone categories and (2) graded ones. There are two subtypes of all-or-none categories: (1a) In ‘well-defined’ categories, all members share a common set of features and a corresponding rule defines them as necessary and sufficient conditions for membership (an example would be ‘student’). (1b) In ‘defined’ (but not well-defined) categories the features need not to be shared by all members, and the rule can be an either/or one (e.g., ‘vehicle’, with or without wheels). Graded categories (2) are not defined by an all-or-none rule at all, and membership is a matter of degree. Categories are collections of entities/objects that include as members all entities/objects having certain properties in common (Russell & Norvig, 1995). Harnad (1987) and Medin & Barsalou (1987) state that categorical representations may be graded which is very useful for our scenario. Categories are also called classes, collections, kinds, types, and concepts by other authors. They have little or nothing to do with the mathematical topic of category theory. The organization of objects into categories is important for knowledge representation and recognition processes. Although the interaction with the world takes place at the level of individual objects, much of the reasoning takes place at the level of categories. Furthermore, categories may serve for making predictions about objects once they are classified. One infers the presence of certain objects from perceptual input, infers category membership from perceived properties of the objects, and then uses category information to make predictions about the objects. Subclass relations organize categories into a taxonomy or taxonomic hierarchy. Symbolic representations model knowledge and complex cognitive phenomena through symbols (Fodor & Pylyshyn, 1988). Often symbols and categories are synonyms depending whether linguistic categories or categories which are somehow related to words are used or not. Symbolic representations are therefore often regarded as related to language. Mental models are different from the representations introduced so far (categorical or symbolic representations). Craig (1943) introduced the term mental model and it has become a predictive, explanatory principle. It has been created as a paradigm which needs itself con-

3.2

29

W HY C HOOSE Q UALITATIVE D ESCRIPTION?

crete interpretation. The notion is that humans form an internal, mental model of themselves and of the objects and people with whom they interact. The term mental model is often used, however, to mean different things which leads to confusion. Johnson-Laird (1980) uses mental models in his work on cognitive speech recognition and understanding. He refines the definition to the following: “A mental model is contrarily to propositional representations a dynamic mostly analog cognitive representation of the content of an utterance. This representation uses the available knowledge about the external world of a communicator. It is the prerequisite for the inference processes necessary for verbal communication.”

Mental models are an interesting concept, but they show a lack of general explication and definition.

3.2 Why Choose Qualitative Description? We use the term Qualitative Description for our representation of image and speech understanding results. We understand it in the same way as categorical or symbolic representations are defined. We have chosen a neutral term which is not related to any established paradigm or notion. However, qualitative descriptions should not be confounded with qualitative reasoning. Our image and speech understanding approach is oriented towards classical computational models. The knowledge base and our understanding processes are structured in increasing levels of abstraction. A qualitative level follows underlying numerical representations. Hence, the qualitative description is the result of the transformation of numerical features into qualitative entities. Our basic qualitative entities are properties. We use the properties type, color, size, and shape to characterize objects, and spatial relations to describe scene properties. Each property (e.g., color) is characterized by a finite number of categories (e.g., white, red, yellow, orange, blue, etc.). Fig. 3.1 shows the connection of the terms property and category for a qualitative description. qualitative description part

property1 part

category1

part

categorym

part

propertyn part

categoryk

part

categoryl

Fig. 3.1: Definition of qualitative descriptions, our representation for speech and image understanding results: The arcs in this graph characterize ‘part-of’ relations.

Our representation is graded. This includes that a fuzzy score is assigned to each qualitative description, property, and category. The fuzzy score characterizes a goodness of fit between

30

3


the qualitative description/item and numerical data. The fuzzy score of a property is a vector composed from the fuzzy scores of each category. The vector of fuzzy scores of the qualitative description contains the vectors of fuzzy scores of the properties. A formal description of qualitative descriptions is given in Section 5.1.

3.3 Related Work This sections addresses related work in the context of image and also speech understanding systems. We designed our image understanding approach not to be limited to our scenario. However, we investigated and evaluated it only in the context of the system Q UA SI-ACE. Therefore, our interest in related work is focused on other image understanding systems as well as on systems which combine vision and speech or natural language1 understanding. In this section, we refer to other systems or to understanding components in the context of complete systems. Related work to the individual components of our approach, such as 3D reconstruction, spatial relations, and object identification, is reviewed in the chapters where we describe these modules in greater detail (Sections 4.1, 5.4.1, and 6.1.3, respectively). Recently, a lot of attention has been drawn on the integration and interaction of data from different sensory channels. Systems get larger and more complex, and tasks become greater challenges. Single components, for example, speech recognition modules, have become very advanced and powerful, but there is the common agreement that their performance can only increase up to a certain level. Beyond this level, it is necessary to incorporate other sources of information. The combination of vision and language is one possibility. A number of efforts concentrate on this issue (e.g., Mc Kevitt, 1994a; Mc Kevitt, 1994b; Maaß & Kevitt, 1996). In the following, we describe examples of systems which deal with complex image or combined image and language understanding problems. All these systems use real data and they operate in limited but real environments. The basic principle of all systems is that conceptual or symbolic descriptions are derived from numerical features (e.g., trajectories, image regions) which are extracted from images, speech, or natural language. VITRA, CITYTOUR, SOCCER, REPLAI, ANTLIMA One of the largest systems or conglomerate of systems in the areas of image understanding, representation of perceptions, verbalization of plans, and referential semantics for spatial prepositions has been built at the Computer Science Department of the Universität des Saarlandes, Saarbrücken, Germany, and the DFKI, the German Research Center for Artificial Intelligence. The project VITRA (VIsual TRAnslator) (e.g., Herzog et al., 1996) started in 1985 1

Speech understanding means the understanding of spoken language. The understanding of typed or written language is called natural language understanding.

3.3

R ELATED W ORK

31

and examines the relations between speaking and seeing. CITYTOUR (André et al., 1986) and SOCCER (e.g André et al., 1988) are two systems constructed in VITRA which transform visual perceptions into verbal descriptions. SOCCER simultaneously analyses and describes short scenes of soccer games like live radio reports. This involves perceiving the locations and movements of the ball and players, interpreting movements with respect to conventions of soccer (especially plans and intentions), and selecting which events to utter in what sequence. The spatial relations which are considered are close to (German) prepositions and the location of an object is described relative to other objects. ‘Probability clouds’ are used to give meanings to relations between objects like ‘being in’, ‘near’, and ‘in front of’. The density centers of such clouds mark those positions rated as good examples for the spatial relations. The system REPLAI (Retz-Schmidt, 1988a) recognizes intentions in the domain of soccer games and verbalizes the intentions. The project ANTLIMA (A Listener Model with Mental Images) (Schirra & Stopp, 1993) is another project in the context of VITRA. It simulates a speaker anticipating the listeners’ understanding by means of mental images. The main assumption is that the audience expects the speaker to mean the most typical case of the described class of events or situations with respect to the communicated context. This is modeled in ANTLIMA with ‘typicality distributions’ of static spatial relations which are extended to restrictions of speed and temporal duration to construct dynamic mental images of the content of sports reports. The work in VITRA tries to build a bridge between insights in psychology, linguistics, and artificial intelligence. Interesting models and representational schemes are implemented. But unfortunately an overall structure or framework is missing for the entire set of systems. No general representational schemes or paradigms are used. XTRACK The system XTRACK (cf. Koller, 1992; Koller et al., 1993; Kollnig & Nagel, 1993; Kollnig et al., 1994; Kollnig, 1995; Kollnig et al., 1995; Nagel, 1996) has been developed at the University of Karlsruhe, Germany. The domain of XTRACK are traffic scenes which are recorded by a stationary camera. XTRACK is an example for an image understanding system which performs fully automatically all necessary steps from low-level image analysis to conceptual descriptions of moving vehicles in traffic scenes. The system works as follows: Starting from optical flow fields, hypotheses for vehicle candidates in an image are created. By means of off-line calibration, these vehicle hypotheses can be projected back into the 3D world which results in pose estimates to initialize a Kalman-Filter for tracking the moving vehicles. 2D line features are extracted in each frame by projecting an hypothesized 3D polyhedral vehicle model into the image plane. The image features are feed into a state update step using a maximum a posteriori (MAP) update. Kalman-Filter prediction is performed by using a motion model. A set of 67 German motion verbs is associated incrementally to the extracted vehicle trajectory segments. Fuzzy sets are used to represent the connections between trajectory

32

3


attributes and motion verbs (Kollnig & Nagel, 1993). The admissible sequences of activities of an agent are modeled by hierarchical ‘situation graphs’ (Sowa, 1984). The representations for state and action of a situation node may be specified by a sub-situation graph. Hence, situations are modeled by specialization hierarchies. NAOS The system NAOS (Neumann & Novak, 1986; Neumann & Mohnhaupt, 1988) has been developed at the University of Hamburg, Germany. It creates retrospective natural language descriptions of object movements in a traffic scene (Neumann & Novak, 1986). It also uses natural language utterances for a top-down control in traffic scene analysis (Neumann & Mohnhaupt, 1988). The recognition of activities in traffic scenes is based on a hierarchy of temporal logic predicates. These predicates model events and actions. Complex events or actions are defined by simpler or elementary events and actions. The elementary events and actions are computed from image features. Most of this system has been tested only on synthetic data, so far. PLAYBOT The PLAYBOT project (Tsotsos et al., 1997; Tsotsos & 16 others, 1995) is a long-term, largescale research program at the University of Toronto. The goal is to provide a directable robot which may enable physically disabled children to access and manipulate toys. The robot possesses a robotic arm and hand, a stereo color vision robot head, and a communication panel. Vision is the primary sensor and the tasks are vision directed. This requires visual attention, gaze stabilization, object recognition, object tracking, and event perception. Visual attention is based on selective tuning (Tsotsos & 16 others, 1995). The gaze stabilization is necessary to facilitate the tracking task and is accomplished using mixture models of motion. A point or object must be identified as the one to be stabilized. Object recognition (Dickinson & Metaxas, 1994) is performed by fitting deformable models to image contours. The objects are modeled as object-centered constructions of volumetric parts chosen from some arbitrary, finite set. Aspects are used to represent the volumetric parts. To find instances of a target object in the image, a Bayesian approach is employed which exploits the probabilities in the aspect hierarchy. Object tracking (Verghese & Tsotsos, 1994) is done by perspective alignment which is an analytic method for real-time monocular model-based tracking. The event perception (Richards et al., 1996) is based on an ontology suitable for describing object properties and the generation and transfer of forces in the scene. A computational procedure tests the feasibility of event interpretations. Multiple interpretations are then scored and ordered. The disabled child communicates with the system via the communication panel. It displays actions, objects, and locations of the objects. The command language consists of verbs, nouns and spatial prepositions. The child selects the ‘words’ by pointing on the communication panel. Actions are given by verbs. These are: pickup, drop, push, bring, inspect, separate,

3.3

R ELATED W ORK

33

place, locate, assemble, disassemble, use, and move-me-to. They are depicted using animated buttons. The ‘nouns’ of the command language refer to the objects in the scene. They are depicted using actual images. Objects may be placed in specific spatial relationships with other objects, such as on top of, to the left of, or to the right of some other object. Buttons for these three spatial relations show blocks and arrows indicating the relation. If the child sets up a command it is carried out based on the captured visual information. The scenarios of PLAYBOT and Q UA SI-ACE are similar. However, PLAYBOT is focused on the visual perception of the environment. The command language is rather simple and therefore less attention is drawn on the representation of understanding results and on inference processes in order to identify the intended object or action. PLAYBOT has to be configured for the room it operates in, and a requirement is that the room is painted in bright colors with strong color patterns which can be used for calibration. PICTION The system PICTION (PICture and capTION) (Srihari, 1994; Srihari & Burhans, 1994; Chopra & Srihari, 1995) has been developed at the State University of New York at Buffalo. PICTION uses captions to identify human faces in an accompanying photograph and explores the interaction of textual and photographic information in image understanding. The final understanding of the picture and caption reflects a consolidation of the information obtained from each of the two sources. The problem of building a general purpose computer vision system without a priori knowledge is very difficult. The concept of using collateral information in scene understanding is exploited here by incorporating picture specific information. A face locator is used to segment face candidates from an image at different resolutions of the original image and the edge image. Constraints for the face recognition are then generated from the semantic processing of the caption. Picture-specific information is extracted to generate contextual (e.g., the name), characteristic (e.g., the gender), and locative or spatial identifying constraints. The constraints guide the processing of the picture to provide a semantic interpretation which is stored in the knowledge-base for future access. VIEWS In the VIEWS project at Queen Mary and Westfield College and at the University of Sussex, England, studies advanced visual surveillance for situations where the scene structure, objects, and much of the expected behavior is known (Toal & Buxton, 1992; Howarth & Buxton, 1992; Howarth, 1995). The aim is to demonstrate the feasibility of knowledge-based computer vision for real-time surveillance of well-structured, dynamic outdoor scenes. The scenario are traffic scenes, and the concrete example is a traffic scene at a roundabout. The system consists of three components, the perception component (for recognizing vehicles and estimating tra-

34

3


jectories), the situation assessment component (for understanding the situation over time), and the control component. The roundabout which is observed in an image sequence is projected to a horizontal plane and is partitioned into regions for an analog spatial representation in the situation assessment component. Static knowledge is attached to these regions. The regions are classified according to their significance for actions in the scenario (e.g., turning lane), and they also represent dynamic vehicle histories. Calculations of actions, like for example, ‘following’, ‘passing’, are based on the analog representation. The main contribution of this work is the approach for spatio-temporal reasoning to analyze occlusion behavior. Temporarily occluded vehicles are correctly relabeled after re-emerging rather than being treated as completely independent vehicles. EVAR and others The semantic network formalism E RNEST (Niemann et al., 1990 and see Section 2.2) has been developed at research groups at the Universities of Erlangen and Bielefeld, Germany. ERNEST is a very useful framework for various knowledge-based signal understanding tasks. Two major applications have been developed in E RNEST, an image and a speech understanding system. The first application is a knowledge-based system for the automatic interpretation of scintigraphic image sequences of the heart (Sagerer, 1988; Sagerer et al., 1990). The goal is to extract cardiac actions of the left ventricle with a retrospective interpretation of image sequences. The knowledge is structured hierarchically in the semantic network according to taxonomic and temporal decompositions. Changes of the ventricle surface in subsequent image frames build elementary motion cues which can be composed according to fuzzy membership functions to longer motion phases such as ‘extraction’,‘stagnation’, and ‘contraction’. Sequences of the motion phases are then described by a formal language, which can be seen as a finite state automata. This mechanism is used to create concurrent hypotheses of the state in the periodic heart cycle such as ‘systole’ or ‘diastole’ and their substates. The best scored hypotheses are used together with shape characteristics and other cardiac features to build the final scored diagnostic interpretation of an image sequence. The second example is the speech understanding system EVAR (e.g., Mast et al., 1992; Kummert et al., 1993) which is able to answer train schedule inquiries. A speech recognition module performs the recognition of constituents (e.g., nominal phrase, verbal phrase) in an utterance based on Hidden Markov Models (HMM). The constituent recognition is triggered by the modeled linguistic knowledge. Predictive language models (Fink et al., 1992) are generated incrementally according to the current stage of interpretation. The structuring of the linguistic knowledge base in different levels of abstraction is oriented towards Winograd’s (1983) speech-understanding model and Fillmore’s (1968) deep case theory. The level of pragmatic knowledge represents task specific concepts. A data-base query is constructed from instances in the pragmatic level specifying, e.g., departure and arrival times and locations. The user and the system follow a task-oriented dialog until all necessary information for a data-base query is available to EVAR.

Chapter 4

Model-based 3D Reconstruction

Images capture two-dimensional projections of light intensities emitted or reflected by objects in a three-dimensional scene. The projection from the three-dimensional world in twodimensional images inherently leads to a loss of depth information. The missing depth information in images causes a number of difficulties in using them as a visual representation of the world:

Shadows and occlusions change the appearance of objects. The separation of shadows and occlusions is not possible without further assumptions. The relative size and location of the objects in the scene is not explicitly represented. A metric size and distance is not available. It is not possible to see “behind” the objects.

3D reconstruction denotes the process of reconstructing the missing depth information and hence the three-dimensional structure of the scene/world. There are a number of well defined approaches to accomplish this task: stereo vision, laser range sensing, structure from motion, shape from shading (Faugeras, 1993; Jain & Jain, 1990; Koenderink & van Doorn, 1991; Oliensis & Dupuis, 1993; Horn & Brooks, 1989). These so called data-driven techniques rely on the image data only. Model-based approaches, however, apply additional knowledge about objects or object properties. This includes the possibility to even (partially) reconstruct the scene “behind” the objects and to get a full 3D representation of the modeled entities in the scene. A full 3D representation offers the possibility to simulate or to infer multiple views, e.g., the view of the instructor and the view of the system, in our assembly scenario. We developed a model-based approach for camera calibration and metric 3D reconstruction of known objects (Merz, 1995; Socher et al., 1995a; Socher et al., 1995b). We can reconstruct several, possibly small, and nearly planar objects using one or more images from uncalibrated

36

4

M ODEL - BASED 3D R ECONSTRUCTION

cameras. The method has been designed to use an arbitrary number of images, however, so far it has been tested only for one and two images (stereo image). For sake of simplicity, we describe here our approach for stereo images. If only one image is used, it is modeled as the image from the left camera, and in the case of more than two images, all images other than the first one are handled as images from the right camera. In this approach, model-based 3D reconstruction and camera calibration is accomplished by fitting projections of three-dimensional object models to two-dimensional image features. The models are fitted by minimizing a cost function which measures all differences between projected model features and detected image features as a function of the objects’ pose1 and the camera parameters. We use simple geometric models representing objects by sets of vertices, edges, and circles. We also need correspondences between model and image features, and – if using multiple images – stereo correspondences between objects in these images. Camera calibration denotes the process of estimating the so called camera parameters which describe the projection characteristics of a camera. These are optical characteristics (internal camera parameters) such as the focal length, the principal point 2, and the scaling between image and camera coordinate frame. The external camera parameters describe, in addition to that, the position and orientation of a camera with respect to a scene or world coordinate frame. Camera calibration is an essential prerequisite for metric 3D reconstruction. It is necessary to know the projection characteristics in order to be able to exploit the geometric relations of features in images according to the chosen camera model. Most approaches require camera calibration prior to other computations. We reconstruct and calibrate in one step, and therefore, we do not require calibrated cameras. The calibration and reconstruction is obtained by direct observation of the objects in the scene. Preceding camera calibration is not necessary through the consequent use of domain knowledge such as object models. Therefore, the problems of separate calibration (calibration errors, sensitivity to noise, mechanic or thermic influences resulting from different times of exposure) are avoided. Furthermore, our method does not require a long setup time and parameter adjustment. Therefore, the method lends itself for on-line calibration of an active vision system. The abilities of humans to adjust themselves to known situations are simulated in this way. This chapter is organized as follows: in the next section, we present related work. Then we introduce our object models (Section 4.2) and the way we compute the perspective projection of the model primitives (Section 4.3). Our stereo matching technique is briefly outlined in Section 4.4. Then, we explain our actual algorithm for 3D reconstruction (Section 4.5). Results of various experiments and a discussion of our approach conclude this chapter. 1 2

their position and orientation the intersection of the optical axis with the image plane

4.1

R ELATED W ORK

37

4.1 Related Work Considering related work, we refer only to approaches that estimate a full 3D reconstruction without additional constraints, like e.g., planarity of the objects. Furthermore, we are interested in methods that do not require prior camera calibration. Our approach was inspired by model-based methods and approaches applying projective geometry. Model-based approaches as suggested by Lowe (1991) and Goldberg (1993) apply three-dimensional models of single objects to features in one image to estimate the object’s pose relative to the camera. They use special simplified partial derivatives which make it difficult to extend their approach for more objects or for additional images. Preceding camera calibration is necessary if not enough significant model features are detected or to stabilize the solution. The approaches of Faugeras (1992), Hartley et al. (1992), Mohr et al. (1993), and Boufama et al. (1993) use images from uncalibrated cameras and estimate a projective reconstruction only from known point correspondences. A projective reconstruction is a non-metric reconstruction which is defined up to a projective transformation in 3D (a collineation). Additional metric information is incorporated in a second step to derive a reconstruction in Euclidean space. Inaccuracies due to noise or false matches introduced in the first step are difficult to correct in the second step, where the additional knowledge from the 3D scene is taken into account. Crowley et al. (1993) achieve robust results using a known object in the scene for camera calibration. They directly compute a metric reconstruction through linearizing the projection equations by estimating the nine entries of the rotation matrix independently. They apply additional constraints to the estimation in order to form an orthogonal rotation matrix from these nine values. Their approach works for isolated objects and it is intended to show how to calibrate with a minimum number of image points. Dhome et al. (1989) present an analytical solution for the reconstruction of one polyhedral object from monocular views. The principle of their method is based on the interpretation of a triplet of any image lines as the perspective projection of a triplet of linear ridges of an object model. They then solve for the pose parameters of the object by fitting the model ridges to the image lines. They provide an analytic solution after having reduced the number of possible solutions by special rules. The approaches mentioned so far reconstruct polyhedral objects. Only few approaches found in literature try to deal with curved features such as circles or conics. The use of circles or conics is more difficult for the following two reasons: (a) it is not possible to reconstruct circles or conics from monocular views without additional constraints (Ma, 1993), and (b) the perspective projection of circles or conics is not as straightforward as the perspective projection of points or lines. The Baufix objects (see Appendix B) have a lot of holes and therefore many conic features. Especially, the cylindric objects show only conic features. Thus, a re-

38

4


construction which can deal with curved features is important for our domain. We have found the following approaches in literature: Ma (1993) and Safaee-Rad et al. (1992) describe very well their insight in projection and reconstruction of quadric-curved features. The projection of any type of conic is formulated as the intersection of the retinal plane and a cone through the 3D conic and the optical center. Ma (1993) provides closed form solutions for a quick and accurate reconstruction using this projection model. He gives a detailed analysis of different cases from single conics up to two planar conics. Additional remarks and corrections of Ma’s method are found in (Ding & Wahl, 1993; Ding & Wahl, 1995). Safaee-Rad et al. (1992) use the same projection method and decompose the reconstruction problem in an orientation estimation and a position estimation problem. As the reconstruction of conics suffers from the problem that the reconstruction is not uniquely possible from monocular images, various additional constraints have been suggested: Ma (1993) uses stereo images. Dhome et al. (1990) use the 3D coordinates of the center of a conic as additional constraint, whereas Han & Rhee (1992) require an additional point marked on the conic which is reconstructed. Xie (1994) uses a planarity constraint. All conic approaches require prior camera calibration. However, Han & Rhee (1992) are able to estimate at least the internal camera parameters. The approaches mentioned above are all designed for the reconstruction of single or coplanar conics either from one or two images. They are difficult to extend for a combination with other types of features or for multiple objects. Gengenbach (1994) formulates the projection of conics in a different way. He provides a closed form for parallel projection. Because of a closed form projection, it can be applied together with the projection of other features such as vertices and line segments with the slight drawback that parallel and perspective projection are then mixed. We use uncalibrated images and are therefore also interested in camera calibration techniques. The minimal number of features for the full estimation of the pose from one image is four, if the features are planar (Fischler & Bolles, 1981). Tsai (1985) shows that full camera calibration is possible with five coplanar reference points. Six non coplanar points determine a unique solution as well (see Yuan, 1989). The well known calibration technique of Tsai (1985) and Lenz (1987) leads to a very accurate calibration due to the possibility of using a specially designed calibration pattern with a large number of features. The camera is calibrated through a two step closed form solution. The method of Weng et al. (1992) is another well known calibration technique. Calibration is here accomplished through a minimization approach. The influence of noise and lens distortion is very well studied by the authors. We do not account for lens distortion in our method. Some authors (e.g., Faugeras, 1995; Mohr et al., 1993) claim that these errors are negligible. Nevertheless, it is possible to model the estimation of lens distortion in a manner similar to that of Li (1994).

4.2

39

O BJECT M ODELS

4.2 Object Models We use a simple boundary representation to model the Baufix objects geometrically in 3D (Faugeras & Hebert, 1986; Foley et al., 1990). Each object is described as a set of vertices, edges, faces, and cylinders. Vertices are the extremities of edges, and edges define faces. The direction of the surface normal of a face is defined by counter-clockwise order of its vertices and edges. The objects are modeled in such a way that all surface normals point outside the object. Most of the Baufix objects have holes. These are either threads or borings. Both of them are modeled as holes, and the additional geometric primitives ‘holed-face’ and ‘holedcylinder’ are used instead of face and cylinder. The number of occurrences of either primitive in a model description can be zero. Most of the Baufix objects have rounded corners. However,

3-holed bar { 1 color { wooden } 8 vertices { 0 (0,0,0) 1 (92,0,0) 2 (0,25,0) 3 (92,25,0) 4 (0,0,4) } 12 edges { (0,1) (0,2) (2,3) (1,3) (4,5) (4,6) (6,7) (5,7) (0,4) } 4 faces { 4 vertices {0 4 6 2} 4 vertices {2 6 7 3} 4 vertices {3 7 5 1} 4 vertices {1 5 4 0} } 2 holed-faces { 4 vertices {0 2 3 1} 3 holes { (15,12.5,0) 7.5 (46,12.5,0) 7.5 (77,12.5,0) 4 vertices {4 5 7 6} 3 holes { (15,12.5,4) 7.5 (46,12.5,4) 7.5 (77,12.5,4) } 0 cylinders {} center of mass { (46,12.5,2) } bounding box { (0,0,0) (92,0,0) (0,25,0) (0,0,4) } }

5 (92,0,4) 6 (0,25,4) 7 (92,25,4)

(1,5) (2,6) (3,7)

7.5 } 7.5 }

Fig. 4.1: Geometric model of the object ‘3-holed bar’ (Dreilochleiste).

none of them are modeled as they are not used for 3D reconstruction. Furthermore, an object centered coordinate frame is associated which each object. The scaling of the coordinate frame determines the base metric unit for 3D reconstruction. We use millimeters as a metric unit. Fig. 4.1 shows the object model of the object ‘3-holed bar’. The object models for all other objects of our domain are explained in Appendix B. The object models are represented in 3D and independent of any viewpoint. The 3D structure of a scene can be inferred with the help of these models even when objects are partially occluded in the images. The rotation and translation of the models is computed easily by a rotation and translation of the object centered coordinate frame.

40

4


4.3 Model Primitives and their Perspective Projection In order to fit object models to image features, we formulate the perspective projection of object models to one or more images explicitly according to the pin-hole camera model. We start this section with a brief overview of the camera model. Then, we describe the projection of model points, line segments, and ellipses. We use the following notations (see also Appendix A) in this chapter:

Vectors (i.e. points or feature vectors) are written as small bold type characters. Indices indicate the current coordinate frame. These are: – bl , br : left or right image (2D), b: image (2D), if left or right is not specified, – l, r: left or right camera coordinate frame (3D), – o: object centered coordinate frame (3D).

Indices of vectors are also counters. E.g. object centered coordinate frame o.

xoi represents feature i of an object in the

Matrices are denoted by capital letters. Transformation matrices are denoted calligraphic letters and the indices indicate here the destination and source coordinate frame. The function transforms Euclidean to projective coordinates, and ?1 transforms projective to Euclidean coordinates (see Appendix C).

4.3.1 Camera Model The pin-hole camera is the simplest possible camera model (e.g., Jähne, 1991). It is a simple but sufficiently precise approximation of an optical camera. The imaging element of a pinhole camera is an infinitesimal small hole. All incoming light rays intersect at the point of the infinitesimal small hole, the optical center. The light rays form an inverted image on a second screen, the retinal plane. A geometric camera model can be built directly of the pin-hole camera. We use a model where the origin of the camera coordinate frame is placed in the optical center and the z -axis is pointing towards the retinal plane. Here the retinal plane is modeled in front of the optical center so that the image is not inverted. Fig. 4.2 illustrates two pin-hole cameras and their geometric relation with respect to an object o in the scene. The retinal plane is orthogonal to the z -axis of the camera coordinate frame at distance f , the focal length. The z -axis intersects the retinal plane at the principal point (Cx ; Cy )T; and the projection of a point (xo ; yo ; zo)T in

4.3

M ODEL P RIMITIVES

AND THEIR

41

P ERSPECTIVE P ROJECTION

zo

Tlo

bl

yo

xbl sx

sy ybl Tbll zl l yl

o

xo

(Cx; Cy )

br xbr sy ybr sx

fl xl

(Cx; Cy )

fr zr r

Tbr r

xr

yr

Trl Fig. 4.2: The geometry of the perspective projection of an object to two images is, according to the pin-hole camera model, equivalent to the transformation from an object-centered coordinate frame o to the two camera coordinate frames l; r and the subsequent projection to the respective images bi ; i 2 fl; rg. The transformation to the two camera coordinate frames can be formulated as the homogeneous transformation from o to l (Tlo ) concatenated with the homogeneous transformation from l to r (Trl ).

3D to the retinal plane is given by

x = f xzo ; o y = f yzo : o

(4.1)

The image is a part of the retinal plane and centered at the principal point. We use an image coordinate frame b that has its origin at the lower left corner of the image 3 and pixel scales sx and sy which express the pixel size compared to the camera coordinate frame in x and y direction, A lower left image coordinate frame requires the reflection of the y coordinates when changing from retinal plane to image coordinates. This is done by using a negative focal length in the equation for yb . 3

42

4


respectively. The projection of the point (xo ; yo; zo )T to image coordinates is therefore

xb = sx + Cx = sf xzo + Cx ; x x o yb = ?1 sy + Cy = ?s f yzo + Cy : y y o

(4.2)

Equations 4.2 can be expressed in homogeneous coordinates (see Appendix C):

xb 1

!

0 f=s s C x xy x B = P (xo) = Tbfl;rg (xo) = @ 0 ?f=sy Cy 0

0

1

0 0C A (xo): 1 0

(4.3)

Vectors in projective space are only defined up to a scalar . The parameters f; sx ; sy ; sxy ; Cx; and Cy are the internal camera parameters. The parameter sxy accounts for non-orthogonality of the pixel grid and is regarded here as negligible and set to zero (Faugeras, 1995). The function transforms the vector o from Euclidean to projective coordinates.

x

The external camera parameters describe the rotation and translation of the camera coordinate frame with respect to a world or scene coordinate frame. We parameterize a rotation as the concatenation of rotations about the x, y , and z -axis, respectively, and estimate the three rotation angles4 as rotational parameters. This leads to projection functions with rather complex partial derivatives due to the multiple use of trigonometric functions in the rotational terms. However, the advantage is that only a minimum number of parameters needs to be estimated, and objects can be reconstructed if at least only four model points are visible. In this work, we use the coordinate frame of the left camera l as scene coordinate frame. Therefore, only the external parameters of the cameras other than the left camera, e.g., the right camera r for stereo images, have to be estimated. We denote the transformation from a coordinate frame q into a coordinate frame p in homogeneous coordinates as

Rpq tpq 000 1

Tpq =

0 B B @

!

=

!y cos !z sin !x sin !y cos !z ? cos !x sin !z cos !x sin !y cos !z + sin !x sin !z

!y sin !z sin !x sin !y sin !z + cos !x cos !z cos !x sin !y sin !z ? sin !x cos !z

0

0

cos

cos

? sin !y

tx sin !x cos !y ty cos !x cos !y tz 0

1

t

where Rpq is the rotation matrix and pq is the translation vector (see Appendix C). 4

These angles are called yaw, pitch, and roll, and we denote them !x, !y , and !z , respectively.

1 C C A;

(4.4)

4.3

M ODEL P RIMITIVES

AND THEIR

43


4.3.2 Projection of Model Points

x

x

The projection of a model point o is the transformation of point o from model coordinates o to the camera coordinate frame l and the subsequent projection to the image bl . This can be expressed in homogeneous coordinates as

xbl = Pbplo(xo) = ?1 (Tbl l Tlo (xo)) :

(4.5)

denotes the transformation from affine to homogeneous coordinates (cf. Appendix C). The projection of the model point to a second image br needs one additional transformation Trl from the reference coordinate system, which we place in the left camera coordinate frame l, to the second or right camera coordinate frame r,

xbr = Pbpr o(xo) = ?1 (Tbr r Trl Tlo (xo)) :

(4.6)

4.3.3 Projection of Model Line Segments We represent line segments in 3D by pairs of two model points. As the projective projection preserves straight lines, a line segment is projected by simply projecting a pair of points. This gives us two image points 1 and 2 . But it is well known (Deriche & Faugeras, 1990; Koller, 1992; Faugeras, 1993), that the vertex representation of line segments is not very well suited as image feature representation in the context of matching models to image features: the endpoints of a line segment are extremely sensitive to noise. Two alternative representations are found in the literature:

x

x

The MDL- or (c; d; ; l)-representation (Crowley & Stelmaszyk, 1990) characterizes a line segment as distance c of the origin to the line segment, the distance d from the perpendicular intercept of the origin to the midpoint of the segment, the orientation of the line segment, and its length l. Figure 4.3 illustrates this representation. The midpoint-representation (Deriche & Faugeras, 1990) characterizes a line segment as midpoint = (mx; my )T , orientation , and length l (cf. Fig. 4.3).

m

We use the Mahalanobis distance (Duda & Hart, 1972) to measure the distance of projected model line segments to detected image line segments. The Mahalanobis distance is a statistic distance and requires covariance matrices o and b of the model ( o ) and the image ( b ) line segments, respectively:

x

d2(xb; xo) = (xb ? xo)T (b + o)?1 (xb ? xo):

x

(4.7)

44

4


y

x2

l

d

m c

x1

x

Fig. 4.3: The MDL parametric representation and the midpoint representation for 2D line segments (Koller, 1992).

Deriche & Faugeras (1990) show that the midpoint representation is in this case more appropriate. The MDL-representation leads to a covariance matrix s 5 that depends strongly on the position of the associated line segment in the image through the parameters c and d. Therefore, two given segments with the same length and orientation will have a different uncertainty in the parameters c and d, depending on their position in the image. This is not the case for the midpoint representation, since the uncertainty associated to the midpoint depends only on the uncertainty of the endpoints.

m

For this reason, we decided to use the midpoint representation which is computed as a function of the two endpoints 1 and 2 as

x

x

mx my l

= x1+2 x2 = y1 +2 y2 = arctan xy22 ??xy11 p = (x2 ? x1)2 + (y2 ? y1)2

(4.8)

The covariance matrix for this representation is given in Appendix F (Eq. F.10). The projection of a model line segment to the left and right image are denoted as

xbl = Pbslo(xo) = b (Tbll Tlo o(xo )) xbr = Pbsr o (xo ) = b (Tbr r Trl Tlo o(xo)) :

(4.9)

xo is the 3D line representation in object-centered coordinates, and xbi denotes the projected line segment in midpoint representation to the respective image. The function o transforms 5

We use the index s to refer to line segments.

4.3

M ODEL P RIMITIVES

AND THEIR


x x x x o = os1 os2 :

45

the 3D line representation (two points os1 , os2 in 3D) to homogeneous coordinates

1

1

(4.10)

The two endpoints are thus projected together by writing them as a 4 2 matrix. The function b transforms the 32 matrix of the projected endpoints to the the midpoint representation of line segments as given in Eq. 4.8.

4.3.4 Projection of Model Ellipses The perspective projection of an ellipse, which is a planar figure, can be understood as a collineation in the projective plane IP2 . We have to introduce some basics of projective geometry in order to understand how we can exploit projective geometry for a correct, closed form perspective projection of ellipses. More details about projective geometry are given in the Appendices C and D. A collineation is a linear projective transformation. The cross ratio is invariant under every collineation (see Semple & Kneebone, 1952). The cross ratio is the basic invariant in projective geometry and all other projective invariants can be derived from it (Mohr, 1993). Let A; B; C; and D be four points in a projective plane, no three of them being collinear, and P the center of a pencil of lines passing through these four points. The cross ratio k is then given as

0 0 0 0 k = [PA; PB ; PC; PD] = A0 C 0 B 0D0 ; AD BC

(4.11)

with A0; B 0; C 0; D0 being the intersections of this pencil with some line not passing through P . The notation A0C 0 stands for the length of the line segment from A0 to C 0. P A

A’ B’

B

C’

D’ D

C

Fig. 4.4: Cross ratio of a pencil of lines on a conic.

The theorem of Chasles states that the centers P of all pencils through A; B; C; and D with the same cross ratio k lie on a conic through A; B; C; and D (see Semple & Kneebone, 1952; Mohr, 1993, and Fig. 4.4). A conic is thus uniquely defined by four points and a cross ratio. This means that the perspective projection of any conic can be formulated as the perspective

46

4


projection of four points and the cross ratio of these four points on that conic. Fig. 4.5 depicts, as an example, an ellipse and its projection using four points an the corresponding cross ratio. The quadratic form of a conic is

ax2 + 2bxy + cy2 + 2dx + 2ey + f = 0: The coefficients a; b; c; d; e; and f are determined using kLAB LCD + (1 ? k)LAC LBD = 0

(4.12)

(4.13)

with

LAB = (xA ? x)(yA ? yB ) ? (yA ? y)(xA ? xB ) LAC = (xA ? x)(yA ? yC ) ? (yA ? y)(xA ? xC ) LBD = (xB ? x)(yB ? yD ) ? (yB ? y)(xB ? xD ) LCD = (xC ? x)(yC ? yD) ? (yC ? y)(xC ? xD )

(4.14)

(ref. Mohr, 1993). Eqs. 4.13 and 4.14 can be used to compute the quadratic form of the projection of the conic inserting the projections of the points A; B; C; D as xA ; :::; yD, and k is the corresponding cross ratio. We now consider only ellipses in order to compute k because we find in our object models only circles which are ellipses with equal radii ( l1 = l2 = r). If we take the curvature extrema of an ellipse, which are the intersections of the principal axes with the ellipse, as four points, then the cross ratio is always 2 (see Appendix D). This is due to the fact, that the four curvature extrema are more than sufficient to describe an ellipse uniquely. Thus, the quadratic form of a projected model ellipse is easily computed inserting in Eq. (4.13) four projected points on the ellipse and the corresponding cross ratio, which is 2.

e

m

The representation of an ellipse as center point , radii l1 and l2 , and orientation is much more convenient as the quadratic form, and it enables the component-wise comparison with a detected image ellipse. This representation is obtained from the quadratic form (Eq. 4.12) inserting the coefficients a; b; c; d; e; f in (Wunderling & Heise, 1984)

m = 1 and

b d

T

c ; ? a b e d e

where

a = b

b c

(4.15)

= 21 arctan a 2?b c : Let 1 and 2 be the real solutions of the polynomial 2 ? (a + c) + are

(4.16)

= 0, then the radii

4.3

M ODEL P RIMITIVES

AND THEIR

47


D l2 l1 A C

xD x

B

xC A xB

Fig. 4.5: The perspective projection of an ellipse can be understood as the perspective projection of four points on the ellipse using the precomputed cross ratio of these four points and the ellipse.

48

4

s l1 = 1

and

s l2 = 2


with

a b d = b c e : d e f

(4.17)

Now, we have all necessary prerequisites to formulate the perspective projection of a model ellipse to the left and right image. Analog to Eq. 4.9, we introduce the two functions ?o and ?b . ?o determines the four points which are projected and their cross ratio in homogeneous, object-centered coordinates and writes them in a 4 4 matrix. ?o is given in Appendix E, Eq. E.14. ?b computes from the four projected points (the result is a 3 4 matrix) the quadratic form (Eq. 4.12) using Eqs. 4.13 and 4.14 and transforms it to the ellipse representation with Eqs. 4.15–4.17. The projection of a model ellipse to the left and right image is thus

xbl = Pbelo(xo) = ?b (Tbll Tlo ?o(xo)) xbr = Pber o(xo) = ?b (Tbr r Trl Tlo ?o (xo )) ; x

(4.18)

m

where b is the image ellipse described by center point , radii l1 and l2 and orientation , and o is the model ellipse characterized by center point, radii, and a normal vector in model coordinates o. This formulation of the perspective projection of a model circle allows us to measure easily the deviation of projected and detected ellipses comparing only five parameters.

x

4.4 Stereo Matching Stereo matching is another necessary prerequisite for our 3D reconstruction appr oach. A good overview of different stereo matching techniques can be found in (Faugeras, 1993). Whereas stereo matching is a very tough problem in most cases, it is fairly simple in our case. We assume that the object recognition process generates reasonably good object and color hypotheses from both stereo images. Our stereo matching algorithm is based on these object hypotheses. We are interested in object correspondences and have therefore only to search for corresponding objects with the same type and color. So far, we do not attempt to correct possible object recognition errors at this stage. The number of possible correspondences is rather small for one object so that we are able to apply an extensive search among all possible matching candidates. The algorithm is illustrated in Fig. 4.6. We use the center of mass of all image features that are extracted for one object as an object abstraction. We first compute for all objects of one image the distance to the centers of mass of all possibly corresponding objects in the second image, respectively. We imagine here that the two images have the same image coordinate frames, i.e. the two images are superimposed. We then compute a distance table, where incompatible

4.4

S TEREO M ATCHING

49

image displacement

Fig. 4.6: Object-based stereo matching: Objects are abstracted by the center of mass of their image features. The two images are superimposed and the distance is pairwise computed between all objects of same type in both images. An ‘image displacement’ is estimated from initial possible matches. The ‘image displacement’ and the matches are iteratively refined until no further improvement is possible. The black lines connect the final matches. The grey dotted line indicates that a correct match has been found in the second iteration, since in this example the centers of mass of the two non-matching rhomb-nuts are actually closer on the superimposed image than the correct matches. This problem can be solved using the ‘image displacement’.

50

4


object matches are indicated by infinitely large distances. The row and column minima in this table signal potential correspondences for objects in the left and right image, respectively. We search for these minima and finally require for a bijective stereo correspondence: minimum(objectsleft [minimum(objectsright [i])]) = i : We compute then a so called ‘image displacement,’6 which is an approximation of the 2D translation and scaling of the image coordinates of objects in the right image with respect to corresponding objects in the left image. We apply this so called image displacement to the centers of mass of the objects in the right image and search again for correspondences. The search for correspondences and the displacement estimation is iteratively refined until no further improvement is possible in the search for stereo correspondences. We obtain good results with this simple algorithm as far as the object hypotheses are reasonably good. Further requirements are that the focal lengths of the two cameras do not differ too much7 and that the vantage points of both cameras are similar. Currently, we use only scenes with non occluding and clearly separable objects. We obtain excellent results here, all stereo correspondences are found.

4.5 Pose Estimation and Camera Calibration In the previous sections we have introduced all prerequisites for 3D reconstruction and camera calibration. 3D reconstruction and camera calibration are accomplished simultaneously by minimizing a multivariate non-linear cost function with measures the deviation of projected model features from detected image features. In the absence of noise, the pose of an object and the camera parameters are estimated correctly if the projections of the model features match exactly to the detected image features, and the distance measure between both feature types is zero. We apply the Mahalanobis distance (Duda & Hart, 1972), a statistic distance, to measure the deviation of projected model and detected image features. As noise8 is present, we obtain the best possible result, if the distance between projected model and detected image features is minimal. A non-linear multivariate cost function C measures the distances of projected model and image features of all objects in all images as a function of the objects’ pose and the camera parameters:

a) =

C( 6

N X X i=1 j 2B

T xbji ? Pbij o(a; xoi ) i?1 xbji ? Pbij o(a; xoi ) ! min :

(4.19)

We are fully aware of the fact that the epipolar geometry can not be reduced to such a simple transformation. A difference in the range of 10mm does not pose a problem. 8 Pixel noise, quantization noise in the feature detection process, as well as false matches between model and image features. 7

4.5

P OSE E STIMATION

AND

C AMERA C ALIBRATION

51

a

The vector captures all unknown parameters. The index j indicates the image frame (left, right, or other) and i counts the number of corresponding model and image feature pairs. i is the measurement covariance associated to the feature i. The type of covariance matrix depends on the feature type. The three types of covariance matrices that we use for points, line segments, and ellipses, respectively, are given in Appendix F (Equations F.2, F.10, and F.14). The projection of one object model depends on 6 + 1 + 7 (n ? 1) = 7 n parameters, where n is the number of images that are used. The pose of an object is described by six parameters, three rotational and three translational parameters, with respect to the scene coordinate frame. We place the scene coordinate frame in the first, or left, camera coordinate frame. Therefore, we only need to estimate one parameter, the focal length fl , for the first camera. For all other cameras, we need to estimate the focal length and the six external camera parameters.

x

x

The vectors bj i (image features) and oi (model features) contain different representations depending on the feature type. The different representation and the corresponding projection functions Pbij o are explained in the previous section. The minimization of a non-linear function requires an iterative method. The main problem here is to find a method which guarantees convergence to a global minimum. The projection from 3D to 2D is a smooth and well-behaved transformation (Lowe, 1991). But convergence of the minimization of the cost function C (Eq. 4.19) is only ensured when starting with good initial parameter values. We use uncalibrated images and therefore we do not have good initial parameter values. Thus, we divide the global reconstruction problem into three steps to stepwise improve the parameter estimates. Figure 4.7 shows the different steps as grey shaded boxes. Starting with corresponding model and image features, we estimate for each object and each image separately the pose parameters and the focal length (step I, model-fitting). We try to fit an object model to the corresponding image features (in one image) by adjusting the focal length and the rotational and translational parameters. We then get n estimates of the focal length for one image, where n is the number of depicted objects in this image. We determine the median of these focal lengths and fixate the focal length parameter to the median. Then in step II, we adjust all estimates of the objects’ poses to the fixed focal length. We apply this for all images. Now the parameter estimates are good enough to minimize the cost function (Eq. 4.19) for all parameters (step III) after the stereo matching has been computed. We use the Levenberg-Marquardt method for minimization which is described in the following subsection. The minimization scheme is explained in greater detail in Subsection 4.5.2.

4.5.1 Levenberg-Marquardt Method The Levenberg-Marquardt minimization method (Scales, 1985; Press et al., 1988) is a combination of Newton’s minimization method and a gradient descent. Most approaches for minimizing multivariate functions without constraints are descent methods. Their general form is

52

4


3D scene

scene reconstruction + camera parameter estimation

initial poses

stereo matching

pose estimation of all objects in the image

initial poses

pose estimation of all objects in the image

median of focal lengths

model fitting +

model fitting + estimation of focal length

estimation of focal length

feature extraction + model feature correspondence

feature extraction + model feature correspondence

left image

right image Object Models

Fig. 4.7: Multi-step 3D reconstruction from uncalibrated images through stepwise minimization of Equation 4.19: The reconstruction includes the camera parameter estimation and the pose estimation of all objects in a scene. Each minimization step (grey shaded boxes) minimizes Eq. 4.19 with respect to a subset of parameters.

4.5

P OSE E STIMATION

AND

53

C AMERA C ALIBRATION

(Spelluci, 1993):

ak+1 = ak ? sk dk ;

d

(4.20)

where k is the descent direction and sk is the width of the k th iteration step. The different minimizing methods differ mainly in their approach for computing k and sk . One of the simplest methods is the gradient descent. The descent direction is the gradient of the function f to be minimized and the width is kept constant for each step:

d

ak+1 = ak ? s @@a f (ak ):

(4.21)

This method converges always to a (local) minimum, but the convergence might be be slow (Lowe, 1991). Newton’s method is another minimization approach (Fröberg, 1985). It is based on a Taylor expansion at point of the function to be minimized:

a

f (a + h) =

2 X @ X 1 @ f (a) + hi @a f (a) + 2 hihj @a @a f (a) + ::: i i j i

i;j

f (a) + @@a f (a)h + 21 hT Ah

with

2 [A]ij := @a@@a f (a): i

j

The function is minimal at the zero crossing of the first derivative:

@ f (a + h) Ah ? @ f (a) = 0 ) h = A?1 @ f (a) min @a @a @a Newton’s method converges quickly, however it does not always converge (Press et al., 1988). The Levenberg-Marquardt method uses a modified Hessian matrix

a0ii = (1 + ) aii a0ij = aij for i 6= j ;

A0 with

(4.22)

and the minimization rule is

ak+1 = ak ? A0?1 @@a f (ak ):

(4.23)

The parameter is adjusted according to the convergence of the function to be minimized. When is large, the matrix 0 is dominant in the diagonal and Eq. 4.23 behaves like a gradient descent, whereas it is like Newton’s method for a small . The Levenberg-Marquardt algorithm is outlined in Fig. 4.8.

A

54

4


a

compute f ( ) ; pick a modest value for , e.g., = 0:001 ; do f ?1 compute = 0 @@ f ( ) ; if ( f ( + ) f ( ) ) increase , e.g., by factor of 10 ; else f decrease , e.g., by factor of 10 ; := + ;

a A a a a a a a a a g g until (condition for stopping, e.g., a < ") ;

a) (Press et al., 1988).

Fig. 4.8: The Levenberg-Marquardt algorithm for minimizing f (

4.5.2 Minimization Scheme We have to cope with non-optimal initial parameter estimates since we use uncalibrated images. This is overcome by dividing the global estimation problem into three steps in order to monitor and to stepwise enhance the minimization process. We estimate in each step a subset of parameters. The Jacobians @=@ Pbij o ( ; oi ) necessary for the minimization are provided in Appendix E (Equations E.2 and E.3, E.10, E.26).

a

ax

Step I (Model Fitting): In the first step, the poses of all objects are reconstructed individually and separately for each image of the scene. The individual reconstructions are performed very quickly since only a few parameters are being estimated. However, the minimizations have to be monitored in order not to converge to false local minima caused by inappropriate initial values. Four visible planar model points contain enough information to estimate all unknown pose parameters for one object (Fischler & Bolles, 1981). The initial value for the focal length f is chosen to be a commonly used length (e.g., 15mm). For the translation in z-direction (tz ) we take a typical object distance (e.g., 2m). The initial values for tx and ty (xand y -translation) are calculated from the assumed focal length and tz , tracing the view ray through an image point and a model point. Rotation parameters can be set to any values. During minimization the focal length is monitored. If it is no longer in an admissible range (10100mm in our case), the object model must be rotated by negating two rotational parameters and the minimization is restarted with all other parameters reset to their original initial values. The cost function is also monitored during minimization. If the process converges to a local minimum with inadmissible high costs, the tz is modified according to a predefined scheme. To improve the speed of convergence, it is furthermore useful to adjust the tx and ty which should be consistent with the current tz and focal length f . This monitored Levenberg-Marquardt iteration is terminated if either the change of the parameter estimates from one iteration step to the next is less than a given threshold, or if the model

4.6

R ESULTS

55

fitting does not succeed, i.e. if a maximum number of iterations is exceeded or if the same local minimum is found despite modified parameter values. There are basically three reasons for a failure of the model fitting: (a) the object hypothesis is wrong as a result of incorrect decisions in the object recognition process, and then a wrong object model was chosen which does not fit. (b) Too many non-corresponding image features are matched to the model features, and (c) the feature detection was too inaccurate. So far, no strategies are incorporated to cope with these problems. Therefore, we cannot reconstruct the pose of an object in these cases. Step II (Pose Enhancement): By applying step I to each detected object, we obtain several estimates of the focal length for each camera and an estimate for the pose of each object relative to each camera. In this step, better initial estimates for the objects’ poses relative to each camera are derived using the same focal length for all objects depicted on one image. For each camera, the focal length is fixed to the median of all focal length estimates in step I for this camera. The poses of all objects in an image are enhanced according to the fixed focal length. Step III (3D Reconstruction): The parameter estimates computed so far are good enough as initial values for the final 3D reconstruction. The median focal length and the resulting object poses of step II are used as initial values for global model fitting. The relative pose between different cameras is roughly estimated from the estimated objects’ poses relative to each camera. The minimization process is robust against the initialization of the external camera parameters. Monitoring this global minimization is not necessary because of sufficiently good initial parameter values that are available now.

4.6 Results We conducted various experiments with the reconstruction approach outlined in this chapter. We used real images as well as synthetic data with simulated noise and report qualitative as well as quantitative results.

Qualitative Results We first show three examples of scenes which were reconstructed from stereo images (Fig. 4.9). Qualitatively, the results are very satisfying. The different views of the reconstructed scenes show that the 3D poses are very well estimated. The reconstruction of the three scenes is based on points and ellipses in stereo images. Object recognition was performed manually for the stereo images of the three scenes in order to focus only on the results of the 3D reconstruction.

56

4


Fig. 4.9: Three examples of 3D reconstructions: the left column depicts the input images of the left camera. The middle and the right column show different views of the reconstructed scenes.

4.6

57

R ESULTS

Accuracy of 3D reconstruction We do not have the possibility to measure the exact 3D coordinates of the objects and the cameras in a world coordinate frame for the acquisition of ground truth data. Therefore, we restrict the quantitative evaluation of our approach to comparisons of relative poses of objects in scenes. The Baufix objects are taken from a children’s toolkit and they are manufactured within certain tolerances. We measured inaccuracies in the range of 1mm. Fig. 4.10 shows two scenes in which we measured the relative distances between the objects prior to having taken the images. Table 4.1 lists a pairwise comparison of the measured and the estimated poses. The pose of an object relative to another is uniquely determined by the distances between two points of the first object to one point of the second object, respectively, and the angle between two surface normals. We have furthermore calibrated our cameras with our implementation of the algorithm of Tsai (1985) using a planar calibration pattern with 49 circles. The estimates of the focal lengths achieved with our approach are very similar to the focal lengths estimated during camera calibration.

2

1 4

1 3 3

2 5 (a)

6

4

(b)

Fig. 4.10: Images of two scenes with objects with known relative poses: objects in image (b) exhibit only ellipses as image features.

The reconstruction accuracy reflects the fact that the more features are available for one object, the better is the accuracy of the estimated pose. The reconstruction of the scene Fig. 4.10a is based on image points and ellipses. The points are the intersections of line segments at the corners of the rhomb-nut (object 3 in Fig. 4.10a) and at the rounded corners of the two bars. For the two holed bars (objects 1 and 2 in Fig. 4.10a) four vertices and three or seven model circles are visible on the top surface. The estimated relative poses are very close to the measured ones. The rim (object 4 in Fig. 4.10a) is reconstructed using only the detected ellipse of the hole. The results show that the largest errors for this scene occur for that object.

58

4

Scene

Object

(a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (b) (b) (b) (b) (b) (b) (b) (b) (b) (b) (b) (b) (b) (b) (b)

1, 2 1, 3 1, 4 2, 1 2, 3 2, 4 3, 1 3, 2 3, 4 4, 1 4, 2 4, 3 1, 2 1, 3 1, 4 1, 5 1, 6 2, 3 2, 4 2, 5 2, 6 3, 4 3, 5 3, 6 4, 5 4, 6 5, 6


'[o ] d0 [mm] d0 [mm] dd00 [%] d1[mm] d1[mm] dd11 [%] 2.7 3.6 1.4 2.7 3.5 2.4 3.6 3.5 4.6 1.4 2.4 4.6 2.3 8.0 9.8 6.3 9.0 7.1 8.3 2.3 6.1 3.1 5.0 2.9 6.2 2.2 4.0

293.5 351 111.7 293.5 191.2 198 351 191.2 306.3 111.7 198 306.3 81.2 82.3 101.8 144.0 110.1 51.5 142.3 113.2 117.8 100.0 67.9 67.4 110.5 50.4 60.1

1.9 4.8 3.7 1.9 6.8 2.6 4.8 6.8 9.6 3.7 2.6 9.6 2.0 1.5 3.5 3.8 2.8 2.9 5.4 5.3 4.7 2.6 2.6 1.8 2.1 1.0 1.2

0.7 1.3 3.4 0.7 3.4 1.3 1.3 3.4 3 3.4 1.3 3 2.5 1.8 3.4 2.6 2.6 5.6 3.8 4.7 4.0 2.6 3.9 2.6 1.9 1.9 2.0

191.9 325.9 313.9 177.7 317.1 171.8 175.7 190.1 282.2 -

2.1 5 2 6.3 3.2 1.8 2.2 3.5 9.6 -

1.1 1.5 0.6 3.4 1 1 1.3 1.8 3.3 -

Table 4.1: Accuracy of the 3D reconstruction of the scenes shown in Fig. 4.10: The table lists the differences between estimated and measured poses of all objects in both scenes, respectively. Scene (b) contains cylindric and rounded objects only. The reconstruction of scene (b) is therefore based on ellipses only. For the reconstruction of scene (a) points and ellipses were used. The relative poses between two objects are here described as the distance between a first point of the first object and one point on the second object d0, the distance between a second point on the first object and the point on the second object d1, and ' which measures the difference between the surface normals of the two objects. The relative pose of cylindric objects is uniquely determined by one distance (d0 ) and the difference of the surface normals (').

4.6

R ESULTS

59

Scene Fig. 4.10b consists of cylindric objects and a cube with rounded corners. The reconstruction of scene Fig. 4.10b is therefore based on image ellipses only. We used the image ellipses at the holes of the objects, which are more accurately detected than the ellipses at the object boundaries. We employ Taubin’s (1991) method for ellipse detection. The camera parameters can only be estimated based on at least two ellipses, where the relative poses of the corresponding circles are known. This is only applicable for the cube (object 1). Thus, the camera parameters are estimated on the basis of only two ellipses per image. The reconstruction results are less accurate for scene Fig. 4.10b than for scene Fig. 4.10a. However, the distance deviation is still less than 6% and the angular deviation for the surface normals is less than 10 . These results show a range of performance of our approach. We are unable to provide exact ground truth. Furthermore, the objects are manufactured imprecisely and the object models are built from averaging over a number of objects. Thus, we can not clearly separate the reconstruction error from other inaccuracies. The reported results can ther efore be taken as a worst case analysis.

Sensitivity to Noise The sensitivity to noise of our approach is evaluated using synthetic data. We first show that the number of features and the number of images effect the reconstruction accuracy in the presence of noise. In a second experiment, we investigate the influence of radial lens distortion. Accuracy depending on the Number of Features and the Number of Images The accuracy of the 3D reconstruction and camera calibration is mainly influenced by the accuracy of the image feature detection. The influence of noise or inaccuracies is less grave the more features are used for 3D reconstruction and camera calibration. The reconstruction accuracy is also better when using stereo images rather than monocular images. We prove this with the following experiment that is based on simulated data. Equally distributed noise in the range of 0:5 pixel is added independently to all projected model features of two 3holed-bars. All six experiments which are reported in Fig. 4.11 are based on 1000 runs. The two 3-holed-bars are reconstructed independently from each of the 1000 sets of features. We compute the distance of the centers of mass of the reconstructed objects and compare it to the true distance. We use this as an indication of the reconstruction accuracy. The histograms in Fig. 4.11a-d show the results of reconstructions from monocular images. Notice the different scalings of the histograms. The distribution of the reconstructed distances between the two centers of mass of the two objects is depicted in the histograms. The true distance is 168.8mm whereas the reconstructed distances vary in experiment (a) from 138mm to 561mm. For experiment (a) we used only four coplanar points per object in one image. Four coplanar points are the minimum necessary number of features for reconstruction (Fischler

60

4

reconstruction from one image true distance between objects: 168.8mm 1. object: 4 points 2. object: 4 points µ=178.6 σ=38.3

30

25


reconstruction from one image true focal length: 25.0mm 1. object: 4 points 2. object: 4 points µ=28.1 σ=12.7

60

50

20 40

15 30

10 20

5

10

0100

200

300

400

500

00

600

20

40

60

80

100

120

distance [mm]

focal length [mm]

(a)

(b)

reconstruction from one image true distance between objects: 168.8mm 1. object: 4 points 2. object: 4 points 3. object: 4 points 4. object: 4 points µ=173.11 σ=23.3

35

30

25

140

20

reconstruction from one image true distance between objects: 168.8mm 1. object: 4 points, 3 circles 2. object: 4 points, 3 circles µ=172.5 σ=16.6

30

20

15

10

10

5

0 120

140

160

180

200

220

240

260

280

300

320

340

0

140

160

180

200

220

240

260

distance [mm]

320

(d)

reconstruction from two images true distance between objects: 168.8mm 1. object: 4 points 2. object: 4 points µ=169.2 σ=4.8

60

300

distance [mm]

(c)

80

280

reconstr. from two images true distance between objects: 168.8mm 1. object: 4 points, 3 circles 2. object: 4 points, 3 circles µ=168.9 σ=3.5

120

100

80

60

40

40 20

20

0

160

170

180

190

200

0

160

165

170

(e)

175

180

distance [mm]

distance [mm]

(f)

Fig. 4.11: Experiments with synthetic data indicate the sensitivity to errors in image point coordinates when one (a-d) or two (e,f) images are used. The number of objects and the number of features per object affect the reconstruction accuracy as well.

4.6

R ESULTS

61

& Bolles, 1981). The distribution of the estimated focal lengths for the same experiment is shown in Fig. 4.11b. The true focal length is 25mm. The histogram in Fig. 4.11b shows a similar deviation as Fig. 4.11a. In experiments (c) and (d), we used again monocular images, however, with more features. The calibration and reconstruction is based on four objects with four points each (experiment (c)) or on two objects with four points and three ellipses each (experiment (d)). For experiment (c) noise was simply added twice to the two projected object models, and so four objects in an image were simulated. Four points on four objects lead to 32 constraint terms in the cost function. Four points and three ellipses of two objects result in 2 (4 2+3 5) = 46 constraint terms. The number of parameters to be estimated in experiment (c) is 25 and 13 in experiment (d). The mean of the reconstructed distances is closer to the true distance in experiment (c) than in experiment (a). The standard deviation is also much smaller. This means that the accuracy can be improved by simply adding more objects with only the minimum number of features. The results are again improved in experiment (d). The ratio of constraint terms to estimated parameters is better in this experiment which results in a much smaller standard deviation ( = 16:6). The accuracy is dramatically improved through stereo images (experiments (e) and (f)). The same number of features is used as in experiment (a). The data from the projection to two images leads to an average distance estimate ( = 169:2) which is pretty close to the true distance, even though now 2 6(object’s pose) + 2(focal lengths) + 6(external camera parameters) = 20 parameters are being estimated rather than 13 in experiment (a). The best results are achieved using stereo images with more than the minimum number of features that are necessary per object. The mean in Fig. 4.11f ( = 169:2) is very close to the true distance, and the standard deviation is only = 3:5. Radial Lens Distortion Another experiment with synthetic data shows the influence of radial lens distortion to calibration and reconstruction. Radial distortion of a maximum displacement of rmax = 4:5 pixel at the corners of an image is added to coordinates on a synthetic image. The image contains again two 3-holed-bars. These are reconstructed from the distorted images and we use again the deviation of the reconstructed distance of the two centers of mass to the true distance as an indication of the reconstruction accuracy. A radial distortion of rmax = 4:5 pixel at the corners of an image is common in off-the-shelf lenses. Table 4.2 shows a maximal difference between the true object distance and the reconstructed distance of 1.7%. The influence of radial distortion becomes smaller if more features or more images are used for reconstruction and calibration. Weng et al. (1992) report similar results.

62

4


# images # objects # features 1 2 24 1 2 27 2 2 24 2 2 27

d0 [%] d0 1.7 0.9 0.7 0.5

Table 4.2: Influence of radial lens distortion to the accuracy of reconstruction and calibration, rmax = 4:5 pixel.

4.7 Discussion We believe that our 3D reconstruction approach is well suited for our application. The objective is (a) to obtain a 3D scene representation for the extraction of qualitative descriptions (see Chapter 5), and (b) an initialization for grasping processes which themselves are able to refine the pose parameters. We can not achieve the high accuracy such as reported by (Crowley et al., 1993) or approaches that use calibrated images (Gengenbach, 1994; Yuan, 1989). But our accuracy is good enough for the extraction of qualitative descriptions and for the initialization of other processes. Our method allows us to use monocular, binocular or more images of a scene with any number of objects. We use three different feature types. Other features can easily be added by formulating their projection functions together with their Jacobians. Thus, a wide variety of shapes can be modeled. The reconstruction accuracy is better when we use stereo images rather than monocular images. Another advantage of stereo images is the possibility to uniquely reconstruct cylindric objects and partially occluded objects which have less visible features than four points. All parameters that are to be estimated are modeled explicitly according to the pin-hole camera model. This allows us to estimate all pose parameters and all camera parameters with a minimum number of visible image features. Constraints, such as location of features other than those encoded in the object models or planarity constraints, can be easily incorporated. Furthermore, calibration hints or assumptions of the pose of objects can be integrated by simply initializing or fixing the respective parameters during minimization. Our object models do not account for threaded holes which do not have a perfect circular ending. Also the models of the bolt-threads are rather poor so far (see Appendix B). A more adequate modeling would certainly improve the accuracy and the success of the model-fitting here. Currently, we use fixed object models. The use of generic or parametric models would be a step towards the extension of the domain to other objects without modeling each of them explicitly. Furthermore, redundancies could be avoided, which appear, for example, in the modeling of the bars which differ only by the number of holes and the length.

4.7

D ISCUSSION

63

Further work should also include better strategies to cope with object hypotheses where the object model can not be fitted successfully. The reasons are not always false hypotheses but also false feature matches or problems due to inaccurate image features such as broken line segments or badly fitted image ellipses. An interaction with the object recognition and feature detection processes would be very helpful. A crucial issue in calibration and 3D reconstruction is the representation of 3D rotations (Stuelpnagel, 1964). There are a number of parameterizations, for example,

Rotation matrix (9 parameters), Quaternions (4 parameters), Yaw, pitch, and roll (3 parameters), Euler angles (3 parameters), Skew-symmetric matrix or rotation vector (3 parameters).

The number of parameters ranges from 3 to 9, but it is known since Euler that the threedimensional rotation group is itself a three-dimensional manifold. It has been shown, however, that it is topologically impossible to have a global three-dimensional parameterization without singular points for the rotation group (Stuelpnagel, 1964; Faugeras, 1993). We modeled rotations using the three angles yaw, pitch, and roll. This parameterization has the advantage that the minimum number of parameters is needed and that the composition of the rotation matrix is pretty much straightforward using these three angles. But this representation has also the disadvantage that the singular points appear cyclically. Zhang & Faugeras (1992) claim at this point advantages of the parameterization as rotation vector. Further work should include investigations in this direction with our approach.

64

4


Chapter 5


Qualitative descriptions form a non-numerical, abstract representation of the signal data content which is relevant for a given situation. In the system Q UA SI-ACE, we use the same type of qualitative descriptions for the results from of speech and image understanding. This common representation is used for inferences to identify the intended object(s) and for the exchange of restrictions between the two understanding processes. In this chapter, we explain first how we represent qualitative descriptions. Then, we present the computations of qualitative descriptions for the type of an object, for the color of an object, and for spatial relations. The qualitative descriptions for object type and color are computed from the results of object recognition and color classification modules. We just sketch these modules briefly, since object recognition and color classification is not part of this thesis. However, the development of a three-dimensional computational model for spatial relations is an important part and is hence described in greater detail. We also present results from a psychological evaluation of the computational model and discuss the model in the context of these experiments.

5.1 Representing Qualitative Descriptions Qualitative entities result from discretizations of the continuous signal space in categories (Harnad, 1987). The boundaries between categories should be placed in such a way that there are qualitative resemblances within each category and qualitative differences between them. But nevertheless, a qualitative property is always more or less true, or applicable, to a physical object or a set of objects. The boundaries between categories may show fuzziness. Graded categories are not defined by an all-or-none rule, and membership is a matter of degree (Medin & Barsalou, 1987). Furthermore, categories are not necessarily disjoint; they may overlap, for example, an object can be colored bluish green.

66

5

Q UALITATIVE D ESCRIPTIONS

A qualitative representation is chosen that can model fuzziness to account for these facts. We use a vectorial fuzzified representation (a) to represent information about the uncertainty of the recognition or detection processes using real data, (b) to represent each qualitative property, which is characterized by a set of categories, by the vector of the fuzzy degrees of membership of each category. Each qualitative object property is described by a function Q:

Qu(p; o) = u

or

Qb(p; IO; RO) = u;

(5.1)

u p

for unary relations ( Qu) and for binary relations ( Qb ) such as spatial relations, respectively. is a vector which represents the fuzzy measurements that are assigned to each category of a property space, respectively. It means that i characterizes how well category i of property fits for the given object o or object pair (IO, RO). Here are two examples:

u

1. The property color contains the following categories:

pcolor = (red, yellow, orange, blue, green, purple, wooden, white)T :

Then

u is a vector of dimension 8 and

Qu(color; rhomb-nut) = (0:4; 0:3; 0:8; 0:1; 0:09; 0:2; 0:15; 0:05)T : This characterizes that the object rhomb-nut is most likely to be orange. The color orange is also somehow red as well as somehow dark yellow, and thus the degrees of membership for the categories red and yellow are higher than for the other color categories. 2. The property spatial relation consists of the projective relations left, right, above, below, behind, and in-front, and therefore

pspatial relation = (left, right, above, below, behind, in-front)T : Then is, for example,

Qb(spatial relation; rhomb-nut; 3-holed-bar) = (0:45; 0; 0:14; 0; 0:49; 0)T : This example is taken from Table 5.6 (IO = 5, RO = 1). It describes that the rhomb-nut is behind and left of a 3-holed bar. This representation has the following advantages:

5.2

O BJECT T YPE

67

Overlapping meanings and concurring hypotheses can be represented. It captures the degree of membership for each category of a property space. No irreversible, strict decisions have to be made for the qualitative value of a property in an early stage of the understanding process. A decision which qualitative value/category is chosen of a property space can be postponed to a later stage. So, other information that might be available later, e.g., about other properties or the scene context, can be taken into account in the process of making the decision for the final qualitative system output. Object specifications can be of a wide variety. Using a decision calculus, the understanding of object specifications is rather flexible in reacting to the choice of properties which are specified in an utterance. Even partially false specifications can be handled.

5.2 Object Type The most obvious description of an object is naming its type or object class. In the Baufix domain, there are objects of 20 different types. Appendix B shows a list of all Baufix objects. The recognition of Baufix objects is performed independently on single (stereo) image frames of an image sequence, without taking into account knowledge from previous frames. The resulting hypotheses are used to bootstrap the 3D reconstruction process and to initialize the qualitative descriptions of the types of all recognized objects in the images. The qualitative descriptions are refined depending on whether the model-fitting which is accomplished in the 3D reconstruction process turns out to be successful or not. The hybrid object recognition approach which was developed jointly in the project “Situated Artificial Communicators”, and some results, are presented briefly in the following subsections.

5.2.1 Object Recognition Object recognition is carried out by a hybrid approach combining neural and semantic networks (cf. Heidemann et al., 1996; Heidemann & Ritter, 1996; Sagerer et al., 1996; Moratz et al., 1995). The neural network generates object hypotheses, which are either verified or rejected by an ERNEST network. Thus, holistic recognition and structural analysis are combined. Fig. 5.1 shows the structure of the hybrid knowledge base for object recognition. Holistic object recognition is performed by a special form of neural networks called LocalLinear-Maps (LLM) (Ritter et al., 1992). A color segmentation algorithm, which is implemented on the special hardware platform Datacube (MV200), provides blob centers as ‘focus points’. At each focus point, a feature vector is extracted by 16 Gabor filter kernels which

68

5


are applied on an edge enhanced intensity image. The feature vectors form the input for the LLM-network which calculates up to three competing object hypotheses (Heidemann & Ritter, 1996).

PE_BAR spec spec

PE_OBJECT

spec

... ...

part [1..n]

part

PE_CUBE part

part

PE_BOLT_HEAD I_FOCUS

spec

part

PE_BOLT

con

PE_BAR_3H PE_BAR_5H PE_BAR_7H

con

PE_ROUNDHEAD PE_HEXAGONAL

spec

PE_BOLT_13mm PE_BOLT_16mm

...

spec

PE_SCENE

PE_BOLT_THREAD con

cdpart

cdpart

I_OBJECT Local Linear Maps

I_REGION Polynomial Classifier

Fig. 5.1: The hybrid knowledge base for object recognition (Sagerer et al., 1996).

For each competing LLM-hypothesis, an instance of the concept I OBJECT is created on the first level of the E RNEST network (image level, prefix I in Fig. 5.1). All competing instances are stored in competing search tree nodes. An E RNEST concept from the second network level (perceptual level, prefix PE) is selected which corresponds to the hypothesis from the LLMnetwork. The choice of the E RNEST concept depends on the type of the LLM-hypothesis. The structural knowledge stored in the semantic network is then used to verify (or discard) the object hypothesis. This means, for example, if an instance, with the object type ‘bolt’ assigned to the attribute ‘type’, is created in the image level, then a modified concept representing a bolt (PE BOLT) is built on the perceptual level. PE BOLT is connected with an inherited concrete link to the image-level instance of type bolt. The E RNEST control algorithm tries to find the parts of the modified concept in the perceptual level as they are modeled in the semantic network. This yields, for our bolt example, instances for ‘bolt head’ (PE BOLT HEAD) and ‘bolt thread’ (PE BOLT THREAD). The concepts representing parts of objects can be instantiated if image regions are found, which show the region characteristics (position, color, shape) that are modeled or propagated as restrictions. Such an instance is then linked to a concept representing the corresponding image region (I REGION) by a concrete link. The region segmentation is performed by a polynomial classifier of sixth degree using intensity, hue, and saturation as features for color classification (see Section 5.3.1). Restrictions on position, color, and shape are propagated in a model-driven way. Additionally, the restrictions of the current focus – which represents the object for which currently three competing hypotheses are under verification – are taken into account. If a successful instance

5.2

69

O BJECT T YPE

of a perceptual object (PE OBJECT) is created, it is then added as part of a modified concept (PE SCENE). The concept PE SCENE refers to all objects in the scene that are detected so far. After this step, the focus is adapted according to the newly holistically detected object, and the next object hypotheses are processed. The next object hypotheses are again created by the LLM-network and yield instances of an object in the image level of the E RNEST network.

5.2.2 Results Two different sets of images were used to evaluate the performance of the hybrid object recognition approach1. In the first set, there are 11 HSI-encoded images from 11 different scenes. The images were taken under controlled lighting conditions but with varying focal lengths of the cameras (Socher et al., 1996). A size estimation based on parameters of characteristic regions detected in the images enables the object recognition process to adapt to different sizes of the projected objects on each of the images. The scenes contained a total of 156 known Baufix objects. The scene complexity ranged from scenes with as few as 5 to 10 objects to scenes containing 25 to 35 different objects. Three of the latter most complex scenes also contain some out-of-domain objects, e.g., a human hand, a toy car. However, objects are not supposed to overlap as occlusions can not yet be handled correctly by the recognition approach. The second set contains 50 images which were again taken under controlled lighting conditions (Kummert, 1997). Here, the focal lengths of the cameras were fixed. The images were taken from scenes with Baufix objects that were placed by 5 naive users. The average number of objects depicted on these images is 12. The objects do not overlap in the scene but occlusions occur due to the chosen camera view. The recognition results from both sets of images are reported in Table 5.1. Recognition errors might be due to false classifications, additional objects, or not classified objects. We compute a detection accuracy (DA) to quantify the recognition performance. The detection accuracy is defined analogously to the well known word accuracy (Lee, 1989): DA =

total ? (false + nothing + additional) total

100%

(5.2)

For the first test set, we achieve a DA of 70%. The object recognition is mainly based on size and shape characteristics of colored blobs which result from the color classification. These blobs vary a lot for the same object type because of the changing focal lengths. For the second test set, the focal lengths are fixed. Here a DA of 93.5% is achieved. Furthermore, no out-ofdomain objects were contained in the second set. The results which are reported in Table 5.1 can be considered as examples of a good and a bad case. They show the range of results achieved with the hybrid recognition approach. Fig. 5.2 shows two of the scenes with the object recognition results superimposed. All objects on the left image are recognized correctly. The complex image on the right contains out-of1

I would like to acknowledge Gernot Fink and Franz Kummert for help in providing these results.

70

5

set/images/objects set 1/11 img. (156 obj.) set 2/50 img. (599 obj.)


correct false nothing additional 122 (78%) 11 (7%) 23 (15%) 13 (8%) 561 (93.7%) 10 (1.6%) 28 (4.7%) 1 (0.2%)

DA 70% 93.5%

Table 5.1: Object recognition results on (a) the first test set with 11 images which were taken with varying focal lengths and may contain out-of-domain objects and (b) the second test set with images which were taken with fixed focal length.

domain objects. Errors occur here for out-of-domain objects, for objects where image regions of two different objects were classified by mistake as belonging to a compound object of type bolt, and for two of the small bolts. These images belong to the first test set.

Fig. 5.2: Two scenes from the first test set: All objects of the simple scene (left image) are recognized correctly, whereas the object recognition module has more difficulties with the complex scene (right image). This scene contains out-of-domain objects.

5.3 Color Color is a dominant feature for object descriptions. Subjects prefer to specify the visually most salient feature (Herrmann & Deutsch, 1976; Herrmann, 1982), and this is often the color of

5.3

C OLOR

71

an object. In the Baufix scenario, the objects are colored with bright and clearly distinct elementary colors. Mangold-Allwinn et al. (1995) show in different investigations that a striking color provides very useful information for object specifications. Furthermore, the number of object specifications using color attributes is higher in instructions at the first naming of an object than in pure object descriptions (Mangold-Allwinn et al., 1992). A short quantitative study using transcriptions of object descriptions and assembly dialogs in the Baufix scenario shows that these results apply here, too. Color classification and color segmentation is a difficult problem. Different approaches ranging from classification techniques to sophisticated models that account for lighting conditions and material reflectance properties can be found in the literature (e.g., Klinker et al., 1988; Priese & Rehrmann, 1993; Perez & Koch, 1994). We use a rather simple color classification approach in the system Q UA SI-ACE, which is outlined in Section 5.3.1. Currently, the lighting conditions are fixed, and the limited set of Baufix colors is pretty well distinguishable. Results are presented in Section 5.3.2. The computed color classifications for the image region of an object are assigned to the color attribute of an object instance. So far, no classification score is recorded. Therefore, the qualitative description for the color of an object is initialized with the score 1 for the category of the classified color and with the score 0.01 for all other color categories. The fuzzy vector is then normalized. We now report results of our study on the use of color, size, and shape adjectives in object descriptions and assembly dialogs while constructing an airplane like shown in Fig. 2.1. Object Descriptions In order to obtain natural object descriptions from naive 2 subjects, a series of experiments were conducted. In the first part, 27 subjects were asked to verbally describe 34 Baufix objects, separately (Arbeitsmaterialien B1a, 1994). The scene context varied (isolated objects vs. context), however no difference was stated. The descriptions were often very detailed and the subjects tended to name many (even more than applicable) object characteristics. The verbal descriptions were recorded and orthographically transcribed. For a fir st evaluation, two reviewers classified key terms for the object specifications in the given scenario. Table 5.2 shows the most frequently named color, shape, and size adjectives. The total number of words used in this sample is 13845. 8.3% of the words are color adjectives, 6.3% of are shape adjectives, and 7.4% of the words are adjectives characterizing the size of the objects. In these detailed descriptions the number of occurrences for adjectives of either class are similar. 2

People who are not familiar with the Baufix domain.

72

5


Color

Shape

Size

total: 1144 = 8.3% gelb (yellow) 182 rot (red) 181 blau (blue) 121 weiß (white) 110 grün (green) 77 hell (light) 63 orange (orange) 56 lila (purple) 42 naturfarben (natural colored) 34 holzfarben (wooden) 33 orangefarben (orange colored) 26 beige (beige) 13 lilafarben (purple colored) 9 violett (violet, purple) 8

total: 875 = 6.3% rund (round) 177 abgerundet (rounded) 68 sechseckig (hexagonal) 66 flach (flat) 62 rechteckig (rectangular) 37 hohl (hollow) 37 rautenförmig (diamond-shaped) 34 länglich (elongated) 24

total: 1028 = 7.4% lang (long) 268 groß (big) 165 klein (small) 148 kurz (short) 124 breit (large, wide) 98 hoch (high) 84 dick (thick) 65 schmal (narrow) 25

Table 5.2: The most frequently named color, shape, and size adjectives in a set of 22 34 (784) descriptions of single objects. The total number of words used in the descriptions is 13845. Within the whole set, we count 1144 color adjectives, 875 shape adjectives, and 1028 size adjectives.

Assembly Dialogs (Human-Human) The second sample consists of 18 dialogs, where pairs of subjects had to construct a Baufix airplane, jointly (Arbeitsmaterialien B1b, 1994). For each dialog, one subject was told to be the instructor and the second one is the constructor. The sight of the instructor of the assembly platform (sight blocked, limited sight (only to the constructor not to the assembly process), fully visible) as well as the model of the airplane (assembly plan vs. prebuilt plane) for the instructor varied in the experimental setting. The number of occurrences for each of the key terms listed in Table 5.2 was counted in the whole sample and is listed in Table 5.3. Only the key terms used in the dialogs are listed. Assembly Dialogs (Human-Computer) For the third sample, human-computer communication was simulated in a Wizard-of-Oz scenario (Brindöpke et al., 1995; Brindöpke et al., 1996). The subjects took the role of the instructor. Instructions from 22 dialogs were orthographically transcribed. Table 5.4 lists the occurrences of the color, shape, and size key terms in this sample. The total number of words that are used is a greater than in human-human dialogs. This reflects the fact that human-human communication can be successful with fewer words, i.e. it is more concise. The subjects tend to be more eloquent and to use more precise specifications than

5.3

73

C OLOR

Color

Shape

Size

listed: 532 = 3.9% rot (red) 119 gelb (yellow) 98 orange (orange) 85 grün (green) 72 blau (blue) 58 weiß (white) 39 lila (purple) 15 violett (violet, purple) 4 holz (wooden) 3

listed: 76 = 0.6% rund (round) 32 eckig (angular) 30 sechseckig (hexagonal) 9 rautenförmig (diamond-shaped) 3 länglich (elongated) 1 flach (flat) 1

listed: 99 = 0.7% lang (long) 52 kurz (short) 15 klein (small) 10 groß (big) 7 dick (thick) 5 breit (large, wide) 3 hoch (high) 2 schmal (narrow) 1

Table 5.3: Number of occurrences for the key terms listed in Table 5.2 in the human-human assembly dialogs. The total number of words of this sample is 13726. The dominance of color adjectives is obvious.

in communications with human partners. In both kind of instructions the object color is the most specified attribute. There is more variety in the adjectives used in descriptions than in instructions. Color

Shape

Size

listed: 2465 = 7.5% rot (red) 556 gelb (yellow) 528 blau (blue) 430 orange (orange) 330 weiß (white) 232 grün (green) 223 lila (purple) 90 violett (violet, purple) 52 holz (wooden) 18 hell (light) 6

listed: 318 = 1% rund (round) 128 eckig (angular) 106 sechseckig (hexagonal) 45 rautenförmig (diamond-shaped) 17 rechteckig (rectangular) 9 länglich (elongated) 5 viereckig (quadrangular) 4 flach (flat) 2

listed: 192 = 0.6% klein (small) 104 lang (long) 30 groß (big) 19 dünn (thin) 11 dick (thick) 8 schmal (narrow) 8 kurz (short) 6 breit (large, wide) 2

Table 5.4: Number of occurrences for the color, shape, and size key terms in human-computer instructions. The total number of words of this sample is 32450. Again, the color is the object property named the most in this sample.

5.3.1 Color Classification Currently, eleven different classes of colors are distinguished, i.e. background color, black, wooden, red, yellow, blue, green, orange, white, purple, and ivory.

74

5


The HSI3 color space shows a color-ordering which is based on human color perception. Especially the hue values are favorable for color segmentation (Perez & Koch, 1994). A pixelwise color classification is performed by a polynomial classifier of 6 th degree on HSI color images. Subsequent smoothing operations and region labeling lead to color segmented images. Polynomial classification is described in the context of statistical pattern classification in the following. Statistical pattern recognition techniques are characterized by considering patterns as high dimensional random variables (Duda & Hart, 1972; Niemann, 1981; Sagerer et al., 1996). Given the measurement of an entity to be classified, a feature vector is computed. This point in the N -dimensional feature space is used as the argument of a decision function. In our color classification task, we use feature vectors in the three-dimensional HSI-space for each pixel. The task is then to construct a mapping from feature vectors into a set of indices characterizing the classes k . The decision function D yields

c

D(c) = k; k 2 f1; : : : ; K g :

(5.3)

In order to optimize this function with respect to a given learning sample, it is convenient to use a decision vector in the following way:

d(c) = (d1(c); : : : ; dK (c))

with

K X k=1

dk (c) = 1 :

(5.4)

The choice of an optimization criterion determines the functions dk . There are two classical approaches: minimizing a cost function and approximation of the perfect decision function.

c

1. To minimize the cost of a decision it is required that the density functions p( j k ), the a priori probabilities pk , and the pairwise error classification loss kl , with 0 kk < kl 1 are known. kl denotes the loss incurred for classifying of a pattern belonging to class l into class k . The average cost evoked by the decision function is therefore given by

V (d) =

Z K K X X k=1

pk

l=1

lk p(cj k )dl(c) dc :

(5.5)

The cost is minimal if the decision function is chosen to be:

dk (c) = D(c) = 3

8 < 1 if k = arg minf PK lj pj p(c j j )g ; : 0 otherwise l j=1 arg maxfdk (c)g : k

Hue-Saturation-Intensity color space

(5.6)

5.3

75

C OLOR

Based on this general decision optimization, special variants of classifiers can be derived. By restricting the loss to kk = 0 and kl = 1; l 6= k , we obtain the Bayes classification rule of a maximum a posteriori probability. Fixing p( j k ) to be Gaussian results in the normal distribution decision rule.

c

2. The perfect decision function

k (c) =

1 0

c

if 2 k otherwise

(5.7)

is approximated by a polynomial decision rule. This function is defined according to the learning sample. The decision functions dk , which approximate the perfect decision, make use of polynomial expansions of the feature vectors . Given an arbitrary but fixed polynomial expression ( ) over the coefficients of , the decision functions are expressed by

xc

c

c

dk (c) = aTk x(c) :

(5.8)

d(c) = AT x(c) :

(5.9)

Rewriting in vector form leads to

Learning or adjusting the classification rule is equivalent to the estimation of the parameter matrix . According to the Weierstrass Theorem, arbitrary functions can be approximated where the accuracy only depends on the degree of the polynomial ( ). The optimal matrix is the one which minimizes the error between the perfect and the estimated decision rule. Therefore, it has to meet the criterion:

A

xc

A

"(A) = min E f((c) ? AT x(c))2g :

A

(5.10)

A closed form solution can be achieved resulting in the simple expression

N 1X

A = N

j

x(cj )x(cj )T

!?1

N 1X

N

j

!

x(cj )(cj )T :

(5.11)

P xc xc

T The only assumption is that the matrix N1 N j ( j ) ( j ) , which has to be inverted, is not singular. This is not a serious problem if a representative learning sample, and thus a sufficient number of feature vectors and their corresponding classes, are available. Both classification rules depend on the learning sample. The semantics of a domain is reflected by the training sample and the perfect decision rule. An implicit distributed representation of the entities that are classified is used. The parameters of the decision rule result from an optimization process. In the formalism above, we presented an off-line estimation. However, there

76

5


exist recursive estimation procedures for both approaches. They can be applied in supervised or unsupervised training. In the latter case, a sufficiently precise initial estimation is required. A new feature vector is then classified according to the present parameter estimation. The parameters for a class are updated with each new feature classified as belonging to this class. Besides the direct classification, the mechanism of randomized decision is also commonly found. The randomized decision takes into account the values of the decision vector ( ) as probabilities for choosing the actual class. Direct classification and randomized decision are both optimal with respect to the chosen criteria.

dc

5.3.2 Results The polynomial color classifier was trained on images of 9 different setups 4 . The images depict non-overlapping objects on a homogeneous dark grey background and they were taken under controlled lighting conditions. For each setup, three images with a slight lighting variation were shot. The light of two metal halide (HQI) lamps was either directly reflected by a white screen above the scene, or the lamps were tilted to the left, or to the right so that less light was directly reflected. With three images of size 502 566 pixel per setup a total of 27 images were used for training. Shadows and specularities occurred even under the controlled lighting conditions. All images of the training sample were segmented and labeled manually. The labeling was accomplished by assigning to the image region of each object the color the object’s color. Each colored region may therefore contain shadows as well as specularities or even portions of the background when it was hard to separate exactly object and background. For a quick application, a lookup table was computed after the training, which contains the corresponding color class for each possible feature vector. Due to time constraints, no separate test sample was taken to verify the classification performance quantitatively. The qualitative results of the color classification are good on images taken according to the lighting conditions of the training sample. Misclassifications occur especially between the colors ‘ivory’ and ‘wooden’. Other colors are very well classified. The color classification is performed pixelwise by using the computed lookup table. A subsequent smoothing operation with a median filter and a region segmentation lead to color segmented images which finally yield very satisfying results. A quantitative evaluation was only run on the training sample5 . Only those pixels which were manually classified as non-background were used for this test. This test can give an impression of how well the polynomial classifier is adjusted to the training sample. Table 5.5 shows the results. We see that the colors ‘ivory’ and ‘wooden’ are very often confused. This indicates the difficulty in separating the feature vectors of the classes for these two beige-brownish colors in HSI-space. 4 5

The polynomial classifier was implemented and trained by Franz Kummert. At this point, I would like to thank Christian Bauckhage for carrying out this evaluation.

5.4

S PATIAL R ELATIONS

Color white ivory red yellow orange blue green purple wooden

#Pixel white red yellow orange blue green purple wooden ivory 226050 152640 0 0 1 0 23 27 13088 238 106806 14155 0 310 2 0 2 40 71346 4 208973 0 168852 101 15176 8 0 16428 3653 0 160695 0 25 146777 2215 1 162 188 8984 0 163524 0 24656 893 130326 4 0 1792 4351 0 164452 16 2 6 3 152222 554 1113 90 0 135220 18 3 43 2 294 123397 73 307 1 100196 0 906 0 23 489 7 79542 209 0 341070 932 18 857 319 0 29 1958 321919 65

77

OK [%]

68 0 80 91 80 93 91 79 94

Table 5.5: Results for color classification on the training sample. Each line indicates for all pixel of one color the number of classifications in each of the color classes. The column entitled OK indicates the percentage of correct classifications.

In Bayesian networks, a conditional probability table P (X jparent(X )) is associated with each arc. The conditional probability table Mci jo (arc Object ! Colorimage ) (see Chapter 6 and Fig. 6.3) is estimated in our implementation from the results which are reported here.

5.4 Spatial Relations Describing the spatial location of an object is another means for specifying an object. Type, color, size, and shape are unary relations. They can only be used to describe object properties but not to describe objects in the context of others. Spatial relations can be used to discriminate an object uniquely because of its location in a scene which is important in our scenario. Linguistically, spatial relations are expressed either by prepositions or by adverbial constructions. Psycholinguistic investigations about the use and interpretation of spatial prepositions show that they can be rather flexible and varying for one and the same spatial configuration (cf. Klein, 1979; Herrmann, 1990; Grabowski et al., 1993). Following Hayward & Tarr (1995), spatial prepositions show fuzziness, they do not exclude each other, and they overlap. Their use varies continuously depending on influencing determinants (Herrmann & Grabowski, 1994) and most of the perceptively available information is irrelevant for the use of spatial prepositions (Talmy, 1983). Spatial relations describe qualitatively the spatial location of the intended object (IO) relative to a reference object (RO). Besides the two objects involved, a reference frame is essential for an adequate communication (Retz-Schmidt, 1988b; Herrmann, 1990). Up to three different reference frames must be taken into account in our scenario: two deictic reference frames representing the field of view of each partner in the assembly scenario and an intrinsic reference frame if the RO has an intrinsic orientation. Furthermore, the objects of our domain do not show ego-motion and so the scene will be rather static over a certain period of time, whereas

78

5


the speaker could easily move around in the scene and change reference frames whenever he or she wants to. The assembly scenario is a 3D scenario. Object localizations in 3D can be addressed by the instructor. As the instructor can move around in the scene, the views of Q UA SI-ACE and the instructor are decoupled. The view of the instructor can be inferred with 3D data of the scene. According to these fundamentals, we developed a three-dimensional computational framework for spatial relations (Scheering, 1995; Fuhr et al., 1995; Fuhr et al., 1997). It can be used to generate qualitative spatial information about the scene from numerical 3D data as well as to understand the qualitative spatial information given in instructions, i.e. given the 3D object description of one object, the admissible 2D image region of the other object can be inferred. The structure of the computational spatial model is shown in Fig. 5.3. generation of graded spatial relations for IO-RO pairs

IO RO reference frame

qualitative spatial representation

meaning definition for spatial relations

(independent of reference frame)

(dependent of reference frame and RO)

object abstraction

restriction to

object-specific partition of 3D space

image regions

3D scene

2D object recognition

Fig. 5.3: Structure of the computational model for spatial relations.

We are aware of the fact that the meaning of spatial prepositions can usually not be modeled by single relations (Herskovits, 1986). However, our spatial relations are a first approximation to the ‘ideal meaning’ of the corresponding prepositions. So far, we have concentrated on the spatial relations left, right, above, below, behind, and in-front. The qualitative description of the ‘spatial relation’ of an object pair is represented by a 6-dimensional vector, where each component of the vector represents a degree of applicability for one of the six spatial relations for that object pair. It is left to the Bayesian inference machine (see Chapter 6) to determine the spatial relations which are finally used or applied.

5.4

S PATIAL R ELATIONS

79

5.4.1 Related Work Spatial relations are investigated in a variety of contexts, e.g., geographical information systems, qualitative reasoning, linguistics, visualization of spatial constellations, and humancomputer interaction (HCI). Our approach has been designed to contribute to a system in the context of HCI. Most work here either investigates computational models or cognitive and linguistic aspects of spatial relations. A number of computational spatial models can be found. A good survey is given by Mukerjee (1996). We want to mention just a few examples which are relevant for our computational model: Abella & Kender (1993) present a computational model for spatial relations between image regions, Peuquet & Ci-Xiang (1987) model the orientation of polygons in a plane, Olivier & Tsujii (1994) designed the WIP-system for the visualization of the spatial relations left, right, bottom, top, front, and back, and Gapp (1994, 1996) presents a system for the generation of these relations for 2D and 3D scenes. The goal of the systems of André et al. (1988) and Wazinski (1993) is to derive symbolic descriptions from 2D or 3D spatial configurations. Our approach is mainly influenced by the work of Hernàndez (1993). In the realization of his purely qualitative 2D model he uses 20 ‘abstract’ qualitative relations to model different sets of orientation relations at different levels of granularity. Each orientation relation is defined a priori as some specific set of the abstract relations. This ‘common grounding’ allows the system to detect overlapping or contradicting meanings of relations of different granularity. We adapted this idea by defining spatial relations as sets of acceptance relations. In contrast to Hernàndez (1993) however, these definitions are dynamically computed accounting for the current reference frame. Abstractions of objects may be used for reasons of computational efficiency, and they are found in all approaches. They should preserve the extension and at least basic features of the shape of both the reference object (RO) and the intended object (IO). There is currently no strong psychological evidence that the objects’ shape and extension do not influence the applicability of prepositions in certain configurations, although, e.g., Landau & Jackendoff (1993) are often interpreted that way. Gapp (1994, 1996) and Olivier & Tsujii (1994) abstract an object by its center of mass only. Abella & Kender (1993) take bounding boxes collinear to the image axes for calculating the relations above and below. This is a rather coarse object abstraction. For the modeling of the 2D relation aligned they use the actual object’s shape since it captures the main directions of the object’s extension. Several systems (André et al., 1988; Hernàndez, 1993; Gapp, 1994) account for the fact that spatial models which support human-computer interaction must be able to deal with different reference frames to allow for an adequate communication with the human partner (RetzSchmidt, 1988b; Herrmann, 1990). Gapp (1994, 1996) discusses the aspect of object identification on the basis of given spatial relations. He only considers a unidirectional image interpretation. He starts from the assumption that all objects are recognized and that their 3D poses are estimated. The localization of

80

5


an object with spatial relations is then based on the efficient search in the set of recognized objects according to the given spatial relation. Unlike Gapp’s approach however, we envision a computational model where a bidirectional control-flow between lower-level recognition tasks and higher-level interpretation tasks is possible. Consequently, we support the identification of objects by deriving the region where the object is to be expected in the image. The benefit we expect is that the low-level efforts can be much better tailored to what is actually needed by the higher-level interpretation processes. Herskovits (1985) and Landau & Jackendoff (1993) address spatial relations from a linguistic point of view and they stress the complexity of the meaning of spatial relations in general situations. The model of Herskovits (1986) assumes an ‘ideal meaning’ for each spatial relation which is transformed according to the context in the current meaning of that relation. Using limited domains the problem becomes less complex. Abella & Kender (1993), André et al. (1988), Gapp (1994, 1996), and Wazinski (1993) correctly emphasize that spatial relations must be graded. In the work of Gapp (1994, 1996), and Olivier & Tsujii (1994) the relative orientation between objects and its scoring is derived from measuring the angle between the line connecting the RO and the IO and the reference axis of interest. Gapp (1994, 1996) determines this angle in an RO-specific coordinate system, thus reflecting the relative size of the objects. The larger the angle the smaller the score of the corresponding spatial relation.

5.4.2 A Computational Spatial Model Input to our computational model (see Fig. 5.3) are the 3D coordinates of objects which are estimated in the 3D reconstruction (see Chapter 4). An object is abstracted by a bounding cuboid collinear to the object’s principal axes. Fig. 5.4 shows an example of a reconstructed scene (cf. Chapter 4) and the abstraction of the objects. We start by constructing a qualitative spatial representation of the scene that is independent of any reference frame. This is based on object-specific acceptance relations that are induced by acceptance volumes partitioning the 3D space in an object-specific way. Then, reference frames are used to compute meaning definitions for spatial relations as a set of acceptance relations. These meaning definitions are finally applied to the qualitative spatial representation to compute graded degrees of applicability for spatial relations. Our model is suited for bidirectional processing. The localization of the IO in the 2D images according to a spatial relation can be inferred from the 3D pose of the RO and a given meaning of that spatial relation. With this information, we can determine those acceptance volumes where the IO must be positioned according to the given meaning of the spatial relation. The projection of these acceptance volumes leads to image regions that restrict the localization of the IO in the image.

5.4

81

S PATIAL R ELATIONS

2 5 4 3 1

Fig. 5.4: Model-based 3D reconstruction from a stereo image and the object abstraction by the bounding cuboid collinear to the object’s principal axes. The numbers in one of the stereo image frames are used to refer to the corresponding objects later in this section.

5.4.3 Generation of Spatial Relations for 3D Objects Object-specific Acceptance Relations Let OBJ denote the set of objects in the scene. Each object O 2 OBJ is assumed to partition the 3D space in its own way. Based on the object’s bounding box BO the 3D space is divided into 3D acceptance volumes AViO (0 i n; n 2 IN). In our current approach we use 79 acceptance volumes for each object. The box itself forms an acceptance volume, examples of the other 78 are shown in Fig. 5.5. The 3D space is not uniformly partitioned by the object because we assume that object displacements near edges (vertices) stronger affect the applicability of spatial relations than changes of object positions along box sides (edges). A direction vector (AViO ) is associated with each acceptance volume. It approximates the direction to which the side (edge, vertex) corresponding to AViO faces in space. The vectors we currently apply are shown in Fig. 5.5. Each acceptance volume AViO induces a binary object-specific acceptance relation riO : (P; O) 2 riO , BP \ AViO 6= ;. For each P 2 OBJ and acceptance relation riO a degree of containment (P; riO ) 2 [0; 1] is defined as vol ( B P \ AViO ) O

(P; ri ) = vol(B ) ; (5.12) P the relative volume of the part of the bounding box BP of object P lying in AViO .

d

82

5

BO


BO

BO

Fig. 5.5: Infinite 3D acceptance volumes attached to an object’s bounding box BO : (a) the six acceptance volumes bound to a vertex, (b) the two acceptance volumes at an edge and (c) the acceptance volume defined by a side of the bounding box.

Qualitative Spatial Representation Now, we have introduced all necessary prerequisites to represent a scene independently of reference frames usually applied in human communication. Fig. 5.6 shows an example of such a representation constructed in a 2D scene for simpler illustration. From Fig. 5.6 we observe that object P lies in three acceptance volumes of O. This representation explicitly reflects the physical constellation of the objects in 3D. Since usually only few objects move at the same time only few acceptance relations must be updated from image frame to image frame of the same scene. Changes of reference frames yield no changes of this representation. Meaning Definition for Spatial Relations For the spatial relations in 3D, left, right, behind, in-front, above, and below, we use a reference frame which is defined by three distinct axes: left-right axis, front-back axis, and bottom-top axis. From these axes we derive 6 vectors rel each describing the direction of one spatial relation, respectively.

r

We define the meaning def (ref ; rel; RO) of a spatial relation as set acceptance relations where an acceptance relation riRO is incorporated in the definition set if the inner product h (AViRO)j reli > 0: Each acceptance relation riRO chosen for the definition is associated with a degree of accordance

d

r

RO i )jr rel i) ; (ref ; rel; riRO ) = 1 ? 2 arccos(hd(AV

(5.13)

that expresses how well it is compatible with the meaning of rel. As an example, we regard the definition set for behind in Fig. 5.7. We chose the degree of accordance

to linearly decrease with an increasing inner angle

5.4

83

S PATIAL R ELATIONS

P (0.15)

(0.65)

(0.2)

O

Fig. 5.6: a) Object-specific partitioning of 2D space using 13 acceptance volumes; different acceptance volumes. The degrees of containment are written in brackets.

d

P

lies in three

r

between (AViRO ) and rel . At the current stage this yields satisfying results. However, the effects of non-linear functions that prefer small inner angles need more attention. Generating Spatial Relations Based on the dynamically calculated meaning definitions the degree of applicability of a spatial relation rel(ref ; IO; RO) for a given intended object IO w.r.t. a reference object RO is determined in the following way:

(ref ; rel; IO; RO) =

X

riRO 2def (ref ;rel;RO)

(ref ; rel; riRO ) (IO; riRO):

(5.14)

Fig. 5.8 shows the degrees of applicability for the spatial relation behind for the given IO, RO, and reference frame.

84

5


(0.8) (0.8)

(0.45)

(0.45)

(0.2) (0.2)

RO

rbehind r

r

left

r

right

in-front

Fig. 5.7: Meaning definition for the spatial relation behind using the indicated reference frame: A 2D scene is chosen for better visibility. RO corresponds to O in Fig. 5.6. The acceptance volumes of the acceptance relations chosen for the meaning definition set are shaded. The darker the shading the higher the degree of accordance of the corresponding acceptance relation. The degrees of accordance are written in brackets.

5.4.4 Results We ran various tests with the computational model on different sets of images and on simulated data. Here, we show as an example the qualitative results we get for five objects in the scene depicted in Fig. 5.4. The chosen reference frame aligns with the position of the camera as vantage point to allow for an easy verification of the results from the images of the scene. Each of the five numbered objects in Fig. 5.4 is used as intended object and as reference object. The degrees of applicability of the six computed spatial relations are presented in Table 5.6 for every object pair. The value of the best scored spatial relation for an object pair is printed in bold type. The computed spatial relations correspond to our expectations. We are dealing with real and therefore noisy data. The reconstructed 3D data are for that reason slightly erroneous. In particular, the reconstruction process does not take into account

5.4

85

S PATIAL R ELATIONS

(0.8) (0.8)

IO (0.45)

(0.15)

(0.65)

(0.45)

(0.2)

(0.2) (0.2)

RO

r

behind

r

r

left

r

right

in-front

Fig. 5.8: Degree of applicability for the spatial relation behind for the IO w.r.t. the RO using the indicated reference frame: (ref; behind; IO; RO) = 0:15 0:45 + 0:65 0:8 + 0:2 0:8 = 0:75. The degrees of containment are printed in bold type in brackets and the degrees of accordance are indicated in non-bold in brackets.

86

5


IO

RO

left

right

above

below

behind

in-front

IO

RO

left

right

above

below

behind

in-front

2 3 4 5 3 4 5 4 5 5

1 1 1 1 2 2 2 3 3 4

0.05 0.64 0.29 0.45 0.39 0.26 0.24 0 0 0

0.21 0 0 0 0 0 0 0.42 0.72 0.20

0.05 0.06 0.10 0.14 0.08 0.15 0.21 0.12 0.15 0.04

0.01 0.08 0.05 0 0.07 0.07 0 0.03 0.01 0.07

0.71 0.30 0.67 0.49 0 0 0 0.53 0 0

0 0 0 0 0.55 0.64 0.66 0 0.20 0.75

1 1 1 1 2 2 2 3 3 4

2 3 4 5 3 4 5 4 5 5

0.25 0 0 0.01 0 0 0 0.36 0.78 0.18

0 0.58 0.25 0.36 0.56 0.60 0.34 0 0 0.04

0.14 0.04 0.05 0 0.13 0.02 0 0.01 0.01 0.03

0 0.01 0 0.01 0.01 0.02 0.02 0.03 0.11 0.06

0 0 0 0 0.40 0.39 0.65 0 0.11 0.74

0.68 0.41 0.74 0.63 0 0 0 0.63 0.05 0

Table 5.6: Degrees of applicability for the six computed spatial relations for the objects 1 to 5 shown in Fig. 5.4. Each object functions as reference object and as intended object.

that all objects are lying on a planar table. Slight differences in the alignment of the objects may therefore occur. This explains the fact that we get positive degrees of applicability for the spatial relations above and below for objects lying in the same plane. Spatial relations are not always invertible. The rhomb-nut (object 4) is in-front of the 5-holedbar (object 2) if we follow the degrees of applicability that are scored best. Using the rhombnut as reference object, however, we obtain the best score for the spatial relation right. This example shows the difficulty of choosing the spatial relation that applies to an object pair. The qualitative description of the spatial relation of an object pair is represented by the 6dimensional vector of the degrees of applicability of all 6 computed relations. The Bayesian inference machine (see Chapter 6) decides which of the 6 relations are finally used. No notion of distance is included in the spatial model, so far. It is computed that object 1 is very well in-front of object 2 whereas a human who would refer to the object in-front of object 2 would probably rather mean the upright bar or the rhomb-nut. Another set of experiments with the computational model, is focused on the degree of applicability for varying IO and RO positions and investigates the behavior of (ref ; rel; IO; RO). As object pair we picked the lying 3-holed-bar (object 1) and the tire (object 3) in the scene shown in Fig. 5.4: The first experiment simulates a rotation of the tire (object 3) around the 3-holed-bar (object 1). Object 1 is the reference object (RO) and object 3 (the IO) is rotated on a full circle in the plane given by left and behind . We chose again the camera view as reference frame. The radius of rotation is the distance between RO and IO, i.e. the distance is kept constant. The degrees of applicability for the spatial relation left, right, behind, and in-front are plotted in Fig. 5.9 as functions of the rotation angle. The graded variation of the degrees of applicability is obvious. The discontinuities in the functions for left and right are due to the discrete acceptance volumes causing the discontinuities in the variation of the degrees of accordance. The functions for left and right are wider than those for behind and in-front because of the shape and the position of

r

r

87

S PATIAL R ELATIONS

degree of applicability

5.4

0.9 0.8

left right

0.7 0.6 0.5 0.4 0.3 0.2

behind in-front

0.1 0 0

50

100

150 200 250 rotation angle

300

350

Fig. 5.9: Behavior of (ref ; rel; 3; 1) for rel 2 fleft,right,behind,in-frontg in the simulation of a 360 of the IO around the RO: The frame of reference remains stable. The scene and the numbering of the objects is chosen as in Fig. 5.4.

the RO. With the reference frame being used, the left and right sides of the bar are larger than the front and rear sides. The maximum applicability for one relation is at about 0.8. None of the (AVi1 ) is parallel to the vectors defining the selected reference frame. In the second experiment, we investigated the effect of the object abstraction and its shape and extension on the computed degrees of applicability. We look again at the object pair (1,3). We start by addressing the abstraction of a cylindric object as a cuboid. For this abstraction, there is one degree of freedom. The abstracting bounding cuboid can be rotated around the normal through the center of mass of the cylindric object. The effect of the rotation of a cylindric RO is shown in Fig. 5.10c as a function of the rotation angle. The relative heights of the degrees of applicability are not affected. In Fig. 5.10a the object pair (1,3) is shown in the initial position with the acceptance volumes that contain parts of the IO and the direction vectors of these acceptance volumes. If RO and IO are switched, i.e. the bar is the RO which is rotated, we see in Fig. 5.10d that the functions of the degrees of applicability intersect depending on the rotation angle. This result indicates that shape and extension of the RO have a significant influence on the degrees of applicability in our model.

d

5.4.5 Understanding Spatial Relations for 2D Object Localization The spatial information contained in the instructions can be used to restrict the location of the IO. The acceptance relations contained in the meaning definition can be inferred for a spatial relation, a reference object in 3D, and a reference frame. For each acceptance relation riRO 2 def (ref ; rel; RO) with a degree of applicability above some threshold 2 [0; 1] the corresponding acceptance volume AViRO is projected onto the image. All resulting polygons are then merged to yield that image region in which the IO must be localized. Fig. 5.11 shows the projected acceptance volumes (darker regions) for the spatial relations right and in-front

88

5

(a)

(b)

(c)

(d) 0.7 degree of applicability

0.7 degree of applicability


0.6 0.5 0.4 0.3 right in-front

0.2 0.1

0.6 0.5 0.4 0.3 left behind

0.2 0.1

0 0

50

100

150 200 rotation angle

250

300

350

0 0

50

100

150 200 rotation angle

250

300

350

Fig. 5.10: Effect of the object abstraction and its shape and extension on the computed degrees of applicability: (a) The object abstraction of the object pair (1,3) with the acceptance volumes containing parts of the IO and the direction vectors of these acceptance volumes: The partitioning of the IO through the acceptance volumes is marked. (b) The object abstraction of the object pair (3,1). Now object 3 is the RO. The acceptance volumes relevant for this object pair and their direction vectors are shown. (c) The function of the rotation angle of the RO (object 3) for the degrees of applicability. The RO is rotated around its center of mass along its z-axis. (d) The degrees of applicability for a rotation of object 1 as the RO.

5.4

89

S PATIAL R ELATIONS

w.r.t. the socket (4). The origin of the reference frame is again the vantage point of the camera.

(a)

(b)

Fig. 5.11: Projected acceptance volumes AVi4 corresponding to relations ri4 2 def(ref,right,4) with (ref ; right; ri4) > 0:2 and ri4 2 def(ref,in-front,4) with (ref ;in-front,ri4) > 0:4; the maximum value for is found in the darkest region. The numbering of the objects is according to Fig. 5.4.

Using the threshold , it is possible to control whether a spatial relation is intended in a stronger or weaker sense. The threshold prevents the projection of those acceptance volumes that only weakly coincide with the meaning of the spatial relation in the intended sense. Since the projection from 3D to 2D is non-linear, object positions according to the spatial relation as well as object positions not according to the spatial relation may be projected onto the same image point (see Fig. 5.11b).

5.4.6 Empirical Psycholinguistic Validation For an empirical evaluation of the computational model, two controlled psychological experiments were run (Vorwerg et al., 1996;Vorwerg et al., 1997;Vorwerg, in prep.). In these experiments, subjects had to either generate spatial terms for describing the spatial relationship between object pairs, or they had to rate the applicability of spatial relations that were computed on the basis of the spatial model. The experiments are focused on investigating the organization of visual space in non-canonical spatial configurations, whereas most research in spatial cognition has focused on the canonical directions constituting the spatial framework

90

5


using reference objects in canonical orientations (e.g., Carlson-Radvansky & Irwin, 1993; Hayward & Tarr, 1995). In both experiments, distance, orientation of the reference object, and position of the intended object (according to the computed acceptance volumes) varied systematically. Reference and intended objects with no intrinsic parts were used to avoid conflicting perspectives. All objects were located in one horizontal plane. The use of the German horizontal projective terms rechts (right ), links (left ), hinter (behind ), and vor (in-front ) was investigated. Stimuli in all experiments consisted of two objects, one of which was always in the center of the picture (the reference object: a 5-holed-bar) in one of four different orientations, and the other (the intended object: a cube), at one of 24 positions around the reference object and at one of three different distances. Fig. 5.12a-d illustrates a top view of the four different orientations of the RO and all positions of the IO that were displayed. The four orientations of the RO that we used included two orientations collinear with the reference system (one horizontal, see Fig. 5.12a, and one sagittal, see Fig. 5.12b) and two rotated orientations where the inertia axes deviate 45 (see Fig. 5.12c,d). Stimuli were displayed in stereo mode on a Silicon Graphics Workstation using a pair of Crystal Eyes Stereo Glasses. Before running these two experiments, we carried out a preliminary study in which subjects (N=36) had to name spatial relations in free speech in order to find out, among other things, what namings would be used. In 99.5% of the utterances, projective prepositions or adverbs were used. In German, two prepositions can be syntactically combined (e.g., links vor ). Those combined namings were used in 61.5% of the utterances. In experiment 1, 40 German native speakers rated the applicability of spatial prepositions. The chosen prepositions correspond to the spatial relation which was scored the best by the computational model. When no single relation was scored best, then the best two (equally scored) were combined. The IO/RO configurations explained above were used. To avoid answer bias and to have a reference system to evaluate the results, a same number of distractors (“wrong answers”) was introduced. Distractors had been constructed systematically by adding a second, either neighboring or contrasting relation, leaving out the best or second best applicable relation, or by substituting a neighboring or contrasting relation. The displayed relation naming could be rated from 0 to 1 by the subjects using a variable scale. Results show a general high acceptance of the computed relations ( = 0:90, 2 = 0:17; cf. distractors: = 0:34, 2 = 0:39). For some distractors in some positions, we find distractors being rated the same as the computed relations (in case of combination with the second applicable relation). A multivariate Analysis of Variance (ANOVA) reveals a significant influence of orientation and distance on the rating of the computed relations with no interaction between the two factors. By post hoc mean comparisons, we find that the largest distance is rated slightly (but significantly) better and that a sagittal orientation of the reference object is rated better than a horizontal one, which again is rated better than the two rotated orientations. The ANOVA yielded a significant influence of position on the rating as well. To investigate this dependency in detail, we subdivided the 24 positions into 6 position groups according

5.4

91

S PATIAL R ELATIONS

(a)

(b) behind

left & behind

left & behind

right & behind

left & in-front

right

left & in-front

right & in-front

in-front

right & behind

left

right

left

behind

right & in-front in-front

(c)

behind

behind

(d) right & behind

left & behind

left

left & behind

right

left

right

left & in-front

right & in-front

left & in-front

right & behind

right & in-front in-front

in-front

(e)

(f) 5

3

1

3

5 3 1

4

6

4 1

6

2 21

5

6

4

4

3

6

5

2 1 5 2

2 2 4

4

6

3

4 1

6

6 5

5 3

1

3

2 21

3 5

6

4

Fig. 5.12: (a) to (d): Prepositions chosen most often in experiment 2 (see text) for each configuration used (winner-take-all). A surrounding line indicates that two namings have been chosen equally often. (e) and (f): Exemplary results of the Kohonen Maps for the collinear and the rotated conditions. The dashed lines enclose the positions belonging to one cluster. Numbers indicate positions grouped together for an ANOVA.

92

5

naming of relations in % / applicability degrees

(a)

(b)

horizontal

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

horizontal, 45

positions of IO

positions of IO forced-choice: left & behind system output: behind


behind

right & behind

Fig. 5.13: Exemplary comparison of applicability degrees computed by the computational model and relative frequency of used prepositions in the forced-choice experiment (experiment 2, see text): (a) horizontal orientation (see Fig. 5.12a), (b) horizontal orientation rotated by 45 (see Fig. 5.12c).

to the six positions contained in each quadrant around the reference object. In the collinear conditions, we find that position groups are rated better the nearer they are to the left-right and front-back axes (1; 2 > 6 > 3; 4 > 5; see Fig. 5.12e). In the rotated conditions there are some position groups rated better than others as well ( 5 > 6; 3 > 1; 4 > 2; see Fig. 5.12f). Altogether, subjects’ ratings show a significant correlation with the degrees of applicability computed by the computational spatial model. In experiment 2, 20 German native speakers named the direction relations themselves by choosing spatial prepositions on buttons, which fitted the displayed spatial configuration best. They could choose from right , left , behind , in-front , below , above and their combinations, respectively. The results of experiment 2 again show that subjects tend to use combined direction terms; single direction expressions seemed to be used only in unambiguous cases. As for the collinear reference objects (see Fig. 5.12a,b), the positions, in which most subjects use single expressions, are located in the center of the computed relation applicability areas. Regarding the rotated reference objects (see Fig. 5.12c,d), two combined direction terms are used significantly more often, depending on the direction of the reference object (e.g., left & in-front and right & behind in case of orientation (c); see Fig. 5.13b), indicating a stronger influence of the reference system on the partitioning of space. Otherwise (when the surrounding space is partitioned exclusively by the reference object) we would expect a symmetric distribution of spatial terms.

5.4

S PATIAL R ELATIONS

93

A more detailed analysis shows the overlap of applicability regions taking into account the distribution of different projective terms over the positions of the intended objects. For the collinear orientations of the reference object, the peaks in the frequency of naming a certain preposition correspond well to the output of the computational model (see Fig. 5.13a). Regarding the rotated reference objects, we find a small systematic shift of the empirical peak and a larger proportion of one combined preposition in the same way as in the winner-takeall-results (see Fig. 5.13b). A clustering of the data obtained by means of a Self-Organizing Kohonen Map reveals that only for the non-rotated reference objects it is possible to find eight clear-cut applicability areas of the four prepositions and their respective combinations. Whereas it is more difficult to distinguish applicability regions in cases of rotated reference objects. This result is supported by an Analysis of Variance using the information entropy obtained for each distance and ROs orientation. There is a significantly greater entropy (that follows from uncertainty in naming) in cases of rotated reference objects as well as for the nearest distance between RO and IO (with no interactions between the two factors).

5.4.7 Discussion The general high rating of the computed spatial relations shows the usability of the spatial computational model. In general, we find a good correspondence between the central tendencies of the subjects’ namings and the computed spatial relations, particularly good for the collinear reference objects. Production data, especially the greater entropy found for the rotated reference objects (experiment 2), can explain the rating of these orientations as less applicable in experiment 1. There are obviously more uncertainties about how to name spatial relations in case of rotated reference objects. This interpretation is supported by results of Kohonen Maps which show that subjects have no clear-cut applicability regions for the rotated reference objects. The empirical findings provide evidence for the assumption of the computational model that the reference object’s extensions in space (see Vorwerg, in prep.), as well as its rotation, must be taken into account. For a rotated reference object, there is obviously a conflict between the object’s axes and the reference system’s axes, which seems to be resolved by a tradeoff resulting in using the nearest part (corner or edge) of the reference object as a reference point. Subjects’ data can be interpreted as using the axes imposed by the reference system, but shifting those half-line axes (left, right, front, and back) towards the nearest part of the reference object. The computational model’s rotation of acceptance volumes may serve as an approximation for these cognitive processes in most positions. In some positions around rotated reference objects however, the assumed half-line axes intersect acceptance volumes resulting in a discrepancy between computed spatial relations and terms used by the subjects. This could possibly be compensated by introducing a dynamic adaptation that distorts acceptance volumes in dependence on the reference system. Subjects’ frequent use of combined direction terms, which might in part be due to this special

94

5


syntactic possibility in German, can easily be fitted into the computational model by changing the criterion according to which to combine two relations (e.g., take all relations which are scored better than 0.3, combine the in case there are two). But nevertheless, as the vector of the degrees of applicability of all computed relations is handed over to the Bayesian inference machine, the combination can be handled there. The fact that several positions exist where subjects rated combined and single terms equally indicates some flexibility in assigning spatial relations. Regarding the empirical evaluation of computational models, it must be taken into consideration that the uncertainty of subjects in naming a spatial relation has some influence on the applicability rating of presented spatial terms. Through the investigations, questions about the adequacy of the computational model arise. There are strong hints that acceptance volumes, which are bound to the RO do not correspond entirely to the human notion of spatial relations/prepositions. There are certain adjustments (dynamic adaption, non-linear degree of accordance, etc.) that can be made. It has also been shown in the experiments that there is a lot of uncertainty in naming and rating of spatial prepositions which is a subjective process. Therefore, also a perfectly adjusted computational model will get better and weaker ratings. The guidelines for the design of the computational model were not only influenced by cognitive aspects of spatial relations but also by technical requirements for the system Q UA SI-ACE. A new contribution of our computational model is the introduction of an intermediate spatial representation which is independent of reference frames. It explicitly captures the physical object constellation in 3D space. The meaning definitions of spatial relations have another advantage. They allow us to test symbolically for overlapping or shared meanings of spatial relations used in the communication. This is of interest since the speaker may change the reference frame during the communication and refer to the same physical spatial configuration using different spatial relations. Also the uniform object abstraction and the simplicity of the computation of degrees of applicability are advantages of the computational model. A good trade-off between technical benefits and cognitive adequacy of the model must be accomplished. We believe that this is achieved as long as we use the results of the computational model for the general qualitative description of objects and have a decision calculus which infers the spatial relation(s) which are finally applicable in the context the scene and other object properties.

Chapter 6

Object Identification

The stages of image understanding, which have been described so far, are all carried out by deterministic functions. They form a transition between numerical input data and symbolic, qualitative descriptions. The actual identification of the object(s) which is/are intended in an utterance is based on these qualitative descriptions. Qualitative data are not as exact as numerical data. Furthermore, errors in the recognition process, ambiguities, slips of the tongue, insufficient specifications, and other inaccuracies, may corrupt the qualitative descriptions. Exact deterministic decisions are therefore hard to make in real, noisy environments. We have chosen a probabilistic approach for object identification to cope with these problems. Bayesian networks are used to identify the intended object(s) based on the fuzzified qualitative descriptions which result from speech and image understanding. In this chapter, we first introduce Bayesian networks in Section 6.1. We describe the general concept, propagation, and update mechanisms as well as related work. The design and implementation of our Bayesian network for object identification is then explained in Section 6.2. The acquisition of the represented knowledge is shown, and we conclude this chapter with results and a discussion of our approach.

6.1 Bayesian Networks Bayesian networks are directed acyclic graphs in which nodes represent random variables and arcs signify the existence of direct causal influences between the linked variables (Pearl, 1988). Bayesian networks are an explicit representation of the joint probability distribution of a problem domain (a set of random variables X1 ; :::; Xn), and they provide a topological description of the causal relationships among variables. If an arc Xi ! Xj is established then the probability of each state of Xi depends on the state distribution of Xj . Bayesian networks offer a mathematically sound basis for inferences under uncertainty (Forbes et al., 1995). We use synonymously the terms node and random variable.

96

6

O BJECT I DENTIFICATION

Associated with each arc is a conditional probability table (CPT) which provides conditional probabilities of a node’s possible states given each possible state of the parent which is linked by this arc. The CPTs express therefore the strength of the causal influences between the linked variables. If a node has no parents, the prior probabilities for each state of the node are given in the CPT. The knowledge of causal relationships among variables is expressed by the presence or absence of arcs between the corresponding nodes. Furthermore, the conditional independence relationships implied by the topology of the network allow the joint probability distribution of all the variables in the network to be specified with exponentially fewer probability values than in the full joint distribution. When specific values are observed for some of the nodes in a Bayesian network, posterior probability distributions can be computed for any of the other nodes (Pearl, 1988). Three parameters are attached to each node representing the belief in the state, as well as diagnostic and causal support from incoming and outgoing links, respectively. Beliefs are updated taking into account both parents and children. The belief update is described in detail in the next subsection. Not only the belief update but also the propagation of evidence and support through the network is an important process in order to achieve global evidence or an approximation of the global evidence in the network. Some propagation mechanisms are sketched in Subsection 6.1.2. Related work, i.e. examples of implemented approaches which are using Bayesian networks, is presented in Subsection 6.1.3.

6.1.1 Belief Update We use a causal tree as Bayesian network topology for object identification. Therefore, we consider here only this type of networks. Bayesian networks in general are very well explained in (Pearl, 1988). We describe here the belief update for trees as given in (Pearl, 1988) and in

U ...

X

...

Y1

...

Ym

Fig. 6.1: A part of a generic causal tree.

(Russell & Norvig, 1995). Fig. 6.1 shows a generic causal tree. Node X has a parent U and children = fY1; :::; Ymg. In a tree there is no connection from U to the Yj except through X .

Y

6.1

97

BAYESIAN N ETWORKS

e

We are assuming that X is the query variable. We denote the total evidence by , which is regarded as emanating from a set of instantiated variables. Evidence means both bottom-up propagated evidence as well as top-down propagated virtual evidence or prior probabilities. The aim for each node is to compute (xj ) = BEL(x) which is the belief of the node or random variable X given the total evidence within the network. The fundamental property of Bayesian networks is that each random variable represented by a node must be conditionally independent of its non-descendents in the graph, given its parents. Therefore, (xj ) depends ? only on the evidence + X connected to X through its parents and on the evidence X connected to X through its children, where

E

P e

P e e

e

e+X is called the causal support, and e?X is called the diagnostic support. Thus, we have

P (xje) = P (xje?X ; e+X ): (6.1) ? ? To separate the contributions of e+ X and eX , we apply Bayes’ rule keeping eX as fixed background evidence:

e

e

? + + P (xje?X ; e+X ) = P (eXPjx;(ee?Xj)eP+()xjeX ) : X X

(6.2)

P (e?X je+X ) can be treated as a con-

+ Because ? X and X are conditionally independent and 1= stant, we obtain, with as normalizing constant,

P (xje) = BEL(x) = P (e?X jx) P (xje+X ) = (x) (x) :

(6.3)

(x) and (x) can be recursively computed in the following way: (x) = P (e?X jx) = P (e?Y1 ; :::; e?Ym jx) = =

m Y i=1

m Y i=1

P (e?Yi jx) Yi (x)

where Yi (x) is the diagnostic support from node Yi . And with

(x) = P (xje+X ) =

=

X P (xje+X ; u) P (uje+X ) u X P (xju)P (uje+X ): u

(6.4)

98

6


P (xju) is the conditional probability table associated with the arc U ! X . It is also denoted by MX jU . And P (uje+ can be calculated by U and delivered to X as message ) X X (u) = P (uje+X ); (6.5) yielding

(x) =

X

P (xju) X (u) = X (u) MX jU :

u Substituting Eqs. 6.4 and 6.6 in Eq. 6.3 gives

P (xje) = BEL(x) =

m Y i=1

(6.6)

!

Yi (x) MX jU X (u):

(6.7)

Thus, node X can calculate its own beliefs if it has received the messages Yi (x) from its children Yi and the message X (u) from its parent U .

6.1.2 Propagation As each node needs to receive messages from its children and its parent in order to calculate its belief, it also has to send out messages to its own children and parent. In this way new information can spread through the network.

X (u) = =

X ? P (eX ju; x)P (xju) x X P (e?X jx)P (xju) x (x)P (xju)

= (6.8) = MX jU (x) Thus, the message X (u) going to the parent U can be calculated from the messages received from the children and the conditional probability table MX jU . And the message Yk (x) which is sent to Yk is Yk (x) = P (xje+Yk ) = P (xje?Y1 ; :::; e?Yk?1 ; e?Yk+1 ; :::; e?Ym ; e+X ) = P (eY1? ; :::; e?Yk?1 ; e?Yk+1 ; :::; e?Ym jx)P (xje+X )

Y

!

P (e?Yi jx) P (xje+X ) i6=k Y = (x) P (e?Yi jx) i6=k Y

=

= (x)

i6=k

Yi (x):

(6.9)

6.1

99

BAYESIAN N ETWORKS

Once the messages are computed, they can be sent to the children and the parent. Every new incoming evidence is followed by its propagation through the network. Fig. 6.2 shows six successive stages of belief propagation through a simple binary tree as an example. It is assumed that the update is triggered by changes in the belief parameters of connected nodes. Initially, the tree is in equilibrium, and all terminal nodes are anticipatory (see Fig. 6.2a). As soon as data nodes are activated, they send diagnostic support to their parents (Fig. 6.2b). In the next phase, the parents absorb the diagnostic support and send messages to their parents and children (Fig. 6.2c). The propagation continues until a new equilibrium is reached. (a)

(b)

(c)

data data

(d)

(e)

(f)

Fig. 6.2: The impact of new data propagates through a tree by a message-passing process (Pearl, 1988).

Various approaches about propagation in Bayesian networks have been proposed. They range from exact and full propagations to approximations like stochastic simulation (Pearl, 1986). For our approach, propagation is not a big issue. The Bayesian network for object identification is a rather small and simple tree. Therefore, evidence may arrive at any time, and it is propagated through the network.

6.1.3 Related Work The basic concept of Bayesian networks is very well introduced in (Pearl, 1988). Algorithms for belief update and propagation are given for basically all possible network types. The learning of conditional probabilities is also covered. Another good overview of Bayesian networks is given by Lauritzen & Spiegelhalter (1988). Recently, Bayesian networks have attracted a large number of researchers who work on a wide variety of different aspects of Bayesian networks, probabilistic models, and reasoning under uncertainty. Chrisman (1996) has collected and classified many references in this area. However, our approach for object identification is a pure application of Bayesian networks. Our network is a rather small tree, and therefore the use of the Bayesian networks is pretty much straightforward. We use Bayesian networks as outlined in (Pearl, 1988) and do not apply

100

6


more advanced propagation or probability learning techniques. The following brief review of Bayesian networks is therefore mainly focused on example applications. Forbes et al. (1995) describe a decision-theoretic architecture which uses probabilistic networks for the application of driving an autonomous vehicle in normal traffic. The driving problem is modeled as a partially observable Markov decision process. At each state of the Markov process the belief state is updated and decisions are made according to it. Each state is therefore modeled with a Bayesian network where the output of different sensors (such as speed, visual sensors (Huang et al., 1994; Koller et al., 1994)) and the current general decision (e.g., passing a car) are taken into account to estimate the overall belief of a state. This decision process over time is not modeled by Bayesian networks but by Markov models. The ‘temporally invariant network’ which is a Bayesian network is applied within a state only. The performance of the sensors is modeled by conditional probability tables. Their exact acquisition is not described. Another application of Bayesian networks was developed by Rimey & Brown (1994). Their goal is to simulate a selective active vision system. The system TEA-1 sequentially collects evidence to answer a specific question with a desired level of confidence. The chosen scenario is a dinner table scene and the task is to visually locate items on the table. Efficiency comes from processing the scene only at necessary locations, to the necessary level of detail, and using only the necessary operators. Bayesian networks are used for representation and benefitcost analysis of the control of visual and non-visual actions. A composite Bayesian network is structured into separate nets. These nets model the physical structure of a scene, geometric relations, a taxonomic hierarchy representing relationships, as well as task specific knowledge. The TEA-1 system tries to solve a complex task. This seems not possible with the formalism of Bayesian networks only. A whole bunch of goodness functions is used for decision making, in addition to the belief propagation in the Bayesian networks. A third example for an application of Bayesian networks is the system PRACMA (Jameson et al., 1995). It is a rather classical decision-making application. A salesperson for used cars is modeled. The system tries to collect as much evidence as possible and use it to give advice on which car to buy. The conditional probability tables are fixed manually and therefore the whole approach is rather artificial and not suitable for the use of real data. Approaches of probabilistic reasoning for computer vision applications are collected in (PAMI 15(3), 1993). All approaches in the collection are designed to be rather specific for the chosen application. Unfortunately, we did not find applications similar to ours, nor applications which could give inspiration to our approach.

6.2 Object Identification using Bayesian Networks The identification of the object(s) in the visible part of the scene which are referred to in an utterance is important in our speech and image understanding system Q UA SI-ACE. The iden-

6.2


USING

BAYESIAN N ETWORKS

101

tification process requires inferences on a common representation of the results from speech and image understanding. The common representation is formed by the qualitative descriptions which are described in Chapter 5. But our goal is not only the identification but also the inference of restrictions for missing information in either type of qualitative descriptions. Bayesian networks are very well suited for this task. They offer the possibility of bottom-up, top-down, and mixed mode reasoning under uncertainty. This allows us to infer restrictions and to deal with data in real and noisy environments. In this section, we first describe the design of our Bayesian network for object identification. Crucial for Bayesian networks is the modeling of the conditional probabilities. We estimate them based on prior knowledge of the domain (see Subsection 6.2.2). We carried out a questionnaire on the World Wide Web to investigate the use of size and shape properties in our domain which is reported in Subsection 6.2.3. Results and a discussion of our approach conclude this section.

6.2.1 Design The design of the Bayesian network for object identification was guided by the following requirements:

Decisions should account for uncertainty in the data as well as uncertainties in the recognition and interpretation processes. Data and results from psycholinguistic experiments should be integrated easily. It should be possible to model lacks of performance of the recognition modules. The system should be able to identify objects from unspecific and even partially false instructions/object descriptions. It should be possible to infer restrictions for missing information of any type.

These requirements are satisfied best when we determine from all detected objects of a scene the one(s) which are most likely referred to in an instruction. This means that the object(s) with the highest joint probability of being part of the scene, and being referred to, are the intened object(s). In this way, we can incorporate the uncertainties which are involved in the recognition as well as the understanding processes. We can also account for erroneous specifications as we do not require a perfect match of observed and uttered features but only the highest probability of a match of these features among all detected objects. Furthermore, we identify an object in the context of the scene and of all uttered features, and we do not just pairwise compare uttered and observed features.

102

6


We designed our Bayesian network (cf. Fig. 6.3) according to these guidelines. It is a treestructured network. The root node intended Object represents the intended object. The dimension of this node is 23. There are 231 different objects in our domain (see Appendix B for a complete list of all objects). Thus, we estimate for each object of the domain the probability or likelihood of being intended. The children of the root node represent the Scene and Typespeech , Colorspeech , Sizespeech , and Shapespeech which may be uttered in an instruction. The dimension of the node Scene is again 23. This represents for each object the probability of being part of the current scene. The dimensions of the speech nodes (Typespeech , Colorspeech , Sizespeech , and Shapespeech ) are chosen according to the number of different categories per property. The speech nodes are instantiated with the qualitative descriptions which result from speech understanding. If a property is not specified in an utterance, then the diagnostic support of the corresponding node is set to 1n1 , where n is the dimension of the property. This means that the vector contains a 1 in every component. intended Object (23)

Ip Scene (23)

Ip (23) Object 1 Mc jo Mtijo i Typeimage (12)

Colorimage (9)

Mtsjo Ip

...

Mshsjo Msisjo

Mcsjo Typespeech Colorspeech (14) (9)

Shapespeech Sizespeech (8) (10)

(23) Object mM Mti jo ci jo Typeimage (12)

Colorimage (9)

Fig. 6.3: Structure of the Bayesian network for object identification: The conditional probability tables (CPT) are marked at the arcs. Ip is a unit matrix where the zero entries are replaced by very small probabilities. The dimension of each node is written in brackets.

The observed objects are represented by the nodes Object1,..., Objectm. Each node has again dimension 23 and represents for an observed object all possible object categories and their likelihood of characterizing it. Typically, one component has a high belief, and the belief for all other components is low. The qualitative type and color descriptions are used to instantiate the nodes Typeimage and Colorimage for each observed object. The conditional probability tables, Mti jo and Mci jo , model the transitions between type and color and all object categories having this type and color, respectively. The beliefs of the nodes Object1,..., Objectm are then derived from Typeimage and Colorimage , the qualitative type and color descriptions of an observed object. The arcs between Scene and Object1,..., Objectm are shaded in grey as their semantics is different from arcs in Bayesian networks. The scene node represents for each object the belief 1

We distinguish now between the cubes in different colors.

6.2


USING

103

BAYESIAN N ETWORKS

that this object is observed in the scene and that it is referred to. Despite the usual modeling of arcs in Bayesian networks, a joint probability of all evidence coming from the children of a node and the causal support of the parents is not adequate here. The identification of an object should not be influenced by the number of occurrences of an object or a specific object property in the scene. Instead, it is of interest to model the existence of an object or an object category. Therefore, we do not compute P ( ? Scene jscene) = (scene) = j Yj (scene), where Yj are the children of the scene node which represent the objects in the scene. But the diagnostic support for the scene node P ( ? Scene jscene) is obtained as follows:

Q

e

e

P (e?Scene jscene) = (scene) = 1 ?

Y? j

1 ? Yj (scene) :

(6.10)

This represents for each object class the probability of it being observed in the scene. Object Identification The intended object(s) are identified in the following way: First, after the initialization of the network, only (scene) and (io)2 are propagated through the network. The resulting beliefs for the objects nodes, prior to normalization, characterize the certainty of detection in the context of the scene. We call this value offset and compute it as

offsetj

= init(objectj ) init (objectj ):

(6.11)

Incoming evidence is then propagated bottom-up and top-down through the network. The belief of the scene node results from incoming evidence as well as from top-down propagated messages from the node intended Object. Hence, the belief of the scene node represents for each domain object the joint probability of being part of the scene and being intended. Messages from the scene node are propagated top-down to the nodes Object1,..., Objectm. After the propagation of evidence obtained from speech and image understanding, the beliefs of the nodes Object1,..., Objectm are again taken prior to normalization. We define for each object j the possibility j of being intended as

j = (objectj ) (objectj ):

(6.12)

For each object, a likelihood value is taken

j = max ((i )j ) ? max ((offseti )j ): i i

(6.13)

j is the difference of the maximal component of j and the maximal component of offsetj . This gives us one likelihood value for each observed object. The identified object should be 2

(io) is the vector of prior probabilities for the node intended Object. It is set to 1=23 in every component.

104

6


the most likely intended one. Therefore, we use a little statistical analysis to find this/these object(s). We compute the mean and the standard deviation of the set of likelihood values j . Objects with likelihood values below ? are excluded from the statistical analysis. We define the selection criterion as < threshold and j > 0 object j identified if (6.14) threshold and j > + : This gives us all outliers with a level of confidence of 0:69. Thus, we get all those objects which have a significantly higher belief of being intended than the other objects. A level of confidence of 0:69 is not too low for our scenario. We have to cope with noisy and erroneous data. And we want the system to identify an object rather than to report ”no objects found” even if a lot of uncertainty is involved. The instructor can correct false identifications, and it is more convenient to correct once in a while a false identification than to repeat instructions multiple times until the system has finally found the intended object.

Spatial Relations So far, we have only described how we identify objects from the unary properties type, color, size, and shape. Binary, spatial relations are another means for referring to objects. Especially when there are multiple objects of the same type in the scene, then it might be the only way to uniquely specify an object. But a tree-structured Bayesian network is not well suited to handle binary relations. Furthermore, we envision to model the spatial configuration of a scene as relation graph such as proposed by Fuhr (1997). We think that a graph structure is the best way to handle the complexity of binary or even higher dimensional relations. We propose therefore at the current state of the system implementation a very simple method for handling object identifications that use spatial relations. spat relspeech is the qualitative description of the uttered spatial relation(s). The following steps are executed: 1. Identify all candidates for possible intended objects (IO) based on type, color, size, and shape, if named. 2. Identify all candidates for possible reference objects (RO) based on type, color, size, and shape, if named. 3. Compute spatial relations for all IO/RO candidate pairs as explained in Section 5.4. (a) Compute for each IO/RO candidate pair the Euclidean distance of the vector of computed spatial relations (ref ; IO; RO) and spat relspeech .

rd = k(ref ; IO; RO) ? spat relspeech k2

(b) Compute

s = IO RO=rd .

4. The IO/RO candidate pair with the greatest s is identified.

6.2


USING

BAYESIAN N ETWORKS

105

6.2.2 Use of prior Knowledge An important requirement for our object identification approach is the incorporation of data or results from psycholinguistic experiments. We also want to model the results of error analyses of the understanding modules. These types of information are represented in the conditional probability tables. In a first experiment 10 subjects named objects verbally. Images of scenes with Baufix objects were presented to the subjects on a computer screen. In each image, one object was marked with a pink arrow and the subjects were told to name this object using an utterance in the form of an instruction such as “give me the object” or “take the object”. From this experiment, 453 verbal object descriptions were collected. The properties named the most in these utterances are the type and the color of the objects. We estimated from this data when objecti was shown , P (typespeech [j ]jintended object[i]) = #typej was #named objecti was shown when objecti was shown . P (colorspeech [j ]jintended object[i]) = #colorj was#named objecti was shown

x x

We denote the ith component of a vector as [i]. We want to avoid conflicting indices with this notation known from programming languages. The second experiment was a questionnaire in the World Wide Web for size and shape of the objects (see Section 6.2.3). It is used for the following conditional probabilities: when objecti was shown P (sizespeech [j ]jintended object[i]) = #sizej was #named , objecti was shown when objecti was shown P (shapespeech [j ]jintended object[i]) = #shapej was#named . objecti was shown The performance of the speech understanding system was not evaluated here. Therefore, confusions which may occur in speech understanding are not yet modeled. Another series of experiments tests the performance of the image understanding modules. Object recognition was performed on 11 images of different scenes which were taken under constant lighting conditions but with different focal lengths (see Section 5.2.2). The conditional probabilities P (typeimage jobjectk ) are estimated as when objecti was depicted P (typeimage[j ]jobjectk[i]) = #typej was detected #objecti depicted

k denotes the kth object in the scene. The conditional probability tables are the same for all objects.

106

6


The color classification performance was also evaluated (see Section 5.3.2). This gives us a conditional probability table

P (color classif.jpixel color) = [P (colorj classif.jpixel colori)]i;j ; with P (colorj classif.jpixel colori ) = (# colorj classified /# pixel with colori ). But we need the conditional probabilities P (colorimage [j ]jobjectk [i]). We estimate these by using an ideal transition matrix P (colorjobject) and multiplying it with P (color classif.jpixel color). We set P (colorj jobjecti) = 3 if the color of objecti is colorj and "4 otherwise. Therefore,

P (colorimage[j ]jobjectk[i]) = [P (colorjobject) P (color classif.jpixel color)]i;j . The transitions between intended Object and Scene as well as between Scene and all objects Objectk are considered as unbiased and so the conditional probabilities were set to unit matrices where the zero entries are replaced by very small probabilities:

P (scene[j ]jintended object[i]) = "

P (objectk [j ]jscene[i]) = "

if i = j otherwise;

if i = j otherwise:

6.2.3 Questionnaire about Size and Shape in the World Wide Web So far, we described how to extract the qualitative properties, type, color, and spatial relations from numerical data. But what about the size and shape of objects? Size and shape are not named in the instructions as often as the color (see Section 5.3), but nevertheless, Q UA SIACE should be able to cope with size and shape specifications. This leads to the question whether there are any classification schemes for size and shape. Are there functions of the metric size and shape of the ojects (e.g., volume, diameter)? Or do other mechanisms apply? We were unable to answer these questions and decided therefore to start a questionnaire in the World Wide Web (WWW) in order to collect empirical data about which size and shape categories subjects associate with the objects in our domain. The World Wide Web is a means to reach many people and to acquire large sets of data. The main hypotheses or questions for the design of the questionnaire were:

3 4

(a) there is an intrinsic size and shape model for each object in our domain, (b) the size and shape depend on the context of the scene, but how?

is a probability near 1 but normalized that P (colorjobjecti) = 1 " is a very small probability.

6.2


USING

BAYESIAN N ETWORKS

107

Fig. 6.4: One page from the questionnaire in the World Wide Web: The objects are presented in the context of others. The subjects were asked to select all size and shape categories that characterize the numbered object.

After an introduction, 20 WWW pages were presented to subjects, one page for each object of our domain5. On each page, an image of an object and buttons with size and shape categories were shown, and the subjects were asked to select all those categories that characterize the depicted object (see Fig. 6.4 as an example). We designed two different versions of the questionnaire according to our hypotheses. In the first version, we presented images of objects in the context of others. The object in question was marked with a number. The second version contained only images of isolated objects. Both versions were randomly distributed among the subjects. We also translated the original German questionnaire into English to open this study to international participants. We did not study the use of size and shape adjectives in English, and we purely translated the German adjectives according to a dictionary. Thus, the usage of 5

Here, we don’t distinguish between cubes with different colors. This reduces the number of objects to 20.

108

6


certain adjectives, that we chose, may be different in English than in German. For each object, the subjects had the choice between the 18 size and shape adjectives shown in Table 6.1. We selected the German adjectives from the empirical data reported in Section 5.3. Size German English klein small groß big kurz short lang long mittellang medium-long mittelgroß medium-sized dick thick dünn thin schmal narrow hoch high

Shape German English rund round eckig angular länglich elongated sechseckig hexagonal viereckig quadrangular rautenförmig diamond-shaped flach flat rechteckig rectangular

Table 6.1: We asked subjects to select all these 18 size and shape adjectives which are applicable for each of the Baufix objects.

426 subjects with over 20 different native languages completed the questionnaire. The German version was selected by 274 subjects. 96% of them are native speakers. The English version was chosen by 152 subjects, 53% of them are native English speakers. 78% of all subjecs are male and only 22% are female. 1% did not specify their gender. The age of the subjects ranges between 14 and 62, the average age is 29.4 years. We evaluate the data with descriptive statistics and a 2 -test. Descriptive Statistics Descriptive statistics characterize a data set in general terms: its mean, and variance. The mean is computed as

N N X X 1 1 = N xi or for vectorial variables = N xi: i=1 i=1 The variance of a data set is

1

2 = N ? 1

N X

1

! N X 2

(xi ? )2 = N ? 1 xi ? 2 i=1 i=1

(6.15)

(6.16)

6.2


USING

or for vectorial variables N

2 =

1

109

BAYESIAN N ETWORKS

X

1

N X

N ? 1 i=1 (xi ? )(xi ? = N ? 1 i=1 x x )T

T i i

!

? T :

(6.17)

Hexa g.

Hexa b.

Hexa o.

Hexa y.

Hexa r.

Round g.

Round b.

Round o.

Round y.

Round r.

Washer f.

Washer p.

Socket

Tire

Rim

Rhomb

Cube

7-h-bar

5-h-bar

3-h-bar

Fig. 6.5 shows the means (or relative frequencies) for all size and shape categories for all objects. We use a visualization by greyvalues. The darker the square the greater the value of the mean. Each row corresponds to one category. There are four squares per object. These stand for the relative frequencies in the data sets from the four versions of the questionnaire (German with context, German without context, English with context, and English without context). It is obvious that the choice of applicable categories is similar for the four different versions of the questionnaire.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Fig. 6.5: Relative frequencies for all size and shape categories and all objects in the four versions of the questionnaire: The darker a square the higher the mean or relative frequency of selecting that category. All objects are shown in Appendix B. The categories are numbered as shown in Fig. 6.6.

Fig. 6.6 shows only the data of the questionnaires in German. Here again the means or relative frequencies of selecting the size and shape categories are visualized as greyvalues for all objects. The first column for each object depicts the relative frequencies from the context version, and the second column shows the results from the version without context. We see again that the relative frequencies are qualitatively very similar. Data sets are not only characterized by their means but also by their variances or covariance matrices for vectorial random variables. We computed the covariance matrices for the four data sets and used the Frobenius norm (Golub & van Loan, 1991)

v u n X m X u jm2i;j kM kF = t i=1 j =1

(6.18)

to compare the covariance matrices. Fig. 6.7 shows a 3D histogram plot of the Frobenius norms of the covariance matrices. Despite minor differences we observe a similar pattern for the uncertainty in the selection of the categories.

Hexa g.

Hexa b.

Hexa o.

Round g.

Hexa y.


Hexa r.

6

Round b.

Round o.

Round y.

Round r.

Washer f.

Washer p.

Socket

Tire

Rim

Rhomb

Cube

7-h-bar

5-h-bar

3-h-bar

110

1 small 2 big 3 short 4 long 5 m.-long 6 m.-sized 7 thick 8 thin 9 narrow 10 high 11 round 12 angular 13 elongated 14 hexagonal 15 quadrang. 16 diamond flat 17 18 rectang.

Fig. 6.6: The relative frequencies for all size and shape categories and all objects for all questionnaires in German, respectively: The darker a square the higher the mean or relative frequency of selecting that category. The size and shape adjectives are translated here into English.

n w. erma

xt

conte

G

text

t con

ou an w/

Germ

text

. con

sh w Engli

Engli

t

ontex

out c sh w/

H He He Hex exa xa g H R Ro ex xa a o b. . R R ou un a r y . W Rou ounound nd b d g. . . W a S n d . R Ti oc as sh d y o. Cu Rho im re ket her er f. r. . 7 p. 3- 5-h- -h-b be mb h- ba ar ba r r

Fig. 6.7: The Frobenius norm of the covariance matrices for all objects and all four versions of the questionnaire, respectively.

6.2


USING

BAYESIAN N ETWORKS

111

2-Test We want to test whether the variables (language (German vs. English) and context (with context vs. without context)) influence the subjects’ selections and hence the distribution of the data sets. A well known test for differences between binned6 distributions is the 2-test (e.g., Nie et al., 1975). 2 is a test of statistical significance, and it helps us to find out whether a systematic relationship exists between two instances of a variable. The data is organized in tables and each cell is determined by an instance of the variable ‘context’, an instance of the variable ‘language’, the object type, the category, and the subjects’ selection (yes/no). The 2-test compares computed cell frequencies to expected frequencies which would occur if the investigated variable would not influence the distribution. The expected frequencies are computed as

fei = cNiri

(6.19)

where ci is the sum of frequencies of column cells (e.g., f (contexttrue , German, categoryl, objectk , yes) + f (contextfalse , German, categoryl , objectk , yes)) and ri is the sum of frequencies of rows cells. (e.g., f (contexttrue , German, categoryl, objectk , yes) + f (contexttrue, German, categoryl , objectk , no)) in the data table. N stands for the total number of valid cases. The test is computed as

2 =

X (foi ? fei )2 i

fei

;

(6.20)

where foi are the observed frequencies. The greater the discrepancies between the expected and the observed frequencies, the larger 2 becomes. We consider only the German questionnaires. We have seen from the descriptive statistics that the English questionnaires lead to somehow similar results. But because of the inhomogeneity of the data (no prior analysis of chosen categories and only 53% native speakers among the subjects) we do not analyze it in greater detail. Furthermore, we make only use of the data from the German questionnaires for the object identification. Table 6.2 shows for all objects those size and shape categories for which a significant difference was found between the German context version and the German version without context. The context varied in the questionnaires. Thus, we compare the subjects’ answers of an averaged context versus an isolated presentation. However, the titlepage of every questionnaire contained the image of a Baufix scene. Table 6.2 displays the value of 2, the probability p for a significant difference (for one degree of freedom in our case), the relative frequencies of the selection of the category given a context (C), and the relative frequencies of the selection of the category without context (w/out). We see, that differences occur only for few categories per object. Furthermore, there are no 6

discrete set of possible values

112

6

Object 3-holed-bar 5-holed-bar 7-holed-bar

Category mittelgroß (medium-sized) rechteckig (rectangular) groß (big) schmal (narrow) flach (flat) Cube klein (small) groß( big) dick (thick) dünn (high) viereckig (quadrangular) Rhomb-nut klein (small) kurz (short) rautenförmig (diamond-sh.) rechteckig (rectangular) Rim klein (small) mittelgroß (medium-sized) Tire groß (big) dick (thick) Washer, purple kurz (short) Washer, flat klein (small) mittelgroß (medium-sized) Round, red kurz (short) rund (round ) Round, yellow kurz (short) Round, blue klein (small) kurz (short) länglich (elongated) Hexagonal, orange mittelgroß (medium-sized) sechseckig (hexagonal) Hexagonal, blue mittelgroß (medium-sized) Hexagonal, green dünn (thin) rund (round) flach (flat) Table 6.2: Objects and those categories for which the scene context on the subjects’ answers.

2 13.0 4.0 4.5 .01 4.4 8.3 24.0 5.0 16.2 7.7 5.5 6.4 29.5 12.1 8.7 7.4 22.1 4.6 8.5 12.3 6.2 11.4 6.7 4.6 6.8 6.7 6.1 4.5 7.7 5.0 4.4 5.6 4.0

p 0 .043 .033 .025 .035 .004 0 .025 0 .005 .018 .011 0 0 .003 .007 0 .031 .003 0 .013 .001 .009 .031 .009 .010 .013 .034 .005 .025 .035 .017 .045


C [%] w/out[%] 36.6 18.5 69.7 57.7 50.0 36.9 34.1 47.7 93.9 86.2 3.8 13.8 40.9 13.8 65.9 52.3 36.4 14.6 50 33.1 53.8 39.2 8.3 1.5 97.7 74.6 .8 10.8 1.5 10 57.6 40.8 62.1 33.1 17.4 8.5 12.9 3.1 87.1 69.2 0 4.6 77.3 92.3 93.9 83.8 78.8 66.9 2.3 10 1.5 8.5 55.3 40 28.8 17.7 71.2 54.6 45.5 33.8 18.2 9.2 51.5 36.9 3 0

2 -test indicates a significant influence of the

6.2


USING

BAYESIAN N ETWORKS

113

systematic differences. For example, a difference has been found for the round-headed bolt in the category short whereas there is no difference for the bolt of the same length but with the hexagonal head. The difference could therefore be due to the appearance of the objects in the images. Differences occur rather often for the categories ‘medium-long’ and ‘medium-sized’, which could be caused by the vagueness of these categories itself. The selection frequencies of most of the differing categories are in a range of 60% or less. This means that the subjects agree better on the characteristic categories. It seems also that the differing categories are selected with higher frequencies having the context of a scene than without. This higher certainty could be caused by the possibility of direct comparisons to other objects. We conclude from this analysis that the influence of the context is not dominant for the choice of size and shape categories in our scenario. But we are not able to draw conclusive assumptions. More detailed experiments are necessary. We consider the observed similarity in the data sets as a hint for intrinsic size and shape models for our domain and our scenario. A Rolls-Royce car is always a big car even when it stands next to a truck. We think that this is true for the Baufix objects in the assembly scenario, too. Furthermore, the relevant context consists not only of the visible objects but also of personal experience, background, association, etc. The Rolls-Royce is a big car in the context of the knowledge about cars and other car types. We use the data from the context version of the German questionnaire (137 questionnaires) to estimate the conditional probability tables for our Bayesian network for object identification as described in Subsection 6.2.2. We consider all scene contexts in the same way which leads to satisfying results. Discussion Our way of modeling the size and shape of objects models does not attempt to model cognitive processes. So far, we model only one aspect of shape and size perception, which can be captured by an intrinsic size and shape model. But, obviously there must be cognitive processes which are activated in order to compare objects within a specific context. At the beginning of this subsection we raised the question how the size and shape of an object depends on the context of the scene. We are unable to answer it. Our approach leads to convincing results which means that we solved the problem of modeling a tool which is able to interact successfully with humans without trying to explain complex cognitive mechanisms.

6.2.4 Results The Bayesian network approach for object identification has been tested in various ways. We show first some examples and demonstrate how objects are identified taking into account an

114

6


utterance and the objects in a scene. Then, we simulated image data with one object per scene and investigated the propagated beliefs of the speech nodes without the information of an utterance. This gives us an overview of the knowledge modeled for each object. An analysis with simulated data is also used to evaluate our Bayesian network. We generated scenes and utterances randomly and analyze the identification results. This is followed by experiments with data from experiments where subjects named specific objects which were presented in images. We compare the identified and the named objects. Here, we use only transcribed utterances and manually generated scene descriptions in order to concentrate on the object identification. Sources of errors in earlier processing and understanding steps are eliminated. This subsection is concluded with an analysis of our approach over time. Examples Our Bayesian network approach for object identification is first illustrated with three examples. We take a simple scene with Baufix objects which is depicted in Fig. 6.8. This scene consists of five cubes, a 3-holed-bar, and a rhomb-nut.

Fig. 6.8: A scene.

First, we use as input data only information about the scene, which means that we instantiate only the nodes for the objects in the image (Objecti) and propagate the evidence through the network. The resulting beliefs are shown in Fig. 6.9. The object nodes (Objecti) are not depicted here. We see that beliefs in the node Scene for the categories of objects that occur in the scene are higher than for objects which are not there. The belief for the object category

6.2


USING

BAYESIAN N ETWORKS

Fig. 6.9: The belief distribution in the Bayesian network without an instruction.

115

116

6


rim is rather high because of a high confusion probability between the red rim and the red cube. The beliefs of the node intended Object are identical to those of the node Scene as no incoming evidence from the speech nodes is available. The propagated beliefs in the speech nodes report the constellation of the objects in the scene. Most of the objects are cubes and therefore the belief for cube is the highest for the property ‘type’. The belief for bolt is high as well. This is due to the fact that most of the objects in the Baufix domain are bolts. Hence, the type bolt has a high prior probability. This example illustrates that the propagated beliefs are joint probabilities of observed evidence and modeled knowledge.

Fig. 6.10: The belief distribution in the Bayesian network after specifying the categories object and blue in an utterance: The beliefs are shown after a complete bottom-up and top-down propagation of all evidence.

Fig. 6.10 is a screen-shot of the object identification result using again the scene shown in Fig. 6.8 but now including an utterance which names the the categories object and blue. In the node intended Object, the beliefs for all blue objects are higher than others. The belief for the blue cube is the highest because an object of this category is part of the scene. The beliefs for the other blue objects in the node Scene are due to propagation within the network. The blue cube is clearly the identified object. In this scene, there are two blue cubes (see Fig. 6.8), and those two objects are identified. The third example (Fig. 6.11) shows the object identification result for the scene shown in

6.2


USING

BAYESIAN N ETWORKS

117

Fig. 6.11: The belief distribution in the Bayesian network after specifying the categories cube and blue in an utterance: The belief distribution is shown after the propagation of all evidence.

118

6


Fig. 6.8 and an utterance which specifies cube and blue. Here again the two blue cubes are identified. The belief distribution is similar to that in Fig. 6.10. But here the belief for the object category cube blue is even higher due to the unambiguous description. In the node Typespeech the dominating belief is the belief for cube, whereas in the node Colorspeech small beliefs for the colors red, yellow, and green can still be observed. This is due to the fact that there are cubes of all four colors in the scene. This demonstrates well the interaction of evidence from speech and image data in the network. Modeled Knowledge The knowledge modeled in our Bayesian network is mainly encoded in the conditional probability tables. In order to check the plausibility of the modeled knowledge, we simulated data of scenes which contain each a different object. The evidence of the data from these scenes was sequentially put in our Bayesian network and propagated. No utterances were used. The propagation of the evidence leads to propagated beliefs in the speech nodes. These beliefs indicate for each object prior probabilities for the naming of properties in order to refer t o that object. We visualize these prior probabilities as greyvalues (white = 0, black = 1) for all objects and all properties in Fig. 6.12. This shows very well the properties of the different objects. Simulated Data Another set of two experiments with simulated data was carried out to evaluate the Bayesian network more generally than with single trials. We randomly created 1000 scenes with Baufix objects. For each object and each scene, we tossed a coin weather this object belongs to the scene or not. Every object occurs in 500 scenes on average and there is an average number of 11.5 objects per scene. First, we used the same 23 utterances, each describing one object class uniquely, for all 1000 scenes. The properties named in the 23 utterances are reported in Table 6.3. Each utterance describes one object uniquely, but they are more or less specific. For example, ‘Schraube gelb rund’ (bolt yellow round) is more specific than ‘Schraube kurz rund’ (bolt short round). The objects were very well identified. The first two lines of Table 6.4 report the results. All objects except the socket and the red, round bolt were always uniquely identified when they were present in a scene. The problem for the socket is that its color is not well defined. It is made of some dark white plastic and the reflections caused by this material irritate the color classification. As already shown in Section 5.3.2 the classification results are very bad. The color is in between white and wooden but obviously not enough clearly separable that it can be well classified in an own class of color ivory. Therefore sometimes the socket is detected as white and sometimes as wooden. This fact is modeled in the conditional probability tables. Furthermore, the object recognition as a whole is not very stable for this object. Most subjects characterized the socket as white. This is also modeled in the conditional probability tables.

119

Hexa g.

Hexa b.

Hexa o.

Hexa y.

Hexa r.

Round g.

Round b.

Round o.

Round y.

Round r.

Washer f.

Washer p.

Socket

Tire

Rim

Rhomb

BAYESIAN N ETWORKS

Cube g.

Cube b.

Cube y.

USING

Cube r.

7-h-bar

5-h-bar


3-h-bar

6.2

Object Bar Bolt 3-h-bar 5-h-bar 7-h-bar Cube Rhomb-nut Rim Tire Socket Round-headed Hexagonal Washer white red yellow orange blue green purple wooden ivory small big short long medium-long medium-sized thick thin narrow high round angular elongated hexagonal quadrangular diamond-shaped flat rectangular

Fig. 6.12: Prior probabilities which are modeled in the Bayesian network for all properties and all objects.

120

6

1.) 2.) 3.) 4.) 5.) 6.) 7.) 8.) 9.) 10.) 11.) 12.) 13.) 14.) 15.) 16.) 17.) 18.) 19.) 20.) 21.) 22.) 23.)

Dreilochleiste holz Fünflochleiste holz Siebenlochleiste holz flach Schraubwürfel rot Schraubwürfel gelb Schraubwürfel blau Schraubwürfel grün Rautenmutter orange klein Felge rot rund Reifen rund Buchse weiß Scheibe hoch violett Scheibe flach holz Schraube kurz rund Schraube gelb rund Schraube orange rund Schraube blau rund Schraube lang rund Schraube sechseckig rot Schraube eckig gelb Schraube eckig orange Schraube sechseckig blau Schraube sechseckig lang


3-holed-bar wooden 5-holed-bar wooden 7-holed-bar wooden flat cube red cube yellow cube blue cube green rhomb-nut orange small rim red round tire round socket white washer high purple washer flat wooden bolt short round bolt yellow round bolt orange round bolt blue round bolt long round bolt hexagonal red bolt angular yellow bolt angular orange bolt hexagonal blue bolt hexagonal long

Table 6.3: Each line contains the properties which are named in an utterance describing one Baufix object uniquely. These descriptions are later referred to as easy descriptions.

6.2


USING

121

BAYESIAN N ETWORKS

Although, these confusions are captured and modeled, the joint belief for detecting socket, white, wooden, or ivory and naming socket, white will never be as high as it would be for objects without confusions. The tire is such an object. Therefore, when naming socket and white, the belief for the object tire is greater than for the object socket because the correct specification of only the color of the tire leads to a better result than the approximate specification of properties of the socket.

Hexa blue

Hexa green

1 0 0 0

Hexa orange

1 0 0 0

Hexa yellow

1 1 .43 0 0 .66 .74 .42 .63 .27 1.21 .86

Hexa red

1 0 .55 .61

Round green

1 0 .55 .38

Round blue

1 0 .85 .36

Round orange

1 0 .84 0

Round yellow

Rim

1 0 .87 .34

Round red

Rhomb-nut

1 0 .93 .06

Washer purple Washer flat

Cube green

1 0 .79 .16

Socket

Cube blue

1 0 .65 .38

Tire

Cube yellow

random: other

Cube red

random: correct

7-h-bar

easy: other

5-h-bar

easy: correct

3-h-bar

The other object which was not always uniquely identified is the red, round bolt. This is due to the fact that the categories bolt, short, and round which were named in the utterance are not specific enough for this object. This shows that the less specific an utterance the smaller is the chance to identify an object.

.69 .21 .71 .60

1 0 .28 .92

1 0 .53 .47

1 0 .97 .07

1 0 .96 0

1 0 .60 .82

1 0 .18 .75

1 0 .56 .59

1 0 .77 .29

1 0 .63 .20

Table 6.4: Line 1 and 3: Relative frequency of correct object identifications from the 23 easy descriptions or the 200 random descriptions, respectively. All descriptions were applied to 1000 randomly generated scenes. Line 2 and 4: Relative frequency of additionally or wrongly identified objects when the intended object was present in the scene. These results are obtained when applying the easy and the random descriptions, respectively. A relative frequency greater than 1 indicates that multiple confusions may occur.

Table 6.4 describes the identification results if the object which is intended in an utterance is part of the scene. But, it is also of interest so see what happens if the intended object is not there. This means that the set of properties which are specified do not apply to any of the objects in the scene. Table 6.5 shows the relative frequency of identifications of nonintended objects when the intended object is not recognized in the scene. The rows indicate the intended objects and the columns stand for the identified objects. Table 6.5 shows very well that confusions occur only between objects which have certain properties in common, for example, the red cube and the rim have the same color. The confusion rates are in a range of 0 to about 0:5 because on average each object is contained in half of the scenes. These types of confusions are only possible when the intended object is not part of the scene. For a larger, second experiment, we randomly generated 200 utterances. These utterances contain for each of the four property classes (type, color, size, and shape) one or no specification. It is obvious that not all of these randomly generated utterances make sense, and sometimes

122

Hexa green

Hexa g.

Hexa blue

Hexa b.

Hexa orange

Hexa o.

Hexa yellow

Hexa y.

Hexa red

Hexa r.

Round green

Round g.

Round blue

Round b.

Round orange

Round o.

Round yellow

Round y.

Round red

Round r.

Washer flat

Washer f.

Washer purple

Washer p.

Socket

Socket

Tire

Tire

Rim

Rim

Rhomb-nut

Rhomb

Cube green

Cube g.

Cube blue

Cube b.

Cube yellow

Cube y.

Cube red

Cube r.

7-h-bar

7-h-bar

5-h-bar

5-h-bar


3-h-bar 3-h-bar

6

0 .44 .43 0 0 0 0 0 0 0 .15 0 .27 0 0 0 0 0 0 0 0 0 0

.46 0 .47 0 0 0 0 0 0 0 .14 0 .21 0 0 0 0 0 0 0 0 0 0

.41 .43 0 0 0 0 0 0 0 0 .16 0 .27 0 0 0 0 0 0 0 0 0 0

0 0 0 0 .50 .46 .48 0 .26 0 0 .06 0 0 0 0 0 0 0 0 0 0 0

0 0 0 .45 0 .48 .46 0 0 0 0 .07 0 0 0 0 0 0 0 .23 0 0 0

0 0 0 .49 .54 0 .51 0 0 0 0 .07 0 0 0 0 0 0 0 0 0 0 0

0 0 0 .46 .47 .46 0 0 0 0 0 .08 0 0 0 0 0 0 0 0 0 0 0

0 0 0 .39 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .04 0 0

0 0 .03 .47 0 0 0 0 0 .47 0 .23 .04 0 0 0 0 0 .04 0 0 0 0

0 0 0 0 0 0 0 0 0 0 .44 .01 .01 0 0 0 0 0 0 0 0 0 0

.06 .01 .06 .21 .01 0 0 .10 0 .03 0 .24 .03 0 .05 0 0 0 .04 .07 0 0 0

0 0 0 0 0 .02 0 .11 0 .25 0 0 .53 0 0 0 .03 0 0 0 0 0 0

.12 .13 .12 0 0 0 0 .11 0 .03 .27 .49 0 0 0 0 0 0 0 0 0 0 0

0 0 0 .52 0 0 0 0 .49 .32 0 0 0 0 .29 .17 .19 .27 .27 0 0 0 0

0 0 0 0 .49 0 0 0 0 .36 0 0 0 .52 0 .19 .22 .25 0 .47 0 0 0

0 0 0 .01 0 0 0 .52 .13 .27 0 0 0 .24 .23 0 .19 .22 .13 0 .56 0 0

0 0 0 0 0 .52 0 0 0 .10 0 0 0 .03 .15 .04 0 .28 0 0 0 .53 .01

0 0 0 0 0 0 .49 0 0 .05 0 0 0 .01 .10 .05 .06 0 0 0 0 0 .34

0 0 0 .48 0 0 0 0 .05 0 0 0 0 .10 0 .02 0 0 0 .16 .18 .22 .11

0 0 0 0 .51 0 0 0 0 0 0 0 0 .12 .49 .02 0 .01 .13 0 .21 .23 .11

0 0 0 .01 0 0 0 .49 .02 0 0 0 0 .03 0 .54 0 0 .48 .15 0 .21 .12

0 0 0 0 0 .53 0 0 0 0 0 0 0 0 0 .01 .52 .02 .06 .12 .09 0 .45

0 0 0 0 0 0 .47 0 0 0 0 0 0 0 0 .01 0 .46 .04 .12 .10 .18 0

Table 6.5: Confusion table for false identifications that occur when the intended objects is not detected in the scene: The identifications are based on the 23 “easy” object descriptions which are applied to the 1000 randomly generated scenes. The rows indicate the intended objects and the columns stand for the identified objects.

6.2


USING

BAYESIAN N ETWORKS

123

it is hard to distinguish which object is intended. Therefore, two semi-naive7 subjects independently classified the meaningful utterances and provided a reference to the object which is in their opinion denoted by the utterance. Those utterances where the subjects agreed on the same object are considered as intending the respective object. 88 utterances were classified as intending a Baufix object. All others are considered as nonsense. The named properties of some sample utterances are shown in Table 6.6. Classified as Bolt y. (2) Hexa b. Cube b. 5-h-bar 3-h-bar bar (3) -

Specified Properties Schraube gelb mittellang länglich bolt yellow medium-long elongated Schraube mittelgroß sechseckig bolt medium-sized hexagonal Fünflochleiste blau mittellang 5-holed-bar blue medium-long Sechskantschraube orange dünn viereckig hexagonal-bolt orange thin quadrangular Schraubwürfel blau flach cube blue flat Fünflochleiste holz schmal viereckig 5-holed-bar wooden narrow quadrangular Leiste holz dick eckig bar wooden thick angular Objekt holz schmal viereckig object wooden narrow quadrangular grün schmal viereckig green narrow quadrangular

Table 6.6: Examples of randomly generated utterances: The first column states the object(s), which are classified as intended by two subjects independently. The uttered properties are contained in the second column. A number in parentheses indicates the number of possible intended objects. These descriptions are referred to as random descriptions.

The 200 randomly generated utterances were used for all 1000 scenes. We again distinguish the identification results whether the intended object(s) is/are part of the scene or not. The relative frequency of correct identifications per object are shown in line 3 of Table 6.4. These results apply when the intended object is part of the scene. The relative frequency of additionally or wrongly identified objects is reported in line 4 of Table 6.4. Here, a relative frequency greater than 1 indicates that multiple confusions occur. Utterances which are intending the two washers were not contained in the randomly generated test set. The results are not as good as for the easy descriptions (see Table 6.3) but still reasonably good. For the three worst cases there are easy explanations. The yellow hexagonal bolt is only referred to by the properties “Schraube gelb mittellang länglich” (bolt yellow medium-long elongated). The named size and shape categories do not really fit for the yellow bolt which is rather short and they do furthermore not especially address the hexagonal bolt. The correct identification rate for the round yellow bolt is not very good either. For this object, the two utterances “Schraube gelb mittellang länglich” (bolt yellow medium-long elongated) and “Rundkopfschraube holz schmal” (round-headed bolt wooden narrow) are not very specific as well. The tire is a third example which is not so good in this test set. But again this object is only referred to by one utterance which is “Reifen gelb” (tire yellow) where the color is not correctly specified. If the intended object is not part of the scene, then we obtained the confusions which are 7

Subjects who are familiar with Baufix objects but not with our object identification approach.

124

6


reported in Table 6.7. The rows indicate the intended objects and the columns denote for the identified objects. Here again the range of confusions is wider than for the “easy descriptions”, which is very much due to the fact that the randomly generated descriptions are not consistent within the named properties. Not all specified properties really apply to the intended object.

Hexa g.

Hexa green

Hexa b.

Hexa blue

Hexa o.

Hexa orange

Hexa y.

Hexa yellow

Hexa r.

Hexa red

Round g.

Round green

Round b.

Round blue

Round o.

Round orange

Round y.

Round yellow

Round r.

Round red

Washer f.

Washer flat

Washer p.

Washer purple

Socket

Socket

Tire

Tire

Rim

Rim

Rhomb

Rhomb-nut

Cube g.

Cube green

Cube b.

Cube blue

Cube y.

Cube yellow

Cube r.

Cube red

7-h-bar

7-h-bar

5-h-bar

5-h-bar

3-h-bar

3-h-bar

The two examples, the easy and the randomly generated descriptions, can be considered as a good and a bad example. They show the range of results which are obtained with our Bayesian network approach for object identification.

0 .32 .18 .01 .05 .01 .03 .09 0 0 .15 0 0 0 .07 .04 .02 .02 0 0 .15 .03 .03

.18 0 .16 .01 .05 .03 .03 .11 0 0 .16 0 0 0 .17 .01 .01 0 0 0 .07 .01 .05

.14 .23 0 .01 .06 .03 .03 .11 0 0 .16 0 0 0 .16 0 .01 0 0 0 .04 .01 .09

.05 .05 .14 0 0 .10 0 .11 .49 0 .02 0 0 .13 0 .13 0 0 .06 0 .08 .01 0

.10 .05 .22 .16 0 .09 0 .10 0 .48 .02 0 0 0 0 .04 0 0 0 0 .09 .01 0

.07 .05 .15 .17 0 0 0 .03 0 0 .03 0 0 0 0 .04 .14 0 0 0 .10 .10 0

.07 .05 .15 .15 0 .09 0 .05 0 0 .03 0 0 0 0 .04 0 .01 0 0 .08 .01 .08

.07 .09 .01 .15 0 .01 0 0 .07 0 .07 0 0 .01 0 .09 .02 0 0 0 .02 0 0

.02 .06 .04 .16 0 .01 .04 .06 0 .08 0 0 0 .13 0 .03 0 0 .18 0 .01 .01 0

.07 .07 .03 0 0 .04 .07 .06 .03 0 .33 0 0 0 0 .16 0 0 0 0 0 0 .08

.04 .03 .03 .04 0 .01 0 .05 .01 0 0 0 0 .01 .01 .01 .03 .03 0 0 .01 .07 .03

.01 .01 .04 .01 0 .26 0 .15 0 0 .07 0 0 0 0 0 .08 0 0 0 0 .02 0

.12 .14 .13 .01 .01 .01 .01 .11 0 0 .26 0 0 0 .03 .02 0 0 0 0 .01 .01 .03

.02 0 .02 .28 0 .01 .01 .06 .25 0 .03 0 0 0 .01 0 .02 .09 .21 0 0 .01 0

.06 0 .10 0 0 .01 0 .04 0 .43 .02 0 0 .04 0 0 .03 .09 .04 0 .01 .01 .01

.05 .02 .01 .18 0 0 0 .09 .18 0 0 0 0 .18 .03 0 .05 .12 .15 .04 .26 .03 .02

.04 .04 .01 .01 0 .25 0 0 0 0 0 0 0 .02 .15 .13 0 .25 0 .29 .07 .24 .09

.01 .01 .06 0 0 .01 .25 0 0 0 0 0 0 .06 .04 .09 .17 0 0 .06 .02 .05 .21

.01 0 .02 .20 .01 .01 .01 .05 .14 0 .02 0 0 .14 0 .01 .03 0 0 0 .02 .06 .01

.04 0 .09 .01 .02 .01 .01 .05 0 .29 .01 0 0 0 0 0 .03 .01 .18 0 .03 .07 .02

.03 .02 .01 .07 .01 .01 .01 .10 .13 0 0 0 0 .10 .01 .19 .05 .02 .18 .03 0 .16 .04

.02 .04 .01 .01 .01 .20 .01 .03 0 0 0 0 0 .01 .06 .01 .13 .13 .04 .12 .12 0 .12

.01 0 .06 0 0 0 .11 .02 0 0 0 0 0 0 .01 0 .08 .15 .01 .03 .05 .10 0

Table 6.7: Confusion table for false identifications that occur when the intended object(s) is/are not detected in the scene: The identifications are based on the randomly generated utterances. Here only the 88 utterances were considered which were classified as specifying a Baufix object. The rows indicate the intended objects and the columns denote the identified objects.

6.2


USING

125

BAYESIAN N ETWORKS

Real Idealized Data We also carried out experiments with real data. In the data set described in Subsection 6.2.2, subjects referred to objects which were shown on a computer screen. We use the transcription of these utterances and manually generated scene descriptions as input for the system Q UA SIACE to evaluate our Bayesian network. We use ‘idealized’ real data in order to avoid errors in the understanding processes. Our Bayesian network approach is evaluated by comparing the object identification results with the originally referred, intended objects8 . Unfortunately, we also used these data for the estimation of the conditional probability tables in the Bayesian network, and due to time constraints we were not able to acquire another test set. An identification is correct if the intended object is found. If additional objects were identified besides the intended object then it is still considered as correct but counted for the class additional as well. # utterances 412

correct additional false 381 (92.5%) 34 (8.3%) 31 (7.5%)

nothing 0 (0%)

Table 6.8: Object identification results for 412 utterances on 11 different scenes: Correct identifications (may contain additional objects besides the intended object), identifications with additional objects, false identifications, and cases where no object was identified (nothing).

Our system identified the intended object correctly for 92.5% of the utterances. False or additional identifications occur mainly because of inaccurate or imprecise specifications by the subjects. Another reason for errors (false/additional) is that the identification criterion (j > + ) is not adequate for a given utterance and scene. It is a dynamic threshold, but thresholds may fail. Spatial Relations We collected a second set of utterances under the same conditions as described in the previous paragraph or in Subsection 6.2.2. This set is used only for testing. Six subjects had to name marked objects from 5 different scenes which were presented on a computer screen. This time, the subjects were explicitly asked to use spatial relations in every instruction. We ‘idealized’ the data again to evaluate the performance of our identification approach when spatial relations are used. The evaluation is carried out in the same way as described in the previous paragraph. In addition to that, we require for this experiment that the object which is identified by the system must be exactly the same, i.e. having the same image coordinates as the one which was named by a subject. It is possible to specify an object uniquely with spatial relations 8

The subjects described in each utterance one object which was marked in an image by an arrow.

126

6


which wasn’t the case with only the description of the type, color, size, or shape of an object. The results are reported in Table 6.9. # utterances 98

correct 82 (83.5%)

additional false 2 (2%) 16 (16.5%)

nothing 0 (0%)

Table 6.9: Object identification results for 98 on 5 different scenes from instructions with spatial relations: Correct identifications (may contain additional objects besides the intended object), identifications with additional objects, false identifications, and cases where no object was identified (nothing).

The number of false identifications is now higher than for the experiment shown in Table 6.8. This is due to the more severe criterion for a correct identification, but also because of discrepancies between the computation of spatial relations and the use of spatial relations by the subjects. Image Sequences In our targeted applications, the scenes are not changing very much. The assembly is rather slow compared to the image frame rate. Furthermore, no actions in the scene might happen during the time an instruction is uttered. Therefore, changes in the object hypotheses are more likely due to recognition errors than to changes in the scene. This is especially true when the image region characteristics (i.e. center of mass), which correspond to the recognized object, are basically constant in subsequent image frames. We can compensate these errors easily with our Bayesian network. We simply use the beliefs computed for the object nodes (Object1,..., Objectm) at time t ? 1 as causal support for these nodes at time t. Thus, we predict that the object category will not change. Fig. 6.13 shows the results of two experiments with simulated data. Both experiments run over 80 time frames. One object node is observed during this time. Random noise is added to the object hypotheses. The probability that the object hypothesis does change is set to 0.25. Hence, the object hypothesis may vary in a quarter of all time frames. Fig. 6.13 shows four components of the belief vector of the object node. The true object category is represented by one of these four components. Fig. 6.13a shows the results when recursive prediction is active. We see the noise in the data, but the belief of the true object hypothesis is clearly distinguishable from the others. In Fig. 6.13b the results are shown of the same experiment but without prediction. The signal to noise ratio is here almost equal to 1. The noise distorts the the belief of the true object hypothesis significantly.

6.2


USING

127

BAYESIAN N ETWORKS

belief

(a) 1.2 1 0.8 0.6 0.4 0.2 0 0

10

20

30 40 50 number of cycles

60

70

80

0

10

20

30

60

70

80

belief

(b) 1.2 1 0.8 0.6 0.4 0.2 0 40 50 number of cycles

Fig. 6.13: The behavior of the beliefs in an object node over time: Four components of the belief vector are plotted in different greyvalues. White noise was added to the diagnostic support. (a) With prediction: The belief of the true object hypothesis is always greater than the others, (b) Without prediction: the signal to noise ratio is close to 1.

6.2.5 Discussion There are various approaches for reasoning with uncertainty (Russell & Norvig, 1995). We decided to use Bayesian networks for a number of reasons:

Bayesian networks are a natural way to represent a problem domain and their conditional independence information. The links between nodes represent the qualitative aspects of the domain, and the conditional probability tables represent the quantitative aspects. A Bayesian network is a complete representation for the joint probability distribution for the domain, however it is often exponentially smaller in size. Inference in Bayesian networks means computing the probability distribution of a set of query variables, given a set of evidence variables. In our case, the evidence variables represent the results from speech and image understanding and the query variable stands for the intended object(s). Reasoning in Bayesian networks can be causal, diagnostical, in mixed mode, or intercausal. No other uncertain reasoning mechanism can handle all these modes (Russell

128

6


& Norvig, 1995). This property is very important for our application. The goal is not only to identify the intended object(s) but also to infer mutual restrictions for the understanding processes in order to enable the interaction between those processes. Various alternative systems for reasoning with uncertainty have been suggested. All the truthfunctional systems have serious problems with mixed or intercausal reasoning.

The complexity of Bayesian network inferences depend on the network structure. In polytrees, the computation time is linear in the size of the network (Russell & Norvig, 1995).

The results have shown that our approach is very well suited for the given problem. Despite all the advantages, however, there are also limitations and drawbacks of our approach. Some of them are subject to future work.

The Bayesian network approach is a decision calculus. It determines the object with the highest joint probability of the named and detected features. Knowledge is modeled mainly through the conditional probability tables. There are no other generalizing or reasoning rules. The system is not able to generalize or to abstract from named properties. No hierarchies like

(frectangular, diamond-shapedg 2 quadrangular) [ fhexagonalg 2 fangularg are implemented. The knowledge implemented in the Bayesian network results from error analyses and from experiments with human subjects. Those subjects do not seem to apply the hierarchies consistently, therefore phrases like: “die schmale rechteckige holzfarbene Scheibe (the narrow rectangular wooden washer)”

do not lead to the identification of an object. Two human subjects, however, both identified a wooden bar from this phrase.

The system is more likely to identify an object than not. If there is at least one object property which matches an uttered property then the object will be identified if no other object fits better. It can be assumed that a serious instructor does not want to fool the system. Instructions should refer to objects in the scene, and they somehow should make sense. In this case, it is a good strategy for the system to identify the best matching object without requiring a perfect match. Furthermore, it is more likely for the system to identify only the best fitting object rather than all possible fits. If, for example, in a scene with several bars, an utterance specifying ‘bar’ is given, then the best fitting bar will be identified and not all of them. In the current implementation, for example, the most preferred bar is the 5-holed-bar, since it was best recognized in the image training set and not confused in the training utterances.

6.2


USING

BAYESIAN N ETWORKS

129

The Bayesian network approach takes the scene context into account. However, no topological structure is considered. Local neighborhoods of objects or subscenes can not be represented. Therefore, an explicit focus on an object and its local neighborhood can not be modeled nor recognized. A possibility could be to model the scene topology with Markov Random Fields (Geman & Geman, 1984) and to attach them to the nodes in the Bayesian networks. Then, evidence would not be derived from the qualitative descriptions only, but from the qualitative descriptions within the context of the scene topology. The identification criterion (j > + ) is a dynamic threshold. As for all hard decisions, also this threshold may fail. Future work should investigate learned identification criteria, for example, neural networks or other classifiers.

130

6


Chapter 7

Results and Discussion

This chapter covers results and a discussion of the complete image understanding component in the context of the integrated speech and image understanding system Q UA SI-ACE. Results of the isolated modules can be found in previous chapters: The performance of the 3D reconstruction is evaluated in Section 4.6. Examples for the extraction of the object type and the object color are given in Subsection 5.2.2 and 5.3.2, respectively. Results of the computational model for spatial relations are discussed in Subsection 5.4.4. For the object identification based on Bayesian networks, the results are shown in Subsection 6.2.4.

7.1 Results Our image understanding and object identification component is integrated in the system Q UA SI-ACE. We evaluate the entire image understanding and object identification component in form of an end-to-end evaluation of the system Q UA SI-ACE. Input are spoken utterances or textual input and object hypotheses which result from object recognition (cf. Section 5.2.1). We use spoken utterances as well as textual input to obtain a measure of the performance of speech recognition and understanding. The textual input are orthographic transcriptions of the spoken utterances. We evaluate Q UA SI-ACE using two sets of data. These are the same data sets which we used in Subsection 6.2.4. There, we idealized the data in order to concentrate on the object identification results. Now we are interested in the performance of the complete system and especially of the image understanding and object identification component when using real data. The first data set consists of 453 spontaneous utterances in which subjects refer to objects in 11 different scenes. Images of the scenes were presented to the subjects on a computer screen and they had to name the object indicated by an arrow. The ground truth are the originally indicated objects. The second data set includes 98 utterances where the subjects were explicitly asked

132

7

R ESULTS

AND

D ISCUSSION

to use spatial relations in every instruction. The objects were presented in the context of six different scenes. Not all of the utterances show a linguistic structure which is modeled in the speech understanding component. These incomprehensible utterances are rejected by the system. Q UA SI-ACE’s performance on the remaining utterances in the context of the different scenes is listed in Table 7.1. The number of processed utterances is shown in column 3 of Table 7.1. The speech recognition component was adapted from a different system and not especially trained for our domain. This explains that only 133 of 453 and 21 of 98 utterances are recognized, respectively. The performance is much better for the textual input. Here, 413 and 84 utterances are valid, respectively. The results for the remaining utterances from the first data set are presented in line 1 and 2 of Table 7.1 and in line 3 and 4 for the utterances using spatial relations (second data set). data set

source # utterances

correct

no spatial rel. speech

133

93 (70%)

no spatial rel.

text

417

spatial rel.

speech

21

16 (76%)

spatial rel.

text

84

66 (78.6%)

additional

false

nothing

14 (10.5%) 25 (18.8%) 15 (11.2%)

360 (86.3%) 34 (8.1%)

40 (9.6%)

17 (4%)

12 (57%)

5 (24%)

0 (0%)

7 (8.3%)

18 (21.4%)

0 (0%)

Table 7.1: Q UA SI-ACE’s results for 453 utterances with spoken and textual input and for 98 utterances where subjects explicitly used spatial relations to specify the intended object.

Column correct of Table 7.1 shows the number of utterances for which the actually intended object was among the ones identified by Q UA SI-ACE. However, additional ones may have been identified, either due to ambiguities in the utterances, speech understanding deficiencies, or due to multiple recognized objects with similarly well applicable properties. The number of utterances listed in column 2 are subsets of the correct identifications. The false identifications are either caused by speech or object recognition errors. The last column lists the number of utterances for which no object was identified at all. In most cases, this is due to errors from object recognition. The intended object might not have been recognized. The relative frequencies are based on the number of utterances per experiment which were not rejected by Q UA SI-ACE. The comparison of results for spoken and textual input shows a much better performance for textual input. We use spontaneous utterances which do not show a uniform linguistic structure and which contain hesitations, repairs, etc. A better use of restrictions from image data is here subject to further work, besides a better training of the speech recognition component. The evaluation criteria for utterances with spatial relations are more severe than for others. A spatial relation specifies an object uniquely in most cases, and Q UA SI-ACE’s task is to identify exactly the intended object. From the first data set, any object with the same type, color,

7.2

D ISCUSSION

133

size, and shape could be selected. Therefore, the number of false identifications is higher when spatial relations are involved. Most of the false identifications are here due to errors in speech understanding or object recognition, or caused by discrepancies between the computation of spatial relations and the use of locative terms by the subjects.

7.2 Discussion In the discussion of the image understanding and object identification component and the system Q UA SI-ACE, we concentrate on the issues:

integration of speech and image understanding, representation of qualitative features, the system Q UA SI-ACE: expectations and results.

Discussions of the individual modules are found in the previous chapters: The 3D reconstruction is discussed in Section 4.7, the computational model for spatial relations is discussed in Subsection 5.4.7, and the Bayesian network approach for object identification is discussed in Subsection 6.2.5. Integration of speech and image understanding is a key issue in the design of the system Q UA SI-ACE. The combination and integration of different sensory channels and understanding abilities are, according to the paradigms of situated artificial communicators, essential for a communication in real environments. Additional sensory channels provide additional information which may narrow the context and lead to more robust recognition and understanding results. Additional abilities can help to overcome non optimal interpretations, and mutual restrictions are very useful for error correction. A first step towards integrated speech and image understanding is that Q UA SI-ACE can automatically derive linguistic language models from image understanding results. The common qualitative descriptions of speech and image understanding results are an excellent means for information exchange. Our Bayesian network for object identification works bidirectional. Thus, results from image understanding can be propagated down as restrictions for speech understanding and used for the generation of language models which represent possible and likely utterances in the context of the current scene. Another example for bidirectional processing is our computational model for spatial relations. It is designed not only to generate spatial relations between object pairs, but also to understand spatial relations together with a given reference object and reference frame, in order to restrict the admissible image region for the possibly intended object.

134

7

R ESULTS

AND

D ISCUSSION

All modules in our system are designed for incremental processing and for a bottom-up as well as top-down flow of information. These are important prerequisites to further extend the interaction of speech and image understanding in the future. Most related work in the field concentrates either on the generation of scene descriptions from visual data or on the visualization of spoken phrases. Only recent attempts try to integrate vision and natural language processing (see, e.g., Mc Kevitt, 1994b; Wahlster, 1994; Maaß & Kevitt, 1996). Kollnig & Nagel (1993) describe a system which analyzes traffic scenes in image sequences and generates symbolic descriptions of movements in the scene which correspond to motion verbs in natural language. The systems of André et al. (1988), Herzog & Wazinski (1994) and others produce textual output describing football scenes. Olivier & Tsujii (1994) visualize spatial prepositions supporting different reference frames. Other systems are designed to understand spatial prepositions (Gapp, 1994) or emotional expressions concerning size or color as in (Nakatani & Itho, 1994). But none of these systems aims to integrate other sensory information or to implement a bidirectional processing strategy. The representation of qualitative features is a second important issue. Investigations about cognitive mechanisms of mastering language, i.e. to use and to understand language, are of great interest for our scenario. Problems such as: ‘Is there a uniform representation for all cognitive mechanisms or do different mechanisms need different representations? How do different mechanisms interact? How do cognitive mechanisms work?’ are addressed in cognitive psychology and linguistics (Schnelle, 1991; Engelkamp & Pechmann, 1993). But those investigations tend to be rather general and they contain only a few hints how to actually implement representational schemes in the context of a concrete system. In contrast to that, the representations which are used in most computational systems are fairly simple and specifically adapted to the actual task (e.g., Kollnig & Nagel, 1993; Herzog & Wazinski, 1994). The goal is, in most cases, to extract some specific symbols. Reasoning on these representations or inferring other results from them is often not considered. Whereas typical reasoning approaches mostly do not start from sensory data but already from high-level special purpose representations (e.g. Peuquet & Ci-Xiang, 1987; Egenhofer, 1991; Hernàndez, 1993). We developed qualitative descriptions as a uniform representation for speech and image understanding results. This representation can also be used for other types of qualitative features and other domains. The vectorial representation has the advantage of being suitable for multiple and even contradicting results, as well as for overlapping categories. A variety of information can be represented which does not require that hard decisions have to be made at early stages of processing. This form of representation has been proven to work successfully in our system, and it forms the basis for the integration of speech and image understanding. It lends itself for probabilistic reasoning upon the represented entities and is not specifically adapted to specific underlying computational routines. Bayesian networks provide a formalism for reasoning about partial beliefs under conditions of uncertainty. They are very useful in combining the different extracted properties and weigh the influence of them upon the final object identification accounting for uncertainties in the

7.2

D ISCUSSION

135

data, the detection processes, and the decision process. Empirical data and results from psycholinguistic experiments can be easily incorporated. Our object identification approach is rather flexible and well suitable for the variety of different instructions humans can give to the system. We expect computer systems to work fast and always correct and to behave according to the users expectations. These are tough requirements and obviously hard to accomplish. The overall results of Q UA SI-ACE (see Section 7.1) are very satisfying as a first attempt of a speech and image understanding system which works full automatically on real data. But we also see that improvements are necessary, especially when using speech input. Efforts in the future will certainly lead to better results, but on the other hand, the system will never react always correct and perfectly adapted to the instructor. Therefore, it is necessary to incorporate strategies for correcting errors and misunderstandings and for adapting to the instructor. Humans make quite a lot of mistakes in natural communication or visual observations. But humans are also very good in adapting to a given situation and in learning from corrections. So far, Q UA SIACE tries to uniquely identify an intended object and it accepts refinements or corrections of the object reference until a unique identification is possible. Corrections are possible at any time, and correcting utterances should start with “No, ...”. More strategies for adapting to and interacting with the instructor should be part of further work besides the improvement of the individual modules.

136

7

R ESULTS

AND

D ISCUSSION

Chapter 8

Conclusions

8.1 Conclusions In this thesis, we addressed the problem of image understanding in the context of integrated speech and image understanding. We designed and implemented components for high-level image understanding and the identification of verbally described object(s) from information which we extracted from image data. Our components work full automatically, and they are suitable for real speech and image data.

The design is based on the idea of integrated speech and image understanding. The information is extracted incrementally and at any time as many mutual restrictions as possible are available for both understanding channels.

For this work understanding means both, (a) the analysis of image data, which is the computation of numerical quantities, and (b) the interpretation of image data, which means the extraction of qualitative features. Image understanding is a hierarchical and context dependent process. Therefore, our image understanding component includes several steps working at increasing levels of abstraction. It is embedded in a knowledge-based system in order to exploit efficiently the available domain knowledge. The image understanding and object identification steps which are addressed in this thesis are: 1. Model-based 3D reconstruction of the scene: We assume that color segmentation and object recognition have been carried out on the images which we use for 3D reconstruction. We then extract features of the recognized objects and fit object models to the image features simultaneously for all objects in all images. The model-based reconstruction approach efficiently uses domain knowledge and does not require prior camera calibration. The 3D reconstruction is an iterative process. So at any time there is information available and the results improve with time.

138

8

C ONCLUSIONS

2D information would be sufficient for simple object identification, but an interactive assembly task requires 3D information. 3D data facilitates image understanding in our scenario. The human perception is three-dimensional and the assembly takes place in a three-dimensional environment. Spatial relations or localizations under changing reference frames, for example, if the instructor moves, can only be understood using 3D information. Furthermore, ambiguities in the image data are often due to the loss of depth information in the perspective projection. Hence, occlusions and ambiguities can be resolved using 3D information. The 3D pose of an object is, in addition to that, necessary for grasping purposes. 2. Computation and understanding of spatial relations: The computation and understanding of spatial relations in 3D is based on the reconstructed 3D scene. The naming of spatial relations is important for the exact specification of objects. Therefore, we developed a 3D model for the computation of spatial relations which supports changing reference frames. It is a two layered approach with the aim to represent a scene in terms of qualitative relations on a first level of abstraction. Most of this representation remains stable over time and needs only to be updated for those objects, for which the position changed. As soon as spatial relations are uttered in an instruction, a degree of applicability for all modeled spatial relations, i.e. left, right, above, below, behind, and in-front, is computed for each possibly intended object (IO) and corresponding reference object (RO). The spatial model is also suitable to generate restrictions for possibly intended objects given a spatial relation, a reference object, and a reference frame. The computational model has been evaluated with psycholinguistic experiments. The overall performance of the model is very well accepted by the subjects, but some questions, especially regarding space partitioning given a reference object, a reference frame, and a spatial relation, are subject to further work. 3. Qualitative descriptions form a uniform representation of speech and image understanding results. We developed a new, fuzzified, non-numerical representational scheme for results of understanding processes. We abstract from the underlying numerical information by categorizing it. Not only the pure categories are represented but also a grade or degree of applicability of each category. Furthermore, we do not assume a distinct partitioning of the categorical space. Instead, we assign to each property, which is a set of categories, a vector of fuzzy degrees of applicability, where each component characterizes one category within the categorical space of this property. The question of how qualitative properties should be computed and represented is a major issue in cognitive science (Palmer, 1978). Some ideas are implemented in this work. The advantages of our qualitative descriptions are: (a) They are flexible enough to react according to a wide variety of verbal descriptions. (b) Overlapping and ambiguous understanding results can be represented. Any of the meanings inferred from the results can then be addressed by the instructor.

8.1

C ONCLUSIONS

139

(c) They form a uniform representation for speech and image understanding results and provide a means for the mutual exchange of restrictions for both speech and image understanding. (d) No decisions have to be made at the stage of categorization. The actual object identification is carried out when most of the relevant information for speech and images is available. This representation is not limited to speech and image understanding results. It can be used for any type of categorization. Furthermore, this representation is not specific for the computational model which we use. The categorization algorithms could easily be substituted. Even the number of possible categories can be changed easily. 4. Object identification with Bayesian networks: The object identification with a Bayesian network approach is based on qualitative descriptions of speech and images. We think that an adequate decision calculus is very important. We want to make as few hard decisions as possible during the categorization processes. During these processes there is only information available about certain properties. But the final object identification and therefore the final assignment of qualitative properties to objects or groups of objects should be done by jointly using as much information from speech and images as possible. We even think that some false or inadequate categorizations can be compensated through the decision calculus when it operates on all information which is available. We provide an approach to combine the two fundamental paradigms, deterministic and probabilistic techniques. Most of the domain knowledge is represented in semantic networks using the ERNEST formalism. The fuzziness of the categorization process during the extraction of qualitative features is modeled through probabilistic values. Bayesian networks are used as decision calculus. The combination of different techniques compensate limits of the single techniques which is reflected in our results. All modules are designed for data driven (bottom-up) as well as for model driven (top-down) processing. Our high-level computer vision component shows incremental processing at each step. In this way, restrictions are available at any time and they can be used and propagated top-down as well as bottom-up. A further advantage of our approach is that it can be evaluated entirely by psychological experiments. Then, evaluation results as well as empirical data can be incorporated into the categorization and decision processes. The results show that our image understanding and object identification components can very well be used as a step towards a more natural human-computer interaction.

140

8

C ONCLUSIONS

8.2 Future Work The system Q UA SI-ACE, and within it the image understanding and object identification components, are a first attempt to build a system for natural human-computer interaction. In this context there are a lot of open questions of which only a few could be touched in this work. Many interesting points could not be addressed due to the complexity of the field. Some steps of future work are summarized in the following. Open problems and future work which are related to the isolated modules are addressed throughout the discussions of these modules (see Section 4.7 for the 3D reconstruction, Subsection 5.4.7 for the computation of spatial relations, and Subsection 6.2.5 for the object identification).

The experimental-simulative method is the key principle of the work in the joint research project “Situated Artificial Communicators” at the University of Bielefeld. The system Q UA SI-ACE provides a first simulation of some aspects of human-computer interaction. The computational model for spatial relations has been evaluated experimentally. The next steps are to evaluate other components as well as to augment and to improve the computational model for spatial relations according to the insights gained from the experiments. A closer interaction of speech and image understanding is one of the next steps to do. The design of the system Q UA SI-ACE was focused on this issue, however, the potential of the designed mechanisms, such as incremental processing, common qualitative representation, homogeneous knowledge base, bidirectional data flow etc., is not yet fully exploited. Now, we have a working system and we can really experiment with these mechanisms. The control strategy should be adapted so that as many mutual restrictions as possible are exchanged. The scene context is taken into account for the object identification in a limited way. Here, further improvements are necessary. The topological structure of the scene should be modeled. This is possible with Markov-Random-Fields (MRF) (Geman & Geman, 1984). They could be attached to the nodes in the Bayesian network to represent the scene topology as well as local neighborhoods of objects. MRFs could be used only for single properties, for example, color. Then, for example, a color which does not often occur in the scene is more salient than the color of many objects in the scene. With this kind of a more explicit scene modeling, evidence for the object identification would not only be derived from the qualitative descriptions, but also from the qualitative descriptions within the context of the scene topology. So far, we concentrated on the identification of objects from instructions. But we did not address the execution of the instructions. This would include addressing the following issues: – The understanding and representation of actions in the instructions.

8.2

F UTURE W ORK

141

– The visual observation of the scene over time: This would imply the tracking of objects which has the advantage of using information recursively. This would also imply to represent and to model observed actions, for example as episodes. Here again, a hierarchical modeling would be adequate. Hidden Markov Models (HMM) could decide at each level of abstraction what kinds of information are emitted as having changed significantly. HMMs could trigger when and which qualitative descriptions must be updated due to changes in the scene. – Actions have to be represented and implemented for the object identification. Not all actions can be executed with all objects and vice versa.

Dialog-strategies, strategies for error corrections, and for the adaption to the instructor should be focused in the efforts of an overall improvement of Q UA SI-ACE.

142

8

C ONCLUSIONS

Appendix A

Notations x

Throughout this work, we write vectors as small bold type characters, e.g., , and matrices as capital letters. Homogeneous transformations are denoted by calligraphic capital characters T and P with subscripts indicating destination and source coordinate frame of the transformation. For example, the homogeneous transformation from the right-handed, Cartesian coordinate frame q into the right-handed, Cartesian coordinate frame p is denoted as Tpq . We use the symbols bl , br for the left and right image coordinate frame, l, r for the left and right camera coordinate frame, and o for the object centered coordinate frame. The following table lists most of the symbols and their meaning which we use throughout this thesis. We split the tabular in three parts, one for 3D reconstruction, one for qualitative descriptions and spatial relations, and one for Bayesian networks and object identification.

xu

h

j

i

0nm ; 1nm Inm pose

f (Cx; Cy )T sx ; sy Rpq !x !y !z

tpq xb = (xb; yb)T xbl xbr xbji

general scalar product n m vector or matrix with all entries zero or one n m unit matrix 3D Reconstruction position and orientation of an object in 3D focal length principal point pixel scale factor in x and y direction rotation matrix for the rotation from coordinate frame q to frame p yaw: rotation angle (x-axis) pitch: rotation angle (y -axis) roll: rotation angle (z -axis) translation vector point in image coordinates image feature (point, line segment or ellipse) in the left image image feature (point, line segment or ellipse) in the right image image feature i in image j 2 fl; rg

144

A

xo = (xo; yo; zo)T xoi m = (mx; my )T

?1 o b ?o ?b

P

a i (a; xo ) i bj o i

Qu Qb BO AViO riO

(P; riO) (ref ; rel; riRO ) (ref ; rel; IO; RO) capital letters

e e+X? ; (x) eX ; (x) BEL(x)

X (u) Yk (x) P (x u); MX jU j

io scene objectj offsetj

j j rd s "

N OTATIONS

point in object centered coordinates feature i in object centered coordinates midpoint (ellipse or line segment) in image coordinates angle (ellipses or line segments) transformation: Euclidean ! projective coordinates transformation: projective ! Euclidean coordinates transformation in frame o: 3D line seg. Euclidean ! projective space transformation in image (b): 2D projective rep. of line seg. ! midpoint rep. transformation in frame o: 3D circle ! 4 projective points + cross ratio transformation in image (b): 2D points + cross ratio ! ellipse ( ; l1; l2; ) vector of objects’ pose and camera parameters projection of feature i from coordinate frame o to image j using covariance matrix covariance matrix of measurement noise in Equation 4.19 for feature i Qualitative Descriptions and Spatial Relations qualitative description of a unary property qualitative description of a binary relation bounding box of object O acceptance volume acceptance relation degree of containment: containment of object P in acceptance rel. riO degree of accordance: accordance of the meaning with a spatial relation rel under a reference frame ref and an acceptance relation riO degree of applicability: applicability of a spatial rel. rel to a IO/RO pair Bayesian Networks and Object Identification node in Bayesian network total evidence in the network causal support of node X diagnostic support of node X belief of node X diagnostic message from node X to node U causal message from node X to node Yk conditional probability table associated to arc U ! X random variable represented by node intended Object random variable represented by node Scene random variable represented by node Objectj certainty of detection of object j in the context of the current scene belief of object j prior to normalization likelihood of being intended for object j Euclidean distance of computed and uttered spatial relations inverse of rd weighted by IO RO probability near 1 but not 1 due to normalization very small probability, e.g., 0.01

m a

Appendix B

Baufix Domain – Objects and Geometric Modeling

Baufix is a trademark for a wooden toy construction kit. The domain for Q UA SI-ACE consists of 23 objects from the Baufix construction kit. These objects are necessary to assemble a toy airplane like the one which is shown in Fig. 2.1. The 23 objects can be categorized to seven principal types which are described in the following. We explain the object characteristics and their geometric modeling.

Bars The basic construction parts are wooden bars with 3, 5, or 7 holes. They are 2.5cm wide and 9.2cm, 15.4cm, and 21.5cm long, respectively. They are named 3-holed-bar, (Dreilochleiste), 5-holed-bar (Fünflochleiste), and 7-holed-bar (Siebenlochleiste) throughout the text. The bars are depicted in Fig. B.1 and numbered 1, 2, and 3, respectively. Even though the corners of the bars are rounded, they are geometrically modeled as cuboids having holes on the top and on the bottom side. Each bar is described through 8 corners, 12 vertices, 4 faces without holes, and 2 faces with holes (see Section 4.2 for the model of the 3-holed-bar). Fig. B.2 shows the geometric models of the 3-holed-bar and the 5-holed-bar. The origin of the model-centered coordinated frame is per default located in the first corner.

Cubes The cube (Schraubwürfel) is a cube with rounded corners and with a length of each edge of 3.1cm. It has a hole in each side, i.e three holes are going through the cube. Two of them are threaded. One is a boring. The cube comes in the four colors red, yellow, green, and blue.

146

B

BAUFIX D OMAIN – O BJECTS

G EOMETRIC M ODELING

AND

2 9 19 6 8

5

4

3 17 14

12

18

7 1

13

17

19

20

9

13

12

7 5 10

18 15

16 14 6

8 11

Fig. B.1: The Baufix objects.

4

B


AND


147

Fig. B.2: The geometric model of a 3-holed bar and a 5-holed-bar.

Object 4 in Fig. B.1 shows a cube. A cube is modeled similar to a bar as a perfect cube having a hole in each side (cf. Fig. B.3). Except for the radius of the holes no difference is made for threaded and unthreaded holes.

Fig. B.3: The geometric model of a cube.

Rhomb-nut The nut is needed to tighten the bolts. It is diamond-shaped (see object 5 in Fig. B.1) and therefore called rhomb-nut (Rautenmutter). It is wooden and painted orange. The length of its principal axes are 3.3cm and 2cm, respectively. A rhomb-nut is modeled as a polyhedron with hole (see Fig. B.4).

Fig. B.4: The geometric model of a rhomb-nut.

148

B


AND


Cylindrical Objects Four cylindrical objects are found in the Baufix domain. A wheel is composed of a rim and a tire. The rim (Felge) (object 6 in Fig. B.1) is wooden and painted red. It is 1cm thick and of diameter 4.3cm. The tire (Reifen) (object 7 in Fig. B.1) is a white rubber tire which is 1.2cm thick and of diameter 6.2cm. Two washers (Beilagscheiben) are available. Both are wooden rings with diameter 2.3cm and a hole of diameter 1.4cm in their center. One, which is painted purple (object 9 in Fig. B.1), is 8mm high and the smaller one (object 10 in Fig. B.1), which is wooden plain, is 4mm high. All cylindrical objects are modeled as cylinders with a hole in the center. In terms of solid modeling (Mäntylä, 1988) it is a cylinder subtracted by a smaller cylinder in the center which has the radius of a hole (see Fig. B.5).

Fig. B.5: Geometric model of the cylindrical objects: tire, washers, rim

Socket A plastic socket (Mitnehmerbuchse) is used to guarantee smoothly rotating wheels. Its color is a dark white which is similar to ivory-colored. It is object number 8 in Fig. B.1. It has a height of 1.45cm and a diameter of 1.8cm at its top and a diameter of 2.1cm at its bottom. The socket is modeled geometrically as two cylinders on top of each other. Both cylinders have a hole of the same diameter. A model of the socket is shown in Fig. B.6.

B


AND


149

Fig. B.6: The geometric model of a socket.

Bolts Bolts (Schrauben) are available in 5 different lengths of the thread (13mm, 16mm, 22mm, 30mm, and 48mm) and with two different heads (round, hexagonal). They are wooden and the head is painted. The color of the head encodes the length of the thread. The red bolt has a 13mm thread, the thread of the yellow bolts is 16mm long, orange encodes a thread of 22mm length, blue stands for a 30mm thread, and the thread of the green bolts is 48mm long. All bolts with round heads are numbered 11 to 15 in Fig. B.1. The higher the number the longer is the thread. The bolts with hexagonal heads are assigned with the numbers 16 to 20 in Fig. B.1 (again with numbers increasing with the thread length). The round-headed bolt is geometrically modeled by two cylinders. One cylinder represents the head and one the thread. This coarse thread modeling leads to sufficient accuracy, so far. The head of the hexagonal-headed bolt is a hexagonal polyhedron. The geometric models of the two types of bolts are depicted in Fig. B.7. Due to difficulties in recognizing the difference between the round-headed and the hexagonal-headed bolts, all bolts are currently considered as round-headed.

Fig. B.7: The geometric models of a round-headed and a hexagonal-headed bolt.

150

B


AND


Appendix C

Homogeneous Transformations

The understanding of coordinate frame transformations is essential for the understanding of the perspective projection performed by cameras. A transformation means, in general, changing geometrical objects or their pose by manipulating their coordinates. There are multiple ways of denoting coordinate manipulations. Euclidean transformations: In Euclidean space, the coordinate frames are Cartesian. All transformations preserve lines, parallels, lengths, and angles. A transformation of coordinate frames is denoted by the multiplication of a vector to a rotation matrix and the addition of a translation vector. Affine transformations: The affine space is the vector space of parallel projections. Ratios, parallels and straight lines are preserved. A transformation of coordinate frames is also denoted by the multiplication of a vector to a rotation matrix and the addition of a translation vector. Projective or homogeneous transformations: The projective space is the vector space of all projections. The use of projective spaces has been made popular in three-dimensional computer graphics and robotics because these spaces permit a compact representation of all changes of coordinate systems by 4 4 matrices instead of a rotation matrix and a translation vector (Faugeras, 1995; Foley et al., 1990). Such changes are special cases of linear projective transformations, or homogeneous transformations called collineations. We use homogeneous coordinates and homogeneous transformations in this work to represent all transformations for 3D reconstruction and camera calibration. We first introduce some basics of projective geometry and show then the well known procedure of computing homogeneous transformations.

152

C

H OMOGENEOUS T RANSFORMATIONS

Projective Geometry The n + 1 dimensional space IRn+1 ? f(0; :::; 0)g with equivalence relation (x1; :::; xn+1)T (x01; :::; x0n+1)T , 9 6= 0 : (x01; :::; x0n+1)T = (x1; :::; xn+1)T is called projective space IPn . A collineation IPn ! IPn is denoted by a reel (n + 1) (n + 1) matrix C

C : IPn ! IPn; x 7! x0; x0 = C x:

(C.1)

(C.2)

Collineations are line preserving and invertible. They have (n + 1) (n + 1) ? 1 = n2 + 2 n degrees of freedom and the cross ratio is invariant under any collineation. All geometric transformations can be represented as collineations. Euclidean or affine coordinates are transformed into homogeneous or projective coordinates and vice versa as:

0x1 0x1 ByC : @ y A 7! B A ; @zC z

1

0a1 0 1 a d C B b ?1 : B A 7! @ db A : @cC d

c d

(C.3)

Homogeneous Transformations The transformation Tpq of a right-handed, Cartesian coordinate frame q into a right-handed, Cartesian coordinate frame p is composed by a rotation and a translation and represented by a 4 4 matrix

R t Tpq = pq pq ; 013 1

t t

(C.4)

where Rpq is an orthonormal 3 3 rotation matrix, and pq is a 3-dimensional translation vector. The inverse of Tpq is computed as T T Tpq?1 = Tqp = 0R1pq3 (?Rpq1 pq ) = 0R1qp3 1qp : (C.5)

t

x x tpq xq = Tpq xq ; 1 1 1

The rotation and translation of an Euclidean or affine vector q to p is then in homogeneous coordinates:

x R pq p = 1

which is equivalent to

013

xp = Rpq xq + tpq

(C.6)

(C.7)

in Euclidean or affine coordinates. Vectors in projective space are only defined up to a scalar (cf. Eq. C.1).

C

153


Rotations There are multiple ways of parameterizing and computing rotation matrices. We have chosen the parameterization by the three angles yaw !x , pitch !y , and roll !z . The rotation matrix Rpq is composed by (cf. Fig. C.1)

a rotation about the x-axis

a rotation about the y -axis

01 Rx = @ 0

1 A;

(C.8)

0 cos ! y Ry = @ 0

1 A;

(C.9)

1 A:

(C.10)

0 0 cos !x sin !x 0 ? sin !x cos !x 0 ? sin !y 1 0 sin !y 0 cos !y

and a rotation about the z -axis

0 cos ! z @ ? sin ! Rz = z 0

sin !z 0 cos !z 0 0 1

z !z

x

!x

!y y

Fig. C.1: Yaw !x , pitch !y , and roll !z and their rotations in positive directions.

We compute Rpq as Rpq

Rpq =

= Rx Ry Rz and obtain (cf. Eq. 4.4)

!y cos !z sin !x sin !y cos !z ? cos !x sin !z cos !x sin !y cos !z + sin !x sin !z cos

!y sin !z sin !x sin !y sin !z + cos !x cos !z cos !x sin !y sin !z ? sin !x cos !z cos

? sin !y

sin

cos

!x cos !y !x cos !y

!

:

154

C


Appendix D

Ellipses as Four Points and Cross Ratio

The theorem of Chasles (Semple & Kneebone, 1952) states that a conic is uniquely defined by four points on that conic and a cross ratio. The cross ratio is the basic projective invariant. We explain it in detail in this appendix and derive that the cross ratio of all ellipses is 2 when using the curvature extrema as the four points. O A’

L’

L’’

D’ B’ C’

D’’

B’’

D

C’’

A’’ B C

A

Fig. D.1: Cross ratio of a pencil of lines.

Let us consider a pencil of four lines Li ; i = 1; :::; 4 (see Fig. D.1), defined by the points A; B; C; D; and the origin O. Any line L0 intersecting the pencil gives us four points A0; B 0; C 0; D0 . As the mapping of a point A0 onto a point A00, which results from the intersection of the pencil with a second line L00 (see Fig. D.1), is a projective mapping, the cross ratios [A0; B 0; C 0; D0 ] and [A00 ; B 00; C 00 ; D00 ] are the same. Thus the cross ratio of a pencil of lines can be defined as:

[L1; L2; L3; L4] = [A0; B 0; C 0; D0]; and it is computed as

0 0 0 0 sin(OA0 ; OC 0 ) sin(OB 0 ; OD0 ) k = [L1; L2; L3; L4] = A0 C 0 B 0D0 = sin( OA0; OD0 ) sin(OB 0; OC 0) : AD BC

(D.1)

156

D

E LLIPSES

AS

F OUR P OINTS

AND

C ROSS R ATIO

A more convenient way of computing a cross ratio is given by Mohr (1993)

OA0C 0j jOB 0D0j ; k = [L1; L2; L3; L4] = jjOA 0 D0 j jOB 0 C 0j

where jOA0 C 0j is the determinant of the the points O; A0; B 0 as columns.

(D.2)

3 3 matrix with the homogeneous coordinates of

We use Eq. D.2 to compute the cross ratio of four points on an ellipse. Since these four points can be any points on the ellipse, we choose the curvature extrema of an ellipse as the four points A; B; C; and D (see Fig. D.2). In general, an ellipse is uniquely defined by any five points. However, when taking the curvature extrema, i.e. the intersections of the principal axes with the ellipse, then these four points1 contain already enough information to characterize an ellipse uniquely. Therefore, the representation of an ellipse by these four points and a cross ratio is overdetermined which leads to a constant cross ratio of 2 for all ellipses (cf. Eq. D.4).

D

l2 l1

A

m C

B

m

Fig. D.2: An ellipse, its representation ( ; l1; l2; )T , and principal axes with the ellipse (curvature extrema).

The points

A; B; C; D, the four intersections of the

x of an ellipse are computed from the (m; l1; l2; )T -representation as cos ? sin l cos t x = m + sin cos l12 sin t ; with 0 t 2 : A =

cos l

B =

2

Thus, the curvature extrema are

1 sin l1

? sin l

cos l2 ? cos l C = ? sin l 1 1 sin l D = ? cos 2l : 2 1

Already three curvature extrema define an ellipse uniquely.

(D.3)

D

E LLIPSES

AS

F OUR P OINTS

AND

157

C ROSS R ATIO

when we assume the center of the ellipse at the origin of the ellipse plane coordinate frame. According to the theorem of Chasles, the origin O of a pencil of lines with a same cross ratio can be any point on the ellipse. We choose O therefore arbitrarily by inserting any t in Eq. D.3 and assuming = 021

m

O =

cos l

1 cos t ? sin l2 sin l1 cos t + cos l2

sin t : sin t

cos l1 cos t ? sin l2 sin t sin l1 cos t + cos l2 sin t 1

cos l1 cos t ? sin l2 sin t sin l1 cos t + cos l2 sin t 1

Applying these points to Eq. D.2 leads to

cos l1 ? cos l1 sin l1 ? sin l1 jOAC j = 1 1 = ?2 l1 l2 sin t (sin )2 ? 2 l1 l2 sin t (cos )2 = ?2 l1 l2 sin t

cos l1 sin l2 sin l1 ? cos l2 jOADj = 1 1 = ?l1l2 ((sin )2 + (cos )2) + l1l2 cos t ((sin )2 + (cos )2) ?l1l2 sin t ((sin )2 + (cos )2) = ?l1 l2 (1 ? cos t + sin t)

jOBDj =

cos l1 cos t ? sin l2 sin t ? sin l2 sin l2 sin l1 cos t + cos l2 sin t cos l2 ? cos l2 1 1 1

= 2 l1 l2 cos t (cos )2 + 2 l1 l2 cos t (sin )2 = 2 l1 l2 cos t

jOBC j =

cos l1 cos t ? sin l2 sin t ? sin l2 ? cos l1 sin l1 cos t + cos l2 sin t cos l2 ? sin l1 1 1 1

= l1l2 ((sin )2 + (cos )2) + l1l2 cos t ((sin )2 + (cos )2) ?l1l2 sin t ((sin )2 + (cos )2) = l1 l2 (1 + cos t ? sin t)

158

D

E LLIPSES

AS

F OUR P OINTS

AND

C ROSS R ATIO

?2 l1 l2 sin t 2 l1 l2 cos t jOAC j jOBDj = jOADj jOBC j ?l1 l2 (1 ? cos t + sin t) l1 l2 (1 + cos t ? sin t) ?4 l12l22 sin t cos t = ?l12l22(1 ? (cos t)2 + 2 cos t sin t ? (sin t)2) sin t cos t = 42 sin t cos t = 2:

(D.4)

Appendix E

Jacobians

The minimization of non-linear systems is not possible in closed form. Common methods are iterative descent approaches which require the computation of the gradient of the minimized function. We have chosen the Levenberg-Marquardt method (cf. Section 4.5.1) for this purpose. In order to minimize Eq. 4.19

N X T X xbji ? Pbij o(a; xoi ) i ?1 xbji ? Pbij o(a; xoi ) ; C(a) = i=1 j 2B

we have to compute the Jacobians

@ Pbijo(a; xoi ) : @a

(E.1)

Pbij o has a different form for each feature type i (point, line segment, or ellipse) and each

image j (see Equations 4.5, 4.6, 4.9, and 4.18). The differences are due to the different feature representations and the differences in the projection to different image coordinate frames. We computed all partial derivatives using the linear algebra software packet MAPLE (e.g. Redfern, 1996). The partial derivatives are rather complex because of the frequent use of trigonometric functions. Our approach has the disadvantage of seemingly complicated Jacobians but the advantage of an explicit modeling and the use of a minimum number of parameters.

Points

x

An image point is represented by its x and y coordinate with bl = (x; y )T . Applying the generalized chain rule, the Jacobian for points projected to the left image is

160

E

JACOBIANS

@ Pbplo(a; xo ) @ xbl @ xl @ xbl = @x @a + @a : @a l

(E.2)

The projected points are a function of the points in camera coordinates and the focal length contained in the vector . The vector includes all pose parameters and the camera parameters which are estimated during minimization of Eq. 4.19. Let n be the dimension of . The Jacobian for the projection to the right image is @ Pbpr o( ; o) @ br @ r @ l @ r @ br : + = + (E.3) @ @ r @ l @ @ @

a

a

x x x x

ax a

Where

@ xbl = 1 @ xl zl2

a

x a

x a

x a

f =s z l x l

0 ?fl=sx xl ; 0 ?fl=sy zl fl=sy yl and @ xbr =@ xr is composed analog to @ xbl =@ xl substituting l by r, @ xbl = 1 sx 01n ; @ xbr = 1 sx 01n?1 ; @ a zl ?sy 01n @ a zr ?sy 01n?1

@ xl = 0 ::: I 31 33 @a

@ xr = 0 ::: I 31 33 @a and

x ::: 031 ; @ Trl x @ Trl x @ Trl x ::: 0 ; o o o 3 1 @!x @!y @!z x

x

@ Tlo @ Tlo @ Tlo @!x o @!y o @!z o

@ xr = T : @ xl rl

a

(E.4)

(E.5)

(E.6)

(E.7)

(E.8)

The vector contains all pose parameters and all camera parameters. Only subsets of them are used for the actual projection of single points. Therefore, @ bl =@ , @ br =@ , @ l =@ , and @ r =@ contain many entries which are 0 (see Eqs. E.5, E.6, and E.7).

x a x

x a

a x a

Line Segments A line segment is represented in the image by its midpoint m, orientation and length l with xbl = (mx; my ; ; l)T . The line segment in 3D is described by two points xos1 , xos2 . They are projected, e.g., to the left image by x x x x

1

2

1 1

= Tbl l Tlo

os1

1

os2

1

:

E

161

JACOBIANS

Vectors in projective space are only defined up to a scalar (cf. Eq. C.1). transformed to the midpoint representation of line segments (cf. Eq. 4.8) with

x1 and x2 are

mx = x1 +2 x2 ; my = y1 +2 y2 ;

y2 ? y1

= arctan x ? x ; 2 1 p 2 l = (x2 ? x1) + (y2 ? y1)2 :

x x

x

x

Hence, the projection P of a line segment is a function P ( 1; 2 ), where 1 and 2 are functions of the actually estimated parameters . Applying the generalized chain rule, the Jacobian of P is

a

2 @P @x @P = X i @ a i=1 @ xi @ a

(E.9)

This leads to the Jacobian for the projection of model line segments to the left image

@ Pbslo(a; xo) @ xbl @ x1 @ xl @ x1 @ xbl @ x2 @ xl @ x2 = @ x @ x @ a + @ a + @ x @ x @ a + @ a : (E.10) @a 1 l 2 l and @ Pbslo (a; xo )=@ a for the projection to the right image using the additional terms @ xr =@ xl and @ xr =@ a as in Eq. E.3. @ xbl =@ x1 and @ xbl =@ x2 are computed as

x = x

@ bl @ 1

0 B B @

1 2

0

y2 ?y1 l2 ? x2?l x1

0

1 x22?x1

? y l?2y ? 2l 1

1 C C A

and

x = x

@ bl @ 2

0 1 2 B 0 B @ ? y2 ?y1 2 x2 ?lx1 l

0

1 2 x2 ?x1 2 y2 l?y1 l

1 C C A:

(E.11)

Ellipses

m

An image ellipse is represented by its midpoint , radii l1 and l2 , and orientation with bl = (mx ; my ; l1; l2; )T In 3D, an ellipse is described by the midpoint , the radii l1 and l2, the orientation and the normal . We project ellipses by projecting the curvature extrema (see Section 4.3.4). The curvature extrema in object-centered coordinates ( A ; B ; C ; D )

x

n

m x x x x

162

E

JACOBIANS

are derived from the 3D ellipse representation by the function ?C ,

0x x x x 1 A B C D B y yB yC yD C A ?o (m; l1; l2; n) = B @ zA zB zC zD C A: 1

1

1

(E.12)

1

The curvature extrema A; B; C; D have the following coordinates in the plane of the ellipse assuming that the midpoint of the ellipse is the plane origin

A = B = C = D =

0 cos l 1 @ sin l11 A 1 0 ? sin l 1 @ cos l2 2 A 1 0 ? cos l 1 @ ? sin l11 A 1 0 sin l 1 @ ? cos 2l2 A : 1

The rotation of the plane of the ellipse relative to the object-center coordinate frame is described by the matrix R,

0 cos ! y R = @ sin !x sin !y

1

0 0 cos !x 0 A : cos !x sin !y ? sin !x 0

Using the normal vector

cos !y =

(E.13)

n = (nx; ny ; nz )T of the plane of the ellipse, the entries of R are

p1 ? n 2 ; q nx2y

cos !x = 1 ? 1?n2x ; ? sin !x = p?1n?yn2 : x

sin !x sin !y = p?n1y?nnx2 ;

qx

cos !x sin !y = ?nx 1 ? 1?nyn2x ; 2

E

163

JACOBIANS

xA; xB ; xC ; xD are then computed by 0x x x x 1 A B C D A B C D B C y y y y A B C D B C ?o = @ z z z z A = R 1 1 1 1 + m A B C D 1

1

1

(E.14)

1

x x x x x x x x

The points A ; B ; C ; D are projected which results in a 3 4 matrix representing the projected points a ; b ; c ; d . These are inserted in 2 LAB LCD ? LAC LBD = 0 using Eq. 4.14 (see Section 4.3.4). The coefficients of x and y are collected and they from the coefficients a; b; c; d; e; f of the quadratic form of an ellipse

ax2 + 2bxy + cy2 + 2dx + 2ey + f = 0: Thus, a; b; c; d; e; f are functions of the projected points xa ; xb ; xc ; xd

a(xa; xb; xc; xd) = 2 (yb ? ya)(yd ? yc) ? (yc ? ya)(yd ? yb); b(xa; xb; xc; xd) = (xa ? xb)(yd ? yc) + (yb ? ya)(xc ? xd) ? 21 (xa ? xc)(yd ? yb) ? 21 (yc ? ya)(xb ? xd);

(E.15)

(E.16)

c(xa; xb; xc; xd) = 2 (xa ? xb)(xc ? xd) ? (xa ? xc)(xb ? xd);

(E.17)

d(xa; xb; xc; xd) = 2 (xa(ya ? yb) ? ya(xa ? xb))(yd ? yc) +2 (yb ? ya)(xc(yc ? yd) ? yc(xc ? xd)) ?(xa(ya ? yc) ? ya(xa ? xc))(yd ? yb) ?(yc ? ya)(xb(yb ? yd) ? yb(xb ? xd));

(E.18)

e(xa; xb; xc; xd) = 2 (xa(ya ? yb) ? ya(xa ? xb))(xc ? xd) +2 (xa ? xb)(xc(yc ? yd) ? yc(xc ? xd)) ?(xa(ya ? yc) ? ya(xa ? xc))(xb ? xd) ?(xa ? xc)(xb(yb ? yd) ? yb(xb ? xd));

(E.19)

f (xa; xb; xc; xd) = 2 (xa(ya ? yb) ? ya(xa ? xb))(xc(yc ? yd) ? yc(xc ? xd)) ?(xa(ya ? yc) ? ya(xa ? xc))(xb(yb ? yd) ? yb(xb ? xd)): (E.20) Using these coefficients, the (mx; my ; l1; l2; )-representation of the ellipse is computed using Eqs. 4.15–4.17 which are equivalent to the following:

164

E

? cd mx = be ac ? b2 ? bd my = ? ae ac ? b2 l1 =

l2 =

v u u u 2 ? b2 f + 2 bde ? d2 c u acf ? ae u r u t (ac ? b2) ? a ? c ? (a+c)2 ? ac + b2 2 2 4 v u u u 2 ? b2f + 2 bde ? d2 c u acf ? ae u r u t (ac ? b2) ? a ? c + (a+c)2 ? ac + b2 2 2 4

= arctan(22b; a ? c)

JACOBIANS

(E.21) (E.22)

(E.23)

(E.24)

(E.25)

The Jacobian of the projection of ellipses to the left image is then

d @ Pbel(a;xo ) X @ x bl @ xil @ xl @ xil @ a = i=a @ xil @ xl @ a + @ a :

(E.26)

ax a x a

For the Jacobian Pbel ( ; o ) =@ of the projection of ellipses to the right image the additional terms @ r =@ l and @ r =@ must be inserted like in Eq. E.3.

x x @ xbl =@ xil ; i 2 fa; b; c; dg can be easily derived from Eqs. E.15-E.20 and E.21-E.25. Because

of the number of terms in this Jacobian, we decided not to write it down here explicitly.

Appendix F

Covariance Matrices

We require that detected image features are matched to corresponding object model features as a prerequisite for model-based 3D reconstruction. We then estimate the best pose of all objects in the scene and the cameras based on a fitting of the object models to image features. The goal is to estimate the vector which represents the pose and camera parameters. We assume therefore that i (F.1) bji = Pbj o ( ; oi ) + noise

a

x

ax

models the projection of the objects to images and thus models the transition between the true state of the scene (= the pose of all objects and the relative pose of the cameras) and the measurements taken from the images (= the image features). But, of course, our measurements are corrupted by noise. We assume this noise to be Gaussian and consider the reconstruction problem as a maximum a-posteriori estimation which we accomplish by minimizing Eq. 4.19

a) =

C(

N X X i=1 j 2B

x

a; xoi )

i bji ? Pbj o (

T

i?1

i xbji ? Pbj o(a; xoi ) ! min :

The distance between measured image features and model features is computed using the Mahalanobis distance. It is a statistical distance which accounts for the noise in the measurements and models it by the covariance matrices i . So far, we distinguish between three different types of images features: points, line segments, and ellipses. Each type of image feature has different measurement noise characteristics which results in different type-specific covariance matrices.

Points An image point or vertex is described by a x and y coordinate. It can be assumed that the measurement noise is not correlated for x and y . The covariance matrix for points (xb ; yb )T is

166

F

therefore modeled as a 2 2 diagonal matrix

C OVARIANCE M ATRICES

!

x2 0 : 0 y2

p =

(F.2)

x2 and y2 denote the variance in x and y, respectively. Line Segments The other images features, i.e. straight line segments and ellipses, are more complex features than points. A key issue here is to find a parameterization of the measured (detected) features which allows us to formulate the statistical correlation of the feature components with as few parameters as possible. Deriche & Faugeras (1990) propose the following for line segments: Assuming no correlation of the noise between the endpoints 1 and 2 and the same covariance matrix for both endpoints leads to a covariance matrix

x

s =

x

!

0 ; 0

(F.3)

where is the 2 2 matrix associated to each endpoint:

x2 xy2 xy2 y2

=

!

:

(F.4)

u

Let’s now assume that is diagonal in a feature-centered coordinate system defined by k and ? , two unit vectors parallel and perpendicular to the line segment. In this coordinate system we have

u

?k =

k2 0 0 ?2

!

(F.5)

2 are very distinct for image line segments which is especially desirable, since both k2 and ? (cf. Fig. F.1). The origin of the (k; ?)-coordinate system is the image origin. We are, however, interested in finding in the given coordinate frame.

x

x

vx

If q is a feature vector in the feature space q with covariance matrix q and if n = ( q ) is a linear transformation of the feature space q to the feature space n , then the covariance matrix n is computed as (Hinderer, 1988):

n = @ v@(xxq ) q @ v@(xxq ) : T

q

q

(F.6)

F

167


x2

l2 l1

?

x1

Fig. F.1: Position and orientation of the respectively.

m

k

(k; ?)-coordinate system for a line segment or an ellipse,

We are interested in a transformation of the (k; ?)-feature space to the image coordinate frame and then a subsequent transformation to the midpoint-representation of line segments. This leads to the following equation

T @ ( m @ ( m x; my ; ; l)T x; my ; ; l)T T ; s = (F.7) @ (x1; x2) Rs s Rs @ (x1; x2) where Rs transforms s with respect to the image coordinate frame. Then still a line segment is represented by its two endpoints. The midpoint-representation represents a line segment as vector (mx ; my ; ; l)T (midpoint, angle, length) which are functions of the two endpoints (see Eq. 4.8). The Matrix Rs is a block-diagonal matrix

R Rs =

022 022 R

with

(F.8)

cos ? sin R = sin cos

(F.9)

and describes the rotation of the (k; ?)-coordinate system with respect to the image coordinate system. x ;my ;;l)T The matrix @ (m @ ( 1 ; 2 ) is composed as (cf. Eqs. E.11 and 4.8):

xx

0 @ (mx; my ; ; l)T = B B @ @ (x1; x2)

1 2

0

y2 ?y1 l2 ? x2?l x1

1 0 1 1 C B 22 ?x1 y02 ?y1 x2 ?2 x1 C = B x A @ ? ? 0

l2

? y2?l y1

1 2

0

2 2 x2 ?lx1 y2 l?y1 l l

1 2 0 sin l

1 0 2 1 0 12 2 sin cos ? ? cos

0

l

l

l

? cos ? sin cos sin

1 CC A

168

F


The covariance matrix for line segments in the midpoint-representation (mx; my ; ; l)T is now obvious using Eq. F.7 k2 cos2 ()+?2 sin2 () (k2 ??2 ) sin() cos()

0 B B s = B B B @

2 2 (k2 ??2 )sin() cos() ?2 cos2 ()+k2 sin2 () 2 2

0 0

0 0

0 0

2?2 l2

1 0 C C 0 C C 0 C A 2

(F.10)

0 2k

Ellipses Gengenbach (1994) proposes to derive the covariance matrix for ellipses represented as in an analogous way to Deriche & Faugeras (1990). A diagonal covariance matrix for ellipses in the (k; ?)-space is the following:

(mx; my ; l1; l2; )T

0 2 0 0 0 k B 0 ?2 0 0 B B e = B 0 0 l22 0 @ 0 0 0 l21 0

0

0 0 0 0 0 2

0

1 C C C : C A

(F.11)

No transformation of the ellipse representation is here necessary. Thus, Eq. F.11 is transformed to e by e = I55 Re e RTe I55 (F.12) using

R Re =

023 032 I33

(F.13)

and resulting in

0 2 cos2() + 2 sin2() (2 ? 2 ) sin() cos() 0 0 ? ? k k B 2 2 2 2 (k ? ?) sin() cos() ? cos () + k2 sin2() 0 0 B B e = B 0 0 l21 0 B B 0 0 0 l22 @ 0

0

0

0 0 0 0 0 2

1 C C C C : C C A

(F.14)

We use e as covariance matrix for ellipses. The experiments show satisfying results. However, in order to express e entirely in terms of k2 and ?2 we should have started from e which describes the covariance of the four projected points of the ellipse. Subsequent transformations would transform e to the space of the quadratic form and to the (mx ; my ; l1; l2; ) representation. We do not follow this approach because of the complexity of the Jacobians necessary for the transformations, and we use Eq. F.14 instead.

Bibliography Abella, A. & Kender, J. R. (1993). Qualitatively Describing Objects Using Spatial Prepositions. In Proc. of the 11th National Conference on Artificial Intelligence (AAAI-93), pp. 536–540. André, E., Bosch, G., Herzog, G., & Rist, T. (1986). Characterizing Trajectories of Moving Objects Using Natural Language Path Descriptions. In Proc. of the 7th European Conference on Artificial Intelligence (ECAI-86), Volume 2, pp. 1–8. André, E., Herzog, G., & Rist, T. (1988). On the simultaneous interpretation of real world image sequences and their natural language description: The system SOCCER. In Proc. of the 8th European Conference on Artificial Intelligence (ECAI-88), pp. 449–545. Arbeitsmaterialien B1a (1994). “Dies ist ein runder Gegenstand...” Sprachliche Objektspezifikationen. Teilprojekt B1: Interaktion visueller und sprachlicher Informationsverarbeitung, Universität Bielefeld, Sonderforschungsbereich 360. Arbeitsmaterialien B1b (1994). “Wir bauen jetzt also ein Flugzeug...” Konstruieren im Dialog. Teilprojekt B1: Interaktion visueller und sprachlicher Informationsverarbeitung, Universität Bielefeld, Sonderforschungsbereich 360. Ballard, D. H. (1996). An Introduction to Natural Computation. Rochester, NY. Ballard, D. H. & Brown, C. M. (1992). Principles of Animate Vision. CVGIP: Image Understanding 56(1), 3–21. Barwise, J. & Perry, J. (1983). Situations and Attitudes. Cambridge, MA: MIT Press. Boufama, B., Mohr, R., & Veillon, F. (1993). Euclidean Constraints for Uncalibrated Reconstruction. In Proc. of the Int. Conf. on Computer Vision, Berlin, Germany, pp. 466–470. IEEE Computer Society Press. Brindöpke, C., Häger, J., Johanntokrax, M., Pahde, A., Schwalbe, M., & Wrede, B. (1996). Darf ich dich Marvin nennen? Instruktionsdialoge in einem Wizard-of-Oz Szenario: Szenario Design und Auswertung. Report 96/16, Universität Bielefeld, Sonderforschungsbereich 360. Brindöpke, C., Johanntokrax, M., Pahde, A., & Wrede, B. (1995). Darf ich dich Marvin nennen? Instruktionsdialoge in einem Wizard-of-Oz Szenario: Materialband. Report 95/7, Universität Bielefeld, Sonderforschungsbereich 360. Brooks, R. A. (1981). Symbolic Reasoning Among 3-D Models and 2-D Images. Artificial Intelligence 17, 285–348. Carlson-Radvansky, L. & Irwin, D. (1993). Frames of reference in vision and language: Where is above? Cognition 46, 223–244.

170

BIBLIOGRAPHY

Chopra, R. & Srihari, R. K. (1995). Control Structures for Incorporating Picture-Specific Context in Image Interpretation. In Proc. of the Int. Joint Conference on Artificial Intelligence, Volume 1, Montréal, Canada, pp. 50–55. Chrisman, L. (1996). A Roadmap to Research on Bayesian Networks and other Decomposable Probabilistic Models. http://almond.srv.cs.cmu.edu/afs/cs.cmu.edu/usr/ldc/Mosaic/bayes-net-research/bayes-net-research.html. Craig, K. (1943). The Nature of Explanation. Cambridge: Cambridge University Press. Crowley, J., Bobet, P., & Schmid, C. (1993). Auto-Calibration by Direct Observation of Objects. Image and Vision Computing 11(2), 67–81. Crowley, J. & Stelmaszyk, P. (1990). Measurement and Integration of 3-D Structures by Tracking Edge Lines. In Proc. First European Conference on Computer Vision, pp. 269–280. Antibes, France, Apr. 23-26, O. Faugeras (Ed.), Lecture Notes in Computer Science 427, Springer-Verlag. De Kleer, J. & Brown, J. S. (1984). A Qualitative Physics Based on Confluences. Artificial Intelligence 24, 7–83. Deriche, R. & Faugeras, O. (1990). Tracking Line Segments. In Proc. First European Conference on Computer Vision, pp. 259–268. Antibes, France, Apr. 23-26, O. Faugeras (Ed.), Lecture Notes in Computer Science 427, Springer-Verlag. Dhome, M., Lapreste, J. T., Rives, G., & Richetin, M. (1990). Spatial Localization of Modeled Objects of Revolution in Monocular Perspective Vision. In Proc. First European Conference on Computer Vision, pp. 475–485. Antibes, France, Apr. 23-26, O. Faugeras (Ed.), Lecture Notes in Computer Science 427, Springer-Verlag. Dhome, M., Richetin, M., LaPreste, J., & Rives, G. (1989). Determination of the Attitude of 3-D Objects from a Single Perspective View. IEEE Trans. Pattern Analysis and Machine Intelligence PAMI-11(12), 1265–1278. Dickinson, S. & Metaxas, D. (1994). Integrating Qualitative and Quantitative Shape Recovery. International Journal of Computer Vision 13(3), 1–20. Ding, M. & Wahl, F. M. (1993). Research on Quadric Curve Based on Stereo. In 8th IEEE Workshop on Image and Multidimensional Signal Processing, Cannes, France. Ding, M. & Wahl, F. M. (1995). Correspondenceless Stereo by Quadric Curve Matching. Int. Journal of Robotics and Automation. Duda, R. O. & Hart, P. E. (1972). Pattern Classification and Scene Analysis. New York: J. Wiley. Egenhofer, M. J. (1991). Reasoning about binary topological relations. In O. Günther & H.-J. Schek (Eds.), Advances in Spatial Databases, 2nd Symposium, SSD’91, Berlin, pp. 143–157. Springer. Engelkamp, J. (1990). Das menschliche Gedächtnis: Das Erinnern von Sprache, Bildern und Handlungen. Hogrefe. Engelkamp, J. & Pechmann, T. (Eds.) (1993). Mentale Repräsentationen. Bern: Huber.

Face & Gesture (1996). Proceedings of the 2nd Int’l Conf.on Automatic Face & Gesture Recognition, Killington, Vermont. Face & Gesture: IEEE Computer Society Press.

BIBLIOGRAPHY

171

Faugeras, O. (1992). What can be seen in three dimensions with an uncalibrated stereo rig ? In Proc. Second European Conference on Computer Vision, pp. 563–578. Santa Margherita Ligure, Italy, 18-23 May, G. Sandini (Ed.), Lecture Notes in Computer Science 588, Springer-Verlag. Faugeras, O. (1993). Three-Dimensional Computer Vision, A Geometric Viewpoint. Cambridge, MA and London, UK: The MIT Press. Faugeras, O. (1995). Stratification of three-dimensional vision: projective, affine, and metric representations. Journal Opt. Soc. Am. A 12(3), 465–484. Faugeras, O. & Hebert, M. (1986). The Representation, Recognition, and Locating of 3-D Objects. Intern. Journal of Robotics Research 5(3), 27–52. Fillmore, C. (1968). A Case for Case. In E. Bach & R. T. Harms (Eds.), Universals in Linguistic Theory, pp. 1–88. New York: Holt, Rinehart and Winston. Fink, G., Sagerer, G., & Kummert, F. (1992). Automatic Extraction of Language Models from a Linguistic Knowledge Base. In Proc. European Signal Processing Conference, Brussels, pp. 547–550. Fink, G. A., Jungclaus, N., Ritter, H., & Sagerer, G. (1995). A Communication Framework for Heterogeneous Distributed Pattern Analysis. In V. L. Narasimhan (Ed.), International Conference on Algorithms and Applications for Parallel Processing, Brisbane, Australia, pp. 881–890. IEEE. Fink, G. A., Kummert, F., & Sagerer, G. (1994). A Close High-Level Interaction Scheme for Recognition and Interpretation of Speech. In Proc. Int. Conf. on Spoken Language Processing, Volume 4, Yokohama, Japan, pp. 2183–2186. Fischler, M. A. & Bolles, R. C. (1981). Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Communications of the ACM 24(6), 381–395. Fodor, J. A. & Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture: A critical analysis. Cognition 28, 3–71. Foley, J., van Dam, A., Feiner, S., & Hughes, J. (1990). Computer Graphics — Principles and Practice. Reading, MA: Addison-Wesley. Forbes, J., Huang, T., Kanazawa, K., & Russell, S. (1995). The BATmobile: Towards a Bayesian Automated Taxi. In Proc. of International Joint Conference on Artificial Intelligence (IJCAI), Montréal, Canada, pp. 1878–1885. Forbus, K. D. (1985). The Role of Qualitative Dynamics in Naive Physics. In J. R. Hobbs & R. C. Moore (Eds.), Formal Theories of the Common Sense World, Chapter 5, pp. 185–226. Ablex Publishing Corporation. Fröberg, C. (1985). Numerical Mathematics - Theory and Computer Applications. The Benjamin/Cummings Publishing Company. Fuhr, T. (1997). Modellieren von und mit Relationen – Integration qualitativer Modellierungstechniken in ein semantisches Netzwerksystem. Eingereichte Dissertationsschrift für die Universität Bielefeld.

172

BIBLIOGRAPHY

Fuhr, T., Socher, G., Scheering, C., & Sagerer, G. (1995). A three-dimensional spatial model for the interpretation of image data. In IJCAI-95 Workshop on Representation and Processing of Spatial Expressions, Montréal, Canada, pp. 93–102. Fuhr, T., Socher, G., Scheering, C., & Sagerer, G. (1997). A three-dimensional spatial model for the interpretation of image data. In P. Olivier & K.-P. Gapp (Eds.), Representation and Processing of Spatial Expressions. Hillsdale: Lawrence Erlbaum Associates. to appear. Gapp, K.-P. (1994). Basic Meanings of Spatial Relations: Computation and Evaluation in 3D space. In Proc of the 12t h National Conference on Artificial Intelligence (AAAI-93), pp. 1393–1398. Gapp, K.-P. (1996). Ein Objektlokalisationssystem zur sprachlichen Raumbeschreibung in dreidimensionalen Umgebungen – Formalisierung, Implementierung und empirische Validierung. PhD thesis, Universität des Saarlandes. Geman, S. & Geman, D. (1984). Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Trans. on Pattern Recognition and Machine Intelligence PAMI6, 721–741. Gengenbach, V. (1994). Einsatz von Rückkopplungen in der Bildauswertung bei einem Hand-AugeSystem zur automatischen Demontage. Dissertationen zur Ku¨ nstlichen Intelligenz (DISKI 72). Sankt Augustin: infix-Verlag. Goldberg, R. R. (1993). Pose determination of parameterized object models from a monocular image. Image and Vision Computing 11(1), 49–62. Golub, G. H. & van Loan, C. F. (1991). Matrix Computations (second ed.). Baltimore and London: The John Hopkins University Press. Grabowski, J., Herrmann, T., & Weiß, P. (1993). Wenn “vor” gleich “hinter” ist – zur multiplen Determination des Verstehens von Richtungspräpositionen. Kognitionswissenschaft 3, 171–183. Han, M.-H. & Rhee, S. (1992). Camera Calibration for Three-Dimensional Measurement. Pattern Recognition 25(2), 155–164. Harnad, S. (1987). Introduction: Psychophysical and cognitive aspects of categorical perception: A critical overview. In S. Harnad (Ed.), Categorial perception. The groundwork for cognition, pp. 1–25. Cambridge University Press. Hartley, R., Gupta, R., & Chang, T. (1992). Stereo from Uncalibrated Cameras. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Champaign, IL, June 15-18, pp. 761–764. IEEE. Hayward, W. & Tarr, M. (1995). Spatial language and spatial representation. Cognition 55, 39–84. Heidemann, G., Kummert, F., Ritter, H., & Sagerer, G. (1996). A Hybrid Object Recognition Architecture. In International Conference on Artificial Neural Networks ICANN-96, Bochum, Germany, July 15-19. Heidemann, G. & Ritter, H. (1996). Objekterkennung mit neuronalen Netzen. Report 96/2 – Situierte Ku¨ nstliche Kommunikatoren, SFB 360, Universität Bielefeld. Hernàndez, D. (1993). Qualitative Representation of Spatial Knowledge. Lecture Notes in Artificial Intelligence, 804. Berlin, Heidelberg, etc.: Springer-Verlag.

BIBLIOGRAPHY

173

Herrmann, T. (1990). Vor, hinter, rechts und links: das 6H-Modell. Zeitschrift für Literaturwissenschaft und Linguistik 78, 117–140. Herrmann, T. (1982). Sprechen und Situation. Berlin: Springer. Herrmann, T. (1985). Allgemeine Sprachpsychologie: Grundlagen und Probleme. München: Urban & Schwarzenberg. Herrmann, T. & Deutsch, W. (1976). Psychologie der Objektbenennung. Bern: Huber. Herrmann, T. & Grabowski, J. (1994). Sprechen: Psychologie der Sprachproduktion. Heidelberg: Spektrum. Herskovits, A. (1985). Semantic and Pragmatics of Locative Expressions. Cognitive Science 9, 341– 378. Summary. Herskovits, A. (1986). Language and spatial cognition. Cambridge University Press. Herzog, G., Blocher, A., Gapp, K.-P., Stopp, E., & Wahlster, W. (1996). VITRA: Verbalisierung visueller Information. Informatik Forschung und Entwicklung 11(1), 12–19. Herzog, G. & Wazinski, P. (1994). VIsual TRanslator: Linking Perceptions and Natural Language Descriptions. Artificial Intelligence Review Special Issue on Integration of Natural Language and Vision Processing 8(2-3), 83–95. Heydrich, W. & Rieser, H. (1995). Public Information and Mutual Error. Report 95/12 – Situierte Ku¨ nstliche Kommunikatoren, SFB 360, Universität Bielefeld. Hinderer, K. (1988). Grundlagen der Stochastik für Informatiker und Ingenieure. Skriptum zur gleichnamigen Vorlesung im WS 1988/89, Institut für Mathematische Statistik, Universität Karlsruhe. Horn, B. & Brooks, M. (Eds.) (1989). Shape from shading. Cambridge, MA, London, UK: MIT Press. Howarth, R. (1995). On seeing spatial expressions. In IJCAI-95 Workshop on Representation and Processing of Spatial Expressions, Montréal, Canada, pp. 118–132. Howarth, R. & Buxton, H. (1992). Analogical reprsentation of space and time. Image and Vision Computing 10(7), 467–478. Huang, T., Koller, D., Ogasawara, B., Rao, B., Russell, S., & Weber, J. (1994). Automatic Symbolic Traffic Scene Analysis Using Belief Networks. In Proc. American Association Conference of Artificial Intelligence (AAAI), pp. 966–972. Jähne, B. (1991). Digital Image Processing. Berlin, Heidelberg: Springer-Verlag. Jain, R. & Jain, A. (1990). Analysis and Interpretation of Range Images. Berlin, Heidelberg: Springer-Verlag. Jameson, A., Schäfer, R., Simons, J., & Weis, T. (1995). Adaptive Provision of Evaluation-Oriented Information: Tasks and Techniques. In Proc. of International Joint Conference on Artificial Intelligence (IJCAI), Montréal, Canada, pp. 1886–1893. Johnson-Laird, P. N. (1980). Mental Models in Cognitive Science. Cognitive Science 4, 71–115. Kanade, T. (1980). Signal vs. Semantics. Computer Graphics and Image Processing 13, 279–297.

174

BIBLIOGRAPHY

Kjeldsen, R. & Kender, J. (1996). Towards the Use of Gesture in Traditional User Interfaces. In Proceedings of the 2nd Int. Conf. on Automatic Face and Gesture Recognition, Killington, Vermont, pp. 151–156. IEEE Computer Society Press. Klein, W. (1979). Wegauskünfte. Zeitschrift für Literaturwissenschaft und Linguistik 33, 9–57. Klinker, G. J., Shafer, S. A., & Kanade, T. (1988). The Measurement of Highlights in Color Images. International Journal of Computer Vision 2, 7–32. Koenderink, J. J. & van Doorn, A. J. (1991). Affine structure from motion. Journal Opt. Soc. Am. A 8, 377–385. Koller, D. (1992). Detektion, Verfolgung und Klassifikation bewegter Objekte in monokularen Bildfolgen am Beispiel von Straßenverkehrsszenen. Dissertationen zur Ku¨ nstlichen Intelligenz (DISKI 13). Sankt Augustin: infix-Verlag. Koller, D., Daniilidis, K., & Nagel, H.-H. (1993). Model-Based Object Tracking in Monocular Image Sequences of Road Traffic Scenes. International Journal of Computer Vision 10(3), 257– 281. Koller, D., Weber, J., Huang, T., Malik, J., Ogasawara, G., Rao, B., & Russell, S. (1994). Towards Robust Automatic Traffic Scene Analysis in Real-Time. In Proc. International Conference on Pattern Recognition, Jerusalem, Israel, October 9-13, pp. 126–131. Kollnig, H. (1995). Ermittlung von Verkehrsgeschehen durch Bildfolgenauswertung. Dissertationen zur Ku¨ nstlichen Intelligenz (DISKI 88). Sankt Augustin: infix-Verlag. Kollnig, H., Damm, H., Nagel, H.-H., & Haag, M. (1995). Zuordnung natürlichsprachlicher Begriffe zu Geschehen an einer Tankstelle. In G. Sagerer, S. Posch, & F. Kummert (Eds.), 17. DAGMSymposium Mustererkennung, Bielefeld, Sept. 13-15, pp. 236–243. Springer-Verlag. Kollnig, H. & Nagel, H.-H. (1993). Ermittlung von begrifflichen Beschreibungen von Geschehen in Straßenverkehrsszenen mit Hilfe unschafer Mengen. Informatik Forschung und Entwicklung 8, 186–196. Kollnig, H., Otte, M., & Nagel, H.-H. (1994). The Association of Motion Verbs with Vehicle Movements Extracted from Dense Optical Flow Fields. In Proc. 3rd European Conference on Computer Vision. Stockholm, Sweden, May 2-6, J.-O. Eklundh (Ed.), Lecture Notes in Computer Science 801, Springer-Verlag. Kosslyn, S. M. (1994). Image and Brain: The Resolution of the Imagery Debate. Cambridge, MA and London, UK: The MIT Press. Kosslyn, S. M., Flynn, R. A., Amsterdam, J., & Wang, G. (1990). Components of high-level vision: a cognitive neuroscience analysis and accounts of neurological syndromes. Cognition 34, 203– 277. Kummert, F. (1992). Flexible Steuerung eines sprachverstehenden Systems mit homogener Wissensbasis. Dissertationen zur Ku¨ nstlichen Intelligenz (DISKI 12). Sankt Augustin: infix-Verlag. Kummert, F. (1997). Interpretation von Bild- und Sprachsignalen – Ein hybrider Ansatz –. Eingereichte Habilitationsschrift für die Universität Bielefeld.

BIBLIOGRAPHY

175

Kummert, F., Niemann, H., Prechtel, R., & Sagerer, G. (1993). Control and Explanation in a Signal Understanding Environment. Signal Processing, Special Issue on ‘Intelligent Systems for Signal and Image Understanding’ 32, 111–145. Landau, B. & Jackendoff, R. (1993). “What” and “where” in spatial language and spatial cognition. Behavioral and Brain Sciences 16, 217–265. Lauritzen, S. L. & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society B 50(2), 157–224. Lee, K.-F. (1989). Automatic Speech Recognition: the Development of the SPHINX System. Boston: Kluwer Academic Publishers. Lenz, R. (1987). Linsenfehlerkorrigierte Eichung von Halbleiterkameras mit Standartobjektiven für hochgenaue 3D-Messungen in Echtzeit. In E. Paulus (Ed.), DAGM-Symposium Mustererkennung 1987, Braunschweig, pp. 212–215. Springer-Verlag. Leung, T., Burl, M., & Perona, P. (1995). Finding Faces in Cluttered Scenes using Random Labeled Graph Mathing. In Proc. of the Int. Conf. on Computer Vision, Boston, MA, June 20-123, pp. 637–644. Lewis, D. (1969). Convention: A philosophical study. Cambridge, MA: Harvard University Press. Li, M. (1994). Camera Calibration of a Head-Eye System for Active Vision. In Proc. 3rd European Conference on Computer Vision, Volume I, pp. 543–554. Stockholm, Sweden, May 2-6, J.-O. Eklundh (Ed.), Lecture Notes in Computer Science 801, Springer-Verlag. Lobin, H. (1993). Situiertheit. KI 1, p. 61. Lowe, D. G. (1991). Fitting Parameterized Three-Dimensional Models to Images. IEEE Trans. Pattern Analysis and Machine Intelligence PAMI-13(5), 441–450. Ma, S. D. (1993). Conics-Based Stereo, Motion Estimation, and Pose Determination. International Journal of Computer Vision 10(1), 7–25. Maaß, W. & Kevitt, P. M. (Eds.) (1996). Workshop on Processes and Representations between Vision and Natural Language, Budapest, Hungary. ECAI’96. Mangold-Allwinn, R., Barattelli, S., Kiefer, M., & Koelbing, H. (1995). Wörter für Dinge. Von flexiblen Konzepten zu variablen Benennungen. Opladen: Westdeutscher Verlag. Mangold-Allwinn, R., von Stutterheim, C., Barattelli, S., U.Kohlmann, & Koelbing, H. G. (1992). Objektbenennung im Diskurs. Eine interdisziplinäre Untersuchung. Kognitionswissenschaft 3, 1–11. Mäntylä, M. (1988). An Introduction to Solid Modeling. Rockville, MD: Computer Science Press. Marr, D. (1982). Vision. San Francisco: W. H. Freeman and Co. Mast, M., Kompe, R., Kummert, F., Niemann, H., & Nöth, E. (1992). The Dialog Module of the Speech Recognition and Dialog System EVAR. In Proc. Int. Conf. on Spoken Language Processing, Volume 2, Banff, Alberta, Canada, pp. 1573–1576. Mast, M., Kummert, F., Ehrlich, U., Fink, G., Kuhn, T., Niemann, H., & Sagerer, G. (1994). A Speech Understanding and Dialog System with a Homogeneous Linguistic Knowledge Base. IEEE Trans. Pattern Analysis and Machine Intelligence PAMI-16(2), 179–194.

176

BIBLIOGRAPHY

Mc Kevitt, P. (Ed.) (1994a). Proc. of AAAI-94 Workshop on Integration of Natural Language and Vision Processing, Seattle, WA. Mc Kevitt, P. (Ed.) (1994b). Special Issue on Integration of Natural Language and Vision Processing, Volume 8 of Artificial Intelligence Volume. Kluwer Academic Publishers. McCarthy, J. & Hayes, P. J. (1969). Some Philosophical Problems from the Standpoint of Artificial Intelligence. In B. Meltzer & D. Michie (Eds.), Machine Intelligence 4, Chapter 26, pp. 463– 502. New York: American Elsevier. Medin, D. L. & Barsalou, L. W. (1987). Categorization processes and categorical perception. In S. Harnad (Ed.), Categorial perception. The groundwork for cognition, pp. 455–490. Cambridge University Press. Merz, T. (1995). 3D-Rekonstruktion und Kamerakalibrierung aus Bildern bekannter Objekte. Diplomarbeit, Universität Bielefeld, Technische Fakultät, AG Angewandte Informatik. Minsky, M. (1975). A Framework for Representation Knowledge. In P. H. Winston (Ed.), The Psychology of Computer Vision, pp. 211–277. New York: McGraw-Hill. Mohr, R. (1993). Projective Geometry and Computer Vision. In C. H. Chen, L. F. Pau, & P. S. P. Wang (Eds.), Handbook of Pattern Recognition and Computer Vision, Chapter 2.4, pp. 369–393. World Scientific Publishing Company. Mohr, R., Veillon, F., & Quan, L. (1993). Relative 3D Reconstruction Using Multiple Uncalibrated Images. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), New York, NY, pp. 543–548. IEEE. Moratz, R., Heidemann, G., Posch, S., Ritter, H., & Sagerer, G. (1995). Representing procedural knowledge for semantic networks using neural nets. In Proc. 9th Scandinavian Conference on Image Analysis, Uppsala. Mukerjee, A. (1994). Metric-less modeling of one, two and three-dimensional metric spaces. In Working notes of the AAAI-94 Workshop on Spatial and Temporal Reasoning, Seattle, WA. Mukerjee, A. (1996). Neat vs Scruffy: A review of Computational Models for Spatial Expressions. Draft Paper. Naeve, U., Socher, G., Fink, G. A., Kummert, F., & Sagerer, G. (1995). Generation of Language Models Using the Results of Image Analysis. In Proc. of Eurospeech’95, 4th European Conference on Speech Communication and Technology, Madrid, Spain, 18-21 Sep., pp. 1739–1742. Nagel, H. (1988). From image sequences towards conceptual descriptions. Image and Vision Computing 6(2), 59–74. ¨ Nagel, H.-H. (1979). Uber die Repräsentation von Wissen zur Auswertung von Bildern. In DAGMSymposium Mustererkennung 1979, Karlsruhe, J.P. Foith (Hrsg.), Informatik-Fachberichte 20, pp. 3–21. Springer-Verlag. Nagel, H.-H. (1996). Zur Strukturierung eines Bildfolgen-Auswertungssystems. Informatik Forschung und Entwicklung 11(1), 3–11. Nakatani, H. & Itho, Y. (1994). An Image Retrieval System That Accepts Natural Language. In P. Mc Kevitt (Ed.), AAAI-94 Workshop on Integration of Natural Language and Vision Processing, pp. 7–13. Twelfth National Conference on Artificial Intelligence (AAAI-94).

BIBLIOGRAPHY

177

Neumann, B. & Mohnhaupt, M. (1988). Propositionale und analoge Repräsentation von Bewegungsverläufen. KI Künstliche Intelligenz 1, 4–10. Neumann, B. & Novak, H.-J. (1986). NAOS: Ein System zur natürlichsprachlichen Beschreibung zeitveränderlicher Szenen. Informatik Forschung und Entwicklung 1, 83–92. Nie, N. H., Hull, C. H., Jenkins, J. G., Steinbrenner, K., & Bent, D. H. (1975). SPSS: Statistical Package for the Social Sciences. New York: McGraw-Hill. Niemann, H. (1981). Pattern Analysis. Berlin, Heidelberg, New York: Springer-Verlag. Niemann, H. (1985). Wissensbasierte Bildanalyse. Informatik Spektrum 8, 201–214. Niemann, H., Sagerer, G., Schröder, S., & Kummert, F. (1990). ERNEST: A Semantic Network System for Pattern Understanding. IEEE Trans. Pattern Analysis and Machine Intelligence PAMI12(9), 883–908. Oliensis, J. & Dupuis, P. (1993). A Global Algorithm for Shape from Shading. In Proc. of the Int. Conf. on Computer Vision, Berlin, Germany, pp. 692–701. IEEE Computer Society Press. Olivier, P. & Tsujii, J.-I. (1994). Quantitative Perceptual Representation of Prepositional Semantics. Artificial Intelligence Review Special Issue of Natural Language and Vision Processing 8(2-3), 55–66. Palmer, S. E. (1978). Fundamental Aspects of Cognitive Representation. In E. Rosch & B. B. Lloyd (Eds.), Cognition and Categorization, Chapter 9, pp. 259–303. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers. PAMI 15(3) (1993, March). IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 15 Number 3. Special Section on Probabilistic Reasoning, IEEE Computer Society. Pearl, J. (1986). Fusion, propagation and structuring in belief networks. Artificial Intelligence 29(3), 241–288. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann Publishers. Perez, F. & Koch, C. (1994). Toward Color Image Segmentation in Analog VLSI: Algorithm and Hardware. International Journal of Computer Vision 12, 17–42. Peuquet, D. J. & Ci-Xiang, Z. (1987). An Algorithm to Determine the Directional Relationship between Arbitrarily-shaped Polygons in the Plane. Pattern Recognition 20(1), 65–74. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1988). Numerical Recipes in C. Cambridge University Press. Priese, L. & Rehrmann, V. (1993). On Hierarchical Color Segmentation and Applications. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), New York, NY, pp. 633–634. IEEE. Redfern, D. (1996). Maple V, The Maple Handbook. New York and others: Springer-Verlag. Retz-Schmidt, G. (1988a). A REPLAI of SOCCER: Recognizing Intentions in the Domain of Soccer Games. In Proc. of the 8th European Conference on Artificial Intelligence (ECAI-88), pp. 455– 457. Retz-Schmidt, G. (1988b). Various Views on Spatial Prepositions. AI Magazine 9(2), 95–105.

178

BIBLIOGRAPHY

Richards, W., Jepson, A., & Feldman, J. (1996). Priors, Preferences and Categorial Percepts. In W. Richards & D. Knill (Eds.), Perception as Bayesian Inference, pp. 93–122. Cambridge University Press. Rickheit, G. & Strohner, H. (1994). Kognitive Grundlagen situierter künstlicher Kommunikatoren. In H.-J. Kornadt, J. Grabowski, & R. Mangold-Allwinn (Eds.), Sprache und Kognition: Perspektiven moderner Sprachpsychologie, pp. 73–92. Heidelberg: Spektrum Akademischer Verlag. Rimey, R. & Brown, C. (1994). Control of Selective Perception Using Bayes Nets and Decision Theory. International Journal of Computer Vision 12(2/3), 173–207. Ritter, H., Martinetz, T., & Schulten, K. (1992). Neural Computation and Self-organizing Maps. Reading, MA: Addison-Wesley. Russell, S. & Norvig, P. (1995). Artificial Intelligence A Modern Approach. Englewood Cliffs, NJ: Prentice Hall. Safaee-Rad, R., Tchoukanov, I., Smith, K. C., & Benhabib, B. (1992). Constraints on quadraticcurved features under perspective projection. Image and Vision Computing 10(8), 532–548. Sagerer, G. (1988). Automatic Interpretation of Medical Image Sequences. Pattern Recognition Letters 8, 87–102. Sagerer, G. (1990). Automatisches Verstehen gesprochener Sprache. Reihe Informatik, Band 74. Mannheim: Wissenschaftsverlag. Sagerer, G., Kummert, F., & Socher, G. (1996). Semantic Models and Object Recognition in Computer Vision. In K. Kraus & P. Waldhäusel (Eds.), International Archives of Photogrammetry and Remote Sensing, Volume XXXI, Part B3, Commission 3, pp. 710–723. Sagerer, G., Prechtel, R., & Blickle, H.-J. (1990). Ein System zur automatischen Analyse von Sequenzszintigrammen des Herzens. Der Nuklearmediziner 3, 137–154. Scales, L. (1985). Introduction to Non-Linear Optimization. London: Macmillan. Schade, U. (Ed.) (1995). Situiertheit, Integriertheit, Robustheit: Entwicklungslinien für einen Ku¨ nstlichen Kommunikator. Report 95/17 – Situierte Ku¨ nstliche Kommunikatoren, SFB 360. Universität Bielefeld. Scheering, C. (1995). Modellierung von Richtungspräpositionen auf der Basis von objektspezifischen Partitionen des 3D Raums. Diplomarbeit, Universität Bielefeld, Technische Fakultät, AG Angewandte Informatik. ¨ Schefe, P. (1986). Künstliche Intelligenz – Uberblick und Grundlagen. Mannheim: Bibliographisches Institut. Schirra, J. R. J. & Stopp, E. (1993). ANTLIMA - A Listener Model with Mental Images. In Proc. of the Int. Joint Conference on Artificial Intelligence, Chambéry, France, pp. 175–180. Schnelle, H. (1991). Die Natur der Sprache. Die Dynamik der Prozesse des Sprechens und Verstehens. Berlin, New York: Walter de Gruyter. Semple, J. G. & Kneebone, G. T. (1952). Algebraic Projective Geometry. Oxford Science Publication.

BIBLIOGRAPHY

179

Socher, G., Fink, G. A., Kummert, F., & Sagerer, G. (1996). A Hybrid Approach to Identifying Objects from Verbal Descriptions. In Workshop on “Multi-Lingual Spontaneous Speech Recognition in Real Environments”, Nancy, France, June 6-7. Socher, G., Merz, T., & Posch, S. (1995a). 3-D Reconstruction and Camera Calibration from Images with known Objects. In D. Pycock (Ed.), Proc. British Machine Vision Conference, Birmingham, UK, Sept. 11-14, pp. 167–176. Socher, G., Merz, T., & Posch, S. (1995b). Ellipsenbasierte 3-D Rekonstruktion. In G. Sagerer, S. Posch, & F. Kummert (Eds.), 17. DAGM-Symposium Mustererkennung, Bielefeld, Sept. 1315, pp. 252–259. Springer-Verlag. Socher, G. & Naeve, U. (1996). A Knowledge-based System Integrating Speech and Image Understanding – Manual Version 1.0. Report 95/15 – Situierte Ku¨ nstliche Kommunikatoren, SFB 360, Universität Bielefeld. Socher, G., Sagerer, G., Kummert, F., & Fuhr, T. (1996). Talking About 3D Scenes: Integration of Image and Speech Understanding in a Hybrid Distributed System. In International Conference on Image Processing (ICIP-96), Lausanne, pp. 18A2. Sowa, J. F. (1984). Conceptual Structures: Information Processing in Mind and Machine. The Systems Programming Series. Reading, MA: Addison-Wesley. Sowa, J. F. (1991). Principles of Semantic Networks. Philadelphia, Pennsylvania: Morgan Kaufmann Publishers, Inc. Spelluci, P. (1993). Numerische Verfahren der nichtlinearen Optimierung. Basel: Birkhäuser. Srihari, R. K. (1994). Photo Understanding Using Visual Constraints Generated from Accompanying Text. In P. Mc Kevitt (Ed.), AAAI-94 Workshop on Integration of Natural Language and Vision Processing, pp. 22–29. Twelfth National Conference on Artificial Intelligence (AAAI94). Srihari, R. K. & Burhans, D. T. (1994). Visual Semantics: Extracting Visual Information from Text Accompanying Pictures. In Proc. of the American National Conf. on Artificial Intelligence, Seattle, WA. Stuelpnagel, J. (1964). On the Representation of the three-dimensional Rotation Group. SIAM Review 6(4), 422–430. Talmy, L. (1983). How language structures space. In H. Pick & L. Acredolo (Eds.), Spatial orientation: theory, research and application. New York: Plenum Press. Taubin, G. (1991). Estimation of Planar Curves, Surfaces, and Nonplanar Space Curves Defined by Implicit Equations with Applications to Edge and Range Image Segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence PAMI-13(11), 1115–1138. Thórisson, K. R. (1995). Multimodal Interface Agents and the Architecture of Psychosocial Dialogue Skills. Ph.D. Thesis Proposal, M.I.T. Media Laboratory. Toal, A. F. & Buxton, H. (1992). Spatio-temporal reasoning within a traffic surveillance system. In Proc. Second European Conference on Computer Vision, pp. 884–892. Santa Margherita Ligure, Italy, 18-23 May, G. Sandini (Ed.), Lecture Notes in Computer Science 588, Springer-Verlag.

180

BIBLIOGRAPHY

Tsai, R. (1985). A Versatile Camera Calibration Technique for High Accuracy 3D Machine Vision Metrology using Off-the-Shelf TV Cameras and Lenses. Technical report. Tsotsos, J. K. & 16 others (1995). The PLAYBOT Project. In J. Aronis (Ed.), IJCAI’95 Workshop on AI Applications for Disabled People, Montréal. Tsotsos, J. K., Verghese, G., Dickinson, S., Jenkin, M., Jepson, A., Milios, E., Nuflo, F., Stevenson, S., Black, M., Metaxas, D., S.Culhane, Yet, Y., & Mann, R. (1997). PLATBOT: A VisuallyGuided Robot for Physically Disabled Children. submitted to Image and Vision Computing. Ullman, S. (1996). High-level vision : object recognition and visual cognition. Cambridge, MA: MIT Press. Verghese, G. & Tsotsos, J. K. (1994). Real-Time Model-based Tracking Using Perspective Alignment. In Proc. Vision Interface’94, Banff, pp. 202–209. Vorwerg, C. (in preparation). Categorization of spatial relations. Vorwerg, C., Socher, G., Fuhr, T., Sagerer, G., & Rickheit, G. (1997). Projective relations for 3D space: Computational model, application, and psychological evaluation. In Proc. of the 14th National Conference on Artificial Intelligence (AAAI-97), Providence, Rhode Island, pp. 159– 164. Vorwerg, C., Socher, G., & Rickheit, G. (1996). Benennung von Richtungsrelationen. In R. H. Kluwe & M. May (Eds.), Proceedings der 2. Fachtagung der Gesellschaft für Kognitionswissenschaft, KogWis96, Universität Hamburg, pp. 184–186. Wahlster, W. (1989). One Word says More Than a Thousand Pictures. On the Automatic Verbalization of the Results of Image Sequence Analysis Systems. Computers and Artificial Intelligence 8, 479–492. Wahlster, W. (1994). Text and Images. In R. A. Cole, J. Ariani, H. Uszkoreit, A. Zenen, & V. Zue (Eds.), Survey on Speech and Natural Language Technology. Dordrecht: Kluwer. Wazinski, P. (1993). Graduated Topological Relations. Technical Report 54, Universität des Saarlandes, Saarbrücken. Weng, J., Cohen, P., & Herniou, M. (1992). Camera Calibration with Distortion - Models and Accuracy Evaluation. IEEE Trans. Pattern Analysis and Machine Intelligence PAMI-14(10), 965– 980. Winograd, T. (1983). Language as a Cognitive Process, Chapter 6.5, pp. 311–326. Reading, MA: Addison-Wesley. Wunderling, H. & Heise, H. (1984). Schuelkes Tafeln. Stuttgart: B.G. Teubner. Xie, M. (1994). On 3D Reconstruction Strategy: A Case of Conics. In Proc. International Conference on Pattern Recognition, Jerusalem, Israel, October 9-13, pp. 665–667. Computer Society Press. Yuan, J. (1989). A General Photogrammetric Method for Determing Object Position and Orientation. IEEE Trans. Pattern Analysis and Machine Intelligence PAMI-5(2), 129–142. Zhang, Z. & Faugeras, O. (1992). 3D Dynamic Scene Analysis. Berlin, Heidelberg: Springer-Verlag.

Index The most relevant page numbers for an item are printed in bold.

2-test, 111 3D, 5, 21, 35, 80 reconstruction, 5, 17ff., 35–63, 137, 143 accuracy, see accuracy model-based, 35–63 A -algorithm, 13 abstraction, 25 level of *, 5, 16, 27, 29 acceptance relation, 81ff., 89 volume, 81ff., 86, 89 accuracy, 5 detection *, 69 of 3D reconstruction, 57–61 of object identification, 118–126 ANOVA, 90 any-time requirement, 22, 137 artificial intelligence, 11, 27 Baufix, 3, 8, 9, 39, 71–73, 80, 102, 105, 108–112, 145–149 3-holed-bar, Dreilochleiste, 145 5-holed-bar, Fünflochleiste, 145 7-holed-bar, Siebenlochleiste, 145 bolt, Schraube, 149 cube, Schraubwürfel, 145 hexagonal bolt, Sechskantschraube, 149 rhomb-nut, Rautenmutter, 147

rim, Felge, 148 round-headed bolt, Rundkopfschraube, 149 socket, Mitnehmerbuchse, 148 tire, Reifen, 148 washer, Beilagscheibe, 148 Bayes classification, 75 Bayes’ rule, 97 Bayesian network, 5, 18, 22, 32, 95–104, 127, 134 belief, 96, 103 network, see Bayesian network bottom-up, see data-driven bounding cuboid, 39, 80 camera calibration, 21, 35f., 38, 50–51 model, 40–42 parameters, 36, 38ff., 50 categorical representation, see representation category, 5, 29, 66, 102 causal support, 97 chain rule, 159 Chasles, see theorem of Chasles classification, 74, see color classification, polynomial classifier cognitive processes, 113, 134 psychology, 4, 28, 89, 105 collineation, see homogeneous transformation color, see object color classification, 19, 22, 73–77 conditional probability table, 76, 95, 98, 102, 105

182

confidence level of *, 104 conic, see ellipse constraint, see restriction constructor, 9 convergence, 51, 53 coordinate frame, 36, 39–42, 152 coordinates affine, 42f., 152 homogeneous, 42ff., 152 covariance, 43, 44, 51, 109, 165–168 cross ratio, 45, 152, 155–158 DACS, 19 data-driven, 14, 21 degree of accordance, 82, 84 of applicability, 78, 83, 88ff. of containment, 81, 83 derivative, see Jacobian detection accuracy, see accuracy diagnostic support, 97 dialog, 14–16, 23, 71f. -step, 15ff. -stragegy, 141 direction vector, 81 ellipse, 37, 45–48, 155–158, 161, 168 quadratic form, 46 representation, 46 end-to-end evaluation, 131 ERNEST, 12–14, 16, 34, 67 concept, 13, 16–18, 68 concrete link, 13, 16, 18, 68 context-dependent part link, 13, 18 goal-concept, 13 inference rules, 14 instance, 13, 17, 18, 68 modified concept, 13, 68 part link, 13, 16 reference link, 13, 18 specialization link, 13

INDEX

structural relation, 13 evidence, 97 focal length, 36, 40, 51, 54 fuzzy score, see scoring geometry projective, 37ff., 152–153 gradient descent, see minimization Hidden Markov Models (HMM), 34, 141 homogeneous, see transformation, coordinates human-computer interaction (HCI), 1ff., 14, 23, 79, 140 identification, see object identification image feature, 36, 50 ellipse, see ellipse line segment, see line segment point, 43 sequence, 126 understanding, 2ff., 10ff., 17–23, 25– 34, 65, 101, 133, 137 incremental processing, 5, 17, 22, 133 inference, 1f., 10f., 27f., 28, 101, 127 instruction, 1, 4, 14 instructor, 9ff. integration, 3, 7 intended object, 9, 14ff., 77ff., 102, 106 invariant, 45, 152, 155 Jacobian, 42, 54, 159 knowledge, 1, 9f., 19, 36, 118 base, 11, 13, 16, 29 modeling, 118 representation, 4, 12, 28, 105, 118 knowledge-based interpretation, 11f. language model, 18, 34 lens distortion, 38, 61

183

INDEX

Levenberg-Marquardt method, see minimization likelihood value, 103 line segment, 43–45, 160, 166 MDL-representation, 43 midpoint-representation, 43, 161, 167 vertex representation, 43 linguistics, 4, 16ff., 28, 77, 134 localization, 1, 21, 80, 87 Mahalanobis distance, 43, 50, 165 Markov-Random-Fields (MRF), 140 meaning definition, 82 mental models, see representation minimal specification, 2 minimization, 36ff., 51–55, 159 gradient descent, 51 Levenberg-Marquardt, 51ff., 159 Newton’s method, 53 sheme, 54 model feature, 36, 39, 50, 159–164 ellipse, see ellipse line segment, see line segment point, 43, 159, 165 fitting, 36, 54, 55 primitive, see model feature model-driven, 8, 14, 21 network Bayesian, see Bayesian network neural, 17, 67f. probabilistic, see Bayesian network semantic, see ERNEST notations, 40, 143 object abstraction, 80 color, 9ff., 26, 29, 70–77, 102 description, 71 identification, 4f., 10ff., 95–129, 131, 139 model, 36, 39

pose, see pose recognition, 5, 17ff., 32, 67 shape, 22ff., 26, 29, 102, 106 size, 22ff., 26, 29, 102, 106 type, 9ff., 26, 29, 67, 102 optical center, 40 partial derivative, 159 pattern recognition, statistical, 74 perception, 1, 5ff., 7, 21, 23 perspective projection, 21, 40–42, 43–48, 159–161, 165 pin-hole camera, see camera model pitch, see rotation angles polynomial classifier, 68, 75 pose, 21, 36ff. estimation, see 3D reconstruction possibility, 103 principal point, 36, 40 probabilistic network, see Bayesian network projection, see perspective projection projective, see geometry, transformation, coordinates reconstruction, 37 propagation, 5, 13ff., 98ff. property, 5, 26, 29, 66, 102 psycholinguistic experiment, 4, 89–93, 101, 105f. psychology, 134, see cognitive psychology, psycholinguistic experiment qualitative description, 4ff., 9, 18ff., 27, 29, 65–94, 134, 138 Q UA SI-ACE, 3, 9, 14–20, 100, 131 reasoning under uncertainty, 101 recognition, see object recognition, speech recognition reconstruction, see 3D reconstruction reference frame, 19, 77, 82 object, 18, 77ff.

184

representation, 2, 5, 10, 25–30, 35, 65, 101, 134 categorical, 27ff. conceptual, 27ff. knowledge, see knowledge representation mental models, 28 numerical, 5, 25ff. qualitative, see qualitative description symbolic, 11, 27ff. vectorial fuzzified, see vectorial fuzzified representation restriction, 5, 8ff., 65, 87, 101, 128 retinal plane, 40 robustness, 9 roll, see rotation angles rotation, 39ff., 152f. angles, 42, 153 scene properties, 29 scoring, 12f., 29, 65 search, 11ff., 48, 68 segmentation, 68ff. semantic network, see ERNEST shape, see object shape situated artificial communicator, 2f., 7–10, 14, 17 situatedness, 7 size, see object size spatial preposition, 31, 77 relations, 1, 9ff., 19ff., 26ff., 77–94, 104, 125, 138 speech recognition, 1, 18 understanding, 1ff., 9ff., 14–16, 21, 65, 101, 133 stereo camera, 3, 9 image, 5, 19ff., 36, 38, 43, 59 matching, 48–50 vision, 35

INDEX

theorem of Chasles, 45, 155 three-dimensional, see 3D top-down, see model-driven transformation, 41ff. affine, 151f. Euclidean, 151 homogeneous, 37, 40–48, 151–153, 159 projective, 45, see homogeneous translation, 39ff., 152 type, see object type understanding image, see image understanding natural language, 30 speech, see speech understanding vectorial fuzzified representation, 5, 22, 29, 65, 78 visual perception, 3, 9, see perception winner-take-all, 90 World-Wide-Web (WWW), 22, 105f. yaw, see rotation angles