changed from simple pick and place tasks to sophisticated robotic solutions. The inter- ..... some individual problems, a fundamental understanding for all these application and their ...... Recently, the approach based on the gPb contour detector, devel-. 40 ...... Advances in Imaging and Electron Physics, 31:82â147, 2004.
DISSERTATION
Robust Object Detection for Robotics using Perceptual Organization in 2D and 3D ausgef¨ uhrt zum Zwecke der Erlangung des akademischen Grades eines Doktors der technischen Wissenschaften unter der Leitung von Ao.Univ.Prof. Dipl.-Ing. Dr.techn. Markus Vincze Institut f¨ ur Automatisierungs- und Regelungstechnik (E376) eingereicht an der Technischen Universit¨at Wien Fakult¨at f¨ ur Elektrotechnik und Informationstechnik von Dipl.-Ing Andreas Richtsfeld geb. am 22. Mai 1978 Matr. Nr.: 9826061 Waxenberg 55, 4182 Waxenberg
Wien, im Juni 2013
Abstract Since robots conquered the assembly lines in factories, the focus of robotics research changed from simple pick and place tasks to sophisticated robotic solutions. The interest in mobile, domestic robotics increased and received great attention in the last decade. Bringing robots from assembly lines to our households means bringing them from a clearly structured workspace into an unknown environment with many uncertainties. Hence, the need for robustly working perception methods, which are able to perceive information from cluttered environments increased. The focus of this thesis lies in robust object detection from visual input data, first for 2D color image data and subsequently for range image data (RGB-D) by exploiting perceptual grouping. Perceptual grouping is a generic technique to organize visual primitives into meaningful groupings and is inspired by human perception. Object detection for 2D image data initially starts with extraction of edge primitives. A grouping framework for object detection by hierarchical data abstraction of visual input is introduced. Extracted edge primitives from 2D color images get grouped and parametrized at several data abstraction levels. Incremental indexing is used to iteratively group lowerlevel primitives to higher-level entities according to perceptual grouping rules, and finally ends in the detection of proto-objects, such as cuboids, cones, cylinders and spheres. Incremental indexing leads to anytime processing, a behavior of a system, delivering the best results generated so far whenever processing is stopped. It additionally avoids usage of parameters and thresholds. Furthermore, the proposed indexing method allows to easily integrate an attention mechanism to the vision system. Processing of the grouping system ends with a 3D model reconstruction by exploiting knowledge about the environment (supporting planes). An application of the system in a general applicable computer vision framework for mobile robotics is shown. Object detection for range image data (RGB-D) again is implemented in a hierarchical framework with several data abstraction levels, but starts from initially extracted surface primitives. After pre-segmentation of the image, surface patches get modeled as planes or B-spline surfaces. Model Selection with Minimum Description Length (MDL) is used to find for each surface patch the model that fits and therefore represents the data best. Inspired from the rules of perceptual grouping, relations between surface patches are defined and organized in a structured feature vector. To support the detection of a wide range of object types, feature vectors between surface models are learned with a support vector machine (SVM). With this method the importance of each relation for grouping is learned and the SVM predicts after a training stage the grouping of surface models. To satisfy global scene properties a graph-cut algorithm is finally employed to produce object hypotheses. The generality of the approach is shown when using different datasets with objects of different size, shape and appearance. The implemented hierarchical framework structure allows to use the developed system in different type of applications and in different environments and is therefore perfectly suitable for usage in mobile and domestic robotics.
Contents 1 Introduction 1.1 Computer Vision for Mobile Robotics 1.2 A Computer Vision Architecture . . 1.3 Problem Statement . . . . . . . . . . 1.4 Thesis Outline and Contribution . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
2 Gestalt Psychology and Perceptual Organization 2.1 Gestalt Psychology . . . . . . . . . . . . . . . . . . 2.2 A Structure for Perceptual Organization . . . . . . 2.3 Perceptual Organization in Computer Vision . . . . 2.3.1 Perceptual Grouping of 2D sensor data . . . 2.3.2 Perceptual Grouping of 3D sensor data . . . 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
1 2 4 5 7
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
11 11 12 14 15 20 24
3 Object Detection in 2D image data 3.1 Detection of Basic Object Shapes . . . . . . . . . . . . . . . 3.2 Concept and Classificatory Structure . . . . . . . . . . . . . 3.3 Process Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Incremental Indexing and Anytimeness . . . . . . . . . . . . 3.5 Gestalt Principles and Primitives . . . . . . . . . . . . . . . 3.6 Adding Attention . . . . . . . . . . . . . . . . . . . . . . . . 3.7 From 2D Shapes to 3D Objects . . . . . . . . . . . . . . . . 3.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . 3.9.2 BLORT - The Blocks World Robotic Vision Toolbox
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
27 27 29 29 30 31 33 33 33 36 36 37
. . . . .
39 39 40 42 44 46
4 Object Detection in 3D image data 4.1 Detection of Objects . . . . . . . . . . . . 4.2 State of the Art . . . . . . . . . . . . . . . 4.3 Concept and Classificatory Structure . . . 4.4 The Object Segmentation Database (OSD) 4.5 Pre-segmentation . . . . . . . . . . . . . . i
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . . .
. . . . .
. . . . . .
. . . . .
. . . . . .
. . . . .
. . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
4.5.1 Normals calculation . . . . . . . . . . . . . . . . 4.5.2 Normals clustering . . . . . . . . . . . . . . . . 4.5.3 Experiments . . . . . . . . . . . . . . . . . . . . 4.6 Parametrization and Model Selection . . . . . . . . . . 4.6.1 Plane fitting . . . . . . . . . . . . . . . . . . . . 4.6.2 B-spline fitting . . . . . . . . . . . . . . . . . . 4.6.3 Model Selection . . . . . . . . . . . . . . . . . . 4.6.4 Experiments . . . . . . . . . . . . . . . . . . . . 4.7 Parametric Surface Grouping . . . . . . . . . . . . . . 4.7.1 Relations at the structural level . . . . . . . . . 4.7.2 Relations at the assembly level . . . . . . . . . 4.7.3 Support Vector Machine (SVM) Classification . 4.7.4 Learning and Testing . . . . . . . . . . . . . . . 4.7.5 Experiments . . . . . . . . . . . . . . . . . . . . 4.8 Global decision making . . . . . . . . . . . . . . . . . . 4.9 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 4.9.1 Evaluation on the object segmentation database 4.9.2 Comparison with state-of-the-art methods . . . 4.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
46 48 48 50 51 51 52 54 55 55 58 60 61 62 63 65 66 68 72
5 Conclusion 5.1 Recent Research Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75 75 77
Bibliography
79
ii
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (OSD) . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
List of Figures 1.1 1.2
1.3
2.1 2.2
3.1 3.2 3.3
3.4
3.5
3.6 3.7 3.8
Computer vision sub-architecture in a robotics framework: Interplay between detection, tracking, learning and recognition of objects. . . . . . . . Intermediate results of the detection module for RGB-D data: original image; segmented image areas which got already assigned to complete entities; reconstructed parametric object models, and the estimated pose of objects in respect to the camera position. . . . . . . . . . . . . . . . . . . . . . . . 2D edge image grouping: original image, extracted edges, assignments during grouping process, and resulting basic object shapes. . . . . . . . . . . . The primarily discussed principles of perceptual organization. . . . . . . . Bottom-up classificatory structure for perceptual organization by Sarkar and Boyer [109, 14] for 2D and 3D data. . . . . . . . . . . . . . . . . . . . 2D perceptual grouping over four levels of data abstraction and associated processing methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Process flow of shape primitives for basic object shape detection. . . . . . . Definition of normal (n) and tangential (t) search lines (first row) and types of junctions between search lines (second row). Collinearities, Tjunctions and L-junctions appear between lines, arc-junctions between arcs, and ellipse-junctions between ellipses and lines. . . . . . . . . . . . . . . . Construction of basic object shapes: First row: Cuboid from three flaps or from flap and L-junction. Second row: Cylinder from two extended ellipses and Cone from extended ellipse and L-junction. . . . . . . . . . . . . . . . Incremental grouping: Line search lines (first row), arc search lines (second row) and resulting basic shapes (third row) after 150, 200, 300 and 500ms processing time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Precision-recall curve with varying processing time. . . . . . . . . . . . . . Search lines and detected basic shapes when attention (region of interest) is set to the salt box and the tape respectively. . . . . . . . . . . . . . . . . More detected shape primitives: Office and living room scenes with detected boxes and cylinders (red) as well as the lower-level primitives rectangles (yellow) and closures (blue). . . . . . . . . . . . . . . . . . . . . . . . . . . iii
4
6 8 12 14
29 30
31
32
34 34 35
36
3.9
4.1 4.2 4.3 4.4
4.5
4.6 4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
BLORT-Overview: First row: Detect and track tea box. Second row: Learn appearance features on shape model. Third row: Sequential recognition of learned objects in a cluttered scene. . . . . . . . . . . . . . . . . . . . . . . 3D perceptual grouping over four levels of data abstraction and associated processing methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Processing example of a complex scene: Original image, pre-segmented patches, parametric surfaces and extracted object hypothesis. . . . . . . . . Example of each scene type from the object segmentation database (OSD). Standard deviation of plane normals with respect to kernel radius kr and depth with a Microsoft (The red points show the selected kernel radius values.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mean and standard deviation of the distance (∆d) between neighboring plane points over the depth distance between camera and plane point, measured with a Kinect sensor. . . . . . . . . . . . . . . . . . . . . . . . . . . . Assignment of pre-segmented patches. Ground truth, segmented patches P1 and P2 and resulting true positives tp, false positives f p and false negatives f n. Pre-segmentation of table top scene from OSD. Parameter of left images: ωc = −0.004 and c = 0.58; for right images: ωc = −0.004 and c = 0.52. See discussion in text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Left: Initialisation of a B-Spline surface (green) using PCA. Right: The surface is fitted to the point-cloud (black) by minimizing the closest point distances (red) (m = n = 3, p = 2, wa = 1, wr = 0.1). . . . . . . . . . . . . Model selection: Original image; Neighborhood network of pre-segmented patches; Selection of neighboring patches; Best combination of parametrized surface models (planes and B-splines). . . . . . . . . . . . . . . . . . . . . Curvature calculation on patch border. Left: Estimation of neighboring pixels. Right: Projection of the surface normals and curvature estimation (see text for details). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Annotation example: Original image with two stacked boxes, annotation for training at the structural level, and annotation for training at the assembly level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graph-Cut segmentation example: Crop of original image, pre-segmented surface patches, constructed graph, correctly segmented image even for wrong binary SVM classification (see text for detailed discussion). . . . . . Precision-Recall for each segmented object. (a-d) with Mishra’s [71] ap¨ proach, (e-h) with Ueckermann’s [126] approach, (i-l) with our approach. Plot a, e and l show results from the OSD dataset and b,f and j more detailed the upper right corner of the first plot. Plot c, g and k show results from the Willow Garage database and d, h and l again the upper right corner of b, f and j. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples from the OSD database. From left to right: Original image, re¨ sults of Mishra, results of Uckermann, and results of our approach (SV Mst+as ) iv
37 43 44 45
47
47 49
50
53
54
57
61
64
69 70
4.15 Examples from the Willow Garage database. From left to right: Original ¨ image, results of Mishra, results of Uckermann, and results of our approach. 71 4.16 Limitations of the approach. Original image, pre-segmented surface patches and final object segmentation with errors. . . . . . . . . . . . . . . . . . . 73 5.1
5.2
5.3
Two boxes occluding each other and the table plane. Left: 3D points do not match the color information given, especially at edges of occlusion. Ellipses labeled with a indicate points broken at edges of occlusion whereas b are points where the color is not consistent with the depth. Right: Depth values are corrected using B-splines for representing surfaces and contours. . . . . The point cloud obtained from the Kinect sensor, the segmentation results with the method shown in Chapter 4, the object hypotheses generated and the final objects selected by the hypothesis verification stage. . . . . . . . . Overlap of a surface of the existing model (yellow) with a new surface (cyan). The overlap in image space is evaluated taking into account the surface normals (2nd image). Two quadratic B-Spline patches (3rd image) are finally substituted by a single B-spline cylinder model (4th image) (images from [73]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
76
76
77
vi
List of Tables 3.1
4.1 4.2 4.3 4.4 4.5 4.6 4.7
Average true positive detection rate for objects shown in Fig. 3.5 with different processing times and results when using an attention point for the sought-after object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Object Segmentation Database (OSD): Number of images and objects in the learn- and test-sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Over-segmentation, under-segmentation and the number of produced patches after pre-segmentation for different values of ωc and c . . . . . . . Fscore and balanced error rate (BER) on the training set of the OSD-0.2 database [87] for the structural level. . . . . . . . . . . . . . . . . . . . . . Fscore and balanced error rate (BER) on the training set of the OSD-0.2 database [87] for the assembly level. . . . . . . . . . . . . . . . . . . . . . . Results on the OSD database [87] for the structural level. . . . . . . . . . . Results on the OSD database [87] for the assembly level. . . . . . . . . . . Precision and recall on the OSD and Willow Garage dataset for the approach ¨ by Mishra et al [71], Uckermann et al. [126] and for our approach, when using the SVM of the structural level SV Mst and when using both data abstraction levels SV Mst+as . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
35 45 51 63 63 66 67
68
viii
Chapter 1 Introduction “I can’t define a robot, but I know one when I see one” Joseph Engelberger Since mankind started to build machines it is a dream of inventors, researchers and engineers to build self-operating machines, capable to perform jobs autonomously. Some early attempts of building such machines describe simple automata for certain tasks, e.g. coinoperated machines, fire engines or wind organs. After the industrial revolution started, memoranda of more complex machines can be found, first mainly to improve the art of war, but in the last century also machines resembling animals and humans. ˇ Karel Capek introduced the word robot in R.U.R. (Rossum’s Universal Robots), a play published in 1920. Later coined the scientist and science fiction author Isaak Asimov the term robotics in his famous science fiction story ’Liar!’. While there is no generally applicable definition of the term ’robot’, we understand a robot as machine able to perform complex tasks and interactions with humans without the necessity of any human control during execution of the assigned tasks. While electronics, telecommunication and computer research celebrated in the last decades great success, robot technology and in particular mobile (humanoid) robotics was not able to fulfill the maybe too high expectations, presumably due to the underestimated complexity of the necessary interrelationship of technology from many different research areas. The missing concepts of cognition in combination with artificial intelligence (AI) which enable robots to act autonomous are maybe another reason, but also other research areas such as mechanics are still not far enough to come up with a general-purpose robot that reaches market maturity. Instead specialized robots that are used to perform a certain task had more success. Industrial robots are widely used in automated production of goods since several decades, but these robots are mainly separated from humans to perform tasks in a reserved workspace without any interplay with humans. The success is related to the high operational availability, high accuracy as well as the low costs compared to humans. In contrast to industrial robotics mobile robotics technology is often experimental and only a few simple robots were placed on the market and were successful for many years, such as vacuum cleaning robots, lawn mowing robots or pool cleaning robots. Nevertheless, 1
1. Introduction it seems that there is a trend to more domestic robots, capable to take over more and more tasks to relieve people from daily housework. Instead of having many robots where each one is specialized for a certain task, robots with many capabilities are desirable and would satisfy the needs of the customers. Such autonomous robots should be able to serve humans in many different ways by having skills to solve a broad bandwidth of tasks and problems as well as social skills to interact with humans. They should act as companion butlers and facilitate the life for humans by performing e.g. fetch and carry tasks for household items. Bringing robots from automated manufacturing into our households is tantamount with bringing them from their well-defined and deserted environment to an unknown and inhabited environment where they have to deal with clutter and uncertainties. A generic, universally usable robot is supposed to perceive and understand its environment, even if the environment is unknown or if it changes with time. The process of becoming aware of something through the senses is called perception and includes hearing, seeing and tactic feeling. This work is focused on making robots see, but without ignoring that perception is more than seeing alone.
1.1
Computer Vision for Mobile Robotics
The dream in AI and (cognitive) robotics research of finding a generic solution for computer vision is still far from being satisfied. After years of research without significant breakthrough, researcher started to split the problem of building a generic machine vision system to specific sub-problems and tried to solve these simpler problems, finally to use the developed techniques in products with a restricted functionality. This trend brought a wide range of efficient algorithms for extraction of visual information and with them also lots of highly discriminative and invariant image descriptors. Examples of computer vision applications are known from state-of-the-art products in digital photography, such as cameras with face recognition which are also able to detect closed eyes. Other examples are image stitching software, enabling to create autonomously a large panorama image from several images taken side-by-side, toll charge systems or access systems for parking blocks which are reading license plates from cars. We put the following work in the mobile robotics research field, where autonomous robots should be able to serve humans and are able to execute a wide range of given tasks. An important research domain of mobile robotics is perception, the extraction of meaning from sensory input, allowing autonomous robots to operate in complex environments performing user-defined tasks. Seeing, which is a part of perception, is essential for understanding an environment and therefore important for a mobile robot in several aspects to solve specific tasks. The term Computer and Machine Vision incorporates all techniques to let machines see and to make sense of what they see. While the term computer vision is more related with the extraction of information from images the term machine vision is used to indicate that visual information is extracted to influence robotics systems, but we will not distinguish between these definitions and will apply Computer Vision as 2
1. Introduction generic term for all methods which are needed to extract information from visual input of a robotics system. This includes all methods and algorithms for acquiring, processing and analyzing images or higher-dimensional data from the real world to extract numerical or symbolic information. All computer vision algorithms used by the previously mentioned applications are highly task dependent and goal oriented. Algorithms for computer vision in mobile robotics are usually also task dependent and only a few generic solutions can be found. Before we discuss some typical problems of computer vision we would like to start with an illustration of a mobile service robot in a future household to outline the problem we would like to tackle: Imagine a future household service robot, called Liza, which was just bought by Tom, a mechatronics engineer and avowing robot geek. Tom is always a bit lazy and prefers to spend more time behind the computer than on housework, so he is now excited if Liza may take over his household duties and is able to serve him like a butler would do. After unpacking and activation, Tom takes Liza around in the flat to get her familiar within the rooms of the household. Liza is a well developed cognitive robot with some prior world knowledge and she learns during the guided tour the allocation of rooms and the location of furniture and other household objects in the flat. In the kitchen Liza discovers an unknown object on the table leading her to ask Tom what it is: Liza: Master Tom, what is that object? Tom: Well, this is my coffee cup which is dedicated for my morning coffee. Liza learns to recognize Toms cup with the picture of the Android Andy on it. After inspection of the cup by rotating it Liza is able to find it later if necessary. Tom and Liza are returning to the living room and Tom is now quite curious if Liza is able to fetch his cup from the kitchen: Tom: Liza, would you please bring me my coffee from the kitchen. Liza proceeds to the kitchen, grabs certainly Toms coffee cup with the Android Andy on it from the kitchen table and returns to Tom. Liza: Master Tom, here is your coffee. Tom knows that it will take some time to get Liza familiar with the environment and all the household items within, but she will learn day by day to better understand the new environment and Tom is happy to have now Liza as attendant. From the computer vision point of view this example reveals some typical problems of computer vision for mobile robotics which require a reliable working computer vision system to gather visual information to handle an assigned job. In the following listing of problems some robotics domains are deliberately disregarded, such as human robot interaction (HRI), navigation or grasping, instead computer vision problems are listed which arise when a robot has to deal with an unknown situation and/or environment: Detection of novel objects – If a robot operates in an unknown environment it will encounter novel objects. While detection of a known object is technically more easily 3
1. Introduction achievable, detection of a novel and unknown object in a cluttered environment is a much tougher task. The robot has to be able to identify a new object and to separate it from the background to investigate its properties. Learning of objects – Once an object is detected, the robot should be able to learn properties of the object. Visual features have to be extracted and a representation (model) of the object has to be built. The representation of the object has to be stored together with the extracted visual features to enable recognition of the object. Recognition of known objects – To perform fetch and carry tasks as shown in the previous example, a robot has to be able to search for previously seen objects. This entails the ability to recognize known or learned objects by matching the learned representation with the extracted visual features. It is illusory to claim that we can find a generic solution to all these problems at once, instead we would like to develop generic methods and algorithm to efficiently extract visual information from image data in an hierarchical system, to found a generic basis for a wide range of applications. Instead of putting lots of effort to solve straightforwardly some individual problems, a fundamental understanding for all these application and their problems should be obtained.
1.2
A Computer Vision Architecture
A simple and loosely defined computer vision architecture for a mobile robotics system is proposed, able to handle the quoted problems from the previous section. Figure 1.1
Manipulation/Grasping
Spatial Reasoning
Planning
Learning
Tracking Other ...
Detection Navigation
HRI: Language
Recognition
Images
Working Memory
Vision subarchitecture
Figure 1.1: Computer vision sub-architecture in a robotics framework: Interplay between detection, tracking, learning and recognition of objects. 4
1. Introduction shows the architecture of that component embedded in a larger framework of robotics components. It forms together with navigation, planning, manipulation, spatial reasoning, human robot interaction (HRI) and maybe many other other components a mobile robotics system. The computer vision component of the system is responsible for extraction of visual information from the data of the imaging devices. These devices are color cameras (RGB), depth image cameras (RGB-D) and maybe other imaging sensors, such as laser range scanners or ultrasonic sensor arrays. The component incorporates a working memory to save the gathered and extracted information. This memory is primarily used to store abstracted information from the images, such as observed objects and their pose in the real world, rather than for storing visual input data (e.g. images or depth images). The information in the working memory is distributed via communication channels between the neighboring components in the system. This knowledge transfer allows also other components of the system to use the extracted visual information about the surrounding environment from the imaging sensors. The first module in the presented computer vision architecture is placed after the image acquisition, namely Detection. The detection component is not only responsible to detect an object, it incorporates attention mechanisms, image segmentation, data extraction, object modeling and pose estimation. The vision component processes raw visual input data from the sensors and exports after processing object models and their position in the real world. The object models and their location can be used to initialize a Tracking module. Object tracking has to be robust against occlusion and has to detect when objects get lost, e.g. when they are out of the robots view or when they are completely occluded. The Learning module has to extract appearance features during tracking of an object and has to assign the features with the already detected geometric model. The learning module has to be able to extend the object’s appearance model when unlearned views are arising. Finally, the Recognition module closes the loop between learning and tracking. Using the learned appearance features and the geometric model of an object allows to recognize the object when it returns into the view or when the tracking module lost the object. The recognition module itself is afterwards able to re-initialize tracking by reporting the actual pose of the recognized object to the tracking module.
1.3
Problem Statement
In the previous section an architecture for a computer vision system in a robotics framework was proposed and typical modules are shown. Single modules of the proposed framework are already well studied, but a generic solution combining object detection, tracking, learning and recognition is still not available out of the box. The theory and methods shown in this work are mainly concerned with the detection module where the focus lies on detection of novel objects, but we prove the usability and correctness of the methods by experimentation in the larger computer vision framework of the robotics system. Figure 1.1 depicts the importance of Detection for further processing in the proposed architecture. Whenever this first component fails, the success of the following components 5
1. Introduction
Figure 1.2: Intermediate results of the detection module for RGB-D data: original image; segmented image areas which got already assigned to complete entities; reconstructed parametric object models, and the estimated pose of objects in respect to the camera position. are in jeopardy when transferring wrong information. The main task of the detection module is to extract information about objects in the environment of a robot. As already mentioned in the previous section the task of the detection module is the estimation of a geometric model of the detected objects and the estimation of the pose to initialize the tracking module. The task of the detection module can be divided into sub-tasks to give a description of the problems in a more technical manner: • • • • •
Identification of objects Segmentation of image data Estimation of models Building object hypotheses Pose estimation of objects
Figure 1.2 shows intermediate results of the outlined tasks. The original image is initially segmented into salient image regions. The next processing step estimates parametric object models and finally the pose of these models is calculated in respect to the camera or the robot pose. In the following a closer look to the sub-tasks is given: Identification of objects – Identification of objects means discrimination of objects from background, also called figure-ground segregation. This task can be done before an explanation of the image is found or at the end of the processing. Some computer vision algorithms require this information already before the processing starts (e.g. active vision algorithms), making this task more difficult. Otherwise the detection process is simpler when already knowing where to start forming object hypotheses. Segmentation of image data – Segmentation splits image data into basic visual features. This features may be e.g. clusters of uniform areas (intensity, color, texture) or discontinuities between clusters such as edges on the border of two regions. These initial segmentation splits the image into visual primitives which will be subsequently used to explain the content of the image.
6
1. Introduction Estimation of models – Estimation of parametric geometry models and estimation of appearance models of image parts is essential, because this information is used to explain image parts and to find coherence relations between parts. It is important that the used models are able to represent task relevant object properties. Building object hypotheses – With the estimated models and their properties, a definition of coherence relations between parts is possible. Applying a policy consisting of generic rules allows to combine segmented parts to form bigger entities and finally object hypotheses. Estimation of the pose of objects – The pose of objects is usually calculated with the geometric description of the objects when using a calibrated sensor delivering depth (RGB-D) or 3D data. For 2D image sensors certain constraints or prior knowledge is required to get the pose of objects. This may be knowledge about the shape and size of the object and geometric knowledge about the background (e.g. the pose of a dominant ground plane). The key problem of the detection module is the estimation of parametric models and the subsequent building of object hypotheses. The problem of the first one is to find generic models which describe all kinds of structured data and which are still computable in a reasonable time on real input data. The second problem relies heavily on the definition of the models in the previous step. Coherence relations and a set of rules have to be defined to found a policy for building object hypotheses from the estimated models.
1.4
Thesis Outline and Contribution
The goal of the thesis is to find a methodology to detect unknown objects and to describe their shape and appearance properties as 3D model. Parametrization of the object models to form an object representation is a key issue to run the introduced vision architecture, because the subsequent vision modules rely on an exact description of detected objects. To solve the outlined problems we take up the knowledge from the well studied Gestalt theory, which was introduced about one hundred years ago. It tries to explain how humans perceive their surroundings and how the human brain processes visual information. It is more precisely an attempt to describe how people tend to organize visual input in the form of basic elements to form wholes which are representing items of the physical environment. The theory is grounded on the fact that Gestalt principles describe the rules to group visual input into larger groupings to form object hypothesis. Even if this does not always hold for natural structures or objects, this usually holds for all man-made structures which can be found in our households. This theory is therefore appropriate for all computer vision systems for mobile indoor robotics where the robot operates in a man-made environment, e.g. a household.
7
1. Introduction
Figure 1.3: 2D edge image grouping: original image, extracted edges, assignments during grouping process, and resulting basic object shapes. Gestalt Psychology and Perceptual Organization (Chapter 2) – An introduction of Gestalt psychology and applications to computer vision are presented in Chapter 2. The applicability of an organizational structure to group visual features is discussed and an overview of related work for 2D and 3D input data is shown according to the presented classificatory structure. We close the chapter with a discussion about possible research gaps and open problems and give an outlook to problems which we would like to solve. Object Detection in 2D image data (Chapter 3) – We introduce a grouping framework for edge primitives, extracted from 2D image data, which we proposed in [92, 93, 91, 95]. Figure 1.3 shows intermediate structures from the bottom-up grouping process, which is developed by following the organizational structure, discussed in Chapter 2. Grouping is designed as incremental process which provides the best grouping results at any processing time, while at the same time avoiding to set parameters and thresholds. The framework yields detection of basic object shapes, such as cuboids, cylinders, cones or spheres as well as their parametrization when using a constrained environment. At the end of the chapter, we review achievements of the proposed algorithms and present implementation results for the proposed computer vision framework in Fig. 1.1, which we presented in [74]. Object Detection in 3D image data (Chapter 4) – The vision framework presented in Chapter 4 also follows the organizational structure of Chapter 2, but is designed to group visual features extracted from 3D data. We review in that chapter the whole work which we presented in [88, 94, 89, 90]. Compared to the vision framework of Chapter 3, this grouping approach avoids a restriction of the detection process to certain basic shapes. After pre-segmentation and parametrization of visual features in a hierarchical framework, a machine learning algorithm is used to automatically train the framework to segment objects based on the principles of perceptual organization. The framework delivers parametrized object models based on surface representations gathered from a single view on the objects, as already shown in Fig. 1.2. Evaluation shows the capabilities of the proposed object segmentation method and a comparison to other state-of-the-art algorithms.
8
1. Introduction Conclusion (Chapter 5) – Finally the thesis is concluded with a discussion about open problems and recently developed solutions to some of these problems, which we presented in [75] and [4]. The applicability of the 2D and 3D grouping methods and possible extensions to other open research questions are pointed out. Furthermore, we give an outlook to future research work.
9
1. Introduction
10
Chapter 2 Gestalt Psychology and Perceptual Organization “The whole is greater than the sum of its parts.” Aristotle, Metaphysica Gestalt psychology is a subject of psychology and is concerned with the description of the human perception, in particular with finding structures and principles for processing sensory stimuli. The different emerging research fields are all based on the work of Christian von Ehrenfels [130]. More research based on the theory by Ehrenfels has been done in the early twentieth century to identify and describe principles for grouping of visual primitives to understand human vision. Later in the century, researchers in technical areas exploited the outcomes of the theory for vision systems by using the principles to organize visual input to understand the content of images. After a peak of research of vision systems based on perceptual organization was reached at the end of the last century the interest decreased, but never passed into oblivion. In the following Section 2.1 a short overview of the main ideas of Gestalt psychology is given and representative work is referred. In Section 2.2 a classificatory structure for perceptual organization is discussed, before related work of perceptual organization for computer vision is reviewed in Section 2.3. Section 2.4 concludes the chapter with a discussion about the applicability of the theory, research gaps and the possible usage of perceptual grouping for robotics vision systems.
2.1
Gestalt Psychology
Wertheimer, K¨ohler, Koffka and Metzger were the pioneers of studying Gestalt psychology, when they started to investigate this theory about hundred years ago. Wertheimer [134, 135] first introduced Gestalt principles and K¨ohler [55], Koffka [54] and Metzger [68] further developed his theory. A summary and more recent contributions can be found in the modern textbook presentation of Palmer [80]. Gestalt principles (also called Gestalt laws) aim to formulate the regularities according to which the perceptual input is organized into 11
2. Gestalt Psychology and Perceptual Organization Proximity Similarity Continuity
Closure Conntectedness Common region
Figure 2.1: The primarily discussed principles of perceptual organization. unitary forms, also referred to as wholes, groups, or Gestalten [123]. In visual perception, such forms are the regions of the visual field whose portions are perceived as grouped or joined together, and are thus segregated from the rest of the visual field. These phenomena are called laws, but a more accurate term is principles of perceptual organization. The principles are much like heuristics, which are mental short-cuts for solving problems. Perceptual organization can be defined as the ability to impose structural organization on sensory data, so as to group sensory primitives arising from a common underlying cause [14]. There is no definite list of Gestalt principles defined in literature. The first discussed and mainly used ones are proximity, continuity, similarity and closure, defined by Wertheimer [134], K¨ohler [55], Koffka [54] and Metzger [68]. Simple examples of the mainly used principles are shown in Figure 2.1. Other discussed principles are common fate, considering similar motion of elements, past experience, considering former experience and good Gestalt (form), explaining that elements tend to be grouped together if they are part of a pattern, which describes the input as simple, orderly, balanced, unified, coherent and as regular as possible. Common region and element connectedness were later introduced and discussed by Rock and Palmer [97, 79, 81], also presented in Figure 2.1. For completeness we also have to mention the concept of figure-ground articulation (also figure-ground segregation), introduced by Rubin [101]. It describes a fundamental aspect of field organization but is usually not referred to as Gestalt principle, because this term is mostly used for describing rules of the organization of somewhat more complex visual fields. Some of these rules are stronger than others and may be better described as tendencies, especially when principles compete with each other.
2.2
A Structure for Perceptual Organization
Perceptual organization in human vision was already a research topic for a long time before it was introduced in computer vision. Witkin and Tenenbaum [136], Lowe [65] 12
2. Gestalt Psychology and Perceptual Organization as well as Marr [67] were the first who pointed out the importance and usefulness of perceptual organization and its grouping processes for computer vision. They claimed that grouping processes may be part of various vision tasks and on different processing levels. Zucker [143, 145] suggested also to use perceptual grouping, but for segmentation and to estimate properties of the objects. Vision tasks in cognitive science are usually classified into low-level (also early), midlevel and high-level vision, according to their input and output data types as well as the size of the considered neighborhood [136]. Marr [67] introduced a representational framework structure for the derivation of shape information in computer vision. At the basis of his hierarchical framework is a gray-level image which represents the intensity at each point. He suggested to process image information over three stages: The primal sketch (or 2D sketch) represents the properties of the two-dimensional image, primarily the intensity changes, their geometrical distribution and organization. The 21/2D sketch makes the orientation and depth of the visible surfaces and the contours of the discontinuities explicit. And finally the 3D model representation describes (or 3D sketch) the shapes and their spatial organization using a modular hierarchical representation that includes volumetric primitives as well as surface primitives. Jaynes [51] more than ten years later investigated aerial images and developed in principle the same hierarchical system as the one proposed by Marr. From the computer vision point of view, the classificatory structure introduced by Sarkar and Boyer [109, 14] is the most prominent one. It is widely used in computer vision to classify perceptual grouping methods [114, 84, 60]. Sarkar and Boyer suggested to classify perceptual grouping processes according to the complexity of image primitives (features) over the dimensionality of the input data. In their review [109] they listed representative work for each category at this time and updated it later in [14] to show domains without related work and possible potential in certain categories of the structure. Each category in the classificatory structure is defined basically by the visual input and output features and does not consider the methods which are used for processing, making their structure generic since the same methods for data abstraction can be found at various levels of abstraction and for different dimensionality of the input data. Figure 2.2 shows the classificatory structure to group two- and three-dimensional input data over four levels of abstraction to more and more meaningful visual features. The part with the temporal structure in [14] is omitted, because this work does not contribute to this section. The proposed structure of Sarkar and Boyer is not ”more correct” than other classificatory structures, but we use it for the rest of the work, because compared to others their structure is more detailed and considers also higher-dimensional input data (RGB-D and 3D). This allows to classify and to discuss related work and the parts of our work with respect to their structure. The two-dimensional (2D) cue starts with intensity pixels or dots of images for the grouping process, as shown in Figure 2.2. In each category features of the underlying level are processed to create new higher-order visual features. At the signal level continuous and non-continuous features are extracted: Continuous features are e.g. clusters of pixels with similar pixel intensity, similar color or similar texture, while non-continuous features are 13
2. Gestalt Psychology and Perceptual Organization 2D DATA
3D DATA
Large regular arrangements, object and region hypotheses
Large regular arrangements, object and region hypotheses
Ribbons, corners, merges, polygons, closed regions
Parallel, continuous patches, tetrahedal vertex combinations
Surface faces, (parametric) contour segments
(Co-)parametric surfaces, occlusion detection
Edge chains, pixel clusters, regions, texture patches
Surface patches, discontinuities, point clusters
Pixels, Dots
3D points, depth images
Assembly level
Structural level
Primitive level
Signal level
Figure 2.2: Bottom-up classificatory structure for perceptual organization by Sarkar and Boyer [109, 14] for 2D and 3D data. e.g. edges on the border of two uniform regions. At the primitive level contour segments are created from edge chains and surface faces from image regions or patches. The structural level constructs structures such as closed contours, ribbons and corners which lead to three-dimensional surface and surface boundary groupings. Finally the assembly level constructs from underlying abstracted data large regular arrangements to represent objects and regions which belong together. Processing of three-dimensional (3D) input data starts with point clouds or depth images. Compared to point clouds, depth images are organized as two-dimensional images which provide for each pixel a depth value allowing to exploit their relationship in the two-dimensional image space. RGB-D images are combining color (RGB: Red, Green Blue) and depth in one data structure. At the signal level continuous surface patches as well as local discontinuities and point clusters are extracted from the input data. With this input data parametric surfaces can be constructed and occlusion may be detected at the primitive level. At the structural level parallel and continuous patches as well as other regular structures such as tetrahedral vertex combinations are built. The purpose of the assembly level is again to group the underlying structures into large arrangements, representing objects or regions which belong together.
2.3
Perceptual Organization in Computer Vision
Perceptual organization has already a long tradition in computer vision and is used since the beginning of research of vision systems. Perceptual grouping in computer vision aims to extract organized structures from visual data and groups visual primitives from a com14
2. Gestalt Psychology and Perceptual Organization mon underlying cause. These visual primitives are structured according to basic geometric relations, called Gestalt principles or principles of perceptual organization. In the beginning perceptual organization was solely used for grouping of 2D visual features to extract 3D structure, first introduced by Roberts [96]. Marr [67], Witkin and Tennenbaum [136], Zucker [143] and Lowe [65] emphasized later the importance of perceptual organization for computer vision. Grouping of 3D data primitives has not been so popular compared to grouping of 2D primitives, therefore 3D perceptual organization is still in its infancy. In this section an overview of representative work in perceptual organization in computer vision is given. Following the classificatory structure of the previous section, related work in 2D perceptual grouping is discussed first, then related work in perceptual grouping of 3D sensor data is reviewed for the different levels of data abstraction.
2.3.1
Perceptual Grouping of 2D sensor data
The extraction of basic visual features from the sensory input is the task of the signal level. These visual features are regions of the image such as pixel clusters and patches, or edge chains representing discontinuity between regions. Regions are uniform areas of intensity, color, texture or other properties of the sensor input. These basic features are subsequently used as basic elements for further processing in the primitive level. The task of the primitive level is to organize pixels or edges into salient extended contours (also parametric segments) and extract regions which are representing surface faces. Contours are e.g. edge chains, straight lines, curves and circular, elliptical or super-elliptical arcs. Separating representative work of signal and primitive level is difficult for 2D input data, because many approaches which treat this two levels together exist. Related work of the signal and primitive level is therefore addressed together to avoid repetitions. Ahuja and Tuceryan [1, 2, 124] extracted structures from dot patterns by integrating region, boundary and component Gestalt when using Voronoi tessellation to associate dots with their neighborhood. Shashua and Ullman [115] presented a saliency measure based on curvature and curvature variation to extract salient structures from edge images for figureground discrimination. Gutfinger and Sklansky [43] as well as Herault and Horaud [47] later also used evaluation of saliency maps for figure-ground discrimination. There is plenty of work for grouping and segmenting edge points or edge chains into curves, lines or other shape primitives. Zucker [144, 145, 146] was one of the first who exploited orientation information in images by early orientation selection to create tangent fields due to the fact that this information precede the formation of contours. Parent and Zucker [82] used position, orientation and curvature constraints combined with a smoothness criterion to fit optimal curves to edge points and later Montesinos and Alquier [72] used in addition co-circularity and the grey levels of the pixels along the curve. In the work of Shashua and Ullman [116] and Guy and Medioni [45, 44] a saliency map for the image is computed to subsequently derive smooth contour segments. Saliency of an image point is related with length, continuation and smoothness of the best edge curve at this position. Shashua computes the map recursively while Guy uses extension fields to employ a directional convolution. The approach of Boldt et al. [12] addresses the problem 15
2. Gestalt Psychology and Perceptual Organization of grouping edgels (edge elements) to straight line segments, which was later extended by the approach of Dolan and Riseman [29]. They group initial extracted edgels using the principles of proximity and good continuation to form strands, classified as straight lines, corners, cusps, inflections or conics. The approach of Urago et al. [127, 128] constructs straight line segments from initial detected edges using a Markov Random Field (MRF). Cox et al. [26, 25] introduced a method of curve partitioning to smooth segments by employing a Bayesian multiple-hypotheses tree. Rosin and West [98] showed an algorithm to split connected points to a combination of straight lines and arcs. Later they improved their method [99] to extract the best combination of straight lines, polynomials and circular, elliptical or super-elliptical arcs while being non-parametric (without using thresholds). Amir and Lindenbaum [7] presented a generic grouping approach for contours where they construct a graph representation of the available perceptual grouping evidence to find the best partition of the graph into groups when using maximum likelihood graph clustering. Sanocki et al. [106] discussed if edges are sufficient for object recognition and comes to the conclusion that edges are far from being sufficient and alternative approaches to the problem of interpreting information have to be found. They suggest to extract image regions which are representing surfaces or intermediate-level structures. Early work in region segmentation was done by Geman et al. [41], presenting a method for partitioning uniform regions and for finding boundaries between these regions on texture collages and natural scenes. Liou et al. [63] introduced a combined region segmentation on gray-level and depth images by describing regions with a regression model. Wu and Lin [138] proposed a novel graph based algorithm for image region segmentation and boundary estimation. Instead of using the full adjacency graph, their approach achieves segmentation by effectively searching closed contours of edge elements on subgraphs. Ishikawa and Geiger [49] showed segmentation of gray-value images yielding closed boundaries by grouping previous found junctions by a novel use of the maximum-flow algorithm in a directed graph. Shi and Malik [118] proposed in an approach related to the one by Sarkar [110] a graph partitioning technique to find the so-called normalized cut by maximization of the dissimilarity between different groups of pixels and similarity within groups. To overcome the bad computational performance of previous graph-based segmentation ([118, 138, 110]), Felzenszwalb and Huttenlocher [35] proposed an efficient graph-based algorithm for image segmentation which satisfies global properties and runs in nearly linear time to the number of graph edges. An approach by Comaniciu and Meer [23] employed recursive mean shift, a feature space analysis method for pattern recognition. Pixels are represented by concatenating spatial coordinates and color values. The non-parametric clustering approach forms uniform regions in the image by grouping. Dollar et al. [30] presented a method called Boosted Edge Learning (BEL) for supervised learning of edges and object boundaries. They attempt to learn an edge classifier to decide on edge points at each location in the image, combining a large number of features across different scales using an extended version of the Probabilistic Boosting Tree algorithm. They show an application of the framework for learning Gestalt laws for edge completion as well as the applicability of the approach to edge detection in natu16
2. Gestalt Psychology and Perceptual Organization ral images. Work on scene image segmentation by Cheng et al. [22] uses a perceptual organization model to capture the non-accidental structure relations among constituent parts of an object. A boundary energy model encodes a list of Gestalt laws and forms the perceptual organization model. This model allows to detect the boundaries of various salient objects without any prior knowledge about objects. Loss et al. [64] presented a tensor voting approach for perceptual grouping considering the problem of grouping oriented segments in highly cluttered images. Iterative multi-scale tensor voting uses Gestalt principles of visual perception to iteratively remove background segments. Extensive evaluation on synthetic images and evaluation on publicly available databases of real-world images shows the usability of the approach. Recent work by Arbelaez et al. [9] investigates both, contour detection and region segmentation. A globalization framework based on spectral clustering is used on multiple local cues to detect contours. The segmentation algorithm transforms the output of any contour detector into a hierarchical region tree, reducing the problem of region segmentation to contour detection. In [8] they show successfully how to combine the approach with top-down part detectors for semantic segmentation. The structural level includes algorithms for grouping and completion of structures to estimate closed contours where inner regions may represent surfaces. This is important when boundaries are occluded or invisible in an image. The approach by Huttenlocher and Wayner [48] is able to extract open and closed convex groups of straight lines, likely to result from the same convex object in a scene. They used constraint triangulation to construct a convexity graph where a convex polygonal chain corresponds to a path in the graph. The method of Henricson and Stricker [46] extracts complex structures in areal images from previous extracted straight lines without an a-priori known template. Their method incorporates besides similarity in position and orientation, also photometric and chromatic attributes to estimate a score between two lines. Local significant structures are used to construct a graph where closed cycles represent closed contours in the image. Jaynes et al. [50] also extracted closed contours from areal images, but used in addition projections of orthogonal edges to delineate buildings. Image primitives are stored in a graph and weighted with local information. Virtual features are hypothesised for the perceptual completion of partial occlusions to construct cycles in the graph which correspond to possible building rooftop hypotheses. Elder and Zucker [31] introduced an algorithm for computing contour closure of an edge image. Their approach aims for connecting smooth curves to find closed contours, but does not consider the properties of the inner region. The technique employed by Cox et al. [27] called ’ratio region’ considered this inner region by finding the optimum relation between the costs of closing boundaries and the benefit of the interior region. Their method is suitable for single boundary finding in particular for medical images. An algorithm for finding perceptually closed paths in hand-drawn sketches and line art is presented by Saund [112]. Salient, compact closed region structure is identified by bidirectional best-first search with backtracking. The approach delineates the roles of the principle of good continuation versus maximally turning paths and considers global figural saliency measures such as compactness and closedness. A drawback of the method are the many hand-set parameters and thresholds to control the system. Wang et al. [133] extend 17
2. Gestalt Psychology and Perceptual Organization a graph-based method for extracting salient closed boundaries of their previous work [132], called ratio contour, with pruning of non-convex edges and edge links to extract convex contours. This method called ’Convex Ratio Contour’ (CRC) finds globally optimal convex boundaries with good continuity and proximity in polynomial time, in the worst case O(n3 ) with n edges. The approach by Estrada and Jepson [32] uses a measure of affinity between pairs of lines to group line segments into perceptually salient contours in complex images. The affinity measure is based on the quality of intersections and the uncertainty of endpoints to guide group formation and to limit the branching factor of the contour search procedure. The approach is able to handle cluttered and textured regions, but many parameters have to be selected manually or determined experimentally. Zhu et al. [140] introduced a grouping criterion, called untangling cycles, to segment salient closed and also open 1D contours. A graph formulation is used to define a measure for topological classification robust to clutter and broken edges. The algorithm is based on the insight that any 1D structure can be put into a specific ordering compared to the 2D image clutter where violations of that ordering leads to entanglements. Mapping of the edge graph to a circle (circular embedding) and calculation of the top complex eigenvectors of the random walk matrix leads to a solution for the combinatorial problem of finding groups of contours. The method of Chen and Gao [20] for image region and shape detection uses perceptual contour grouping to find contour closures. They introduce Generic Edge Tokens (GET), a set of perceptually distinguishable edge segment types which are subsequently used to build a graph for cycle search. Their approach works on simple toy blocks as well as on cluttered real-world images. Mahamud et al. [66] presented a method for identifying smooth closed contours bounding objects of unknown shape. A saliency measure based on the global property of contour closure is used, incorporating the Gestalt principles of proximity and good continuation. Compared to previous work contour closure is implemented by finding the largest positive real eigenvalues of the transition matrix of the graph. The transition matrix consists of conditional probability values that a contour with an edge contains another certain edge. Finding strongly connected components in the edge graph means finding of closed contours. Segmentation can be stopped after any number of detected contours due to the successive segmentation of contours according to the saliency. However, a lack of the approach is the computational intensity and that the approach does not scale well to bigger problems. The method of Zillich et al. [141, 142] uses an incremental indexing approach in the image space for parameter-free grouping of straight lines, merely controlled by the desired processing time. Incrementally extending search lines are used to find intersection of lines. Shortest path search is employed to identify closed convex contours whenever a new junction between lines appears. The method delivers results after any processing time and allows grouping without setting certain parameters or thresholds. Another approach for finding closed contours is based on active contours optimization. Montesinos et al. [72] presented a method of perceptual organization applied to the extraction of thin networks. Perceptual grouping is considered as problem of optimization where the quality of a grouping is defined with a class of functions involving curvature, co-circularity, grey-levels and orientation. Such functions can be optimized from a local 18
2. Gestalt Psychology and Perceptual Organization to a global level before a selection procedure rates and extracts principal groupings. They are showing the validity of the approach on synthetic as well as on aerial and medical data.
Grouping approaches at the assembly level aim to find object and region hypothesis as large regular arrangements of underlying visual primitives. Building object hypothesis without limiting a vision system to detect pre-defined models is one of the tough problems at the assembly level. Sala and Dickinson [104, 103] introduced a method for contour grouping to identify a pre-defined vocabulary of simple part models. They construct a region boundary graph from image patches constructed with over-segmentation of the image. Subsequently they train a family of classifiers on the estimated vocabulary. Finding shape candidates corresponds then to finding cycles in the graph. To avoid combinatorial explosion for finding part models from the vocabulary in the boundary graph, a technique called consistent path is introduced. A path from a starting point is followed as long as the shape is at least consistent with one shape of the vocabulary. In [105] they extended the proposed system to spatio-temporal grouping to improve the precision of the method. The proposed method describes objects using pre-defined parts, but the approach is limited through the pre-defined vocabulary and therefore does not scale well.
Learning algorithms are widely used for object recognition systems to detect objects or object classes. Object recognition methods by Nelson and Selinger [76] and by Opelt et al. [78] used the spatial arrangement of 2D boundary fragments to detect 3D object shapes. Ferrari et al. [37, 36] generated a codebook of contour segments to recognize objects and Ommer and Malik [77] used a hierarchical approach based on sparse representation of object boundaries. While learning was mainly employed for object recognition, Sarkar and Soundararajan [108, 111] employed a methodology to learn the importance of Gestalt principles for grouping of large salient edge groups. They introduced a framework to group large salient groups of visual low-level primitives that are likely to come from a single object. They employed a learning process to segregate objects from background based on the trained relative importance of the basic salient relationships, such as proximity, parallelness, continuity, junctions and common region. Parameters of the grouping process are used in a Bayesian network that has to be trained. A so-called scene structure graph is built and graph partitioning is used for the grouping process to form large groups. Robust performance of the approach is demonstrated on several cluttered real-world images. The problem of finding salient groupings of image primitives without usage of any pre-defined model was also tackled by Song et al. [119]. They proposed a novel definition of the Gestalt principle Pr¨agnanz based on Koffkas definition that image descriptions should be both, stable and simple. A grouping algorithm appealing to the Gestalt principles proximity and common region is shown, using straight lines as grouping primitives and color image regions to estimate the common region principle. Benchmark results are shown on the Berkeley Segmentation Dataset to demonstrate the value of their method. 19
2. Gestalt Psychology and Perceptual Organization
2.3.2
Perceptual Grouping of 3D sensor data
While perceptual grouping on 2D data, in particular for edge-based algorithms is well studied, perceptual grouping of 3D primitives has not been investigated as thoroughly. There are a few attempts to use perceptual grouping already for the stereo correspondence matching problem. Ambrosio and Gonz´ales [6] proposed construction of perceptual groups with the proximity, collinearity, parallelity and closure principle to use them as higherlevel primitives for guidance of stereo matching at lower levels in a hierarchical stereo vision system. Unfortunately, they could not show in the experimental evaluation how this approach performs in comparison to other stereo vision systems. Pugeault et al. [85] also used perceptual groups for stereo matching. They proposed to use a multi-modal affinity measure (previously introduced by Kr¨ uger et al. [59]) in addition to the geometric information to reduce the ambiguity of stereo matching. A multi-modal similarity measure is composed of phase, color and optical flow measurement, and combined with a classical good continuation criterion to form a novel multi-modal definition of the affinity between primitives. They showed that resulting groups follow adequately the contours of the image on different sequences. Since cheap and powerful active ranging sensors, such as Microsoft Kinect or Asus Xtion became available, the stereo correspondence problem has moved to the background. However, instead of focusing on perceptual grouping work for the construction of 3D information, we review work according to the classificatory structure of Figure 2.2 and assume that 3D information is already provided in the form of 3D point clouds or as depth images (RGB-D images). Methods in the signal level of the classificatory structure are responsible to group points into point clusters, and discontinuities between neighboring points are detected. Point clusters usually represent smooth surface patches and the discontinuities their boundaries. The signal level is therefore concerned with all kind of range image segmentation work. Algorithms of the primitive level encompass methods to parametrize the extracted surface patches of the signal level. Similar to the 2D grouping algorithms methods of the two levels are mainly related, why we review work again together. Range image segmentation can be split into two main groups: region-based methods [11, 62, 52, 53, 121, 86, 131, 28, 113, 56, 57, 58, 10, 17, 16, 120] and edge-based methods [33, 125, 126, 122, 107]. Region-based methods segment the image first into initial regions. Some approaches over-segment the image and try then to merge or extend the region segments. Edge-based methods start with finding jump edges (discontinuities) in the image. Subsequently functions are fitted to the surface patches to describe the properties. A problem of the edge-based methods is that discontinuities are often difficult to find. This results usually in under-segmentation of the image, which cannot be corrected later in the following processing steps. On the contrary region based methods often lead to distorted boundaries compared to the edge-based methods and tend more to over-segmentation. An early region based method was introduced by Besl and Jain [11]. They split images into piecewise smooth surfaces before surface fitting is used to segment the image 20
2. Gestalt Psychology and Perceptual Organization into surface models. An initial coarse image segmentation is achieved with surface curvature sign labeling, which is refined by iterative region growing on variable-order surface fitting. Their approach simultaneously segments a large class of images into regions of arbitrary shape and approximates image data with bivariate functions. The algorithm by Boyer et al. [13] is also an approach to simultaneous parametrization and organization of surfaces in noisy functional range data. Seed points are estimated first as possible surface candidates and the best approximating model from a given set of competing models (planar and bi-quadratic) is chosen with a modified Akaike Information Criterion [3]. The surface model is subsequently expanded from its seed over the entire image. This procedure is repeated for all seeds. Outliers with respect to the model in growth are not included in the surface, which allows creation of disconnected surfaces (e.g. partly occluded surfaces). Noise, outliers or coincidental surface alignment leads to points appearing in more than one surface. Ambiguities are resolved by a weighted voting scheme in a decision window around the point. Isolated point regions are left after the resolve stage and any missing points in the data are filled. Leonardis et al. [62] proposed a paradigm for segmentation of range images into piecewise continuous surfaces by fitting variable-order bivariate polynomials using iterative regression. Compared to Besl and Jain, Model Selection based on Minimum Description Length (MDL) is employed to select the best solution by a winner-takes-all technique. In contrast to the method by Leonardis et al., the approach by Jiang and Bunke [52, 53] makes use of high-level features (line and curve segments) as segmentation primitives. In [52] they segment range images into planar regions by partitioning of surfaces in the depth image. Straight lines in the scan lines are detected to use them as segmentation primitives for a subsequent region growing approach. Three neighboring line segments are selected to serve a seed for the region growing algorithm. Jiang and Bunke later modified their approach in [53] to fit also curved surface patches to the detect quadratic surfaces. Their approach is more efficient because of pre-segmentation of the depth image into primitives instead of using pixels for region growing. Another region based segmentation algorithm is introduced by Taylor and Cowley [121], who propose a processing scheme to parse the scene into a collection of salient planar surfaces. Fast color segmentation is used to pre-segment the image into coherent regions employing randomized hashing of feature vectors. The groupings suggested by this procedure are used to inform a RANSAC based interpretation process, which is significantly faster than without this prior knowledge. However, the approach is limited to fit planes. The method by Rabbani et al. [86] is not limited to a certain surface model and compared to the previous method uses region growing. Their approach is completely based on surface normals. They recursively cluster neighboring points to smooth surface patches limited by the deviation between neighboring surface normals and a maximum value for the residual, which is estimated during surface normals calculation. Wan et al. [131] showed a region growing approach for range image segmentation based on coplanarity. RGB-D images are over-segmented using superpixel segmentation from the mixed intensity and range image. Region growing is employed to group coplanar super-pixels to a set of initial planes. They claim that a representation at superpixel level is more natural compared 21
2. Gestalt Psychology and Perceptual Organization to a representation on pixel or line level (as shown by Rabbani et al.). Results on a publicly available database confirms their statement. The method by Dellen et al. [28] oversegments the color image first in a three-stage image pyramid. Subsequently quadratic surface models are fitted into the segments of the hierarchy using the available depth data. Segments which minimize the fitting error are selected within a region merging and growing stage. The constructed surface models are finally used for data completion, if depth data is partially missing. Similar to 2D image segmentation several graph-based methods exist for segmentation of 3D image data. All of these methods yield only segmentation of depth images or point clouds without a parametrization to certain surface models and are therefore completely dedicated to the signal level. Sedlacek and Zara [113] presented an interactive unstructured point-cloud segmentation approach based on a graph-cut algorithm. They use only the euclidean distance as cost function and employ user interactions to improve segmentation of the terrain from the objects of interest. Results are shown for a complex miniature model of Prague, where the terrain is segmented from the buildings on it. Recent work by Kootstra et al. [56, 57, 58] introduces a symmetry detector to initialize object segmentation. Similar to Wan et al. pre-segmented superpixels are used, but grouping is implemented by minimizing the energy function of a graph with a MRF. They developed a quality measure based on Gestalt principles to rank segmentation results for finding the best segmentation hypothesis. Their approach with both, detection and segmentation, was modified by Bergstr¨om [10] to overcome the under-segmentation, when objects are stacked or side by side. Bergstr¨om formulates an objective function where it is possible to incrementally add constraints generated through human-robot interaction in addition to an appearance model computed from color and texture, which is commonly used to better distinguish foreground from background. A volumetric graph cut algorithm to automatically detect 3D objects from multiple views is introduced by Campbell et al. [17, 16]. Instead of interactive user input their method relies on the camera fixating on the object of interest during the sequence. First, a color model of the object is learned around the fixation point and then image edges are extracted. A MRF is constructed and the energy function with a volumetric and a boundary term is minimized. Strom et al. [120] presented another graph-based segmentation approach for colored 3D laser point clouds without the necessity of any user interaction. They combine color information from a wide field of view camera with a point cloud from an actuated planar laser scanner (LIDAR). The proposed approach extends the graph-cut algorithm by Felzenszwalb [35] by using the surface normals deviation besides the color deviation of neighboring points as weight. Neighboring points get merged as long as the color weight and the normals weight are below the threshold. Evaluation shows better results compared to segmentation of the 2D color image or on the depth image alone, but the approach still has limitations, because the method tends to usually over-segment images with heavily textured objects. An early edge-based segmentation method is introduced by Fan et al. [33]. Surface patches are extracted by detection of jump-edges (depth-edges) and creases (surface orientation discontinuities). This approach yields detection of partial boundaries of surface 22
2. Gestalt Psychology and Perceptual Organization patches which describe boundaries of surfaces, but are not necessarily closed contours which would allow a direct extraction of surface patches. Completion by extension is used for the detected boundary fragments to get regions which correspond to elementary surface patches. They evaluated their approach on real images with single objects on a ground plane from a time-of-flight camera. Taylor et al. [122] introduced a method to segment geometric primitives based on the initial detection of depth discontinuities, creases and changes in surface type. A surface type classification algorithm uses analysis of the Gaussian image and the convexity of surface patches to avoid approximation of a geometric function to surface patches. Modeling of geometric primitives is implemented for basic shapes, such as planes, spheres, cylinders and cones. Sappa [107] presented an algorithm to extract closed contours from edge points in range images. It is assumed that these edge points are already given as input for the algorithm. After a partially connected graph is generated from the input points, a minimum spanning tree is constructed, before a single path through regions is generated by removing noisy links and closing open contours. The approach reduces the contour closure problem to a minimum spanning tree partitioning problem plus a cost function minimization stage to generate closed contours. ¨ Pre-segmentation of recent work by Uckermann et al. [125, 126] is almost similar to Fan et al. [33]. Surface normals are calculated from a smoothed depth image and edges are detected when building the scalar product of adjacent surface normals. Results for each image point are averaged from the eight scalar products in the different directions of the neighborhood. The former smoothing of the curvature values and the thresholding with a small value ensures closedness of the extracted contours on object boundaries. Region growing is finally employed to find regions of smooth surfaces, separated by the detected edges. A drawback of the approach is the insensitivity for small surface patches or for regions with high variability of curvature. According to the structure in Figure 2.2 methods of the structural level are responsible to group parallel or continuous patches and tetrahedral vertex combinations from the parametric surface patches. Fisher [39] was a pioneer in construction of objects from surfaces and in model based object recognition. He showed object recognition based on extracted surface models (with a maximum of two principal curvatures) from 3D data, where he assumed the surface as primitive object model feature. Surfaces are clustered based on heuristic rules to achieve figure-ground segregation and to match object models with objects in the database of the recognition process. Fan et al. [34] presented also a surface model-based method to learn and recognize objects which is more advanced than the approach by Fisher. Pre-segmentation is based on their previous work in [33]. After finding planar and quadratic surface models in the point cloud, a graph is constructed with surface models as nodes and relations between the surfaces as edges of the graph. The relation value, called possibility value p, between two surfaces depends on the type of edge. For a convex crease p is 1, for a concave crease 0.75 and in the case of a jump edge a value between 0 and 0.5 dependent on the distance between the surface models. By removing edges below a certain threshold the graph is then split into sub-graphs, which represent object hypothesis. Subsequently multiple views are combined to one represent23
2. Gestalt Psychology and Perceptual Organization ing graph model and recognition is implemented by matching the learned graph with a detected graph, but we do not want to go here in details of the recognition algorithm. The approach of Lee and Schenk [61] uses perceptual organization for grouping of 3D laser altimetry data to detect buildings and structures in landscapes. They considered also the structure by Boyer and Sarkar and established a framework to process 3D data up to the structural level. Raw 3D points are organized into spatially coherent surface patches in the signal level. Patches are merged into co-parametric surfaces in the primitive level and occlusions are detected. Finally, at the structural level, useful surface combinations such as polyhedral structures are derived, again using perceptual grouping principles. They suggest to select several cues from various choices, such as proximity, connectedness, continuity, similarity, parallelism, symmetry, common region and closure, but showed just results when using adjacency. However, their approach explicitly describes processing in the signal, primitive and structural level, represented as segmentation, merging and grouping, respectively. ¨ Uckermann et al. [126] improved the approach of just using simple heuristics from their previous work [125] by building a bi-directed, weighted graph, modeling the topological neighborhood structure. The edges are the probability of two adjacent patches (in 3D space) belonging to the same object region and are initially set according to the number of neighboring surfaces. Subsequently, full connected node-triples are assigned and the common probability is updated. Object regions are finally formed greedily by considering surface groups of maximal weight first. The algorithm is iterated until all surfaces are uniquely assigned to an object region. This graph-based approach based on heuristic rules delivers good segmentation results for all compact shapes consisting of a few surfaces and runs in nearly real-time, but suffers as well as other approaches when objects are partlyor self-occluded, or if objects are not reasonably compact. Methods of the assembly level finally construct large arrangements of lower-level primitives to form object and region hypotheses. We assume later in Chapter 4 grouping of neighboring surface patches at the structural level and consequently grouping of nonneighboring surface patches at the assembly level. This enables to form object hypothesis even when object parts are spatially separated by occlusion, either because of occlusion by other objects or by self-occlusion. Unfortunately, more than twenty years after Boyer and Sarkar emphasized the importance of research of 3D perceptual organization, no work is noted which tackles the last level of their classificatory structure.
2.4
Discussion
A hierarchical implementation of the grouping processes to abstract data in several layers is important for a generic design of a computer vision system. Data abstraction in a hierarchical manner produces features of different representation levels, which can be used from different applications to solve various tasks. Flat system architectures suffer from this inherent limitation, even if such systems lead to impressive results for specific tasks. The structure by Sarkar and Boyer, shown in Section 2.2, is a proposal for a hierarchical 24
2. Gestalt Psychology and Perceptual Organization perceptual grouping design. Each level in their structure includes different visual features of the same representation level. The implementation of generic rules is a challenging task when constructing a computer vision system. The principles of perceptual organization are such generic rules, suitable to group input data into wholes, representing complete entities, and to build object hypotheses. In the following we give an overview of our contribution which we present in the next two chapters in more detailed. 2D object detection: One problem of the many 2D grouping algorithms is the usage of many parameters and thresholds which have to be set to control the grouping process. These parameters influence the process and are usually set empirically. In Chapter 3 we show an incremental processing algorithm, turning the problem of parameter setting into a problem of setting only one parameter, the processing time. This approach brings another advantage to the system, namely anytime-processing, the ability of the system to deliver results whenever processing is stopped. In addition we show a simple and natural extension to incorporate attention mechanisms to the grouping system. 3D object detection: Compared to grouping of 2D data perceptual grouping of 3D data is not researched thoroughly. While many research work for image segmentation in the lower two levels of data abstraction exists, only a few approaches exist for the structural level and to our best knowledge there exists no work which comprises all four levels of data abstraction to extract objects from images. In Chapter 4 we introduce a framework which abstracts range data into planar and B-spline surfaces before we estimate relations between the surfaces which we inferred from the rules of perceptual organization. Subsequently, a learning algorithm is employed to learn grouping of objects based on the estimated relations between surface models. The proposed grouping process is implemented in the structural and assembly level and allows to tackle the problem of segmenting self-occluded or partly occluded objects.
25
2. Gestalt Psychology and Perceptual Organization
26
Chapter 3 Object Detection in 2D image data “The human mind has first to construct forms, independently, before we can find them in things.” – Albert Einstein In Chapter 2 we have shown that 2D perceptual grouping is a well studied area which still has its merits even in the age of powerful object recognizers, namely when no prior object knowledge is available. Often perceptual grouping mechanisms struggle with the runtime complexity stemming from the combinatorial explosion when creating larger assemblies of features, and simple thresholding for pruning hypotheses leads to cumbersome tuning of parameters. In the following an incremental approach is proposed instead, which leads to an anytime method, where the system produces more results with longer runtime. Moreover the proposed approach lends itself easily to incorporation of attentional mechanisms. We show how basic 3D object shapes can thus be detected using a table plane assumption.
3.1
Detection of Basic Object Shapes
Recognition methods based on powerful feature descriptors have lead to impressive results in object instance recognition and object categorization. In some scenarios, however, no prior object knowledge can be assumed and more generic object segmentation methods are required. This is the realm of perceptual grouping, the application of generic principles for grouping certain image features into assemblies that are likely to correspond to objects in the scene, as shown in Chapter 2. In the following we review once more related work, show pros and cons and show how we want to contribute. The perceptual grouping literature has largely focused on grouping of edges, especially detecting the enclosing contours of objects. While learning algorithms have been mainly used for object recognition, Sarkar and Soundararajan [108, 111] employ a methodology to learn the importance of Gestalt principles to build large salient edge groupings of visual low-level primitives. Grouping principles, such as proximity, parallelity, continuity, 27
3. Object Detection in 2D image data junctions and common region are trained to form large groups of edge primitives that are likely to come from a single object. Many grouping algorithms suffer from tuning too many parameters and from a combinatorial explosion with rising number of edge primitives. The approach by Estrada and Jepson [32] uses a measure of affinity between pairs of lines to group line segments into perceptually salient contours in complex images. Compact closed region structure is also identified by bidirectional best-first search with backtracking by Saund [112], but similar to the approach by Estrada et al. many parameters and thresholds have to be set. Graph-based methods to extract closed contours were introduced by Wang et al. [133] and Zhu et al. [140]. Unfortunately both methods suffer from the problem of combinatorial explosion leading to polynomial runtime complexity in the number of edge segments. Focusing on the runtime behavior of perceptual grouping Mahamud et al. [66] presented an any-time method for finding smooth closed contours bounding objects of unknown shape using the most salient edge segments for incremental processing. Their approach is able to stop processing at any time and delivers the best results (most salient closures) detected up to that time. However, a drawback of the approach is the high computational cost and that the approach does not scale well to bigger problems. Zillich et al. [142] introduced incremental indexing in the image space (as opposed to the more common indexing into a model data base using geometric features) for parameter-free perceptual grouping. Incrementally extending search lines are used to find intersections of straight lines and shortest path search is then used to identify closed convex contours, leading again to an anytime approach that yields the most salient closed convex contours at any processing time. Their approach however only uses straight edges. Song et al. [119] propose a novel definition of the Gestalt principle Pr¨agnanz based on Koffkas definition that image descriptions should be both stable and simple. They show a grouping mechanism based on the Gestalt principles proximity and common region, using straight lines as grouping primitives and color image regions to estimate the common region principle. Their approach does not abstract line primitives into higher order models, instead a bipartite graph is built to segment the primitives into line groupings. Benchmark results are shown on the Berkeley Segmentation Dataset to demonstrate their method. Sala and Dickinson [104, 103] introduce a method for contour grouping by construction of a region boundary graph and use consistent path search to identify a pre-defined vocabulary of simple part models. These models correspond to projections of 3D objects, but the approach stays at the 2D image level. The focus of the work presented in this chapter is on anytime processing, as we believe that having control of the runtime behavior of a method is of high importance if that method is to be used within a larger system context. This is especially true for methods prone to suffering from high runtime complexity (e.g. combinatorial explosions), which is often the case with perceptual grouping approaches. In particular, we extend the framework by Zillich et al. [142] to support a wider range of feature primitives. Moreover we show how attention quite naturally fits into this framework and leads to attended objects popping out earlier. Finally we construct 3D objects from 2D features for a limited class of basic object shapes using geometric constraints of the environment. 28
3. Object Detection in 2D image data DATA STRUCTURES
PROCESSING
Cuboid, Cylinder, Cone, Sphere Assembly level
Basic object shape grouping Closures, Rectangles, Flaps, Arc Groups, Ellipses, ...
Structural level
Parameter-free perceptual grouping Lines, Arcs
Primitive level
Parametric model fitting Edgels, Edge chains
Signal level
Edge extraction Image pixels
Figure 3.1: 2D perceptual grouping over four levels of data abstraction and associated processing methods.
3.2
Concept and Classificatory Structure
According to the classificatory structure presented in Section 2.2 our approach follows the four level structure, see Figure 3.1. Edge extraction delivers edgels (edge segments) and edge chains at the signal level, extracted from image pixel intensities (grey level or RGB image). Parametric image features, such as straight lines and arcs are subsequently estimated at the primitive level with parametric model fitting. Incremental processing is initiated by employing incrementally extending search lines in the image space. Whenever search lines intersect, new junctions are created and subsequent modules will be triggered to form new higher level primitives. Closed regions, polygons, arc groups, ellipses and other shape primitives are then constructed at the structural level. Bottom-up grouping continues until basic object shapes, such as cuboids, cylinders, cones and spheres appear at the assembly level.
3.3
Process Flow
Processing does not happen in a traditional bottom-up pipeline, where each level of primitives is constructed one after another. Instead, following the principle of anytimeness, processing is incremental. A processing module for a primitive (e.g. finding closures) is always triggered when the module gets informed about a new lower level primitive (e.g. a 29
3. Object Detection in 2D image data Assembly level
Spheres
Cylinders
Cones
Cuboids Flaps
Ellipses
Extended Ellipses
C. Arc Groups
Structural level
Primitive level
Arc-Junctions
Rectangles Closures
Ellipse-Junctions
Arcs
Line-Junctions
Lines
Edge-Segments
Signal level Images
Figure 3.2: Process flow of shape primitives for basic object shape detection. junction). This in turn leads to processing at the next higher level (e.g. rectangles), and stops at a level where no more grouping principles can be satisfied, or at the assembly level with the creation of an object hypothesis. Processing starts with edge extraction, which is the only non-incremental part, as we use an off-the shelf edge detection method to detect all edges (lines and arcs) at once, see Figure 3.2. All subsequent processing is controlled by extending search lines of shape primitives (see Sec. 3.4) to find junctions between primitives. All created primitives are ranked according to significance values derived from geometric constraints (e.g. line length for straight lines). Ranking serves two purposes. First, primitives at the lower levels, i.e. those triggering higher level processing, are selected for processing according to rank. For example a line is chosen to extend its search line by one pixel. To this end we randomly select ranked primitives using an exponential distribution. This leads to the most salient structures popping out first. Second, ranking allows masking to prune unlikely primitives. A higher ranked primitive masks a lower ranked primitive if the two disagree about the interpretation of a common lower-level element (e.g. two overlapping closures sharing an edge). Masking prunes results with poor significance and limits combinatorial explosions in processing modules higher up the hierarchy.
3.4
Incremental Indexing and Anytimeness
Incremental indexing is used at the structural level to form junctions. Search lines emanating from primitives are used to find junctions between primitives. Search lines are defined in the image space for line, arc and ellipse primitives, as shown in Figure 3.3. We define tangential and normal search lines at the start and end point of straight lines and arcs, and normal search lines at the vertex points for ellipses. These search lines are drawn into the 30
3. Object Detection in 2D image data ner
nsr ts
l
ts
te
nsl
Line search lines li
a
nsl
nel
ner te
nsr
nol
Ellipes search lines
Arc search lines ai lj
Collinearity
T-Junction
lj
li
li
aj e
lj L-Junction
nor
e
li lj
nir
nil
nel
Arc-Junction
Ellipse-Junction
Figure 3.3: Definition of normal (n) and tangential (t) search lines (first row) and types of junctions between search lines (second row). Collinearities, T-junctions and L-junctions appear between lines, arc-junctions between arcs, and ellipse-junctions between ellipses and lines. so-called vote image one pixel at a time, whenever the respective primitive was selected for processing. Each search line is drawn with the originating primitive’s label. Whenever a growing search line intersects another or hits a primitive, a new junction emerges. The different types of junctions for different types of intersecting search lines are shown in the second row of Figure 3.3. Emanating search lines for different processing times will be seen in the experiment section in Figure 3.5.
3.5
Gestalt Principles and Primitives
Primitives are grouped using implicitly and explicitly implemented Gestalt principles. For example proximity is implemented implicitly by the search lines for finding junctions, while closure, parallelity or connectedness are implemented explicitly as geometric constraints within the respective processing modules. In the following we briefly describe the generation of each primitive: Edge Segments – Edge segments are constructed from the raw color image data with a Canny edge extractor. Lines and Arcs – Edge segments are split for fitting straight lines and arcs into the segments using the method by Rosin and West [98]. Line Junctions – Junctions between lines are T-Junctions, L-Junctions and Collinearities, as shown in Figure 3.3. T-junctions are substituted by two L-junctions and a collinearity for further processing in the grouping framework. Closures (closed convex contours) – Lines form the vertices Vl , and junctions the edges El of a graph Gl = (Vl , El ), which is constantly updated as new junctions are created. Whenever new junctions appear, Dijkstra’s algorithm for shortest path search is run on the updated graph Gl = (Vl , El ). A constraint on similar turning directions during search 31
3. Object Detection in 2D image data
+
fi
+
fj
fk
ei li
=
+
fi
+
li
lj ej
=
li
=
L-j
ei lj
L-j
lj
lj
li
ej
e
+
L-j
=
lj
li e
Figure 3.4: Construction of basic object shapes: First row: Cuboid from three flaps or from flap and L-junction. Second row: Cylinder from two extended ellipses and Cone from extended ellipse and L-junction.
ensures creation of convex contours. Rectangles – With rectangles we refer to geometric structures including trapezoids and parallelograms (i.e. perspective projections of 3D rectangles under one-point projection [18]). Four dominant changes of the direction (L-Junctions) and at least one parallel opposing line pair is mandatory to create a rectangle. Flaps – A flap is a geometric structure built from two non-overlapping rectangles where the two rectangles share one edge, see Figure 3.4. All cuboidal structures under generic views consist of flap primitives. Arc junctions – Arc junctions are created when two arc search lines with same convexity intersect, as shown in Figure 3.3. Convex arc groups – Again a graph Ga (Va , Ea ) is constructed from arcs and their junctions, and updated whenever a new arc junction comes in. Path search on the graph Ga (Va , Ea ) leads to convex arc groups, i.e. groups of pairwise convex arcs. Ellipses – Ellipses are fitted to convex arc groups using least squares fitting [40] as implemented in OpenCV [15]. Ellipse junctions – Ellipses trigger initialization of ellipse search lines, as shown in Figure 3.3), with the goal of finding lines connected to the ellipse’s major vertices. Extended ellipses – Ellipses with attached lines (possibly themselves extended via collinearities) form so called extended ellipses. They are created whenever a new ellipse junctions appear. Cuboids – Figure 3.4 shows the two options to construct a cuboid. First, from three flaps sharing three different rectangles and second, from an L-junction and two line primitives connecting to a flap. Cones – A cone is constructed when finding a L-junction between the connected lines from the ellipse junctions, see Figure 3.4, as shown in Figure 3.4. Cylinders – Cylinders are also build from extended ellipses with ellipse junctions at each vertex by finding a connection between the ellipse junctions, see again Figure 3.4. 32
3. Object Detection in 2D image data Spheres – Spheres are inferred from circles, which represent a special type of an ellipse with similar length of the major axis.
3.6
Adding Attention
In Section 3.3 ranking of primitives was based solely on their significance. Sometimes however we might have external clues. The user of our system might provide us with regions of interest (ROIs), perhaps deduced from saliency maps, or change detection. Also higher level knowledge might be available such as the gaze direction of a human or the position of a grasping hand. Including such attentional cues is quite straight-forward. To this end we weight the significance of a primitive with its distance to the provided region(s) of interest, using a Gaussian distribution located at the center of the ROI with sigma equal to 0.25 times image width. Processing is thus concentrated around the region of interest. The ROIs allow us to specify where to look first and more “carefully”. Note that we do not exclude other regions of the image, as would be the case if we simply cut out the ROI and then work on the subimage. Strongly prominent structures outside the ROI will still be detected, albeit a bit later.
3.7
From 2D Shapes to 3D Objects
All primitives so far are groups of 2D image features. But the primitives at the assembly levels are highly non-accidental configurations corresponding to projections of views of 3D objects. So with a few additional assumptions we should be able to create 3D object hypotheses from the 2D shapes. To this end we assume that the pose of the calibrated camera with respect to the dominant plane on which objects are resting is known. For the indoor robotics scenario we are targeting, this knowledge comes from the known tilt angle and elevation of the camera, together with the assumed (typically standardised) table height. We are then able to calculate the 3D properties of a rectangular cuboid, of upright standing cones and cylinders. We simply intersect view rays through the lowermost junctions with the ground plane and thus obtain 3D position on the plane as well as unambiguous size of the basic object shapes. This restriction to simple symmetric shapes also allows to complete the unseen backside of the object.
3.8
Experiments
We demonstrate results on a table top scene containing six objects of different shapes. Figure 3.5 shows the growing search lines of lines (first row) and arcs (second row) and the resulting 3D shapes of the assembly level, after 150ms, 200ms, 300ms and 500ms 33
3. Object Detection in 2D image data
Figure 3.5: Incremental grouping: Line search lines (first row), arc search lines (second row) and resulting basic shapes (third row) after 150, 200, 300 and 500ms processing time. processing time (Intel i7, 4x 2.5GHz). As can be seen, with the proposed incremental approach object detection depends on processing time: The longer the search lines grow the more primitives are connected and hence the more shapes are found. Adjustment of the processing time influences not only the detection rate, but also the number of false positive detections. Figure 3.6 shows precision and recall for different processing times. 1
precision-recall
0.9
recall
0.8
0.7
0.6
0.5
0.4
0.3 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
precision
Figure 3.6: Precision-recall curve with varying processing time. 34
3. Object Detection in 2D image data
Table 3.1: Average true positive detection rate for objects shown in Fig. 3.5 with different processing times and results when using an attention point for the sought-after object.
150ms with ROI 200ms with ROI 300ms with ROI 500ms with ROI 1000ms with ROI
Mug 69% 76% 82% 83% 85% 86% 86% 88% 82% 84%
Salt 0% 0% 2% 1% 4% 3% 6% 3% 11% 21%
Cube 33% 71% 95% 94% 99% 100% 100% 100% 100% 100%
Ball 61% 46% 89% 91% 97% 100% 99% 100% 99% 100%
Peanuts 3% 6% 28% 24% 44% 47% 53% 46% 51% 61%
Tape 65% 72% 78% 84% 85% 87% 92% 83% 84% 90%
Average 38.5% 45.2% 62.3% 62.8% 69.0% 70.5% 72.7% 70.0% 71.2% 76.0%
Figure 3.7: Search lines and detected basic shapes when attention (region of interest) is set to the salt box and the tape respectively. Short processing time leads to high precision and low recall while longer processing times increases the recall but at the cost of a lower precision. It is the users decision to adjust the processing time to achieve the desired performance behaviour. Figure 3.7 shows two examples when using attention by specifying a region of interest (ROI) with 300ms processing time. Search lines in the ROI are preferred for extension, leading on average to earlier detection. Table 3.1 shows the average detection rate (over 100 images) for the different objects shown in Fig. 3.5. The detection rate is estimated with different processing times with and without a region of interest (ROI) centered on the object. Note that we are here not concerned with how attention is provided. It could be based on color (”the yellow ball”), generic saliency operators or any other means providing salient locations in the image. Surprisingly it can be observed that the detection rate actually decreases for some objects when using attention. This can be explained considering the texture on the objects which, being nearest to the attention point, grabs too much attention and lets the system hallucinate objects into the texture. Note that the actual object is typically still found, 35
3. Object Detection in 2D image data
Figure 3.8: More detected shape primitives: Office and living room scenes with detected boxes and cylinders (red) as well as the lower-level primitives rectangles (yellow) and closures (blue). but masked by the hallucinated texture object, and thus not reported. The main impact of using ROIs can be observed when using short processing times.
3.9
Discussion
We presented an anytime system for detecting 3D basic object shapes from 2D images. Anytimeness is implemented by incremental indexing, which also drives the bottom-up perceptual grouping process. We have shown how attention can quite naturally be integrated into anytime processing. Experimental evaluation shows encouraging results on example images of moderate complexity with objects of different shapes, though a more extensive evaluation on a broader range of scenes and objects is needed. Two images with prototypical real world scenes are shown in Fig. 3.8, showing the occurrence of the basic object shapes in natural scenes.
3.9.1
Limitations
One limitation of the system is the fact that we rely solely on edge primitives from a standard Canny edge detector. Recent approaches have shown that edge extraction and region segmentation deliver best results when done simultaneously [9]. Results from more sophisticated edge detectors would certainly lead to improved results and should be considered for further investigations. Another limitation of the approach is the restricted number of detectable shapes at the assembly level (much like [104, 103]), but this restriction on the other hand provides the heuristics to generate 3D shapes from the 2D assemblies. To overcome the problem of defining basic shapes, learning algorithms could be employed at the assembly level to group primitives of the structural level according 36
3. Object Detection in 2D image data
Figure 3.9: BLORT-Overview: First row: Detect and track tea box. Second row: Learn appearance features on shape model. Third row: Sequential recognition of learned objects in a cluttered scene. to the principles of perceptual organization to group primitives to hypotheses as shown in [108, 111].
3.9.2
BLORT - The Blocks World Robotic Vision Toolbox
It is time to return to the computer vision architecture which we presented in Figure 1.1 in Section 1.2, to review the achievements of our work. The previously defined architecture allows to detect, track, learn and recognize objects to provide this information to other components in a robotics framework. The introduced perceptual grouping framework is able to detect novel objects, if they appear as one of the defined basic object shapes. The grouping framework delivers not only the shape of the object, but also the pose and the 3D properties of the shape which are essential for subsequent tracking, learning and recognition. In [91] we have shown first how to combine detection and tracking of objects. Figure 3.9 shows in the first row the detection of a tea box and subsequent tracking. After shape detection and estimation of the 3D model, the tracker additionally uses the texture on the object to track it robustly. In [74] we demonstrate subsequently the functionality of the whole system architecture, again using first detection and tracking. Scale invariant 37
3. Object Detection in 2D image data appearance features (SIFT) are calculated and projected onto the 3D model to allow the recognition module to learn the object appearance from different viewpoints. The second row in Figure 3.9 shows two different viewpoints for learning while tracking the object. Features from different viewpoints are assigned to the model when turning it. The more viewpoints are trained, the better is later the detection of the object recognition module. The last row of Figure 3.9 finally shows sequential recognition of already learned objects in a cluttered scene and in the last image of the row a 3D scene representation with the recognized objects. The whole framework was published on the internet 1 for dissemination. Videos, code, installation instructions and other additional material can be found on that page.
1
http://www.acin.tuwien.ac.at/?id=290
38
Chapter 4 Object Detection in 3D image data “One new feature or fresh take can change everything.” – Neil Young In the last chapter we have shown a method for the detection of basic object shapes from 2D image data, namely cuboids, cylinders, cones and spheres. In this chapter we extend the problem statement of Chapter 3 to get rid of the basic shape constraint. Instead we focus on detection of objects of arbitrary shape and aim for a generic reconstruction of objects, composed from surface models. To achieve this goal, input data is extended to depth images (RGB-D images), which provide next to color also a depth value for each pixel of the image.
4.1
Detection of Objects
At the beginning of computer vision object detection was mainly concerned with the detection of known objects. It meant originally to find a given object in visual input, such as images and videos, where the object and its properties are already known. These are geometric and/or appearance models of an object, which emerge different when having other viewpoints to the object in the image. But what to do if objects are not previously known? These situations are typical in domestic robotics. Object recognition is certainly important but a vision system cannot know all objects beforehand. Methods have to be found which are able to extract object hypotheses from the image without the need of knowledge about specific objects. During the process of perception a description of the image content should be constructed, even when the meaning of the content is still unknown. And this is where perceptual grouping processes enter the picture. Perceptual grouping processes work without certain prior knowledge of possible image content. Groupings of visual features are built because features are likely to belong to the same physical object and not because they match to a known model. Gestalt principles describe the rules to find groupings without having a meaning for a certain group, compared to recognition where a meaning is assigned when an known object gets recognized. 39
4. Object Detection in 3D image data Many flat hierarchies have been used for task specific computer vision systems. Instead of designing a flat hierarchy, we design again a deep hierarchical system, which segments a depth image over several levels of data abstraction. The process of segmentation is not implemented at a single level of the hierarchy, instead it is implemented in each level of the proposed structure, delivering different types of segmentations for each level. The image is initially pre-segmented into planar patches, before these patches are merged and parametrized into uniform patches without discontinuities. Two more data abstraction levels are responsible to segment the image into object hypotheses. This implementation with several different feature representations at various abstraction levels allows an implementation of more than one application, e.g. object reconstruction, object recognition or manipulation of an object.
4.2
State of the Art
Since 2010 cheap and powerful sensors, such as the Microsoft Kinect or Asus Xtion, became available and sparked a renewed interest in 3D methods throughout all areas of computer vision. Making use of RGB-D data can greatly simplify the grouping of scene elements, as structural relationships are more readily observable in the data rather than needing to be inferred from a 2D image. We review in the following global and local methods. We primarily focus on work based on RGB-D data which is related to our proposed method, and on state-of-the-art algorithms with which we compete in the evaluation section. Range segmentation algorithms can be roughly classified into region- and edge-based methods. Both are global segmentation methods which usually segment the whole scene at once. Region-based segmentation methods segment the image first into initial regions (over-segmentation) and try then to merge or extend the region segments. Early work by Besl and Jain [11], Boyer et al. [13] and Leonardis et al. [62] is based on fitting surface models to the 3D data. [11] pre-segment the image with curvature sign labelling into smooth surfaces and subsequently fit variable-order surface models. [13] and [62] skip pre-segmentation and directly fit geometric models into the data, starting at pre-defined seed points with region-growing algorithms. Both use planar and bi-quadratic surface models and choose the best of the competing models with Model Selection (AKI and MDL). Wan et al. [131] show another region growing approach, where RGB-D images are over-segmented using superpixel segmentation from the mixed intensity and range image to merge them afterwards to a set of planes according to their co-planarity. All these methods are limited to segment the scene into a set of pre-defined surface models and are not able to segment objects as wholes. Graph-based methods typically are also region-based approaches, initially developed to segment 2D image data. The theory of graph cuts for computer vision was introduced by Greig et al. [42]. Several other algorithms have been proposed, such as the Minimum Spanning Tree by Zahn et al. [139], Mean Shift by Comaniciu and Meer et al. [23], the Normalized Cut by Shi and Malik [118] or the interactive Grab-Cut algorithm by Rother et al. [100]. Recently, the approach based on the gPb contour detector, devel40
4. Object Detection in 3D image data oped by Arbelaez et al. [9] has shown state-of-the-art results, outperforming the former ones on publicly available databases. To overcome the bad computational performance of graph-based segmentation approaches, Felzenszwalb and Huttenlocher [35] propose an efficient graph-based algorithm for image segmentation based on super-pixel clustering which satisfies global properties and runs in nearly linear time to the number of graph edges. Several attempts employ RGB-D or 3D data to boost the results of graph-based segmentation methods. Sedlacek and Zara [113] present an interactive unstructured pointcloud segmentation approach. They use only the euclidean distance as cost function and employ user interactions to improve segmentation of the objects of interest from the terrain. A volumetric graph-based algorithm for automatic detection of 3D objects from multiple views was introduced by Campbell et al. [17, 16]. In the proposed method, a camera is fixated on the object of interest during the sequence of movements. A color model of the object is learned around the fixation point and image edges are extracted. Segmentation is done by minimizing an energy function composed of a volumetric term and a boundary term. However, the described approach requires complete 3D views, which is typically not possible in robotic tasks. Strom et al. [120] presented a graph-based approach without the necessity of any user interaction. They extend the algorithm of Felzenszwalb [35] by using the surface normals deviation besides the color deviation of neighboring points as weight. Their method outperforms algorithms which are purely based on color information, but it still suffers when scenes contain textured areas. Early edge-based segmentation methods based on detection of jump-edges (depth-edges) and creases (surface orientation discontinuities) have been introduced by Fan et al. [33] and Tylor et al. [122]. [33] uses completion by extension to estimate closed contours to form segments and [122] proposes a surface type classification algorithm to identify basic shapes, such as planes, spheres, cylinders and cones. State-of-the-art edge-based segmentation by ¨ Uckermann et al. [126] detects edges using surface normals and employs region growing to find regions of smooth surfaces, separated by the detected edges. This pre-segmentation algorithm is almost similar to the approach by Fan et al. [33]. Subsequently, an adjacencygraph is built from the detected surfaces and heuristic rules are used to split the graph to segment objects. A problem of edge-based methods is that discontinuities are difficult to identify. This results usually in under-segmentation, which cannot be corrected in following processing steps. However, all work in this direction was concentrated on segmenting the whole scene at once. In contrast, active segmentation methods segment the image into foreground (object hypothesis) and background by starting from a fixation on the foreground. This fixation is derived from user input or from attention cues. The idea to investigate the scene step by step was proposed first by Aloimonos et al. [5], who presented the concept of active vision. Later, Mishra et al. [69] describe a framework for active segmentation with fixations based on that concept. They argue that the human visual system investigates or observes the scene by a set of fixations that are followed by segmentation. Their approach requires therefore only a fixation point on the object and segmentation itself is done by minimizing the energy function in the log-polar coordinate space of the detected edges. 41
4. Object Detection in 3D image data Mishra et al. extended their work in [70] with the concept of ”simple object” and border ownership, which was defined using depth or motion information about the scene. They also proposed a new strategy to calculate the fixation point as points situated closer to the object border. These points are assumed to capture main properties of the object. One of the most significant differences to other active segmentation algorithms is the use of motion and depth information to boost segmentation. Kootstra et al. [56] proposed an attention-driven graph-based segmentation approach. Objects are localized with fixation points extracted from 2D symmetry saliency maps. An energy minimization function is applied to depth and color along with a support plane constraint for segmentation. It is worth mentioning that both of the above segmentation algorithms [56, 70] were developed specifically for table top scenes and require information about the support plane. However, both approaches fail to segment objects if the scene consists of multiple occluded and cluttered objects, having several colors and textures. These types of scenes are common in domestic robotic tasks and are needed to be resolved correctly to enable applications such as grasping or other manipulation tasks with objects. We consider for the proposed object detection and segmentation approach again the structure of Boyer and Sarkar [109]. A hierarchical framework is implemented, using a region-based pre-segmentation and a learning algorithm to train the importance of relations between pre-segmented patches. The definition of the relations are derived from generally valid perceptual grouping rules. The main contribution of our work is the combination of perceptual grouping with support vector machine (SVM) learning following a designated hierarchical structure. The learning approach of the framework enables segmentation of unknown objects of reasonably compact shape and allows segmentation for a wide variety of different objects in cluttered scenes, even if objects are partially occluded.
4.3
Concept and Classificatory Structure
The problem of generating object hypotheses can also be seen as segmentation problem. Starting with maximum over-segmentation, which means that each pixel represents a single segment, a grouping process assigns these visual segments to larger entities. We propose a hierarchical grouping process over several levels of data abstraction. We again use the structure of Sarkar and Boyer [14, 109], which was already introduced and discussed in Section 2.2. Input data is organized in bottom-up fashion, stratified by layers of abstraction: signal, primitive, structural and assembly level, see Figure 4.1. Raw sensor data, occurring as RGB-D data, is grouped in the signal level to point clusters, before the primitive level estimates parametric surface models and the associated boundaries. At the structural level grouping of parametric surfaces produces combinations of neighboring patches. Finally, the output of the assembly level generates arrangements of surfaces, which are enclosed entities representing object hypotheses. Figure 4.1 shows on the right side the bottom-up processing steps over the different levels of data abstraction. Compared to the grouping process of the data structures processing is split into five steps. Responsible is the final global decision making, which 42
4. Object Detection in 3D image data happens at the end of the processing based on the results of the structural level and assembly level. In the following processing at the different levels is explained more detailed. Signal level - Raw RGB-D images are pre-clustered based on depth information. The relation between 2D image space and the associated depth information of RGB-D data is exploited to group neighboring pixels into patches. Primitive level - The task at the primitive level is to create parametric surfaces and boundaries from the extracted pixel clusters of the signal level. Plane and B-spline fitting methods are used in combination with Model Selection to merge pre-segmented patches and to estimate their parametric surface representations. Model Selection is employed to find the best representation and therefore the simplest set of parametric models for the given data. Structural level - Features, derived from Gestalt principles, are calculated between neighboring surface patches (in the 3D euclidean space) and a feature vector is created. During a training period, feature vectors and ground truth data are used to train a support vector machine (SVM) classifier to distinguish between patches belonging to the same object and patches, which do not belong to the same object. The SVM then provides a value for each feature vector from a neighboring patch pair which represents the probability that two neighboring patches belong together to the same object. DATA STRUCTURES
PROCESSING
Large arrangements of parametric surfaces Assembly level
Grouping of non-neighboring surface patches (SVM) Parametric surface combinations
Global Decision Making: Graph Cut
Structural level
Grouping of neighboring surface patches (SVM) Parametric surfaces and boundaries
Primitive level
Parametric model fitting and Model Selection Point clusters, surface patches
Signal level
Pre-segmentation: Surface normals clustering RGB-D or 3D data
Figure 4.1: 3D perceptual grouping over four levels of data abstraction and associated processing methods. 43
4. Object Detection in 3D image data
Figure 4.2: Processing example of a complex scene: Original image, pre-segmented patches, parametric surfaces and extracted object hypothesis. Assembly level - Groups of neighboring parametric surfaces are available for processing. Feature vectors are again constructed from relations derived from Gestalt principles, but now between non-neighboring surface patches of different parametric surface groups. A second SVM is trained to classify based on this type of feature vector. Creating object hypotheses directly from the assembly level is difficult, as the estimated probability values from the SVM are only available between single surfaces, but not between whole groupings of surface patches from the structural level. Wrong classifications by the SVMs (which after all only perform a local decision) pose a further problem, because they usually lead already for a few errors to high under-segmentation of the scene. Global Decision Making - To overcome these problems, the decision about the optimal segmentation has to be made on a global level. To this end we build a graph where parametric surfaces from the primitive level represent nodes and the above relations implementing Gestalt principles represent edges. We then employ graph-based partitioning using the probability values from the SVM of the assembly level as well as from the structural level as energy terms of the edges to finally segment the most likely connected parts, forming object hypotheses. Figure 4.2 shows intermediate processing results of a complex scene, processed with the proposed hierarchical framework. The system provides beside image segmentation a parametric model for each object, enabling efficient storage for convenient further processing of the segmented structures.
4.4
The Object Segmentation Database (OSD)
For evaluation of the following perceptual grouping approaches a database was created consisting of 111 table-top scenes with diverse kinds of typically man-made objects, mainly found in households. These objects are boxes, cups, bowls and different product packaging. For the whole dataset ground truth annotation was produced by hand-labeling of objects with polygons on the color image. Tab. 4.1 shows an overview of the database which is split into six subcategories according to the complexity of the scene and with a variable number of objects within. The database is organized in learn and test sets. The learn set 44
4. Object Detection in 3D image data
Table 4.1: Object Segmentation Database (OSD): Number of images and objects in the learn- and test-sets. Scene type Boxes Stacked Boxes Occluded Objects Cylindric Objects Mixed Objects Complex Scenes TOTAL
Learn set Nr. of images Nr. of objects 17 38 8 20 8 16 12 38
45
108
Test set Nr. of images Nr. of objects 16 36 8 21 7 14 12 42 12 81 11 166 66 360
consists of 45 images from the four simpler types of scenes while the test set encompasses all six categories with 66 images. Figure 4.3 shows an example image of each scene type of the OSD dataset.
Figure 4.3: Example of each scene type from the object segmentation database (OSD). 45
4. Object Detection in 3D image data
4.5
Pre-segmentation
3D cameras provide RGB-D data consisting of a color image and the associated depth information for each pixel. This data is organized in an image plane allowing to exploit the relationship between 2D image space and associated depth information. Pre-segmentation is according to the introduced classificatory structure the first processing step to group points to clusters. The estimated clusters represent surface patches which are uniform areas without discontinuities. The task of the pre-segmentation module is twofold: First, surface normals are calculated and second, neighboring pixels in the image space are clustered to uniform patches based on the calculated surface normals. To achieve a good trade-off between the degree of details and insensibility to the noise of the sensor, the characteristics of the sensor are modeled to be considered during processing.
4.5.1
Normals calculation
Surface normals are important properties of geometric surfaces and their calculation for 3D data is a standard task. Since an acquired point cloud dataset represents a set of points on a surface, a direct calculation of the perpendicular surface normal is not possible and per definition an ill-defined problem. The solution to this problem is a local approximation of the surface directly from the point cloud dataset by using the neighboring point data. This enables to compute the surface normals directly at each point by approximating a plane into the neighboring points. Neighborhood is usually defined by the maximum allowed euclidean distance between neighboring points. When having an organized point cloud, a pre-defined kernel size can be used to efficiently calculate the neighboring points in the dataset because it avoids calculation of the euclidean distance between all points in the dataset. Subsequently, two parameters are influencing the normal calculation: the kernel radius kr and the euclidean inlier distance din . The former one defines the number of points used and thus the smoothing of the normals, the latter the maximum allowed euclidean distance of the neighboring points to the center point of the kernel. The adjustment of these two values account for high deviations caused by noise of the sensor that would distort the local plane and therefore the calculated normals. An optimal adjustment for the two values is dependent on the noise of the sensor. In our case we are using a Microsoft Kinect equivalent to the recordings of the object segmentation database (OSD) which is later used for evaluation of the pre-segmentation. The spatial distribution of points from planes at different distances and with different angles to the depth camera have been estimated to get the characteristics of the sensor. 126 images with an unambiguous dominant plane were recorded and the parameters of the dominant plane estimated with a RANSAC plane fitting approach [38]. The error of the plane normals and the error of the depth of the points to the estimated plane have been analyzed. Figure 4.4 shows the standard deviation of the plane normals for different kernel radius kr with respect to different distances between camera and plane points (depth). To 46
4. Object Detection in 3D image data
Figure 4.4: Standard deviation of plane normals with respect to kernel radius kr and depth with a Microsoft (The red points show the selected kernel radius values.) limit the noise of the normals to a certain value without loosing too much details, the kernel radius kr is adjusted dependent on the depth interval to compensate the noise of the sensor data which increases with the distance to the sensor. Figure 4.5 shows the mean and standard deviation of the euclidean distance between neighboring plane points over the depth. The values of the histogram let us approximate the inlier distance for the normal calculation: din (z) = κg dk z
(4.1) mean of ∆d std. deviation of ∆d
0.04
∆d [m]
0.03
0.02
0.01
0 0
0.25
0.5
0.75
1
1.25
1.5
1.75 2 2.25 depth [m]
2.5
2.75
3
3.25
3.5
3.75
Figure 4.5: Mean and standard deviation of the distance (∆d) between neighboring plane points over the depth distance between camera and plane point, measured with a Kinect sensor. 47
4. Object Detection in 3D image data The maximum allowed inlier distance din (z) is defined by the distance threshold for neighboring points κg multiplied by the distance to the kernel center dk . Finally, the inlier distance is dependent on the distance of the actual point to the camera z to account for the rising neighborhood distances in Figure 4.5.
4.5.2
Normals clustering
After calculation of the surface normals for all points in the image, points are clustered to form planar surface patches. Greedy clustering of the normals is again controlled by two parameters. First, the maximum allowed angle between normals γcl and second, by the maximum allowed normal distance dcl of points to the plane. The normal of the planar cluster is estimated as the mean of the clustered point normals and the origin as mean position of the clustered points. These values are updated whenever a new point is added to the cluster. Since Equation (4.1) does not reduce the noise in the data itself, but reflects the limit to measurability of discontinuities, the distance to a point z still influences normals clustering. For this reason the maximum allowed distance and the maximum allowed angle between neighboring points is modeled as linear function of the depth, γcl (z) = g z + c dcl (z) = ωg z + ωc
(4.2) (4.3)
where c , ωc are the offset and g , ωg the slope of the function.
4.5.3
Experiments
Adjusting the parameters for pre-segmentation, in our case a Microsoft Kinect, is dependent on the deployed RGB-D sensor. A modeling with linear functions is also suitable for other depth sensors. The parameters of the linear functions have to be adjusted in a way that the sensibility exceeds the sensors noise level, but still models the environment with maximum precision. For the following experiments all images from the object segmentation database (OSD) are used. An evaluation of all six parameters for pre-segmentation is inefficient because of the huge parameter space. Instead, the parameters for normal calculation are derived from the measurements in Figure 4.4 and 4.5. The kernel size kr (z) and the slope κg for normal calculation is therefore set to: 3 z 2.0, z 2.5, z 3.0, z 3.5 • κg = 0.005125 48
4. Object Detection in 3D image data fp
P2
GT P1
fp
fn
tp tp
Figure 4.6: Assignment of pre-segmented patches. Ground truth, segmented patches P1 and P2 and resulting true positives tp, false positives f p and false negatives f n. To optimally adjust the remaining four parameters for normals clustering, all learnand test-sets of the object segmentation database with the associated ground truth annotation are used. While a correct pre-segmented patch covers only one object or only background, an incorrect pre-segmented patch may cover background and objects, or different objects. Subsequently, the quality of pre-segmentation is judged by counting the correctly assigned and wrongly assigned pixels from all pre-segmented patches, after assigning all patches to one certain object in the image, see Fig. 4.6. When defining true positives tp as correctly assigned pixels, false positives fp as wrongly assigned pixels and false negatives fn as unassigned pixels of the objects ground truth, over-segmentation Fos and under-segmentation Fus are defined as: Fos =
fn tp + fn
(4.4)
Fus =
fp tp + fn
(4.5)
Note that with the assignment of each pre-segmented surface patch to an object based on ground truth annotation, a fake object segmentation is produced. Hence, the resulting over- and under-segmentation are not comparable with the final object segmentation results. The resulting values are just used to compare the quality of pre-segmentation among themselves. Equivalent to the normal calculation, parameters for clustering are set according to the noise of the sensor. The gradients of the angle- and distance-function are empirically set to the following values while the offset of the functions are adjusted during the subsequently following experiments to achieve optimal pre-segmentation results: • g = 0.1 • ωg = 0.015 When turning the parameters for clustering lower than to the noise level of the sensor, the system tends to produce more and more (smaller) surface patches. This leads to more true positive assigned pixels and less false positives and hence to better values of overand under-segmentation, but with the disadvantage of getting more clusters. The goal 49
4. Object Detection in 3D image data
Figure 4.7: Pre-segmentation of table top scene from OSD. Parameter of left images: ωc = −0.004 and c = 0.58; for right images: ωc = −0.004 and c = 0.52. See discussion in text. is therefore to find an optimal trade-off between over- and under-segmentation, and the number of produced surface patches. Figure 4.7 shows two examples of pre-segmentation with different sets of parameters for the left and right image. While the parameters for the left images are optimal for the image in the first row, the same parameters cause under-segmentation for the left image in the second row (see blue patch). Parameters of the right images solve this under-segmentation problem of the second example, but cause over-segmentation for the example in the first row (see patches at edges of the top box). Over- and under-segmentation, and the number of patches for different values of parameter ωc and c are shown in Table 4.2. A significant increase of the number of clustered patches can be observed, when the parameters are reaching the noise level (e.g. c =0.46 or ωc =–0.006). The improvement of over- and under-segmentation is not proportional with the number of patches produced. When turning the parameters into the other direction, the number of produced patches decreases, but with the disadvantage of higher over- and under-segmentation. In the following experiments and the evaluation section c = 0.58 and ωc = −0.004 is used, which represents a fair trade-off between accuracy and number of produced patches.
4.6
Parametrization and Model Selection
In the last section planar patches were extracted from raw RGB-D data by clustering neighboring pixels according to their surface normals. Parametrization of these patches to certain surface models allows for a more compact representation. Two parametric models are chosen, a plane model to represent simple planar patches and B-spline surfaces allowing 50
4. Object Detection in 3D image data
Table 4.2: Over-segmentation, under-segmentation and the number of produced patches after pre-segmentation for different values of ωc and c ωc -0.000
-0.002
-0.004
-0.006
c = 0.46 4.797% 1.902% 4692 4.791% 1.896% 4765 4.787% 1.891% 4984 4.778% 1.882% 6230
c = 0.50 4.846% 1.949% 3990 4.837% 1.941% 4083 4.832% 1.936% 4299 4.820% 1.924% 5611
c = 0.54 4.897% 2.001% 3440 4.875% 1.980% 3540 4.864% 1.969% 3825 4.846% 1.950% 5122
c = 0.58 4.917% 2.022% 3046 4.903% 2.007% 3153 4.883% 1.987% 3460 4.860% 1.964% 4865
c = 0.62 4.939% 2.044% 2788 4.919% 2.023% 2900 4.895% 1.999% 3210 4.875% 1.980% 4654
c = 0.66 4.949% 2.053% 2582 4.929% 2.033% 2690 4.906% 2.011% 3020 4.887% 1.991% 4519
to represent curved surfaces of higher polynomial order. Note that B-splines could also represent planes and thus explicit plane models could be skipped, but planes are more efficient in terms of data size, fitting and processing (e.g. calculating intersections or distances).
4.6.1
Plane fitting
Although planes are just a special case of B-splines and could thus be estimated using the above procedure, we chose a more direct approach for this most simple type of surface, because the iterative optimization algorithm in B-splines fitting is computationally expensive. To this end we use the linear least squares implementation of the Point Cloud Library (PCL) [102] to get an optimally fitted plane model for each pre-segmented surface patch.
4.6.2
B-spline fitting
A common way to model free-form curves and surfaces in computer-aided design (CAD), computer graphics and computer vision are B-splines and their generalization the socalled Non-Uniform Rational B-Splines (NURBS). The reason for their popularity are the ability to represent all conic sections, i.e. circles, cylinders, ellipsoids, spheres and so forth. They are convenient to manipulate and possess useful mathematical properties, such as refinement through knot insertion, C p−1 -continuity for p-th order curves and the convex hull properties. A good overview of the characteristics and strength of NURBS is summarized in [24]. To reduce the number of parameters to be estimated we set the 51
4. Object Detection in 3D image data weights of the control points to 1, that is we actually fit B-Spline surfaces. We assume that the reader is sufficiently familiar with the basic concepts of B-splines and we will just refer to the well known book by Piegl et al. [83]. We want to start from their mathematical definition of B-Spline surfaces in Chapter 3.4. S(ξ, η) =
u X v X
Ni,d (ξ)Mj,d (η)Bi,j
(4.6)
i=1 j=1
The basic idea of this formulation is to manipulate the B-spline surface S : R2 → R3 of degree d, by changing the entries of the control grid B. The i, j-element of the control grid is called control point Bi,j ∈ R3 which defines the B-spline surface at its region of influence determined by the basis functions Ni,d (ξ), Mj,d (η). (ξ, η) ∈ Ω are called parameters defined on the domain Ω ⊂ R2 . Given a set of points pk ∈ R3 with k = 1 . . . n we want to fit a B-spline surface S with u > d, v > d and d ≥ 1. A commonly used approach is to minimize the squared Euclidean shortest distance ek from the points to the surface. P f = 12 nk=1 ek + ws fs (4.7) ek = ||S(ξk , ηk ) − pk ||2 For regularisation we use the weighted smoothing term ws fs to obtain a surface with minimal curvature. Z fs = ||S00 (ξk , ηk )||2 dξdη (4.8) Ω
The weight ws strongly depends on the input data and its noise level. In our implementation it is set to ws = 0.1. For minimizing the functional in Eq. (4.7) the parameters (ξk , ηk ) are required. We compute them by finding the closest point Sk (ξk , ηk ) on the B-spline surface to pk using Newton’s method. The surface is initialized by performing a principal-component-analysis (PCA) on the pre-segmented surface patch, see Figure 4.8.
4.6.3
Model Selection
Neighboring patches are merged after parametrization, if a joint parametric model fits better than the two individual models. To come to a decision, model selection with Minimum Description Length (MDL) criterion [62] is used. The sum of savings N κ2 X N SH = − κ1 Sm − (1 − p(fk |H)) , Am Am k=1
(4.9)
for models of neighboring patches SH,i and SH,j are compared with savings of a model fitted to a merged patch SH,ij and in case SH,ij is larger the individual patches are substituted. The number of data points explained by the hypothesis H is given by N , the costs for coding different models Sm and p(fk |H) is the probability, that a data point fk belongs to 52
4. Object Detection in 3D image data
Figure 4.8: Left: Initialisation of a B-Spline surface (green) using PCA. Right: The surface is fitted to the point-cloud (black) by minimizing the closest point distances (red) (m = n = 3, p = 2, wa = 1, wr = 0.1). H (modeled with a Gaussian error model). Am is a normalization value representing the size of merged patches and κ1 and κ2 are constants to weight the different terms. Model selection is thereafter used in a merging procedure to optimally represent the data with a minimal set of parameters. First, the clustered patches are represented with planes and B-splines depending on the savings computed with Equation 4.9. To account for the complexity of the surface model, Sm is set to the number of parameters of the models, i.e., three times the number of B-spine control points. Then the savings of neighboring Algorithm 1 Modeling of surface patches Detect piecewise planar surface patches for i = 0 → number of patches do Fit B-spline to patch i Compute MDL savings Si,B−spline and Si,plane if Si,B−spline > Si,plane then Substitute the model Hi,plane with Hi,B−spline end if end for Create Euclidean neighborhood pairs Pij for surface patches for k = 0 → number of neighbors Pij do Greedily fit B-spline to neighboring patches Pij Compute MDL savings Sij to merged patches if Sij > Si + Sj then Substitute individual models Hi and Hj with merged B-spline model Hij end if end for 53
4. Object Detection in 3D image data
Figure 4.9: Model selection: Original image; Neighborhood network of pre-segmented patches; Selection of neighboring patches; Best combination of parametrized surface models (planes and B-splines). patches Si and Sj are compared to the savings of a model fitted to a merged patch Sij and in case Sij > Si + Sj (4.10) the individual patches are substituted with the merged patch. Algorithm 1 summarizes the proposed surface modeling pipeline and an example is shown in Figure 4.9. After pre-segmentation neighboring patches are estimated (second image). Neighboring patch pairs are selected (third image) and get substituted by B-splines models, if Model Selection determines higher savings for the merged B-spline model (fourth image).
4.6.4
Experiments
During the parametrization and model selection procedure, neighboring patches are getting merged, if the merged model fits better then the two single models. To asses if patches have been merged correctly, the same evaluation procedure as during pre-segmentation is used. Patches get assigned to objects according to the ground truth annotation and again, over- and under-segmentation, and the number of surface patches are calculated to compare against pre-segmentation results. A perfect working merging algorithm reduces the number of surface patches while keeping over- and under-segmentation constant. This happens if patches from the same object get merged, while patches of different objects are not considered for merging. The weights for Model Selection have been estimated empirically. The weight of the model costs is set to κ1 = 0.003, the weight for the error costs to κ2 = 0.9 and the variance of the Gaussian error model is set to σ 2 = 0.003. Experimental evaluation shows that the number of patches decreases from 3825 to 2798 (26.85%), while over-segmentation increases insignificantly by 0.04% to 4.90% and undersegmentation also by 0.04% to 2.01%. Note that reduction of patches through merging is highly dependent on the used images, because merging happens usually on curved surfaces when planar pre-segmented patches get connected to one bigger B-spline surface. However, the results show that typically neighboring patches of the same object get merged, while neighboring patches of different objects stay apart. 54
4. Object Detection in 3D image data
4.7
Parametric Surface Grouping
Pixels have been clustered to planar surface patches in the sensor level and these patches have been merged and parametrized in the primitive level. After processing of the first two data abstraction levels the whole image is separated into parametric surface patches which are now available for further grouping processes. At the structural and assembly level the estimated parametric surfaces are grouped to enclosed entities which will finally represent object hypotheses. While more often feature vectors, describing the properties of surface patches, are used to connect surface patches with similar properties, feature vectors describing the relationship of couples of surface patches are calculated in our approach. A feature vector is defined by several relations between surface patches. The relations are based on the principles of perceptual grouping which act as generic rules for composing objects by grouping surface patches. The grouping rules are not defined by setting threshold, instead they are learned by training a support vector machine (SVM) classifier with the calculated feature vectors. Processing of the structural level is responsible to group neighboring surface patches and processing of the assembly level is responsible to group non-neighboring surface patches. The splitting into two different abstraction levels has two reasons. One reason are the different relations, which can be used for neighboring and non-neighboring surface patches. Neighboring surface patches have e.g. local relations on the boundary between the surfaces which are by definition not available for non-neighboring surface patches. Usage of different relations leads to different feature vectors at the structural and assembly level and requires therefore two support vector machines for classification. The second reason is the different importance of a relation, weather if it is used to assign neighboring or non-neighboring surface patches. In the experiments section the different importance of the same relation for neighboring and non-neighboring patches is shown (e.g. for color similarity or texture similarity). In the following, relations of the structural and the assembly level and the definition of the constructed feature vectors are introduced. Afterwards SVM classification is explained, before the section ends with experimental evaluation of the relations and a discussion of the results.
4.7.1
Relations at the structural level
Relations at the structural level are the basis to learn grouping principles for neighboring surface patches. When using only neighboring surface patch pairs it is possible to investigate local properties, utilizing the border between two surface patches. The relations based on local properties on the border form together with the global relations based on global surface properties the feature vector of neighboring patch pairs. In the following the globally and locally defined relations and their dependency on the perceptual grouping rules are explained in more detail.
55
4. Object Detection in 3D image data Similarity of patch color rco – The following relations are all inferred from the similarity principle, which can be integrated in many different ways. Similarity of patch color rco is implemented by comparing the 3D-histogram in the YUV color space. The histogram is constructed of four bins in each direction leading to 64 bins in the three-dimensional array. The Fidelity distance a.k.a. Bhattacharyya coefficient P √ (dF id = Pi ∗ Qi ) is then calculated to get a single color similarity value between i two different surface patches. Similarity of patch size rrs – Similarity of patch size rrs is also based on the similarity principle and is defined as the relative difference of the patch area (number of pixels) of two neighboring surfaces patches. Similarity of texture quantity rtr – Relations comparing texture similarity are realized in three different ways: As difference of texture quantity rtr , as Gabor filter match rga and as Fourier filter match rf o . The texture quantity relation rtr is the relation of canny edge pixels on a surface patch to all pixels of a surface patch. The difference of texture quantity is then the relative difference of those values of two surface patches. Similarity of texture: Gabor filter rga – The Gabor and the subsequently following Fourier filter are implemented as proposed in [129]. For the Gabor filter six different directions (in 30◦ steps) with five different kernel sizes (17, 21, 25, 31, 37) are used. A feature vector g with 60 values is built from the mean and the standard deviation of each filter value. The Gabor filter match P rga is p then the minimum difference between 60 these two vectors (d(g1 , g2 ) = mink=0..5, i=1 (µ1,i − µ2,i+10k )2 + (σ1,i − σ2,i+10k )2 ), when one feature vector gets shifted such that different orientations of the Gabor filter values are matched. This guarantees a certain level of rotation invariance for the filter. Similarity of texture: Fourier filter rf o – The Fourier filter match rf o is calculated as Fidelity distance of five histograms, each consisting of eight bins filled with the normalized absolute values of the first five coefficients from the Discrete Fourier Transform (DFT). Similarity of color on patch border rco3 – Subsequent relations are derived from local properties along the common border of two patches exploiting the relationship between 2D image space and 3D space. Color similarity rco3 is calculated along the 3D patch border of surface patches. The following definition of 3D neighborhood is used for all following relations to calculate the 3D patch border. The 3D patch border is defined as all neighboring pixels in the 3D image space. The points are efficiently estimated as subset from all neighboring pixels in the 2D image space by reviewing the 3D neighborhood. Neighborhood of pixels in the 3D space is defined relative to the distance of the camera. The maximum distance between points is defined as dmax = 0.007 ∗ z, according to the measurements shown in Fig. 4.5. The maximum distance of 7mm per meter absolute distance of the point to the camera considers also the quantization noise of the sensor which increases with distance. 56
4. Object Detection in 3D image data C n1 α
P1
P1
P2
β
n2
P2
Figure 4.10: Curvature calculation on patch border. Left: Estimation of neighboring pixels. Right: Projection of the surface normals and curvature estimation (see text for details).
Mean curvature on patch border rcu3 – The mean curvature rcu3 between two patches is also calculated along the 3D patch border. Neighboring pixels of the patches are estimated in the 2D image space (4 neighborhood), indicated by short black lines in the left image of Fig. 4.10. The surface normals of each point pair are projected onto a plane, constructed from the two neighboring points P1 and P2 and the camera position C. The curvature between normals n1 and n2 is calculated as sum of α and β. Variance of curvature on patch border rcv3 – The variance of curvature rcv3 is calculated simultaneous with the mean value. Mean and variance of curvature are representing relations inferred from a mixture of continuity as well as closure. Closure is a principle which is usually defined in the 2D space, but in the 3D space closure could also be interpreted as compactness. Mean depth on patch border rdi2 – The mean of depth rdi2 is defined on the 2D border of neighboring patches and describe together with the variance of depth the spatial separation of neighboring patches. Please note that even if the 2D border is used to estimate these relations, the patches have to be neighbors in the 3D space as well, otherwise they would be considered at the assembly level and not at the structural level. Variance of depth on patch border rvd2 – The variance of the depth rvd2 is calculated together with the mean depth along the 2D patch borders in the image space. The two relations are inferred from the continuity principle as well as from the proximity principle. 3D-2D boundary ratio r3d2 – The 3D to 2D boundary ratio describes the relation between 3D to 2D boundary length (r3d2 = l3D /l2D ) between two surface patches. The relation cannot be assigned to a single grouping principle, instead it corresponds again to a mixture of principles, such as the continuity, good form and the proximity principle.
57
4. Object Detection in 3D image data From all introduced relations one feature vector is constructed which is subsequently used to train a support vector machine (SVM), which is later described more detailed in this section: rst = {rco , rrs , rtr , rga , rf o , rco3 , rcu3 , rcv3 , rdi2 , rvd2 , r2d3 } (4.11) Feature vectors rst are calculated between all combinations of neighboring parametric surfaces in the 3D image space. These vectors are then classified as indicating patches belonging to the same object or different objects by using the support vector machine classifier, which we train during a learning period. When using just the SVM classification of the structural level, a majority of objects from the OSD can be already segmented, because they are compact and mainly not occluded (either partly nor self-occluded). The problem of handling occlusion to group spatially separated structures is tackled by the introduction of the assembly level.
4.7.2
Relations at the assembly level
The assembly level is the last grouping level and is responsible to group spatially separated surface groupings. Similar to the structural level relations between couples of surface patches are estimated which are derived from the perceptual grouping principles. The first five relations are equal to the relations used at the structural level and are again based on the similarity between patches. The reader is referred to the definitions shown in the last section. New relations are introduced in the following, which complete the set of relations for the feature vector of the assembly level. Minimum distance between patches rmd – The minimum distance between patches rmd is inferred from the proximity principle. At the structural level this principle was given implicitly when only neighboring surface patches were considered for building patch pairs. At the assembly level the principle is explicitly considered as relation between non-neighboring patches. Similarity of mean of surface normals direction rnm – The difference of the mean value of the normals direction rnm represents the similarity of orientation of two surface patches. Similarity of variance of surface normals direction rnv – The difference of the variance of the surface normals rnv shows roughly the similarity of the shape of two surface patches. Difference of normals direction from nearest contour points rac – For this and the next relation the nearest twenty percent of contour (boundary) points of two patches are estimated. rac is a relation, comparing the mean value of the direction of the normals on the contours of the surface patches. This relation describes together with rdn the continuity of patches on their common border.
58
4. Object Detection in 3D image data Mean distance in normal direction from nearest contour points rdn – Again the nearest twenty percent of contour points are used to calculate the mean distance in normal direction between the contour. Together with rac a description of the continuity principle is established for the border of surface patches. The last five relations at the assembly level are based on the boundary of the surfaces. We are using partly the framework introduced in Section 3 and in [95, 92, 141], but instead of edges from the color image are edges from the estimated 3D surface patches used as primitives in the signal level. The contours of the surfaces are projected into the image space and lines are fitted into the contour image. With the concept of search lines, intersections between line segments from different boundaries can be found and categorized as L-Junctions or as Collinearities. Shortest path search is then used to find closures in the image. Collinearity continuity rcs – Collinearities are estimated between contours in the 2D image space which have been abstracted to lines by finding intersections when extending search lines. The collinearity continuity rcs is estimated as sum of the angle and the normal distance between the two lines, both calculated in the 3D image space. If more than one collinearity between two surface patches can be found, the collinearity with the lowest value is chosen. Mean collinearity occlusion roc – Due to the processing of non-neighboring surface patches at the assembly level there is always a gap between the end-points of collinearities. Hence, a hypothesized line can be constructed in between the end-points of the two lines. Relation roc measures the mean depth between the line points and the points of the point cloud. A positive value of roc indicates a possible occlusion between the non-neighboring surface patches by other surfaces. Closure line support rls – The rest of the boundary relations are based on closures (closed convex contours), found with shortest path search when considering the L-junctions and collinearities as connections between lines. The line support is the relation between the line length on the two patches to the overall line length of the closure. This relation is based on the closure principle as well as on the proximity principle. Closure area support ras – The closure area support describes the relation between the area of the closure and the summed area of the two surface patches. The smaller the gap between the surfaces, the higher is the similarity of the size of the areas. Closure line gaps rgl – The gap to lines relation shows the ratio between the the gaps and the lines of a closure. The gaps between the lines are resulting from the extension of the search lines and are bridging missing contour boundaries when constructing closures. This relations is once more based on the proximity of surfaces.
59
4. Object Detection in 3D image data Similar to the structural level a representative feature vector is defined from the introduced relations. The feature vector describes the relation between non-neighboring surface patches from different surface patches at the structural level. The feature vector is again used to train an SVM classifier. This classifier provides after a training period a probability value indicating that two patches belong to the same object, but now for all couples of non-neighboring surface patches. ras = {rco , rrs , rtr , rga , rf o , rmd , rnm , rnv , rac , rdn , rcs , roc , rls , ras , rgl }
(4.12)
Depending on the number of surface patches in the groupings, there are several probability values between two groupings of the structural level and optimal object hypotheses can not be created by simply thresholding these values. Instead, a globally optimal solution has to be found.
4.7.3
Support Vector Machine (SVM) Classification
After calculation of the relations and building of feature vectors one has to decide, whether two surface patches belong together or not. This decision is based on the relations of the feature vector. Setting thresholds for classification is getting more complex the more relations are used and would not be manually adjustable any more. A solution to this problem lies in learning of the grouping rules using a learning method which classifies feature vectors to single decision values. Learning algorithms which may be used are e.g. support vector machines, neural networks (multi-layer perceptron), k-nearest neighbors, boosting or random forests. Support vector machines have some advantages towards the other algorithms. Compared to k-nearest neighbors support vector machines are computationally expensive during training, but prediction is computationally efficient. Neural networks are linear classifiers which may not classify data of non-linear problems, while SVMs supports non-linear classifiers. However, the main reason is the output of probability estimates instead of binary decision values, which we exploit in the final global decision making (see Section 4.8). Support vector machines are maximum margin classifiers. The idea is to find a separating hyperplane (defined by a normal vector ω and a bias b) between different classes in the data with the maximum margin. The margin is the distance from the hyperplane to the nearest data points (support vectors). For training vectors xk ∈ Rn , k = 1, ..., m given in two classes and a vector of labels y ∈ Rm such that yk ∈ {1, −1}, a SVM solves a quadratic optimization problem: m
X 1 ξk min ω T ω + C w,b,ξ 2 k=1 yk (ω · xk + b) ≥ 1 − ξk ,
(4.13)
ξk ≥ 0, k = 1, ..., m,
C is a penalty parameter on the training error and the slack variable ξk measures the degree of misclassification of the data xk . 60
4. Object Detection in 3D image data
Figure 4.11: Annotation example: Original image with two stacked boxes, annotation for training at the structural level, and annotation for training at the assembly level. SVMs support non-linear classification by using the kernel trick. Input data is mapped from a general set S into an inner product space V , which is of higher dimension than the input space. This is done in the hope that the data will gain meaningful linear structure. The function φ(xk ) maps the training data xk to a higher dimensional space. In the following a kernel function, which can be expressed as inner product in the form k(x, x0 ) = φ(x)T φ(x0 ), may be used. For all experiments in the rest of the work the radial basis function (RBF) is used as kernel function: 0 2
K(x, x0 ) = e−γ||x−x ||
(4.14)
The decision function (predictor) is then for any testing instance x: f (x) = sgn(ω·φ(x)+b). When using the RBF kernel, C and γ are determined in the SVM model.
4.7.4
Learning and Testing
The classical output of a two-class SVM is a binary decision value voting for one of the two classes. But it is also possible to not only provide a binary decision same or notsame for each feature vector r, but also a probability value p(same | r) for each decision, based on the theory introduced by Wu et al. [137]. But before training and testing, the training and testing data has to be scaled to the range [−1, 1] for the SVM classification. One reason is to avoid domination of attributes with greater numeric range over ones with smaller numeric range. Another reason is to avoid numerical overflow during processing, because the kernel values are dependent on the inner product of feature vectors. The SVM at the structural as well as at the assembly level is trained during an offline phase. For training of the SVM at the structural level feature vectors rst and for training at the assembly level feature vectors ras are used. Note that training at the structural level and at the assembly level is not the same in terms of ground truth annotation. For training at the structural level all objects are annotated as ground truth, see middle image of Figure 4.11. Feature vectors of patch pairs from the same object represent positive training examples and vectors of pairs from different objects or objects and background 61
4. Object Detection in 3D image data represent negative examples. With this strategy, not only the affiliation of patches to the same object, but also the disparity of object patches to other objects or background is learned. For training at the assembly level are all smooth surfaces annotated as ground truth, even if they are partly occluded, see right image of Figure 4.11. Positive training examples are patch pairs from the same surface. Negative training examples are patch pairs from different surfaces, and patch pairs with a surface on the object and a surface in the background.
4.7.5
Experiments
For the offline training and online testing phase we use the freely available libsvm package [19]. As kernel we use the radial basis function and as solver C-support vector classification (C-SVC) with C = 1, γ = 1/n and n = 9. Evaluation of the introduced relations is done by calculation of the Fscore for each relation. Fscore is a technique which measures the discrimination of two sets of real numbers, usually used for feature selection, see Chen et al. [21]. Given training vectors rk , k = 1, ..., m, if the number of positive and negative instances are n+ and n− , respectively, then the Fscore of the i-th feature is defined as: Fscore (i) =
1 n+ −1
(r¯i (+) − r¯i )2 + (r¯i (−) − r¯i )2 Pn− (−) (+) 1 (+) )2 + (−) )2 k=1 (rk,i − r¯i k=1 (rk,i − r¯i n− −1
Pn+
(4.15)
where r¯i , r¯i (+) and r¯i (−) are the average of the i-th feature of the whole, positive and (+) (−) negative data sets, respectively, rk,i is the i-th feature of the k-th positive instance and rk,i is the i-th feature of the k-th negative instance. The numerator indicates the discrimination between the positive and negative sets, and the denominator indicates the discrimination within each of the two sets. The larger the Fscore , the more likely this feature is more discriminative, but unfortunately it does not reveal mutual information among features. Proximity is implicitly implemented through splitting of the surface grouping procedure into structural and assembly level. While the structural level is responsible to connect neighboring patches the assembly level is responsible to connect spatially separated surface patches. The likelihood of two neighboring patches to belong to the same object is far higher than for two non-neighboring patches. Hence, a quantitative analysis of SVM classifications for each relation is done by measuring of the Balanced Error Rate (BER). The BER is the average of the error rates on positive and negative class labels and is therefore more meaningful in a case of an unbalanced distribution of positive and negative examples, compared to the True Positive Rate (TPR). For the following experiments always one single relation is used in the feature vector to investigate the relevance of each relation for classification. The balanced error rate is computed from the true positive tp , true negative tn and false positive fp decisions of the SVM (we assume a decision boundary at 0.5): BERsvm =
fp fn 1 ∗( + ) 2 tp + f p tn + f n 62
(4.16)
4. Object Detection in 3D image data
Table 4.3: Fscore and balanced error rate (BER) on the training set of the OSD-0.2 database [87] for the structural level. rco rrs rtr rga rf o rco3 rcu3 rcv3 Fscore 0.163 0.185 0.094 0.183 0.206 0.191 1.418 0.001 BER 37.5% 38.7% 43.1% 40.1% 43.2% 38.4% 19.6% 49.9% rdi2 rvd2 r2d3 ALL Fscore 0.453 0.687 0.342 BER 27.9% 27.3% 27.0% 16.7% Table 4.4: Fscore and balanced error rate (BER) on the training set of the OSD-0.2 database [87] for the assembly level. rco rrs rtr rga rf o rmd rnm rnv Fscore 30.5e-3 8.40e-3 3.33e-3 9.08e-3 7.91e-3 19.7e-3 20.4e-3 1.03e-3 BER 50.0% 50.0% 50.0% 50.0% 50.0% 50.0% 50.0% 50.0% rac rdn rcs rod rls ras rgl ALL Fscore 6.22e-3 15.8e-3 40.4e-3 15.5e-3 6.21e-3 8.19e-3 5.72e-3 BER 50.0% 50.0% 50.0% 50.0% 50.0% 50.0% 50.0% 48.75% The Fscore is calculated from all training images and the BER from all test sets of the OSD-0.2. 1 For the structural level the prior probability of positive classification is 0.575 (and hence 0.425 for negative classification). Table 4.3 shows evaluation results at the structural level. It can be seen that a higher Fscore indicates typically fewer wrong decisions of the SVM, resulting in a lower BER. For the assembly level the prior probability of positive examples is 0.0545 (and hence 0.9455 for negative examples). Table 4.3 shows that the SVM classification always decides negative when using a single relation in the feature vector due to this low prior. This results in a balanced error rate of 50%. When using all relations, the SVM starts to classify in both, the negative class and the positive class. The classification results still seem weak (48.75%), but later in this chapter we will see the positive influence to the overall segmentation results.
4.8
Global decision making
After SVM classification at the structural and assembly level some probability estimates may contradict when trying to form object hypotheses. Figure 4.12 shows a segmentation example with a cereal-box on a table (crop of the scene). The second image in Fig. 4.12 shows the segmentation results after the primitive level. The scene is split into four surface patches. Remarkable is patch 3, because this small patch in the corner of the cereal box 1
The results are not comparable with the results in [94, 89, 90], because of the changed pre-segmentation procedure and the evaluation on the newer version (0.2) of the OSD.
63
4. Object Detection in 3D image data
2 0
1
3
Figure 4.12: Graph-Cut segmentation example: Crop of original image, pre-segmented surface patches, constructed graph, correctly segmented image even for wrong binary SVM classification (see text for detailed discussion). emerges due to noise of the sensor. If the SVM classification produces the following classification probabilities and therewith the binary decisions (decision boundary = 0.5), • • • • • •
p01 p02 p03 p12 p13 p23
= p(same|r01 ) = 0.013 ⇒ = p(same|r02 ) = 0.002 ⇒ = p(same|r03 ) = 0.512 ⇒ = p(same|r12 ) = 0.964 ⇒ = p(same|r13 ) = 0.521 ⇒ = p(same|r23 ) = 0.008 ⇒
FALSE FALSE TRUE (wrong classification) TRUE TRUE FALSE
segmentation would end with maximum under-segmentation. All surface patches will be connected to one object hypothesis, including the table table plane. To overcome such vague or wrong local predictions from the SVMs at the structural and assembly level, a globally optimal solution has to be found. To this end we define a graph, where surface patches represent nodes and edges are represented by the classification values of the SVMs, shown in Figure 4.12, and employ a graph partitioning algorithm to find an optimal solution. Several algorithms exist for graph partitioning in computer vision, mainly used for image segmentation in early vision. Zahn [139] presents a segmentation method based on the minimum spanning tree (MST) of a graph, where edge weights are based on the intensity of the pixels. The used criterion in his method is to break edges with large weights. A drawback of the method is the inadequate splitting of regions with high variability. The theory of graph cuts for computer vision was introduced by Greig et al. [42]. They show how the maximum a posteriori estimate of a binary image can be exactly obtained by maximizing the flow through an associated image network. Their method was preliminarily used for smoothing noisy or corrupted images. The cut criterion of Wu et al. [138] minimizes the similarity between pixels that are being split, but it has a tendency for finding small components. This issue was addressed by Shi et al. [117, 118] with the introduction of the normalized cut criterion which takes self-similarity of regions into account. In contrast to early methods non-local properties of the image are considered for segmentation. The 64
4. Object Detection in 3D image data advantage of the normalized cut method is compensated by the long runtime due to the NP-hard computational problem. All these graph-cut methods, which are based on energy minimization (maximum flow/minimum cut problem) are computationally NP-hard and suffer from long processing time. An efficient algorithm to segment images was introduced by Felzenszwalb and Huttenlocher [35]. Their graph-based image segmentation technique adaptively adjusts the segmentation criterion based on the degree of variability in neighboring regions of the image. This results in a method that, while making greedy decisions, can be shown to obey certain non-obvious global properties. Their technique merges image pixels into components such that the resulting segmentation is neither too coarse nor too fine. The algorithm by Felzenszwalb et al. is computationally efficient compared to the previously mentioned graph-based segmentation algorithms. We employ for that reason their method, using the probability values from the SVMs as the pairwise energy terms between the surface patches to find the global optimum for object segmentation (wi,j = 1 − p(same|rij )). Initially each node is placed in its own component and the internal difference of a component Int(R) is defined as the largest weight in the minimum spanning tree of R. When considering edges in order by weight, each step of the algorithm merges components R1 and R2 connected by the current edge if the edge weight w1,2 is less than: w1,2 < min(Int(R1 ) + τ (R1 ), Int(R2 ) + τ (R2 ))
(4.17)
where τ (R) = k/ |R|. k is a scaling parameter used to set a preference for the component size. When using the probability values from the SVMs as weights the scaling parameter is set to 0.5. For the example shown in Figure 4.12, the object is correctly segmented from the background when using the graph partitioning algorithm of Felzenszwalb et al., because the weak probabilities between the small patch and the neighboring patches lead only to one assignment of the small patch, either to the object or to the background (table). One could claim that the method by Felzenszwalb et al. could be directly used on the RGB-D data for segmentation, but experiments based on curvature and color (similar to Strom et al. [120]) have shown the high dependency of the sensor readings to less noise. Another limitation is the weakness of the algorithm when having large textured areas in the image.
4.9
Evaluation
After all parts of our framework are introduced, experimental evaluation of the proposed object segmentation method is shown. The proposed framework is evaluated on the previous introduced object segmentation database (OSD), see Section 4.4, as well as on the Willow Garage dataset2 , which was originally composed for the ’Solutions in Perception 2
http://vault.willowgarage.com/wgdata1/vol1/solutions in perception/Willow Final Test Set/
65
4. Object Detection in 3D image data
Table 4.5: Results on the OSD database [87] for the structural level.
rst = {rco } rst = {rrs } rst = {rtr } rst = {rga } rst = {rf o } rst = {rco3 } rst = {rcu3 } rst = {rcv3 } rst = {rdi2 } rst = {rvd2 } rst = {r2d3 } rst
Fscore 0.163 0.185 0.094 0.183 0.206 0.191 1.418 0.001 0.453 0.687 0.342
BERsvm 37.5% 38.7% 43.1% 40.1% 43.2% 38.4% 19.6% 49.9% 27.9% 27.3% 27.0% 16.7%
P 19.89% 18.11% 22.99% 25.01% 39.40% 33.33% 81.81% 5.83% 27.21% 26.10% 33.35% 90.85%
R 92.38% 92.39% 93.56% 93.55% 93.94% 89.04% 94.05% 97.53% 93.83% 94.01% 93.12% 93.88%
P∗ 89.97% 87.27% 90.76% 90.85% 90.33% 91.40% 63.71% 91.29% 91.29% 90.71% 90.70%
R∗ 93.90% 93.80% 93.89% 93.90% 93.88% 93.89% 93.61% 93.82% 93.81% 93.80% 93.70%
Challenge’ 3 . Furthermore, comparison with state-of-the-art methods in RGB-D image segmentation is presented in the second part of this section.
4.9.1
Evaluation on the object segmentation database (OSD)
Table 4.5 shows evaluation results of the structural level. Each row shows results when a single relation is used to construct the feature vector. The last row of Tab. 4.5 finally shows results when all relations are used. This evaluation reveals the importance of each introduced relation for the whole segmentation algorithm. The first two columns again show the F-score and the balanced error rate BERsvm (see Tab 4.3) of the SVM classification to show their influence to the segmentation results. The following two columns show precision P and recall R of segmentation summed up over the whole test-set and P ∗ and R∗ finally show precision and recall when using the whole feature vector rst without considering the given relation. It can be seen that a higher F-score typically leads to fewer wrong decisions of the SVM. This results in a lower BER. It is obvious that the balanced error rate is significantly lower when using all relations (see last row), compared to the BER when using a single relation. The lower the BER, the higher is precision P and recall R of the whole object segmentation algorithm. P ∗ and R∗ show finally the importance of each relation to the overall performance of the segmentation framework, when comparing the numbers with the results of the complete feature vector, shown in the last row of the table. Table 4.6 shows the same evaluation as Tab. 4.5, but for relations of the assembly level (when using them in addition to the structural level). A comparison of the F-scores with the results of Tab. 4.5 shows significantly lower values for relations of the assembly level, 3
http://opencv.willowgarage.com/wiki/SolutionsInPerceptionChallenge
66
4. Object Detection in 3D image data
Table 4.6: Results on the OSD database [87] for the assembly level.
rst , ras = {rco } rst , ras = {rrs } rst , ras = {rtr } rst , ras = {rga } rst , ras = {rf o } rst , ras = {rmd } rst , ras = {rnm } rst , ras = {rnv } rst , ras = {rac } rst , ras = {rdn } rst , ras = {rcs } rst , ras = {rod } rst , ras = {rls } rst , ras = {ras } rst , ras = {rgl } rst + ras
Fscore 30.5e-3 8.40e-3 3.34e-3 9.08e-3 7.91e-3 19.7e-3 20.4e-3 1.03e-3 6.22e-3 15.9e-3 40.4e-3 15.5e-3 6.21e-3 8.19e-3 5.72e-3
BERsvm 50.0% 50.0% 50.0% 50.0% 50.0% 50.0% 50.0% 50.0% 50.0% 50.0% 50.0% 50.0% 50.0% 50.0% 50.0% 48.75%
P 90.85% 90.85% 90.85% 90.85% 90.85% 90.85% 90.85% 90.85% 90.85% 90.85% 90.85% 90.85% 90.85% 90.85% 90.85% 89.95%
R 93.88% 93.88% 93.88% 93.88% 93.88% 93.88% 93.88% 93.88% 93.88% 93.88% 93.88% 93.88% 93.88% 93.88% 93.88% 95.00%
P∗ 87.16% 86.89% 90.19% 83.82% 90.93% 92.33% 91.08% 86.19% 93.78% 83.52% 89.83% 83.98% 83.49% 82.76% 59.97%
R∗ 94.49% 94.52% 94.58% 95.00% 94.87% 94.46% 94.21% 95.21% 94.13% 94.42% 94.81% 94.76% 94.85% 94.60% 67.97%
indicating a lower discrimination compared to the relations of the structural level. Hence, neighboring patches can be easier connected correctly than non-neighboring patches, which also indicates the strength of the proximity principle which is implemented implicitly in the hierarchical framework structure by only considering neighboring patches at the structural level and non-neighboring patches at the assembly level. It is noticeable that a single relation in the feature vector of the assembly level never leads to a decision that two non-neighboring patches belong together. The low prior probability (0.0545) of positive decisions leads to permanent negative decisions of the SV Mas . This is shown by the 50.0% for the balanced error rate BER of the SV Mas and also by the similar values of precision and recall. When using more than one relation for ras , the SV Mas starts to sometimes classify positive and starts therefore with assignment of non-neighboring patches. When using all relations, this finally ends in the overall results shown in the last row of Tab. 4.6. Usage of the assembly level leads to better results of recall R, because partially occluded and non-compact object shapes may now be segmented correctly, but the chance of sometimes wrongly connecting surface patches increases, what leads to a lower precision P . The decision of using the assembly level is left to the user who decides which error is more important for a certain application. 67
4. Object Detection in 3D image data
Table 4.7: Precision and recall on the OSD and Willow Garage dataset for the approach ¨ by Mishra et al [71], Uckermann et al. [126] and for our approach, when using the SVM of the structural level SV Mst and when using both data abstraction levels SV Mst+as . Mishra Boxes Stacked Occluded Cylindric Mixed Complex OSD Willow
4.9.2
P 76.87% 70.57% 67.37% 69.81% 62.99% 61.06% 66.10% 77.51%
R 75.86% 74.61% 55.81% 87.38% 76.29% 54.61% 67.91% 83.82%
¨ Uckermann P R 97.12% 94.72% 95.61% 93.26% 94.53% 74.76% 96.47% 92.50% 95.27% 93.42% 93.14% 83.49% 94.91% 88.79% 98.69% 98.83%
SV Mst P 96.47% 86.70% 94.18% 96.21% 91.21% 87.50% 90.85% 98.11%
R 97.91% 96.23% 78.23% 97.11% 95.90% 91.49% 93.88% 98.82%
SV Mst+as P R 96.47% 97.91% 86.72% 97.54% 94.00% 91.62% 87.35% 97.71% 91.21% 95.90% 86.78% 92.09% 89.95% 95.00% 98.10% 98.81%
Comparison with state-of-the-art methods
We compare our object segmentation method with two state-of-the-art methods. The method of Mishra et al. [71] is an attention-driven active segmentation algorithm, designed to extract boundaries of (freestanding) simple objects. The recently published method by ¨ Uckermann et al. [126] is an edge-based segmentation approach, which uses pre-defined heuristics to end up with object hypotheses. Table 4.7 shows P recision P and Recall R of segmentation from the OSD for the al¨ gorithms of Mishra and Uckermann and for both of our methods, when using the support vector machine of the structural level SV Mst or when using both data abstraction levels SV Mst+as . In addition all segmentation algorithms have been evaluated on the Willow Garage database 1 for which we provide the created ground truth data at [87]. Our system is trained with the four learning sets of the OSD for all experiments in Tab. 4.7, even for the evaluation of the Willow Garage dataset. Figure 4.13 shows in more detail the precision over recall for each segmented object from the OSD and from the Willow Garage dataset. These graphs give a qualitative overview and show in detail the distribution of single object segmentation results. Figure 4.14 presents seven selected examples from the OSD and Fig. 4.15 from the Willow Garage dataset. In contrast to the other segmentation results object segments of Mishra are represented by boundaries, because their method allows overlapping of object hypotheses. This is by definition not possible for the method ¨ of Uckermann and also not for our method. The results in Tab. 4.7 show that our approach works significantly better than the approach by Mishra for all sets of the OSD as well as for the Willow Garage dataset. In ¨ contrast the results of Uckermann are almost similar to our approach. A closer look on the values shows a higher precision P , but at the same time a lower recall R. This indicates 1
http://vault.willowgarage.com/wgdata1/vol1/solutions in perception/Willow Final Test Set/
68
1
1
0.9
0.98
0.96
0.8
0.96
0.7
0.94
0.7
0.94
0.6 0.5 0.4
0.92 0.9 0.88
0.5 0.4
0.92 0.9 0.88
0.86
0.3
0.86
0.2
0.84
0.2
0.84
0.1
0.82
0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.8 0.8
1
0.82
0.84
0.86
Recall
0.88
0.9
0.92
0.94
0.96
0.98
0
1
0.82 0
0.1
0.2
0.3
0.4
Recall
(a)
0.5
0.6
0.7
0.8
0.9
0.8 0.8
1
(c) 1
1
0.9
0.98
0.96
0.8
0.96
0.7
0.94
0.7
0.94
0.92 0.9 0.88
Precision
1 0.98
Precision
1
0.4
0.6 0.5 0.4
0.86
0.2
0.84
0.2
0.84
0.1
0.82
0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.82
0.84
0.86
Recall
0.88
0.9
0.92
0.94
0.96
0.98
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.8 0.8
1
1 0.98
0.8
0.96
0.7
0.94
0.7
0.94
0.9
Precision
1 0.9
0.96
Precision
1 0.98
Precision
1
0.88
0.6 0.5 0.4
0.86
0.2
0.84
0.2
0.84
0.1
0.82
0.1
0.2
0.3
0.4
0.5
Recall
(i)
0.6
0.7
0.8
0.9
1
0.8 0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
0
1
Recall
0.9
0.92
0.94
0.96
0.98
1
0.92
0.94
0.96
0.98
1
0.82 0
0.1
0.2
0.3
0.4
0.5
Recall
(j)
0.88
0.9
0.3
0.1
0.86
0.88
0.86
0
1
0.92
0.3
0
0.98
(h)
0.8
0.4
0.84
(g)
0.92
0.96
Recall
0.9
0.5
0.82
Recall
(f)
0.6
0.94
0.82 0
Recall
(e)
0.92
0.9
0.3
0
0.9
0.88
0.86
0.8 0.8
0.88
0.92
0.3
0
0.86
(d)
0.8
0.5
0.84
Recall
0.9
0.6
0.82
Recall
(b)
Precision
Precision
0.6
0.3
0
Precision
Precision
1 0.98
0.8
Precision
1 0.9
Precision
Precision
4. Object Detection in 3D image data
(k)
0.6
0.7
0.8
0.9
1
0.8 0.8
0.82
0.84
0.86
0.88
0.9
Recall
(l)
Figure 4.13: Precision-Recall for each segmented object. (a-d) with Mishra’s [71] approach, ¨ (e-h) with Ueckermann’s [126] approach, (i-l) with our approach. Plot a, e and l show results from the OSD dataset and b,f and j more detailed the upper right corner of the first plot. Plot c, g and k show results from the Willow Garage database and d, h and l again the upper right corner of b, f and j. that their approach avoids wrong assignments of surfaces, but at the cost of sometimes over-segmenting the objects. An example is shown in the first row of the example images of Fig. 4.14. The benefit of using the assembly level can be seen for the occluded object set of the OSD. Recall is much higher when additionally using the assembly level, while precision remains almost constant on a high level. This demonstrates that occluded parts have been connected without wrongly assigning surface patches, what can be seen in the example of the second and fifth row of Fig. 4.14. When scenes become more complex the precision decreases, because the SVM of the assembly level connects non-neighboring surface patches of different objects and hence produces under-segmentation. However, the benefit of the assembly level when having partly occluded or self-occluded objects is significant. Another advantage of processing at the assembly level can be seen in the third and seventh row of Fig. 4.14. Self-occluding objects, such as mugs and bowls get split into two separated parts. These examples can be solved by our system, even when some of the examples are too tough (see image row six in Fig. 4.14). 69
4. Object Detection in 3D image data
Figure 4.14: Examples from the OSD database. From left to right: Original image, results ¨ of Mishra, results of Uckermann, and results of our approach (SV Mst+as ) 70
4. Object Detection in 3D image data
Figure 4.15: Examples from the Willow Garage database. From left to right: Original ¨ image, results of Mishra, results of Uckermann, and results of our approach. 71
4. Object Detection in 3D image data Evaluation of the method by Mishra on the Willow Garage dataset shows better performance compared to evaluation on the OSD database, because of the reduced complexity of scenes. Objects in the dataset are mainly freestanding on a ground plane and there are no occluded objects. Segmentation with our approach performs also well on such examples (see Fig. 4.15), but the benefit when using the assembly level does not exist anymore, because there are no partly or self- occluded objects. Precision and recall of the segmentation remains for that reason almost stable when using the assembly level. However, evaluation on the Willow Garage dataset shows the generalization of our approach with respect to other objects and scenes during training, because our framework was trained with the OSD learning sets and therefore with different objects. This is an evidence that perceptual grouping rules act in a generic manner and are portable into different situations with different types of objects.
4.10
Discussion
We presented a framework for segmenting novel objects in cluttered table top scenes of RGB-D images. Raw input data is abstracted in a hierarchical framework by initially clustering pixels to surface patches at the primitive level. Parametric surface models are estimated from the surface clusters, represented as planes and B-spline surfaces and Model Selection finds the combination which explains the input data best. At the structural and assembly level relations between neighboring and non-neighboring surface patches are estimated which we infer from Gestalt principles. Instead of matching geometric object models, more general perceptual grouping rules are learned with a SVM. With this approach we address the problem of segmenting objects when they are stacked, side by side or partially occluded, as shown in the previous section. The presented object segmentation approach works well for many scenes with stacked or jumbled objects, but there are still open issues which are not yet handled or could be revised. A major limitation of our approach is the inability of the grouping approach to split wrongly pre-segmented surface patches. If objects are stacked or side-by-side and surface parts of different objects are aligned to one co-planar plane, pre-segmentation will wrongly detect one planar patch and the following grouping approach is not able to split it in the following grouping procedure. Such an example is shown in the first row of Fig. 4.16. Considering color as additional cue during pre-segmentation would be one solution to overcome this issue. Another, rather obvious limitation of our approach is the resolution of the sensor, causing errors when objects or object parts can not be abstracted to surfaces, because of their size. This problem incorporates also falsely calculated relation when only minimal connections between surface patches are available. This occurs e.g. for handles of mugs, as shown in Fig. 4.16. The current implementation of relations delivers better segmentation results for convex objects compared to concave objects due to the fact that concave objects may have selfocclusion which leads to splitting of surface patches of the same object into non-neighboring 72
4. Object Detection in 3D image data patches. Usually cylindrical objects, such as mugs and bowls show this nicely when the inner and outer part is decomposed into separate parts. Therefore, concave objects have to be treated similar to occluded objects, but evaluation results have shown that relations of the assembly level are far weaker what causes more errors for these types of objects. ¨ The recently published approach by Uckermann uses edge-based pre-segmentation without any modeling of the data. Pre-segmentation is implemented to be processed on the GPU and runs therefore in real-time. Their method avoids to separate small parts from objects, such as handles from mugs, and has therefore not the problem of grouping these parts together. Their method delivers almost the same segmentation results, but the hierarchical data abstraction and parametrization in our method delivers parametrized object models at the output of the system. This is important for further processing, especially when data storage or multi-view reconstruction plays a role. However, the presented grouping framework demonstrates that learning of perceptual grouping rules is a generic method which enables object segmentation of previously unknown objects when data is initially abstracted to meaningful parts. The examples shown in Figure 4.15 demonstrate that the knowledge about the learned rules can be transferred to other object shapes and the rotated camera pose shows that no prior assumptions about the camera pose are needed. Evaluation of the proposed framework has shown that the approach is promising due to the expandability of the relations in the framework. The proposed method has the ability for usage in several indoor robotic tasks where identifying unknown objects or grasping plays a role.
Figure 4.16: Limitations of the approach. Original image, pre-segmented surface patches and final object segmentation with errors. 73
4. Object Detection in 3D image data
74
Chapter 5 Conclusion Detection and segmentation of objects in images per definition is an ill-posed problem when no prior knowledge about objects and the according situation is available. The data abstraction methods and algorithms of this thesis were developed to detect unknown objects in cluttered environments by learning generic rules. Furthermore, object models are estimated by parametrization of surface patches as planes or B-spline surfaces. Learning of generic perceptual grouping rules enables to segment objects, even if shapes of objects are previously unknown. We have shown that the trained system generalizes to other scenes. To achieve wide dissemination of the proposed methods and algorithms, all software developments have been published on the internet. These comprise the Blocks World Robotic Vision Toolbox1 (BLORT), the Object Segmentation Database2 (OSD) and the RGB-D Object Segmentation3 algorithm. Feedback on these publications has shown that these applications are in demand and that the robotics research community welcomes all out-of-the-box software developments.
5.1
Recent Research Work
There are several open issues to improve the results of the object segmentation and reconstruction framework, presented in the thesis. The hierarchical implementation at several abstraction levels allows to add or substitute algorithms easily. Especially the relations at the structural and assembly level enable to extend the system whenever the requirements for the segmentation have changed. In the following a solution two one of the problems is shown by exploiting our framework, and an application of our system in an object recognition system. One problem of color and depth capturing devices, such as the Microsoft Kinect and Asus Xtion, is the noisy range data with an additionally misalignment of color and depth data, as shown in Fig. 5.1. Noise and misalignment influence our object detection algorithm due to the defined relations at the border of surface patches, e.g. the color similarity 1
http://www.acin.tuwien.ac.at/?id=290 http://www.acin.tuwien.ac.at/?id=289 3 http//www.acin.tuwien.ac.at/?id=316 2
75
5. Conclusion
Figure 5.1: Two boxes occluding each other and the table plane. Left: 3D points do not match the color information given, especially at edges of occlusion. Ellipses labeled with a indicate points broken at edges of occlusion whereas b are points where the color is not consistent with the depth. Right: Depth values are corrected using B-splines for representing surfaces and contours. or depth variance on the patch border. In [75] we tackle this problem and propose to fit B-spline curves on the boundaries of the extracted surface patches. A novel formulation of B-spline surfaces, where the inverse image is defined in the image space, provides a direct mapping to the 3D space. Starting at the contour of the surface patches, extracted canny edges from the color image are used to iteratively minimize the distance. After convergence a more accurate representation of the region is found and points outside of the fitted B-spline curve can be identified. Subsequently, these points get reassigned to surface patches in the neighborhood and the depth values are adapted to the new surface patch. This method allows to refine poor segmentation, especially when color and depth information is not consistent. Figure 5.1 shows an example where disturbances and misalignment on the border of the box gets minimized. The hierarchically implemented segmentation framework delivers segmented objects as groups of parametrized visual features which can be used for other applications too. In [4]
Figure 5.2: The point cloud obtained from the Kinect sensor, the segmentation results with the method shown in Chapter 4, the object hypotheses generated and the final objects selected by the hypothesis verification stage. 76
5. Conclusion we show an extension to object recognition. Our proposed segmentation method is used as parallel processing cue next to 2D and 3D appearance feature extraction. While the 2D and 3D appearance features are used to generate object hypotheses, the global segmentation output of our framework is used to verify the global properties in the hypothesis verification stage. This implementation boosts the overall recognition performance significantly. Figure 5.2 shows results of the recognition algorithm from examples on the Willow Garage dataset.
5.2
Outlook
To conclude this thesis, we would like once more to come back to the proposed computer vision architecture, presented in Section 1.2. This system has been realized with the Blocks World Robotic Vision Toolbox (BLORT). It is able to initially detect basic object shapes, such as cubes, cones, cylinders and spheres, which triggers and initializes an edge-based tracking algorithm for subsequent learning of appearance features on the hull of objects. This allows to recognize the object, if it got lost during tracking or if it was out of sight of a robot, and to re-initialize the tracker. The limitation of this simple vision system is the usage of the basic object shape detector, which we presented in Chapter 3. Object segmentation and reconstruction of Chapter 4 solves this task in a more generic way, but there are still open issues. One problem that needs to be solved is the complete reconstruction of the object shape. With the presented segmentation method, objects are represented as parametrized surface patches, constructed from uniform areas of the hull of the object. This representation is already suitable to initialize the tracker, but has to be extended as soon as the object gets turned or the robot drives around the object and get other views from different viewpoints. One solution to this problem is to add and merge new parts of objects as soon as they become visible. Figure 5.3 shows an example where a new surface appears with a new view after calculation with the proposed segmentation framework. Simply adding these surfaces to the existing object model would fill the memory and would lead to a collapse of the system. The surfaces get instead re-projected into the 2D image space and
Figure 5.3: Overlap of a surface of the existing model (yellow) with a new surface (cyan). The overlap in image space is evaluated taking into account the surface normals (2nd image). Two quadratic B-Spline patches (3rd image) are finally substituted by a single B-spline cylinder model (4th image) (images from [73]) 77
5. Conclusion overlapping areas are identified. New surfaces get merged with the existing ones, if the overlapping area as well as the new surface area is large enough. To overcome problems arising for B-spline fitting when more complex models are needed, higher order B-spline models are used and again Model Selection is employed to choose the most suitable model for the data. Figure 5.3 shows an example where two B-spline surfaces are substituted by a B-spline cylinder model. With this method it is possible to incrementally extend the object models while learning the appearance features on the hull for object recognition. Hence, the loop can be closed again, but this time with a generic approach allowing to successively extend the knowledge about the environment and the objects therein, which is a key demand in cognitive robotics.
78
Bibliography [1] Narendra Ahuja. Dot pattern processing using Voronoi neighborhoods. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 4(3):336–343, 1982. [2] Narendra Ahuja and Mihran Tuceryan. Extraction of Early Perceptual Structure in Dot Patterns: Integrating Region, Boundary, and Component Gestalt. Computer Vision, Graphics, and Image Processing, 48:304–356, 1989. [3] Hirotugu Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716–723, 1974. [4] Aitor Aldoma, Frederico Tombari, Johann Prankl, Andreas Richtsfeld, Luigi Di Stefano, and Markus Vincze. Multimodal Cue Integration through Hypotheses Verification for RGB-D Object Recognition and 6DOF Pose Estimation. In IEEE International Conference on Robotics and Automation (ICRA), pages 1–8, 2013. [5] John Aloimonos, Isaac Weiss, and Amit Bandyopadhyay. Active Vision. International Journal of Computer Vision, 1(4):333–356, 1988. [6] Gregorio Ambrosio and Javier Gonzalez. Extracting and Matching Perceptual Groups for Hierarchical Stereo Vision. In International Conference on Pattern Recognition (ICPR), pages 542–545, 2000. [7] Arnon Amir and Michael Lindenbaum. A Generic Grouping Algorithm and Its Quantitative Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 20(2):168–185, 1998. [8] Pablo Arbelaez, Bharath Hariharan, Chunhui Gu, Saurabh Gupta, Lubomir Bourdev, and Jitendra Malik. Semantic Segmentation using Regions and Parts. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3378–3385, 2012. [9] Pablo Arbel´aez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour Detection and Hierarchical Image Segmentation. IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI), 33(5):898–916, May 2011. 79
Bibliography [10] N. Bergstr¨om, M. Bj¨orkman, and Danica Kragic. Generating object hypotheses in natural scenes through human-robot interaction. In Intelligent Robots and Systems (IROS), pages 827–833. IEEE, 2011. [11] Paul J. Besl and Ramesh C. Jain. Segmentation through variable-order surface fitting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(2):167– 192, March 1988. [12] Michael Boldt, Richard Weiss, and Edward Riseman. Token-based extraction of straight lines. In Systems, Man and Cybernetics, IEEE Transactions on, volume 19, pages 1581–1594, 1989. [13] Kim L. Boyer, Muhammad J. Mirza, and Gopa Ganguly. The Robust Sequential Estimator : A General Approach and its Application to Surface Organization in Range Data. IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI), 16(10):987–1001, 1994. [14] Kim L Boyer and Sudeep Sarkar. Perceptual organization in computer vision: status, challenges, and potential. Computer Vision and Image Understanding, 76(1):1–5, 1999. [15] Gary Bradsky. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000. [16] N.D.F. Campbell, G. Vogiatzis, C. Hern´andez, and R. Cipolla. Automatic 3D object segmentation in multiple views using volumetric graph-cuts. Image and Vision Computing, 28(1):14–25, January 2010. [17] Neill Campbell, George Vogiatzis, Carlos Hern´andez, and Roberto Cipolla. Automatic 3D object segmentation in multiple views using volumetric graph-cuts. In British Machine Vision Conference, volume 28, pages 530–539, 2007. [18] Ingrid Carlbom and Joseph Paciorek. Planar Geometric Projections and Viewing Transformations. ACM Computing Surveys, 10(4):465–502, December 1978. [19] Chih-chung Chang and Chih-jen Lin. LIBSVM : A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27:1—27:27, 2011. [20] Huiqiong Chen and Qigang Gao. Efficient Image Region and Shape Detection by Perceptual Contour Grouping. In IEEE International Conference on Mechatronics & Automation, number July, pages 793–798, 2005. [21] Yi-wei Chen and Chih-jen Lin. Combining SVMs with Various Feature Selection Strategies. In Isabelle Guyon, Masoud Nikravesh, Steve Gunn, and Lotfi A Zadeh, editors, Feature Extraction, volume 324 of Studies in Fuzziness and Soft Computing, chapter 12, pages 315–324. Springer, 2006. 80
Bibliography [22] Chang Cheng, Andreas Koschan, David L. Page, and Mongi. A. Abidi. Scene image segmentation based on Perceptual Organization. In 2009 16th IEEE International Conference on Image Processing (ICIP), pages 1801–1804. Ieee, November 2009. [23] Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI), 24(5):603–619, 2002. [24] J A Cottrell, T J R Hughes, and Y Bazilevs. Isogeometric Analysis. Continuum, 199(5-8):355, 2010. [25] Igemar J. Cox, James M. Rehg, and Sunita Hingorani. A Bayesian multiplehypothesis approach to edge grouping and contour segmentation. International Journal of Computer Vision (IJCV), 11(1):5–24, 1993. [26] Ingemar J. Cox. A Bayesian multiple hypothesis approach to contour grouping. In European Conference on Computer Vision (ECCV), pages 72–77, 1992. [27] Ingemar J. Cox, Satish B. Rao, and Yu Zhong. Ratio Regions: A Technique for Image Segmentation. In Proceedings of the 13th International Conference on Pattern Recognition (ICPR), pages 557–564, 1996. [28] Babette Dellen, Guillem Alenya, Sergi Foix, and Carme Torras. Segmenting color images into surface patches by exploiting sparse depth data. In IEEE Workshop on Applications of Computer Vision (WACV), pages 591–.598, 2011. [29] John Dolan and Edward Riseman. Computing curvilinear structure by token-based grouping. Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society Conference on, pages 264–270, 1992. [30] P Dollar and S Belongie. Supervised Learning of Edges and Object Boundaries. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 1964–1971. Ieee, 2006. [31] James H. Elder and Steven W. Zucker. Computing contour closure. In Proceedings of the 4th European Conference on Computer Visiono (ECCV), pages 399–412, 1996. [32] Francisco J. Estrada and Allan D. Jepson. Perceptual grouping for contour extraction. In International Conference on Pattern Recognition (ICPR), pages 32–35 Vol.2. Ieee, 2004. [33] Ting-Jun Fan, Gerard Medioni, and Ramakant Nevatia. Segmented descriptions of 3-D surfaces. IEEE Journal on Robotics and Automation, 3(6):527–538, December 1987. [34] Ting-Jun Fan, Gerard Medioni, and Ramakant Nevatia. Recognizing 3-D objects using surface descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 11(11):1140–1157, 1989. 81
Bibliography [35] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Efficient Graph-Based Image Segmentation. International Journal of Computer Vision, 59(2):167–181, September 2004. [36] V Ferrari, L Fevrier, F Jurie, and C Schmid. Groups of adjacent contour segments for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(1):36–51, 2008. [37] Vittorio Ferrari, Tinne Tuytelaars, and Luc Van Gool. Object Detection by Contour Segment Networks. In Ale Leonardis, Horst Bischof, and Axel Pinz, editors, European Conference on Computer Vision (ECCV), volume 3 of Lecture Notes in Computer Science, pages 14–28. Springer, 2006. [38] Martin A Fischler and Robert C Bolles. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cortography. Communications of the ACM, 24(6), 1981. [39] R B Fisher. From Surfaces to Objects: Computer Vision and Three Dimensional Scene Analysis, volume 7. John Wiley and Sons, 1989. [40] Andrew W. Fitzgibbon and Robert B. Fisher. A Buyer’s Guide to Conic Fitting. In Procedings of the British Machine Vision Conference (BMVC), pages 513–522. British Machine Vision Association, 1995. [41] Donald Geman, Stuart Geman, Christine Graffingne, and Ping Dong. Boundary detection by constrained optimization. IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI), 12(7):609–628, July 1990. [42] D. M. Greig, B. T. Porteous, and A. H. Seheult. Exact Maximum A Posteriori Estimation for Binary Images. Journal of the Royal Statistical Society. Series B (Methodological), 51(2):271–279, 1989. [43] Dan Gutfinger and Jack Sklansky. Robust classifiers by mixed adaptation. IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI), 13(6):552–567, 1991. [44] Gideon Guy and Gerard Medioni. Inferring Global Perceptual Contours from Local Features. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), number c, pages 786–787. IEEE Comput. Soc. Press, 1993. [45] Gideon Guy and Girard Medioni. Perceptual grouping using saliency-enhancing operators. In In Proceedings 11. Int. Conference on Pattern Recognition (ICPR), number 90, pages 99–103, 1992. [46] Olof Henricsson and Markus Stricker. Exploiting Photometric and Chromatic Attributes in a Perceptual Organization Framework. In Asian Conference on Computer Vision (ACCV), pages 258–262, 1995. 82
Bibliography [47] Laurent Herault and Radu Horaud. Figure-Ground Discrimination : A Combinatorial Optimization Approach. IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI), 15(9):899–914, 1993. [48] D.P. Huttenlocher and P.C. Wayner. Finding convex edge groupings in an image. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 406–412. IEEE Comput. Sco. Press, 1991. [49] Hiroshi Ishikawa and Davi Geiger. Segmentation by Grouping Junctions. In Proceedings of Computer Vision and Pattern Recognition (CVPR), number June, pages 125–131, 1998. [50] Christopher O. Jaynes, Frank Stolle, and Robert T. Collins. Task driven perceptual organization for extraction of rooftop polygons. In Proceedings of 1994 IEEE Workshop on Applications of Computer Vision, pages 152–159. IEEE Comput. Soc. Press, 1994. [51] Christopher O. Jaynes, Frank R. Stolle, Howard Schultz, Robert T. Collins, Allen R. Hanson, and Ed M. Riseman. Three-Dimensional Grouping and Information Fusion for Site Modeling from Aerial Images. In In Proc. Arpa Image Understanding Workshop, pages 479–498, 1996. [52] Xiaoyi Y. Jiang and Horst Bunke. Fast Segmentation of Range Images into Planar Regions by Scan Line Grouping. In Machine Vision and Applications, pages 115–122, 1994. [53] Xiaoyi Y. Jiang, U. Meier, and Horst Bunke. Fast Range Image Segmentation Using High-Level Segmentation Primitives. In Proceedings of the 3rd IEEE Workshop on Applications of Computer Vision, pages 83–88, 1996. [54] Kurt Koffka. Principles of Gestalt Psychology, volume 20 of International library of psychology, philosophy, and scientific method. Harcourt, Brace and World, 1935. [55] Wolfgang K¨ohler. Gestalt Psychology Today. American Psychologist, 14(12):727– 734, 1959. [56] Gert Kootstra, Niklas Bergstr¨om, and Danica Kragic. Fast and Automatic Detection and Segmentation of Unknown Objects. In IEEE-RAS International Conference on Humanoids Robotics (Humanoids), pages 442–447, 2010. [57] Gert Kootstra, Niklas Bergstr¨om, and Danica Kragic. Gestalt Principles for Attention and Segmentation in Natural and Artificial Vision Systems. In Semantic Perception, Mapping and Exploration (SPME), ICRA 2011 Workshop, pages 1–8, Shanghai, 2011. 83
Bibliography [58] Gert Kootstra and Danica Kragic. Fast and bottom-up object detection, segmentation, and evaluation using Gestalt principles. In International Conference on Robotics and Automation (ICRA), pages 3423–3428, 2011. [59] Norbert Kr¨ uger, Michael Felsberg, Christian Gebken, and Martin P¨orksen. An Explicit and Compact Coding of Geometric and Structural Information Applied to Stereo Processing. Pattern Recognition Letters, 25(8):665–673, 2004. [60] Norbert Kr¨ uger and Florentin W¨org¨otter. Statistical and Deterministic Regularities: Utilisation of Motion and Grouping in Biological and Artificial Visual Systems. Advances in Imaging and Electron Physics, 31:82–147, 2004. [61] Impyeong Lee and Toni Schenk. 3D perceptual organization of laser altimeter data. International Archives of Photogrammetry and Remote Sensing, XXXIV(3):57–65, 2001. [62] Aleˇs Leonardis, Alok Gupta, and Ruzena Bajcsy. Segmentation of range images as the search for geometric parametric models. International Journal of Computer Vision, 14(3):253–277, April 1995. [63] SP Liou, AH Chiu, and RC Jain. A parallel technique for signal-level perceptual organization. IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI), 13(4):317–339, 1991. [64] Leandro Loss, George Bebis, Mircea Nicolescu, and Alexei Skurikhin. An iterative multi-scale tensor voting scheme for perceptual grouping of natural shapes in cluttered backgrounds. Computer Vision and Image Understanding, 113(1):126–149, 2009. [65] D. G. Lowe. Perceptual Organization and Visual Recognition. Springer, 1985. [66] S. Mahamud, L.R. Williams, and K.K. Thornber. Segmentation of multiple salient closed contours from real images. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 25(4):433–444, April 2003. [67] David Marr. Vision. W. H. Freeman, 1982. [68] Wolfgang Metzger. Laws of Seeing. The MIT Press, 1 edition edition, 1936. [69] Ajay K Mishra and Yiannis Aloimonos. Active Segmentation. International Journal of Humanoid Robotics, 6(3):361–386, 2009. [70] Ajay K Mishra and Yiannis Aloimonos. Visual Segmentation of Simple Objects for Robots. In Robotics: Science and Systems VII, pages 217–224, 2011. [71] Ajay K Mishra, Ashish Shrivastava, and Yiannis Aloimonos. Segmenting Simple Objects Using RGB-D. In International Conference on Robotics and Automation (ICRA), pages 4406–4413, 2012. 84
Bibliography [72] Philippe Montesinos and Laurent Alquier. Perceptual Organization of Thin Networks with Active Contour Functions Applied to Medical and Aerial Iinages. In Proceedings of the 13th International Conference on Pattern Recognition (ICPR), pages 647–651, 1996. [73] Thomas M¨orwald. Object modelling for cognitive robotics. PhD thesis, Vienna University of Technology, 2013. [74] Thomas M¨orwald, Johann Prankl, Andreas Richtsfeld, Michael Zillich, and Markus Vincze. BLORT - The Blocks World Robotic Vision Toolbox. Best Practice in 3D Perception and Modeling for Mobile Manipulation in conjunction with ICRA 2010, 2010. [75] Thomas M¨orwald, Andreas Richtsfeld, Johann Prankl, Michael Zillich, and Markus Vincze. Geometric data abstraction using B-splines for range image segmentation. In IEEE International Conference on Robotics and Automation (ICRA), pages 0–6, 2013. [76] Randal C. Nelson and Andrea Selinger. A Cubist Approach to Object Recognition. In Sixth International Conference on Computer Vision (ICCV), number TR689, pages 614–621. Dept. of Computer Science, Univ. of Rochester, IEEE, 1998. [77] Bj¨orn Ommer and Jitendra Malik. Multi-Scale Object Detection by Clustering Lines. In IEEE 12th International Conference on Computer Vision (ICCV), number Iccv, pages 484–491, 2009. [78] Andreas Opelt, Axel Pinz, and Andrew Zisserman. A Boundary-Fragment-Model for Object Detection. In European Conference on Computer Vision (ECCV), pages 575–588, 2006. [79] S E Palmer. Common region: a new principle of perceptual grouping. Cognitive Psychology, 24(3):436–447, 1992. [80] Stephen Palmer. Photons to Phenomenology. A Bradford Book, 1999. [81] Stephen Palmer and Irvin Rock. Rethinking perceptual organization: The role of uniform connectedness. Psychonomic Bulletin & Review, 1(1):29–55, 1994. [82] P. Parent and S.W. Zucker. Trace inference, curvature consistency, and curve detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 11(8):823–839, 1989. [83] Les Piegl and Wayne Tiller. The NURBS Book. Computer-Aided Design, 28(8):665– 666, 1997. [84] Stefan Posch and Daniel Schl¨ uter. Perceptual Grouping using Markov Random Fields and Cue Integration of Contour and Region Information. Technical report, 1998. 85
Bibliography [85] Nicolas Pugeault, Florentin W¨org¨otter, and Norbert Kr¨ uger. Multimodal Scene Reconstruction using Perceptual Grouping Constraints. In Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), pages 195–203, 2006. [86] T. Rabbani, F. A. van den Heuvel, and G. Vosselman. Segmentation of point clouds using smoothness constraint. International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, 36(5):248–253, 2006. [87] Andreas Richtsfeld. The Object Segmentation http://www.acin.tuwien.ac.at/?id=289, 2012.
Database
(OSD),
[88] Andreas Richtsfeld, Thomas M¨orwald, Johann Prankl, Jonathan Balzer, Michael Zillich, and Markus Vincze. Towards Scene Understanding Object Segmentation Using RGBD-Images. In Proceedings of the 2012 Computer Vision Winter Workshop (CVWW), Mala Nedelja, Slovenia, 2012. [89] Andreas Richtsfeld, Thomas M¨orwald, Johann Prankl, Michael Zillich, and Markus Vincze. Segmentation of Unknown Objects in Indoor Environments. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4791 – 4796, 2012. [90] Andreas Richtsfeld, Thomas M¨orwald, Johann Prankl, Michael Zillich, and Markus Vincze. Learning of Perceptual Grouping for Object Segmentation on RGB-D Data. Journal of Visual Communication and Image Representation, April 2013. [91] Andreas Richtsfeld, Thomas M¨orwald, Michael Zillich, and Markus Vincze. Taking in Shape: Detection and Tracking of Basic 3D Shapes in a Robotics Context. Computer Vision Winder Workshop, pages 1–8, 2010. [92] Andreas Richtsfeld and Markus Vincze. 3D Shape Detection for Mobile Robot Learning. In Torsten Kr¨oger and Friedrich M. Wahl, editor, Advances in Robotics Research, pages 99–109, Braunschweig, 2009. Springer Berlin Heidelberg. [93] Andreas Richtsfeld and Markus Vincze. Basic Object Shape Detection and Tracking Using Perceptual Organization. In Advanced Robotics, 2009. ICAR 2009. International Conference on, pages 1–6, Munich, 2009. [94] Andreas Richtsfeld, Michael Zillich, and Markus Vincze. Implementation of Gestalt Principles for Object Segmentation. In 21st International Conference on Pattern Recognition (ICPR), Tsukuba, JAPAN, November 2012. [95] Andreas Richtsfeld, Michael Zillich, and Markus Vincze. Anytime Perceptual Grouping of 2D Features into 3D Basic Shapes. In to appear in ’International Conference on Computer Vision Systems’, 2013. [96] L. G. Roberts. Machine perception of three-dimensional solids. In J. T. Tippett, editor, Optical and Electro-Optical Information Processing, pages 159–197. MIT Press, Cambridge, MA, 1965. 86
Bibliography [97] I Rock and S Palmer. The legacy of Gestalt psychology. Scientific American, 263(6):84–90, 1990. [98] Paul L. Rosin and Geoff A. W. West. Segmenting Curves into Elliptic Arcs and Straight Lines. In Proceedings Third International Conference on Computer Vision (ICCV), pages 75–78. IEEE Comput. Soc. Press, 1990. [99] Paul L. Rosin and Geoff A. W. West. Nonparametric Segmentation of Curves into Various Representations. Pattern Analysis and Machine Intelligence (PAMI), 17(12):1140–1153, 1995. [100] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. ”GrabCut”: interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (SIGGRAPH), 23(3):309–314, August 2004. [101] E Rubin. Visuell Wahrgenommene Figuren. Copenhagen Gyldendals, 1921. [102] RB Rusu and S. Cousins. 3d is here: Point cloud library (pcl). In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages 1–4. IEEE, 2011. [103] Pablo Sala and Sven Dickinson. Contour Grouping and Abstraction Using Simple Part Models. In European Conference on Computer Vision (ECCV), volume 6315 of Lecture Notes in Computer Science, pages 603–616, 2010. [104] Pablo Sala and Sven J Dickinson. Model-based perceptual grouping and shape abstraction. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 1–8, 2008. [105] Pablo Sala, Diego Macrini, and Sven Dickinson. Spatiotemporal Contour Grouping using Abstract Part Models. In Proceedings of the 10th Asian Conference on Computer Vision (ACCV), pages 539–552, Queenstown, 2011. Springer-Verlag. [106] Thomas Sanocki, Kevin W. Bowyer, Michael D. Heath, and Sudeep Sarkar. Are edges sufficient for object recognition? Journal of Experimental Psychology: Human Perception and Performance, 24(1):340–349, 1998. [107] Angel Domingo Sappa. Unsupervised contour closure algorithm for range image edge-based segmentation. IEEE Transactions on Image Processing, 15(2):377–384, February 2006. [108] Sudeep Sarkar. Learning to Form Large Groups of Salient Image Features. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 780–786, 1998. [109] Sudeep Sarkar and Kim L Boyer. Perceptual organization in computer vision - A review and a proposal for a classificatory structure. IEEE Transactions On Systems Man And Cybernetics, 23(2):382–399, 1993. 87
Bibliography [110] Sudeep Sarkar and Kim L. Boyer. Quantitative Measures of Change Based on Feature Organization: Eigenvalues and Eigenvectors. Computer Vision and Image Understanding, 71(1):110–136, July 1998. [111] Sudeep Sarkar and Padmanabhan Soundararajan. Supervised learning of large perceptual organization: graph spectral partitioning and learning automata. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 22(5):504–525, 2000. [112] E. Saund. Finding perceptually closed paths in sketches and drawings. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 25(4):475–491, April 2003. [113] David Sedlacek and Jiri Zara. Graph Cut Based Point-Cloud Segmentation for Polygonal Reconstruction. In 7th International Conference on Computer Vision Systems, pages 218–227, 2009. [114] Andrea Selinger and Randal C Nelson. A Perceptual Grouping Hierarchy for Appearance-Based 3D Object Recognition. Computer Vision and Image Understanding CVIU, 76(1):83–92, 1999. [115] Amnon Sha’ashua and Shimon Ullman. Structural Saliency: The Detection of Globally Salient Structures Using a Locally Connected Network. In Second Internation Conference on Computer Vision (ICCV), pages 321–327, 1988. [116] Amnon Shashua and Shimon Ullman. Grouping Contours by Iterated Pairing Network. In Advances in neural information processing systems (NIPS), pages 335–341, 1990. [117] Jianbo Shi and Jitendra Malik. Normalized Cuts and Image Segmentation. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 731 – 737, 1997. [118] Jianbo Shi and Jitendra Malik. Normalized Cuts and Image Segmentation. IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI), 22(8):888–905, 2000. [119] Yi-Zhe Song, Bai Xiao, Peter Hall, and Liang Wang. In Search of Perceptually Salient Groupings. IEEE Transactions on Image Processing, 20(4):935–947, April 2011. [120] Johannes Strom, Andrew Richardson, and Edwin Olson. Graph-based segmentation for colored 3D laser point clouds. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2131–2136. Ieee, October 2010. 88
Bibliography [121] Camillo J. Taylor and Anthony Cowley. Fast Scene Analysis Using Image and Range Data. In IEEE International Conference on Robotics and Automation (ICRA), pages 3562–3567. Ieee, May 2011. [122] Geoffrey Taylor and Lindsay Kleeman. Robust Range Data Segmentation Using Geometric Primitives for Robotic Applications. In Signal and Image Processing (SIP), pages 467–472, 2003. [123] Dejan Todorovic. Gestalt principles. Scholarpedia, 3(12):5345, 2008. [124] Mihran Tuceryan, Anil K. Jain, and Narendra Ahuja. Supervised Classification of Early Perceptual Structure in Dot Patterns. In Proceedings of the 11th IAPR International Conference on Pattern Recognition (ICPR)., volume 2, pages 88–91. IEEE Comput. Soc. Press, 1992. ¨ [125] Andre Uckermann, Christof Elbrechter, Robert Haschke, and Helge Ritter. 3D scene segmentation for autonomous robot grasping. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1734–1740. Ieee, October 2012. ¨ [126] Andre Uckermann, Robert Haschke, and Helge Ritter. Real-Time 3D Segmentation of Cluttered Scenes for Robot Grasping. In 12th IEEE-RAS International Conference on Humanoid Robots, 2012. [127] Sabine Urago, Josiane Zerubia, and Marc Berthod. A Markovian model for contour grouping. In International Conference on Pattern Recognition (ICPR), number 33, pages 556–558, 1994. [128] Sabine Urago, Josiane Zerubia, and Marc Berthod. A markovian model for contour grouping. Pattern recognition, 28(5):683–693, 1995. [129] Ahsan Ahmad Ursani, Kidiyo Kpalma, and Joseph Ronsin. Texture features based on Fourier transform and Gabor filters: an empirical comparison. In 2007 International Conference on Machine Vision, pages 67–72, December 2007. ¨ [130] Christian von Ehrenfels. Uber Gestaltqualit¨aten. senschaftliche Philosophie, 14:249–292, 1890.
Vierteljahresschrift f¨ ur wis-
[131] Ji Wan, Tian Xia, Sheng Tang, and Jintao Li. Robust Range Image Segmentation Based on Coplanarity of Superpixels. In 21st International Conference on Pattern Recognition (ICPR 2012), number Icpr, pages 3618–3621, 2012. [132] Sang Wang and Toshiro Kubota. Salient closed boundary extraction with ratio contour. IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI), 27(4):546–561, 2005. 89
Bibliography [133] Song Wang, Joachim S. Stahl, Adam Bailey, and Michael Dropps. Global Detection of Salient Convex Boundaries. International Journal of Computer Vision (IJCV), 71(3):337–359, June 2007. [134] M. Wertheimer. Untersuchungen zur Lehre von der Gestalt. II. Psychological Research, 4(1):301–350, 1923. [135] M Wertheimer. Principles of perceptual organization. In David C Beardslee and Michael Wertheimer, editors, A SourceBook of Gestalt Psychology, pages 115–135. Van Nostrand, Inc., 1958. [136] Andrew P. Witkin and Jay M. Tenenbaum. On the role of structure in vision. In Jacob Beck, Barbara Hope, and Azriel Rosenfeld, editors, Human and Machine Vision, pages 481–543. Academic Press, Orlando, 1983. [137] Ting-Fan Wu and Chih-Jen Lin. Probability estimates for multi-class classification by pairwise coupling. The Journal of Machine Learning Research, 5:975–1005, 2004. [138] Zhenyu Wu and Richard Leahy. An optimal graph theoretic approach to data clustering: theory and its application to image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 15(11):1101–1113, 1993. [139] C T Zahn. Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters. IEEE Transactions on Computers, C-20(1):68–86, 1971. [140] Qihui Zhu, Gang Song, and Jianbo Shi. Untangling Cycles for Contour Grouping. In International Conference on Computer Vision (ICCV), number c, pages 1–8. Ieee, 2007. [141] Michael Zillich. Incremental Indexing for Parameter-Free Perceptual Grouping. In 31st Workshop of the Austrian Association for Pattern Recognition, pages 25–32, 2007. [142] Michael Zillich and Markus Vincze. Anytimeness avoids parameters in detecting closed convex polygons. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1–8. Ieee, June 2008. [143] Steven W. Zucker. Computational and psychophysical experiments in grouping: Early orientation selection. In Human and Machine Vision, pages 545–567. 1983. [144] Steven W. Zucker. Early orientation selection: Tangent fields and the dimensionality of their support. Computer Vision, Graphics, and Image Processing, 32(1):74–103, October 1985. [145] Steven W. Zucker. The computational connection in vision: Early orientation selection. Behavior Research Methods, Instruments, & Computers, 18(6):608–617, November 1986. 90
Bibliography [146] Steven W. Zucker, Chantal David, Allan Dobbins, and Lee Iverson. The Organization Of Curve Detection: Coarse Tangent Fields And Fine Spline Coverings. In Second International Conference on Computer Vision (ICCV), pages 568–577. Ieee, 1988.
91
Bibliography
92
Curriculum Vitae Personal Information First name / Surname: Academic degree: Citizenship: Date Of Birth: Address: Languages: Homepage:
Andreas Richtsfeld Dipl.-Ing. Austrian 22. May 1978 Waxenberg 55, 4182 Waxenberg, Austria German (mother tongue), English http://users.acin.tuwien.ac.at/arichtsfeld
Education 1992 to 1997: 1998 to 2007:
Since Sep. 2007:
Technical Collage HTL II, Paul Hahn Strae, Linz Electrical Engineering / Automation, power and drive engineering Vienna University of Technology, Master program Electrical engineering / Computer technology Master Thesis: Scenario Recognition by Symbolic Processing with Fuzzy Logic 15.06.2007: Diploma examination with distinction Doctoral program of Electrical Engineering at Vienna University of Technology, Austria Automation and Control Institute (ACIN) Vision For Robotics Group (V4R)
Work Experience 2002-2010:
Since Sep. 2007:
Nightoperator at CPB Software AG, Vienna Support of IBM AS/400 computer at infrastructure department: - System operation, supervision and system monitoring - End of day processing, data saving, backup and storage Research assistant at Automation and Control Institute (ACIN) at Vienna University of Technology I
Research projects XPERO Learning by Exploration (EU funded) CogX Cognitive Systems that Self-Understand and Self-Extend (EU funded) InSitu Integrated Visual Scene and Natural Language Understanding for Human-Robot Interaction (FWF funded)
Publications A. Richtsfeld, T. M¨ orwald, J. Prankl, M. Zillich and M. Vincze Learning of Perceptual Grouping for Object Segmentation on RGB-D Data; Journal of Visual Communication and Image Representation (JVCI), Special Issue on Visual Understanding and Applications with RGB-D Cameras, July 2013 A. Richtsfeld, M. Zillich and M. Vincze Anytime Perceptual Grouping of 2D Features into 3D Basic Shapes; to appear in: International Conference on Computer Vision Systems (ICVS), July 2013 T. M¨ orwald, A. Richtsfeld, J. Prankl, M. Zillich and M. Vincze Geometric data abstraction using B-splines for range image segmentation; IEEE International Conference on Robotics and Automation (ICRA), 2013 A. Aldoma, F. Tombari, J. Prankl, A. Richtsfeld, L. Di Stefano and M. Vincze Multimodal Cue Integration through Hypotheses Verification for RGB-D Object Recognition and 6DOF Pose Estimation; IEEE International Conference on Robotics and Automation (ICRA), 2013 A. Richtsfeld, M. Zillich and M. Vincze Implementation of Gestalt Principles for Object Segmentation; IAPR 21st International Conference on Pattern Recognition (ICPR), 2012 A. Richtsfeld, T. M¨ orwald, J. Prankl, M. Zillich and M. Vincze Segmentation of Unknown Objects in Indoor Environments; IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012 A. Richtsfeld, T. M¨ orwald, J. Prankl, J. Balzer, M. Zillich, M. Vincze Towards Scene Understanding - Object Segmentation Using RGBD-Images; Computer Vision Winter Workshop (CVWW), 2012 K. Zhou, K. Varadarajan, A. Richtsfeld, M. Zillich, M. Vincze From holistic scene understanding to semantic visual perception: A vision for mobile robots; Workshop at the International Conference on Robotics and Automation (ICRA), 2011 II
K. Zhou, A. Richtsfeld, M. Zillich, M. Vincze, A. Vrecko, D. Skocaj Visual Information Abstraction for Interactive Robot Learning; International Conference on Advanced Robotics (ICAR), 2011 K. Zhou, A. Richtsfeld, M. Zillich, M. Vincze Coherent Spatial Abstraction and Stereo Line Detection for Robotic Visual Attention; International Conference on Intelligent Robots and Systems (IROS), 2011 A. Richtsfeld, T. M¨ orwald, M. Zillich and M. Vincze Detection and Tracking of Basic 3D Shapes in a Robotics Context; Computer Vision Winter Workshop (CVWW), 2010 K. Zhou, A. Richtsfeld, K. Varadarajan, M. Zillich, M. Vincze Combining Plane Estimation With Shape Detection For Holistic Scene Understanding; Advanced Concepts for Intelligent Vision Systems (ACIVS), 2011 K. Zhou, A. Richtsfeld, K. Varadarajan, M. Zillich, M. Vincze Accurate Plane Estimation Within A Holistic Probabilistic Framework ; The Austrian Association for Pattern Recognition (AGM/AAPR) Workshop, 2011 T. M¨ orwald, J. Prankl, A. Richtsfeld, M. Zillich, M. Vincze BLORT- The Blocks World Robotic Vision Toolbox ; Best Practice Algorithms in 3D Perception and Modeling for Mobile Manipulation. International Conference on Robotics and Automation Workshop (ICRAW), 2010 A. Richtsfeld and M. Vincze Basic Object Shape Detection and Tracking Using Perceptual Organization; International Conference on Advanced Robotics (ICAR), 2009 A. Richtsfeld and M. Vincze 3D Shape Detection for Mobile Robot Learning; German Workshop on Robotics (GWR); 2009
III