Identification and Recognition of Objects in Color Stereo Images Using a Hierachial SOM Giovanni Bertolini, Stefano Ramat Dip. Informatica e Sistemistica, University of Pavia E-mail:
[email protected]
Abstract Identification and recognition of objects in digital images is a fundamental task in robotic vision. Here we propose an approach based on clustering of feature extracted from HSV color space and depth, using a hierarchical self organizing map (HSOM). Binocular images are first preprocessed using a watershed algorithm; adjacent regions are then merged based on HSV similarities. For each region we compute a six element features vector: median depth (computed as disparity), median H, S, V values, and the X and Y coordinates of its centroid. These are the input to the HSOM network which is allowed to learn on the first image of a sequence. The trained network is then used to segment other images of the same scene. If, on the new image, the same neuron responds to regions that belong to the same object, the object is considered as recognized. The technique achieves good results, recognizing up to 82% of the objects.
1. Introduction Object identification and recognition plays a fundamental role in human interactions with the environment. Building artificial systems able to automatically understand images is one of the greatest challenges of robotic vision [1]. Image segmentation is the first important process in many vision tasks since it is responsible for dividing an image into homogeneous regions so that the merging of two adjacent regions would produce a non homogeneous region [2]. Therefore, segmenting an image into regions that represent meaningful objects depends upon the homogeneity criterion chosen for the segmentation and upon the features used for evaluating such homogeneity (i.e. color or grayscale). Humans often achieve recognition using semantic characteristics allowing to group parts of complex objects; an
Fourth Canadian Conference on Computer and Robot Vision(CRV'07) 0-7695-2786-8/07 $20.00 © 2007
approach that goes well beyond the homogeneity of a single feature of the pixels in the image. Many different methods for image segmentation can be found in the literature, ranging from edge detection to region-based approaches and clustering in feature spaces. The algorithm described here represents a hybrid approach to image segmentation based on edge detection, merging and clustering but it can be broadly considered as belonging to the class of region merging processes. Once the sought objects are identified and correctly segmented, the problem of recognition is often formulated as that of finding a description suitable for comparing the objects across the different frames or for building models that the objects in subsequent images have to match [3]. The aim of this work is to present an algorithm to process binocular images for distinguishing meaningful objects from the background and storing a description of these objects useful to recognize them in subsequent frames. This can be seen as a double step problem: segmentation and recognition. The key idea of our system is to store the description of the objects as the rules learned for segmenting them in the first image and then use these rules to segment the subsequent frames. In this way it is possible to combine segmentation and recognition. The paper is organized as follows. Section 2 gives a brief overview of the algorithm we developed introducing the main ideas that lead to our choices. Section 3 describes the methods used to extract features from the images and the algorithm used to produce the first segmentation. Section 4 describes the HSOM neural network and its use as a clustering tool within our algorithm as well as its role in the recognition of objects. Section 5 reports the experimental results obtained by applying the suggested algorithm to sequences of images captured using our binocular system. Section 6 draws some
conclusions and considers developments of this work.
possible
future
2. Algorithm overview Despite the importance of segmentation and the great amount of literature on this topic, there is currently no optimal solution to it. The reason for this lack is well explained by Fu and Mui in [4]: “the image segmentation problem is basically one of psychophysical perception and therefore not susceptible to a purely analytical solution”. Thus we reasoned that trying to mimic some of the putative processing used by the Central Nervous System (CNS) in building such psychophysical perception could provide interesting hints for solving the problem at hand. It is currently believed that the CNS processes information in retinal images through parallel neural pathways that build up a description of different features of the visual scene. Those can then be combined at a higher level to obtain the visual perception of the world as we are used to see it [5]. Such processing allows the CNS to dynamically build different interpretations of the environment, varying the weights it assigns to the different features. The segmentation procedure described here combines information on color, edges and depth of the image, each one being computed through separate algorithms, to build a first segmentation. The combined output of the above procedures is used to train a hierarchical self-organizing map, which builds the final segmentation by merging the regions that it assigns to the same cluster. Using this approach the network also stores a distributed model of the observed world, which works as an “expected image” that is used in the processing of the subsequent frames. Overall, the suggested algorithm allows to produce a segmentation while simultaneously recognizing the objects that are present in the observed scene. To acquire the binocular images, we developed a vision system made of two cameras (two commercial
web-cams) with the optical axes aligned so that the two images differ only for a translation perpendicular to the axes themselves. Since depth information is available only for the part of the visual scene that is visible in both images, we apply all the processes that will be described in the following only to such common region. For ease of explanation in the subsequent paragraphs we refer to the common part of a pair of binocular images simply as ‘image’, where not differently specified.
3. Features extraction Current hypotheses on how the CNS processes retinal information argue for different features being processed in parallel. The selection of these features is critical to the performance of any segmentation and object recognition algorithm. Clearly, the use of different features adding meaningful information relative to the observed scene may be useful for achieving a better understanding of its structure in space. Based on these considerations we chose to combine color space information both with that on edges, derived from the gray scale version of the images, and with depth information derived from the comparison of each image pair acquired by the binocular camera system. As we reasoned that it could be profitable to imitate the CNS approach to processing visual information, the extraction of each one of these three features is performed by a different algorithm so that the overall algorithm could be run in parallel.
3.1. HSV color space Although color is usually represented in terms of its intensity in red, green and blue wavelengths (RGB), such coding may not be the most appropriate for image segmentation. The main problem with RGB coding is that the distance between two colors, when computed in such space, does not resemble that perceived by humans. For instance, the distance between two
Figure 1. Flowchart of the feature extraction process
Fourth Canadian Conference on Computer and Robot Vision(CRV'07) 0-7695-2786-8/07 $20.00 © 2007
different colors with low intensity (e.g. [30,0,0] and [0,30,0]) may be lower than that between two intensities of the same color (e.g. [50,0,0] and [200,0,0]) [2]. RGB coding therefore does not allow to establish reliably the similarity between two colors, which may instead be a useful criterion for determining if two regions pertain to the same object in the scene. We therefore decided to transform the data in the HSV color space (Hue, Saturation, Value), which can be obtained through a non linear transformation of the RGB coding. HSV color representation separates color and intensity information, making it especially efficient in representing similarities when non uniform illumination creates differences between pixels of the same surface. In HSV space color information is represented by the hue (the dominant wavelength in the spectral distribution) and the saturation (the purity of the color) while the value is the intensity. Thus, by comparing hue values it is possible to evaluate the color differences directly, in a way which may be closer to human perceptual processing.
3.2. Edge detection and watershed Edge detection is based on finding discontinuities in the grey level of the pixels in the image assuming that the points of abrupt change represent the separation between regions. The suggested algorithm detects edges using a Sobel filter applied to the grayscale image. This returns an image-sized map of gradients of the gray levels. The resulting map is then used as the input of the watershed algorithm, which returns the over-segmented map. The watershed algorithm is well described by Beucher in [6]. To explain how it works we can imagine the gradient image as a topographic relief map being flooded. The algorithm exploits the idea of a “catchment basin”, which is a set of connected edgepixels (representing the border of a region) and of all the pixels encircled by it. Therefore a catchment basin contains a local minimum of gray level intensity so that if we imagine that a drop of water falls within one basin it would trickle down until such minimum. During the flooding process the water rises level by level within the catchments basin until it reaches a point at which the water coming from two different basins begins to enter in contact. This occurs when a local maximum in the gradient is reached. To prevent the merging the algorithm builds the equivalent of a dam and then goes on with flooding. At the end of the process all the topographic surface will be submerged and only the dams will come out of the water. These dams represent the boundary of the regions identified by the watershed
Fourth Canadian Conference on Computer and Robot Vision(CRV'07) 0-7695-2786-8/07 $20.00 © 2007
algorithm. An efficient implementation of the watershed algorithm could be found in [7]. The output of the edge detection process is an early row segmentation of the image that will be the basis for all subsequent elaborations. Since the following steps perform a merging of the regions identified, it is important that this processing step builds a representation of the image made of its elementary regions. Thus we accept an over-segmented output, keeping all the edges, including also non-meaningful ones, to avoid erroneous merging in this phase. The reason for over-segmentation can be sought in the watershed algorithm which identifies all the boundaries in the image and returns them as having all the same height.
3.3. Depth estimate The knowledge of the distance from the camera for every point in the image could provide additional valuable information for the image analysis process and for identifying the objects in the scene. Information carried by the position in the 3D environment is significantly different from that provided by color and grayscale and can supply important clues to understand whether adjacent regions belong to the same or to different objects. The problem faced by the algorithms that compute depth from visual information (stereo vision algorithms), is that they need to know the correspondences between the pixels of the two images, i.e. the two projections needed for 3D reconstruction. Many techniques allowing to search for these matches can be found in literature, exploiting different constraints and similarity criteria. These can be grouped in two categories: global and local methods. Global methods try to find the correspondences referring to constraints or to searching strategies that involve at least an entire scan line or the entire image. Local methods instead rely on solutions that consider a small number of pixels surrounding the pixel of interest. Whatever the method used, the output of a depth estimating algorithm is a map of almost the same size of each of the two images (i.e. the common region discussed in paragraph 2). In this map each element represents a disparity value, that is the difference between the position of two corresponding pixels in the two images: the larger the disparity, the closer the object is to the cameras. We implemented a local algorithm similar to that proposed by Di Stefano and colleagues in [8]. This attempts to estimate the disparity of a point in the reference image by computing the sum of the absolute
difference (SAD), in gray level, between a square window centered in the pixel of interest (the reference window) and an equal sized window (the sliding window) sliding on the corresponding scan line of the other image (the search image). In addition, following the suggestions in [9], we developed a multiple windows approach, combining informations over areas larger than a single window, which allows us to determine the matching pixel with greater confidence. For each pixel in the reference image we first compute the median gray level of the pixels in a square window surrounding it: swm. We then begin to look for matching pixels one by one, scanning each line of the reference image. For each pixel in the reference image we select the three adjacent pixels whose swm is the closest to that of the pixel of interest, and add their three windows to the reference window increasing its size and, consequently, that of the sliding windows in the search image. This is a modified version of the approach proposed by Hirshmüller in [9], which suggests choosing the supporting windows based on the highest correlation values. The advantage of our implementation over that proposed in [9] is that we need to choose the supporting windows only at the beginning of the disparity search process, thus reducing computational time. The other significant difference with [8] is in the post processing of the disparity map obtained by the SAD computation. Although we implemented the tests of sharpness and distinctiveness to evaluate the SAD values as suggested in [8], we chose to use an additional criterion for evaluating the uniqueness of each match. For each pixel in the reference image we keep the best four matching pixels in the search image, excluding those that did not pass the two tests mentioned above, creating a subset of possible candidates. For each scan line we then apply an iterative procedure that finds the collisions (multiple matches, i.e. groups of pixels of one image that correspond to the same pixel of the other image), keeps only the pair of pixels with the best SAD value and replaces the “loosing” matches by their successive candidate. This step can be seen as implementing a global constraint over each row of pixels, yet it is applied only to a subset of locally established matches. Once the disparity map is obtained, two more procedures where applied to remove non reliable correspondences pertaining to low texture areas and missing matches caused by occlusions (the absence of a corresponding pixel due to its occlusion, in one of the two images, caused by an interposed object). These procedures scan the disparity map filling the hole and
Fourth Canadian Conference on Computer and Robot Vision(CRV'07) 0-7695-2786-8/07 $20.00 © 2007
the low texture area with an estimate of the disparity derived by considerations about the surrounding reliable values and the position of the edges in the depth and in the grayscale image.
3.4. Merging of regions Eventually, the feature maps obtained through the described algorithms will represent the input to the HSOM network, which produces the final output of the recognition process. Each identified region will be summarized as a six values feature array containing the coordinates of its centroid and the median of each of the four computed features: hue, saturation, value and disparity Yet, before feeding the data to the recognition network, a further processing step is needed to reduce the over-segmentation of the region map produced by the watershed algorithm. One of the problems of using the watershed is that some of the regions generated by the algorithm are very small and carry non reliable information. This led us to develop a simple region merging procedure for preprocessing the region data before the HSOM merging. We considered that a first merging of the small regions generated by the watershed does not need complex rules derived by the context or by semantic property of the objects, but can be based on a simple similarity measure. We therefore implemented a slightly modified version of the algorithm proposed by Navon and colleagues in [10], which merges adjacent regions following a sequence established by a measure of their dissimilarity. This is a local measure, based on two parameters that take into account the difference in color and the mean intensity of the edges the separate two regions. Such measure is computed for every existing region with respect to all its adjacent regions. The process then merges the regions one by one, starting from the most similar. At each step the variance of the gray level scale (i.e. the Value in HSV coding) of the pixels within the new region is calculated and a dissimilarity threshold is derived locally and dynamically from the merging history of the region itself. Such threshold is then used to decide when to exclude a region from the merging process. In the original algorithm [10] the color difference is measured as the distance between their hue levels and is combined with edge intensity weighing them respectively 80% and 20%. However the HSV color space has some singularities that make the hue level less reliable as the saturation diminishes [2]. To avoid such pitfall, we developed an adaptive algorithm for
selecting the weights of the two components of the similarity measure. This allows us to take into account the reliability of the hue information by increasing the weight of the edge intensity component when the saturation is low. Before beginning the merging process we subtract the background by joining all the pixels that lie further than a chosen depth threshold. This forces the algorithm to consider the background as a single element, speeding up the merging. The new region map obtained from the merging process described is made up of regions that represent homogenous and possibly meaningful parts of the objects in the scene. Each region in this map will represent one input to the HSOM. Thus, for each region, we compute the corresponding feature array.
4. HSOM neural network The Self Organizing Map [11] belongs to the class of unsupervised neural networks. SOMs are basically a clustering tool, able to build a map of the distributions of the input data, grouping them in clusters topologically ordered within the SOM structure. A typical SOM structure consists of a single layer of neurons, connected to each other in a fixed, regular geometric configuration. Each neuron is also connected with all the inputs and each connection is associated with a vector of weights. These vectors have the same size of the input vectors, so we can say that each neuron represents a position in the input space, i.e. the centre of a cluster. The learning procedure is repeated at each presentation of an input vector. It is made up of two phases that can be described as follows: 1 - The distances between the input vector and the weight vectors of the neural units are computed and the neuron associated with the vector that is closest to the input vector is declared the winner (competitive phase). 2 - The weight vector of the winner and those of the neural units in its topological neighborhood are moved toward the input vector (cooperative phase). The amount of change of a weight vector W in the updating process of the cooperative phase is expressed by the function Eq. 1
dw = lr (t ) ∗ a(t ) ∗ ( P − W ) ,
where P is the input vector, lr(t) is the learning rate and a(t) is a coefficient that is usually determined using a bell-shaped function centered on the winner neuron. During the learning process the learning rate lr(t) and
Fourth Canadian Conference on Computer and Robot Vision(CRV'07) 0-7695-2786-8/07 $20.00 © 2007
the radius of the bell-shaped function are progressively reduced to ensure the convergence. There are two phases in this process: the ordering phase, in which both lr(t) and the radius are large and fast decreasing, and the tuning phase, that implements a low lr(t) and a radius reduced to few (eventually one) neurons. The ordering phase is fast and is used to position the neurons in the more dense region of the distribution of input vectors, while the tuning phase is usually longer and refines the positions of the neurons to generate the final map. SOMs are a popular choice for clustering problems associated with image segmentation [12]. The main reason is that the segmentation itself can be seen as a clustering process in which each cluster encloses the portion of the features space that represents an homogeneous region. Moreover the SOMs naturally produce topological maps, an important point in image segmentation, as it helps taking into account the spatial aspects of the segmentation process. On the other hand one of the main shortcomings of SOMs in our context is that they need the user to define the number of neural units before the segmentation begins. In the SOMbased segmentation process, a low number of neurons limits the number of regions while a large one could lead to oversegmentation. Because the desired number of regions is a priori unknown, this shortcoming sets a significant limit to the use of SOMs in segmentation. In our work we make use of a Hierarchical Self Organizing map, an evolution of the classical SOM that tries to overcome the aforementioned shortcoming by building a hierarchical structure in which each layer is a single layer SOM. The main idea is that it is possible to segment an image by grouping its features at different levels of analysis, supposing that each layer of the HSOM could achieve a viewpoint from a different scale [13]. This behavior could be very useful in object identification as it would allow to group the regions belonging to each object progressively, without a priori defining a homogeneity criterion. Such hierarchical structure grows while the segmentation process unfolds, thus partially overcoming the limit imposed by the fixed number of neurons of the classical SOMs. The inputs to the first layer of the network are the six-element feature vectors built from the regions identified by the merging process, as described in the previous section. The first four are the median of the H, S and V component of the HSV color representation and of the depth over the pixels of each region, while the last two are the X and Y coordinates of the centroids of the region. Thus, half of the feature space takes in account the color descriptions, while the other half represents the position of the regions in 3D space.
The size of each SOM layer is NxN where N depends upon the number n of input vectors (i.e. the number of regions for the first layer) through the formula:
N = round (0.8 ∗ n ) Eq. 2 One of the main drawbacks of this approach is that it doesn’t take into account the size of the regions in building clusters. This could lead to errors in cluster formation, because small regions are usually less reliable then the bigger ones. Our goal is to obtain few big regions that identify the objects in the images. For these reasons we suggest that it may be more useful to use the little regions to fine tune the clusters formed around the larger ones. Therefore, we decided to enhance the role of large regions by increasing the number of times that their feature vectors are presented to the HSOM according to the following equation: a k = 10 * tan sig , 0.05 ∗ A where represents the ceiling function, k is the Eq. 3
number of times the vector that represents a region with area a is present in the data set, and A is the total area of each image. Such preprocessing of the training data set is performed after determining the size of the subsequent layer through equation 2, thus it does not affect the number of regions n and the number of neurons in the layer. Once the first layer has been trained, the weight vectors of each neuron that won for at least one input become the input vector for the second layer. The size of the second layer is also determined using Eq. 2, where n is now the number of winning neurons in the first layer. The network grows dynamically, adding new layers until the desired size, which we set to a 3x3 neurons lattice, is reached. The choice of nine neurons in the final layer, allowing to recognize up to nine different objects, is empirical and depends on the structure of our images, which usually contain at most three objects. Following this learning process, the HSOM progressively builds clusters by grouping simpler clusters. Therefore, the new clusters are not necessarily spherical anymore but could have any shape that can be obtained by merging the shapes of the clusters that are joined together. Using this approach the HSOM achieves a better partition of the features space than any other classical SOM. At the end of the process the trained network has learned a description of the visual scene that can be used to recognize the same objects in other images of the same scene. To this goal, the following pair of
Fourth Canadian Conference on Computer and Robot Vision(CRV'07) 0-7695-2786-8/07 $20.00 © 2007
images in the sequence is chosen and is processed using the described segmentation algorithms, but the final merging is performed using the previously trained HSOM. The regions identified in the new pair of images are therefore grouped accordingly to the clusters defined in the segmentation process on the first pair of images. In a scenario in which the cameras are motorized to allow scanning of a larger visual scene, camera motion information may be available from sensors or as a copy of the driving command. Such information may be used to modify, for each neuron, the weights representing the position of the identified object. This would allow to avoid recognition errors caused by the different position that an object has in the scene because of the movement of the camera. We therefore consider that if the regions belonging to the same object in the two pairs are grouped by the same neurons, the object has been recognized.
5. Experimental results Many publications dealing with the problem of object identification and recognition can be found in the scientific literature, approaching the issue using a large variety of solutions. Despite such interest, the definition of metrics allowing to uniformly evaluate the results obtained by each technique and the performance of each approach is still lacking [14]. This lack, which is shared with the results of segmentation algorithms, is probably due to the problem of defining the critical aspects for evaluating the results of intrinsically subjective processes. The absence of shared performance evaluation criteria led to a large number of qualitative evaluations that make it impossible to compare the results achieved by different algorithms. However, in the field of recognition it is possible to specify some key points that could lead to more quantitative evaluations. According to Smith et al. [14], an object shown in different images is recognized if it is correctly labeled in all the images and if its position in the different images is correctly estimated. Other parameters that could contribute to the scoring of an object recognition process are the correct estimates of the shapes as well as the areas or the orientations of the recognized objects, depending on the needs of the specific systems. The suggested approach exploits different algorithms and therefore a complete evaluation of its performance should score both the algorithms computing the different features and the quality of the final segmentation. However, a quantitative evaluation
comprising all the intermediate results of the processing is rather difficult due to the nonhomogeneity of the different features and, therefore, of their evaluation criteria. Moreover, assessing the quality of any feature evaluation process does not ensure a better judgement of the overall performance. Therefore we chose to evaluate the whole system in terms of its object recognition performance only, as it is the overall goal of the presented approach. The metric we defined to evaluate the results of our system aims at verifying if, when the trained SOM is applied to a new image pair, each neuron still identifies the same object it responded to in the training pair. In practice this is accomplished by determining whether each neuron responds to parts of the same object in all image pairs. We therefore acquired sequences of stereo images of a fixed visual scene while moving the binocular system along a straight line perpendicular to the optic axis of the cameras. In this setting we assumed that the displacement of the binocular system between two frames is known and is provided to the recognition algorithm. Therefore, when an object is detected in one image, the estimate of its position in another image in the sequence can be obtained multiplying the disparity value of the object by the displacement of the binocular system, and dividing by the distance between the two cameras. Such estimate is used for evaluating the correctness of the position of the recognized objects, acting like an automatically calculated ground truth for the centroids. According to [14] this is one critical point to assert that an object has been recognized. To enforce such comparison we also consider that, in our specific setting, the area of each object could not vary too much between the different frames. A significant variation in area between what the system consider the same object in two frames is therefore a clue of some mismatch in the identified object, i.e. an error in recognition. These considerations led to the suggested “recognition test”, which can be summarized as follows: 1. Train the HSOM with one image of the sequence 2. Modify the coordinates stored in the network weights using the informations about cameras’ displacement. 3. Use the trained network for the segmentation of another image of the sequence. 4. Identify the main region generated by each neuron in each image
Fourth Canadian Conference on Computer and Robot Vision(CRV'07) 0-7695-2786-8/07 $20.00 © 2007
5.
Compute a dissimilarity measure (Eq. 4) between each possible pair of regions belonging to the two different images 6. Match each region in the training image with the region having the lowest dissimilarity measure in the test image 7. Consider as recognized only those regions of the training image that match regions of the test image identified by the same neuron and having a dissimilarity measure I below a predefined threshold. Note that the first three points represent the mixed identification/recognition process described in the paper, while the subsequent steps represent the evaluation process aimed at assessing the effectiveness of the recognition. This procedure is repeated cyclically, training the network on each image in the sequence and testing the trained network on all the other images in the sequence. The similarity measure I, used for matching regions, is made out of two parts that take into consideration the variation of area and the error in the centroid position, respectively. The formula can be written as Area(Ri ) − Area(R' j ) Centroide (Ri ) − Centroid(R' j ) , + Eq.4 I = Area(Ri )
Centroide (Ri ) − Centroid(Ri )
where R represents regions in the training image and R’ are regions in the test image, i and j represent the i-th and j-th neuron of the network, respectively, while the subscript e refers to the expected centroid position based on camera motion information. We consider that an object represented by Ri is recognized if the minimum of I is less than 2 and is obtained for j=i. The recognition performance is then evaluated as the percentage of recognized objects over the total number of expected objects based on the number identified in the training image, and by the mean error in area and in centroid position. We tested our system over three sequences of six binocular images each; two sequence showing three objects in the foreground while the other only two. Due to the arrangement of our binocular system the foreground is defined as the zone of the observed scene between 50 cm and 150 cm from the cameras. Altogether we tested our system over a total of 18 training images, 90 test images and 240 objects that had to be recognized. The system recognized the 82% of the objects, committing an average error of 14% in terms of the area of the objects and an average error of 25% in estimating the position of their centroids.
7. References [1] D. Kragic, M. Björkman, H.I. Christensen and J.-O. Eklundh, “Vision for robotic object manipulation in domestic settings”, Robotics and autonomous Systems 52, 2005, pp. 85-100. [2] H.D. Cheng, X.H. Jang, Y. Sun and Jingli Wang, “Color image segmentation: advances and prospects”, Pattern Recognition 34, 2001, pp. 2259-2281. [3] S.D. Roy, S. Chaudhury, S. Banerjee “Active recognition trought next view planning: a survey”, Pattern Recognition 374, 2004, pp. 429-446. [4] K.S. Fu, J.K. Mui, “A survey on image segmentation”, Pattern Recognition 13, 1981, pp. 3-16. [5] E.R. Kandel, J.H. Schwartz, and T.M. Jessel, Principles of Neuroscience, Elsevier, 1999. [6] S. Beucher, “The watershed trasformation”, Scanning Microscopy, 1992 st
Figure 2. Images 1,2 and 5 of the 1 sequence (a,b,c) and the respective output of the algorithm (e,f,g). Each color represents a label, i.e. an object. The network is trained only on the first image (a).
6. Conclusion We have developed a binocular vision system for detecting unknown objects in the foreground of the visual scene and recognize them in other views of the same scene. This is achieved by the use of a HSOM network that combines informations from color space, edges and depth derived by independent algorithms. The HSOM shows good clustering ability, almost always correctly grouping the regions that compose the objects. The information stored in the network allows recognition of the objects in the other frames of the sequence when they fit the stored descriptions. Thus, the network weights represent an “expected image”, helping both segmentation and recognition of the objects. This approach proves to be valuable both in recognition and in estimating of the positions of objects in 3D space. Future developments of this work will consider continuously adjust the description of the scene moving the HSOM neurons by partially retraining the network at each new image pair, and using camera motion information provided by inertial sensors. The ultimate goal will be that of beeing able to detect and track unknown objects in an unconstrained environment.
Fourth Canadian Conference on Computer and Robot Vision(CRV'07) 0-7695-2786-8/07 $20.00 © 2007
[7] L. Vincent, P. Soille, “Watershed in digital space: an efficient algorithm based on immersion simulation”, IEEE Transactions on Pattern Analysis and Machine Intelligence 13, 1991, pp. 583-598. [8] L. Di Stefano, M Marchionni, S. Mattoccia, G. Neri, “A fast area-based stereo matching algorithm”, Image and Vision Computing vol. 22, n. 12, 2004, pp. 983-1005 [9] H. Hirshmüller, “Real-time correlation-based stereo vision with reduced border errors”, International Journal of Computer Vision 47, 2002, pp.229-246 [10] E. Navon, O. Miller, A. Averbuch, “Color image segmentation based on adaptive local tresholds”, Image and Vision Computing 23, 2005, pp.69-85 [11] T. Kohonen, “Self-organized formation of topologically corrected feature maps”, Biological Cybernetics 43, 1982, pp.59-69 [12] Y. Jiang, Z.-H. Zhou, “SOM ensemble-based image segmentation”, Neural Processing Letters 20, 2004, pp.171178 [13] S. M. Bhandarkar, J. Koh, M. Suk, “Multiscale image segmentation using a hierarchical self-organizing map”, Neurocomputing 14, 1997, pp.241-272 [14] K. Smith, D. Gatica-Perez, J.-M. Odobez, S. Ba, “Evaluating Multi-Object Tracking”, 2005 IEEE conference on computer vision and patern recognition 3, 2005