Computer Vision and Image Understanding 95 (2004) 334–353 www.elsevier.com/locate/cviu
Dynamic learning from multiple examples for semantic object segmentation and search Yaowu Xu a, Eli Saber a
a,b
, A. Murat Tekalp
a,c,*
Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY 14627, USA b Xerox Corporation, 800 Phillips Road, Webster, NY 14580, USA c College of Engineering, KOC University, Sariyer, Istanbul, Turkey Received 11 June 2003; accepted 19 April 2004 Available online 20 July 2004
Abstract We present a novel ‘‘dynamic learning’’ approach for an intelligent image database system to automatically improve object segmentation and labeling without user intervention, as new examples become available, for object-based indexing. The proposed approach is an extension of our earlier work on ‘‘learning by example,’’ which addressed labeling of similar objects in a set of database images based on a single example. The proposed dynamic learning procedure utilizes multiple example object templates to improve the accuracy of existing object segmentations and labels. Multiple example templates may be images of the same object from different viewing angles, or images of related objects. This paper also introduces a new shape similarity metric called normalized area of symmetric differences (NASD), which has desired properties for use in the proposed ‘‘dynamic learning’’ scheme, and is more robust against boundary noise that results from automatic image segmentation. Performance of the dynamic learning procedures has been demonstrated by experimental results. Ó 2004 Elsevier Inc. All rights reserved. Keywords: Learning by examples; Dynamic learning; Shape matching; Segmentation
*
Corresponding author. Fax: 1-716-273-4919. E-mail addresses:
[email protected] (Y. Xu),
[email protected] (E. Saber), Eli.Saber@ usa.xerox.com (E. Saber),
[email protected] (A.M. Tekalp). URLs: http:/www.ece.rochester.edu/~yaxu, http:/www.ece.rochester.edu/~saber, http:/www.ece.rochester. edu/~tekalp. 1077-3142/$ - see front matter Ó 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.cviu.2004.04.003
Y. Xu et al. / Computer Vision and Image Understanding 95 (2004) 334–353
335
1. Introduction Humans navigate through and retrieve samples from large image/video databases by means of semantic concepts, such as, objects, people, etc. However, most current multimedia systems can only process low-level visual features, such as color, texture, shape, etc. [1–6] in an automatic fashion. ‘‘Learning’’ approaches have been proposed in order to automatically compute high-level semantic concepts from low-level visual features. These approaches can be classified as: (1) Learning from interactive user feedback and (2) Learning from examples without run-time user interaction. ‘‘Learning from user feedback,’’ i.e., relevance feedback [7–12], requires user responses to indicate relevant or irrelevant items in a search to: (1) establish either positive or negative links between retrieved images and query objects [7], (2) update the weights of various feature dimensions in a given vector space [8,9]; or (3) enhance the probability distribution of a proposed Bayes model for the images in the database [10]. More recently, several algorithms were introduced to improve ‘‘relevance feedback’’ learning mechanisms. In particular, He et al. [11] proposed inferring a semantic space to improve long term performance. Muneesawang and Guan [12] introduced self-organizing tree map (SOTM) to minimize the user interaction required by relevance feedback. Tieu and Viola [13] proposed the use of a ‘‘boosting’’ algorithm in interactive query learning. Tong and Chang [14] introduced Support Vector Machine active learning to search a given image database with relevance feedback. Wang et al. [15] employed a neural network model with a back-propagation through structure (BPTS) learning algorithm to learn the tree-structure representation of the image content. Potential drawbacks of the ‘‘relevance feedback’’ approach include: (1) slow convergence; (2) sensitivity to user subjectivity; and (3) ‘‘knowledge’’ is stored only during the current query session and does not propagate to later queries. ‘‘Learning from example,’’ on the other hand, attempts to create semantic abstractions for images when regions in images are determined to match user provided examples based on similarity of low-level visual features [16–19]. One attractive benefit of the ‘‘learning from example’’ scheme is automatic abstraction of semantic concepts, and segmentation and labeling of semantic objects without user intervention. The concept of ‘‘learning from a single example’’ and associated image representation, data structures, search procedures, and querying were introduced in [18]. We later proposed a partial shape matching procedure as a means for faster partialmatch-guided searching [19]. The purpose of this paper is to extend our methods to ‘‘dynamic learning from multiple sequential examples.’’ Earlier, we presented the concept of ‘‘learning from example’’ [18] and a contour-based partial shape matching algorithm [19], respectively. This paper focuses on a ‘‘dynamic learning’’ concept designed to overcome new challenges that have not been addressed in the previous papers. Hence, the contribution of this paper includes: (1) Dynamic learning to resolve potential conflicts between different semantic abstractions resulting from different example templates, (2) New data structure and query strategy that enable the ‘‘dynamic learning’’ scheme, and (3) A new similarity measure for shape matching. The dynamic learning
336
Y. Xu et al. / Computer Vision and Image Understanding 95 (2004) 334–353
process refines the segmentation mask of the object and updates the semantic abstractions when a new example provides a better match than the existing one(s). The constraints, context, and quantifiable benefits of the proposed ‘‘dynamic learning’’ technique can be summarized as follows: (1) Constraints and context: (a) The dynamic learning concept rests on the concept that an object is more similar to its own template than to the template of another object. (b) The shape matching is dependent on a ‘‘reasonable’’ segmentation. It is somewhat robust to minor errors as will be demonstrated in the results, but becomes less effective as these errors cause the shape to become highly distorted. (c) The matching is invariant to relative translations, planar rotations, reflections, and zooming. Perspective distortions, however, would require use of additional example templates with similar object pose. (d) The initial search could be expensive depending on the number of regions in the segmentation map. This process, however, can be executed offline. (2) Quantifiable benefits: (a) ‘‘Knowledge’’ gained from past or current queries can be propagated to future queries. (b) Dynamic learning provides ability to selfcorrect erroneous matches as new templates become available, i.e., inaccurate semantic abstractions can be automatically corrected (possibly in off-line mode) during the process of ‘‘dynamic learning’’ as new examples become available provided that the constraints/context discussed above is satisfied. (c) No manual intervention or handling of images is required. This provides time and resource benefits while maintaining repeatability and objectivity of the results and eliminates user subjectivity commonly encountered in relevance feedback schemes. The ‘‘dynamic learning’’ scheme imposes new requirements on the similarity matching measure of the low-level visual features. Section 2 presents a new shape similarity metric, which has such desirable properties. Section 3 introduces the proposed data representation and query strategies for the proposed dynamic learning system. Section 4 presents the concept and procedure for ‘‘dynamic learning.’’ Experimental results and analysis are presented in Section 5. Conclusions are drawn in Section 6.
2. A new similarity measure for shape matching The ‘‘learning from example’’ concept requires that similarity measures for low-level visual features, such as a color or shape, be defined. In our previous work [18], color histogram intersection measure and Hausdorff distance [21–23] were used as color and shape similarity measures, respectively, between an example template and a candidate image region. However, directed Hausdorff distance is sensitive to noise. In particular, the distance between two point sets can be fairly affected by a single outlier [22]. The proposed dynamic learning procedure requires ranking of similarities between pairs of many image regions and multiple objects templates, i.e., we need to find not only the best match to a given object template among many image regions, but also the best matching between a given image region and many object templates.
Y. Xu et al. / Computer Vision and Image Understanding 95 (2004) 334–353
337
Therefore, it is desirable that the similarity measure be: (1) a metric, i.e., it should not only be good for a threshold test, but also for ranking; (2) symmetric, i.e., similarity measure should be invariant whether it is computed in the image or template domain. 2.1. Normalized area of symmetric differences We propose a new shape similarity metric, called normalized area of symmetric differences (NASD), which is normalized to remove the effect of size of the candidate region or template on the similarity measure, that satisfy the above requirements. It is given by: dðA; BÞ ¼
ðA BÞ þ ðB AÞ ; ðA þ BÞ
where A and B represent two shapes, (A + B) is the area of the union of two regions, (A B) denotes the area over by A but not by B, (B A) is vice versa. Suppose A is the example template and B is a candidate region. Then, (A B) represents the area of FALSE NEGATIVE region, and (B A) is the area of FALSE POSITIVE region. The properties of the normalized area of symmetric differences measure include: 1. It is a metric. That is, it satisfies the following: Nonnegative: d(A, B) P 0 for any pair of shapes involved in a matching; Identity and Uniqueness; d(A, A) = 0 if and only if the two shapes are identical; Symmetric; i.e., d(A, B) = d(B, A); Triangle Inequality: Given A, B, for any C, d(A, C) + d(B, C) P d(A, B); 2. It is robust against small changes in the shapes A and B. The NASD is not sensitive to boundary noise. It is also robust against small distortion, crack, occlusion, extrusion, etc. 3. Invariance to rotation, translation, and scaling. The NASD is invariant to translation, rotation, and scaling, provided that it is computed after the two shapes are registered in the same domain. This is explained next. 2.2. Computation of normalized area of symmetric differences Before computing the NASD, the contours of the shapes A and B are approximated by B-splines, and then registered in either the image or template domain. Note that since the measure is symmetric, the registration can be done in the image or the template domain. It is well known that B-spline representation and modal matching can suppress contour noise which is due to low-level segmentation errors [20–23]. Hence, we adopted the modal matching approach [20] to establish feature correspondences between the template and candidate region. These correspondences are then used to estimate the affine transform parameters between the two shapes through a least square approach [23]. Upon computation of the affine transform parameters, the
338
Y. Xu et al. / Computer Vision and Image Understanding 95 (2004) 334–353
Fig. 1. Similarity matching of an object and an example template.
image region and template are registered, and the NASD is calculated. The proposed process for modal matching and computation of the NASD is illustrated in Fig. 1.
3. Data structure and querying In this section, we present the data structures and query strategy used in the proposed dynamic learning framework. 3.1. Data structure for static learning We start with a very brief review of the data structure used in our original learning method (referred to as static learning here). We represent images by a ‘‘scene graph’’ which consists of a tree that indicates the parent–child relationships between
Y. Xu et al. / Computer Vision and Image Understanding 95 (2004) 334–353
339
high-level objects and low-level (elementary) image regions, and an adjacency matrix that captures the spatial relationships between these elementary regions [18]. An example object-region tree is illustrated in Figs. 2A–C. Elementary regions (nodes) are automatically constructed based on low-level color or texture segmentation, with each node denoting a uniform region [24]. ‘‘Learning from example’’ refers to storing combination of regions, similar to the example object, in the form of composite nodes with specific indexing information attached. The implementation of the learning process requires searching all valid combinations of elementary regions (as determined by the adjacency matrix) in an image for shape and/or color similarity to a user provided example template. A match is established when the similarity measure between a particular combination of elementary nodes and the example template is less than a pre-determined threshold. Then, a composite node is formed containing the matching combination of elementary nodes [18]. The composite node provides a level of semantic knowledge over and above the
Fig. 2. Dynamic learning concept: (A) original image; (B) low-level segmentation with each homogeneous color region enumerated; (C) initial content hierarchy; (D) object template; (E) matching result; (F) content hierarchy after initial learning; (G) new object template; (H) matching result to new template; and (I) content hierarchy after dynamic learning.
340
Y. Xu et al. / Computer Vision and Image Understanding 95 (2004) 334–353
original scene graph containing only low-level nodes. As a result, subsequent searches using the same example template would immediately identify the composite node as a match using its shape and/or color attributes without processing its lower level. 3.2. Data structure for dynamic learning from multiple examples In the case of dynamic learning from multiple examples, we introduce a new data structure for each database image, called object vs. template similarity table (OTST), in addition to the scene graph with composite nodes. The structure of the OTST is depicted in Table 1, where each row and column corresponds to a potential object and an example template, respectively. OTST provides a means for ranking the similarity of each composite node vs. each example template. Before any learning takes place, the database images have only low-level regions; hence the OTST is a null table. For each image I, the OTST is populated, as learning takes place with the introduction of each example template Tj, according to the procedure: Procedure 1: Initial population of OTST If (Image I already has composite nodes) {For (each composite node Ci) {If d(Ci, Tj) < Reject Threshold Insert d(Ci, Tj) in OTST at row i and column j } {Perform similarity search between Tj and rest of the elementary nodes of I [18]} } else {Perform similarity search between Tj and the elementary nodes of I [18]} Matches found during similarity search between Tj and combinations of elementary nodes lead to creation of new composite nodes. We enter them in the OTST as new rows and also compute their similarity to other example templates. The reject threshold, which is introduced only to keep the size of the OTST reasonable, can be set rather loosely, and it is the only threshold involved in the proposed dynamic learning procedure. 3.3. Queries and query resolution The query engine supports three types of queries using the OTST. In all cases, it returns the best N matching database images to the query (example) template. Table 1 Object-template similarity table for image I i/j
Template 1
Template 2
C1 C2
d(C1, T1) d(C2, T1)
d(C2, T2) d(C2, T2)
Y. Xu et al. / Computer Vision and Image Understanding 95 (2004) 334–353
341
1. Query by known example: In this case, the query template is already in the OTST. Since all similarity values have already been computed, only ranking of values needs to be done at the time of query. 2. Query by new example: When a new example/query template is introduced, the system first updates the OTST as described in Section 3.2. The updated OTST is then used to generate the rankings. 3. Query by keyword: There is no specific labeling information stored in the OTST, since keywords and labels may be subjective. Links between object templates and keywords may be established at the query user interface, and then system performs queryby-template.
4. Dynamic learning In this section, we first compare static vs. dynamic learning concepts in Section 4.1 in order to establish the need for dynamic learning procedures. Specific dynamic learning procedures for updating the rows of the OTST, guided searching about existing composite nodes, computational complexity of the search procedures, and use of a color similarity measure are then discussed in Sections 4.2–4.5. 4.1. ‘‘Static’’ versus ‘‘dynamic’’ learning In our original static learning method [18], composite nodes (the grouping of the elementary nodes) stay permanent once they are created. Depending on which example template is presented first and the similarity measure used, it is possible that a composite node is formed by a non-optimal grouping of elementary nodes; i.e., the grouping may have missed a part of the object or has some extra parts that do not belong to the object; yet the similarity measure did not exceed the threshold for node formation. In static learning, there is no possibility of updating a composite node with new examples. The ‘‘dynamic learning’’ concept rests on the assumption that a portion of a ‘‘real’’ life object is less similar to a given template than a complete object is to its own template. In the proposed ‘‘dynamic learning,’’ composite nodes established in the first learning step (when first example template is presented) can be later updated when new example templates become available (self-correction). Hence, the grouping of elementary nodes is dynamic and the learning never stops. This process is explained in the next section. 4.2. Dynamic learning procedure The strategy of updating existing composite nodes can be summarized as follows: when an existing composite node is found to also match a new (later) example
342
Y. Xu et al. / Computer Vision and Image Understanding 95 (2004) 334–353
template, the low-level regions making up the composite node and all neighboring regions are re-searched to find if a better match to the new example template exists. When such search yields a slightly different grouping of elementary nodes that is better matched to the object template than the existing composite node matches to any of the existing templates, the existing composite node is destroyed and a new composite node is created. This procedure can be summarized as follows: Procedure 2: updating composite nodes in the OTST If (I has composite nodes) {For (each composite node Ci) {If d(Ci, Tj) < Reject Threshold {Similarity search between Tj and perturbations of Ci to find best match C i if C i ! ¼ C i fReplace C i by C i in OTST g else {Insert d(Ci, Tj) in OTST at row i and column j} } } {Perform similarity search between Tj and rest of the elementary nodes of I [18]} } else {Perform similarity search between Tj and the elementary nodes of I [18]} The main concepts of ‘‘static vs. dynamic learning’’ are illustrated by the example in Fig. 2. Figs. 2A and B show an ‘‘SUV’’ image and its corresponding segmentation map made up of eight regions, respectively. The initial scene graph, constructed from Fig. 2B, is illustrated in Fig. 2C. Once the image is searched using the ‘‘Sedan’’ example template shown in Fig. 2D, the shape matching algorithm identifies the seven regions (regions 2–8), shown in Fig. 2E, to be the best match to the template displayed in Fig. 2D. This results in the formation of a composite node that consists of the grouping of regions 2–8 as shown in Fig. 2F. Clearly, this newly formed composite node missed ‘‘region 1,’’ a small region that belongs to the ‘‘SUV’’ object. In ‘‘static learning,’’ the incomplete grouping of the elementary nodes stays permanent yielding a sub-optimal and somewhat undesirable effect. Hence, the concept of ‘‘dynamic learning’’ is illustrated in Figs. 2G–I, where Fig. 2G represents an ‘‘SUV’’ template, introduced at a later time, to search the image shown in Fig. 2B after the formation of the composite node depicted in Fig. 2F. Figs. 2H and I represent the corresponding shape matching results and the updated composite node graph, respectively. In general, when a new object template is introduced in an image database system, all the images in the database are searched for the object as a phase of the ‘‘learning’’ process. The search can be performed either online, directly as the user is retrieving images through ‘‘query by new example,’’ or offline by employing the user search profile. Either way, each image in the system is searched for the new object template by applying the hierarchical content matching strategy described above.
Y. Xu et al. / Computer Vision and Image Understanding 95 (2004) 334–353
343
4.3. Guided search procedure for dynamic learning Guided search refers to finding the best matching C* to a new example template T in the neighborhood of an existing composite node C, taking advantage of the established match between T and C. The first step is to set up the search scope based on information provided by the existing match. As presented in [18,19], correspondences between the image region and the template have been established in the matching process (either full or partial match). These correspondences are employed to estimate the affine transform that maps the object template into the image domain. Once the mapping has been completed, we use the projection of the template in the image domain to classify all elementary nodes in three categories: (1) Elementary node is fully covered by the template; this set is noted as {F}. (2) Elementary node is partially covered by the template; this set is noted as {P}. (3) Elementary node does not intersect with the template at all; this set is noted as {N}. We limit the scope of the search to {F} + {P} excluding all the nodes in {N}. This significantly reduces the computational complexity of the search. The second step is to find the best match to the template in the search scope {F} + {P}. To this effect, all nodes in {F} are pre-determined to be part of any potential match to be tested thereby taking full advantage of the previously known best match. Hence, the procedure reduces to determining whether each of the elementary nodes in {P} should be incorporated into the existing composite node to form a more suited match. This is accomplished by computing a corresponding matching score, using the techniques discussed in Section 2, for each of the combinations made up of all the elementary nodes in {F} + one or more of the elementary nodes in {P}. The ‘‘closest’’ combination to the object template is compared against the similarity of the existing composite node. The composite node is rebuilt with a new grouping of elementary nodes if it yields the best similarity measurement. The above procedure—to search around an existing composite node to find a more suited match to a given example template—is illustrated in Fig. 3. There are a couple of advantages in dynamic learning with the above described search strategy: (1) It automatically corrects inaccurate groupings of elementary nodes stored in existing hierarchical content description of images. Hence, ‘‘dynamic learning’’ yields a better and more accurate semantic abstraction for the images. (2) It significantly lowers the computational cost of finding the best match to the new example template by taking advantage of the match between the existing composite node and a new example template. A detailed analysis of the computational complexity of searches with new examples is shown in Section 4.4. 4.4. Computational complexity In this section, we analyze the computational complexity of the guided search procedure by evaluating an example case and comparing it to the approach described
344
Y. Xu et al. / Computer Vision and Image Understanding 95 (2004) 334–353
Fig. 3. Process of guided search for dynamic learning.
in [18]. Let us assume that the image is made up of K elementary nodes, and that a given object in the image consists of m elementary nodes. We can safely assume that m < f + p < K, where f and p are the number of elementary nodes in set {F} and {P}, as defined in Section 4.3, respectively. The number of combinations to be tested in the above described search procedure is: p X p! ¼ 2p : C¼ l!ðp lÞ! l¼0 However, if the image is searched in a hierarchical fashion prior to the formation of composite nodes, the number of valid combinations can be as large as 2K 1 [18].
Y. Xu et al. / Computer Vision and Image Understanding 95 (2004) 334–353
345
Table 2 Computational complexity of example cases
Dynamic learning All combinations
K = 10 m=6 f=3 p=3
K = 15 m = 10 f=8 p=2
K = 15 m = 10 f=2 p=8
K = 20 m = 16 f=8 p=8
8 1023
4 32,767
256 32,767
256 1,048,575
Since {P} is a subset of all the elementary nodes, (practically p K), this implies that 2p 2K 1. As a result, the number of combinations tested in a dynamic learning guided search 2p is much smaller than the total number of combinations tested in hierarchical content search 2K 1. Table 2 provides some practical examples demonstrating the significant reduction in computational complexity. 4.5. Dynamic learning based on color similarity This section describes the potential use of histogram intersection [25] as a color similarity measure for dynamic learning. Properties of histogram intersection include: (1) low computational complexity, and (2) robustness to noise. It should be noted, however, that color histogram intersection is not a metric (see Section 2.1), but it satisfies all criteria for a metric except for the uniqueness requirement. If T and C denote the color histograms of an example template and a composite node, respectively, their histogram intersection is defined as: PN j¼1 minðC j ; T j Þ H ðC; T Þ ¼ ; PN j¼1 T j where it is assumed that both histograms contain N bins. We note that histogram intersection is symmetric, i.e., H(C, T) = H(T, C), and it can be used for color similarity ranking. As such, histogram intersection can be used as a color similarity measure for dynamic learning, either together with or instead of the normalized area of symmetric difference metric for shape similarity.
5. Experimental results In this section, we demonstrate the advantages of dynamic vs. static learning on a set of ‘‘real’’ life 8 bits/pixel RGB type color images. We first begin by providing a comparison for shape similarity matching using Hausdorff versus NASD clearly highlighting the advantages of the latter in achieving robust similarity ranking for images in the presence of noise and segmentation errors. We then demonstrate the advantages of dynamic learning in identifying objects in images and constructing more accurate composite nodes.
346
Y. Xu et al. / Computer Vision and Image Understanding 95 (2004) 334–353
5.1. Shape matching comparison using Hausdorff vs. NASD As we discussed in Section 2, the NASD similarity measurement has several distinct advantages over the previously utilized Hausdorff distance [18]. Here, in Fig. 4, we present a direct comparison between these two metrics. The original image, its automatically and manually generated segmentations are shown in Figs. 4B–D, respectively. The car template utilized in the shape matching process is shown in Fig. 4A. Note the segmentation errors found in Fig. 4C as compared to the manually generated ‘‘ground truth’’ segmentation found in Fig. 4D. These are especially visible in the roof and the front windshield of the car. Table 3 provides the results for the manual versus automatic segmentations using the Hausdorff and NASD metrics. From the table, it can be easily seen that both measures increased (depicting lesser similarity) for the automatic segmentation (Fig. 4C) when compared with the ‘‘ground truth’’ (Fig. 4D); a reasonable expectation since most automatic segmentations are much less accurate than their human prepared counterparts. However, the Hausdorff distance increased much more drastically compared to the NASD as demonstrated by the percentage difference in Table 3. This drastic increase is due to the boundary noise and segmentation errors found in the automatic segmentation. On the other hand, the NASD measurement was much less sensitive to boundary noise and segmentation errors, and as such more robust for similarity measurement. This is also demonstrated in the image portrayed in Fig. 5, where the composite node created using the NASD was more accurate than the one generated with the Hausdorff distance.
Fig. 4. Noise sensitivity of similarity measurements: (A) template; (B) input image; (C) automatic segmentation; and (D) manual segmentation.
Table 3 Comparison of similarity measurements
Semi-manual segmentation Automatic segmentation Difference in percentage
Hausdorff distance
Normalized area of symmetric differences
8.663 14.623 68.8%
0.276 0.300 8.7%
Y. Xu et al. / Computer Vision and Image Understanding 95 (2004) 334–353
347
Fig. 5. Object segmentation using Hausdorff distance and NASD: (A) template; (B) input image; (C) composite node and content hierarchy created using Hausdorff distance; and (D) Composite node and content hierarchy created using NASD.
5.2. Dynamic learning experiments To test the performance of the dynamic learning scheme for object-based image labeling, experiments were performed on a large dataset using multiple object templates. Fig. 6A shows a subset of 20 images numbered consecutively in a raster scan fashion starting from left to right and top to bottom. Images 1–10 and 11–17 represent ‘‘Sedans’’ and ‘‘SUVs,’’ respectively. Image 18 is a scene with multiple ‘‘Sedans.’’ Images 19 and 20 are representatives of other images in the database. Note that Images 9, 11, 14, 15, and 17 depict slight rotations in/out of the image plane and Images 8 and 16 portray vehicles parked in the opposite direction compared to the other images. Fig. 6B provides a profile view of Template 1, a typical ‘‘Sedan’’ template. The results of searching the images in Fig. 6A using Template 1 are shown in Fig. 6C arranged in a corresponding fashion to Fig. 6A with the top five matches depicted in Fig. 6D. As can be seen from Fig. 6C, the search yielded not only ‘‘Sedans’’ but also ‘‘SUV’’ type objects since they have a close similarity to Template 1, the ‘‘Sedan’’ template, shown in Fig. 6B. Subsequently, a composite node, capturing the initial regions in the segmentation map that correspond to the object, is introduced into the content hierarchy for each image. The node serves as a representation for the object, i.e., an entity that corresponds to the grouping of all the regions that together depict a shape similar to the template utilized in the search process. However, a close examination of Images 14 and 16 reveals that their corresponding composite node has succeeded in capturing the regions of the ‘‘SUV’’ that match that of a ‘‘Sedan’’ and did not incorporate the ‘‘hatchback’’ portion of the ‘‘SUV’’ since it does not match well to a ‘‘Sedan.’’ 5.2.1. Search using new template without dynamic learning Fig. 6E shows a second template (Template 2)—an ‘‘SUV’’ Shape—that will now be employed to search the database shown in Fig. 6A after it has been searched using Template 1 (Sedan) as described above. The search is done as portrayed in Procedure 1, where the ‘‘permanent’’ composite nodes formed using Template 1 (Sedan) were first examined and scored for similarity to Template 2, followed by the remaining
348
Y. Xu et al. / Computer Vision and Image Understanding 95 (2004) 334–353
Fig. 6. Experiments for dynamic learning.
regions in the scene. At the completion of the search, all the composite nodes were found to represent a ‘‘similar’’ match to Template 2 (SUV) with varying degrees of similarity. The top five most similar matches are depicted in Fig. 6F.
Y. Xu et al. / Computer Vision and Image Understanding 95 (2004) 334–353
349
Fig 6. (continued )
Table 4 provides the NASD similarity results to Templates 1 and 2, respectively, for all the images in Fig. 6A. Note that the table follows the format described by Table 1. By examining Table 4, we can see that the composite nodes in Images 1– 10, 14, 16, and 18 are more similar to Template 1, while Images 11–13, 15, and 17 are much closer to Template 2. Images 19 and 20 do not contain either. This table now serves as the mechanism for any subsequent searches or rankings within the 20 images for either Template 1 or 2 without the need to re-compute the NASD similarity measure. However, upon a close examination of Images 14 and 16, we can see that the composite node formed subsequent to the Template 1 search (see Fig. 6C for corresponding ‘‘semantic’’ segmentation) is made up of only those regions that are most similar to Template 1. In other words, the ‘‘hatchback’’ regions of both SUVs are totally missing from the final object yielding a higher similarity to Template 1. This is confirmed in the results shown in Table 4 for Images 14 and 16, where the similarity to Template 1 is ‘‘closer’’ than that to Template 2. In essence, since the composite nodes generated after searching for Template 1 are ‘‘permanent,’’ there is no opportunity to correct this scenario using this static structure.
350
Y. Xu et al. / Computer Vision and Image Understanding 95 (2004) 334–353
Table 4 Experimental results for static learning (top five matches in bold) Composite nodes
Template 1
Template 2
Image1/C1 Image2/C1 Image3/C1 Image4/C1 Image5/C1 Image6/C1 Image7/C1 Image8/C1 Image9/C1 Image10/C1 Image11/C1 Image12/C1 Image13/C1 Image14/C1 Image15/C1 Image16/C1 Image17/C1 Image18/C1 Image18/C2
0.095 0.141 0.300 0.255 0.208 0.121 0.179 0.128 0.231 0.148 0.316 0.237 0.193 0.202 0.347 0.181 0.182 0.161 0.310
0.230 0.178 0.444 0.528 0.314 0.197 0.248 0.216 0.346 0.264 0.269 0.162 0.125 0.408 0.247 0.293 0.138 0.274 0.578
5.2.2. Dynamic learning To improve the scenario described above, we utilize the concept of dynamic learning described earlier and demonstrate its effectiveness on the same set of images focusing our attention on Images 14 and 16. Fig. 6G shows the content hierarchy with static and dynamic learning, respectively. In contrast to the static learning described above, Procedure 2 is followed to determine if a given composite node possesses within its hierarchy all the regions that are relevant to the object. More specifically, once Template 2 is introduced, the neighborhood of the car composite node found in Images 14 and 16 (see Fig. 6G) is searched as described in Procedure 2. The search confirms that incorporating the ‘‘hatchback’’ regions (the previously missing regions) yields a better similarity to Template 2 than the previously formed node (see Fig. 6G) to Template 1. At this point, the composite node hierarchy is reconstructed to reflect the above. Furthermore, the OTST is updated accordingly as shown in Table 5, where the newly computed NASD numbers depict a better similarity to Template 2 than Template 1. In particular, Images 1–10 and 18 ‘‘the Sedans’’ are more similar to Template 1, while 11–17 ‘‘the SUVs’’ are much closer to Template 2. This is a definite improvement of what was originally depicted in Table 4. Finally, the newly computed top five matches resulting from the dynamic search are shown in Fig. 6H. Table 6 provides a comparison of the computation complexity of the ‘‘dynamic learning’’ mechanism and the normal hierarchical content matching. The latter produces the same object labeling as the proposed ‘‘dynamic learning’’ by searching all possible region combinations. From the table, it can be easily seen that the ‘‘dynamic learning’’ guided search using the existing composite node requires significantly less computation than the alternative.
Y. Xu et al. / Computer Vision and Image Understanding 95 (2004) 334–353
351
Table 5 Experimental results for dynamic learning (top five matches in bold) Composite nodes
Template 1
Template 2
Image1/C1 Image2/C1 Image3/C1 Image4/C1 Image5/C1 Image6/C1 Image7/C1 Image8/C1 Image9/C1 Image10/C1 Image11/C1 Image12/C1 Image13/C1 Image14/C1* Image15/C1 Image16/C1* Image17/C1 Image18/C1 Image18/C2
0.095 0.141 0.300 0.255 0.208 0.121 0.179 0.128 0.231 0.148 0.316 0.237 0.193 0.211 0.347 0.234 0.182 0.161 0.310
0.230 0.178 0.444 0.528 0.314 0.197 0.248 0.216 0.346 0.264 0.269 0.162 0.125 0.123 0.247 0.176 0.138 0.274 0.578
Table 6 Computational complexity of dynamic learning Image No. (as in Fig. 6A)
Number of elementary nodes
Combinations to test in normal hierarchical search process
Combinations to test in dynamic learning process
Similarity (NASD) after dynamic learning (template used)
Image 14 Image 16
14 10
2426 304
16 8
0.123 (2) 0.176 (2)
6. Conclusion This paper presented a novel approach to dynamically improve object segmentations and labeling without user intervention for object-based indexing. In contrast to ‘‘static learning,’’ the proposed ‘‘dynamic learning’’ process updates the links between composite nodes and groupings of elementary nodes as new examples are introduced and searches are performed. The results portray the effectiveness of this approach for object labeling and learning from examples. The main advantage of ‘‘dynamic learning’’ is that inaccurate semantic abstractions of images are automatically corrected in the process of learning from new examples; the accuracy of the content description of images improves over time as new examples are introduced. In essence, ‘‘dynamic learning’’ examines and updates existing composite nodes for potentially better matches to newly introduced object templates. Because the process takes advantage of existing matches, it is computationally more effective in establishing an accurate object labeling than
352
Y. Xu et al. / Computer Vision and Image Understanding 95 (2004) 334–353
the former normal hierarchical search. In addition, ‘‘dynamic learning’’ does not require any user intervention; the process can be performed offline as new example templates are introduced. The process is dependent on the initial segmentation map. While it is somewhat robust to minor segmentation errors as shown in Fig. 4, it is less and less effective as these errors become more severe. It is also invariant to translations, planar rotations, reflections, and uniform scaling (zooming).
Acknowledgment This material is based upon work supported by the National Science Foundation under Grant IIS-9820721 to the University of Rochester.
References [1] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, R. Jain, Content-based image retrievals at the end of the early years, IEEE Trans. Pattern Anal. Mach. Intell. 22 (12) (2000) 1349–1380. [2] Y. Rui, T.S. Huang, S.-F. Chang, Image retrieval: current techniques, promising directions and open issues, J. Vis. Commun. Image Rep. 10 (4) (1999) 39–62. [3] F. Idris, S. Panchanathan, Review of image and video indexing techniques, J. Vis. Commun. Image Rep. 8 (2) (1997) 146–166. [4] M. De Marsicoi, L. Cinque, S. Levialdi, Indexing pictorial documents by their content: a survey of current techniques, Image Vision Comput. 15 (2) (1997) 119–141. [5] E. Chang, K. Coh, G. Wu, CBSA: Content-based soft annotation for multimodal image retrieval using bayes point machines, IEEE Trans. Circ. Syst. Video Technol. 13 (1) (2003) 26–38. [6] T. Gevers, A.W.M. Smeulders, PicToSeek: combining color and shape invariant features for image retrieval, IEEE Trans. Image Process. 9 (1) (2000) 102–119. [7] H. Muller, W. Muller, S. Marchand-Maillet, T. Pun, Strategies for positive and negative relevance feedback in image retrieval, in: Internat. Conf. on Pattern Recognition (ICPRÕ00), Barcelona, Spain, 2000. [8] Y. Wu, A. Zhang, A feature re-weighting approach for relevance feedback in image retrieval, in: IEEE Internat. Conf. on Image Processing (ICIPÕ02), Rochester, New York, USA, 2002. [9] Y. Rui, T.S. Huang, M. Ortega, S. Mehrotra, Relevance feedback: a power tool for interactive content-based image retrieval, IEEE Trans. Circ. Syst. Video Technol. 8 (5) (1998) 644–655. [10] I.J. Cox, M.L. Miller, T.P. Minka, P.N. Yianilos, An optimized interaction strategy for Bayesian relevance feedback, in: Internat. Conf. on Computer Vision and Pattern Recognition (CVPRÕ98), Santa Barbara, CA, USA, 1998. [11] X. He, O. King, W.-Y. Ma, M. Li, H.-J. Zhang, Learning a semantic space from userÕs relevance feedback for image retrieval, IEEE Trans. Circ. Syst. Video Technol. 13 (1) (2003) 39–48. [12] P. Muneesawang, L. Guan, Minimizing user interaction by auto and semi-auto feedback for image retrieval, in: IEEE Internat. Conf. on Image Processing (ICIPÕ02), Rochester, New York, USA, 2002. [13] K. Tieu, P. Viola, Boosting image retrieval, in: IEEE Conf. on Computer Vision and Pattern Recognition (CVPRÕ00), Hilton Head Island, South Carolina, USA, 2000. [14] S. Tong, E. Chang, Support vector machine active learning for image retrieval, in: Proc. ACM Multimedia 2001, Ottawa, Canada, 2001. [15] Z. Wang, Z. Chi, D. Feng, A.C. Tsoi, Content-based image retrieval with relevance feedback using adaptive processing of tree-structure image representation, Int. J. Image Graphics 3 (1) (2003) 119– 144.
Y. Xu et al. / Computer Vision and Image Understanding 95 (2004) 334–353
353
[16] R. Zhao, W.I. Grosky, From features to semantics: some preliminary results, in: IEEE Int. Conf. on Multimedia and Expo, New York, New York, USA, 2000. [17] V. Gudivada, V.V. Raghavan, Content based image retrieval systems, IEEE Comput. 28 (9) (1995) 18–22. [18] Y. Xu, E. Saber, A.M. Tekalp, Object segmentation and labeling by learning from examples, IEEE Trans. Image Process. 12 (6) (2003) 627–637. [19] Y. Xu, E. Saber, A.M. Tekalp, Partial shape recognition by sub-matrix matching with application to object based image labeling, Pattern Recogn., submitted. [20] L. Shapiro, J. Brady, Feature based correspondence: An Eigenvector approach, Image Vision Comput. 10 (6) (1992) 283–288. [21] B. Gunsel, A.M. Tekalp, Shape similarity matching for query by example, Pattern Recogn. 31 (7) (1998) 931–944. [22] R.C. Veltkamp, Shape matching: similarity measures and algorithms, in: Internat. Conf. on Shape Modeling and Application (SMI2001), Genoa, Italy, 2001. [23] E. Saber, A.M. Tekalp, Region based shape matching for automatic image annotation and query by example, J. Vis. Commun. Image Rep. 8 (1) (1997) 3–20. [24] E. Saber, A.M. Tekalp, G. Bozdagi, Fusion of color and edge information for improved segmentation and edge linking, Image Vision Comput. 15 (10) (1997) 769–780. [25] M. Swain, D. Ballard, Color indexing, Int. J. Comput. Vision 7 (1) (1991) 11–32.