Image Processing, IEEE Transactions on

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 12, NO. 6, JUNE 2003

627

Object Segmentation and Labeling by Learning From Examples Yaowu Xu, Eli Saber, Senior Member, IEEE, and A. Murat Tekalp, Fellow, IEEE

FFICIENT management of image databases requires indexing of large amounts of visual information for effective searching, browsing, and retrieving. Multimedia content can be classified as professionally produced content, such as movies, TV shows; raw content, such as consumer pictures and home videos; and real-time content, such as surveillance videos. While it is likely that professionally produced content will be indexed off-line by its producer using manual or semi-automatic tools, automatic analysis and indexing of images and videos will play an important role in the latter two. Recently, prototype systems became available that allow the user to utilize simple textual descriptions, or complex visual features such as color, texture, shape, and/or motion to search large

databases (see [1]–[3] for recent surveys on image and video indexing and retrieval technologies). These systems may be classified as special and general purpose systems. Special purpose systems include: Xenomania [4] for face image retrieval based on query by example; Trademark image database system [5] for trademark retrieval using shape information; and the Center of Excellence for Document Analysis and Recognition (CEDAR) system [6] for indexing and retrieving documents by understanding captions. General purpose systems include: QBIC [7]; Photobook [8]; VisualSEEK [9], WebSEEK [10]; Chabot [11]; Illustra/Virage [12]; FourEyes [13]; MARS [14]; RetrievalWare (Excalibur Technologies Corp.) [15]; Netra [16] and PicHunter [17]. Some of the recent systems feature “learning behavior” for improved performance. For example, FourEyes selects the most suitable models and similarity measures for retrieval from a “society of models,” based on some user-provided positive and negative examples. MARS and PicHunter exploit relevance feedback from user interactions, exhibiting learning behavior. However, they require users to input their preferences manually through graphical user interfaces. The main difficulty with these systems is that they can automatically describe content using only low-level descriptors (color, texture, shape, etc.), and not semantic descriptors (objects, people, places, etc.) which humans prefer. The goal of this paper is to provide an effective method that can automatically analyze and label images at the semantic-object level by introducing the concept of “learning from examples.” A potential application scenario is that a user manually segments and labels main objects of interests in just a few sample images in a large database. Then, the rest are segmented and labeled automatically by learning from the example objects in the manually segmented and labeled images. The extracted information is stored in a hierarchical content tree data structure. To this effect, this paper proposes methods for:

Manuscript received June 8, 1999; revised November 14, 2002. This work was supported by the National Science Foundation under Grants IIS-9820721 and INT-9731007. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Tsuhan Chen. Y. Xu is with the Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY 14627 USA (e-mail: yaxu@ece. rochester.edu; http://www.ece.rochester.edu/~yaxu). E. Saber is with the Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY 14627 USA. He is also with Xerox Corporation, Webster, NY 14580 USA (e-mail: [email protected]; http://www.ece.rochester.edu/~saber). A. M. Tekalp is with the Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY 14627 USA. He is also with the College of Engineering, Koç University, Istanbul, Turkey (e-mail: tekalp@ ece.rochester.edu; http://www.ece.rochester.edu/~tekalp). Digital Object Identifier 10.1109/TIP.2003.810595

• Hierarchical content description and matching: We propose a hierarchical content representation and a procedure to determine similarity of content by searching this representation in a top-down fashion, i.e., by first matching composite nodes (objects), if they are already formed, to reduce search time, and thereafter, matching combinations of elementary nodes if no match has been established at higher levels. • Object extraction and labeling by learning from examples: We propose a procedure to automatically construct composite nodes by combining a set of elementary nodes according to given examples or user profiles. This learning behavior is loosely analogous to Lempel-Ziv

Abstract—We propose a system that employs low-level image segmentation followed by color and two-dimensional (2-D) shape matching to automatically group those low-level segments into objects based on their similarity to a set of example object templates presented by the user. A hierarchical content tree data structure is used for each database image to store matching combinations of low-level regions as objects. The system automatically initializes the content tree with only “elementary nodes” representing homogeneous low-level regions. The “learning” phase refers to labeling of combinations of low-level regions that have resulted in successful color and/or 2-D shape matches with the example template(s). These combinations are labeled as “object nodes” in the hierarchical content tree. Once learning is performed, the speed of second-time retrieval of learned objects in the database increases significantly. The learning step can be performed off-line provided that example objects are given in the form of user interest profiles. Experimental results are presented to demonstrate the effectiveness of the proposed system with hierarchical content tree representation and learning by color and 2-D shape matching on collections of car and face images. Index Terms—Color matching, learning from examples, object annotation, semantic object segmentation, shape matching.

I. INTRODUCTION

E

1057-7149/03$17.00 © 2003 IEEE

628


Fig. 1. Hierarchical object description. (a) Original image; (b) low-level segmentation with each homogeneous color region enumerated; (c) adjacency matrix; (d) parent-child relationship tree with only elementary nodes, where the big square represents the whole image and small squares represent elementary nodes; and (e) parent-child relationship tree with two intermediate (semantic) level composite nodes, where composite nodes are represented by rectangles.

coding, where combination of symbols are declared as new symbols as they occur. Section II describes the proposed hierarchical representation of visual content and formation of lowest level (leaf) nodes by means of automatic low-level image analysis. The proposed hierarchical matching procedure using color and two-dimensional (2-D) shape cues is discussed in Section III. Section IV addresses high-level node (object) formation by learning from past user behavior. Section V demonstrates the proposed concepts by experimental results. Conclusions are drawn in Section VI. II. HIERARCHICAL VISUAL CONTENT REPRESENTATION Most visual objects are composed of sub-objects (or regions), e.g., a car is made of a body, glass windows, doors, tires, etc., which are rigidly connected; and a human body consists of parts which are flexibly connected. Section II-A defines such a content hierarchy, where the lowest level nodes are homogeneous (in some sense, e.g., color or texture) regions. Section II-B presents a low-level image analysis procedure that enables automatic generation of the lowest level (leaf) nodes. A. Content Hierarchy We propose to represent an image by a “content tree” [18], [19] indicating the parent-child relationships between various image regions (nodes), and an adjacency matrix [18], [19] capturing the spatial relationships between these regions. Let denote the number of regions in the segmentation map of the

image, and be the number of levels within the tree. The root , corresponds to the whole node of the tree, at level (also called elementary node) repimage. Each leaf node resents a homogeneous image region with uniform color or texture as indicated by the image segmentation map, where , , and denote the index of the parent node, level within the tree } and index of the leaf node { } { within level , respectively. Furthermore, an intermediate node represents a se(also called composite node), denoted by denotes the mantic object in the image, where is the number of nodes within object index within level and level . Note that denotes the index of the parent node, e.g., the indicating that it is root node can also be denoted by ), has no parent ( ), and it is the first node at level one ( ). Composite nodes consist of groupings within that level ( of elementary nodes or other composite nodes, hence the content hierarchy. The content tree is adequate to represent containment relationship between objects and corresponding elementary reis gions. On the other hand, the adjacency matrix such that if is a neighbor of otherwise This representation is illustrated in Fig. 1. Fig. 1(a) shows a synthetic image with three objects on a uniform background. Fig. 1(b) shows the segmentation map of the image in Fig. 1(a) low-level regions. Fig. 1(c) shows the adjacency with matrix given the segmentation map in Fig. 1(b). Fig. 1(d) illus-

XU et al.: OBJECT SEGMENTATION AND LABELING BY LEARNING FROM EXAMPLES

Fig. 2.

629

Low-level image analysis for content hierarchy construction.

trates the content tree prior to formation of composite nodes, where each low-level region is depicted as an elementary node. , is the parent for all eleThe root node , denoted by { } that are at level . mentary nodes Fig. 1(e) depicts the content tree including the “house” and and . The composite nodes “flag” composite nodes 1, 2, 3, and has the root as parent, and elementary nodes 4, 5, 6 as children, respectively. Initially, the content tree consists of only root and elementary nodes. Elementary nodes are automatically constructed based on the segmentation map, with each node denoting a uniform region in the map. Each elementary node may be associated with a number of features and descriptors that characterize its content; e.g., color, texture and shape of the underlying region. For example, the color attribute for an elementary node may consist of the color space, and mean and variance of each color channel, while the shape attribute may be given by a B-spline contour representation as in [20]. Construction of higher level nodes usually requires either user interaction or learning from examples. We describe a method for generation of composite nodes by learning in Section IV. Composite nodes may also be associated with features and descriptors. For example, the color attribute of a composite node can be more accurately described by a color histogram. This hierarchical description (together with the adjacency matrix) allows access to individual parts of objects (e.g., tires or doors of the car), as well as partial object matching. B. Formation of Low Level Nodes We now address the computation of the proposed contentbased hierarchical description starting at the lowest level. We first perform a series of low level image analysis operations to extract: i) a color edge map [21] and ii) a color segmentation map [22]. Traditional color segmentation methods typically create either too many or too few regions, i.e., over-segmentation or under-segmentation. Standard edge detection methods yield unconnected edge maps. To deal with these difficulties, we employ a split and merge procedure to enhance the segmentation map [23]. Finally, regions in the enhanced segmentation map are taken as the elementary nodes of the content tree, whose

color, texture, and shape could be analyzed and recorded as specific attributes. A block-diagram of the proposed procedure to initialize the content tree is shown in Fig. 2. In the following, we elaborate further on each step of the low-level image analysis. Color Edge Detection: The color edge detection step computes the gradient of multichannel images, as described in [21]. The square root of the largest eigenvalue of the spatial gradient matrix of a multichannel image [21] and its corresponding eigenvector represent the equivalents of the magnitude and direction of the gradient of a single-channel image. An edge map is then estimated by appropriately thresholding the magnitude of the multichannel gradient vector at each pixel. Color Segmentation: Many color segmentation methods have been proposed [24], including those based on histograms, minimum volume ellipsoids, fuzzy clustering, competitive learning, and Bayesian estimation using Gibbs random fields (GRF). Here, we adopt the Bayesian approach since GRF modeling of the segmentation field facilitates formation of spatially smooth regions. In particular, we maximize the a posteriori probability of the segmentation field subject to a multidimensional (three-channel) Gaussian image model with space-varying mean and variance. An adaptive “small” class elimination step, designed to obtain a smooth segmentation map, has also been implemented. The reader is referred to [23] for details. Region Formation by Splitting and Merging: Both region splitting and the edge field generation modules use the magnitude of the gradient map of the 3-channel color image field. Uniform color regions that have at least one edge segment within their boundary are split into multiple regions to provide consistency between the edge map and color segmentation map [23]. We first threshold the color gradient, and label all contiguous regions therein as seeds. The seeds are then allowed to grow in the direction of increasing gradient until all pixels are assigned to one of the seeds. Next, this edge-based region map is overlaid on the color segmentation map, and color regions containing more than one edge-based region are split accordingly. The color-edge integration module performs merging of neighboring regions using a highest confidence decision method. A stability measure provides an indication of whether a region

630


should be merged with one of its neighbors. It is computed based on the edge strength found between two neighboring regions. The proposed procedure [23] merges two neighboring regions when their local means are “similar” and there are no edge elements between them, resulting in a final segmentation map for the image. Initialization of the Content Hierarchy: Each region in the final segmentation map constitutes an elementary node of the content tree (see Fig. 1). The node attributes (e.g., color, shape, etc.) are computed accordingly. The spatial relationship between the nodes is recorded in an adjacency matrix. This content tree containing lowest level nodes may be achieved automatically (in an unsupervised manner), without the need for specific domain knowledge. III. HIERARCHICAL VISUAL CONTENT MATCHING SIMILARITY MEASURES

AND

This section discusses color and 2-D shape matching based on the hierarchical content representation given in Section II. After an introduction to modes of query in Section III-A and the hierarchical matching procedure in Section III-B, we elaborate on particular color and 2-D shape matching measures in Sections III-C and III-D, respectively. The proposed color and 2-D shape-based matching processes form an integral part of the learning process introduced in Section IV. A. Query Modes The proposed system enables two types of visual queries: low-level queries at the region level and high-level queries at the semantic object level. High level queries are possible only if image representations already contain composite (intermediate) nodes. Otherwise, the system will allow only low-level queries that attempt to match the query template to all neighboring combinations of elementary nodes in a given database image. The composite nodes are proposed to store the results of such low-level searches for selected objects to enable fast searching of the same (similar) object subsequently (see Section IV). B. Hierarchical Content Matching One of the advantages of the proposed hierarchical image description is that it enables matching images/objects at various levels of the hierarchy. Let us consider performing query-by-example in an image database that supports the proposed hierarchical description with composite nodes. The matching of the query template with any database image proceeds in a top-down fashion, where the template is matched against all highest-level (composite) nodes (i.e., objects) in the description of the database image first. If no match has been established at the highestlevel, then combinations of all next lower level nodes (which may still be composite nodes corresponding to parts of an object that consist of multiple homogeneous regions) are considered. The search terminates when all combinations of elementary (lowest-level) nodes (i.e., homogeneous regions) are considered. The matching at each level of the content hierarchy is accomplished by utilizing the appropriate descriptors and sim-

ilarity measures at that level. Clearly, if database images do not contain composite nodes, as they initially do not, then only low-level searching can be performed. We now discuss computation of color and shape similarity measures for low-level matching. C. Color Similarity Matching Since elementary nodes generally correspond to regions with uniform color, we quantify color similarity between two elementary nodes by the distance between the mean colors of corresponding regions. In order to measure color similarity between two composite nodes or between a template and a combination of elementary nodes, we employ the histogram intersection method initially proposed by Ballard et al. [25]. If and denote the color histograms of the template and selected composite node or grouping of elementary nodes respectively, their histogram intersection is defined as [25]

where we assume both histograms contain bins. While this similarity measure is fairly simple, it is remarkably effective in determining color similarity between images of multicolored objects. D. Two-Dimensional Shape Similarity Matching A variety of 2-D shape representation and matching techniques based on local features, template matching, Fourier descriptors, generalized Hough transform, moment invariants, modal matching, finite element analysis, local and differential invariants are available, which are invariant to size, position, and/or orientation (see [20], [26], and [27] for comprehensive surveys). Here we employ modal matching [20] to locate all translated, rotated and scaled instances of a template within a given image, where each region or combinations of neighboring regions (corresponding to elementary or composite nodes) constitute “potential” objects. To this effect, the boundary of the query template, as well as each potential region or group of regions is approximated by B-splines. A modal shape description [20], [28] is then employed to establish correspondences between the template boundary points and those of the potential object. These correspondences are used to compute affine parameters, which, in turn, are employed to map the image onto the template domain. Finally, the Hausdorff distance between the template contour and that of the mapped region or group of regions under investigation is computed to determine their similarity. The outcome is a table where each region or group of regions is given a similarity measure to the template under consideration. This table is arranged in increasing order from the most similar region or group of regions, as indicated by the smallest Hausdorff distance, to the least similar. Those regions or grouping of regions with a Hausdorff distance below a certain threshold are pronounced as “similar” matches.


IV. OBJECT EXTRACTION AND LABELING LEARNING FROM EXAMPLES

BY

In this section, we discuss the formation of composite nodes from elementary ones by learning, and elaborate on the advantages of composite nodes for automatic segmentation and fast searching of the learned semantic objects. A. Formation of Composite Nodes In the context of this work, “learning” refers to storing frequently occurring (user-specified) combinations of regions, which correspond to meaningful objects, in the form of composite nodes with specific indexing information attached. We demonstrate the concept of learning by means of a specific example, where a user is interested in retrieving all images that contain “red cars.” Suppose that the user provides an example image of a “red car,” and states that the attributes to be used in similarity matching are “color” and “2-D shape.” We assume that low-level segmentation maps of all database images have been computed. If the database has not been processed for a “red car” before, then the search begins by matching the color and 2-D shape of the example car template with those of the elementary regions of each image, and proceeds to match with those of the combinations of neighboring elementary nodes, until the most similar combination of regions matching in both color and shape has been identified. An important outcome of this procedure is the grouping of regions in database images whose color and 2-D shape match that of the “red car” template. A composite node will capture each instance of such a match. This composite node would then contain the appropriate color and shape attributes of the “red car” as well as references to the constituent elementary nodes. The color histogram is selected as a color descriptor for the composite nodes. The shape descriptor is recomputed for the region obtained by grouping the regions that form the composite node. In addition, the spatial relationship of the elementary nodes that are beneath a composite node is stored in an adjacency matrix. This process constitutes a learning step that would not have taken place had we not performed the match. As a result, subsequent “red car” searches in this database would immediately identify the composite node as a match using its shape and/or color attributes without the need to process the shape and color attributes of the lower level nodes. Ultimately, the second “red car” search in this database will be orders of magnitude faster than the previous search due to the previously “learned” information. We can extend the concept of learning from “an example” to “multiple example templates.” Suppose a user keeps an interest profile file, say for cars, that consists of a representative set of car image templates that characterizes the content that s/he is interested in searching or has searched in the past. Then, each labeled template in this profile file can be used as an example template as in the above scenario to generate hierarchical descriptors tailored to fit this interest profile. B. Advantages of Forming Composite Nodes The first benefit is semantic object formation without user intervention. For instance, consider the image shown in Fig. 1(a) and its low-level segmentation map portrayed in Fig. 1(b). This

631

initial segmentation entails a “low level” grouping of pixels with similar colors into various regions, but falls short of identifying any objects within the scene. However, the composite node in Fig. 1(e), formed as a result of the search, captures the notion of the “house” object and the “Flag” object within the scene. This composite node provides a level of semantic knowledge that would not have taken place had we not performed the initial search. Clearly, this description provides a semantic level of segmentation over and above the original low level one. Another advantage of the composite node representation is a significant reduction in the number of combinations to be tested during subsequent retrieval operations for the same object. For instance, during the initial search for the house template in Fig. 1, each region or group of neighboring regions has been tested in order to identify the most similar combination. However, once the “house” composite node has been formed, any subsequent searches for the house within this image would proceed in a top-down fashion where the composite nodes are searched first, followed by elementary nodes and their respective combinations. Since the “house” object has been captured in a composite node, the combinations of the elementary nodes “beneath” this composite node are eliminated from the search thereby significantly reducing the number of combinations to be tested. Here, we provide a worst case analysis for the reduction in computational complexity. Given a complete graph [18] with elementary nodes, corresponding to the regions in the segmentation map, and links, the maximum number of possible links is: . In the initial representation of the image prior to the learning step, the graph consists entirely of elementary nodes. Therefore, the initial search for “car” objects within this scene would examine all valid combinations of neighboring regions seeking similarity to a user specified template. For a complete graph (worst case scenario), where all elementary nodes are interconnected, the maximum number of combinations that can be examined is given by

However, once a composite node has been formed, the number of combinations to be tested is greatly reduced for subsequent searches. To this effect, assume that the initial search has yielded of the regions within the a composite node made up of segmentation map. In the worst case, the combinations of the composite node and remaining elementary nodes still need to be tested, since the user could be utilizing a “tighter” similarity threshold than the one originally utilized to perform the initial search. Then, the number of possible combinations for a comelementary plete graph with one composite node made up of nodes is given by

This accounts for the composite node combination in addition to all the valid combinations among the remaining nodes within the graph. As can be easily seen from this equation and the previous one, the formation of the composite node has provided a

632


TABLE I PERFORMANCE EVALUATION OF RETRIEVAL BEFORE AND AFTER LEARNING FOR “CAR” IMAGES

TABLE II PERFORMANCE EVALUATION OF RETRIEVAL BEFORE AND AFTER LEARNING FOR “FACE” IMAGES

reduction of in the number of combinations to be tested over the initial case. The above analysis has been done for the absolute worst case where the scene is represented by a complete graph with nodes and links. However, in general, the actual neighboring relations in images are local and assuming that one-point connections are not allowed, the “scene graph” may be more practically modeled by a planar [18] instead of a complete graph. As a result, the maximum number of links , instead of , resulting in a much smaller set of valid combinations when compared to the above worst case analysis. Tables I and II in Section V provide the actual number of combinations searched for real life examples. C. Applications Two possible application scenarios for this proposed hierarchical representation are: 1) Suppose a user shoots a number of digital still pictures. S/he may want to index this content automatically according to his/her interest profile (with examples of his/her favorite objects) as s/he enters them into his/her personal database. Each image is then individually analyzed into its elementary regions and particular groupings of regions would then be defined as composite nodes with appropriate attributes as dictated by the example templates in the profile file. Consequently, the results “learned” by this procedure and entered into the database may be later utilized for more expedient searches of desired content. 2) Suppose a user wishes to search remote databases faster using his/her interest profile file. Then, a search engine may process these databases off-line in order to generate a local copy of the proposed hierarchical descriptions tailored according to the examples in the profile file. Then,

actual searches may be later conducted based on this customized hierarchical descriptor files. V. EXPERIMENTAL RESULTS We provide experimental results on collections of car and face images (see Figs. 3–5) to demonstrate the system on real life examples. Each color image is RGB format, where each color component is quantized to 8 bits/pixel. Fig. 5 demonstrates the performance of the method with multiple search templates. A. Experiments With Single Example Template The main objective of these experiments is to demonstrate the concept of learning by example. Figs. 3 and 4 outline the use of our proposed approach using 2-D shape and color similarity, respectively, on two sets of images. The first set is made up of five car images suitable for 2-D shape matching, while the second set includes four portrait images for color matching. At the top of each set, we indicate the “example template” used in the search process. Note that the example templates are derived from the first car and portrait images, respectively. In each set, we show (a) the original image, (b) the low-level segmentation map, (c) the initial content tree before any learning takes place, (d) the matching combination of regions, and (e) the modified content tree depicting the composite node structure obtained by the learning process. Figs. 3(b) and 4(b) provide low-level segmentation results designed to form the basic building block for the semantic search and retrieval operations. Note that the original elementary nodes are moved beneath the composite nodes as they are formed as shown in Figs. 3(e) and 4(e). Any subsequent queries for “cars” and “faces” performed on the images shown in Figs. 3(a) and 4(a) will ultimately utilize the composite node structure shown


633

Fig. 3. Results on car images using shape matching: (a) The original image; (b) Final low-level segmentation; (c) Parent-child relationship tree with elementary nodes only; (d) Semantic segmentation matching the query template; and (e) Parent-child relationship tree with an intermediate semantic-layer, where the rectangle (i.e., the composite node) represents the segmented car.

in Figs. 3(e) and 4(e) yielding a significant reduction in the number of combinations to be tested. B. Evaluation Criteria Tables I and II provide a summary of the performance of our proposed algorithms on the “car” and “face” images respectively. In each table, we furnish the image name followed by the number of initial elementary nodes found within the scene. We, then, provide a comparison between the initial and subsequent

search results. Initial search and subsequent search imply search prior to and search posterior to the formation of the composite nodes, respectively. In the initial and subsequent search portions of the tables, we show both the theoretical number of combinations computed using the formulas shown in Section IV-B as well as the practical ones obtained directly from the adjacency matrix. From the tables, it can be easily seen that the number of combinations examined after the learning step represents a significant reduction over the initial search results. This can

634


Fig. 4. Results on face images using color matching: (a) The original image; (b) Final low-level segmentation; (c) Parent-child relationship tree with elementary nodes only; (d) Semantic segmentation matching the query template; and (e) Parent-child relationship tree with an intermediate semantic-layer, where the rectangle (i.e., the composite node) represents the segmented face.

be observed from both a theoretical and practical perspective. Note that the theoretical perspective is computed using complete graphs as described in Section IV-B, while the practical results are obtained from Figs. 3 and 4, respectively. Table III provides an evaluation of the semantic segmentation results documented by the formation of the composite node within each of the figures. In the table, we present the normalized object size given by the ratio of pixels in the ground truth object to the total number pixels in the image, true positives (TP) normalized by the object size, and precision defined by TP/(TP+FP) where FP stands for false positives. Note that the normalized TP and precision values are always between 0 and 1. TP and FP are computed by comparing the automatic segmenta-

tion of the “car” and “face” regions as captured by the composite nodes with the ground truth segmentations which are obtained manually. The manual segmentation results are depicted by the “red line” overlaid on the original images. C. Experiments With Multiple Example Templates Fig. 5 shows the results of utilizing multiple templates on a collection of “car” images. These experiments demonstrate the robustness of our proposed labeling technique with different object templates. Fig. 5(a) provides a collection of “Sedan” and “Sport Utility Vehicles (SUV).” Note the last image in the collection contains multiple cars. Fig. 5(b) and (c) display the results of searching these images with the “Sedan” and “SUV”


Fig. 5.

635

Experiments on a collection of images using multiple templates.

template respectively, where the resulting segmentations are arranged in decreasing similarity as we travel from left to right and top to bottom. The resulting images, also arranged in decreasing similarity, are displayed in Fig. 5(d); and their similarity measurements are tabulated in Table IV. Both the “Sedan” and “SUV” templates are abstractions for “Car” objects. Hence, they can both be utilized to label “Car”

images. However, as seen from Table IV, the similarity measurement generally improves as the object shape approaches that of the template. This can be validated on a first order if we examine the average similarity measure for the “Sedan” and “SUV” images respectively using Template 1 and 2. Images 1–10 and 16 represent “Sedans,” while images 11–15 are “SUVs.” The average similarity measure using Template 1 is:

636


Fig. 5.

Sedans

(Continued.) Experiments on a collection of images using multiple templates.

, , and Template 2: Sedans , . As a result, images 1-10 and 16 are more similar to Template 1, while 11-15 are best represented by Template 2. Hence, either template can be utilized to represent all “Cars” in an initial search with a high threshold. This will result in a large set of true positives at the expense

of potential false positives. On the other hand, if a higher degree of accuracy is required, then the “Sedan” or “SUV” template can be utilized to search for “Sedans” and “SUVs” respectively with a lower threshold, resulting in a potentially reduced but more accurate true positive set, while minimizing false positives.


TABLE III PERFORMANCE EVALUATION OF SEMANTIC SEGMENTATION RESULTS

637

as can be seen from Tables I and II. However, given a user search profile, e.g., a collection of example objects similar to those in Figs. 3 and 4, this initial search can be done offline. We are working on improving the speed of initial search through hierarchical low-level segmentation and partial match guided search procedures. REFERENCES

TABLE IV SIMILARITY MEASURING IN LEARNING

VI. CONCLUSIONS The primary benefits of the proposed learning by example method are: 1) ability to produce a semantic level of segmentation without user intervention and 2) capability to significantly improve the speed of subsequent searches as a consequence of capturing semantic object information in composite nodes. These are demonstrated in the tables and figures presented in the results section. Some concluding observations are in order: 1) The quality of resulting semantic object segmentation depends on the granularity of the initial low-level segmentation. An initial segmentation with perfectly uniform low-level regions will yield a more accurate potential semantic object boundary; hence, improved matching performance. However, there is a tradeoff between the number of regions in the initial segmentation and the time required to perform the initial search. 2) The proposed learning by example concept can employ any object matching method. The 2-D shape matching technique employed in this work has been shown to handle translation, rotations in the image plane, and isometric scaling, but not general affine or perspective transformations. Inaccuracies in low-level matching affect the performance of learning (labeling) as quantified by TP and Precision measures. Also limited by the object matching methods, the current system does not handle partially occluded objects. 3) The initial search is generally expensive for real life images

[1] Y. Rui, T. S. Huang, and S.-F. Chang, “Image retrieval: Current techniques, promising directions and open issues,” J. Vis. Commun. Image Represent., vol. 10, no. 4, pp. 39–62, Apr. 1999. [2] F. Idris and S. Panchanathan, “Review of image and video indexing techniques,” J. Vis. Commun. Image Represent., vol. 8, no. 2, pp. 146–66, June 1997. [3] M. De Marsicoi, L. Cinque, and S. Levialdi, “Indexing pictorial documents by their content: A survey of current techniques,” Image Vision Comput., vol. 15, no. 2, pp. 119–41, Feb. 1997. [4] J. R. Bach, S. Paul, and R. Jain, “A visual information management system for the interactive retrieval of faces,” IEEE Trans. Knowledge Data Eng., vol. 5, Aug. 1993. [5] A. Jain and A. Vailaya, “Shape based retrieval: A case study with trademark image databases,” Pattern Recognit., vol. 31, no. 9, pp. 1369–1390, 1998. [6] R. K. Srihari, “Automatic indexing and content-based retrieval of captioned images,” Computer, vol. 28, no. 9, pp. 49–56, Sept. 1995. [7] C. Faloutsos, R. Barber, M. Flickner, J. Hafner, W. Niblack, D. Petkovic, and W. Equitz, “Efficient and effective querying by image content,” J. Intell. Inform. Syst., vol. 3, no. 3-4, pp. 231–262, July 1994. [8] A. Pentland, R. W. Picard, and S. Sclaroff, “Photobook: Content-based manipulation of image databases,” Int. J. Comput. Vis., vol. 18, no. 3, pp. 233–54, June 1996. [9] J. R. Smith and S.-F. Chang, “VisualSEEK: A fully automated contentbased image query system,” in Proc. ACM Multimedia 96, 1996. , “Visually searching the Web for the content,” IEEE Multimedia [10] Mag., vol. 4, no. 3, pp. 12–20, 1997. [11] V. E. Ogle and M. Stonebraker, “Chabot: Retrieval from a relational database of images,” Computer, vol. 28, no. 9, pp. 40–8, Sept. 1995. [12] C. Okon, “Image recognition meets content management for authoring, editing and more,” Adv. Imag., vol. 10, no. 7, July 1995. [13] T. P. Minka and R. W. Picard, “Interactive learning using a society of models,” Pattern Recognit., vol. 30, no. 4, pp. 565–581, Apr. 1997. [14] Y. Rui, T. S. Huang, M. Ortega, and S. Mehrotra, “Relevance feedback: A power tool in interactive content-based image retrieval,” IEEE Trans. Circuits Syst. Video Technol., vol. 8, pp. 644– 655, Sept. 1998. [15] J. Dowe, “Content-based retrieval in multimedia imaging,” in Proc. SPIE Storage and Retrieval for Image and Video Database, 1993. [16] W. Y. Ma and B. S. Manjunath, “Netra: A toolbox for navigating large image databases,” in Proc. IEEE Int. Conf. Image Processing, 1997. [17] I. J. Cox, M. L. Miller, T. P. Minka, and P. N. Yianilos, “An optimized interaction strategy for Bayesian relevance feedback,” in Proc. IEEE Conf. Computer Vision Pattern Recognition, June 1998. [18] M. Swamy and K. Thulasiraman, Graphs, Networks, and Algorithms. New York: Wiley, 1981. [19] R. Sedgewick, Algorithms in C. Reading, MA: Addison-Wesley, 1990. [20] E. Saber and A. M. Tekalp, “Region-based shape matching for automatic image annotation and query by example,” J. Vis. Commun. Image Represent., vol. 8, no. 1, pp. 3–20, Mar. 1997. [21] H. C. Lee and D. R. Cok, “Detecting boundaries in vector field,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 39, pp. 1181–94, May 1991. [22] E. Saber and A. M. Tekalp, “Integration of color, edge, shape, and texture features for automatic region-based image annotation and retrieval,” J. Electron. Imag., vol. 7, no. 3, pp. 684–700, July 1998. [23] E. Saber, A. M. Tekalp, and G. Bozdagi, “Fusion of color and edge information for improved segmentation and edge linking,” Image Vis. Comput., vol. 15, no. 10, Oct. 1997. [24] N. R. Pal and S. K. Pal, “A review on image segmentation techniques,” Pattern Recognit., vol. 26, no. 9, pp. 1277–1294, 1993. [25] M. Swain and D. Ballard, “Color indexing,” Int. J. Comput. Vis., vol. 7, no. 1, pp. 11–32, 1991. [26] S. Marshall, “Review of shape coding techniques,” Image Vis. Comput., vol. 7, pp. 281–294, 1989. [27] W. Grimson, Object Recognition by Computer: The Role of Geometric Constraints. Cambridge, MA: MIT Press, 1990. [28] L. Shapiro and J. Brady, “Feature-based correspondence: An eigenvector approach,” Image Vis. Comput., vol. 10, pp. 283–288, June 1992.

638

Yaowu Xu received the B.S. degree in physics from Tsinghua University, Beijing, China, in 1993, and the M.S. and Ph.D. degrees in reactor engineering and reactor safety from the Institute of Nuclear Energy and Technology, Tsinghua University, in 1998. He also received the M.S. degree in electrical and computer engineering from the University of Rochester, Rochester, NY. In 1999, he joined the Algorithm Development Group of On2 Technologies, Inc., where he is currently a Senior Engineer. His current research interests includes image and video compression, real time video encoding, post-processing of low-bit rate video, object based image representation, image and video segmentation, indexing, and retrieving.

Eli Saber (S’91–M’96–SM’00) received the B.S. degree in electrical and computer engineering from the University of Buffalo, Buffalo, NY, in 1988, and the M.S. and Ph.D. degrees in electrical and computer engineering from the University of Rochester, Rochester, NY, in 1992 and 1996, respectively. He joined Xerox in 1988 and is currently a Product Development Scientist and Manager heading the Image Science, Analysis, and Evaluation area in the Print Engine Development Unit. He is an Adjunct Faculty Member at the Electrical and Computer Engineering Departments of the University of Rochester and the Rochester Institute of Technology responsible for teaching graduate coursework in signal, image, and video processing and performing research in digital libraries and watermarking. His research interests include color image processing, image/video segmentation and annotation, content-based image/video analysis and retrieval, computer vision, and watermarking. He holds a number of conference and journal publications in the field of signal, image, and video processing. He is an associate editor for the Journal of Electronic Imaging. Dr. Saber was the recipient of the Gibran Khalil Gibran Scholarship and of several prizes and awards for outstanding academic achievements from 1984 to 1988, as well as the Quality Recognition Award in 1990 from The Document Company, Xerox. He is a member of the Electrical Engineering Honor Society, Eta Kappa Nu, the Imaging Science and Technology Society. He is also an associate editor for the IEEE TRANSACTIONS ON IMAGE PROCESSING. He was appointed the finance chair for the IEEE International Conference on Image Processing (ICIP) 2002 at Rochester, NY. He was also the general chair for the Western New York Imaging Workshop in 1998.


A. Murat Tekalp (S’80–M’84–SM’91–F’03) received the M.S. and Ph.D. degrees in electrical, computer, and systems engineering from Rensselaer Polytechnic Institute, Troy, NY, in 1982 and 1984, respectively. From December 1984 to August 1987, he was with Eastman Kodak Company, Rochester, NY. He joined the Electrical and Computer Engineering Department, University of Rochester, Rochester, NY, in September 1987, where he is currently a Distinguished Professor. Since June 2001, he has also been with Koç University, Istanbul, Turkey. He authored the book Digital Video Processing (Englewood Cliffs, NJ: Prentice-Hall, 1995) and holds five U.S. patents. His group contributed technology to the ISO/IEC MPEG-4 and MPEG-7 standards. His research interests are in the area of digital image and video processing, including video compression and streaming, video filtering for high-resolution video segmentation, object tracking, content-based video analysis, and summarization, multicamera surveillance video processing, and protection of digital content. Dr. Tekalp was an Associate Editor for the IEEE TRANSACTIONS ON SIGNAL PROCESSING (1990–1992), the IEEE TRANSACTIONS ON IMAGE PROCESSING (1994–1996), and the Journal of Multidimensional Systems and Signal Processing (1994–1999). He was an Area Editor for Graphical Models and Image Processing (1995–1998). He was also on the Editorial Board of the Visual Communication and Image Representation (1995–1999). Currently, he is the Editor-in-Chief of the journal Image Communication. He was appointed as the Technical Program Chair for the 1991 IEEE Signal Processing Society Workshop on Image and Multidimensional Signal Processing, the Special Sessions Chair for the 1995 IEEE International Conference on Image Processing, the Technical Program Co-Chair for IEEE ICASSP 2000 in Istanbul, Turkey, and the General Chair of IEEE International Conference on Image Processing (ICIP) at Rochester, NY, in 2002. He is the Founder and first Chairman of the Rochester Chapter of the IEEE Signal Processing Society. He was elected as the Chair of the Rochester Section of IEEE in 1994–1995. He received the National Science Foundation Research Initiation Award in 1988, was named Distinguished Lecturer by IEEE Signal Processing Society in 1998, and was awarded a Fulbright Senior Scholarship in 1999. He has chaired the IEEE Signal Processing Society Technical Committee on Image and Multidimensional Signal Processing (January 1996–December 1997).