Context Dependent Segmentation and Matching in Image Databases Hayit Greenspan Faculty of Engineering, Tel-Aviv University, Tel-Aviv 69978, Israel
Guy Dvir Faculty of Engineering, Tel-Aviv University, Tel-Aviv 69978, Israel
Yossi Rubner Applied Materials, Israel
[email protected] July 31, 2003 Abstract The content of an image can be summarized by a set of homogeneous regions in an appropriate feature space. When exact shape is not important, the regions can be represented by simple “blobs”. Even for similar images, the blob representation of the two images might vary in shape, position, the number of blobs, and the represented features. In addition, separate blobs in one image might correspond to a single blob in the other image and vice versa. In this paper we present the BlobEMD framework as a novel method to compute the dissimilarity of two sets of blobs while allowing for context-based adaptation of the image representation. This results in representation that represent well the original images but at the same time are best aligned with respect to the representation of the context images. We compute the blobs by using Gaussian mixture modeling and use the Earth Mover’s Distance (EMD) to compute both the dissimilarity of the images and the flow matrix of the blobs between the images. The BlobEMD flow-matrix is used to find optimal correspondences between source and target image representations and to adapt the representation of the source image to that of the target image. This allows for similarity measures between images that are insensitive to the segmentation process and to different levels of details of the representation. We show applications of this method for content-based image retrieval, image segmentation, and matching models of heavily dithered images with models of full resolution images.
1
Introduction
Many content-based retrieval works rely on an initial segmentation of the input and archived images. Yet, image segmentation remains one of the more challenging problems in computervision and often is not well defined, as different contents entail different segmentations of the same image. For example, in some contexts it is more appropriate to segment together all the trees in an image of a forest while in other contexts, each tree should stand by its own. In this work we address the challenge of comparing similar images that are segmented differently and/or are represented at varying level of resolution, as is the case in dithered images. The “BlobEMD” framework is proposed in this work as a simultaneous solution to both the image representation problem and the estimation of the distance between images. This coupling allows for context-based model adaptation where the representation of one image is adjusted based on the representation of a second image - the context. The framework combines an initial transition from image pixels to representative image regions (segments or blobs) via Gaussian mixture modelling (GMM) [2], followed by utilizing the Earth mover’s distance measure (EMD) [19] for finding the optimal correspondences between regions in the two images, and extracting an overall image matching measure between two input images. The correspondences between the regions in the two images are used to merge and spit the regions, so they still represent well the images but at the same time bring the two representations to a common context. For example, the problem of image segmentation is treated here as an image pair (source-target) task. Thus, an image will be segmented differently based on the target image. The suggested framework provides for image representations that are more uniform and best aligned between the two images to be matched. The overall framework of the image representation and matching phases is represented in Figure 1. In section 2 we review some of the related work and motivate the proposed scheme. The BlobEMD framework is presented in section 3. In addition to the distance between two sets of blobs, the BlobEMD results in a flow matrix with correspondences between blobs. In section 4 we focus on the flow-matrix and provide a set of rules for extracting regioncorrespondences between images and for image model adaptation. Experimental evaluation of the BlobEMD framework, along with its application to context-based image segmentation 2
Figure 1: A block diagram of the BlobEMD matching system and robust image matching are presented in section 5.
2
Related Work
Histograms are the classical means of representing image content and are widely used as the chosen image representation [8, 1]. A histogram is a discrete representation of the continuous feature space, generated by a partitioning of the feature space. The partitioning is determined by the feature space chosen (e.g. the color space representation), the quantization scheme chosen (such as uniform or vector quantization), as well as computational and storage considerations. Color histograms advantages and disadvantages are well studied [23] and many variations exist [16, 22, 13]. Several measures have been proposed for the dissimilarity between two histograms. In general they can be divided into two categories [20, 17]: “bin-by-bin” measures, that compare contents of corresponding histogram bins, and “cross-bin” measures that enable comparisons across non-corresponding bins. In the first category are included the Minkowski-form distance, the histogram intersection (H.I.) measure [23, 20], the χ2 statistics, the KullbackLeibler (KL) divergence [14, 4], and others. “Cross-bin” measures combine also the feature 3
space information of the bins (e.g. the dissimilarities between colors represented by the histogram bins). Such measures include the Quadratic-form distance [11] in which a similarity matrix is included to represent similarity between bins. The Earth mover’s distance measure [19] extracts dominant modes from a histogram, as a signature, and defines a measure of similarity between signatures. Additional distance measures between histogram representations in an image matching task are evaluated and compared in [19, 17, 20]. The histogram representation has been extended recently to include additional features as well as spatial information. In [16] each entry of a “joint” histogram contains the number of pixels in the image that are described by a particular combination of feature values. In [22] local information is included by dividing an image into five fixed overlapping blocks and extracting the first three color moments of each block to form a feature vector for the image. In [13] correlograms are proposed to take into account the local color spatial correlation as well as the global distribution of the spatial correlation. Other works in image representation include “region-based” approaches. Image regions are the basic building blocks in forming the visual content of an image, and thus have great potential in representing the image content and enabling image matching. In [21] Smith and Chang store the location of each color that is present in a sufficient amount in regions computed using histogram backprojection. Ma and Manjunath [15] perform retrieval based on segmented image regions. The segmentation is not fully automatic as it requires some parametric tuning and hand pruning of regions. Unsupervised segmentation of an image into homogeneous regions in feature space, such as the color and texture space, can be found in the “Blobworld” image representation [2, 3]. In [2] a naive Bayes algorithm is used to learn image categories from the blob representation in a supervised learning scheme. The framework suggested entails learning blob-rules per category. Thus, one may argue that there is a shift to a high-level image description (image labeling). Each query image is next compared with the extracted category models, and associated with the closest matching category. In [3] the user composes a query by viewing the Blobworld representation and selecting the blobs to match along with possible weighting of the blob features. A query may include a combination (conjunction) of two blobs. In essence, the image matching problem is shifted to a (one or two) blob matching problem. Each blob is compared with all blobs in each database image. Spatial information is thus included, yet in a very concise manner. It 4
should be noted that each blob is represented by a color histogram, thus the representation is a discrete representation (in the image plane as well as in feature space). An extension to the Blobworld system, termed “GMM-KL” framework, has recently been proposed [10]. The set of regions in an image is represented by a continuous Gaussian mixture model (GMM). Images are next compared and matched via the continuous and probabilistic KL distance between distributions. The GMM-KL framework achieves strong matching results between images while addressing the problem of ‘multiple-blob’ to ‘multipleblob’ matching. In the current work we similarly extend the Blobworld system to address the ‘multiple-blob’ matching problem. The continuous GMM representation is used in the image representation stage following which we utilize the EMD distance measure in the matching stage. In addition to providing a distance measure between multiple blob sets, the BlobEMD framework generates a flow-matrix which provides correspondences between individual source and target blobs. Thus the BlobEMD flow-matrix addresses the regioncorrespondence problem between the two images. This information is used for context-based image model adaptation, as will be exemplified in the following sections.
3
The BlobEMD Framework
In order to measure similarities between images that are represented by homogeneous regions, we need to define an appropriate dissimilarity measure. This problem is harder when the two sets of regions don’t have clear correspondences, and often, a region in one image matches the union of several regions, or parts of regions in the second image. An example for this can seen in Figure 8 (a). Both images show a lake and two trees. However, in the left image the lake is represented by a single region while in the right image it is represented by three regions. Similarly, the tree-tops in the right image are combined into a single region. In order for the dissimilarity measure to perform properly, it should solve these cases. This is done by the BlobEMD framework. The BlobEMD framework [9] consists of three main steps (see Figure 1): First, each input image is modeled as a Gaussian mixture distribution in a selected feature space. The EMD is next utilized for measuring similarity between the respective models of two images. In addition to the similarity measure between sets of regions, the EMD also returns the 5
correspondence (flow) between them. The third step uses these correspondences to adapt one (source) image model based on the model of the second (target) image. Adaptation of the image models achieves context based modeling and segmentation, and provides better overall image similarity measures. The three steps are described in more detail in the following sections.
3.1
Image representation via Gaussian mixture modeling
In the representation phase, each homogeneous region in the image is represented by a Gaussian distribution and the set of regions in the image is represented by a Gaussian mixture model (GMM). Pixels are grouped into homogeneous regions in the image plane by grouping feature vectors in a selected feature space. We use the five-dimensional feature space of color and space (L, a, b, x, y), where (L, a, b) is the 3-dimensional CIE-Lab color space [24], and (x, y) is the spatial image plane. We use the CIE-Lab color space as it was designed so that (short) Euclidean distances between two colors match perceptual similarity. The underlying assumption is that the image colors and their spatial distribution in the image plane are generated by a mixture of Gaussians. It should be noted that the representation model is general, and can incorporate any desired feature space (such as color, texture, shape, etc) or combination thereof. The distribution of a random variable X ∈ Rd is a mixture of k Gaussians if its density function is: f (x|θ) =
k X j=1
1 exp{− (x − µj )T Σ−1 j (x − µj ), 2 (2π)d |Σj |
αj q
1
such that the parameter set θ = {αj , µj , Σj }kj=1 consists of: αj > 0,
Pk
j=1
(1) αj = 1, µj ∈ Rd
and Σj is a d×d positive definite matrix. Given a set of feature vectors x1 , ..., xn , the maximum likelihood estimation of θ is : θM L = arg max f (x1 , ..., xn |θ). θ
(2)
Since a closed form solution for this maximization problem is not possible, we utilize the Expectation-Maximization (EM) algorithm [5] as an iterative method to obtain θM L (similar to [3]).
6
The iterative EM algorithm is initialized via the K-means algorithm [7], and is repeated until the log-likelihood measure is increased by less than a predefined threshold (1%) from one iteration to the next. The MDL principle [4] is used to select the number of mixture components (or number of means), k, as best suits the natural number of groups present in the image. Once we associate a Gaussian mixture model to an image, the image can be viewed as a set of independently identically distributed samples from the Gaussian mixture distribution. Examples of images with their respective models are shown in Figures 8 - 11. Each localized Gaussian mixture is shown as a set of ellipsoids, with each ellipsoid representing the support, mean color and spatial layout, of a particular Gaussian in the image plane. The variability in the number of regions, their layouts and colors for similar context input images, is evident in the GMM representation as in the image plane.
3.2
The Earth Mover’s Distance (EMD)
In [19] the concept of the Earth Mover’s Distance is introduced as a flexible similarity measure between multidimensional distributions, and is described in detail therein. Intuitively, given two distributions represented by sets of weighted features, one can be seen as a mass of “earth” properly spread in the feature space, the other as a collection of “holes” in that same space. The EMD measures the least amount of work needed to fill the holes with earth. Here, a unit of work corresponds to transporting a unit of earth by a unit of ground distance which is a distance in the feature space. The EMD is based on the transportation problem [12] and can be solved efficiently by linear optimization algorithms that take advantage of its special structure. Formally, let S = {(s1 , ws1 ), . . . , (sm , wsm )} be the first set with m regions, where si is the region descriptor and wsi is the weight of the region; T = {(t1 , wt1 ), . . . , (tn , wtn )} the second set with n regions; and DIST = [dist(si , tj )] the ground distance matrix where dist(si , tj ) is the distance between regions si and tj . The EMD between sets S and T is then Pm Pn
EMD(S, T ) =
i=1
j=1
fij dist(si , tj ) , j=1 fij
Pm Pn i=1
(3)
where F = [fij ], with fij ≥ 0 the flow between si and tj , is the optimal admissible flow from 7
S to T that minimizes the numerator of (3) subject to the following constraints: n X
m X
fij ≤ wsi ,
j=1 m X n X i=1 j=1
fij ≤ wtj
i=1
fij = min(
m X i=1
wsi ,
n X
wtj ) .
j=1
Notice that the two sets can have different total weights. This allows for partial matches [19]. The EMD results both in a distance measure and with the actual flow. Both are used in our framework.
3.3
Combining the EMD distance with GMM representation
The EMD distance is combined with the GMM image representation in the BlobEMD framework. The source and target sets (S and T ) are the blob sets (GMMs) per source and target image and the EMD is used to find correspondences between the blobs, or regions. These correspondences are optimal in the sense that they minimize the overall EMD distance (equation 3) between the images. Figure 2 shows the bi-partite graph with which the EMD problem is defined and solved. The source and target images yield two sets of blobs {s1 ...sn } and {t1 ...tm }. The source blobs comprise the vertices of the left-hand side of the bi-partite graph. The target blobs comprise the right-hand vertices of the graph. Note that each of the two images can be represented by a different number of blobs. Each connecting arc is weighted by the ground-distance between the corresponding source and target blob pair. This ground distance, dist(s, t), can be defined in several ways. Here we use the Fr`echet distance [6] which is a closed-form solution to the EMD in the case of two equal weight Gaussians and therefore is a natural distance for the Gaussian blob representation (see Appendix A). In the EMD algorithm, each vertex has a description and a weight. In our case the vertex description corresponds to the feature vector (blob description) and the weight of a vertex is defined by the relative weight of the corresponding Gaussian, in other words, the relative number of pixels that correspond to the Gaussian (blob). The source and target weights determine how much flow can be transferred from the source blob and to the target blob, respectively. The EMD provides an optimal solution to the minimization problem defined on the bipartite graph, with the constraint that the maximum possible flow is transferred from the 8
Figure 2: Feature vectors (blob) correspondence using a fully-connected bi-partite graph source to the target image. The generated solution yields the best match between source and target blobs of the corresponding source and target images, along with an overall minimal distance between the images, as defined by equation (3). Solving the minimization problem results in a generated flow matrix. The flow matrix represents the amount of flow on each arc of the fully-connected bi-partite graph. Examples of flow-matrices can be seen in Figures 8 - 11. The flow value is in the interval [0..1], where 0 indicates no flow exists through an arc and 1 indicates that the entire weight of the source image is transferred through the arc (this situation can occur in the trivial situation in which the source and the target images consist of a single region each). The flow matrix shows the transformation of each blob in the source image (rows) to blobs of the target image (columns).
3.4
Image model adaptation
Adaptation of an image model is useful when images are represented in inconsistent ways. For example, under- and over-segmented images in the space domain or dithered images in the color domain. The resultant flow-matrix is used next for context-dependent image model adaptation. Model adaptation can be applied in one of two possible adaptation modes: (1) Adapt the representation model of a source image with respect to a second, target image, while still maintaining similarity to the original model. Here only the source image represen9
tation is modified while the target image is unaffected. We hereon refer to this mode as “source-to-target adaptation”; (2) Adapt both image models to reach the best common mutual representation, keeping their similarities with the respective images. This mode will be referred to as “mutual adaptation”. The model adaptation is performed by an iterative process on the GMM models of the two images by applying a series of merging and splitting steps on the source image GMM, or on both the source and target image GMMs, depending on the mode used. The rules for blob merging and blob splitting are based on the BlobEMD flow-matrix and are defined in detail in the following section. In general, two blobs from one image will be considered for merging if they flow (almost) entirely to a single blob in the other image. A blob will be considered for splitting if it flows to several blobs in the other image, and these blobs also receive flow from other blobs in the first image. Without the second condition the merging rule would be applicable in the opposite direction - merge the blobs in the other image to match the blob in the first image. Merging is always preferable over splitting to simplify the resulting models.
4
Model Adaptation Rules
The candidate blobs for the merging and splitting are chosen based on the flow matrix that results from the BlobEMD computation. Candidate blobs for a merge are characterized by rows (or columns) with a single large value in the same column (or row) in the flow matrix A candidate blob for splitting is characterized by a row (column) with multiple values such that for each value, its respective column (row) contains additional non-zero entries. For blobs in the candidate list to qualify for merging or splitting, three additional conditions need to be met: 1. Similarity in feature space. The BlobEMD finds correspondences between all blobs in the source and target images in a way that minimizes the global distance between the two sets of blobs. However, since the EMD process is forced to match all blobs, it often needs to compromise and match blobs, or parts of blobs, that are rather dissimilar from each other. We require the respective candidate blobs in the two images to exhibit good similarity in the feature space. For that we use the same ground distance GDF (·, ·), 10
that was used for the BlobEMD computation. In this work we usually use the Fr`echet distance in L, a, b color space (see Appendix A). In the case of dithered images the Fr`echet distance is used in x, y space (as will be shown in section 5.4). 2. Significant spatial overlapping. Even when respective candidate blobs are similar in the feature space, they might not be spatially close enough. Merging and splitting require significant spatial overlap of the blobs. For this purpose we define a second ground distance, GDS (·, ·), which ignores the similarity in the feature space and measures only the spatial overlap. We require that this measure returns zero when spatially, one blob completely contains the other (i.e. a small blob inside a large blob). Given two blobs s and t, consider the corresponding sets of pixels: {pi }pi ∈s2σ and {pj }pj ∈t2σ , where s2σ and t2σ are the 2σ projections of the Gaussian blobs on the x, y plane (i.e., all the pixels in the Gaussian blobs with Mahalanobis distance of 2σ). We define this distance as GDS (s, t) = 1 −
|{pi } ∩ {pj }| , min(|{pi }|, |{pj }|)
(4)
where | · | represents the size of the group. 3. Significant flow. For a merge, we require that nearly all the weights of the candidate blobs flow to the corresponding target blob. To split a candidate blob, we require that the resulting blobs are not too small, i.e. the candidate blob has a significant flow to the corresponding target blobs. The conditions for the merge and split are summarized in Table 1. In the diagrams, the weight of blob si is denoted by wsi , and the flow between source blob si and target blob tj , by f (si , tj ). The conditions involve several empirical thresholds that are application and domain dependent (see examples in Section 5). Notice that for the spatial similarity condition the threshold for the merge, CS1 is different than the threshold for the split, CS2 . For the merge we demand that the target blob overlaps the two source blobs, while for the split, we require only partial overlap. In general, CS2 < CS1 . This reasoning also applies to the threshold of the significant flow condition. For the merge we want Cf low1 to be close to 1, meaning that all the weights of the source blobs flow to the target blob. For the split we 11
require each of the target blobs to carry a significant amount of the source blob, therefore, Cf low2 < Cf low1 < 1. Merge
Before operation
Split
* ! ')$&( % # "
Feature space similarity Spatial similarity Significant flow
" ! # After operation
) " ' % ( & $ #
!
GDF (si , tk ) < CF GDF (sj , tk ) < CF GDS (si , tk ) < CS1 GDS (sj , tk ) < CS1 f (si , tk )/wsi > Cf low1 f (sj , tk )/wsj > Cf low1
$"% # !
' - & + ),*(
GDF (si , tk ) < CF GDF (si , tl ) < CF GDS (si , tk ) < CS2 GDS (si , tl ) < CS2 f (si , tk )/wsi > Cf low2 f (si , tl )/wsi > Cf low2
Table 1: Merge and split conditions. The model adaptation process consists of several consecutive merging and splitting steps conducted on the source and target images. Next we describe in detail the merging and splitting steps. A description of the entire process will follow.
4.1
Blob merging
In the merging process the mixture model is updated, resulting in a smaller set of blobs and updated feature characteristics. The process is an iterative one, passing through all merging candidate lists, and finalizing when no additional merging is possible. 12
(a)
(b)
(c) Figure 3: Synthetic example of source-to-target merging process. (a) A cross image is the source image (left) that is matched to the target, line image (right); (b) Initial image models (representation layer); (c) Final image models following source model adaptation. Notice that the two blobs in the source image that match the line in the target image were merged together. The merging process replaces pairs of blobs from the source image with a single new blob. The new blob’s spatial position and statistics are based on the original source blobs. Given two blobs: bi = (wi , µi , Σi ) and bj = (wj , µj , Σj ), the merged blob parameters b = (w, µ, Σ) are calculated as follows: w = wi + wj wj wi µ = µ i + µj w w wi wj Σ = (Σi + µi µti ) + (Σj + µj µtj ) − µµt w w
(5) (6) (7)
The derivations of these equations can be found in Appendix B. Figure 3 shows an example of the context-based merging process. An image of a cross is the source image (left) that is matched to the target, an image of a line (right). The initial image source models are shown in the center row, and the resulting image models, following source model adaptation are shown bottom row. Perceptually, the image models look more similar following the merging process.
13
4.2
Blob splitting
Splitting occurs, for example, in images with a large uniform background that is represented by a single large blob, or when the segmentation process results in a small number of segments (under-segmentation). Often, splitting blobs enables the blob parts to be merged with other blobs in a follow-up merging process. Hereon we term the set of target blobs to which the source blob flows to as the “targetblobs” set. Once the target-blobs set is defined per source blob, we wish to split the source blob into a set of smaller blobs, each corresponding to one of the target blobs in the set. The splitting process is done as follows: 1. Randomly sample the source blob according to its Gaussian distribution. 2. Each sample x, is probabilistically affiliated with each target-blob distribution gj (x|θ¯j ), j = 1, . . . , N . 3. For each target blob j, the set of M samples from the source blob of highest affiliation to blob j is collected. 4. A Gaussian is learned for each set of M samples. 5. The source image mixture model is updated accordingly. Figure 4 shows an example of source-to-target context-based splitting on synthetic images. The representation of the input image (top left) is updated according to the given target image (top right). The input representation layer is shown in the center row. The resulting output models, following one step of source model adaptation are shown bottom row. Note that if mutual adaptation was pursued in this case, a merging of the target model would have preceded the splitting of the source model.
4.3
The complete adaptation process
Figure 5 shows a flow-chart of the complete adaptation process. The process is an iterative merging and splitting process. The adaptation process modifies either the source model or both the source and the target models according to the adaptation mode (source-to-target 14
or mutual adaptation). The update loop terminates once no change is found in the source model (source-to-target adaptation) or in both the source and the target models (mutual adaptation mode). An optional post-processing step follows the main update loop. The postprocessing includes an additional source-target merging step followed by an intra-merging step. Intra-merging is an additional blob-merging step that is pursued in the mutual adaptation mode, for each of the source and target models. It is an image smoothing filter. The blob set of each image is checked for pairs of blobs of high similarity. Two blobs bi and bj within an image may be merged if they are close in both feature space and have spatial similarity. We use the following criteria: GDF (bi , bj ) < 0.05, GDS (bi , bj ) < 1.0. The intra-merging step was found to be helpful in cases that result in many small blobs, i.e. the optimal match still entailed a very large set of blobs (such a case may occur if we start with a large set of blobs in each image). The outcome of the adaptation process is a set of newly segmented source and target models with a final updated distance measure between them.
5
Experimental Results of the BlobEMD Framework
We have described the BlobEMD framework which consists of three main steps: First, each input image is modeled as a Gaussian mixture distribution in the joint (L, a, b, x, y) feature space. The EMD is next utilized for measuring similarity between the respective models of two images. In addition to the similarity measure between sets of regions, the EMD also returns the correspondence (flow) between them. The third step uses these correspondences to adapt the source and target models according to the adaptation mode chosen. In this section we present an investigative analysis of the BlobEMD framework. We start with the combination of the first two steps: the GMM representation and the EMD distance without the merging and splitting steps. We investigate the framework’s robustness in the image matching task and its application to the image retrieval task. We next illustrate the utilization of the flow-matrix for model adaptation within several application domains. These include context-based image segmentation and dithered image matching.
15
5.1
Robustness to fragmentation in the image representation
Images with semantically similar content may be represented by differing number of regions via the Gaussian mixture model (parameter k). The goal is to have images compared and matched regardless of this variability and show robustness to it. In [10] we introduced a novel intra-inter class statistical evaluation methodology as a benchmarking measure. The intra-class set of images corresponds with similar content image samples, and the inter-class set corresponds to pairing of images with different content. We use the inter-intra evaluation scheme to evaluate the robustness of the BlobEMD framework to fragmentation in the image representation. In this experiment we use a random set of 245 images extracted from the COREL database. The ground-truth is generated by choosing four mixture representations (4 values of k, k = 3, 4, 5, 6) per input image. The “intra-class” distance set is computed as the distances between all combinations of representation models per image. Note that the similarity of the models within the “intra-class” set is an objective one and does not depend on subjective labeling. We have overall a set of 12 non-zero distances per image. This process is repeated for each of the 245 images in the database for an overall 12 × 245 distances. A second set of distances is computed across images, with each image represented by the MDL chosen mixture representation (the optimal k value). We term this set of distances (with 245 × 244 distances) the “inter-class” distance set. A histogram of the “intra-class” and “inter-class” distances is plotted in each of the two graphs presented in Figure 6. The graph on the left shows results in color only feature space, while the graph on the right shows the distances between images when compared in a combined color-space domain. Two distinct modes are present in both graphs, demonstrating the clear separation between the sets. The “intra-class” distances are more narrowly spread at the lower end of the axis (close to zero), as compared to the wide-spread and larger distance values of the “inter-class” set. The results presented indicate the strong similarity between same class models (same image with different values of k), regardless of the variability in the representation. The BlobEMD framework is in fact robust to fragmentation in the representation space.
16
5.2
Statistical performance evaluation
We next demonstrate the applicability of the presented framework to the image retrieval task. In addition to the random set of 245 images an additional set of 70 images were hand-picked as comprising 6 different classes or categories (10 images per class). Labeled categories include: “car”, “desert”, “field”, “monkey” , “snow” and “waterfall”. Each image in the database is processed to extract the localized Gaussian mixture representation. The BlobEMD with the Fr`echet ground distance is next computed between each of the images and an input query image. The images are sorted based on the distance and the closest ones are presented as the retrieval results. Retrieval results are evaluated by precision versus recall (PR) curves. Recall measures the ability of retrieving all relevant or similar information items in the database. It is defined as the ratio between the number of relevant or perceptually similar items retrieved and the total relevant items in the database (in our case 10 relevant images per each of the labeled classes). Precision measures the retrieval accuracy and is defined as the ratio between the number of relevant or perceptually similar items and the total number of items retrieved. Precision vs. recall (PR) curves are extracted for each of the 6 categories. A comparison with global histogram representation and several histogram distance measures is conducted as well as with our earlier work on the GMM-KL framework [10]. In the GMM-KL framework the continuous KL distance is used to measure the distance between two continuous distributions, the two GMMs representing the two image inputs. The definition of the continuous KL distance is given in Appendix A. Histogram measures include the bin-to-bin Euclidean distance (Euc.), the histogram intersection measure (H. I.) and the discrete KL measure (Disc. KL) [23, 20, 17]. A binning of 8 × 8 × 8 is used in the histogram representation. This resolution (512 quantization levels) is commonly found in the literature. This resolution is also in the same order of magnitude (and favorably so) with the GMM representation. Curves are presented in Figure 7. Each plot is an average of the results of the 10 query images in the class. We notice the following points: 1. In most cases retrieval results are better when using color only features (dashed black line); slightly worse when adding spatial features (dashed red line). This fact is in 17
correspondence with earlier results as shown in Figure 6, and agrees with previous works (e.g. [23]). 2. The BlobEMD framework provides very similar results to the GMM-KL framework. In some cases the BlobEMD is better and in some of the cases the GMM-KL framework gets better results. This behavioral pattern is to be expected as the two schemes are closely related (with the advantage of the BlobEMD for model adaptation). 3. In all cases, the BlobEMD method provides better performance than histogram-based methods.
5.3
Context-based image segmentation
In this and the following sections we focus on the model adaptation task. The challenge of image segmentation is treated in this work as an image pair (source-target) task. An image will be segmented differently based on the context as reflected by the target image. The model adaptation is performed by an iterative process on the GMM models of the two images and applying a series of merging and splitting steps on the source image GMM, or on both the source and target images GMM, depending on the adaptation mode used. The rules for blob merging and blob splitting are based on the BlobEMD flow-matrix, as defined in Table 1. In the experiments presented in this section the following thresholds were used: Merging rules thresholds: CF = 0.2, CS1 = 0.75 and Cf low1 = 0.6. Splitting rules thresholds: CF = 0.2, CS2 = 0.75 and Cf low2 = 0.01. Thresholds were selected heuristically based on experimentation. In Figure 8 we illustrate context-based image model adaptation for adaptive segmentation and image-pair matching on the Lake image example. In this example, similar semantic content (“trees next to a lake”) is represented by a different number of regions and region colors (a). The treetops are separate in one image and merged in the other, while the lake appears as separate blobs in one image and as a single blob in the other. The initial source and target image models are shown in (b) with the corresponding flow matrix shown in (c). The updated source and target image segmentation maps, image models and corresponding flow matrix are shown in (d), (e) and (f), respectively. Note the resemblance of the two updated image models in (e) vs. the initial representation in (b). The context-based model 18
update results in updated image distances. In this example, the BlobEMD distance is 0.08 in the initial representation phase and 0.04 in the final representation. A decrease of 50% in the distance is achieved via the update process. A second example is shown in Figure 9. In (a) we show two similar images of a red car. Due to different segmentation processes, they result in very different segmentations as shown in (b) and (c), top. The corresponding GMM models are also significantly different as shown in (b) and (c), bottom. Using the model adaptation process (source-to-target adaptation), the final modeling and segmentation results are shown in (d). The region-correspondence process along with merging and splitting, provides us with an updated model that results in a segmentation that is very similar between the two images (compare (b) and (d)). Note also that the model adaptation results in smoother regions and similar-looking object (car) silhouettes.
5.4
Matching dithered images
Dithered images are images with reduced resolution in the color space, where due to limitations of the display or printing device or because of a compression process, only a limited set of discrete colors is used. The perceived color is based on our ability to blend the mixture, of sometimes very different colors, into coherent colors which are not in the set of given colors, such as in the example of the Monkey in Figure 10(a). When a dithered image is modeled using only the limited set of colors, the resulting model is very different from the model of the original, non-dithered image. Classical techniques such as histograms fail to identify the similarly of the two models. Using the BlobEMD framework we can adapt the dithered image representation according to the target image representation and enable a comparison among them. The following algorithm characteristics apply for dithered images: The similarity in feature space, GDF , is the Fr`echet ground distance on (x,y) space only. Here we don’t use the color information for the ground distance as the distance between dithered image colors and their original image colors may be large, while the mixture of the dithered colors may be in close resemblance to the desired color at that location. The merging process in the color feature space is thus critical in this application domain. The criteria for the merging process is in the spatial domain. The blobs to be merged overlap in space (two colored blobs in 19
the dithered image overlap and flow to the same blob in the target image). The thresholds used are the following. Merging rules thresholds: CF = 1.0, CS1 = 0.6 and Cf low1 = 0.6. Splitting rules thresholds: CF = 0.2, CS2 = 0.75 and Cf low2 = 0.01. Thresholds were selected heuristically based on experimentation. Figure 10 shows an example of comparing between a target image (top left) and a dithered version (27 colors) as a source (query) image (top right). A zoom-in window is shown and clearly demonstrates the differences between the two input images. Source-to-target adaptation is used. The initial models extracted for the two images are shown on the bottom of (b) and (c), with the corresponding segmentation maps, (b) and (c), top. The differences between the images are again evident in their respective models. Using the BlobEMD framework enables a model adaptation process with a final updated model that fits the source model both in color and spatial layout (d). Note the strong resemblance between the models of (b) and (d), especially as contrasted with (c). A second example is presented in Figure 11. The target image is shown top left and a dithered version (27 colors) as a query image is shown top right. An extension to mutual adaptation is shown in (d). Here the target image model is adapted as well for a final result that is a more compact representation of both source and target images. The updated representation results in updated image distances. In this example the BlobEMD distance in the initial representation phase is 0.1. Following source-to-target adaptation the distance reduces to 0.05. In the final mutual adaptation stage the distance value is 0.036. A decrease of more than 50% in the distance is achieved via the update process.
6
Discussion
In this work we present the BlobEMD framework for a simultaneous solution to both the image region correspondence problem and the estimation of an image pair distance. This coupling allows for context-based model adaptation where the representation of one image is adjusted based on the representation of a second image - the context. We are presenting a different approach to the image segmentation problem. Rather than trying to estimate the “true” segmentation of an image, the BlobEMD framework provides for context dependent image segmentation. The segmentation problem is treated in conjunction 20
with the image matching problem. An image may be segmented differently in accordance to the target image it is being compared to. Context based image segmentation and image matching is enabled via the EMD flow. In the BlobEMD framework the image is represented in the continuous domain using GMM statistical modeling. The EMD optimization enable matching of individual model components (Gaussians or blobs) while providing an overall distance measure between the image distributions. There are interesting distinctions from earlier work: the image is represented via a continuous and probabilistic representation as opposed to the well-known discrete histogram representation; Global image matching is achieved along with a correspondence mapping of the individual representation components. This mapping is not available in global matching techniques such as in the GMM-KL framework recently proposed. A comparison between the two methods of the BlobEMD and the GMM-KL has been presented in the experimentation section. The results demonstrate a strong correlation between the performance of the two approaches. The two approaches have the same representation of the image space with a difference in the distance measures used for image matching. The GMM-KL framework is a continuous probabilistic framework throughout, with the continuous KL distance measure used for comparing statistically between two GMM distributions. The BlobEMD framework provides the global distance measure along with an insight into the correspondences found between individual mixture components, or image regions. This mapping is essential for the model adaptation purposes and any other applications that rely on region correspondences. The price payed for the inside view is a slight decrease in the accuracy of the global distance measure. An open theoretical issue for investigation is the definition of an appropriate Ground distance for Gaussian, or blob comparison. Both the KL distance as well as the Fr`echet distance are defined for equal-weight Gaussians. A challenge remains to find a more exact mathematical formalism for the comparison between nonequal-weight Gaussians, as is the case in-hand. Using the BlobEMD framework, we solve the region correspondence problem across an image pair. The correspondences between the regions in the two images are used to merge and spit the regions so they still represent well the images but at the same time bring the two representation to a common context. The suggested framework provides for image rep21
resentations that are more uniform and best aligned between the two images to be matched. We view this work as a first step in an extensive research effort ahead in which we augment the region representation vector to include features such as texture, size and shape, in addition to the color feature chosen here. A definition of an hierarchical matching framework is under way. Region correspondences based on low-level features such as color and texture may provide a semantically plausible image segmentation, thus enabling the extension of the feature space to include high-level more semantic region characteristics, such as the inclusion of region sizes and shapes. In Figure 9 we see that the model adaptation results in smoother regions and similar-looking object (car) silhouettes. The BlobEMD methodology may provide the means for the much desired transition from regions to silhouettes and shapes.
A
Fr` echet ground distance
The Fr`echet distance is a special case of the Monge-Kantorovich mass transference problem [18] which is the basis to the EMD. The general Monge-Kantorovich problem is defined as ½Z
inf
U ×U
¾
c(s, t)P (ds, dt) : P ∈ P(P1 , P2 )
,
(8)
where P1 and P2 are two Borel probability measures given on a separate metric space (U, d), and P(P1 , P2 ) is the space of all Borel probability measures P on U × U with fixed margins P1 (·) = P (· × U ) and P2 (·) = P (U × ·). P1 and P2 are the initial and final distributions and P is the optimal transference plan, or the flow as we use in this work. c(s, t) is the cost function which in our work is the euclidian distance. The Fr`echet distance[6] solves the general Monge-Kantorovich problem for the case where s and t are normal distributions with means µs , µt , and covariance matrices Σs , Σt , respectively. h
i
d2 (s, t) = |µs − µt |2 + tr Σs + Σt − 2 (Σs Σt )1/2 .
(9)
It is a closed-form solution to the EMD in the case of two equal weight Gaussians and is a natural distance for the Gaussian blob representation. Unfortunately, when two Gaussian
22
blobs have different weights, the Fr`echet distance is not valid. An extension for the non-equal weights case is yet to be investigated.
B
Merging blob statistics
Let bi = (wi , µi , Σi ) and bj = (wj , µj , Σj ) be two blobs to be merged, where wi , wj are the weights, µi , µj the means, and Σi , Σj the covariance matrices of the blobs. We look for the blob b = (w, µ, Σ) that represents the statistics of the union of the two sets of pixels represented by the two blobs. Let ni and nj be the number of pixels represented by blobs bi and bj respectively. We have,
1 X p , ni p∈bi
µi =
µj =
1 X p. nj p∈bj
Combining the two sets of pixels bi ∪ bj , we get the combined mean µ =
X 1 p ni + nj p∈bi ∪bj
X X 1 p+ p = ni + nj p∈bi p∈bj
1 (ni µi + nj µj ) ni + nj ni /n nj /n = µi + µj (ni + nj )/n (ni + nj )/n wj wi µi + µ j , = w w =
where w = wi + wj . Similarly for the covariance matrix we have Σi =
1 X t pp − µi µti ni p∈bi
,
Σj =
1 X t pp − µj µtj . nj p∈bj
Combining the two sets of pixels bi ∪ bj , we get the combined covariance Σ =
X 1 ppt − µ2 ni + nj p∈bi ∪bj
=
X
X
1 ppt + ppt − µµt ni + nj p∈bi p∈bj 23
³ ´ 1 ni (Σi + µi µti ) + nj (Σj + µj µtj ) − µµt ni + nj ni /n nj /n = (Σi + µi µti ) + (Σj + µj µtj ) − µµt (ni + nj )/n (ni + nj )/n wj wi = (Σi + µi µti ) + (Σj + µj µtj ) − µµt . w w
=
References [1] J. R. Bach, C. Fuller, A. Gupta, A. Hampapur, B. Horowitz, R. Jain, and C.F. Shu. Virage image search engine: an open framework for image management. In Jain R. (ed) Symposium on Electronic Imaging:Science and Technology - Storage and Retrieval for Image and Video databases IV, volume IV, pages 76–87, 1996. [2] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Region-based image querying. In Proc. of the IEEE Workshop on Content-based Access of Image and Video libraries (CVPR’97), pages 42–49, 1997. [3] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:1026–1038, August 2002. [4] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley and Sons, 1991. [5] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. J. Royal Statistical Soc. B, 39(1):1–38, 1977. [6] D. C. Dowson and B. V. Landau. The frechet distance between multivariate normal distributions. In Journal of Multivariate Analysis, volume 12, 1982. [7] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley and Sons Inc., 1973. [8] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, and B. Dom et al. Query by image and video content: the qbic system. IEEE Computer, 28(9):23–32, 1995. [9] H. Greenspan, G. Dvir, and Y. Rubner. Region correspondence for image matching via emd flow. In Proceedings CVPR 2000 Workshop on Content-Based Access of Image and Video Libraries, pages 27–31, 2000. [10] H. Greenspan, J. Goldberger, and L. Ridel. A continuous probabilistic framework for image matching. Computer Vision and Image Understanding, 84:384–406, December 2001. [11] J. Hafner, H. Sawhney, W. Equitz, M. Flickner, and W. Niblacket. Efficient color histogram indexing for quadratic from distance functions. IEEE Trans. Pattern Analysis and Machine Intelligence, 17(7):729–739, 1995. [12] F. L. Hitchcock. The distribution of a product from several sources to numerous localities. J. Math. Phys., 20:224–230, 1941.
24
[13] J. Huang, S. R. Kumar, M. Mitra, W.-J. Zhu, and R. Zabih. Image indexing using color correlograms. In Proc. of the IEEE Comp. Vis. And Patt. Rec., pages 762–768, 1997. [14] S. Kullback. Information theory and Statistics. Dover, 1968. [15] W. Ma and B. Manjunath. Netra: A toolbox for navigating large image databases. In Proceedings of IEEE Int. Conf. On Image Proc., pages 568–571, 1997. [16] G. Pass and R. Zabih. Comparing images using joint histograms. Multimedia Systems, 7:234– 240, 1999. [17] Jan Puzicha, Yossi Rubner, Carlo Tomasi, and Joachim M. Buhmann. Empirical evaluation of dissimilarity measures for color and texture. In IEEE International Conference on Computer Vision, pages 1165–1172, 1999. [18] S. T. Rachev. The Monge-Kantorovich mass transference problem and its stochastic applications. Theory of Probability and its Applications, XXIX(4):647–676, 1984. [19] Yossi Rubner and Carlo Tomasi. Perceptual Metrics for Image Database Navigation. Kluwer Academic Publishers, Boston, December 2000. [20] J. R. Smith. Integrated Spatial and Feature Image Systems: Retrieval, Analysis and Compression. PhD thesis, Columbia University, 1997. [21] J. R. Smith and S-F Chang. Integrated spatial and feature image query. Multimedia Systems, 7:129–140, 1999. [22] M. Stricker and A. Dimai. Spectral covariance and fuzzy regions for image indexing. Machine Vision and Applications, 10(2):66–73, 1997. [23] M. J. Swain and D. H. Ballard. Color indexing. International Journal of Computer Vision, 7(1):11–32, 1991. [24] G. Wyszecki and W. Stiles. Color Science: Concepts and Methods, Quantitative Data and Formulae. Wiley, 1982.
25
(a)
0.7
1
0.6 0.5
2
0.4 0.3
3
0.2 0.1
4
0
1
2
3
(b)
1
0.7 0.6
2 0.5 0.4
3
0.3 4 0.2 0.1
5
0
1
2
3
(c)
0.7 1 0.6 0.5 0.4
2
0.3 0.2 3 0.1 0
1
2
3
(d) Figure 4: Synthetic example of source-to-target splitting process. (a) The source image (left) is matched to the target image (right); (b) Initial image models (representation layer) and their flow matrix from source to target; (c) Image models after splitting; (d) Final image models after merging, following source model adaptation.
26
Figure 5: Model adaptation flow chart
(a)
(b)
Figure 6: Statistical analysis of intra-class distances (black) vs. inter-class distances (white). (a) (L,a,b) feature space; (b) (L,a,b,x,y) feature space. The x-axis is the BlobEMD distance and the y-axis is the frequency of occurrence of the respective distance in each of the two feature spaces.
27
1
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
Precision
1
0.9
Precision
Precision
1
0.9
0.5
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5 Recall
0.6
0.7
0.8
0.9
0
1
0.1
0
0.1
0.2
0.3
field
0.4
0.5 Recall
0.6
0.7
0.8
0.9
0
1
1
1
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
Precision
1
0.5
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.2
0.3
0.4
desert
0.2
0.3
0.4
0.5 Recall
0.5 Recall
0.6
0.7
0.8
0.9
1
0
0.6
0.7
0.8
0.9
1
0.8
0.9
1
0.5
0.4
0
0.1
car
0.9
0
0
snow
Precision
Precision
0.5
0.4
0.1
0
0.1
0.2
0.3
0.4
0.5 Recall
monkey
0.6
0.7
0.8
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5 Recall
0.6
0.7
waterfall
Figure 7: Precision vs. Recall. 315 images in database. Each plot is an average of the results of the 10 query images in the class. In dashed colors are the BlobEMD results. Dashed black is results of color only. Dashed red is results of color and x, y. Solid lines are for comparison. In black is the PR curve of the GMM-KL framework. The purple, red and green curves correspond to histogram representation and Euc., H. I., and Disc. KL distance measures, respectively.
28
(a)
(b) 0.45
1
0.4 2 0.35 3 0.3 4
0.25
5
0.2
6
0.15 0.1
7
0.05 8 0
1
2
3
4
5
6
(c)
(d)
(e) 0.5 1 0.45 0.4 2 0.35 0.3 3
0.25 0.2
4
0.15 0.1
5
0.05 0
1
2
3
4
5
(f) Figure 8: Context-based image representation and matching via BlobEMD. (a) An image pair example; (b) Source and target image models; (c) Corresponding flow matrix; (d) Updated source image segmentation map and target image segmentation map; (e) Updated source and target image models; (f) Updated flow matrix.
29
Target Image
Source Image
Target Image
(a) Source Image
Context-based adaptation
(b)
(c)
(d)
Figure 9: Context-based model adaptation for segmentation. (a) Input images; (b) Target image: segmentation map (top) and GMM representation (bottom); (c) Source image: segmentation map (top) and GMM representation (bottom); (d) Source image after contextbased segmentation: adapted segmentation map (top) and adapted GMM representation (bottom).
30
Target image
Source image
4.0.jpg
Target Image
(b)
(a) Source Image Context-based adaptation
(c)
(d)
Figure 10: Context-based model adaptation for dithered image representation. (a) Target image (left) and a dithered version (27 colors only) as a query image (right). A zoom in window is shown bottom; (b) and (c) Target and source image models are shown bottom, with the corresponding segmentation maps shown on top; (d) Final updated model using source-to-target adaptation. 31
(a)
1
0.1
2
0.25 0.35
1 1
0.09
3
0.08
4 5
0.25
0.07
2
6 7
0.06
8
0.05
9
3
0.15 0.2
4
10
0.04
11
0.03
12
0.1
0.15
3
0.1
5 0.05
0.02
13
0.3
0.2
2
0.05
4
14
0.01
15
6 0
0
1
2
3
(b)
4
5
1
2
3
(c)
4
5
1
2
3
(d)
Figure 11: Context-based model adaptation for dithered image representation. (a) Original images; (b) Initial images models; (c) Source model adaptation according to target model (source-to-target adaptation); (d) Mutual adaptation of both source and target models.
32