Mining Images on Semantics via Statistical Learning - CiteSeerX

7 downloads 0 Views 788KB Size Report
Aug 25, 2005 - UNC-Charlotte. Charlotte, NC 28223, USA [email protected] ...... [27] C.A. Bouman, M. Shapiro, G. Cook, C. Atkins, H. Cheng, “Cluster: An ...
Research Track Paper

Mining Images on Semantics via Statistical Learning Jianping Fan

Hangzai Luo,Yuli Gao

Mohand-Said Hacid

Dept of Computer Science UNC-Charlotte Charlotte, NC 28223, USA

Dept of Computer Science UNC-Charlotte Charlotte, NC 28223, USA

UFR Informatique Universite Claude Bernard Lyon 1, Lyon, FRANCE

[email protected]

hluo, [email protected]

[email protected]

ABSTRACT

To enable image mining on semantics, it is very important to achieve a middle-level understanding of image semantics (i.e., bridging the semantic gap between image semantics for human interpretation and low-level visual features extracted by computers for image content representation) [1011]. Thus, the following inter-related problems should be addressed jointly: (1) What are the suitable image patterns that are able to achieve a middle-level understanding of the semantics of image contents? (2) What are the underlying concept models for accurately interpreting the semantic image concepts of particular interest? (3) What is the basic vocabulary of semantic image concepts of particular interest in a specific image domain? (4) Given the basic vocabulary of semantic image concepts, how can we learn the underlying concept models more accurately? To address the first issue, two approaches are widely used for image content representation and feature extraction: (a) Image-based approaches that use the whole images for feature extraction. (b) Region-based approaches that take the homogeneous image regions as the underlying image patterns for feature extraction. One common weakness of the region-based approaches is that the homogeneous image regions have little correspondence to the image semantics, thus they are not effective to support semantic image classification [12]. Without performing image segmentation, the image-based approaches may not work very well for the images that contain individual objects, because only the global visual features are used for image content representation [15]. To enable more accurate interpretation of various semantic image concepts, we have recently developed a novel technique for semantic image classification and automatic annotation by using salient objects [5]. To adress the second issue, Gaussian mixture model (GMM) has been widely used for semantic image concept interpretation with a pre-defined model structure [7-8]. However, different semantic image concepts may relate to different numbers and types of various image blobs, and thus automatic techniques for model selection are strongly expected. Unfortunately, there is no existing works in the literature to effectively address the third and fourth issues. Most existing techniques for semantic image classification ignore the hierarchical relationships between the semantic image concepts at different semantic levels [4]. They independently learn a set of flat classifiers for various semantic image concepts of particular interest. However, the semantic image concepts at the high level of the concept hierarchy may have larger hypothesis variance, and directly learning the flat classifiers for these high-level semantic image concepts

In this paper, we have proposed a novel framework to enable hierarchical image classification via statistical learning. By integrating the concept hierarchy for semantic image concept organization, a hierarchical mixture model is proposed to enable multi-level modeling of semantic image concepts and hierarchical classifier combination. Thus, learning the classifiers for the semantic image concepts at the high level of the concept hierarchy can be effectively achieved by detecting the presences of the relevant base-level atomic image concepts. To effectively learn the base-level classifiers for the atomic image concepts at the first level of the concept hierarchy, we have proposed a novel adaptive EM algorithm to achieve more effective model selection and parameter estimation. In addition, a novel penalty term is proposed to effectively eliminate the misleading effects of the outlying unlabeled images on semi-supervised classifier training. Our experimental results in a specific image domain of outdoor photos are very attractive. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications image databases. General Terms Algorithms, Measurement, Experimentation Keywords: Image classification, hierarchical mixture model, adaptive EM algorithm.

1. INTRODUCTION As high-resolution digital cameras become more affordable and widespread, high-quality digital images have exploded on the Internet. With the exponential growth on high-quality digital images, the need of mining image database on semantics is becoming increasely important to enable semantic image retrieval via keywords [1-5]. Semantic image classification is a promising approach for mining large-scale image database on semantics and has attracted the interest of researchers from a variety of fields [1-16, 28].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD ’05, August 21-25, 2005, Chicago, IL, USA Copyright 2005 ACM 1-59593-135-X/05/0008...$5.00.

22

Research Track Paper Higher−Level Semantic Image Concept 1

Higher−Level Semantic Image Concept k

Higher−Level Semantic Image Concept Nc

Atomic Image Concept 1

Atomic Image Concept i

Atomic Image Concept Ne

Type 1 Image Blob

Type j Image Blob

Hierarchy Concept

may result in low prediction accuracy. Some pioneer works have shown that the accuracy of text classifiers can be significantly improved by taking advantage of concept hierarchy such as WordNet and training the classifiers for multiple concepts hierarchically [21-26]. This observation also encourages us to integrate the concept hierarchy to enable more effective image classification. Another major difficulty for most existing image classification techniques is that a large number of labeled images are required to learn the concept models accurately. Unfortunately, labeling a large number of training images is very expensive and time-consuming. Given this costly labeling problem, it is very attractive to design the semi-supervised classifier training techniques that can take advantage of unlabeled samples [29-32]. However, there are two basic assumptions for most existing semi-supervised classifier training techniques: (a) each unlabeled image originates from one of the known image context classes (i.e., existing mixture components that have already been used for semantic image concept interpretation); (b) all these relevant image context classes can be effectively learned from the available labeled images. When only a limited number of labeled images are available for classifier training, these two basic assumptions are not satisfied because of concept uncertainty (i.e., presence of new concept, outliers, and unknown image context classes). Considering that many image context classes and new concept have not even occurred in a limited number of available labeled images, using the outlying unlabeled images will corrupt the density estimation and lead to worse performance rather than improvement when the model structure is incorrect [29]. Based on these observations, we have proposed a novel framework to enable hierarchical image classifier training. This paper is organized as follows: Section 2 presents a novel framework for hierarchical organization and modeling of semantic image concepts; Section 3 proposes a novel algorithm for hierarchical image classifier training; Section 4 introduces our technique for semantic image classification via partial matching; Section 5 shows our work on algorithm evaluation; We conclude the paper in Section 6.

Type Ns Image Blob

Figure 1: The hierarchical framework for semantic image concept organization and modeling. root node

Sports

Outdoor

Praire

Mountain View

rock

forest

sky

elephant

zebra horst grass

Garden

Beach

flower building water sand field

Sailing

sailing boat cloth

Skiing

snow human

Figure 2: The concept hierarchy for organizing outdoor photos in our experiments, where the first level represents the underlying atomic image concepts.

and logical relationships. The lower the level of a semantic image concept node, the narrower is its coverage of the subjects, and the semantic image concepts at a lower level of the concept hierarchy characterize more specific aspects of image semantics. Thus, the hypothesis variances for the semantic image concepts at the lower level of the concept hierarchy may be smaller and can be interpreted effectively by using the low-level visual features. As a result, the concept nodes at the first level of the concept hierarchy are named as atomic image concepts as shown in Fig. 1. One example of the concept hierarchy, that is used in our experiments for a specific image domain of outdoor photos, is given in Fig. 2. As shown in Fig 2, one certain atomic image concept Cj at the first level of the concept hierarchy can be interpreted accurately by using a finite mixture model (FMM) to approximate the class distribution of the relevant image blobs:

2. HIERARCHICAL IMAGE CONCEPT ORGANIZATION AND MODELING The success of most existing techniques for semantic image classification is often limited and largely depends on the discrimination power of the low-level visual features for image content representation (i.e. attributes for image description) [10]. On the other hand, the discrimination power of the low-level visual features also depends on the capability of the underlying image patterns to capture sufficient semantics of image contents. Because the image blobs have the capability to characterize the relevant dominant image compounds, they are able to capture the middle-level image semantics and thus extracting the visual features from the image blobs can enhance the discrimination power of visual features and may result in more effective image classification [10]. To enable more effective image classification, the concept hierarchy is used for semantic image concept organization and hierarchical classifier combination as shown in Fig. 1 and Fig. 2. The concept hierarchy defines the basic vocabulary of the semantic image concepts and their contextual

P (X, Cj , Θcj ) =



κj X

P (X|Cj , Sl , θsl )ωsl

(1)

l=1

j where l=1 ωsl = 1, Θcj = {κj , ωcj , θcj } is the parameter set for model structure, weights and model parameters, P (X|Cj , Sl , θsl ) is the lth mixture component to characterize the class distribution for the lth type of image blobs, κj is the model structure (i.e., the optimal number of mixture components), ωcj = {ωs1 , · · · , ωsκj } is the weight set for κj mixture components, ωsl is the relative weight for the lth mixture component to characterize the relative importance of the lth type of image blobs for accurately interpreting the given atomic image concept Cj , θcj = {θs1 , · · · , θsκj } is the set of model parameters, θsl is the model parameters for the lth mixture component, X = (x1 , · · · , xn ) is the n-dimensional visual features to characterize the visual properties for the relevant image blobs.

23

Research Track Paper

3.

At the second level of the concept hierarchy, the class distribution for a given semantic image concept Ci can be interpreted by using a hierarchical mixture model to approximate the underlying class distributions for the relevant sibling atomic image concepts: P (X, Ci , Θci ) =

κi X

ωcj

j=1

P

κj X

P (X|Cj , Sl , θsl )ωsl

To accurately learn the hierarchical mixture model for semantic image concept interpretation, the semantic labels for a set of training images are manually labeled for each atomic image concept. We use the one-against-all rule to organize the labeled images Ωcj = {Xl , Cj (Sl )|l = 1, · · · , NL } into: positive images and negative images for a given atomic image concept Cj , Xl is the set of attributes (i.e. visual features) that are used to describe the training image Sl . The unlabeled images Ωcj = {Xn , Sn |n = 1, · · · , Nu } can be used to improve the density estimation by reducing the variance of mixture density and discovering the unknown image context classes. For the given atomic image concept CjS, we then define the mixture training image set as Ω = Ωcj Ωcj [29-32]. The labeled images for the sibling atomic image concepts are further combined as the joint training images for their parent node at the concept hierarchy. The visual features for image content representation include 1-dimensional coverage ratio (i.e., density ratio) for a coarse shape representation, 6-dimensional locations of image blobs (i.e., 2-dimensions for blob center and 4-dimensions to indicate the rectangular box for a coarse shape representation of image blob), 7-dimensional LUV dominant colors and color variances, 7-dimensional Tamura texture, and 12dimensional wavelet texture features. The atomic image concepts at the first level of the concept hierarchy have low hypothesis variances and thus they can be accurately interpreted by using the relevant image blobs. Based on this understanding, we propose a bottom-up approach for hierarchical image classifier training.

(2)

l=1

i ωcj = 1, Θci = {κi , ωci , θci } is the parameter where κj=1 set for model structure, weights and model parameters, and ωci is the set of weights for the gate networks that are used to define the relative importance of the κi sibling atomic image concepts for accurately interpreting their parent node Ci [19]. Through a similar hierarchical approach, the class distribution for a given semantic image concept Ck at the nth level of the concept hierarchy can be interpreted by using a hierarchical mixture model to approximate the class distributions of the (n − 1)th sibling semantic image concepts:

P (X, Ck , Θck ) =

κn X m=1

ωcm · · ·

κi X j=1

ωcj

κj X

HIERARCHICAL IMAGE CLASSIFIER TRAINING

P (X|Cj , Sl , θsl )ωsl

l=1

(3) P n ωcm = 1, Θck = {κm , ωcm , θcm } is the paramewhere κm=1 ter set for model structure, weights, and model parameters, and ωcm is the set of weights for the gate networks that are used to define the relative importance of the sibling semantic image concepts at the (n − 1)th level of the concept hierarchy [19]. Our proposed framework for hierarchical classifier training has the following advantages: (a) By using the hierarchical mixture model for classifier combination, it is able to enable more effective learning of the classifiers for the high-level semantic image concepts. Thus, the concept models for the high-level semantic image concepts can be adapted by the observations of the relevant base-level atomic image concepts. In addition, learning the high-level semantic image concepts hierarchically is able to reduce the size of covariance matrices being inverted. (b) It is able to enable a natural approach for achieving discriminative learning of finite mixture models by jointly learning the concept models for the sibling semantic image concepts under the same parent node, and thus the positive images for one certain semantic image concept can be treated as the negative images for its sibling semantic image concepts under the same parent node. By using the negative images to maximize the margins among the sibling semantic image concepts, our proposed hierarchical classifier training technique can achieve higher prediction accuracy. (c) By using the concept hierarchy for semantic image concept organization, our proposed framework is able to facilitate more effective image database indexing, searching and navigation. To support concept-oriented image database indexing, the finite mixture models for semantic image concept interpretation can be used to enable density-based database node representation [33]. After all the images are classified into the semantic image concepts at different semantic levels, the underlying concept hierarchy can further be incorporated to construct the hierarchical image database indexing structure, where the concept nodes become the relevant database nodes at different semantic levels, upon which the root node of the image database can be constructed automatically.

3.1

Base-Level Image Classifier Training

To learn the model-based classifier for the given atomic image concept Cj , maximum likelihood criterion can be used to determine the underlying model parameters. To avoid the overfitting problem [20], a penalty term is added to determine the underlying optimal model structure. The optimal parameters (i.e., model structure, weights, and model paˆ c = (κˆj , ω ˆ cj , θˆcj ) for the given atomic image rameters) Θ j concept Cj are then determined by:



ˆ c = arg max Θ j Θcj where L(Cj , Θcj ) = −

P

Xi ∈Ωcj

log P (Xi , Cj , Θcj )

is the likelihood function, and log p(Θcj ) = − κ

κ (N +1)

(4)

log P (Xi , Cj , Θcj ) + log p(Θcj )

P

Xi ∈Ωcj

is the objective function, −



L(Cj , Θcj )

n+κj +3 2

Pκj

l=1

N log N12ωl − 2j log 12 − j 2 is the minimum description length (MDL) term to penalize the complex models with a large number of mixture components [20], N is the total number of training samples, and n is the dimensions of visual features X. The estimation of maximum likelihood described in Eq. (4) can be achieved by using the EM algorithm with a predefined model structure κj [17-18]. However, pre-defining the model structure κj is not acceptable for semantic image classification. Thus, there is an urgent need to develop new techniques that are able to select the optimal model structure automatically. To automatically select the optimal model structure and estimate the accurate model parameters, we have proposed

24

Research Track Paper Algorithm 1: Adaptive EM Algorithm

components P (X|Cj , Sl , θsl ) and P (X|Cj , Sk , θsk ) from the same concept model Cj .

Inputs: Training Images Ωcj , κj = κmax ˆc Outputs: Θ j

JS(Cj , θsl , θsk )

Initialization is done by k-mean clustering; for each κj do Jm (i, k, θsik ) = JS(Cj , θsik ) + ϕJS(Cj , θsi , θsk ) Js (i, m, θsi ) =

Pκj Je (i, θsi ) Pi=1 κj Pκh i=1

ϕJS(Cj ,Ch ,θsi ,θsm ) , Je (i, θsi ) = JS(Cϕj ,θs ) JS(Cj ,θsi ) i κj κj + i=1 k=i+1 Jm (i, k, θsik )

P

P

+

m=i+1 Js (i, m, θsi ) = 1

for each image {Xl , Sl } ∈ Ωcj do E-step: P (Cj |Xl , Sl , Θcj ) = M-step: ωit+1 = µt+1 cj = σct+1 = j

1 Nl

PNl

l=1

P (Xl |Cj ,Sl ,θsl )ωsl

κj l=1

P (Xl |Cj ,Sl ,θsl )ωsl

P (Cj |Xl , Sl , Θcj )

Nl Xl P (Cj |Xl ,Sl ,Θcj ) l=1 Nl P (Cj |Xl ,Sl ,Θcj ) l=1 Nl t+1 T (Xl −µt+1 cj )(Xl −µcj ) P (Cj |Xl ,Sl ,Θcj ) l=1 Nl P (Cj |Xl ,Sl ,Θcj ) l=1

end for end for an adaptive EM algorithm for classifier training. To incorporate the negative images for discriminative learning of finite mixture models, we have taken advantage of the concept hierarchy for hierarchical classifier training, where the finite mixture models for interpreting the sibling atomic image concepts under the same parent node are learned jointly. Thus, the positive images for the given atomic image concept Cj can be used as the negative images for the sibling atomic image concepts under the same parent node. By learning these sibling atomic image concepts jointly, it is able to achieve discriminative learning of finite mixture models by incorporating the negative images to maximize the margins among different concept models. To achieve more accurate model selection and parameter estimation, our adaptive EM algorithm performs automatic merging, splitting, and elimination to re-organize the distribution of mixture components and modify the optimal number of mixture components according to the class distribution of the available training images [5]. To exploit the most suitable image context classes for accurately interpreting the atomic image concept Cj , our adaptive EM algorithm starts from a large value of κj and takes the major steps as shwon in Algorithm 1. With the given κj , k-mean clustering technique is used to select the reasonable and robust initial values for the model parameters (i.e., mean and covariance for each cluster). To determine the underlying optimal model structure, we use two criteria to perform automatic splitting, merging, and elimination of mixture components: (a) fitness between one specific mixture component and the distribution of the relevant training images; (b) overlapping between the mixture components from the same atomic image concept or the mixture components from the sibling atomic image concepts under the same parent node. Our adaptive EM algorithm uses symmetric Jensen-Shannon (JS) divergence (i.e., intra-concept JS divergence) JS(Cj , θsl , θsk ) to measure the divergence between two mixture

25

= H(π1 P (X|Cj , Sl , θsl ) + π2 P (X|Cj , Sk , θsk )) −π1 H(P (X|Cj , Sl , θsl )) − π2 H(P (X|Cj , Sk , θsk )) (5) P

where H(P (·)) = − P (·) log P (·) is the well-known Shannon entropy, π1 and π2 are the weights. In our experiments, we set π1 = π2 = 12 . If the intra-concept JS divergence JS(Cj , θsl , θsk ) is small, these two mixture components are strongly overlapped and may overpopulate the relevant training images; thus they are merged into a single mixture component P (X|Cj , Slk , θslk ). In addition, the local JS divergence JS(Cj , θslk ) is used to measure the divergence between the merged mixture component P (X|Cj , Slk , θslk ) and the local density of the training images P (X, Cj , θ). The local density P (X, Cj , θ) is modified as the empirical distribution weighted by the posterior probability [5]. Our adaptive EM algorithm κ (κ −1) pairs of mixture components that could be tests j 2j merged and the pair with the minimum value of the local JS divergence is selected as the best candidate for merging. Two types of mixture components may be split: (a) The elongated mixture components which underpopulate the relevant training images (i.e., characterized by the local JS divergence); (b) The tailed mixture components which overlap with the mixture components from the concept models for the sibling atomic image concepts (i.e., characterized by the inter-concept JS divergence). To select the mixture component for splitting, two criteria are combined: (1) The local JS divergence JS(Cj , Si , θsi ) to characterize the divergence between the ith mixture component P (X|Cj , Si , θsi ) and the local density of the training images P (X,Cj , θ); (2) The inter-concept JS divergence JS(Cj , Ch , θsi , θsm ) to characterize the overlapping between the mixture components P (X|Cj , Si , θsi ) and P (X|Ch , Sm , θsm ) from the sibling atomic image concepts Cj and Ch . If one specific mixture component is only supported by few training images, it may be removed from the concept model for the Cj . To determine the unrepresentative mixture component for elimination, our adaptive EM algorithm uses the local JS divergence JS(Cj , θsi ) to characterize the representation of the mixture component P (X|Cj , Si , θsi ) for the relevant training images. The mixture component with the maximum value of the local JS divergence is selected as the candidate for elimination. To jointly optimize these three operations of merging, splitting and elimination, their probabilities are defined as:

8 Jm (i, k, θsik ) = JS(Cj , θsik ) + ϕJS(Cj , θsi , θsk ) > > > > > < Js (i, m, θsi )

> > > > > : Je (i, θsi )

=

ϕJS(Cj ,Ch ,θsi ,θsm ) JS(Cj ,θsi )

=

ϕ JS(Cj ,θsi )

(6)

where ϕ is a normalized factor and it is determined by: κj X i=1

κj κj X X

Je (i, θsi )+

i=1 k=i+1

κj κh X X

Jm (i, k, θsik )+

Js (i, m, θsi ) = 1

i=1 m=i+1

(7) The acceptance probability to prevent poor operation of

Research Track Paper merging, splitting or elimination is defined by:







|L(Cj , Θ1 ) − L(Cj , Θ2 )| , τ

Paccept = min exp −

uncertain unlabeled images may be changed over time if they originate from the unknown image context classes that cannot be directly learned from the available labeled images. Thus, the changing scale of the confidence score for the unlabeled image (Xl , Sl ) is defined as:

 1

(8) where L(Cj , Θ1 ) and L(Cj , Θ2 ) are the objective functions for the models Θ1 and Θ2 (i.e., before and after performing the merging, splitting or elimination operation) as described in Eq. (4), τ is a constant that is determined experimentally. In our current experiments, τ is set as τ = 9.8. By optimizing these three operations jointly, our adaptive EM algorithm is able to automatically select the optimal model structure to capture the essential structure of the image context classes. In addition, our adaptive EM algorithm is able to escape the local extrema by re-organizing the distribution of mixture components and modifying the optimal number of mixture components according to the class distributions of the training images. By integrating the negative samples to maximize the margin among the concept models for the sibling atomic image concepts, our adaptive EM algorithm is also able to enable discriminative learning of finite mixture models and result in high prediction accuracy.

yl = |ψ(Xl , Cj , t + 1) − ψ(Xl , Cj , t)|

where yl ≥ 0, ψ(Xl , Cj , t) and ψ(Xl , Cj , t + 1) indicate its confidence scores with the atomic image concept Cj before and after model updating. The informative unlabeled images with a large value of yl may originate from the unknown image context classes, and thus they should be incorporated for classifier training. On the other hand, the outlying unlabeled images with a yl value close to zero may originate from outliers and should be removed from the training set. In order to incorporate the informative unlabeled images for discovering the unknown image context classes, one or more new mixture components are added to the residing areas for the informative unlabeled images with a large value of yl (i.e. birth). P (X, Cj , Θcj ) = ωsκj +1 P (X|Cj , Sκj +1 , θsκj +1 ) +(1 − ωsκj +1 )

When only a limited number of labeled images are available for classifier training, it is difficult to select the optimal model structure and estimate the accurate model parameters. In addition, incorporating the outlying unlabeled images for classifier training may lead to worse performance rather than improvement. Thus, it is very important to develop new techniques able to eliminate the misleading effects of the outlying unlabeled images. The weak classifier for the given atomic image concept Cj is first learned from a limited number of available training images, and the Bayesian framework is used to achieve “soft” classification of the unlabeled images. For the given atomic image concept Cj , the confidence score for an unlabeled image is defined as:

p

ψα (Xl , Cj , t)ψβ (Xl , Cj , t)

κ X j

3.2 Training with Unlabeled Images

ψ(Xl , Cj , t) =

(10)

P (X|Cj , Sl , θsl )ωsl

l=1

where ωsκj +1 is the weight for the (κj +1)th mixture component P (X|Cj , Sκj +1 , θsκj +1 ) to characterize the appearance of unknown image context class. To eliminate the misleading effects of the outlying unlabeled images, a penalty term γl is defined as:

8 certain unlabeled images < 1, γl = : eyl −e−yl , uncertain unalbeled images eyl +e−yl

(11)

where 0 ≤ γl ≤ 1, γl = 0 if yl = 0. The penalty term γl is able to select only the certain unlabeled images and informative unlabeled images for semi-supervised classifier training, thus the joint likelihood function is defined as:

(9) −

where ψα (Xl , Cj , t) is the posterior probability for the unlabeled image {Xl , Sl } with the given atomic image concept Cj , ψβ (Xl , Cj , t) = − log P (Xl , Cj , Θcj ) is the log-likelihood value of the unlabeled image {Xl , Sl } with the given atomic image concept Ch . For the unlabeled image {Xl , Sl }, its confidence score ψ(Xl , Cj , t) can be used as the criterion to indicate its possibility to be taken as an outlier of the Cj . For the given atomic image concept Cj , the unlabeled images are first categorized into two classes according to their confidence scores: (a) certain unlabeled images with high confidence scores may originate from the known image context classes (i.e., mixture components) that have already been used for interpreting the given atomic image concept Cj ; (b) uncertain unlabeled images with low confidence scores may orginate from outliers and unknown image context classes that have not occurred in a limited number of available labeled images. The certain unlabeled images are first incorporated to improve the estimation of the mixture density (i.e. regular updating of model parameters without changing the model structure) incrementally by reducing the variance of the mixture density. By integrating the certain unlabeled images to update the statistical model, the confidence scores for some

X

log P (Xl , Cj , Θcj )−λ

Xl ∈Ω

X Xn ∈Ω

κ X j

γn log

P (Xn |Cj , θsm )ωsm

m=1

where the discount factor λ = NNu is used to control the relative contribution of the unlabeled images for semi-supervised classifier training, N = Nu + NL are the total number of training images (i.e., unlabeled images Nu and labeled images NL ). Using the joint likelihood function to replace the likelihood function in Eq. (4), our adaptive EM algorithm is performed on the mixture training image set, both originally and probabilistically labeled, to learn the image classifiers accurately.

3.3

Higher-Level Image Classifier Training

To learn the model-based classifier for the given secondlevel semantic image concept Ci , we still use maximum likelihood criterion to determine the underlying model structure and model parameters. Thus, the optimal parameter set ˆc (i.e. model structure, weights, and model parameters) Θ i = (κˆi , ω ˆ c , θˆc ) is then determined by: i

i

ˆc = Θ i

26

arg max Θci

{L(Ci , Θci )}

(12)

!

Research Track Paper

Figure 3: Our experimental results for statistical image modeling and semantic image concept modeling: (a) original images with “ocean view”; (b) individual statistical models for image blobs; (c) statistical model for semantic image concept “ocean v’iew’.

Figure 4: Our experimental results for statistical image modeling and atomic image concept modeling: (a) original images with “sunset”; (b) individual statistical models for image blobs; (c) statistical model for atomic image concept “sunset”.

where the objective function L(Ci , Θci ) for the second-level semantic image concept Ci is defined as:

lapped mixture components from different sibling atomic image concepts are merged into a single mixture component. ! The elongated mixture components that underpopulate the κj κi N X Y X L(Ci , Θci ) = − log ωcj P (Xn , Cj |Sl , θsl )ωsl +log(Θcj ) joint training images are split into multiple representative mixture components. n=1 j=1 l=1 If one mixture component, P (X|Cj , Sm , θsm ), is eliminated, the concept model for accurately interpreting the κi κi X X  given second-level semantic image concept Ci is then refined ≈ log ωcj L(Cj , Θcj ) = ωcj L(Cj , Θcj ) (13) as: j=1 j=1 where L(Cj , Θcj ) is the objective function for the Ci ’s children node Cj that has been obtained by using Eq. (4), N is the total number of the joint training images from all the sibling children nodes (i.e., sibling atomic image concepts under the same parent node Ci ), ωcj is the weight parameter to define the relative importance of the atomic image concept Cj for accurately interpreting the second-level semantic image concept Ci . Based on this understanding, we use the posterior probability to infer the logarithmic weight parameter ωc j . Because the model structures and the model parameters for the sibling atomic image concepts have already obtained, a simple but effective solution is developed to determine the finite mixture model for accurately interpreting their parent node at the second level of the concept hierarchy. Our classifier combination framework takes the following steps: (a) The mixture components from the κi sibling atomic image concepts are combined to achieve a better approximation of the underlying class distribution of their parent node at the second level of the concept hierarchy (i.e., the second-level semantic image concept Ci ). In addition, the training images for these sibling atomic image concepts are also combined as the joint training samples for their parent node Ci . (b) Based on the available finite mixture models for interpreting the κi sibling atomic image concepts, our adaptive EM algorithm is used to select the optimal model structure and estimate the model parameters for the given second-level semantic image concept Ci by performing automatic merging, splitting, and elimination of mixture components. (c) The mixture components with less prediction power on the joint training images are eliminated. The over-

P (X, Ci , Θci ) =

1 1 − ωm

X

κ−1

P (X|Cj , Sl , θsl )ωsl , m = l

l=1

(14) P i κj is the total number of the mixture comwhere κ = κj=1 ponents for the Ci , κi is the number of the relevant sibling atomic image concepts, and κj is the optimal model structure for its children node Cj . If two mixture components P (X|Cj , Sm , θsm ) and P (X|Ch , Sl , θsl ) from two sibling atomic image concepts Cj and Ch are merged as a single mixture component P (X|Cj , Sml , θsml ), the concept model for accurately interpreting the second-level semantic image concept Ci is refined as:

X

κ−2

P (X, Ci , Θci ) =

P (X|Cj , Sh , θsh )ωsh +P (X|Cj , Ssml , θsml )ωsml

h=1

(15)

where ωsml is the weight parameter for the merged mixture component P (X|Cj , Sml , θsml ). If one mixture component, P (X|Cj , Sh , θsh ), is split into two new mixture components, P (X|Cj , Sr , θsr ) and P (X|Cj , St , θst ), the concept model for accurately interpreting the second-level semantic image concept Ci is refined as:

X

κ−1

P (X, Ci , Θci ) =

P (X|Cj , Sh , θsh )ωsh + P (X|Cj , Sr , θsr )ωsr

h=1

+P (X|Cj , St , θst )ωst

(16)

After the finite mixture models for the sibling second-level semantic image concepts are available, they are then integrated to obtain the finite mixture model for their parent

27

Research Track Paper

Figure 5: The classification results for the secondlevel semantic image concept “garden” and the relevant first-level atomic image concepts.

Figure 6: The classification results for the second-level semantic image concepts “ocean view”, “praire” and the relevant first-level atomic image concepts.

node at the third level of the concept hierarchy. Through a hierarchical approach, the hierarchical mixture models for accurately interpreting the higher-level semantic image concepts can be obtained effectively.

4. PARTIAL MATCHING FOR SEMANTIC IMAGE CLASSIFICATION Given a test image Ii , its image blobs Ii = {S1 , · · · , Sl , · · · , Sn } and the relevant visual features X = {X1 , · · · , Xl , · · · , Xn } are detected automatically by using statistical image modeling, where the class distribution for each dominant image compound (i.e., image blobs) is charcterized by one or multiple mixture components. Some experimental results on statistical image modeling are given in Fig. 3 and Fig. 4. The lth image blobs Sl in the test image Ii is first classified into the most relevant atomic image concept Cj with the maximum value of posterior probablity:



P (X|Ch , Sl , θsl )ωsl |h = 1, · · · , Ne P (X, Ch , Θch ) (17) After the image blobs in the test image Ii are classified into the most relevent atomic image concepts, the statistical models for interpreting these atomic image concepts are then integrated to characterize the semantics of the test image Ii . By combining different number of these atomic image concepts and their statistical models to approximate the real class distribution of the test image Ii , it is able to achieve multi-level representation of image semantics with different details. Because the semantic similar images may consist of different numbers and types of atomic image concepts, our framework for multi-level image representation can provide a natural way for partial image matching. Thus, semantic image classification is finally treated as a model matching problem, i.e., macthing the statistical model for interpreting the semantics of the test image Ii with the finite mixture models for interpreting the semantic image concepts of particular interest. To achieve partial image matching effectively, the Bayesian approach is used to calculate the posterior probability be-



P (Cj |Xl , Sl , Θcj ) = max

tween the statistical model for interpreting the semantics of the test image Ii and the finite mixture models for interpreting the semantic image concepts of particular interest, the test image Ii is then classified into the most relevant semantic image concept Ch with the maximum value of posterior probability. Our current experiments focus on mining the image semantics on the first level and the second level of the concept hierarchy, such as “sunset”, “ocean view”, “beach”, “garden”, “mountain view”, “flower view”, “water way”, “sailing”, and “skiing”, which are widely distributed in a specific image domain of outdoor photos. Some semantic image classification results are given in Fig. 5, Fig. 6, Fig. 7 and Fig. 8., where the unknown atomic image concepts are shown in white color in the images. It is important to note that the text keywords for semantic image concept interpretation can be used to support multilevel image annotation effectively. The text keywords for interpreting the first-level atomic image concepts (i.e., dominant image compounds) provide the annotations of the images at the content level. The text keywords for interpreting the relevant high-level semantic image concepts provide the annotations of the images at the concept level.

5.

PERFORMANCE EVALUATION

Our experiments are conducted on two image databases: the image database from the Google image search engine and the Corel image database. The image database from Google image search engine consists of 9,000 pictures. The Corel image database includes more than 3,800 pictures with different image concepts. Our works on algorithm evaluation focus on: (a) evaluating the performances of our adaptive EM algorithm with different combinations of mergring, splitting and elimination; (b) evaluating the performance of our classifier training technique by using different sizes of unlabeled images; (c) comparing the performances of our hierarchical classifier with other flat classifiers under the same classification objective (i.e., classifying the images into a set of semantic image concepts with or without using the concept hierarchy for classifier training).

28

Research Track Paper Table 1: The average performance of our classifiers for some atomic image concepts at the first level of concept hierarchy (precision ρ versus recall +). concepts brown horse grass purple flower ρ 95.6% 92.9% 96.1% + 100% 94.8% 95.2% concepts red flower rock sand field ρ 87.8% 98.7% 98.8% + 86.4% 100% 96.6% concepts water human skin sky ρ 86.7% 86.2% 87.6% + 89.5% 85.4% 94.5% concepts snow sunset/sunrise waterfall ρ 86.7% 92.5% 88.5% + 87.5% 95.2% 87.1% concepts yellow flower forest sail cloth ρ 87.4% 85.4% 96.3% + 89.3% 84.8% 94.9% concepts elephant cat zebra ρ 85.3% 90.5% 87.2% + 88.7% 87.5% 85.4%

Figure 7: The classification results for the secondlevel semantic image concepts “waterway”, “mountain view” and the relevant atomic image concepts. ferent combinations of three operations: SM + N eg represents combining three operations of merging, splitting and elimination of mixture components, SM indicates combining two operations of merging and splitting, Split is for only operation of splitting, M erge is for only one operation of mergring, Borman is for the EM algorithm developed by Borman et al. [27]. From Fig. 9 and Fig. 10, one can find that integrating negative images for discriminative learning of finite mixture models (i.e., splitting by using negative images) can improve the classifiers’ performance significantly. For the same purpose to classify the images into a set of pre-defined semantic image concepts, we have also compared the performance differences between our hierarchical classifier and the flat classifiers (i.e., training the classifier for each semantic image concept independently). The test results are given in Fig. 11. Above the yellow line, our hierarchical classifier has better performance than the flat classifiers. Below the yellow line, our hierarchical classifier has worse performance than the flat classifiers. By using the hierarchical mixture model for concept interpretation and classifier training, our hierarchical classifier has improved the performance significantly and the high-level semantic image concepts can be determined by detecting the presences of the relevant atomic image concepts. For some high-level semantic image concepts, our hierarchical image classification technique may have worse performance than the corresponding flat classifiers (under yellow line as shown in Fig. 11). The reason is that the classification errors for some low-level semantic image concepts may transmit to the relevant high-level semantic image concepts. Shrinkage may be used to address this problem. Given a limited number of labeled images, we have tested the performance of our classifiers by using different sizes of unlabeled images for classifier training (i.e. with different Nu between the unlabeled images Nu and size ratios λ = N L the labeled images NL ). The average performance differences for some semantic image concepts are given in Fig. 12. One can find that the unlabeled images can improve the classifier’s performance significantly when only a limited num-

The benchmark metric for algorithm evaluation includes classification precision ρ and classification recall +. They are defined as: ϑ ϑ , += (18) ρ= ϑ+γ ϑ+ν where ϑ is the set of true positive images that are related to the given semantic image concept and are classified correctly, γ is the set of true negative images that are irrelevant to the given semantic image concept and are classified incorrectly, and ν is the set of false positive images that are related to the given semantic image concept but are misclassified. The average performance (precision vs. recall) of our classifiers for some atomic image concepts are given in Table 1. The atomic image concepts at the first level of the concept hierarchy can be directly interpreted by using the relevant image blobs. The average performance of our classifiers for some second-level semantic image concepts are given in Table 2. The semantic image concepts at the second level of the concept hierarchy can be interpreted by using the relevant atomic image concepts at the first level of the concept hierarchy. For each atomic image concept, we currently use 100 labeled images for weak classifier training. In our adaptive EM algorithm, multiple operations, such as merging, splitting, and elimination, have been integrated to re-organize the distrbutions of mixture components, select the optimal number of mixture components and construct more flexible decision boundaries among sibling semantic image concepts according to the real class distributions of the training images. Thus, our adaptive EM algorithm is expected to have better performance than the traditional EM algorithm and its recent variants. In order to evaluate the real benefits of the integration of these three operations (i.e. merging, splitting, and elimination), we have tested the performance differences of our adaptive EM algorithm with different combinations of these three operations. As shown in Fig. 9 and Fig. 10, we have tested the performances of the classifiers under dif-

29

Research Track Paper Table 2: The average performance of our classifiers for some semantic image concepts at the second level of concept hierarchy. concepts mountain view beach garden ρ 71.7% 70.5% 70.6% + 74.3%% 74.7% 80.6% concepts water way ocean view prairie ρ 72.5% 74.3% 73.2% + 75.5% 75.1% 76.3% concept sailing skiing desert ρ 77.6% 75.4% 79.6% + 75.5% 73.7% 72.8% concepts flower view wedding city ρ 72.5% 69.2% 74.2% + 77.1% 68.7% 76.1%

Figure 9: The relationship between the classifier performance (i.e., precision 100ρ) and our multi-class EM algorithm with different operations of merging, splitting and elimination for the semantic image concept “ocean view”.

Figure 10: The relationship between the classifier performance (i.e., precision 100ρ) and our multi-class EM algorithm with different operations of merging, splitting and elimination for the atomic image concept “rock”. EM algorithm is proposed to select the optimal model structure and estimate the accurate model parameters, and thus the misleading effects of the outlying unlabeled images can be eliminated effectively. Obviously, our proposed classifier training technique may be very attractive for other data domains. Our future works will focus on addressing the error transmission problem for hierarchical image classifier training.

Figure 8: The classification results for the secondlevel semantic image concept “flower view”, “work ship” and the relevant atomic image concepts.

7.

ber of labeled images are available for classifier training. The reasons are: (a) The certain unlabeled images, that originate from the existing image context classes for concept interpretation, are able to improve the density estimation by reducing the variances of mixture density. (b) The informative unlabeled images, that originate from the unknown image context classes, have the capability to provide additional image context knowledge to learn the concept models more accurately. By modifying the concept models to be more representative for the data resources, the concept models that are learned incrementally are able to obtain the accurate classifiers with higher prediction accuracy. (c) The outlying unlabeled images, that originate from outliers, can be predicted and their misleading effects on classifier training can be eliminated automatically by using a novel penalization framework.

REFERENCES

[1] O.R. Zaiane, J. Han, Z.-N. Li, S.H. Chee, J.Y. Chiang, “MultimediaMiner: A system prototype for multimedia data mining”, ACM SIGMOD, 1998. [2] S.J. Simoff, C. Djeraba, O.R. Zaiane, “MDM/KDD 2002: multimedia data mining between promises and problems”, SIGKDD Explorations, vol.4, no.2, 2002. [3] C. Djeraba, “When image indexing meets knowledge discovery”, ACM MDM/KDD, pp.73-81, 2000. [4] C. Breen, L. Khan, A. Ponnusany, “Image classification using neural networks and ontologies”, DEXA workshop, pp.98-102, 2002. [5] J. Fan, Y. Gao, H. Luo, “Multi-level annotation of natural scenes using dominant image compounds and semantic concepts”, ACM Multimedia, New York, Oct. 10-15, 2004. [6] G. Sheikholeslami, W. Chang and A. Zhang, “Semantic clustering and querying on heterogeneous features for visual data”, ACM Multimedia, Bristol, UK, 1998. [7] E. Chang, K. Goh, G. Sychay, G. Wu, “CBSA: Content-based annotation for multimodal image retrieval using Bayes point machines”, IEEE Trans. CSVT, 2002.

6. CONCLUSIONS AND FUTURE WORKS To support semantic image retrieval via keywords, we have proposed a novel framework for hierarchical image classification. By integrating the concept hierarchy and negative images for discriminative classifier training, our proposed framework has achieved very convincing results in a specific image domain of outdoor photos. In addition, an adaptive

30

Research Track Paper

Figure 11: The classification error rate (percentage) 100(1 − ρ) for our hierarchical image classifier and the flat classifiers for the same classification purpose.

Figure 12: The classifier performance (i.e., precision ρ) Nu between the unlabeled imwith different ratio λ = N L ages Nu and the labeled images NL : (a) waterfall; (b) rockform; (c) sunset; (d) sky; (e) flowerview; (f ) elephant.

[8] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D.M. Blei, M.I. Jordan, “Matching words and pictures”, Journal of Machine Learning Research, vol.3, pp.1107-1135, 2003. [9] R. Zhang, Z. Zhang, S. Khanzole, “A data mining approach to modeling relationships among categories in image collections”, ACM SIGKDD, 2004. [10] C. Carson, S. Belongie, H. Greenspan, J. Malik, “Blobworld: Image segmentation using expectation-maximization and its application to image querying”, IEEE Trans. PAMI, vol.24, no.8, 2002. [11] Y. Wu, A. Zhang, “Adaptive pattern discovery for interactive multimedia retrieval”, IEEE CVPR, 2003. [12] A. Natsev, M. Naphade, J.R. Smith, “Semantic representation: search and mining of multimedia content”, ACM SIGKDD 2004. [13] W. Wang, Y. Song, A. Zhang, “Semantic-based image retrieval by region saliency”, Proc. CIVR, 2002. [14] J. Pun, H. Yan, C. Faloutsos, P. Dugulu, “Automatic multimedia cross-model correlation discovery”, ACM SIGKDD 2004. [15] J. Huang, S.R. Kumar and R. Zabih, “An automatic hierarchical image classification scheme”, ACM Multimedia, Bristol, UK, 1998. [16] A. Aslandogan, C. Their, C. Yu, J. Zon, N. Rishe, “Image retrieval using WordNet”, ACM SIGIR, 1997. [17] G. McLachlan and T. Krishnan, The EM algorithm and extensions, New York, John Wiley & Sons, 2000. [18] N. Ueda and R. Nakano, Z. Ghahramani, G. E. Hinton, “SMEM algorithm for mixture models”, NIPS, 1998. [19] M. Jordan, R. Jacobs, “Hierarchical mixtures of experts and the EM algorithm”, Neural Computation, vol.6, pp.181-214, 1994. [20] M. Figueiredo and A.K. Jain, “Unsupervised learning of finite mixture models”, IEEE Trans. PAMI, vol.24, pp.381-396, 2002. [21] A. McCallum, R. Rosenfeld, T. Mitchell, A.Y. Ng, “Improving text classification by shrinkage in a hierarchy of classes”, Proc. ICML, 1998.

[22] T. Hofmann, “The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data”, Proc. IJCAI, 1999. [23] D. Koller and M. Sahami, “Hierarchically classifying documents using very few words”, Proc. ICML, 1997. [24] S. Chakrabarti, B. Dom, R. Agrawal, P. Raghavan, “Using taxonomy, discriminants, and signatures for navigating in text databases”, Proc. VLDB, 1997. [25] C. Fellbaum, WordNet: An electronic lexical database, MIT Press, 1998. [26] G.A. Miller, “WordNet: A lexical database for English”, Comm. of ACM, vol.38, n.11, 1995. [27] C.A. Bouman, M. Shapiro, G. Cook, C. Atkins, H. Cheng, “Cluster: An unsupervised algorithm for modeling Gaussian mixtures”, Technical Report, Purdue University. [28] C. Djeraba, Multimedia Mining: A highway to intelligent multimedia documents, Kluwer Academic Publishers, 2003. [29] F. Cozman, I. Cohen, “Unlabeled data can degrade classification performance of generative classifier”, TR-HPL-2001-234, 2001. [30] M.R. Naphade, X. Zhou, and T.S. Huang, “Image classification using a set of labeled and unlabeled images”, Proc. SPIE, 2000. [31] K. Nigam, A. McCallum, S. Thrun, T. Mitchell, “Text classification from labeled and unlabeled documents using EM”, Machine Learning, vol.39, no.2, 2000. [32] M. Szummer and T. Jaakkola, “Information Regularization with Partially Labeled Data”, Proc. NIPS, 2002. [33] K.P. Bennett, U. Fayyad, D. Geiger, “Density-based indexing for approximate nearest-neighbor queries”, ACM SIGKDD, pp.233-243, 1999.

31

Suggest Documents