2010 International Conference on Pattern Recognition
Semi-supervised and Interactive Semantic Concept Learning for Scene Recognition Xian-Hua Han, Yen-Wei Chen Ritsumeikan University, Japan
[email protected]
Abstract In this paper, we present a novel semi-supervised and interactive concept learning algorithm for scene recognition by local semantic description. Our work is motivated by the continuing effort in content-based image retrieval to extract and to model the semantic content of images. The basic idea of the semantic modeling is to classify local image regions into semantic concept classes such as water, sunset, or sky [1]. However, labeling concept sampling manually for training semantic model is fairly expensive, and the labeling results is, to some extent, subjective to the operators. In this paper, by using the proposed semi-supervised and interactive learning algorithm, training samples and new concepts can be obtained accurately and efficiently. Through extensive experiments, we demonstrate that the image concept representation is well suited for modeling the semantic content of heterogenous scene categories, and thus for recognition and retrieval. Furthermore,higher recognition accuracy can be achieved by updating new training samples and concepts, which are obtained by the novel proposed algorithm.
1. Introduction Semantic understanding of scenes remains an important research challenge for the image and video retrieval community. Some even argue that there is an ”urgent need” to gain access to the content of still images. The reason is that techniques for organizing, indexing and retrieving digital image data are lagging behind the exponential growth of the amount of digital images. In general most approaches concisely convey information by collections of the global low-level features such as color, texture and edge position. These features are used to measure the similarity among the images. This has long been recognized as a problem in Semantic gap for natural scene retrieval because this matrix of these 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.746
Xiang Ruan Omron Corporation, Japan
features is insufficient for image retrieval. Therefore, the semantic gap between the image understanding of the user and the image representation of the computer still hampers fast progress in modeling high-level semantic content for image browsing and retrieval. Also in the early work on scene classification, semantics are often only found in the definition of the scene classes, e.g. indoor vs. outdoor, or waterfalls vs. mountains [2]. Recently, several systems have been proposed that address a global as well as local image annotation [3]. In general, these approaches aim at learning the correspondence between global annotations and images or image regions, respectively, a promising trend in image understanding. Nevertheless, the fact that global annotations are more general than pure region naming, and consequently that a semantic correspondence between keywords and image regions does not necessarily exist, is often neglected. This is especially true for the correspondence between category labels and category members. In this paper, we propose a novel image representation of natural scenes by local semantic concept description. Local regions descriptions are combined to a global image representation that can be used for scene recognition, retrieval, and ranking. Furthermore, for training a good concept model, lots of labeled training regions must be obtained, and effective concepts must be decided in advance. However, concept type decision objectively may be not suitable for the specific scene classification, and manually labeling the training concept samples is fairly expensive, and subjective to operators. In this paper, a semi-supervised [4] and interactive learning algorithm is proposed for actively updating concept types and the training samples for concept model. So, at first we decide the concept types for representing scene images, and just roughly label dozens of regions for each concepts to learn the concept model. Then the trained concept model is applied for some unlabel samples. The samples with high probability classified into one concept are as the candidates for training 3037 3049 3045
the next concept model. At same time, we also interactively explore the candidates with high probability and concept types of the classification model to update final training samples and concept types. Through extensive experiments, we demonstrate that the image concept representation is well suited for modeling the semantic content of heterogenous scene categories, and thus for recognition and retrieval. Furthermore,the more accuracy classification model can be obtained after updating new training samples and concepts with the proposed semi-supervised and interactive learning algorithm.
Figure 1. Sample image for scene classes. Figure 2. Semantic model and semisupervised, interaction learning methods.
2. Semantic model In this section, we introduce the semantic concept representation of scene images[1]. The scene images are described by concept frequency or global concept probability of local regions. For that reason, the image analysis proceeds in two stages. In the first stage, local image regions are classified by concept classifiers into semantic concept classes. In order not to be dependent on the largely varying quality of an automatic segmentation, the local image regions are extracted on a regular grid of 10x10 regions. In our experiment, the target scene classes are set to: Beach, Snow, Night landscape, Firework and Sunset. Therefore, according to the sample images of the five scene classes (one sample of each class is shown in Fig. 1), we initially decided 10 local semantic concepts types (Water,Tree Sunset, Snow, Sky, Sand, Black, Rock, Firework and Night), which are shown in the top row of Fig. 2. Then, we roughly label 50 training local region from the scene sample image for each semantic concept as the training samples of concept model. Therefore, each region (patch) of the test scene has a 10-dimensional probability vector. In the second stage, the region-wise information of the concept classifiers is combined to a global image representation. Three types of combination method of the local semantic concepts information are used for global image representation as the following. (1) The first one is contacted probability vector
(1000-D in our case) of all regions in the scene image. The second one is contacted summational probability vector (100-D in our case) of 10 regions in each row of the image. The third one is summational probability vector (10-D in our case) of all 100 regions in the image. The combination methods are demonstrated in Fig. 3. The advantages of the semantic modeling are manifold. Only through the use of named concept classes in the first stage of the image understanding system, the semantic detail of scene images can effectively be modeled and be used for description. In addition, the semantic content of the local image regions is far less complex than that of full images making the acquisition of ground-truth required for training and testing much easier. Since the local semantic concepts corresponds to ”real-world” concepts, the method can also be used for descriptive image search. However, in the following, we will only relate to the global image representation through semantic modeling.
3. Semi-supervised and interactive learning algorithms As argued above, for training a good concept model, lots of labeled training regions must be obtained and ef3046 3050 3038
(a)
(b)
Figure 3. Combination methods for global representation of concept probability.
(c) Figure 4. Experimental results with different combination method of concept probabilities; (a) with the contacted probability vector of all regions in one image; (b) with the contacted summational probability vector of 10 regions in each row of the image; (c) with the summational probability vector of all 100 regions in the image.
fective concepts must be decided in advance. However, concept type decision subjectively may be not suitable for the specific scene classification, and at the same time, manually labeling the training concept samples is fairly expensive, and subjective to operators. In this paper, a semi-supervised and interactive learning algorithm is proposed for actively updating concept types and the training samples for concept model. At first we decide the concept types for representing scene images, and just roughly label dozens of regions for each concepts to learn the concept model. Then semi-supervised learning method is used for obtain more labeled training samples for concept models. In this paper, we use self-training for semi-supervised learning [4]. Self-training is a commonly used technique for semisupervised learning. In self-training a classifier is first trained with the small amount of labeled data. The classifier is then used to classify the unlabeled data. Typically the most confident unlabeled points, together with their predicted labels, are added to the training set. The classifier is re-trained and the procedure repeated. Because, the concept types are just initially decided by human, and the training samples are roughly labeled. It is necessary to refine them with the high confidence concept regions after semi-supervised learning. We interactively check the high-confidence regions of semisupervised learning. If the regions with high confidence
are misclassified to other concepts, we will redistribute them to the right ones as the next training samples. At the same time, we analysize the global statistic properties of the candidate samples with high-confidence, and found that the region number, which were classified to rock concept is small, and then there are a large proportion of misclassified regions. So,we remove the rock from the concept set, and add sunset water and other concept to the concept set shown in the bottom row of Fig 2. After the sumi-supervised and interactive learning, the concepts set is S={water, T ree, Sunset, Snow, Sky, Sand, Black, F irework, N ight, Sunsetwater and others}, and at the same time, the available labeled training samples of each concept are also increased. So we train the new concept model with the updated labeling sample and concepts for semantic model. After obtaining the semantic concept model, the 3047 3051 3039
patch of a scene images is evaluated with the probabilities belonging to different concepts. For each images, 100 N-dimension probabilities (N: concept number, 100: region number of an image) are available for image representation. Then, we combine the patch probabilities for image representation in three ways. The first one is contacted probability vector (100 × N D in our case) of all regions in the scene image. The second one is contacted summational probability vector (10 × N -D in our case) of 10 regions in each row of the image. The third one is summational probability vector (1 × N -D in our case) of all 100 regions in the image. The combination methods are demonstrated in Fig. 3.
Table 1. Confusion matrix of recognition result with contacted probability vector and χ2 distance Gaussian kernel by our proposed method. Classes Beach Snow Firework Night Sunset Beach 89.8 9.237 0 0 1.004 Snow 8.285 90.8 0.35 0 0.58 Firework 0.083 0 89.759 8.826 1.332 Night 0 0 2.82 96.5 0.704 Sunset 0.729 0.146 1.75 0.58 96.8
5. Conclusions In this paper, we presented a novel semi-supervised and interactive concept learning algorithm for accessing scene images by local semantic description. It was proven that semantic concept representation is an efficient features for image understanding. However, labeling concept sampling manually for training semantic model is fairly expensive, and objective according to different people. Then we proposed to use the selftraining semi-supervised and interactive learning algorithms for obtaining accuracy training samples and new concepts in concept model learning procedure. Through extensive experiments, we demonstrate that the scene categorization accuracy rate can be greatly improved by our proposed semantic learning methods.
4. Experimental results
In this section we provide experimental results of the proposed algorithm. We carried out experiments utilizing 5 types of scene (Beach-698, Snow-1066, Firework342, Night landscape-1401 and Sunset-886) and 10 initial high-level concepts mentioned in Sec. 2. A small sample of this set is depicted on Fig. 1. We initially label 50 samples for each high-level concept for training concept model, and then use 50 images of each scene class for semi-supervised and interactive learning. The new concept model is trained with the updated samples and concept set. Then the semantic representation for each scene image can be obtained after combinating the output probability of image regions in the concept model shown in Fig. 3. In order to be fair to be compared, 50 samples for each concepts with the same number of the first concept model are randomly selected for training the new one.
References [1] Julia Vogel and Bernt Schiele, ”Semantic Scene Modeling and Retrieval for Content-Based Image Retrieval”, International Journal of Computer Vision. Vol. 72, No. 2, pp. 133-157, Apr. 2007.
The Scene recognition was implemented by SVM classifier using image semantic representation as features. In SVM classifier for scene recognition, we use three types of kernel function (linear, Gaussian kernel with Euclidean distance and Gaussian kernel with χ2 distance). We randomly select 100 scene images of each scene class for training, and the remainder are used for testing. The categorization results are shown in Fig. 4(a)-(c), which represent the results with different combination methods of the output probability in the trained concept models (see Fig. 3), respectively. For the detail categorization information, Fig. 5 show the confusion matrix with the contacted probability vector of all regions and χ2 distance Gaussian kernel by our proposed method. From Fig. 4, It is obvious that the categorization accuracy rate can be greatly improved with the seme-supervised and interactive learning method.
[2] M. Szummer and R. W. Picard, ”Indoor-outdoor image classification”, Proc. of IEEE Workshop on Content-based Access of Image and Video Databases, 1998. [3] Oliva, A., Torralba, A., ”Modeling the shape of the scene: A holistic representation of the spatial envelope”, Int. Journal of Computer Vision 42 (2001) [4] Zhu, X., Goldberg, A., ”Introduction to SemiSupervised Learning”, Morgan & Claypool Publishers, 2009.
3048 3052 3040