Automatic and precise extraction of generic objects

0 downloads 0 Views 3MB Size Report
画像の認識・理解シンポジウム (MIRU2010)」 2010 年 7 月. Automatic and precise extraction of generic objects using saliency-based priors and contour constraints.
「画像の認識・理解シンポジウム (MIRU2010)」 2010 年 7 月

Automatic and precise extraction of generic objects using saliency-based priors and contour constraints Gurbachan SEKHON†,†† , Ken FUKUCHI††† , and Akisato KIMURA†† † Department of Mechanical Engineering, University of British Columbia, Canada †† NTT Communication Science Laboratories, NTT Corporation, Japan ††† Department of Information and Communication Systems Engineering, Okinawa National College of Technology, Japan E-mail: †[email protected], ††[email protected], †††[email protected] Abstract This paper deals with automatic video segmentation without supervision or interactions. We examine a method for automatic noise reduction in segmented video frames utilizing contour information, which we have dubbed the Contour-Classification method. This method uses information about the contours of the segmented image mask in order to accurately reduce noise in segmented video frames. We will also examine which we have developed, called the Erosion-Dilation method. Our proposed method is then composed of these two fundamental techniques: Contour-Classification and Erosion-Dilation. Test results indicate our proposed method precisely removes noise regions from videos with low error rate when compared with both the original unaltered segmentation result and the Erosion-Dilation method. Key words Video segmentation, saliency, graph cuts, contour, erosion, dilation,

1. Introduction

Fig. 1 Example of noise in automatic segmentation results

In standard image segmentation processes, segmentation seeds are provided manually and carefully in order to accurately extract important regions from images and videos; for example, the Interactive Graph Cuts method proposed by Boykov et al [1] . In many applications, however, manual labeling can often be unfeasible. To this end, we proposed a method for automatic saliency-based segmentation, called Saliency Graph Cuts (SGC) [2] using human-based visual attention models [3], [4]. This process has been shown to work accurately, consistently, and quickly over a wide range of videos [5]. One of the applications of video segmentation is in

the field of object recognition, annotation, and retrieval. The method described in [2] has been shown to work well in this field as seen in [6]. However, a problem with the automatic segmentation process is that there is occasionally some residual noise in the segmented images, as shown in Fig. 1. When applied to an object recognition, annotation, and retrieval system, noise within the final images will be considered as part of the object in the system. Features of the noise become mixed with features of the actual object, resulting in noisy annotated images, as well as noisy query images. Through this paper we describe a method for reducing the noise in automatically segmented video frames through the classification of the different contours present in the video frame, which we have named the Contour-Classification method. We will also describe another method, which we have named the ErosionDilation method, which uses a combination of erosion and dilation image processing techniques for noise reduction. The rest of this paper is organized as follows: first, we will provide a brief description of the SGC method for automatic segmentation in Section 2. Following will be an explanation of the Contour-Classification and Erosion-Dilation noise reduction methods. We will also describe a combination of these two techniques as our main contribution in this paper. Finally, we will present

IS3-3:1566

Fig. 2 Framework of the Saliency-Graph Cuts (SGC) method

the results of our quantitative and qualitative analysis done on the three methods presented in this paper.

2. Saliency-Based Video Segmentation This section will provide a description of the previously proposed SGC method for automatic video segmentation.

2. 1 Framework Fig. 2 describes the framework of the SGC method. Given an input video, the first step is to estimate the rough positions of objects from each frame based on visual saliency. We used a stochastic model of human visual attention detailed in [3], [4]. A brief description of the human visual attention model is given in Section 2. 2. We then construct a Markov random field (MRF) model for segmentation, where hidden states are given by labeling a certain position as being an ”object” or ”background”, and each frame of the input video is an observation. Each successive step uses information from the previous step (density of the previous step, regions extracted from previous steps) in order to estimate the priors of objects and background, as well as the feature likelihood of the MRF. After the MRF has been constructed, the maximum a posteriori (MAP) solution of the MRF provides the object regions. A brief description of how to construct the MRF model and to derive the map solution is shown in Section 2. 3.

2. 2 Human Visual Attention Model The framework for estimating human visual attention using our stochastic model described in [3], [4] is shown in Fig. 3. For each input frame of a video, a saliency map is first calculated using the method proposed by Itti et

Fig. 3 Estimation of human visual attention using a stochastic attention model

al. [7]. Then, we obtain stochastic representation of the saliency map (called the stochastic saliency map) through a Kalman filter, where the saliency map is the observation of the filter. Since each pixel of the stochastic saliency map can be expressed by a Gaussian density, we calculate the visual attention density for each pixel such that the saliency is maximum at that pixel. Our model also incorporates the property of eye movements being affected by a cognitive state. Cognitive states are represented by an eye movement pattern in this model, which is defined as either passive or active. Active eye movement corresponds to the eyes constantly scanning an object for information, similar to what might occur while someone is watching a film. Passive eye movement corresponds to the eyes focusing on one position, as might occur while sitting idly in a bus or a train. Through the eye movement patterns, we can model eye movement with a hidden Markov model. Finally, we integrate the density coming from the stochastic saliency map and the eye movement pattern in order to obtain the final density of visual attention, which we call the eye focusing density map (EFDM).

2. 3 Priors and Likelihoods Fig. 4 shows the concept of providing prior density. Our method is built upon the supervised image segmentation technique based on graph cuts proposed by Boykov et al [1]. While Boykov’s method required manually provided labels, our method utilizes the human visual density system described in Section 2. 2 to automatically provide prior densities indicating the proba-

IS3-3:1567

first prior density p(Ax = 1) . ψ2 and ξ2 are almost the same as the ones described in Boykov et al [1]. Our method also introduces a way to update the prior and likelihood terms according to segmentation results from the previous frame and the visual attention density of the current frame. In [5], we show that using sequentially updated priors improves the performance of the segmentation process greatly.

3. Main Contributions

Fig. 4 Updated prior densities using previous frame’s prior density and current segmentation result

bility a certain position is an object, instead of utilizing labels. Given a set I of coordinates, we define a set of random variables A = {Ax }x∈I . Each random variable Ax takes a value ax from the set L = {0, 1} , corresponding to background (0) and object (1). An MRF estimation is then formulated as an energy minimization problem, in which the energy of the configuration a is the negative log likelihood of the joint posterior density of the MRF, E(a|D) = − log p(A = a|D) , where D represents the input image. Then, Equation 1 defines the energy function: ∑{ ψ1 (D|ax ) + ξ1 (ax ) x∈I } ∑ + (ψ2 (D|ax , ay ) + ξ2 (ax , ay )) y∈Nx

E(a|D) =

(1)

where Nx is a neighbourhood system to the position x, ψi (D|·)(i = 1, 2) is a likelihood term and ξi (·) is a prior term. The first prior term in Equation (1) is the negative log of the prior density obtained from the EFDM (Section 2. 2). We assume that the EFDM can be described as a Gaussian mixture model (GMM), and the model parameter is estimated by the Estimation Maximization (EM) algorithm. The estimated GMM density represents the first prior density p(Ax = 1) . The first likelihood term ψ1 (D|Ax ) is the negative log likelihood of the RGB values obtained in a way similar to Interactive Graph Cuts [1]. While in Interactive Graph Cuts gathers samples from manually labeled pixels, our method utilizes all pixels for estimating the likelihood ψ1 (D|Ax ) , where samples are weighted by the

This section details noise reduction using the Contour-Classification and Erosion-Dilation methods. We will also describe our proposed method, which utilizes a combination of the Contour-Classification and Erosion-Dilation methods.

3. 1 Contour-Classification Method Under the Contour-Classification approach to noise reduction, the contours of the segmentation image mask are calculated first. In our application, we utilized the OpenCV library to provide contour-related functions, and used the Teh-Chin algorithm [8] for contour detection. In general, noise in the segmented image results is not physically connected to the actual object of interest. As a result of this property, the contours of the object of interest and the contours of noise will be separate. After finding all of the image contours, the next step is to calculate the area of the contours and apply an appropriate criteria to classify each contour as either being part of the object or noise through one of two criterion. We introduce the following two criteria to use with this noise reduction method: ( 1 ) Contour Area Threshold Criterion. For this criteria, a certain threshold is determined to define contours that are ”objects”. Each contour is checked against the threshold and the contours with area higher than the threshold are used for creating the final segmentation result. In general, for any given segmentation result a contour of noise will be significantly smaller than the contour of an object, so a suitable threshold is usually easy to determine. ( 2 ) Largest Contour Area Criterion. In this criteria, only the contour with the largest area is considered to be the object of interest. All other contours are disregarded. As indicated earlier, contours of noise are significantly smaller than the contours of an object, so it is reasonable to expect that the largest possible contour within a target image will be the object of interest. Both of these criteria have their uses in practice. Depending on the contents of a given input image, the

IS3-3:1568

Fig. 5 Two examples comparing the two criterion described

segmentation result may contain more than one region which can be identified as having an ”object”. In certain applications, the desired result may be to only have one object in the final image. For this case, the Largest Contour Area method would be useful. In other applications, the desired result may be to have multiple objects appear in the final image. In this case, the Contour Area Threshold method would work better. To illustrate the difference between the criteria, take for example the images shown in Fig. 5. The first row shows an example where the Contour Area Threshold criteria would give a more accurate result. In this case, both methods have reduced all of the noise; however, the Largest Contour Area has also removed the horses in the background, as it assumes that there is only one object of interest. The Contour Area Threshold, on the other hand, has a predetermined threshold to evaluate whether a certain contour is noise or an object. A downside of the Contour Area Threshold method can be seen in the second row of Fig. 5. Here, it can be seen that if noisy areas are large, the Contour Area Threshold method will incorrectly evaluate them as being another object. If the threshold were increased to compensate for larger noise, we run the risk of removing useful object areas as well. As such, there are trade-offs in using either criteria to remove the noise of segmentation results. In practice, we have implemented both criteria and manually select which one to use based on the type of result we are expecting.

3. 2 Erosion-Dilation Method The Erosion-Dilation approach to noise reduction involves using erosion and dilation image processing techniques under certain criteria to reduce image noise and improve object continuity, respectively. In our application, we utilized the OpenCV library to provide the erosion and dilation functions, and we used the linear decomposition method [9] to perform the process. The main indicator used to determine which image

Fig. 6 Two examples showing application of ErosionDilation technique

processing technique, erosion or dilation, to use is the number of pixels indicating ”object” in a certain subregion of the segmentation result. In each sub-region, the number of ”object” pixels divided by the number of pixels in the sub-region (denoted by a) can be calculated. Once this value has been calculated, the following criteria applies: If 0 < =a< = θe then perform erosion If θd < a = < = 1 then perform dilation < < If θe = a = θd then do nothing where θe and θd are the thresholds for erosion and dilation, respectively. This criteria must follow the condition 0 < = θe < = θd < = 1. This process is iterated for all of the sub-regions in the image. A sub-region has some fraction of the original width and height of the image. From experimental analysis, we found that sub-regions with one-half the original image’s width and height provided the best results. In addition to this, each subsequent sub-region should overlap the previous sub-region by half. This is for the case in which the border of an object is only slightly within one sub-region. If the sub-regions did not overlap in this case, then the border information would be lost and the object of interest would lose clarity. In addition, before applying this noise reduction method to an image, we downsize the image in order to aid in the application of the criteria above. Another parameter that can be identified is the number of iterations of the erosion and dilation techniques used on each region. Fig. 6 shows an example of the Erosion-Dilation method being applied to a test image. The ErosionDilation method reduces some of the noise, but does not get rid of it completely. It is a more conservative tech-

IS3-3:1569

nique than the Contour-Classification method in that if regions are incorrectly defined as ”object”, not all of the information in that region will be lost when using the Erosion-Dilation method. At the same time, noise effects are reduced by this method. However, Fig. 6 also shows some limitations of the Erosion-Dilation method. While large patches of noise will be reduced in size, it is difficult to completely eliminate these regions using the Erosion-Dilation method.

3. 3 Combination Contour Classification and Erosion-Dilation Method This combination method implements ContourClassification, using either criteria, and the ErosionDilation method in series. In general, the SGC method will trim part of the object area’s outermost border. This is because the border between ”object” and ”background” is a difficult area to distinguish precisely. As such, there can be two ways an image can be made to more closely match the ground truth image: decreasing the amount of noise, or increasing the object area in order to compensate for the trimming done by the SGC method. The Contour-Classification method focuses on the first possibility; that is, decreasing the amount of total noise. The Erosion-Dilation method, however, attempts to both reduce the noise of the image (through erosion), and increase the total object area (through dilation). This is the main benefit of the combination of these two methods. For our purposes, the combination is done in series, with the Contour-Classification method being applied first and the Erosion-Dilation method being applied to the result of the Contour-Classification method. The order of application of these methods is not a significant parameter in the performance of the combination method. As was demonstrated in Section 3. 1, choosing a certain threshold for the contour area does not mean that all the noise will be decreased. If we consider the reverse application of the two methods (ErosionDilation first, then Contour-Classification), we might hope to first reduce the area of noisy regions such that they would fall within the classification of ”noise” when we apply the Contour-Classification method. However, noisy regions greatly vary in size, and as such we can not rely on the Erosion-Dilation method in this regard. On the other hand, when our proposed application is considered, since the Erosion-Dilation method’s criteria relies on counting the number of ”object” pixels in an image, we see that applying the Contour-Classification method first would give the Erosion-Dilation method

Fig. 7 Samples of input videos and corresponding ground truths created by hand

more accurate results.

4. Evaluation Results We conducted both a quantitative and qualitative evaluation on the methods described in this paper to compare their results. This section will be organized as follows: first, we will describe the methods and parameter choices used for evaluation. Then, we will compare the parameter choices of each method to the segmentation result obtained from the SGC method in order to identify optimal operating parameters. Following this will be a comparison between the two individual methods and the combination of the two. Finally, we will show a qualitative evaluation of the three methods outlined in this paper.

4. 1 Quantitative Evaluation The noise reduction methods outlined in this paper were evaluated by doing a pixel-by-pixel comparison with a ”ground truth” video. Each method was first used to reduce the noise of the segmentation results derived from 8 sample videos of approximately 7 seconds in length at 12 fps and of resolution 352 x 288. Then, the results of the noise reduction were compared pixel-bypixel against the ground truth video, which was created manually. Fig. 7 shows some samples of input videos and the corresponding ground truths. We used four measures when considering the performance, using pixels defined as either True Positive (TP), False Positive (FP), or False Negative (FN): precision, recall, error, and F-Measure, where:

IS3-3:1570

Fig. 8 Precision, recall, and F-Measure of NNR, CL, and CT

Fig. 10 Precision, recall, and F-Measure of the ErosionDilation methods

Fig. 9 Error of NNR, CL, and CT

Fig. 11 Error of the Erosion-Dilation methods



TP TP + FP TP Recall = TP + FN FP + FN Error = T otalP ixels 2 × P recision × Recall F − M easure = Recall + P recision P recision =

(2) (3) (4) (5)

F-Measure and error were the two primary measures used to compare each of the methods.

4. 2 Contour-Classification Evaluation(CL and CT) In our evaluation of the Contour-Classification method, we considered and compared the following methods: • No Noise Reduction (shortened as NNR). This is the unaltered segmentation result obtained from the SGC method. • Contour-Classification using Largest Contour Area criteria (shortened as CL).

Contour-Classification using Contour Area Threshold criteria (shortened as CT). Fig. 8 and Fig. 9 show the results from the evaluation of the CL and CT methods. As can be seen from Fig. 8, there exists a trade off between the precision and recall when using either criteria. While CL provides an increase in precision and a decrease in recall, CT provides the opposite; an increase in recall combined with a decrease in precision. However, from the F-Measure in Fig. 8, it can be seen that the CL method performs better on average and more consistently than the CT method. Neither method provides any advantage over the other in terms of the error, as seen in Fig. 9

4. 3 Erosion-Dilation Evaluation In evaluating the Erosion-Dilation method, we identified two parameters of interest which can affect performance: number of iterations of erosion and dilation, and the amount of downsizing before the erosion and dilation is applied. The following parameter combinations have been considered for this method: • 3 iterations, half downsizing (Shortened as

IS3-3:1571

Fig. 12 Precision, recall, and F-Measure of the CL, ERD3.5, and CLERD methods

ERD3.5 in the results) • 5 iterations, half downsizing (Shortened as ERD5.5 in the results) • 1 iteration, half downsizing (Shortened as ERD1.5 in the results) • 1 iteration, no resizing (Shortened as ERD1.0 in the results) • 1 iteration, quarter resizing (Shortened as ERD1.25 in the results) Fig. 10 and Fig. 11 show the results from the evaluation of the Erosion-Dilation methods. When considering the parameter of downsizing, similar to the ContourClassification method, we see that there is a trade-off in the amount of downsizing. Downsizing by a quarter yields high precision with low recall, while not downsizing at all yields higher recall than precision, as seen through ERD1.25 and ERD1.0 respectively. The number of iterations used has a similar trade-off. We can see, through ERD5.5, ERD3.5, and ERD1.5, that as we increase the number of iterations, the precision increases, while the recall decreases. Overall, it is difficult to determine the best parameterization, as there is overlap between all of the FMeasures (Fig. 10). The parameters for the ErosionDilation depend on the application and the image being considered, and as such it is difficult to determine a single parameter combination that will work over a wide range of image types.

4. 4 Effectiveness of combination of ContourClassification and Erosion-Dilation For our evaluation, the combination method has been built by combining the CL and ERD3.5 methods described previously (shortened as CLERD in the results). Fig. 12 and Fig. 13 shows the results of our evaluation of the combination method. In Fig. 12, we see

Fig. 13 Error of the CL, ERD3.5, and CLERD

Fig. 14 Qualitative comparison of the noise reduction techniques outlined in this paper

that there is an increase in F-Measure when comparing CLERD to both CL and ERD3.5. Fig. 13 shows that this is accompanied with some decrease in average error; however, there is some overlap in this figure. Overall, we see that the combination of Contour-Classification and Erosion-Dilation proves to further strengthen the strong results of the Contour-Classification method.

4. 5 Qualitative Evaluation In this section, we will compare the three methods outlined in this paper qualitatively. For the ContourClassification and Combination methods, we use the Contour Area Threshold criteria, while the ErosionDilation for both the method itself and the combination method use the parameters of ERD3.5. Fig. 14 shows the results of noise reduction of part of a video that was used in for quantitative evaluation, above. We can see that the Contour-Classification results have only removed the smallest portions of the noise, due to the limitations imposed by the threshold criteria. In the fourth row, we see that the Erosion-

IS3-3:1572

Dilation results have improved relative to the ContourClassification results. However, when there is a high concentration of noise, the Erosion-Dilation method fails to totally remove the noise. Finally, we see that the combination of the Contour-Classification and ErosionDilation yields little to no noise in most of the video frames. The last video frame seems to have too large noise for either method to reduce completely; however, overall the combination of the two methods performs very well in this situation.

[3]

[4]

[5]

5. Conclusion This paper proposed a method for extracting object-like regions automatically by using ContourClassification. The effectiveness of our proposed method as shown in Section 4. 2 is substantially improved when compared to the NNR result. We have also considered the combination of our proposed method with another method by eroding or dilating segmented regions. From our results we have seen that this combination works to further improve the effectiveness of our proposed method. From our results, we have observed that using the Largest Contour Area criteria provides better and more consistent results than using the Contour Area Threshold criteria. However, as shown in Section 3. 1, there are certain situations in which the Contour Area Threshold criteria may provide better results than the Largest Contour Area criteria. As such, some automatic selection of the criteria should be considered as a future work. In addition to this, some automatic selection of the Erosion-Dilation criteria should also be considered. We found in Section 4. 3 that it was difficult to determine a single parameter combination as the best performing over a wide range of images. By selecting the Erosion-Dilation parameters automatically, the performance of this method could improve greatly. This could also be expanded to include the automatic selection of θd and θe for the criteria of Erosion-Dilation. In our tests, we chose the values for θd and θe which gave us the best average results over all of the samples. However, performance could be improved if the best θd and θe were selected for each individual video. This was not conducted in our analysis as it represents an impracticality for our application.

[6]

[7]

[8]

[9]

References [1] Y. Boykov and G. F. Lea ”Graph cuts and efficient N-D image segmentation”, IJCV, vol. 70, no. 2, pp. 109-131, 2006 [2] K. Fukuchi, K. Miyazato, A. Kimura, S. Takagi, and

IS3-3:1573

J. Yamato ”Saliency-Based Video Segmentation with Graph Cuts and Sequentially Updated Priors”, Proc. ICME 2009, pp. 638-641, June 2009 D. Pang, A. Kimura, T. Takeuchi, J. Yamato, and K. Kashino ”A stochastic model of selective visual attention with a dynamic Bayesian network”, Proc. ICME 2008, pp. 1073-1076, June 2008 K. Miyazato, A. Kimura, S. Takagi, and J. Yamato ”Real-time estimation of human visual attention with MCMC-based particle filter”, Proc. ICME 2009, pp. 250-257, June 2009 K. Fukuchi, K. Miyazato, A. Kimura, K. Akamine, S. Takagi, K. Kashino, J. Yamato ”Saliency-Based Video Segmentation with Graph Cuts and Sequentially Updated Priors”, to appear, IEICE Transactions on Information and Systems, vol. J93-P, No. 8, August 2010 (in Japanese) A. Kimura, K. Kashino, K. Fukuchi, K. Akamine, S. Takagi ”Cognitive developmental approach to the realization of sophisticated visual scene understanding”, IEICE Technical Report, PRMU 2009-144, December 2009 (in Japanese) L. Itti, C. Koch, and E. Niebur ”A model of saliencybased visual attention for rapid scene analysis”, IEEE Trans. PAMI, vol. 20, no. 11, pp. 1254-1259, November 1998 C. Teh, R. Chin ”On the detection of dominant points on digital curves” IEEE Trans. PAMI, vol. 11, no. 8, pp. 859 - 872, August 1989 J. Pecht ”Speeding up successive Minkowski operations”, Pattern Recognition Letters, vol. 3, no. 2, pp. 113 - 117, March 1985