A Rule Based Technique for Extraction of Visual Attention ... - CiteSeerX

3 downloads 0 Views 3MB Size Report
from our personal albums (PA). Each pixel in the image is .... Examples of hierarchical visual attention regions in CD and BB. (a). Original image; (b) ... ferent VARs for different screen sizes while maintaining the ac- curacy. If the screen size is ...
766

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 9, NO. 4, JUNE 2007

A Rule Based Technique for Extraction of Visual Attention Regions Based on Real-Time Clustering Zhiwen Yu, Student Member, IEEE, and Hau-San Wong, Member, IEEE

Abstract—Recently, the detection of visual attention regions (VAR) is becoming more important due to its useful application in the area of multimedia. Although there exist a lot of approaches to detect visual attention regions, few of them consider the semantic gap between the visual attention regions and high-level semantics. In this paper, we propose a rule based technique for the extraction of visual attention regions at the object level based on real-time clustering, such that VAR detection can be performed in a very efficient way. The proposed technique consists of four stages: 1) a fast segmentation technique which is called the real time clustering algorithm (RTCA); 2) a refined specification of VAR which is known as the hierarchical visual attention regions (HVAR); 3) a new algorithm known as the rule based detection algorithm (RADA) to obtain the set of HVARs in real time, and 4) a new adaptive image display module and the corresponding adaptation operations using HVAR. We also define a new background measure which combines both feature contrast and the geometric property of the region to identify the background region, and a confidence factor which is used to extract the set of hierarchical visual attention regions. Compared with existing techniques, our approach has two advantages: 1) the approach detects the visual attention region at the object level, which bridges the gap between traditional visual attention regions and high-level semantics; 2) our approach is efficient and easy to implement. Index Terms—Clustering, knowledge extraction, real time processing, visual attention regions, visualization.

I. INTRODUCTION

W

ITH THE increasing popularity of terminal client devices, wide availability of wireless networks and the development of multimedia technology, more and more client users in a heterogeneous environment enjoy all kinds of multimedia services. However, there are a wide variety of terminal client devices with different screen sizes, such as personal digital assistants (PDA), hand-held computers (HHC), mobile phone, pocket PC, and so on. As a result, one of the key issues in multimedia services is how to transcode and display the images on the screen of the client devices which is also known as the image adaptation problem. The MPEG-21 Standard

Manuscript received June 2, 2006; revised October 4, 2006. This work was supported by City University of Hong Kong Project 7001766. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Alan Hanjalic. The authors are with the Department of Computer Science, City University of Hong Kong, Hong Kong (e-mail: [email protected]; cshswong@cityu. edu.hk). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TMM.2007.893351

Part 7 [9]–[12] not only provides a standardized framework to adapt format-dependent and format-independent multimedia resources with respect to the capability of terminal devices, but also provides the description tools for region-of-interest (ROI) specification based on perceptual criteria. The main approaches to solve the image adaptation problem can be divided into two categories: resolution based and perception based. Most of the traditional approaches belong to the first category. They focus on the reduction of image and color resolution directly. For example, Li, et al. proposed a new multimedia data representation called InfoPyramid [45] which represents contents in different modalities, at different resolutions and at multiple abstraction levels. Smith et al. [42]–[44] represented the significance of the regions/blocks of the images by an importance value which is within [0,1]. The highly important regions with corresponding values close to 1 are compressed with lower compression factors, while the less important regions with corresponding values close to zero are compressed with higher compression factors. Most of the approaches in the first category rely on the direct reduction of the resolution of large images to fit the size of the screen on the terminal client devices. Unfortunately, this is not the best way for image adaptation due to the possible loss of important information. The approaches in the second category are based on the perception of human visual system. For example, Lee et al. [46] proposed a new image adaptation scheme based on perceptual criteria of human visual system. They delivered only the important image regions to the client when the screen size is small. Chen et al. [38] described an image attention model based on the perception of human visual system. Each entity in the model consisted of three attributes: the region of interest (ROI), attention value and minimal perceptible size. They applied a branch-and-bound algorithm to find the optimal adaptation. Hu et al. [7] generated and enhanced a saliency map by associating a weight with each pixel in the image based on a Gaussian distribution. They also design an image adaptation engine to automatically detect ROI of the images using an enhanced visual attention model. Unfortunately, most of the approaches in the second category have two drawbacks: 1) the detected ROI only contains a part of the salient object but not the whole object as shown in Fig. 1; 2) a single ROI cannot satisfy the requirements of different users with various terminal devices. In this paper, we focus on 1) detecting the ROI at the object level efficiently and 2) adapting images for the limited screen size of the terminal devices. We extend the definition of region-of-interest (ROI) to include hierarchical visual attention regions (HVAR), which correspond to a set of nested visual attention regions of different sizes.

1520-9210/$25.00 © 2007 IEEE Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 21, 2009 at 00:40 from IEEE Xplore. Restrictions apply.

YU AND WONG: RULE BASED TECHNIQUE FOR EXTRACTION OF VISUAL ATTENTION REGIONS

767

Fig. 1. Detection of ROI based on perception-based approaches.

These visual attention regions satisfy the following relationship: and . Users with different s from HVAR. terminal devices can select the suitable In order to extract the set of HVARs at the object level, we propose a rule based automatic detection algorithm (RADA). Specifically, RADA first segments the image by a grid-based real-time clustering algorithm which performs K-means at the grid level instead of the data level. Then, it removes the noise from each segmented region by a denoising algorithm. Next, the major background is separated from the image based on a new background measure which combines feature contrast and geometric properties of the regions. Finally, a set of hierarchical visual attention regions (HVAR) are generated based on a confidence factor. We also design an adaptive image display module using HVAR. The contribution of the paper is fourfold. First, we propose RTCA which performs K-means at the grid cell level. Second, we propose a new definition called hierarchical visual attention regions, a new background measure and confidence factor. Third, we design a rule based automatic detection algorithm to detect HVAR at the object level efficiently. Fourth, we design a new adaptive image display module and the corresponding adaptation operations using HVAR. The remainder of the paper is organized as follows. Section II describes related work about VAR. Section III introduces the rule based automatic detection algorithm, the real time clustering algorithm, the new background measure, the confidence factor, the denoising algorithm, and the adaptive image display module using HVAR. Section IV evaluates the performance of our approach through user investigation. Section V concludes the paper and describes possible future works. II. RELATED WORK Detecting visual attention regions has become an important topic in recent years [1]–[8] since visual attention analysis can potentially provide a link between the low level features and

Fig. 2. Rule based automatic detection algorithm.

the high level representations which is required by a lot of multimedia applications, such as content-based retrieval [13]–[15], [35]–[38], adaptive content delivery and pattern recognition [2], [3]. The study of visual attention requires knowledge in different areas [16]–[34], [39]–[41] such as biology, psychology, neuroscience, cognitive science, and computer vision. One of the most important works related to visual attention is proposed by Itti et al. [4]. They adopt a saliency map based visual attention model for rapid scene analysis, which is inspired by the neural architecture of the primate visual system. The whole process of their approach consists of three steps: 1) extraction of the low-level features which include intensity, color and orientation from the images; 2) generation of a saliency map based on a pyramid structure consisting of the feature maps; 3) determination of the visual attention regions by a “winner-take-all” neural network. Ahmad [19] proposed a visual attention model (VISIT) which fully simulates the visual attention system of humans. VISIT consists of several modules: a gating network, the gated feature map, a priority network, a control network and a working memory. Each module corresponds to a processing unit in the human visual system. Ma and Zhang [3], [20] proposed a new framework by contrasting the difference between the image block at the center and the neighboring blocks in the HSV color

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 21, 2009 at 00:40 from IEEE Xplore. Restrictions apply.

768

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 9, NO. 4, JUNE 2007

Fig. 3. Selection of the seeds. (a) Searching the grid cells; (b) selecting the seeds.

space. The framework not only simulates human perception by a fuzzy growing method, but also provides a three-level attention analysis (attended view, attended areas, attended points). However, the efficiency of their method is sensitive to the size of the blocks. The most recent work is proposed by Hu et al. [1]. They described a visual attention analysis approach by subspace estimation and analysis by first transforming the image in 2-D space to a 1-D linear subspace using a polar transformation. Then, generalized principal component analysis is used to estimate the subspace. Finally, a set of visual attention regions are extracted by a new region attention measure which combines the feature contrast and geometric properties of the regions. While the previous approaches can detect visual attention region at different levels of effectiveness, there still exist two limitations: 1) Few of them detect visual attention regions at the object level, which is the prerequisite for providing a semantic interpretation of the image. 2) The specification of a single visual attention region cannot satisfy the requirements of different users with a wide variety of terminal devices. In order to perform detection at the object level, we propose the following rule based automatic detection algorithm. III. RULE BASED AUTOMATIC DETECTION ALGORITHM A. An Overview of the Algorithm Fig. 2 provides an overview of our proposed rule based automatic detection algorithm (RADA) which extracts hierarchical visual attention regions (HVAR) at the object level from the images. The algorithm first extracts the features for each pixel in the color space. We adopt the CIE lab color feature here CIE since this color space is one of the most widely adopted color models for describing colors visible to the human eye. Color and color contrast are important features for humans to identify the salient objects in an image, and our approach is based on this observation. If the color contrast between the object and the background is very low, the detection result will not be satisfactory, but humans will also have difficulties in these cases. B. Real Time Clustering Algorithm In this section, we introduce a real time clustering algorithm (RTCA) to separate the images into regions. RTCA is a variant

of the K-means algorithm [21]–[23], in a low dimensional space, and its objective is to reduce the number of iterations and distance computational cost in K-means. The main idea of RTCA is to perform K-means at the grid cell level instead of the data point level. In this case, the data point is a pixel with , , values, while the Euclidean distance is adopted as the metric between two points. RTCA first selects the minimum and maximum values of the data points in each dimension, and normalizes these values within [0,1]. Then, all the data points are hashed into a grid with grid cell size ( is a pre-specified value which denotes the number of grid cells in each dimension, is the number of dimensions which is equal to 3 in our case). The hash function for grid cell assignment is

if if

(1)

where is the index of the grid cell, is the value of the th data point in the th dimension, and is the number of dimensions. We then introduce a grid cell density based method (GCDBM) to find the “good” initial centers of the clusters. Based on our observation of the property of the Gaussian distribution, in which the closer a region is to the center of the distribution, the higher the point density in the region will be, we design GCDBM to identify all the grid cells which satisfy the following condition: the cardinality of the grid cell is not smaller than that of its neighboring cells. We denote this set . Then, GCDBM clusters the data points of grid cells as . The centers of the clusters are in each grid cell then viewed as seeds. If the number of clusters is given, GCDBM will only consider those grid cells which belong to , and whose cardinalities correspond to the first largest . If (where denotes the number values in of grid cells in ), GCDBM selects grid cells with largest cardinalities from the remaining grid the first cells. Fig. 3(a) provides an illustration of GCDBM. Since there are three grid cells whose cardinalities are larger than their

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 21, 2009 at 00:40 from IEEE Xplore. Restrictions apply.

YU AND WONG: RULE BASED TECHNIQUE FOR EXTRACTION OF VISUAL ATTENTION REGIONS

769

Fig. 4. Illustrating the max–min algorithm and the distribution algorithm.

neighboring cells, thres seeds are obtained which are shown as triangles in Fig. 3(b). We then determine the seed candidate list of the non-empty grid cells by a newly proposed max–min algorithm. The seed of the non-empty grid cell is defined as candidate list follows. Definition 1: Seed candidate list of a grid cell If the seed satisfies the following conditions: (2) (3) it is a seed candidate of the grid cell . The seed candidate list of the grid cell contains all the seed candidates which satisfy the conditions:

Fig. 5. Real time clustering algorithm.

(4) , and , where denotes the seed set are the minimum Euclidean distance and the maximum Euclidean distance between the seed and the grid cell respectively, as shown in Fig. 4(a). Fig. 4(a) illustrates the max–min algorithm. In this algorithm, the non-empty grid cells are considered one by one. , and is The first non-empty grid cell is the cell between the equal to the maximum distance cell and the seed . Then, it calculates for each seed . Since all the are greater than , except , is added to . Let us . In this case, is equal to also consider the grid cell which is the maximum distance between the and the seed . Since grid cell and , the seed candidate list contains two seeds , . The max–min algorithm terminates when all the non-empty grid cells are assigned to the neighboring seeds by the max–min algorithm. Finally, we introduce the distribution algorithm to assign the points in the grid cells to the seeds which are the closest to them. The distribution algorithm is motivated by the following lemma. Lemma 1: If the grid cell contains the point , the seed of the which is closest to is in the seed candidate list grid cell .

The proof is given in Appendix I. Based on Lemma 1, it is not necessary to compute the distance between the point and all the seeds in order to obtain the closest seed to . Only the seeds in the seed candidate list of the grid cell which contains the point are considered. As a result, the distribution algorithm assigns the point to the nearest seed by only considering the seeds in the seed canof the grid cell which contains the point . didate list Fig. 4(b) shows an example of applying the distribution algorithm. An overview of RTCA is provided in Fig. 5. In order to determine when the algorithm should terminate, an evaluation function is proposed which calculates the gap between the current centers of the clusters and the centers of the clusters in the previous iteration. (5) (6) where denotes the number of clusters, and number of dimensions.

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 21, 2009 at 00:40 from IEEE Xplore. Restrictions apply.

denotes the

770

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 9, NO. 4, JUNE 2007

Fig. 6. Examples of applying the denoising algorithm. (a) Noise-corrupted images; (b) segmentation by RTCA; (c) the binary image; (d) eliminate noise; (e) the final result.

If the algorithm converges, in the th iteration. This implies that the current centers of the clusters in the th iteration th iteration. are the same as the clusters in the After assigning all the points to the seeds, the algorithm performs the following steps: 1) recomputes the centers of the clusters and used these as the new seeds, 2) assigns all the points to the clusters by the max–min algorithm and the distribution algorithm again, and 3) calculates the gap by (5) and (6). The . algorithm terminates when As a result, the total time consumption of RTCA consists of three parts: (7) is the time for determining the number of cluswhere is the time for determining the seed ters by GCDBM, is the time for ascandidates for each grid cell, signing the data points to the clusters which are closest to them, and is the number of iterations. The time complexity of RTCA is found to be , which implies that the algorithm is sensitive to the number of data points , , the number of clusters , the number of grid cells , the average number of seed the number of dimensions candidates in each grid cell , and the number of iterations . Since , the time . complexity of RTCA is approximately The major space consumption is for the storage of the data points. For the grid cells, we only store their ids and the cardinalities of the grid cells. The vertices of the grid cells can be calculated by the reverse hashing function. Since the number of grid cells is smaller than the number of data points, the space . complexity of RTCA is Based on the property of RTCA and the property of the seed candidate list, the following lemma is obtained. Lemma 2: RTCA obtains the same clusters as K-means. The proof is given in Appendix II.

We adopt the Davies–Bouldin index (DBI) [47] to measure the quality of the resulting clusters and to choose the value of . , the index is calculated Given a set of regions as follows: (8) (9) (10) where is the mean distance between the points in the region and the centroid of the region, and corresponds to and . If the distance between the centroids of the regions the clusters are compact and far from each other, the value of will be small. The best value is calculated as follows: (11) and are the minimum and maximum values where respectively. The segmentation result based on the best value is adopted. Compared with K-means, RTCA achieves a low running time without sacrificing the quality of the clusters, which represents one of its main advantages. C. Denoising Algorithm We then apply the denoising algorithm which removes the discontinuous small regions whose areas are smaller than a threshold. Our denoising algorithm is motivated by the observed properties of the noise and outliers, which are essentially a set of discontinuous and distributed pixels with small areas. The denoising algorithm considers the segmented regions one by one, and applies morphological operations [48], [49] to eliminate all connected components whose areas are smaller than a threshold, which is set at 0.5% of the total image area. Fig. 6 shows examples of applying this algorithm to the images.

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 21, 2009 at 00:40 from IEEE Xplore. Restrictions apply.

YU AND WONG: RULE BASED TECHNIQUE FOR EXTRACTION OF VISUAL ATTENTION REGIONS

771

TABLE I STATISTICAL OBSERVATIONS OF THE PROPERTIES OF DIFFERENT REGION TYPES

Fig. 7. Extracting the background. (a) Original image; (b) major background; (c) secondary background; (d) salient objects.

Fig. 8. Synthetic examples.

TABLE II VALUES OF THE BACKGROUND MEASURE (B)

D. New Background Measure In the next step, RADA calculates the background measure for each region. In order to obtain the properties of the background, we 1) perform image segmentation on 1000 images which are selected randomly from the Corel image library 2) identify the salient object and the major background from each image and 3) compare the properties of the salient object and those of the major background. These properties include the extent of the region, the standard deviation with respect to the centroid of the region, the standard deviation with respect to the image center, and the luminance. We record the number of images corresponding to the two cases and for each property, where , correspond to the above attribute values associated with the background and salient object respectively, and obtain several interesting observations from these statistical results in Table I. Observation 1: Most of the major background regions have larger extents along the x-axis or y-axis than those of the salient

objects in the same image, since these regions have elongated shapes. Observation 2: Most of the major background regions have larger standard deviations with respect to their centroids than those of the salient objects in the same image, since the shapes of these regions are far from approximately circular. Observation 3: Most of the major background regions have larger standard deviations with respect to the center of the image than those of the salient objects in the same image, since these regions are far away from the center. Observation 4: Most of the major background regions have smaller values of luminance than those of the salient objects in the same image, since these regions are usually darker than the salient objects. If a region possesses all the properties of the above four observations at the same time, there is a high probability that the region will be identified as a background region. As a result, the definition of the background measure is based on these four observations. The formulation of the new background measure for the region ( , where is the number of the regions) is as follows:

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 21, 2009 at 00:40 from IEEE Xplore. Restrictions apply.

(12) (13) (14) (15)

772

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 9, NO. 4, JUNE 2007

Fig. 9. Example of applying Rule 3 and Rule 4.

Fig. 10. Example of applying Rule 5.

(16)

of background measure values sets, and calculate the mean value threshold is then calculated as follows:

into two subin each subset. The

(17) are the weight parameters and , and denote the maximum extent of the region along the axis and the axis respectively, and are the width and the height of the image respectively, is the cardinality of the th region, is the maximum and , is the discrete form of value between the normalized moment for the th region, is the coordinates of the pixel, is the centroid of the th region, is the order of is the the normalized moment (the default value of is 2), standard deviation between the pixels in the th region and the center of the image , and is the average luminance of the background measure is of the th region. The value normalized within the range [0,1] by the following equation:

(19)

where

(18) According to the definition of the new background measure , the following rule is obtained. Rule 1: If a region has the maximum value of , the region forms part of the background. The background often consists of several regions. In order to identify these regions, a threshold is obtained by the algorithm automatically to identify those regions which form the background. We perform binary K-means algorithm to divide the set

Rule 2: If the measure value of a region is not smaller than , the region forms part of the background. Fig. 7 illustrates the background area which is extracted from the image by the new background measure. The regions which form the background occupy a large part of the image area, while the salient object is found near the image center. The background measure not only considers feature contrast, but also the effect of the extent, the shape and the location of the region. Fig. 8 shows three synthetic examples which illustrate the influence of these three factors. In Fig. 8(a), a gray region, which is surrounded by a black region, in turn contains a white region at the center of the image. In Fig. 8(b), a gray region contains a white region and a black region at different locations. In Fig. 8(c), a small white region is located at the corner of the image which is otherwise occupied by a black region and a gray region. Table II lists the corresponding background measure values of all the regions in the synthetic images (with associated , , , and ). The weights black region in Fig. 8(a) is the background according to Rule 1 value. The threshold is since it has the maximum selected automatically by the algorithm, and the gray region is also identified as part of the background according to Rule 2. In

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 21, 2009 at 00:40 from IEEE Xplore. Restrictions apply.

YU AND WONG: RULE BASED TECHNIQUE FOR EXTRACTION OF VISUAL ATTENTION REGIONS

773

Fig. 11. Hierarchical visual attention regions.

Fig. 8(b), the background (the gray region) and the salient objects (the white square and the black square) are identified based on Rule 1 using the threshold 0.71. Although the white region in Fig. 8(c) is located at the edge of the image, it is not identified as part of the background, since the extent and the size of the region is small, and the corresponding value of is smaller than those of the other regions. The correct background regions . are extracted by Rule 2 based on the threshold E. Detecting Hierarchical Visual Attention Regions by Confidence Factor Ranking In the next step, we compute a confidence factor for each region. To perform this effectively, a fuzzy membership function is designed to rank the probabilities of the regions to be considered as visual attention regions. Based on the set of background for the th measure values, the fuzzy membership function region is defined as follows: (20)

where and denote the background measure of the th region and the th region respectively, is the number of regions, is the fuzziness exponent (here is set to 3 in the and with respect to other values of experiment). If the value of is small, the value of will be large, and vice is large, the th region has a high versa. In other words, if probability to be identified as a visual attention region. In order to further evaluate the significance of visual attention regions, a confidence factor is defined based on gaps in the membership values. We first sort the membership values of all the regions in descending order and reorder the labels of the refor the th region, which gions. Then, the confidence factor is used to determine whether the th region contains the salient object, is defined as follows:

be identified as the salient object. Otherwise, the th region will be identified as the background. The first region with the maximum membership value has the equal to 1, while the th region with the smallest value of equal to membership value has the corresponding value of values is within [0,1]. 0. The range of other Rule 3: If the CF value of the region is 1, the region forms part of the visual attention region. Rule 4: If the CF value of the region is 0, the region forms part of the background. Fig. 9 shows examples in which Rule 3 and Rule 4 are applied. The images in the second column are the background extracted by Rule 4, while the images in the third column are the visual attention regions selected by Rule 3. Rule 5: If a suitable threshold for the confidence factor is form the visual attention selected, the regions with form the background. regions and the regions with The threshold in rule 5 is obtained automatically as follows: we first perform binary K-means to separate the set of the CF values into two clusters. Then, is set equal to (where and are the centers of the clusters). Fig. 10 shows an example of applying Rule 5. Based on the obtained by the algorithm, the visual attenthreshold tion regions and the background can be distinguished easily. Motivated by Rule 5, we propose a new definition for a set of visual attention regions with a hierarchical relationship. The definition of hierarchical visual attention regions (HVAR) is as follows: Definition 2: Hierarchical visual attention regions We assume that 1) there exist regions, 2) the CF values of the regions are sorted in descending order and 3) the labels of the regions are reordered according to its corresponding position in the ranked list of CF values. We can define a set of hierarchical as follows: visual attention regions

(21) where is the number of regions, and . If is large, the th region has a high probability to the value of Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 21, 2009 at 00:40 from IEEE Xplore. Restrictions apply.

(22)

774

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 9, NO. 4, JUNE 2007

Fig. 12. More examples of hierarchical visual attention regions. (a) Original image; (b) primary salient object; (c) secondary salient object; (d) tertiary salient object or background; (e) hierarchical region of interest.

Fig. 11 illustrates an example of hierarchical visual attention regions. The most salient object detected by our approach is the sun. The next salient object is the red cloud. The remaining objects (dark blue sky, ground) are regarded as background regions.

Fig. 12 provides more examples of hierarchical visual attention regions (HVAR) detected by RADA(in the figure, the red rectangles denote the visual attention regions , while the green rectangles denote the set of HVAR). The relative impor-

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 21, 2009 at 00:40 from IEEE Xplore. Restrictions apply.

YU AND WONG: RULE BASED TECHNIQUE FOR EXTRACTION OF VISUAL ATTENTION REGIONS

tance of the set of visual attention regions is correctly characterized in most of the cases. However, there are exceptions, as in the case of the image in the 13th row. In this example, the algorithm assigns a lower saliency to the big symbol “3”, while assigns a high saliency to the symbol ‘day”. Since the symbol “day” has greater luminance and a smaller area than the symbol “3”, our algorithm identifies the symbol “day” as the more salient object first. This can be improved if we also take into account the relative strengths of the object edges and corners in determining saliency. Since the major time consumption of RADA is at the stage of time, and the computaimage segmentation which takes time, the tion of the background metric which also takes . The space complexity of time complexity of RADA is . RADA is also F. Adaptive Image Display Module By combining RADA with the MPEG-21 standard framework, we can develop an adaptive image display module to automatically extract HVAR and determine the corresponding adaptation operation on the screen of the terminal devices. The objective of the adaptive image display module is to preserve the important information in the image as much as possible under the constraint of the screen. Fig. 13 illustrates the process of adaptive image display (here we assume the aspect ratio of the image is the same as that of the screen). IV. EXPERIMENT A. Experimental Setting and Data Set We experimentally evaluate the performance of the proposed approach in this section. All the experiments presented are performed with a Pentium 3.2 GHz CPU with 1 GByte memory. There are 1834 images which are used in our experiments: 1000 images are collected from the Corel Dataset (CD), 722 images come from the Butterflies and Birds archive (BB) in Ponce’s research group [25], [26], and 112 images are collected from our personal albums (PA). Each pixel in the image is viewed as a data point with correcolor sponding value, value and value in the CIE space. The Euclidean distance is used to measure the similarity between two data points. In the following experiment, we first compare RADA based on RTCA (RADA(RTCA)) with RADA based on K-means (RADA(K-means)) to illustrate the efficiency of (RADA(RTCA)). Then, a user investigation is carried out to evaluate the effectiveness of our approach. Next, we compare RADA with the saliency map based visual attention model [4], which detects visual attention region at the pixel level, to show the advantage of detecting visual attention regions at the object level. Finally, we compare RADA with NN-GPCA (a subspace estimation algorithm based on generalized principle component analysis and nearest neighbor method) [1], which also detects visual attention region at the object level to illustrate the efficiency of RADA. Finally, we investigate the sensitivity of the results to the parameters.

775

Fig. 13. Adaptation operation.

B. Comparison of RTCA and K-Means In this section, we compare RADA based on RTCA (RADA(RTCA)) with RADA based on K-means (RADA(K, and the size of the means)). The number of grid cells is images in the experiments is either 256 384 or 384 256. Fig. 14 illustrates the segmented regions, the detected salient object and the detected background by RADA(RTCA) and RADA(K-means). It can be observed that the visual results obtained by RADA(RTCA) and RADA(K-means) are indistinguishable, while RADA(RTCA) results in a significant reduction in computation time as shown in Fig. 15(a). There are two reasons which lead to the reduction of computation time. One reason is that GCDBM in RADA(RTCA) finds the “good” initial centers of the clusters which decrease the number of iterations as shown in Fig. 15(b). The other reason is that the max–min algorithm and the distribution algorithm reduce the number of distance computations between the points and the center of the clusters as shown in Fig. 15(c). We further search for the best value using the DaviesBouldin index. Due to lemma 2, both RTCA and K-means attain the same DB index for different images. We then adopt this index to select the number of clusters for segmentation. Fig. 16 illustrates the three examples and their segmented results by RTCA with respect to the best value which corresponds to the smallest DBI value in each row of Table III. We can see that the optimal value correctly characterizes the image structure in each of the examples. C. User Investigation As is well known, there does not exist a standard measure to evaluate the correctness of visual attention regions. Therefore, we carry out a user investigation to evaluate the effectiveness of our approach. The investigation consists of three parts which correspond to the sequence of image detail scanning by the users: 1) The major visual attention region which highlights the salient objects of the image. When the users scan the image, this region is the first that the user pays attention to. 2) The secondary visual attention region which contains objects of secondary importance. After the users finish scanning the major visual attention region, they turn their attention to the secondary

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 21, 2009 at 00:40 from IEEE Xplore. Restrictions apply.

776

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 9, NO. 4, JUNE 2007

Fig. 14. Comparison of visual results obtained by RADA(RTCA) and RADA(K-means).

Fig. 15. Performance comparison between RADA(RTCA) and RADA(K-means). (a) The running time; (b) number of iterations; (c) number of distance computation.

Fig. 16. Segmented results based on the best k value.

TABLE III VALUES OF THE DAVIES-BOULDIN INDEX (DBI) FOR DIFFERENT NUMBER OF SEGMENTED COMPONENTS

visual attention region. 3) The major background which corresponds to the maximum value of the background measure and indicates the location that the image is taken. The set of hierarchical visual attention regions which summarizes the sequence corresponding to 1)–3). There are ten users who participated in the investigation, and six among ten of these users are familiar with photography. They are asked to consider the images one by one in the investigation. The users first look at the original image and determine the following three regions: the major

attention region, the secondary attention region and the major background by specifying their boundaries. Our system then generates the corresponding three regions in the same image. By comparing the two results, they are required to evaluate the effectiveness of our results by A (Perfect), B (Good), C (Pass), and D (Fail). Since our approach detects visual attention regions at the object level, the major task of user investigation is to evaluate whether the results have semantic meaning. If the salient object is flower, and the visual attention region detected by our

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 21, 2009 at 00:40 from IEEE Xplore. Restrictions apply.

YU AND WONG: RULE BASED TECHNIQUE FOR EXTRACTION OF VISUAL ATTENTION REGIONS

777

approach does not contain the flower, our approach fails. Regardless of the difference between the users in their interests, their inherent skills, the amount of training, etc, they can still unambiguously identify the failure case (D), which occurs when the degree of overlap between the user selected regions and the generated regions is smaller than a threshold. For the other three cases: A (perfect), B (good), C (pass), different users have different points of view. As a result, the effectiveness of our approach is mainly measured by the failure case (D). The results of our approach are separated into three categories according to the categories of the original images: CD (1000 images), BB (722 images), and PA (112 images). We use the following voting result VR to define the degree of satisfaction of the users: (23) is the number of images which belong to V in each where , and is the total number category of images in each category. Fig. 17 and Fig. 18 illustrate some example results. Table IV shows the voting results corresponding to the identification of the major visual attention region. Since the Corel dataset is a general data set which consists of all kinds of photos, the result based on this data set can represent the average performance of our approach. As shown in Table IV, our approach works well in CD: only 3% of the results are marked as D (Fail), while most of the results are marked as B. In particular, 8% of the results are marked as A. Our approach works well in BB as well, since the themes of the photos in BB are well defined. Due to the difficulty of distinguishing the major objects in the images of PA from the secondary objects, PA is the least satisfactory one among the three categories in terms of the detection results. Table V illustrates the evaluation results on secondary attention region. It is observed that the overall performance of detecting secondary visual attention region is not as satisfactory as that of detecting major visual attention region. Only 5% of the images are marked as A, while 7% of the images are considered as D in the Corel data set. The number of images marked B (48%) is close to that of images marked C (40%) in the Corel data set. This could be due to the confusion of the major visual attention region with the secondary visual attention region for some images, as indicated by the example in the second column of Fig. 18. In this image, the human faces should be the major salient objects in the image and the shirts of the human should be the secondary salient objects, but the algorithm confuses the order of the human face and the shirts. Due to weak contrast or poor exposure, the correct detection rate corresponding to PA decreases as well. Fig. 19 illustrates more unsatisfactory cases of our approach. In the first row of Fig. 19, the major salient object flower is confused with the second salient object butterfly, since the butterfly, which is at the center of the image, has a smaller value of the background measure. On the other hand, in the second and third rows of Fig. 19, the salient object is confused with the background due to the low color contrast. Table VI illustrates the statistical evaluation results for the major background region. The percentage of the failed cases is

Fig. 17. Examples of hierarchical visual attention regions in CD and BB. (a) Original image; (b) primary salient object; (c) secondary salient object; (d) tertiary salient object or background; (e) hierarchical region of interest.

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 21, 2009 at 00:40 from IEEE Xplore. Restrictions apply.

778

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 9, NO. 4, JUNE 2007

Fig. 18. Examples of hierarchical visual attention regions in PA.

Fig. 19. Unsatisfactory cases. TABLE IV MAJOR VISUAL ATTENTION REGION (VOTING RESULTS)

TABLE VI MAJOR BACKGROUND (VOTING RESULTS)

TABLE V SECONDARY VISUAL ATTENTION REGION (VOTING RESULTS)

as small as 1% in CD and BB, and 2% in PA. This indicates that the background metric works well in these three categories.

Table VII shows the overall evaluation results for the detection of hierarchical visual attention regions. As a whole, the detected hierarchical visual attention regions conforms to user opinion since there are only 5% failed cases in CD, 3% failed cases in BB, and 6% failed cases in PA. The ranked order of salient objects is similar to the objects to which users pay attention during the perceptual process. The evaluation results indicate that the set of hierarchical visual attention regions can

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 21, 2009 at 00:40 from IEEE Xplore. Restrictions apply.

YU AND WONG: RULE BASED TECHNIQUE FOR EXTRACTION OF VISUAL ATTENTION REGIONS

779

Fig. 20. Comparing RADA with SMVA. TABLE VII HIERARCHICAL VISUAL ATTENTION REGIONS (VOTING RESULTS)

HVAR representation, we can develop an adaptive image display module as described in Section III-F, which selects different VARs for different screen sizes while maintaining the accuracy. If the screen size is small, only the major attention region will be sent to the user. On the other hand, if the screen size is greater than the major attention region, the secondary attention region will be sent. If the screen is large enough, the major background region will also be sent.

TABLE VIII PREFERENCE OF THE USER (VOTING RESULTS)

D. Comparison With the Saliency Map Based Visual Attention Model

potentially relate the lower level features to higher level representations. We also investigate the preference of the users between single level VAR and hierarchical VAR in the different image categories. As shown in Table VIII, they prefer HVAR due to its more accurate representation of the image and the additional amount of information available. In addition, they find that HVAR also provides useful contextual information in the form of a set of hierarchical regions when compared with traditional VAR. As a result, the user investigation provides an effective way to evaluate the correctness of visual attention regions and the effectiveness of our approach has been verified in a variety of images through these investigations. In addition, based on the

We next compare RADA with the saliency map based visual attention model (SMVA) [4]. Fig. 20 shows the segmentation results and the detected visual attention regions (red rectangles) by RADA and SMVA. It is observed that the segmentation results of RADA are more accurate than those of SMVA. This in turn results in a more accurate set of detected visual attention regions, as can be observed when we compare the detection results of the flower region in the first, third and sixth column. We can notice that, in the case of SMVA, one quarter of the flower in the first, third and sixth rows of Fig. 20 is excluded from the detected region, while this is included in the RADA results. In order to illustrate the robustness of RADA, we show that our approach can detect the visual attention regions in different cases where the salient objects are at the center of the image or near the boundary of the image, while SMVA fails in some of these cases. Case 1) The salient object is close to the center of the image. Fig. 21 shows two examples in which the red rectangle indicates the visual attention region. The salient object in the first example is the butterfly. It

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 21, 2009 at 00:40 from IEEE Xplore. Restrictions apply.

780

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 9, NO. 4, JUNE 2007

Fig. 21. Examples of Case 1.

Fig. 22. Examples of Case 2.

is observed that RADA detects this salient object accurately, while VAR detected by SMVA does not contain the whole butterfly. The salient object in the second example is the cow. RADA again detects the complete salient object, while VAR of SMVA only includes part of the cow.

Case 2) The salient object is close to one of the edges of the image. Fig. 22 shows two examples of Case 2. SMVA fails to detect the salient object (horses) in the first example, while the VAR of RADA accurately encapsulates the horses. In the second example, the VAR

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 21, 2009 at 00:40 from IEEE Xplore. Restrictions apply.

YU AND WONG: RULE BASED TECHNIQUE FOR EXTRACTION OF VISUAL ATTENTION REGIONS

781

Fig. 23. Examples of Case 3.

Fig. 24. Examples of Case 4.

of SMVA only contains small parts of the rose, while RADA fully detects the rose. Although the salient objects are not at the image center in both examples, their small extents and small standard deviations with respect to their centroids ensure their successful detection. Case 3) The salient object is close to the corner of the image. Fig. 23 shows two examples of Case 3. SMVA fails in both examples, while RADA successfully detects the salient objects: the sun in the first image and the man in the second image. Again, in this case, their successful detection is due to their small extents and small standard deviations with respect to their centroids. Case 4) More than one salient object is in the image.

Fig. 24 illustrates two examples of Case 4. In the first example, we can see that the VAR of RADA correctly encapsulates most of the flowers, while that of SMVA includes a large portion of the leaf region. In the second example, RADA correctly detects four men in the image, while SMVA finds only three men. As a result, our approach is robust for the detection of the visual attention regions in many different situations. E. Comparison With NN-GPCA In order to provide a fair comparison with NN-GPCA, we adopt the same experiment conditions as in [1], in which the data points correspond of the mean of the features over a 2 2 block in the image. Fig. 25 shows the segmentation results and the detected visual attention regions by RADA and NN-GPCA

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 21, 2009 at 00:40 from IEEE Xplore. Restrictions apply.

782

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 9, NO. 4, JUNE 2007

Fig. 25. Comparing RADA with NN-GPCA.

and

are then selected to minimize the total classification error :

(24) (25)

Fig. 26. Efficiency comparison.

for some example images. It is observed that the segmentation results of the two approaches are comparable, while RADA results in a significant reduction in computation time as shown in Fig. 26. F. Effect of the Parameters In this section, we discuss how to determine the parameters , , and for calculating the background measure and the effect of these parameters on the performance. We first select 1000 images randomly from the Corel image library for parameter selection. These images are then removed from the library, and the image set in Section IV. A, B, C, D, E is selected from the remaining entries. Since our choice of the parameters based on the first set of reference images results in good performance on the second set with substantially different characteristics, the selected parameters are suitable for a large variety of image types and the approach can generalize to unseen data. The set of reference images are first segmented by the realtime clustering algorithm. The segmented regions are separated into the background class and the salient object class manually to form the ground truth. The suitable parameters , ,

is the th entry of the confusion matrix , and where is the number of segmented regions. Since there are two classes is a (background class and salient object class) in our case, 2 2 matrix. We enumerate all the possible combinations of , , and with fixed increments of 0.005 to evaluate the minimum . In our experiment, , value of , , are the best choices which minimize the objective function. We then investigate the sensitivity of the results to the weight parameters. Specifically, each of the weight parameters is removed in turn, while the others are adjusted to calculate the values of the background measures and the confidence factors. Based on the confidence factors, the algorithm separates the salient objects from the background. By comparing with the ground truth classes, we obtain the total classification error through (24) and (25). These errors are listed in Table IX. The baseline value corresponds to the case where all weight parameters have the same value 0.25. In the remaining four cases, one of the weight parameters is set to 0, while the others are set to 1/3. As shown in Table IX, the values in the last four rows are all greater than the values in the first and second rows. This implies that all the weight parameters are essential to the algorithm. If one of the weight parameters is removed, the accuracy of the algorithm will be affected. In addition, we also

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 21, 2009 at 00:40 from IEEE Xplore. Restrictions apply.

YU AND WONG: RULE BASED TECHNIQUE FOR EXTRACTION OF VISUAL ATTENTION REGIONS

783

TABLE IX TOTAL CLASSIFICATION ERROR (E)

notice that the algorithm is more sensitive to the weight paramand than , since the value increases rapidly eters , or is removed, when one of the weight parameters , while the increase is less abrupt when is dropped. V. CONCLUSION AND FUTURE WORK In this paper, we propose a rule based approach for the extraction of visual attention regions at the object level by a realtime clustering algorithm. Although there exist a number of approaches to detect visual attention regions, few of them address the issue of the semantic gap between the visual attention regions and high-level semantics. Our major contribution is a rule based automatic detection algorithm(RADA) which detect hierarchical visual attention regions at the object level in real time. RADA consists of four stages. First, RADA performs image segmentation on an image by a real time clustering algorithm. Then, a denoising algorithm is applied to perform noise removal. In the next step, RADA calculates the value of the background measure for each region. This measure combines feature contrast with geometric property of the region to identify the background effectively. Finally, RADA detects a set of hierarchical visual attention regions at the object level based on a confidence factor. Our experiments show that RADA bridges the gap between visual attention regions and high-level semantics when applied to different types of images. In the future, we shall further investigate the performance of the algorithm based on different features, the effect of aspect ratio on the choice of which subset of visual attention regions to be displayed, and how RADA can be integrated into a content based image retrieval system. APPENDIX I PROOF OF LEMMA 1 satisfies , it is Proof: If the seed not the closest seed of the point when falls into the grid cell . Since there exists at least one seed which satisfies , and the seed is closer to the point than the seed , after removing all the seeds which satisfy , the closest seed of the point must be one of the remaining seeds. Since 1) the remaining , and 2) the seed candidate list seeds satisfy of the grid cell only stores the candidates which satisfy (where , all the remaining seeds will be added to the seed candidate list of the cell . As a result, the seed which is closest to the point is in the seed candidate list, and Lemma 1 is proved.

APPENDIX II PROOF OF LEMMA 2 Proof: According to Lemma 1, the closest seed which is obtained by calculating the distances between the point and all the centers of the clusters in K-means is the same as the closest seed which is obtained by calculating the distances between the point and the seed candidates in the seed candidate list of the grid cell in RTCA. In other words, RTCA only accelerates the process of finding the closest seed for each point . As a result, RTCA obtains the same clusters as K-means. REFERENCES [1] Y. Hu, D. Rajan, and L.-T. Chia, “Robust subspace analysis for detecting visual attention regions in images,” in ACM Multimedia 2005, 2005, pp. 716–724. [2] U. Rutishauser, D. Walther, C. Koch, and P. Perona, “Is bottom-up attention useful for object recognition?,” in Proc. 2004 IEEE Comput. Soc. Conf. Computer Vision and Pattern Recognition, Washington, DC, Jul. 2004, vol. 2, pp. 37–44. [3] Y.-F. Ma and H. J. Zhang, “Contrast-based image attention analysis by using fuzzy growing,” in ACM Multimedia 2003, 2003, pp. 374–381. [4] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 11, pp. 1254–1259, Nov. 1998. [5] A. P. Bradley and F. W. Stentiford, “Visual attention for region of interest coding in jpeg2000,” J. Vis. Commun. Image Represent., vol. 14, no. 3, pp. 232–250, Sep. 2003. [6] Y. Hu, X. Xie, Z. Chen, and W.-Y. Ma, “Attention model based progressive image transmission,” in ICME 2004, 2004, pp. 1079–1082. [7] Y. Hu, L.-T. Chia, and D. Rajan, “Region-of-interest based image resolution adaptation for mpeg-21 digital item,” in Proc. 12th Annu. ACM Int. Conf. Multimedia, New York, 2004, pp. 340–343. [8] K. Cheoi and Y. Lee, “Detecting perceptually important regions in an image based on human visual attention characteristic,” in SSPR/SPR 2002, 2002, pp. 329–338. [9] J. Bormans and K. Hill, “MPEG-21 overview V.5,” ISO/IEC JTC1/ SC29/WG11/N5231 Oct. 2002. [10] A. Vetro and C. Timmerer, “ISO/IEC 21000-7 FCD—Part 7: Digital item adaptation,” ISO/IEC JTC 1/SC 29/WG 11/N5845 Jul. 2003. [11] D. Mukherjee, G. Kuo, S. Liu, and G. Beretta, “Motivation and use cases for decision-wise BSDLink, and a proposal for usage environment descriptor-adaptationQoSLinking,” ISO/IEC JTC 1/SC 29/WG 11 Hewlett Packard Laboratories, Apr. 2003. [12] G. Panis, A. Hutter, J. Heuer, H. Hellwagner, H. Kosch, C. Timmerer, S. Devillers, and M. Amielh, “Bitstream syntax description: A tool for multimedia resource adaptation within MPEG-21,” Signal Process.: Image Commun., vol. 18, no. 8, pp. 721–747, 2003. [13] Y.-Y. Lin, T.-L. Liu, and H.-T. Chen, “Semantic manifold learning for image retrieval,” in ACM Multimedia 2005, 2005, pp. 249–258. [14] K.-S. Goh, E. Y. Chang, and W.-C. Lai, “Multimodal concept-dependent active learning for image retrieval,” in ACM Multimedia 2004, 2004, pp. 564–571. [15] C.-H. Hoi and M. R. Lyu, “A novel log-based relevance feedback technique in content-based image retrieval,” in ACM Multimedia 2004, 2004, pp. 24–31. [16] W. James, The Principles of Psychology. Cambridge, MA: Harvard Univ. Press, 1890.

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 21, 2009 at 00:40 from IEEE Xplore. Restrictions apply.

784

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 9, NO. 4, JUNE 2007

[17] D. E. Broadbent, Perception and Communication. Oxford, U.K.: Pergamon, 1958. [18] J. Deutsch and D. Deutsch, “Attention: Some theoretical considerations,” Psychol. Rev., vol. 70, pp. 80–90, 1963. [19] S. Ahmad, “VISIT: A neural model of covert attention,” in Advances in Neural Information Processing Systems. San Mateo, CA: Morgan Kaufmann, 1991, vol. 4, pp. 420–427. [20] Y. Li, Y. F. Ma, and H. J. Zhang, “Salient region detection and tracking in video,” in Proc. IEEE Int. Conf. Multimedia & Expo 2003, Jul. 2003, vol. 2, pp. 72–75. [21] T. Kanungo, D. M. Mount, and N. S. Netanyahu et al., “An efficient K-means clustering algorithm: Analysis and implementation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 881–892, Jul. 2002. [22] M.-C. Su and C.-H. Chou, “A modified version of the K-means algorithm with a distance based on cluster symmetry,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp. 674–680, Jun. 2001. [23] S. P. Lloyd, “A least squares quantization in PCM,” IEEE Trans. Inform. Theory, vol. IT-28, pp. 129–137, 1982. [24] K. Lee, H. S. Chang, S. S. Chun, L. Choi, and S. Sull, “Perception based image transcoding for universal multimedia access,” in Proc. 8th Int. Conf. Image Processing (ICIP-2001), Thessaloniki, Greece, Oct. 2001, vol. 2, pp. 475–478. [25] S. Lazebnik, C. Schmid, and J. Ponce, “A maximum entropy framework for part-based texture and object recognition,” in Proc. IEEE Int. Conf. Computer Vision, Beijing, China, Oct. 2005, vol. 1, pp. 832–838. [26] ——, “Semi-local affine parts for object recognition,” in Proc. British Machine Vision Conf., Sep. 2004, vol. 2, pp. 959–968. [27] A. Treisman and S. Gormican, “Feature analysis in early vision: Evidence from search asymmetries,” Psychol. Rev., vol. 95, pp. 15–48, 1988. [28] F. Crick and C. Koch, “Some reflections on visual awareness,” in Proc. Cold Spring Harbor Symp. Quantitative Biology, 1990, vol. 55, pp. 953–962. [29] J. M. Wolfe and K. R. Cave, “Deploying visual attention,” in The Guided Search Model, A. Blake and T. Troscianko, Eds. New York: Wiley, 1990, ch. 4, pp. 79–103, AI and the Eye. [30] A. Treisman, “Perception of features and objects,” in Visual Attention. New York: Oxford Univ. Press, 1998. [31] C. Koch and S. Ullman, “Shifts in selective visual attention: Towards the underlying neural circuitry,” Human Neurobiol., vol. 4, pp. 219–227, 1985. [32] J. K. Tsotsos, S. M. Culhane, and W. Y. K. Wai et al., “Modeling visual attention via selective tuning,” Artif. l Intell., vol. 78, pp. 507–545, 1995. [33] E. Niebur and C. Koch, “Computational architectures for attention,” in The Attentive Brain, R. Parasuraman, Ed. Cambridge, MA: MIT Press, 1998, pp. 163–186. [34] S. Baluja and D. A. Pomerleau, “Expectation-based selective attention dor visual monitoring and control of a robot vehicle,” Robot. Auton. Syst., vol. 22, no. 3–4, pp. 329–344, Dec. 1997. [35] S. Kim, Y. Tak, Y. Nam, and E. Hwang, “CLOVER: Mobile contentbased leaf image retrieval system,” in ACM Multimedia 2005, 2005, pp. 215–216. [36] C. Yang, M. Dong, and F. Fotouhi, “Semantic feedback for interactive image retrieval,” in ACM Multimedia 2005, 2005, pp. 415–418. [37] F. Jing, M. J. Li, H. J. Zhang, and B. Zhang, “An efficient and effective region-based image retrieval framework,” IEEE Trans. Image Process., vol. 13, no. 5, pp. 699–709, 2002. [38] L. Q. Chen, X. Xie, X. Fan, W. Y. Ma, H. J. Zhang, and H. Q. Zhou, “A visual attention model for adapting images on small displays,” ACM Multimedia Syst. J., vol. 9, no. 4, pp. 353–364, 2003. [39] D. Walther, U. Rutishauser, C. Koch, and P. Perona, “On the usefulness of attention for object recognition,” in 2nd International Workshop on Attention and Performance in Computational Vision 2004, Prague, Czech Republic, May 2004, pp. 96–103. [40] ——, “Selective visual attention enables learning and recognition of multiple objects in cluttered scenes,” Comput. Vis. Image Understand., pp. 745–770, 2005.

[41] A. Bamidele, F. W. Stentiford, and J. Morphett, “An attention-based approach to content based image retrieval,” Brit. Telecommun. Adv. Res. Technol. J. Intell. Spaces (Perv. Comput.), vol. 22, no. 3, pp. 151–160, Jul 2004. [42] S. Paek and J. R. Smith, “Detecting image purpose in world-wide web documents,” in Proc. SPIEIS&T Photonics West, Document Recognition, Jan. 1998, vol. 3305, pp. 151–158. [43] R. Mohan, J. R. Smith, and C.-S. Li, “Adapting multimedia internet content for universal access,” IEEE Trans. Multimedia, vol. 1, no. 1, pp. 104–114, Mar. 1999. [44] J. R. Smith, R. Mohan, and C.-S. Li, “Content-based transcoding of images in the internet,” in Proc. IEEE Int. Conf. on Image Processing, Oct. 1998, vol. 3, pp. 7–11. [45] C.-S. Li, R. Mohan, and J. R. Smith, “Multimedia content description in the InfoPyramid,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, May 1998, vol. 6, pp. 3789–3792. [46] K. Lee, H. S. Chang, S. S. Chun, and H. Choi, “Perception-based image transcoding for universal multimedia access,” IEEE Trans. Image Process., vol. 2, pp. 475–478, 2001. [47] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-1, no. 4, pp. 224–227, 1979. [48] H. J. A. M. Heijmans, “Connected morphological operators for binary images,” Comput. Vis. Image Understand., vol. 73, no. 1, pp. 99–120, 1999. [49] ——, Morphological Image Operators. Boston, MA: Academic, 1994.

Zhiwen Yu (S’06) is pursuing the Ph.D. degree candidate in the Department of Computer Science, City University of Hong Kong. He received the BSc and MPhil degrees from Sun Yat-Sen University in China. His research interests include multimedia, image processing, machine learning and data mining.

Hau-San Wong (S’96–M’00) received the B.Sc. and M.Phil. degrees in electronic engineering from the Chinese University of Hong Kong, and the Ph.D. degree in electrical and information engineering from the University of Sydney, Australia. He is currently an Assistant Professor in the Department of Computer Science, City University of Hong Kong. He has also held research positions at the University of Sydney and Hong Kong Baptist University. His research interests include multimedia information processing, multimodal human-computer interaction and machine learning. He is the co-author of the book Adaptive Image Processing: A Computational Intelligence Perspective (CRC/SPIE Press), and a guest co-editor of the special issue on “Information Mining from Multimedia Databases” for the EURASIP Journal on Applied Signal Processing. Dr. Wong was an organizing committee member of the 2000 IEEE Pacific-Rim Conference on Multimedia and 2000 IEEE Workshop on Neural Networks for Signal Processing, both held in Sydney, and has co-organized a number of conference special sessions, including the special session on “Image Content Extraction and Description for Multimedia” in 2000 IEEE International Conference on Image Processing, Vancouver, BC, Canada, and “Machine Learning Techniques for Visual Information Retrieval” in 2003 International Conference on Visual Information Retrieval, Miami, Florida.

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 21, 2009 at 00:40 from IEEE Xplore. Restrictions apply.

Suggest Documents