Region-of-Interest based Image Resolution Adaptation for ... - CiteSeerX

1 downloads 0 Views 539KB Size Report
Oct 16, 2004 - Region-of-Interest based Image Resolution Adaptation for. MPEG-21 Digital Item. Yiqun Hu. Center for Multimedia and. Network Technology.
Region-of-Interest based Image Resolution Adaptation for MPEG-21 Digital Item Yiqun Hu

Liang-Tien Chia

Deepu Rajan

Center for Multimedia and Network Technology School of Computer Engineering Nanyang Technological University, Singapore 639798

Center for Multimedia and Network Technology School of Computer Engineering Nanyang Technological University, Singapore 639798

Center for Multimedia and Network Technology School of Computer Engineering Nanyang Technological University, Singapore 639798

[email protected]

[email protected]

[email protected]

ABSTRACT The upcoming MPEG-21 standard proposes a general framework for augmented use of multimedia services in different network environments, for various users with various terminal devices. In the context of image adaptation, terminals with different screen size limitation require the multimedia adaptation engine to adapt image resources intelligently. Saliency map based visual attention analysis provides some intelligence for finding the attention area within the image. In this paper, we improved the standard MPEG-21 metadata driven adaptation engine by using enhanced saliency map based visual attention model which provides a mean to intelligently adapt JPEG2000 image resolution for different terminal devices with varying screen size according to human visual attention.

Categories and Subject Descriptors H.3.5 [Online Information Services]: Data Sharing

General Terms Standardization

Keywords MPEG-21, Image Adaptation, Saliency Map, Intelligent Resolution

1.

INTRODUCTION

With the development of multimedia and network technology, multimedia resources can be accessed by different terminal devices in different network situations. The most vital limitation of device terminal is the screen size. Direct resolution reduction of large images to fit the terminal screen

size is not always an optimal adaptation because, according to information theory, information within an image will be lost if the image size is reduced to be lower than a certain percentage. Ideal image adaptation in this context should provide better user experience by displaying only the area that captures human visual attention most. Thus the available screen size will be optimally used for the most attentive information. MPEG-21 Standard Part 7 DIA [6] describes a standardized framework to adapt format-dependent and format-independent multimedia resources according to terminal capability. For image resolution adaptation, the standard adaptation engine only supports direct resolution reduction although it provides the description tool for regionof-interest (ROI) information. The engine itself does not provide a method to automatically detect the ROI of the requested image. In this paper, we improved the standard MPEG-21 image adaptation engine to automatically detect ROI of the images using enhanced visual attention model. The engine also auto-generates the adaptation decision based on image ROI and terminal capability information, and finally adapts the resolution of the image to be compatible with human visual attention intelligently. Using our image adaptation engine, the time consuming work of manually outlining image ROIs in the image database is automated in real time and users with the extremely small screen mobile devices can achieve better user experience by viewing the most attentive area in limited screen size. The rest of this paper is organized as follows. We begin in Section 2 by briefly introducing the improved saliency map based visual attention model. In Section 3, we present the MPEG-21 framework and the architecture of the BSD adaptation engine. Afterward, our improved image adaptation engine using the enhanced visual attention model are described in Section 4. Experiment evaluations are given in Section 5 and we conclude our paper in Section 6.

2. VISUAL ATTENTION MODEL Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’04, October 10-16, 2004, New York, New York,USA. Copyright 2004 ACM 1-58113-893-8/04/0010 ...$5.00.

Attempts to understand human vision system have resulted in visual attention models such as saliency map based Visual Attention Model [2] and Contrast-based Attention Model [3]. In our system, we utilize the saliency map based visual attention model [2] because of its biological plausibility. While the objective in [2] is to track several visual attentive objects dynamically using a ”winner-take-all” neural

network, our aim is to extract a single region which captures all the attention objects. The saliency map is a topography combined feature map that indicates the possible attention area of an image. In the next two subsections, we briefly review the process of generating the saliency map, as described in [2].

tion, ”⊕”, which is implemented by reduction of each map to scale four and point-by-point addition:

2.1 Feature extraction

For orientation, maps with the same orientation are combined to form 6 groups. The groups are then combined again into a single orientation conspicuity map:

The first step to generate the saliency map is low level feature extraction. Three low level features which represent the characteristics of an image are used: intensity, color, texture. Each feature is subsampled and filtered by Gaussian pyramids. The difference between fine and coarse scales is implemented by a “center-surround” difference operator ”” which interpolates a coarse scale to a fine scale and then carries out point-by-point subtraction. Here the center is represented by a pixel at scale c ∈ {2, 3, 4} and the surround is the corresponding pixel at scale s = c + δ, δ ∈ {3, 4}. For intensity, an intensity image I is calculated by I=(r+g+b)/3 where r, g and b represent the red, green and blue channels of the input image respectively. This intensity image I is used to create a Gaussian pyramid I(σ), where σ ∈ [0..8] is the scale. Six maps are calculated as follows: I(c, s) = |I(c)  I(s)|. For color, four broadly-tuned color channels are created as R=r-(g+b)/2 for red, G=g-(r+b)/2 for green, B=b-(r+g)/2 for blue, and Y=(r+g)/2-(r-g)/2b for yellow from which four Gaussian pyramids R(σ), G(σ), B(σ), Y (σ), respectively, are created. Maps RG(c, s) (1) and BY (c, s) (2) simultaneously account for red/green, green/red double opponency and blue/yellow, yellow/blue double opponency: RG(c, s) = |(R(c) − G(c))  (G(s) − R(s))|

(1)

BY (c, s) = |(B(c) − Y (c))  (Y (s) − B(s))|

(2)

For orientation, the Gabor filter is used to extract the orientation information. The orientation map is obtained from I using oriented Gabor Pyramids O(σ, θ), where θ ∈ {0 ◦ , 45 ◦ , 90 ◦ , 135 ◦ } is the preferred orientation. Orientation features, O(c, s, θ), are calculated with each orientation θ as a group of 6 maps (3): O(c, s, θ) = |O(c, θ)  O(s, θ)|

(3)

In all, 42 feature maps are created - 6 for intensity, 12 for color and 24 for orientation.

2.2 Saliency Map Generation The saliency map is generated from the 42 feature maps. A normalization operator N (·) [2] is used to globally enhance the maps with a few strong contrast peaks and suppress maps with numerous small peaks. It consists of three steps: 1) normalize the map into a fixed range [1..M]; 2) find the location of the global maximum M and compute the average m ¯ of all its other local maxima; 3) multiply the map by (M − m) ¯ 2 . The difference between maximum activity and the average is measured by comparing global maximum activity to the average overall activities. Large difference indicates the most active location stands out. On the contrary, small difference indicates the map contains nothing unique and is suppressed. Feature maps are combined into three ”conspicuity maps” corresponding to three types of ¯ for color (5) and O ¯ for orienfeature. I¯ for intensity (4), C tation (6). They are calculated through across-scale addi-

¯= O

I¯ = ⊕4c=2 ⊕c+4 s=c+3 N(I(c, s))

(4)

C¯ = ⊕4c=2 ⊕c+4 s=c+3 [N(RG(c, s)) + N(BY (c, s))]

(5)



N(⊕4c=2 ⊕c+4 s=c+3 N(O(c, s, θ))) (6)

θ∈{0 ◦ ,45 ◦ ,90 ◦ ,135 ◦ }

Finally, three conspicuity maps are normalized and summed ¯ + N(C) ¯ + N(O)). ¯ into a final saliency map as S = 13 (N(I)

2.3 Enhancement of Visual Attention Model It is assumed that small object at the edges of an image is unlikely to be the main attention region and the attention region closer to the center of the image is perceptually more important in human vision. We assign a weight to each pixel in the image. Without additive restriction, we assume the surface of the weights of the image satisfies the gaussian distribution along both horizontal and vertical directions ((7), (8)) and the total weight is the arithmetic mean of two directions. 1 1 x − µx 2 N (µx , σx2 ) = √ exp[− ( ) ] (7) 2 σx 2πσx N (µy , σy2 ) = √

1 1 y − µy 2 exp[− ( ) ] 2 σy 2πσy

(8)

Both gaussian curves are centered at the center point of the image by setting µx the half of the width (Width / 2) and µy the half of the height (Height / 2). The σx and σy are fixed to 10 so that the gaussian curve is smooth, avoiding sharpness which only considers the small center region of the image. These weights are used to modify the saliency map as (9). 2

2

N (µx , σx ) + N (µy , σy ) S¯x,y = Sx,y ∗ ( ). 2

(9)

S¯x,y is the weighted value of the saliency map at location (x,y). Weighting the saliency map differently according to the position in the image, if there are tiny attention points in the edges of the image, we will skip them and keep our focus on the most important attention region. Our experiment result shows that this simple factor has a good effect on noise reduction. The modified saliency map will now assign different value for each point according to their topology attention. In our image adaptation model, a simple region growing algorithm whose similarity threshold is defined as 30% of the gray level range in the saliency map is used to generate the smallest bounding rectangle that includes the identified attention area(s). Firstly, we take the pixels with maximum value (one or multiple) as the seeds and execute the region growing algorithm. In each growing step, the 4neighbour points are examined. If the difference between the point and the current seed is smaller than a threshold (30% of the range of gray-level value), the point will be added into the seed queue and will be grown later. The algorithm will continue until the seed queue is empty. Finally, the output are one or several separate regions and we generate a smallest rectangle to include these regions.

3.

MPEG-21 DIGITAL ITEM ADAPTATION

The upcoming MPEG-21 framework considers the heterogeneous network environments, different terminal devices and personal characteristics of users to provide an open framework for all the participants in the multimedia consumption chain [1]. The multimedia resource is combined with metadata to describe network environment, terminal capability and user characteristic as the fundamental unit of distribution and transaction called the Digital Item. MPEG21 multimedia standard defines the technology needed to support Users to exchange, access, consume, trade and otherwise manipulate Digital Items in an efficient, transparent and interoperable way [1]. Digital Item Adaptation is an important part of MPEG21 standard (Part 7) [6]. It aims to achieve interoperable transparent access to (distributed) advanced multimedia content by shielding Users from network and terminal installation, management and implementation. In the Final Committee Draft of Digital Item Adaptation [6], an adaptation engine architecture using Bitstream Syntax Description (BSD) [5] is proposed. When the image is input into the adaptation engine, the BinToBSD engine analyses the multimedia resource and generates a format-dependent BSD or a format-independent general BSD (gBSD) indicating high-level structure of the corresponding image bitstream. BSD(gBSD) description of Digital Item is then adapted with the help of XML Stylesheet Transformation (XSLT) according to all related information including network situation, terminal capability and user preference. Finally, BSDToBin creates a new adapted multimedia resource according to the transformed BSD(gBSD) description. More details can be found in [5].

Usage Environment Description

Context Digital Item

Digital Item Adaptation Engine

AdaptationQoS

Constraints not covered in Usage Environment Description

Converter

4. INTELLIGENT RESOLUTION ADAPTATION ENGINE Current image adaptation frameworks do not provide ROI based resolution adaptation or require users to manually outline the ROI. Both cases are not always convenient and feasible, such as in mobile devices where it is difficult to accurately outline an ROI. In our work, by combining MPEG21 standard framework and enhanced visual attention model, we improved the standard image adaptation engine to automatically detect the visual attention region and decide the adaptation operation in a standardized, dynamic and intelligent way. The advantage of our intelligent resolution adaptation engine is to preserve, as much as possible, the most attentive (important) information of the original image while satisfying terminal screen constraints. The engine utilizes the Structured Scalable Meta-formats (SSM) for Fully Content Agnostic Adaptation [4] proposed as a MPEG-21 reference software module by HP Research Labs whose architecture is shown in Figure 1. The SSM module adapts the resolution of JPEG2000 images according to their ROIs and the terminal screen constraints of the viewers. BSD description of the JPEG2000 image is generated by BSDL module [5]. The ROI is automatically detected using our enhanced visual attention model and adaptation operation is dynamically decided by considering both ROI and terminal screen size constraint. We change the resolution of JPEG2000 image by directly adapting JPEG2000 bitstream in compressed domain. The whole adaptation procedure is described as follows. The BSD description and ROI information are combined with image itself as a digital item. When the user requests the image, its terminal constraint is sent to server as a context description (XDI). Then combining XDI, BSD descrption and ROI information, the Adaptation Decision-Taking Engine decide on the adaptation process for the image [4]. Finally, the new adapted image and its corresponding BSD description will be generated by the BSD Resource Adaptation Engine [5]. Description can be updated to support multiple step adaptation. A snapshot of BSD digital item adaptation is shown in Figure 2.

XDI Extraction

Adaptation Decision-Taking Engine

Format Independent Constraints Source Parameter Values

Content Digital Item

Link Processing Engine

BSDLink

Adaptation Operations

(g)BSD

Resource

CDI Extraction

AdaptationQoS & Link Transformation Engine

(g)BSD based Resource Adaptation Engine

Resource’

(g)BSD’

(a) BSDLink’

AdaptationQoS’

CDI Packetization

Adapted Content Digital Item

Figure 1: MPEG-21 Digital Item Adaptation Architecture

(b)

Figure 2: Example of Digital Item BSD Adaptation; (a) Adaptation Decision Description; (b) JPEG2000 BSD Adaptation (Green - Original BSD, Blue Adapted BSD). The intelligent ROI adaptation is decided according to the relationship between image size (Isize ), ROI size (Rsize ) and the terminal screen size Csize . • If Csize > Isize : No adaptation, the original image is sent to the user directly.

• If Rsize < Csize < Isize : Crop the ROI according to the result of visual attention analysis, removing nonattention areas. • If Csize < Rsize : Crop the attention region first and reduce the region resolution to terminal screen size. (another adaptation can be performed by the adaptation engine)

5.

EXPERIMENT EVALUATIONS

600 test images were selected from different categories of the standard Corel Photo Library. Several output examples of our intelligent visual attention based adaptation are shown in Figure 3 and Figure 4. Due to the subjectivity of visual attention perspective, we applied the user study experiment in [3] to test the effectiveness of the proposed algorithm. 8 human subjects were invited to evaluate 40 adapted images for each of the 4 categories. The users were asked to grade the adapted images from 1 (failed) to 5 (good). From the evaluation result shown in Table 1, we found

Table 1: User Study Evaluation - percentage of images in each category Category Animal People Scenery Others Average

Failed 0.02 0.01 0.03 0.01 0.017

Bad 0.09 0.11 0.13 0.10 0.108

Acceptable 0.22 0.22 0.22 0.26 0.23

Medium 0.33 0.30 0.40 0.41 0.38

Good 0.34 0.36 0.22 0.22 0.29

87% cases are acceptable including 67% are better than acceptable. Only 10% are bad and 1% are failed. Bad results are mainly because not the whole visual object is included in the cropped images (eg. the head of a animal) and 1% failure rate is due to either wrong visual object identified as the attention region or images like scenery shots where there may not be specific visual objects. The framework works reasonably well for a general set of natural images.

6. CONCLUSION Saliency map based visual attention analysis provides an intelligent content understanding mechanism for multimedia adaptation. In this paper, we design an improved MPEG-21 image adaptation engine for JPEG2000 using the improved visual attention model to provide an intelligent ROI based image resolution adaptation for different terminal devices according to human visual attention. The advantages of this engine over others are its capability of ROI automatic detection and dynamic adaptation decision combining image ROIs and terminal screen size constraint. This work can be extended to provide Universal Multimedia Access (UMA) services compatible with MPEG-21 standard.

7. REFERENCES

Figure 3: Example of good intelligent adaptation; (a) Original Image; (b) Saliency Map; (c) Cropped Image.

(a)

(b)

(c)

Figure 4: Example of bad and failed intelligent adaptation; (a) Original Image; (b) Saliency Map; (c) Cropped Image.

that for different categories of images, an average of close to

[1] J. Bormans and K. Hill. MPEG-21 Overview V.5. In ISO/IEC JTC1/SC29/WG11/N5231, October 2002. [2] L. Itti, C. Koch, and E. Niebur. A Model of Saliency-Based Visual Attention for Rapid Scene Analysis. IEEE Tran on Pattern Analysis and Machine Intelligence, 20(11), 1998. [3] Y. Ma and H. Zhang. Contrast-based Image Attention Analysis by using Fuzzy Growing. In Proc. ACM Multimedia, Berkeley, CA USA, Novemember 2003. [4] D. Mukherjee, G. Kuo, S. Liu, and G. Beretta. Motivation and Use cases for Decision-wise BSDLink, and a proposal for Usage Environment Descriptor-AdaptationQoSLinking. In ISO/IEC JTC 1/SC 29/WG 11, Hewlett Packard Laboratories, April 2003. [5] G. Panis, A. Hutter, J. Heuer, H. Hellwagner, H. Kosch, C. Timmerer, S. Devillers, and M. Amielh. Bitstream Syntax Description: A Tool for Multimedia Resource Adaptation within MPEG-21. Singal Processing: Image Communication, EURASIP, 18(8), 2003. [6] A. Vetro and C. Timmerer. ISO/IEC 21000-7 FCD Part 7: Digital Item Adaptation. In ISO/IEC JTC 1/SC 29/WG 11/N5845, July 2003.

Suggest Documents