esting region/object when watching videos. Therefore, giving a higher quality to region of interest (ROI) while reducing quality of other areas may result in ...
2012 IEEE International Conference on Multimedia and Expo
IMPACT OF REGION-OF-INTEREST VIDEO CODING ON PERCEIVED QUALITY IN MOBILE VIDEO Ivan Himawan, Wei Song, Dian Tjondronegoro Science and Engineering Faculty, Queensland University of Technology, Brisbane, Australia {i.himawan, w1.song, dian}@qut.edu.au ABSTRACT
Many studies have been conducted to evaluate the relationship between human perceived quality, content types (e.g. news, sports, music), and to the way the visual media itself presented in mobile devices. For example, in [2], studies are conducted to understand how video resolution affects viewing experience in small screens. The fine balance between video resolution, content, shot types and user acceptability are further explored in [3]. In [4], the authors concluded that reduction in frame rate does not significantly affect the acceptability of sports videos as a reduction in frame quality does. These studies will benefit content producers to present standard TV contents in mobile devices in more efficient ways without sacrificing contents. Another method of presenting visual media in mobile devices efficiently is video retargeting by adapting large resolution video to a smaller screen size. This includes cropping off the footage outside the area of interest and zooming to this area of importance [5, 6]. It is also possible to visually enhance the region of interest [3, 4]. In addition, some parameters (i.e. resolutions, frame rate, quantisation scale) may be optimally adjusted to reduce the bitrate while maintaining user acceptability. For ROI enhancement, it requires information on the region that attract viewer’s attention in a video. One approach to determine the ROI is by employing saliency detectors which predict the location of viewer’s attention [7, 8]. Information on the region of interest can be used to help human editors in preparing mobile TV contents [3]. For example, in the videoconferencing application, human face is more important than the background. In this case face detector can be employed to detect faces, and in conjunction with the technique of zooming and cropping display the face in the centre of the screen. Given arbitrary videos, detecting region of interest is still a challenging task since it is difficult to determine which objects can be considered salient. Considerable research has been done on region of interest video coding. Some research focuses on improving perceptual quality in important areas using the latest video codec [9, 10] or as an error resiliance tool [11] with applications seems limited to video conferencing and surveillance with improvement in Peak Signal-to-Noise Ratio (PSNR) or subjective comparison as a measure of success. More user
Effective streaming of video can be achieved by providing more bits to the most important region in the frame at the cost of reduced bits in the less important regions. This strategy can be beneficial for delivering high quality videos in mobile devices, especially when the availability of bandwidth is usually low and limited. While the state-of-the-art video codecs such as H.264 may have been optimised for perceived quality, it is hypothesised that users will give more attention to interesting region/object when watching videos. Therefore, giving a higher quality to region of interest (ROI) while reducing quality of other areas may result in improving the overall perceived quality without necessarily increasing the bitrate. In this paper, the impact of ROI-based encoded video on perceived quality is investigated by conducting a user study for various target bitrates. The results from the user study demonstrate that ROI-based video coding has superior perceived quality compared to normal encoded video at the same bitrate in the lower bitrate range. Index Terms— Subjective Quality, User Study, Video Analysis, Region of Interest 1. INTRODUCTION The advent of mobile technologies has created a new trend of consuming visual media on mobile devices. More and more people spend their time watching TV on mobile devices as an alternative to conventional TV due to pocket-size compactness and portability. Video-on-demand, which gives the user the freedom to watch anytime anywhere they want is predicted to be very popular for mobile TV users [1]. Increasing numbers of users streaming visual media in their mobile devices will put strain to the existing communication networks. For service providers, it is necessary to utilise bandwidth efficiently to maintain high quality of service while meeting customer’s demand for quality. This means to meet the customer expectation without necessarily adding the cost of network expansion. To achieve this goal, providers need to know the minimum acceptable video quality customers find comfortable to watch. 978-0-7695-4711-4/12 $26.00 © 2012 IEEE DOI 10.1109/ICME.2012.126
79
studies are needed to assess the impact of ROI on perceived quality for various video contents. For example in [12], the study found that ROI can enhance perceived quality for image encoded in low bit rates. Study by [13] shows that normal encoded video is preferred to ROI encoded video because of severe visible coding artifacts in low bitrate, while [14] shows that ROI enhancement improves user perceived quality in low bitrate sports video. However, it is still not clear to what is the minimum bitrate, the quality level, and the effect of content types, especially the scenario when people are watching videos on mobile devices. In order to study more on the impact of ROI on perceived quality, a user study is conducted to assess the quality of ROIbased and normal encoded video for various target bitrates in four video contents. Section 2 presents the materials and the methodology used in the user study. The user study findings is presented and discussed in Section 3. This is followed by conclusion and further work in Section 4.
of macroblocks, on average the ROI area occupies between 20% to 31% of the total macroblocks in a frame. The normal quality video was encoded with an average target bitrate encoding mode. The same target bitrate is imposed to the ROI-based video by giving higher quality in the region of interest and lower quality in the non-ROI area. Figure 2 shows snapshots from Normal and ROI encoded videos. There are seven pairs of video for each content types with a bitrate of 200, 300, 400, 500, 700, 900, and 1100kbps. Because it is very difficult to aim for the exact specified target bitrate, ±3kbps from the target is allowed. 2.2. Test Procedures The test procedure is conducted in similar fashion to the recommendation from ITU-R BT.500-12, double-stimulus continuous quality-scale (DSCQS) [16] in order to evaluate the quality of ROI encoded video with respect to normal encoded video, or vice versa under different target bitrates. The procedure takes approximately 25 minutes long, 5 minutes of briefing and 20 minutes of user study. During the briefing, the demo video was presented with instructions on how assessment will be conducted as well as answering any questions arises until the subject feel comfortable to do the rest of experiments by himself/herself. The experiment was conducted indoor and there was no restriction on how the subject handle and view the screen. The outlines of the subjective video quality assessment are as follows:
2. USER STUDY
User study is conducted using iPhone 4 with a customised application to play videos and collect participant’s responses. Instruction sheet summarising the test procedures is given to each subject at the start. The details on test materials and procedures are presented in the following subsections. 2.1. Test Material
1. The user starts with the first video content (i.e. cartoon).
In order to investigate the perceived quality of a video in which different qualities exist between the ROI area and the rest of a frame, it is necessary to define the salient regions. We manually define this region for four different video contents: 3D animation (labelled as cartoon), music clip, news clip, and soccer match (labelled as sports). Figure 1 shows the region of interest from representative frames in those four videos. Test videos were formatted to the resolution of 960x640 pixels (native iPhone resolutions), each has a length of 10 seconds and no audio. For each video content (e.g. news), two versions of clip were prepared. The first video clip was encoded in a usual manner for a specified target bitrate using first pass encoding. The second video clip was encoded with region of interest in a quality higher than the rest of the frame. This is achieved by encoding the video with a higher than specified target bitrate in the first pass before reducing the quality of non-ROI area with a quantisation parameter (QP) set to be 5 levels higher in the second pass encoding. This is achieved using a custom-developed x264 codec1 . Since H.264 [15] video coding standard divides each frame into macroblocks, the region of interest are pointed out to the encoder as macroblock positions within a frame. No B-frames were used and a maximum GOP size was set to 50. In terms 1 Original
2. Each of content type contains 7 (seven) pairs, where each pair consist of a normal encoded video and a ROIbased encoded video with specific target bitrate. The 7 video pairs are presented in random order. The normal encoded video or ROI-based encoded video in a pair are also presented in random order. 3. The subject is asked to rate on normal encoded video (e.g. A signal) and ROI-based encoded video (e.g. B signal). For each pair, the user is able to switch between A and B signals until he/she made his/her decision. 4. When the user has made a decision, he simply taps the screen twice to rate the video he/she is currently watching (see Figure 3.c.). The Mean Opinion Score (MOS) with five scales ranging from 1 (bad), 2 (poor), 3 (fair), 4 (good), and 5 (excellent) is used to judge the quality. Once he/she rates the first video (e.g. A signal), the second video (e.g. B signal) is presented for the user to rate. 5. Once a pair of video has been rated, 3 (three) seconds mid-grey mask then appear in the screen before the next pair of video is displayed. The assessment is continued
source code available online at http://www.videolan.org
80
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 1. The region of interest (areas within the bounding box) from representative frames in cartoon (a and b) followed by music (c and d), news (e and f), and sports (g and h).
(a)
(b)
(c)
(d)
Fig. 2. Screenshots from Normal (a and c) and ROI (b and d) encoded video at 300kbps for cartoon and music. until all seven pairs of video are rated before moving on to the next content (i.e. music video). There are a total number of 21 participants (16 male and 5 female), mostly are students from a university. Evaluation responses are recorded in the iPhone device for data analysis as shown in Figure 3.b. 3. RESULTS AND DISCUSSIONS
The average MOS for the four video contents are shown in Figure 4. It can be observed that in overall ROI encoded videos have better perceptual video quality compared to normal encoded video. Table 1 shows that MOS improvement is greater for videos encoded in lower bitrate range (between 200 to 500kbps) with a highest MOS difference of 0.71 in News video. The Wilconox Signed-Rank test shows that in this bitrate range, the ROI encoded videos have MOS scores significantly higher than that of Normal encoded videos (p