Full-Reference Video Quality Metric for Fully ... - Semantic Scholar

IEEE TRANSACTIONS ON BROADCASTING, VOL. 56, NO. 3, SEPTEMBER 2010

269

Full-Reference Video Quality Metric for Fully Scalable and Mobile SVC Content Hosik Sohn, Hana Yoo, Wesley De Neve, Cheon Seog Kim, and Yong Man Ro, Senior Member, IEEE

Abstract—Universal Multimedia Access (UMA) aims at enabling a straightforward consumption of multimedia content in heterogeneous usage environments. These usage environments may range from mobile devices in a wireless network to high-end desktop computers with wired network connectivity. Scalable video content can be used to deal with the restrictions and capabilities of diverse usage environments. However, in order to optimally tailor scalable video content along the temporal, spatial, or perceptual quality axis, a metric is needed that reliably models subjective video quality. The major contribution of this paper is the development of a novel full-reference quality metric for scalable video bit streams that are compliant with the H.264/AVC Scalable Video Coding (SVC) standard. The scalable video bit streams are intended to be used in mobile usage environments (e.g., adaptive video streaming to mobile devices). The proposed quality metric allows modeling the temporal, spatial, and perceptual quality characteristics of SVC-compliant bit streams by taking into account several properties of the compressed bit streams. These properties include the temporal and spatial variance of the video content, the frame rate, the spatial resolution, and PSNR values. An extensive number of subjective experiments have been conducted to construct and validate our quality metric. Experimental results show that the average correlation coefficient for the video sequences tested is as high as 0.95 (compared to a value of 0.60 when only using the traditional PSNR quality metric). The proposed quality metric also shows a performance that is a uniformly high for video sequences with different temporal and spatial characteristics. Index Terms—QoS, quality measurement, quality metric, SVC.

I. INTRODUCTION HE demand for ubiquitous consumption of multimedia resources is steadily increasing. In order to have access to multimedia resources, consumers are relying on a plethora of networks and terminals, ranging from mobile devices in a wireless network to high-end desktop computers with wired network connectivity. All of these different networks and terminals come with particular restrictions and capabilities, such as bandwidth availability, display resolution, energy consumption, and computational power. Moreover, dependent on their physical capabilities, end-users may have different preferences on how to consume multimedia content. Universal Multimedia Access

T

Manuscript received January 11, 2009; revised February 11, 2010; accepted May 05, 2010. Date of publication June 07, 2010; date of current version August 20, 2010. The authors are with the Image and Video Systems Laboratory, Korea Advanced Institute of Science and Technology, Daejeon 305-732, Republic of Korea (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TBC.2010.2050628

(UMA) [1], [2] strives for enabling a straightforward consumption of multimedia content in diverse usage environments. Scalable coding can be seen as one of the most important tools to realize the UMA vision, as it allows optimizing the End-to-End Quality of Service (E2E QoS) [3]–[6] in systems for multimedia delivery and consumption. Scalable Video Coding (SVC) is a new standard developed by the Joint Video Team (JVT) of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG). The SVC specification enables the creation of video bit streams that can be adapted along the temporal, spatial, and Signal-to-Noise Ratio (SNR) scalability axis, respectively resulting in an adjustment of the frame rate, the spatial resolution, and the perceptual quality. Consequently, using a combination of the aforementioned adaptation possibilities, it is possible to create video bit streams that offer a diverse set of spatial resolutions, frame rates, and perceptual quality levels for a given target bit rate, without requiring complicated transcoding operations [7], [8]. As such, it should be clear that SVC-compliant bit streams are of particular interest for realizing a UMA-enabled multimedia system. To satisfy the high E2E QoS requirements of present-day and future multimedia systems, it is essential to have a thorough understanding of subjective video quality, i.e., video quality as experienced by end-users. Mean Squared Error (MSE) and Peak Signal-to-Noise Ratio (PSNR) are two frequently used methods for the objective assessment of video quality. Their computation is simple and straightforward. However, it is also well-known that significant discrepancies can often be observed between MSE or PSNR values on the one hand and the perceived video quality on the other hand. Furthermore, when the frame rate, spatial resolution, and perceptual quality are jointly adjusted for a scalable video bit stream, the computed MSE and PSNR values often do not reflect the subjective quality [9], [10]. A significant number of research efforts have been dedicated to the construction of objective video quality metrics that aim at better modeling of subjective quality than MSE and PSNR. The Video Quality Experts Group (VQEG) is the most well-known group of experts working in the field of video quality assessment [11]. Their work has been used by the ITU as the basis for several recommendations. An example of such a recommendation is ITU-T J247, which deals with objective perceptual multimedia video quality measurement in the presence of a full reference [12]. S. Wingler and E. Ong et al. have been working on video quality assessment techniques using particular properties of the Human Visual System (HVS), such as human color perception and the theory of opponent colors [13]–[16]. The National Telecommunications and Information Administration (NTIA) [16] has proposed a video quality metric that provides

0018-9316/$26.00 © 2010 IEEE

270

an estimate of the subjective quality of video content by comparing several parameters between the original and the distorted video sequence. These parameters include edge, texture, color, and angle information. Quality assessment in broadcasting systems has also been investigated [17], [18]. Using the reduced-reference quality metric discussed in [16], M. Pinson and S. Wolf discuss a broadcasting system in which the quality of the video content is measured in the terminal responsible for the playback of the video content [18]. A low-complexity quality metric had to be devised, as delay needs to be minimized in real-time broadcasting systems. In [19], a comparison can be found of different quality metrics, as well as a system that classifies quality metrics according to their objective or subjective nature. In their conclusions, the authors claim that the construction of a “general-purpose” quality metric might be too complex, and that application-specific quality evaluation is more sensible. The authors also conclude that more research should be focused on quality evaluation of recent image and video coding formats. Quality metrics have also been proposed that directly make use of coded bit stream characteristics, such as frame and bit rate, instead of relying on properties of the human visual system [9], [10], [20]. Video quality is also strongly dependent on the coding format used. Hence, in the scientific literature, quality metrics have been proposed that are format specific. In [21]–[23], quality metrics are proposed for H.262/MPEG-2 Video and H.264/AVC. However, all of the aforementioned quality metrics are not able to take into account both the spatial resolution and the frame rate of video bit streams. This is in practice a significant problem when measuring the quality of SVC bit streams, because these bit streams can be adapted along the temporal, spatial, and perceptual quality axis. As video content is increasingly consumed in mobile usage environments, several quality metrics have been proposed that take into account a low spatial resolution or wireless connectivity [24]–[26]. The quality metric in [9] is for instance parameterized in terms of frame rate, motion magnitude, and PSNR. That way, it is possible to deal with joint adjustments of the quantization parameter and the frame rate when a given target bit rate has to be maintained. However, information regarding the spatial resolution is not incorporated in the quality metric, although the importance of spatial resolution is high in the context of mobile video content [27]. In [10], a quality metric for fully scalable SVC bit streams has been proposed, taking into account the frame rate, motion information, PSNR, and the spatial resolution. However, the effect of the spatial resolution is only partially considered, as the quality metric does not consider the influence of video characteristics such as edge information. Also, in the experiments, video sequences are used that are similar in terms of spatial detail. The spatial and temporal characteristics of video sequences exert a dominant influence on video quality [9], [16], [23], [28]. Therefore, it is necessary to take into account these characteristics in order to improve the performance of a quality metric. In [29], [30], a quality metric has been proposed that allows dealing with several video genres, like action and drama. In this method, the use of the quality metric is divided into two steps: the first step consists of genre classification of the consumed


video content, while the second step measures the video quality by taking into account the classified genre. This paper discusses a full-reference quality metric for video bit streams compliant with SVC. The proposed quality metric takes into account the frame rate and the standard deviation of the motion magnitude, as well as information regarding the spatial resolution, the PSNR, and the edge complexity of each picture. Further, we also target a mobile usage environment in this paper. Therefore, the spatial resolution of the video bit streams varies between QCIF (Quarter Common Intermediate Format; 176 144) and CIF (Common Intermediate Format; 352 288). Note that this paper does not address the impact of packet loss on video quality, which is also an important characteristic of mobile usage environments. The potential use of video quality metrics is diverse. Video quality metrics can be used for determining optimal compression parameters during encoding [30] and for monitoring the video quality in digital broadcasting networks [31]. In addition, video quality metrics can be used for guiding the decisionmaking process in extraction software for scalable bit streams. The latter functionality makes it possible to achieve a maximum E2E QoS when a bit stream extractor has several adaptation possibilities to meet a particular target bit rate. This paper is organized as follows. Section II discusses the overall setup of the experiments that were conducted to collect subjective quality data, as well as the methodology used to measure PSNR. This section also discusses how the spatial and temporal variance of video content affects the quality of the video content, and how to quantify their influence. The analysis of the collected quality data and the process of quality metric modeling are presented in Section III. Finally, a performance evaluation of the proposed quality metric is provided in Section IV, while Section V concludes this paper. II. SUBJECTIVE QUALITY ASSESSMENT In order to construct a metric that is able to reflect video quality in a reliable way, several subjective quality assessments need to be performed. The obtained experimental results are used for two purposes in our research: for the construction of a quality metric in a step-by-step approach on the one hand (where each step relies on experiments using different settings and video sequences), and for the independent verification of the reliability of our quality metric on the other hand. Therefore, the overall setup of our subjective experiments is described first in this section. Further, this section also pays attention to the methodology used for measuring PSNR (as PSNR measurement is more complicated for scalable video content than for non-scalable video content). Finally, this section also discusses how quality is influenced by the spatial and temporal characteristics of video content, as well as how to quantify the spatial and temporal complexity of a particular video sequence. A. Experimental Environment The settings for our subjective experiments were in line with the requirements of ITU-R recommendation BT.500-11 [32]. The total number of human observers per experiment was 16. A 30 inch LCD monitor was used (DELL3007WFB), with the viewing distance between the participants and the monitor

SOHN et al.: FULL-REFERENCE VIDEO QUALITY METRIC FOR FULLY SCALABLE AND MOBILE SVC CONTENT

271

Fig. 2. Scalable bit stream creation.

Fig. 1. Video sequences used for quality metric construction and validation.

fixed to six times the height of a video sequence. To avoid constructing a quality metric that can only deal with video content of a specific nature, twelve video sequences were used, covering a wide range of spatial and temporal characteristics: Harbor, Mobile, Silent, and Crew, as well as video content originating from regular TV programs. Harbor, Mobile, Silent, and Crew are online available [33]. Snapshots of the video sequences used are shown in Fig. 1: (a) Harbor, (b) Soccer Game, (c) Mobile, (d) Interview, (e) Snow Forest, (f) Silent, (g) Child, (h) Crew, (i) Rain, (j) Soccer, (k) Mountain, and (l) Soccer Ground. The original version of each video sequence has CIF resolution, a fixed frame rate of 30 frames per second (fps), and a total duration of eight seconds. The video sequences were encoded and decoded using the Joint Scalable Video Model (JSVM) 9.16 reference software. Given a particular video sequence and a Quantization Parameter (QP) value, a scalable bit stream was generated with six spatial layers, making it possible to vary the spatial resolution between QCIF and CIF. To be more precise, the following spatial resolutions were used: 176 144, 224 176, 256 208, 288 224, 320 256, and 352 288. While a number of these spatial resolutions are not frequently used in practice for the coding of mobile video content (i.e., 256 208 and 288 224), the higher amount of variation in terms of spatial resolution allowed to more accurately model the spatial component of the quality metric proposed in this paper.

Also, the need for fine-grained spatial scalability during the construction of our quality metric implied that our bit streams are compliant with the Scalable High Profile of SVC. Indeed, in the Scalable Baseline Profile, which explicitly targets low-complexity decoding in mobile usage environments, support for spatial scalable coding is restricted to resolution ratios of 1.5 and 2 between successive spatial layers in both horizontal and vertical direction. This restriction does not hold true for the Scalable High Profile. It is also important to note that, while the Scalable High Profile is used to create bit streams needed for constructing our quality metric, the Scalable Baseline Profile is used to encode bit streams for verifying the reliability of the proposed quality metric (see Section IV). Further, each spatial layer consisted of three temporal layers, making it possible to offer three different frame rates for a given spatial resolution: 7.5 fps, 15 fps, and 30 fps. Due to a restriction of the reference software regarding the maximum number of layers in a scalable bit stream, SNR scalability was realized by encoding a video sequence six times using the spatial and temporal configuration as previously discussed, each time using a fixed QP value for all spatial layers. Encoding settings were also selected in order to reflect the typical requirements of mobile usage environments (e.g., use of CAVLC instead of CABAC, a base layer compatible with single layer H.264/AVC, use of an IPPPPPPP coding pattern in the original bit streams, etc.). The way the scalable bit streams were created is also summarized in Fig. 2. The construction of our quality metric consisted of several steps. For each step, a subset of the total number of available video sequences was used. This selection process is described in more detail in Section III.

272


When conducting subjective tests, the way the video quality is graded may depend on “context effects”. Context effects are observed when the perceived video quality of one video sequence is influenced by the perceived video quality of the other video sequences included in the subjective experiment. As explained in [34], among several double stimulus methods, the influence of contextual effects is minimal when the Double Stimulus Continuous Quality Scale (DSCQS; [32]) is used. Further, DSCQS was also used in [9] for grading test sequences with a low spatial resolution of 320 192 (the quality metric discussed in [9] was used as a starting point for the construction of the quality metric proposed in this paper). Therefore, in order to model and evaluate quality metrics in our research, we have selected DSCQS using the Differential Mean Opinion Score (DMOS) for measuring video quality using subjective experiments. Note that a single stimulus method could also have been used for the purpose of grading low-resolution and low-quality videos [35]. Further, it is worth mentioning that the perceived quality may also depend on the order in which the video sequences are shown to the participants, an observation known as “the order effect”. Order effects, which are present with all subjective methodologies, are typically addressed by randomizing the presentation order of the video sequences in subjective experiments. The DMOS method consists of randomly displaying two video sequences A and B, where one of the video sequences is the reference sequence and where the other video sequence is the impaired sequence. An opinion score is subsequently assigned to the video sequences A and B. This opinion score may range from 0 to 100. The differential mean opinion score (DMOS) for the perceived quality of the impaired sequence is obtained by subtracting the DMOS score of the reference sequence from the DMOS score of the impaired sequence. In this paper, an value close to zero means that the impaired sequence is similar to the reference one. On the other hand, an value far from zero means that the impaired sequence is distorted. The worse the quality of the impaired sequence, the farther from zero the value. As such, an value of ‘ 100’ is considered to represent the worst quality possible. The score of the quality metric proposed in this paper ranges from 0 to 100, where a quality score of ‘100’ is considered to . Therefore, for conrepresent the best Subjective Quality venience of modeling, is converted to using (1). That way, ranges from 0 (the lowest possible quality) to 100 (the highest possible quality). (1) Strictly speaking, the differential score could also have been determined by subtracting the DMOS score of the impaired sequence from the DMOS score of the reference sequence. This would have avoided using the transformation in (1). Further, the obtained DMOS data were not scanned for unreliable and inconsistent results. As such, outliers were not removed. However, despite the fact that no outliers were removed, the average size of the 95% confidence intervals is 2.13 on the 0-100 scale used, indicating a feasible level of agreement between observers.

Fig. 3. Process for measuring PSNR.

B. PSNR Measurement SVC bit streams may offer an arbitrary combination of spatial, temporal, and SNR scalability. Adjusting the spatial resolution or frame rate of an SVC bit stream results in problems when measuring PSNR, as the use of this video quality metric assumes that the spatial resolution and the frame rate of the original and the impaired video sequence are the same. Therefore, a number of measures had to be taken in order to make the computation of PSNR values possible for arbitrarily adjusted scalable bit streams. To achieve spatial scalability, encoding by the JSVM reference software is done using different versions of a particular video sequence, where each version has a different spatial resolution. These different versions of the original video sequence can be generated using a downsampling tool that is part of the JSVM reference software (in our experiments, the downsampling method used was based on the Sine-windowed Sinc-function). As such, in our experiments, the PSNR was measured between a spatially downsampled version of the original video sequence and the impaired video sequence. That way, both the reference and the impaired video sequence have the same spatial resolution. As previously discussed, for ease of display during the subjective quality tests, frame copy operations were applied to video sequences that were temporally down-sampled to 15 fps and 7.5 fps. This approach also makes it possible to perform PSNR measurements in a straightforward way. The way PSNR is measured is also visualized in Fig. 3. Note that the PSNR values in our research are computed in the luma domain, relying on the same formulas as used in [36]. In particular, the PSNR values in our research are computed using (2): (2)


where denotes the maximum luma value ( is equal to when the pixel depth is equal to eight bits per pixel component). The MSE is computed using (3):

SV

273

TABLE I VALUES FOR THE VIDEO SEQUENCES USED

(3) where and respectively represent the coordinate of a pixel along the horizontal axis and the vertical axis, and where and respectively represent the height and width of the video sedenotes a pixel value in the original quence. As such, , where the video sequence video sequence at coordinate . Similarly, the notain question has a spatial resolution of tion denotes a pixel value in an experimental video , encoded with a sequence, having a spatial resolution of QP value equal to , and having a frame rate equal to (where is the frame rate of the adapted video sequence, i.e., the video sequence obtained after adaptation, but before applying frame copy operations).

for the video sequences used in our research are presented in Table I. The higher the value, the higher the spatial complexity of a video sequence. D. Temporal Characteristics

C. Spatial Characteristics The spatial complexity of a video sequence affects the subjective quality [16], [28]. If pictures have the same distortion measured in terms of MSE or PSNR, then the subjective quality increases when the spatial complexity increases [16]. Moreover, people have a stronger preference for the use of higher spatial resolutions in case of pictures with a high spatial complexity [28]. The spatial complexity of a picture can be computed using several approaches. In [16], the luma values of the original and impaired pictures are first processed with horizontal and vertical edge filters. These filters allow enhancing the edges in the pictures in question, while reducing noise. The enhanced edges are then used for computing an estimate of the spatial complexity. In [12], [13], another technique called spatial texture masking is used. In our research, the spatial complexity of a video sequence is computed by relying on a metric known as Spatial Variance . This spatial complexity metric is derived from the MPEG-7 edge histogram algorithm [36], [38]. The value that is computed using this algorithm is resolution invariant [36]. Consequently, only one value needs to be calculated for a scalable video sequence encoded at different spatial resolutions. In particular, the value is computed using the formula below: (4) where and respectively represent the frame index and the type of edge histogram value, and where and respectively represent the total number of frames and the total number of types of edge histogram values. In the MPEG-7 edge histogram algorithm, one picture is divided into 16 sub blocks. For each sub block, 5 types of edge histogram values (vertical, horizontal, 45 , 135 , and non-direction) are calculated [36]. denotes the average edge histogram value of the type of edge histogram value of all sub blocks in the frame. The values

Besides spatial characteristics, the subjective quality of a video sequence is also affected by temporal characteristics, such as the speed of motion and the variation of the motion speed. For example, the difference in subjective quality at 7.5 fps and 30 fps is negligible for static video sequences as such video sequences are typically characterized by slow changes in motion. However, discrepancies in subjective quality are often easily noticeable for dynamic video sequences, which are usually characterized by fast changes in motion. Indeed, the subjective quality perceived for dynamic video sequences is highly sensitive to dropped or repeated frames. The temporal complexity of a video sequence can be determined by investigating its motion vectors. In this paper, we use a temporal complexity metric known as Temporal Variance . Similar to the spatial complexity metric, our temporal complexity metric is derived from an MPEG-7 concept known as motion activity, which denotes the intuitive notion of ‘intensity of action’ or ‘pace of action’ in a video sequence [39], [40]. Motion activity is computed using the standard deviation of the motion magnitude, which denotes the average value of the motion vectors. A high value for the motion magnitude indicates high activity, while a low value for the motion magnitude indicates low activity. Both motion magnitude and its standard deviation can be used to reflect how the subjective quality is influenced by the frame rate. However, the standard deviation of the motion magnitude reflects motion slightly better than the motion magnitude [39]. The values for the video sequences used in our research are presented in Table II. The higher the value, the higher the temporal complexity of a video sequence. III. QUALITY METRIC CONSTRUCTION The subjective quality of a video sequence is affected by and [12], [13], [16], [28]. Hence, our proposed quality and in order to quantify the submetric considers both jective quality. The construction of our quality metric is done using the step-by-step approach summarized in Fig. 4.

274


TV

TABLE II VALUES FOR THE VIDEO SEQUENCES USED

TABLE III CORRELATION COEFFICIENTS FOR QM

V QM

[9] AND [10]

linear with the PSNR at a fixed frame rate. This linear relation is defined by (5): (5)

Fig. 4. Quality metric construction.

First, an experiment is discussed in which we quantify our proposal to replace the motion magnitude by the standard deviation of the motion magnitude in the quality metric presented in [9]. This modification allows to better take into account the temporal variance of different video sequences. Second, we investigate the influence of the spatial variance on the subjective quality and the PSNR. To quantify this influence, an experiment is conducted that uses video sequences with different QP values, but with a fixed frame rate and a fixed spatial resolution. Third, a quality metric is constructed that is able to take into account both spatial variance and spatial scalability. To quantify the influence of the spatial variation on the subjective quality and the spatial resolution, video sequences are used with different spatial resolutions, but with a fixed frame rate and a fixed QP. Finally, the quality metrics constructed in the different steps are integrated. To realize this, video sequences are used having different QP values, frame rates, and spatial resolutions. The construction of our quality metric is described in more detail in the sections below.

denotes Quality Metric, while the abbreviation In (5), represents a fixed frame rate. The coefficients and are used to maximize the correlation between and the subjective quality. Their precise values are experimentally determined. and , is The second term, which is parameterized on . used to compensate for the linearity between PSNR and In this paper, is derived from the standard deviation of is the motion magnitude. This is in contrast to [9], where derived from the motion magnitude. To quantify the improvement that comes with using the standard deviation of the motion magnitude instead of the motion magnitude, we compare the correlation coefficients between the subjective quality and the scores as obtained by the quality metrics defined in [9] and [10]. More precisely, the Pearson product-moment correlation coefficient is used for measuring the correlation [41]. The quality metrics in [9] and [10] only take into account the PSNR and the frame rate. Hence, they are respectively denoted and . The difference as between the two quality metrics is completely determined by . The coefficients and are different as the definition of well. However, the values of these coefficients are optimized acused (in other words, changing cording to the definition of in the quality metric defined in [9] also rethe definition of quires recalculating the values of the coefficients and ). The subjective experiment, performed to quantify the improvement that comes with using the standard deviation of the motion magnitude instead of the motion magnitude, relied on the use of the video sequences Harbor, Soccer Ground, and Akiyo. These video sequences were not used in [9] and [10]. The video sequences in question were encoded at two spatial resolutions, QCIF (176 144) and CIF (352 288), a frame rate of 7.5, 15, and 30 fps at each spatial resolution, and a QP value of 25, 35, and 45 at each frame rate, resulting in a total of 54 video sequences used. The results obtained for all video sequences are presented in Table III. Note that the correlation coefficients in Table III were measured using values for and that best fit the subjective quality, making use of multiple regression analysis based on the Least Squared Method (LSM). B. Influence of SV on SQ-PSNR Relation

A. Improved QM for Temporal and SNR Scalability For quality metrics that do not consider the spatial resolution of a video sequence, [9] indicates that the subjective quality is

In addition to , the relation between and PSNR is also exerted by [12], [13], [16]. To incorporate this observation in the modeling of our quality metric, an understanding is


275

of the video sequence in question. As a result account the -PSNR relation of the conducted subjective experiments, the is described in (6). as affected by (6)

Fig. 5. Influence of SV on the SQ-PSNR relation.

needed of how influences the -PSNR relation. The resolution and frame rate of the video sequences used in this experiment are fixed to CIF and 30 fps. Further, six different QP values (20, 25, 30, 35, 40, 45) are used, as described in Section II-A. Consequently, the total number of video sequences used is 72, all having the same spatial resolution and frame rate, but a difon the -PSNR relation is ferent PSNR. The influence of illustrated in Fig. 5. value is divided into three levels: Low, In Fig. 5, the Normal, and High. The video sequences used in the experiment are classified according to these three levels. Each line value as obtained for the in Fig. 5 denotes the average level. The video sequences that are classified into the same criterion used for the classification of the video sequences is the quantization table that comes with the computation of an MPEG-7 edge histogram [42]. This quantization table is presented in Appendix A. The range of each level is given as fol, , lows: . For instance, a video sequence with an value of 0.13 is assigned to Normal . According to this classification, Harbor, Soccer Game, and Mobile are assigned , Interview, Snow Forest, Silent, Child, and Crew to High , while Rain, Soccer, Mountain, and are assigned to Normal Soccer Ground are assigned to Low . -PSNR relation deAs shown in Fig. 5, the shape of the value notes a log function that is fully saturated when the is increasing, the slope is equal to 100. When the value of of the curve in Fig. 5 is decreasing. This means that an obof server grades an experimental video sequence with a high better quality, although the PSNR value for all video sequences is the same, an observation also made in [13] and [16]. For exvalue ample, when the PSNR value is close to 30 dB, the , Normal , and Low is respectively equal at High to 85, 70, and 50. Therefore, the quality metric as shown in (5), which demonstrates a linear relationship between the PSNR and , will not perform well for video sequences with a high the . In particular, the error will be enlarged for video sequences , as the -PSNR relation is far from linear in with a high these cases (as can be seen in Fig. 5, the -PSNR relation is ). To solve this only linear for video sequences having a low instead of the problem, we propose to make use of is able to better reflect the traditional PSNR metric. subjective quality of a particular video sequence by taking into

In (6), represents a normalized PSNR value, from [20, 45] to [0, 1]. When a PSNR value exceeds a value of 45, we have found that an observer cannot distinguish the experimental video sequence from the original video sequence. value is considered to be equal to 100 in that particThe ular case. Further, we also assume that the underlying network guarantees a minimum picture quality that is equal to a PSNR is derived from the relation value of 20. In (6), 1.54–7.61 values and the shape of the -PSNR relation between the (using the Findgraph v1.87 graphing tool [43], minimizing the standard deviation error). Finally, taking into account our observations regarding the use of the standard deviation of the motion on the -PSNR relation, magnitude and the influence of the quality metric that considers temporal scalability, SNR scal: ability, and spatial variation is defined as (7) Note that our perceptual quality metric is built on top of PNSR, taking advantage of its ease of computation and the fact that this quality metric is well-understood and widely used by the video coding community. The proposed quality metric addresses a number of shortcomings of PSNR that become apparent when targeting quality measurement in the context of fully scalable video content (this is, support for temporal, spatial, and SNR scalability). In particular, our quality metric uses PSNR as a parameter that reflects the visual distortion that results from the use of quantization, whereas other parameters in our perceptual quality metric take into account the temporal and spatial variance of the video content, as well as the frame rate and the spatial resolution of the video content. C. Influence of SV on SQ-Spatial Resolution Relation The subjective quality of a particular video sequence is af[28]. In order to add fected by both the spatial resolution and support for varying spatial resolutions to our quality metric, an and the spaunderstanding is needed of the relation between tial resolution, and how this relation is influenced by the spatial variance. The video sequences used in the resulting subjective experiment have the following characteristics: a fixed frame rate of 30 fps, a fixed QP with a value of 35, and six different spatial resolutions uniformly ranging from QCIF to CIF. In total, 72 video sequences were used in this subjective experiment. The outcome of this subjective experiment is shown in Fig. 6. is again the In Fig. 6, the criterion for classifying the quantization table of the MPEG-7 edge histogram, which is the same criterion as used in Fig. 5 [42]. We have normalized the height value of the spatial resolution using (8), which implies that the range 144 to 288 is replaced on the X-axis by the range 20 to 0: (8)

276


D. QM for Temporal, Spatial, and SNR Scalability This section describes the construction of our proposed quality metric, having support for temporal, spatial, and SNR scalability. Using (7) and (9), our quality metric is defined as:

(10)

Fig. 6. Influence of SV on the SQ-spatial resolution relation.

In (8), represents the height, while represents the normalization of . We assume that a video sequence makes use of a meaningful aspect ratio. Therefore, we also assume that the height is sufficient to represent the spatial resolution of a particular video sequence, without needing information about the width of the video sequence. Normalization is performed in order to simplify our video quality metric. In Fig. 6, each point indicates the average value of the oblevel, while the dotted served subjective quality for each and dashed lines (Low QM, Normal QM, High QM) are the result of modeling the subjective quality using (9). For comparison purposes, the quality metric proposed in [10] and having support for spatial scalability is presented as a bold black line (denoted as Default in Fig. 6). Note that the quality metric proposed in [10] is not able to take into account the spatial variance of a video sequence.

where the PSNR, as defined in [36], is dependent on the spatial resolution and the frame rate. and respectively in, and (7), i.e., . dicate (5), i.e., Multiple regression analysis using the Least Squared Method (LSM) was used to compute the coefficients , , , and . These coefficients determine the correlation between the and the quality score obtained by evaluating our proposed . In order to collect subjective quality data for several combinations of different types of scalability, Harbor, Crew, and Soccer Ground were used. The three selected video sequences and values. Further, each selected video have diverse sequence was coded using three different QP values (25, 35, 45), three different frame rates (7.5, 15, 30), and three different spatial resolutions (QCIF, 256 208, CIF), resulting in a total of 81 video sequences used in this experiment. , , and The values of are computed for each video sequence , , PSNR, frame rate, and using information about the spatial resolution of the video sequence in question. The value for each video sequence was experimentally determined. As previously mentioned in this section, LSM-based multiple and regression analysis was applied using the obtained values. The final shape of the proposed quality metric is presented below, consisting of an SNR, temporal, and spatial component:

(9) denotes our quality metric with support for In (9), spatial scalability. By giving the coefficients a value of 50, it is , which is possible to obtain a range of 0 to 100 for similar to the range of the final quality metric proposed in this paper. As shown in Fig. 6, the shape of the different curves is similar to the left-half of a Gaussian function, with zero being is modthe position of the center of the peak. Hence, eled using a Gaussian function. The width of the Gaussian func. A low results in a tion is dependent on the value of broad width of the Gaussian function. This observation makes gets a lower than clear that a video sequence with a high , although both video sequences a video sequence with a low have the same spatial resolution. In (9), the constants in the exponent are computed using the relation between the width of the . As shown in Fig. 6, the shape and Gaussian function and position of Default are similar to the shape and the position of . However, a clear gap can the curve representing a Normal be observed between the Default curve and the curves denoting . The reason for this is the fact that the a Low and a High quality metric proposed in [10] is not able to take into account the spatial variance of a video sequence.

(11) value is affected by the content characteristics and The the bit rates used. Specifically, and values are highly dependent on the content characteristics, while the PSNR has a high dependency on the bit rate. The dependency between the parameters in (11) can be summarized as follows: the second term on the right side of (11) implies that the perceptual video increases at the same bit rate. quality becomes higher as The third term on the right side of (11) is used to compensate . In particular, this term for the linearity between PSNR and compensates the influence of frame copying on PSNR. Finally, the last term on the right side of (11) implies that a video sevalue at quence with a high spatial resolution gets a higher . the same Note that and seem to resemble the Spatial Informaand Temporal Information terms used in the SITI tion metric proposed in [44]. However, the motivation and usage of


277

Fig. 7. Video sequences used in the verification experiment.

, and , are different in the respective quality metand are used to measure the perceptual imrics. In [44], pairments that are the result of spatial and temporal artifacts of digitally compressed video. As such, Sobel-filtered images and difference images between successive frames are used to and values, respectively. On the other hand, the extract and values are used in our research to reflect the spatial and temporal characteristics of scalable video content, where the scalable video content may have been the subject of adaptations along the spatial, temporal, or SNR quality axis. Strictly speaking, in our quality metric, the second term and the third and rather term in (11) are more similar to the use of than and . IV. VERIFICATION EXPERIMENT This section describes an experiment that was used to verify the reliability of our video quality metric, as defined by (9). Three video sequences were used in this experiment. Note that these video sequences were not used for the actual construction of our quality metric (discussed in Section III). Each of the video sequences was coded using three different QP values (25, 35, 45), three different frame rates (7.5, 15, 30), and three different resolutions (QCIF, 264 216, CIF). As such, 81 video sequences were used for verification purposes. Further, since our quality metric is targeting deployment in mobile usage environments, the generated video sequences are strictly compliant with the Scalable Baseline Profile of SVC. The three original video sequences have significantly difand values. This makes it possible to show ferent that our quality metric is not specialized for specific types of video content. The video sequences used are Akiyo, Foreman, and Fall Road. Representative screenshots for the three video sequences can be found in Fig. 7. Note that Akiyo has low and low ; Foreman has high and low ; and Fall and low (the and values can Road has high also be found in Tables I and II).The experimental conditions were similar to the ones described in Section II. The results of the verification experiment are summarized in Fig. 8 and Table V. Fig. 8 plots the values as computed by the proposed quality metric, defined in equation (11), versus the assessed subjective quality, defined in equation (1). It can be seen that the points on each chart in Fig. 8 are closely distributed around a straight line, which implies that a high correlation exists between the estimated subjective quality and the proposed . Table IV shows the correlation coefficient between the quality metric score and the subjective quality. Four metrics were used in the verification experiment, in[9] (a cluding the traditional PSNR metric, quality metric that does not take into account spatial resolution

QM

Fig. 8. The proposed versus the assessed subjective quality: (a) Akiyo; (b) Foreman; and (c) Fall Road.

TABLE IV CORRELATION WITH

SQ

and spatial variance), [10] (a quality metric that does on the subjective quality), not consider the influence of the (the quality metric proposed in this paper). In order and to measure the PSNR of scalable video content, frame copying

278


TABLE V CORRELATION WITH SQ AT CIF RESOLUTION

bit streams. It can be expected that the construction and the experimental validation of the proposed quality metric would benefit from the use of a higher number of video sequences and . The number of viewers with diverse values for per experiment can also be increased in order to obtain more subjective quality data.

was used to deal with temporal scalability and down sampling was used to deal with spatial scalability (see Fig. 3). These two operations explain the low correlation between PSNR and . Although it is natural that the other quality metrics used in the verification experiment outperform PSNR, the correlawas included in the comparative tion between PSNR and experiment to illustrate the minimum baseline performance quality of the proposed quality metric. Note that the metric in Table IV is different from the quality metric in Table III: is parameterized on the temporal, spatial, and SNR properties of a bit stream, while is only able to take into account the temporal and SNR characteristics of a bit stream. As shown in Table IV, PSNR cannot efficiently deal with changes in frame rate and spatial resolution. The low correlacan be attributed to frame copying tion between PSNR and and down-sampling operations that alter the video signal. Fur[9] cannot take into account the spather, tial resolution of a video sequence. Consequently, as shown in [10] is characterized by a higher performance Table IV, than PSNR and [9], as [10] is parameterized in terms of frame rate and spatial resolution. How[10] shows a decreased performance for Fall Road. ever, Indeed, the performance of [10] abruptly decreases for . This is due to the fact that the video sequences with a high [10] relied on video sequences that were construction of (the value of Foreman is similar to Foreman in terms of equal to 0.12, which is Normal ). As shown in Figs. 5 and 6, the discrepancy between subjective quality and PSNR, as well as subjective quality and Default [10], is higher in Normal than others. PSNR is the main factor used in matching the subjective quality [9]. Hence, the performance was the worst for . On the other hand, our quality video sequences with a high and , and where the value may metric considers both vary from a low to a high value. Hence, the proposed quality metric shows a uniformly high performance for all tested video sequences, regardless of their temporal and spatial characteristics. Table V shows the correlation between the quality metric score and the subjective quality at CIF resolution. Since [9] does not take into account spatial scalability, the spatial resolution was set to CIF in order to allow for a fair comparison. As shown in Table V, the results are in line with the results provided in Table IV. The correlation for is higher since [9] does not take into . When is high, then better reflects the subjective quality. In this paper, three video sequences were used to determine the coefficients in (11) and to validate the proposed quality metric, mainly for practical reasons as the scalable coding of the three video sequences already resulted in a total of 81 test

V. CONCLUSIONS Multimedia resources are increasingly consumed in diverse usage environments. The use of SVC-compliant bit streams allows taking into account the constraints and capabilities of heterogeneous usage environments. However, in order to optimally adapt scalable video bit streams along the spatial, temporal, or perceptual quality axis, a quality metric is needed that reliably models subjective quality. In this paper, we have proposed a quality metric that offers support for fully scalable SVC bit streams that are suited for delivery in mobile usage environments. The proposed quality metric allows modeling the temporal, spatial, and perceptual quality characteristics of SVC-compliant bit streams by taking into account several properties of the compressed bit streams, such as the temporal and spatial variance of the video content, the frame rate, the spatial resolution, and PSNR values. Our experimental results show that the average correlation coefficient for the video sequences tested is as high as 0.95 (compared to a value of 0.60 when only using the traditional PSNR quality metric). The proposed quality metric also shows a uniformly high performance for video sequences with different temporal and spatial characteristics. Future work will investigate the use of the proposed quality metric for the optimal adaptation of SVC bit streams. The influence of packet losses and bandwidth fluctuations on the behavior of the proposed quality metric will also be analyzed. REFERENCES [1] S.-F. Chang and A. Vetro, “Video adaptation: Concepts, technologies, and open issues,” Proc. IEEE, vol. 93, no. 1, pp. 148–158, Jan. 2005. [2] A. Perkis, Y. Abdeljaoued, C. Christopoulos, T. Ebrahimi, and J. F. Chicaro, “Universal multimedia access from wired and wireless systems,” Circuits, Syst., Signal Process., vol. 20, no. 3, pp. 387–402, Feb. 2005. [3] R. V. Babu and A. Perkis, “Evaluation and monitoring of video quality for UMA enabled video streaming systems,” Multimedia Tools Appl., vol. 32, no. 2, pp. 211–231, Apr. 2008. [4] A. T. Compbell and G. Coulson, “QoS adaptive transports: Delivering scalable media to the desktop,” IEEE Netw. Mag., vol. 11, no. 2, pp. 18–27, Mar. 1997. [5] J. W. Kang, S.-H. Jung, J.-G. Kim, and J.-W. Hong, “Development of QoS-aware Ubiquitous Content Access (UCA) testbed,” IEEE Trans. Consum. Electron., vol. 53, no. 1, pp. 197–203, Feb. 2007. [6] L. Skorin-Kapovm and M. Matijasevic, “End-to-end QoS signaling for future multimedia services in the NGN,” in Proc. Next Gen. Teletraffic Wired/Wireless Adv. Netw., LNCS, Sep. 2006, pp. 408–419. [7] H. Schwarz, D. Marpe, and T. Wiegand, “Overview of the scalable video coding extension of the H.264/AVC standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 9, pp. 1103–1120, Sep. 2007. [8] Joint draft 9 SVC amendment, ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6, Jan. 2007. [9] R. Fechali, F. Speranza, D. Wang, and A. Vincent, “Video quality metric for bit rate control via joint adjustment of quantization and frame rate,” IEEE Trans. Broadcast., vol. 53, no. 1, pp. 441–446, Mar. 2007. [10] C. S. Kim, D. Suh, T. M. Bae, and Y. M. Ro, “Quality metric for H.264/AVC scalable video coding with full scalability,” Proc. SPIE, pp. 64921P-1–64921P-12, Jan. 2007.


[11] “Final report from the video quality experts group on the validation of objective model of video quality assessment, phase II,” VQEG, Aug. 2003. [12] Objective Perceptual Multimedia Video Quality Measurement in the Presence of a Full Reference, ITU-T Recommendation J.247, Aug. 2008. [13] E. Ong, X. Yang, W. Lin, Z. Lu, and S. Yao, “Perceptual quality metric for compressed videos,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Mar. 2005, pp. 581–584. [14] E. Ong, W. Lin, Z. Lu, S. Yao, X. Yang, and F. Moschetti, “Low bit rate video quality assessment based on perceptual characteristics,” in Proc. Int. Conf. Image Process., Sep. 2003, pp. 182–192. [15] S. Winkler, “A perceptual distortion metric for digital color video,” Proc. SPIE, pp. 175–184, Jan. 1999. [16] S. Wolf and M. Pinson, “Video quality measurement techniques,” NTIA Report 02-392, 2002. [17] G.-M. Muntean, P. Perry, and L. Murphy, “Subjective assessment of the quality-oriented adaptive scheme,” IEEE Trans. Broadcast., vol. 53, no. 3, pp. 1–11, Sep. 2005. [18] M. Pinson and S. Wolf, “A new standardized method for objectively measuring video quality,” IEEE Trans. Broadcast., vol. 50, no. 3, pp. 312–446, Sep. 2004. [19] U. Engelke and H.-J. Zepernick, “Perceptual-based quality metrics for image and video services: A survey,” in Proc. 3rd Euro-NGI Conf. Next Gen. Internet Netw., May 2007, pp. 190–197. [20] E. C. Reed and F. Dufaux, “Constrained bit-rate control for very low bit-rate streaming-video applications,” IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 7, pp. 882–888, Jul. 2001. [21] P. Cuenca, L. Orozco-Barbosa, A. Carrido, and F. Quiles, “Study of video quality metrics for MPEG-2 based video communications,” in Proc. IEEE Pacific Rim Conf. Commun., Comput. Signal Process., Aug. 1999, pp. 280–283. [22] E. Ong, W. Lin, Z. Lu, S. Yao, and M. H. Loke, “Perceptual quality metric for H.264 low bit rate videos,” in Proc. IEEE Int. Conf. Multimedia Expo, Jul. 2006, pp. 677–680. [23] O. Nemethova, M. Ries, M. Rupp, and E. Siffel, “Quality assessment for H.264 coded low-rate and low-resolution video sequences,” in Proc. Conf. Internet Inf. Technol. (CITT), Nov. 2004, pp. 136–140. [24] S. Winkler and F. Dufaux, “Video quality evaluation for mobile applications,” in Proc. Visual Commun. Image Process. Conf. (VCIP), Jul. 2003, pp. 593–603. [25] D. Gill, J. P. Cosmas, and A. Pearmain, “Mobile audio-visual terminal: System design and subjective testing in DECT and UMTS networks,” IEEE Trans. Veh. Technol., vol. 49, no. 4, pp. 1378–1391, Jul. 2000. [26] M. Ries, O. Nemethova, and M. Rupp, “Video quality estimation for mobile H.264/AVC video streaming,” J. Commun., vol. 3, no. 1, pp. 41–50, Jan. 2008. [27] H. Knoche, J. D. McCarthy, and M. A. Sassse, “Can small be beautiful? assessing image resolution requirements for mobile TV,” in Proc. ACM Multimedia, Nov. 2005, pp. 829–838. [28] R. C. Gonzalez and R. E. Woods, Digital Image Process., 2nd ed. London: Prentice Hall, 2002, pp. 61–63. [29] Y. S. Kim, Y. J. Jung, T. C. Thang, and Y. M. Ro, “Bit-stream extraction to maximize perceptual quality using quality information table in SVC,” in Proc. SPIE, Jan. 2006, pp. 607723-1–607723-11. [30] H. Koumaras, A. Kourtis, D. Martakos, and J. Lauterjung, “Quantified PQoS assessment based on fast estimation of the spatial and temporal activity level,” Multimedia Tools Appl., vol. 34, no. 3, pp. 355–374, Sep. 2007. [31] N. Montard and P. Bretillon, “Objective quality monitoring issues in digital broadcasting networks,” IEEE Trans. Broadcast., vol. 51, no. 3, pp. 269–275, Sep. 2005. [32] Methodology for the Subjective Assessment of the Quality of Television Pictures, ITU-R Recommendation BT.500-11, 2002. [33] “FTP directory,” [Online]. Available: ftp://ftp.tnt.uni-hannover.de/pub/ [34] P. Corriveau, C. Gojmerac, B. Hughes, and L. Stelemach, “All subjective scales are not created equal: The effects of context on different scales,” Signal Process., vol. 77, no. 1, pp. 1–9, Aug. 1999. [35] Q. Huynh-Thu and M. Ghanbari, “A comparison of subjective video quality assessment methods for low-bit rate and low-resolution video,” in Proc. IASTED Int. Conf. Signal Image Process., Aug. 2005, pp. 70–76. [36] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. J. Sullivan, “Rate-constrained coder control and comparison of video coding standards,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 688–703, Jul. 2000.

279

[37] T. Sikora, “The MPEG-7 visual standard for content description-an overview,” IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 6, pp. 696–702, Jun. 2001. [38] C. S. Won, D. K. Park, and S.-J. Park, “Efficient use of MPEG-7 edge histogram descriptor,” ETRI Journal, vol. 24, no. 1, pp. 23–30, Feb. 2002. [39] B. S. Manjunath, P. Salembier, and T. Sikora, Introduction to MPEG-7: Multimedia Content Description Interface. Hoboken, NJ: John Wiley & Sons Ltd., 2002. [40] A. Divakaran, “An overview of MPEG-7 motion descriptors and their applications,” in Proc. 9th Int. Conf. Comput. Anal. Images Patterns, LNCS, 2001, pp. 29–40. [41] J. L. Rodgers and A. W. Nicewander, “ Thirteen ways to look at the correlation coefficient,” The American Statistician, vol. 42, no. 1, pp. 59–66, Feb. 1988. [42] Information Technology—Multimedia Content Description Interface—Part 3: Visual, ISO/IEC 15938-3, May 2002, First edition, pp. 63-65. [43] “Find graph quick and easy,” [Online]. Available: http://www.uniphiz. com/findgraph.htm [44] A. A. Webster, C. T. Jones, M. H. Pinson, S. D. Voran, and S. Wolf, “An objective video quality assessment system based on human perception,” in Proc. SPIE, Feb. 1993, pp. 15–26.

Hosik Sohn received the B.S. degree from Korea Aerospace University, Goyang, South Korea, in 2007 and the M.S degree from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, in 2009. He is currently working toward the Ph.D. degree at KAIST. His research interests include video adaptation, visual quality measurement, bio-cryptography, multimedia security, Scalable Video Coding (SVC), and JPEG XR.

Hana Yoo received the B.S. degree from the Department of Electronics at Inha University, Incheon, South Korea. She worked as an engineer for the Department of Liquid Crystal Displays at Samsung Electronics from 2005 to 2006. In 2008, she received the M.S. degree from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea. She is currently working as a research staff member for the Advanced Infotainment Research Team at Hyundai Motors. Her research interests include Scalable Video Coding (SVC) and video quality measurement in mobile environments.

Wesley De Neve received the M.Sc. degree in computer science and the Ph.D. degree in computer science engineering from Ghent University, Ghent, Belgium, in 2002 and 2007, respectively. He is currently working as a senior researcher for the Image and Video Systems Lab (IVY Lab), in the position of assistant research professor. IVY Lab is part of the Department of Electrical Engineering of KAIST, the Korea Advanced Institute of Science and Technology (Daejeon, South Korea). Prior to joining KAIST, he was a post-doctoral researcher at both Ghent University - IBBT in Belgium and the Information and Communications University (ICU) in South Korea. His research interests and areas of publication include the coding, annotation, and adaptation of image and video content, GPU-based video processing, efficient XML processing, and the Semantic and the Social Web.

280

Cheon Seog Kim received the B.S degree from the Department of Electrical Engineering at Hong-Ik University, Seoul, South Korea, in 1981 and the M.S. degree from the School of Engineering at Korea University, Seoul, South Korea, in 1983. He received the Ph.D. degree from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, in 2009. In 1984, he worked for Taihan Electric Wire Co., Ltd. From 1992 to 2008, he worked as a chief scientist for security service provider ADT CAPS and for multimedia solutions provider Curon. He currently works as a CTO for Woori CSt. He participated in the MPEG-21 international standardization effort, contributing to the definition of the MPEG-21 DIA visual impairment descriptors and modality conversion. His major research interests are video quality measurement, video coding, Scalable Video Coding (SVC), and the design of multimedia systems.


Yong Man Ro (M’92–SM’98) received the B.S. degree from Yonsei University, Seoul, South Korea, and the M.S. and Ph.D. degrees from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea. In 1987, he was a visiting researcher at Columbia University, and from 1992 to 1995, he was a visiting researcher at the University of California, Irvine and KAIST. He was a research fellow at the University of California, Berkeley and a visiting professor at the University of Toronto in 1996 and 2007, respectively. He is currently holding the position of full professor at KAIST, where he is directing the Image and Video Systems Lab. He participated in the MPEG-7 and MPEG-21 international standardization efforts, contributing to the definition of the MPEG-7 texture descriptor, the MPEG-21 DIA visual impairment descriptors, and modality conversion. His research interests include image/video processing, multimedia adaptation, visual data mining, image/video indexing, and multimedia security. Dr. Ro received the Young Investigator Finalist Award of ISMRM in 1992 and the Scientist Award in Korea in 2003. He served as a TPC member of international conferences such as IWDW, WIAMIS, AIRS, and CCNC, and he was the program co-chair of IWDW 2004.