VIDEO-QUALITY ESTIMATION BASED ON REDUCED ... - CiteSeerX

VIDEO-QUALITY ESTIMATION BASED ON REDUCED-REFERENCE MODEL EMPLOYING ACTIVITY-DIFFERENCE Toru YAMADA, Yoshihiro MIYAMOTO, and Masahiro SERIZAWA Common Platform Software Research Laboratories, NEC Corporation, Japan ABSTRACT This paper presents a Reduced-Reference based video quality estimation method suitable for individual end-user quality monitoring of IPTV services. With the proposed method, activity values (spatial frequency levels) for individual given-size pixel blocks of an original video are transmitted to end-user terminals. At the end-user terminals, the video quality of a received video is estimated on the basis of the activity-difference between the original video and the received video. Psychovisual weighting with respect to the activity-difference is also applied to improve estimation accuracy. In addition, lowbit-rate transmission is achieved by using temporal subsampling and by transmitting only the lower six bits of each activity value. The proposed method achieves accurate video quality estimation using only low-bit-rate original video information (15kbps for SDTV). The correlation coefficient between actual subjective video quality and estimated quality is 0.901 with 15 kbps side information.

1. INTRODUCTION IPTV appears promising as an improvement over conventional TV broadcasting. With IPTV services, however, since network conditions will vary for individual users, end-user video quality monitoring is an important issue, and such monitoring must be automatic since subjective video quality evaluation by human observers is impractical. The need for objective video quality-metrics having a high correlation to subjective video quality has been considered by the Video Quality Expert Group (VQEG) [1]. In ITU-T recommendation J.143 [2], objective video quality-metrics may be categorized into following three types: 1) Full Reference (FR) models: evaluation of video quality by means of a comparison between an original video and a processed video.

Video Stream Network Original Video

Server

Original Video Info. Extraction

End-User Terminal Network

Received Video

Quality Estimation

Estimated Quality

Fig. 1. Video Quality Monitoring based on an RR model.

2) No Reference (NR) models: evaluation of video quality on the basis of processed frames alone. 3) Reduced-Reference (RR) models: evaluation of video quality using both a processed video and a small amount of information extracted from an original video. The FR model has been described in specific terms in ITU-T recommendation J.144 [3], but since end-user terminals in IPTV applications would not be able to refer to original frames on the spot, this model would not be suitable for real- time end-user video quality-monitoring. By way of contrast, although the NR model would be able to evaluate video quality without reference to the original video and its system implementation would be relatively easy, it would be difficult to achieve accurate quality estimation. In response to this NR model drawback, ITUT recommendation J.147 [4] presents a method for inserting invisible markers into the original video and determining degradation of the invisible markers at enduser terminals. Unfortunately, the insertion itself of invisible markers can lead to video quality degradation. With regard to the RR model, since it transmits feature parameters extracted from the original video to end-users at low bit rates (See Figure 1), it is not necessary to transmit the original video itself, as it would be with the FR model, and it can be expected to achieve more accurate quality estimation than would an NR model. Typical RR models extract a small number of pixels from the original video at a video server and transmit information with respect to those pixels to end-user

terminals. Error (as reflected by PSNR) between the original pixels and corresponding processed pixels is then calculated at end-user terminals, and the average PSNR for an entire frame can be estimated from a calculated partial PSNR. For example, ITU-T recommendation J.240 [5] has tried accurate PSNR estimation by using a spreadspectrum and an orthogonal transform. This approach does not estimate subjective video quality but PSNR. In this paper, we propose an RR based video quality metrics for estimating subjective quality. With it, activity values for individual given-size pixel blocks are transmitted to end-user terminals. These values indicate spatial-frequency levels and are used as original video information. Video quality is estimated on the basis of the activity-difference between the original video and the received video. Psychovisual weighting operations with respect to the activity-difference are also applied to improve estimation accuracy. In addition, low-bit-rate transmission of the feature parameters is achieved by using temporal sub-sampling and by transmitting only the lower bits of each activity value. The subsequent sections of this paper are organized as follows: Section 2 describes the proposed algorithm for estimating subjective video quality using activitydifference values; Section 3 discusses an evaluation of the performance of the algorithm; and Section 4 summarizes our work. 2. PROPOSED METHOD The proposed method first calculates activity-difference values, and then psychovisual weighting operations are adapted one by one. In this section, we first describe a basic concept of the activity-difference, and then explain psychovisual weighting operations. 2.1. PSNR Calculation Based on Activity-Difference To calculate PSNR, it is necessary to calculate a mean square error (MSE) value of luminance values between the original video and the received video. Let X i be a luminance value in a 16x16 pixel block of the original video, Yi be one of the received video in the same position with X i and ei be noise induced, i.e., Yi = X i + ei .

We now assume ei is independent from X i and E (ei ) = 0 ,

(1) (2)

where E is a function to calculate an average value. From this assumption, the following relation is obtained: Y = E (Yi ) = E ( X i + ei ) = E ( X i ) + E (ei ) . (3) = E( X i ) = X

where X and Y is the average values of the luminance values in the blocks. For an RR approach, since all pixels cannot be used, we must consider using less amount of information. We now consider using standard deviation of the luminance values. Standard deviation value for each 16x16 pixel is defined as: 1 255 σ ( X ) = Var ( X i ) = ∑ ( X i − X )2 . (4) 256 i =0 = E[( X i − X ) 2 ]

Square error (SE) of the standard deviation is calculated as:

SEσ = (σ ( X ) − σ (Y ))

2

=

( E[(X − X ) ] − 2

i

E [(Yi − Y ) 2 ]

)

2

= Var( X i ) + E [( X i + ei − X ) 2 ] −2 Var( X i )E[( X i + ei − X ) 2 ]

(5)

= 2Var( X i ) + 2E[( X i − X )ei ] + E[ei ] 2

−2 Var( X i ){Var( X i ) + 2E [( X i − X )ei ] + E[ei ]} 2

= 2Var( X i ) + E[ei ] 2

−2 Var( X i ){Var( X i ) + E[ei ]} 2

2  E[ei ]  2 = 2Var( X i ) + E[ei ] − 2Var( X i )1 − 1 +  Var( X i )   

Generally, since E[ei 2 ] / Var ( X i ) is small enough in compressed video sequences, SE of the standard deviation can be described as: 2 2 (6) SEσ ≈ 2Var( X i ) + E[ei ] − 2Var( X i ) = E[ei ] . This shows that SE of the standard deviation can approximate MSE of the luminance values in the case of 2 E[ei ] / Var ( X i ) ThSF otherwise

.

(11)

2.2.2. Weighting for Difference in Specific Color Region A human observer tends to gaze more at video regions in which humans are present. We define blocks whose pixels mainly consist of colors close to human skin colors as a Region of Interest (ROI) block and apply a weighting operation to the activity-difference for each ROI block. Preliminary experiments indicated that we might usefully define the color range close to that of human skin as 48 ≤ Y ≤ 224 , 104 ≤ Cb ≤ 125 , and 135 ≤ Cr ≤ 171 . Naturally, ROI blocks include not only human objects but also other objects colored in the same range. For a given block and its adjacent eight blocks, if the number of pixels within the above color range ( NumROIPixe ls ) is more than 175, this block is defined as an ROI block. We then apply following operation:

 Ei , j × WCR , NumROIPixe ls > ThCR . Ei , j ⇐  Ei , j , otherwise 

(12)

2.2.3. Weighting for Blockiness Artifacts Generally, since blockiness is the most annoyable artifacts, subjective video quality tends to be low for video sequences with high blockiness level. In the proposed method, weighting operation for blockiness level is also incorporated for more accurate video-quality estimation. Since detected blockiness artifacts tend to affect overall subjective video-quality, weighting operation is applied not to activity-difference values for each block ( E i , j ), but to calculated video-quality score ( VQ ). To estimate blockiness level, activity values for 8x8 pixel blocks in the received video sequences are used. As may be seen in figure 2, two activity values in horizontally adjacent blocks ( ActBlock 1 , ActBlock 2 ) are calculated, and the average value of the two activity values ( Act Ave ) is calculated by

Block1

2.2.4. Weighting for Local Impairments Local impairments generated by transmission errors are also annoyable and result in low subjective video quality. In the proposed method, weighting operation for local impairments is also incorporated. Since annoyance of the local impairments tends to affect overall subjective videoquality, weighting operation is applied not to activitydifference values for each block ( E i , j ), but to calculated

Block2

Activity = ActBlock

Activity = ActBlock

1

Pixels along a block boundary

Pixels along a block boundary

Y1,0 - Y1, 7

Y2,0 - Y2, 7

2

Fig. 2. Information for calculating blockiness level.

Act Ave =

1 ( ActBlock 1 + ActBlock 2

2

).

(13)

Next, the absolute difference of the luminance values along a boundary between the two blocks is calculated. When, as is illustrated in Figure 2, Y1, 0 represents a luminance value in a left block along the boundary, Y2 , 0

video-quality score ( VQ ). When a transmission error is generated, error concealment is applied to impairment-regions to conceal video quality degradation. If the error concealment is not effective, video quality will be largely degraded and correlation between the original video and the received video will be lost in the local-impairment regions. To detect this lost of correlation, the proposed method uses the difference of the variance values of the activity in the nearest blocks. For a given block and its adjacent eight blocks, the variance values of the activity are calculated both the original video ( ActVar Org ) and the received video ( ActVar Deg )and the absolute difference value is calculated as

∆ActVar = ActVarOrg − ActVarDeg .

(18)

represents one in a right block, an average value of the absolute luminance difference DiffBound may be expressed as:

Next, the average value of this absolute difference values is calculated for each frame. The ratio of the maximum ( ∆ ActVar Max ) and minimum ( ∆ ActVar Min ) values of

1 7 (14) ∑ Y1 , i − Y 2 , i . 8 i=0 Blockiness level ( BL ) is defined by the ratio between Act Ave and DiffBound , i.e.,

this average values is calculated as

DiffBound

=

DiffBound . Act Ave + 1 The average value of the BL is calculated by N −1 M −1 1 BL Ave = ∑ ∑ BL i , j , N × M i=0 j=0 BL =

(15)

(16)

where N is the number of frames and M is the number of blocks per frame. For the most right side blocks, the BL value is set to zero. If BL Ave is larger than a predetermined threshold, it is considered that the video sequence includes a large level of blockiness and a weighting operation is adapted to the calculated video quality value. We then apply following operation:

VQ / WBL , VQ ⇐   VQ,

BLAve > ThBL otherwise

.

(17)

 ∆ActVarMin / ∆ActVarMax ∆ActVarMax ≠ 0 . (19) LI =  ∆ActVarMax = 0  1 This value is used for detecting local impairment. If this value is larger than a pre-determined threshold, it is considered that large local impairments are included in the video sequence and a weighting operation applied to the calculated video quality value. We then apply following operation:

VQ / WLI , VQ ⇐   VQ,

LI < ThLI

.

(20)

otherwise

2.3. Bit-Rate Control for Original Video Information

For RR models, it is necessary to reduce the bit rate for the original video information. The VQEG has specified original video information bit rates for SDTV as 256, 80, and 15 kbps [6]. With the proposed method, since it is only necessary to extract a single activity value from each block, bit rates can be greatly reduced without creating

any problem. To meet the respective bit rates, we employ the following protocol: 1)256 kbps: transmit activity values as is. 2)80kbps: apply temporal sub-sampling every 4 frames. 3)15kbps: apply temporal sub-sampling every 11 frames and transmit lower 6 bits of each activity value. To achieve 15 kbps, we apply not only temporal subsampling but also partial-bit transmission. Only the lower six bits of each eight-bit activity value are transmitted. In general, although the received video contains impairment, the original and the received video will still have a high correlation. Here, since it is extremely likely that the higher bits of the activity values in the received video will be the same as those of the original video, only the lower bits are transmitted.

TABLE I SUBJECTIVE VIDEO QUALITY TEST CONDITIONS Test Methodology ITU-T P.910 (ACR-HR)[7] The Number of Subjects 18 Video Codec MPEG-2 and H.264 Video Bitrate 1~6Mbps(CBR) Video Duration 5 Seconds Resolution 720×480 Frame Rate 29.97fps Video Sequences Training Set: ballet, bus, cheer, flower Test Set: football, hockey, mobile, tennis Packet Length 1288Byte Transmission Error Packet Loss Ratio 0.1~0.5% (Random)

3. EXPERIMENTAL RESULTS

We have applied our proposed method to the estimation of video quality and have evaluated its correlation to actual subjectively-determined video quality. Subjective testing was conducted under the conditions shown in Table I. We first determined the parameters for the weighting operations. Training set of the video sequences shown in Table I is used to determine the parameters. We calculated Pearson correlation coefficient values over changing the parameters and we have obtained the parameter set for the best correlation coefficient. Table II shows the parameter values for the weighting operations. Table III shows the resulting correlation coefficients for training set of the video sequences. As may be seen, the proposed method provides better correlation at all bit rates, including even 15 kbps, than that provided by a PSNRbased method employing full-reference to an original video. Since J.240, a conventional approach, estimates PSNR, correlation to the subjective quality is comparable with that for the actual PSNR. Table IV shows the resulting correlation coefficients for test set of the video sequences. As may be seen in Table IV, correlation coefficients are higher than those of J.240 and even PSNR which is calculated by a fullreference approach. Table IV also shows the resulting root mean square error (RMSE) between the actual subjective video-quality and the estimated video-quality for the test set. The RMSE values of the proposed

TABLE II PARAMETERS FOR THE WEIGHTING OPERATIONS Weighting Operation Type Parameter Values 0.6 Spatial Frequency Weighting W SF Th SF 25 W CR Specific Color Weighting 2.0 Th CR 175 W BL Blockiness Weighting 1.15 Th BL 1.0 W LI Local Impairment Weighting 1.15 Th LI 1.67

TABLE III EXPERIMENTAL RESULTS FOR TRAINING SET Correlation Coefficient

PSNR (FR model) J.240 (RR model, 396 kbps) Proposed (a) Activity-Difference (b) (a)+Spatial Freq. Weight. 256kbps (c) (d) (e)

Proposed 80kbps Proposed 15kbps

(f) (g)

(b)+Specified Color Weight. (c)+ Blockiness Weight. (d)+Local Impairment Weight. (e)+Sub-sampling (4 frames) (e)+Sub-sampling (11frames) + Partial-Bit Transmission

0.831 0.833 0.899 0.906 0.922 0.925 0.940 0.939 0.932

TABLE IV EXPERIMENTAL RESULTS FOR TEST SET Correlation Coefficient PSNR (FR model) 0.836 J.240 (RR model, 396kbps) 0.831 Proposed 256kbps 0.925 Proposed 80kbps 0.914 Proposed 15kbps 0.901

RMSE 0.695 0.703 0.481 0.512 0.566

method are smaller than those of the PSNR and J.240. These results show the proposed method can estimate the subjective video-quality more accurately and it can be used for end-user video quality monitoring of actual IPTV services. 4. CONCLUSION

We have proposed here an RR based video quality estimation method that employs activity-difference values. The use of temporal sub-sampling and partial-bit transmission of activity values helps to achieve accurate subjective video quality estimation with only a slight amount of extra information. A correlation coefficient of 0.901 is achieved with the simple addition of a 15 kbps transmission. This method is suitable for end-user quality monitoring of IPTV services. 5. REFERENCES [1] The Video Quality Experts Group Web Site, http://www.its.bldrdoc.gov/vqeg/. [2] ITU-T Recommendation J.143, “User requirements for objective perceptual video quality measurements in digital cable television,” May 2000. [3] ITU-T Recommendation J.144, “Objective perceptual video quality measurement techniques for digital cable television in the presence of a full reference,” 2004. [4] ITU-T Recommendation J.147: “Objective picture quality measurement method by use of in-service test signals,” 2002. [5] ITU-T Recommendation J.240: "Framework for remote monitoring of transmitted picture signal-to-noise ratio using spread-spectrum and orthogonal transform," 2004. [6] VQEG RRNR-TV Group Test Plan Version 2.1, ftp://vqeg.its.bldrdoc.gov/Documents/Projects/rrnrtv/RRNR-tv_draft_2.1_ changes_highlighted.doc, 2007 [7] ITU-T Recommendation P.910, "Subjective video quality assessment methods for multimedia applications," 1996.