REGION OF INTEREST BASED SCALABLE IMAGE AND VIDEO CODING A SUPERLATIVE STUDY Yogananda Patnaik Electrical Engineering Dept. NIT Rourkela.
[email protected]
Rajkumar Maharaju Electrical Engineering Dept. NIT Rourkela.
[email protected]
Abstract— A video consists of many number of frames that
contains the valuable information in different ways. Region of interest depends on what we want to extract from the different frames of the video which is application oriented. If for example if one is interested to detect the face of a person from a video then the Region of Interest (ROI) would be the face only. ROI scalability is of greater importance now a days in the field like video surveillance and hand held devices, like where some visual data is more significant and interesting than the non ROI parts of the video. ROI mainly operates on any image and video coding which boosts the regions exclusively which is more important and to have high visual quality. The main aim of SVC is to produce all the three basic scalabilities like temporal, spatial and quality scalabilities of an encoded bit stream. The ROI consists of a region which is important to a particular user with heterogeneous display devices. By doing so, less bandwidth is required for transmission as compared to total region is transmitted or decoded. In this paper we are going to extract the required ROI from the frames of a video sequence and carry out the coding process for Scalability. Index Terms—ROI, MCTF, SVC, EBCOT, SSIM.
R
I. INTRODUCTION
ecent developments in the field of Scalable video coding have led to a renewed interest in ROI. The ROI scalable video coding is being discussed in few literatures[1],[2],[3],[4]. A signal that carries information could be compressed by removal of redundancy from the signal. Statistical redundancy can be removed in a lossless compression system and by doing so the signal can be reconstructed at the receiver. One of the paper on state of art SVC describes the operation of ROI is best suitable for scalable video coding (SVC)[5]. The ROI functionality can be done by macro block based selective enhancement. In [6] process of calculating ROI for SVC by placing the blocks in order for coding and then layer remapping is being carried Out. Some papers reveal that ROI scalable video coding framework is prepared using leaky prediction(LP) to send video over an error prone network[7][8]. Here a leaky factor is used to assess the prediction: first one is by calculating motion
978-1-4799-6986-9/14/$31.00©2014 IEEE
Dipti Patra, Sr.Member, IEEE Electrical Engineering Dept. NIT Rourkela.
[email protected]
estimation with the ROI layer of the reference frame, and then calculating the unrestricted Motion Estimation (ME) from the entire reference frame. Region of interest (ROI) mainly operates on any one of the two may it be image or video coding, which increases the regions exclusively that are of greater importance in terms of visual quality. The ROI functionality can be done by macro block based selective enhancement. In [6] random shaped ROI for SVC by using a prioritized block coding and a layer remapping technique is being used. ROI detection can be done in two ways (a) approach through pixel domain (b) approach through compressed domain[2]. The information of any particular ROI can be determined by first calculating the actual position of the ROI with in a frame of a video (e.g. By finding out the macro blocks which belongs to the upper-left and bottom-right of a rectangular ROI).The traditional method for ROI based SVC video coding is lowering of quantization parameter (QP) for the macro blocks inside the ROI and allocating higher QP with lower macro blocks which lies outside the ROI. This paper discusses on how to find the rectangular ROI from a video and then process it for scalable bit stream. As compared to uniform encoding which uses more bandwidth. But ROI based encoding which takes less bandwidth than the maximum quality video as it degrades quality in most of the areas. So ROI based coding maintains quality and at the same time saves bit rate. The paper sequence goes this way: Section II deals with ROI based Scalable video coding, Section III deals with ROI detection scheme based on motion, Section IV deals with performance evaluation, Section V deals with the simulation results for ROI –SVC for foreman image and video. Section VI provides a focuses on future scope. Finally, Section 7 gives the conclusion. II. ROI BASED SCALABLE VIDEO CODING ROI based SVC passes broadly through four phases starting from segmentation, ROI shape identification, process ROI and lastly Embedded Block coding with optimized truncation (EBCOT). Fig.1. shows the functional layout of ROI based SVC system. The first step to get a ROI for a frame of a video, segmentation and tracking is being done. Then MCTF is applied to the ROI as well as to the background separately to reduce the temporal correlation substantially, by doing so a
large number of bits are assigned to ROI which gives better visual quality. When MCTF is done the low frequency frames are given to the second level of filtering step and the high frequency frames are the output.
Fig.1.ROI Based Scalable video coding model
The shape of any ROI is defined using a mask and the ROI can be one or many objects of a video. Then the motion is estimated by using any of the available block matching techniques. As per the motion vectors the pixel mapping is being carried out from current macro blocks to the reference samples. The samples of non ROI region are assigned the value of the motion vectors. For further details on MCTF can be found in[9]. MCTF is carried out within the ROI along the temporal direction using 5/3 or other lifting filters as depicted in fig.2.
Then the transformation of the output MCTF is carried out by DWT for spatial decorrelation. To achieve temporal scalability MCTF is frequently used. The SVC extension, like regular H.264/AVC, uses DCT for the remaining frames. (5/3) wavelet is applied perform the MCTF whether MCTF is used in combination with DCT or wavelet transforms. 5/3 MCTF consist of prediction from two reference frames, both just before and just after the target frames. Fig. 2 shows the MCTF structure of the three level (5/ 3) wavelet. Then the next step would be to encode the wavelet coefficients to the bit stream using Embedded block coding with optimized truncation(EBCOT)[10] .The lifting of bit plane within the ROI is used to each temporal view spatial sub band so the ROI can secure more bits to get a better visual quality which can realize the scalability in content at the same time. Using this EBCOT SNR scalability is achieved. III. ROI DETECTION BASED ON MOTION ROI encoding is mainly carried out using YUV (YCbCr) color model. This method[11] is mainly based on an assumption that most of the essential objects are located at the center of the frame region and also in the video sequence. This technique divides a given video into a group of frames based on the change in the number of pixels among consecutive frames. Then the next task here would be to calculate the statistics of Luma changes for all frames in an entire scene and to locate the rectangular region where the change is mostly seen.
Fig.3.The system flow of ROI Extraction Fig.2. Structure three level 5/3 MCTF
The Center of the rectangular region is the center of the ROI .This technique consists of the steps mentioned below:
1. Revealing of any scene change: by checking the Luma components in each frame exceeds a threshold value, then this threshold value is maximum distance between two consecutive frames/scenes. 2. Then the next step is maximum change location detection: this finds a rectangular region where maximum number of pixels change occurs in each frame. This is carried out for the total scene of the video and a histogram is created which displays the frequency of occurrence of the maximum number of pixel changes within each rectangular area. This step is done for each frame of the video. 3. Establishing a bunch of rectangular areas as per the preceding step. 4. Detecting ROI Centre: The center of the ROI would be the cluster with maximum frequency value. If ROI center is given then one can easily choose any size and shape of ROI around that center.
2 1 N N X ( i , j ) − Y ( i, j ) ) (1) 2 ∑∑ ( N i j Colour image MSE is given by 2 2 1 N N MSE= 2 ∑∑ [( r ( i, j ) − r * ( i, j ) ) + ( g ( i, j ) − g * ( i, j ) ) N i j
MSE=
( b ( i , j ) − b ( i, j ) ) *
2
]
(2)
Signal to Noise Ratio (SNR): Usually Mean Square Error is expressed in terms of a Signal to Noise Ratio (SNR) which is specified in decibels (dB) as the ratio of desired image variance ( σ 2 ) to the average image variance ( σ e2 ) SNR= 10 log10
σ2 σ e2
(3)
Where σ 2 is the variance of the desired image and σ e2 is average variance. Peak Signal to Noise Ratio (PSNR): PSNR, expressed in dB, is more often used as a quality measure in video coding. PSNR is defined as
(f ) PSNR = max 2 ( MSE ) 2
Fig.4.Results for ROI extraction for foreman sequence (a): original frame in gray scale with ROI mask (b): Binary mask of the region(c): New image with mask attached into image (d): Outside Region masked (e): Inside Region masked (f): Image Cropped
ROI Algorithm 1. Read the video from different data set 2. Extract Frames 3. Read an image from the extracted frames 4. Convert it into binary image. 5. Find the connected components of the binary image. 6. Define Region or mask 7. Calculate maximum number of objects IV. PERFORMANCE EVALUATION Performance measures/Quality measures The ROI based SVC method is validated with different performance metrics. Peak Signal to Noise Ratio (PSNR) is considered to be the most important which states the quality. Some other quality measures that define quantity of compression present in the process are Mean Square Error (MSE) and Compression Ratio (CR) and SSIM. It is observed that PSNR and MSE are inversely related to each other.
Where fmax is the maximum (peak) intensity value of the video signal. For the luminance component, a PSNR value higher than 40 dB indicates excellent image reproduction, between 30–40 dB indicates a good image (acceptable image but with visible distortion), between 20–30 dB is poor, and a PSNR less than 20 dB is objectionable image level reproduction. It can be determined as the ratio of the signal variance to the reconstruction error variance. Mean Absolute Difference (MAD): It is also called as sum of absolute Difference (SAD) and is sometimes adopted due to easier and quicker computation. This is given by MAD( d1 , d 2 ) =
1 N1 N 2
N1 −1 N 2 −1
∑∑
f ( n1 , n2 , t ) − f (n1 − d1 ,n2 − d 2 , t − 1)
n1 = 0 n2 = 0
(5)
Compression Ratio (CR): Compression ratio is determined as the ratio of the original image size to that of compressed image size. original image size Compression Ratio= (6) Compressed image size
Correlation Quality (CQ): M
Mean Square Error (MSE): It is defined as the average or aggregate of squares of the error between two images or frames. Monochrome image MSE is given by
(4)
CQ =
N
∑∑ F ( j, k ) Fˆ ( j, k ) J =1 K =1 M
N
∑∑ F ( j, k ) j =1 k =1
(7)
Laplacian Mean Square Error (LMSE) M N −1
LMSE =
∑∑
J =1 K = 2
{
2
}
⎡O { F ( j , k )} − O Fˆ ( j , k ) ⎤ ⎢⎣ ⎥⎦ M N −1
∑∑ ⎡⎣O {F ( j, k )}⎤⎦
(8)
2
J =1 K = 2
Image Fidelity (IF): ⎛ M N 2 ⎡ F ( j , k ) − Fˆ ( j , k ) ⎤ / IF = 1 − ⎜ ⎦ ⎜ J =1 K =1 ⎣ ⎝
∑∑
M
N
∑∑ [ F ( j, k )] j =1 k =1
2
⎞ ⎟ ⎟ ⎠
(9) Fig.6.Results for MSE with different threshold values
Structural SIMilarity (SSIM) Index: (2μ x μ y + C1 )(2σ xy + C2 ) SSIM ( x, y ) = 2 ( μ x + μ y2 + C1 )(σ x2 + σ y2 + C2 )
(10)
V. SIMULATION RESULTS This section deals with the experimental results which were obtained with different threshold values for the foreman image. Then this is further extended to foreman video sequence. In this simulation initially a foreman image of size 256X256 is taken into consideration and the corresponding MSE and PSNR values were calculate and various plots were prepared for different threshold values. Table 1 and fig. 5-9 clearly depicts this. PSNR increases with decrease in threshold and MSE increases with increase in threshold. Similarly SSIM, IF, CQ decreases with the increase in threshold value. This implies the quality of the image is better when PSNR, is high and the threshold value is less.
Fig.5.Results for PSNR with different threshold values
Fig.7.Results for SSIM with different threshold values
Fig.8.CQ plot with different threshold values
Fig.9.IF plot with different threshold values
TABLE I PERFORMANCE COMPARISON FOR FOREMAN
Quantization Level 8 16 32 64 128 256
PSNR (dB) 51.466 50.62 50.002 49.323 46.182 45.612
MSE 0.466 0.563 0.649 0.759 1.564 1.785
Quantization Level 8 16 32 64 128 256
SSIM 0.6605 0.6468 0.6190 0.5922 0.5621 0.5630
LMSE 1.223376 1.178086 1.109815 1.103999 1.162803 1.104733
Quantization Level 8 16 32 64 128 256
CQ 89.279083 89.184453 89.023391 88.815267 88.335104 87.975935
IF 0.747797 0.747132 0.745893 0.744147 0.740190 0.737073
This is verified from fig.10. that the images are appearing blurred with high threshold value. Different performance measures such as MSE, PSNR SSIM, IF, CQ, LMSE are calculated for quality measurement. Here only a single frame is considered in our simulation. Similarly the experiment can be carried out for other frames of the video. The experiments were carried out with the help of MATLAB-7 (2012) programming language and was achieved with a Pentium Intel CORE i7 processor, 2.4 GHz CPU with 4GB RAM.
Fig.10. Results for ROI extraction for foreman sequence with different threshold (a): original frame in gay scale with ROI mask (b): threshold=8(c): threshold=16(d): threshold=32 (e): threshold=64 (f): threshold=128 (g): threshold=256
VI. CONCLUSION
This paper gives an overall idea of ROI scalable video coding. The basic operational design of ROI – Scalable video coding is discussed in detail. The ROI scalability features and functionality of a scalable coder in the existing standards are explained. Initially ROI extraction is being done followed by ROI processing followed by coding to extract the scalable bit stream. In order to achieve spatial and temporal decomposition of the input video sequence, wavelet transform combined with motion compensated temporal filtering (MCTF) is carried out. Here 5/3 MCTF is carried out followed by two dimensional discrete wavelet transform. The motion vectors and wavelet coefficients are compressed followed by a bit stream representation giving rise to a layered representation is being done to remove redundancy. From our experiment it is observed that the ROI Scalable Coding can be carried by considering a low threshold value so that the quality of the image /video is not degraded extensively. This work can be extended towards development of better MCTF technique and lifting scheme as well as to incorporate the performance indices FSIM, normalized cross correlation. VII. FUTURE SCOPE This work can be extended further by taking different video sequence and different scalability techniques can be verified.
VIII. REFERENCES [1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
Q. Chen, L. Song, Y. Xiaokang, and Z. Wenjun, “Robust Video Region-of-Interest Coding Based on Leaky Prediction,” IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 9, pp. 1389–1394, 2009. K. E. Grois Dan, Hadar Ofer, “ROI Adaptive Scalable Video Coding for Limited Bandwidth Wireless Networks,” Wirel. Days (WD), 2010 IFIP, pp. 0–4, 2010. T. Huang, “Region of Interest Extraction and Adaptation in Scalable Video Coding,” in Seventh International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2010, pp. 2320– 2323. M. P. Karunakar A.Pai, “Interactive region of interest scalability for wavelet based scalable video coder,” J. Real Time Image Process., vol. 6, no. 2, pp. 93–100, 2011. N. Adami, a. Signoroni, and R. Leonardi, “State-of-the-Art and Trends in Scalable Video Compression With Wavelet-Based Approaches,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 9, pp. 1238–1255, Sep. 2007. W. Peng, T. Chiang, and H. Hang, “Adding Selective Enhancement in Scalable Video Coding for Region-of-Interest Functionality,” in IEEE International Symposium on Circuits and Systems, 2006. ISCAS 2006. Proceedings, 2006, pp. 3089–3092. D. Grois and O. Hadar, “Efficient Region-of-Interest Scalable Video Coding with Adaptive Bit-Rate Control,” Adv. Multimedia, Hindawi Publ. Corp., vol. 2013, p. 17, 2013. T. E. Slowe and I. Marsic, “Saliency-based visual representation for compression,” in International conference on Image Processing, 1997, pp. Vol–2, 554–557. Zoghlami, M. Marzougui, M. Atri, and R. Tourki, “High-level implementation of Video compression chain coding based on MCTF lifting scheme,” in 10th International Multi-Conferences on Systems, Signals & Devices 2013 (SSD13), 2013, pp. 1–6. D. Taubman, “High performance scalable image compression with EBCOT,” IEEE Trans. Image Process., vol. 9, no. 7, pp. 1158–70, Jan. 2000. S. Azad, W. Song, and D. Tjondronegoro, “Measuring Bitrate and Quality Trade-Off in a Fast Region-of-Interest Based Video Coding,” Springer-Verlag Berlin Heidelberg LNCS 6524 pp. 442– 453, 2011. Z. Wang, A. C. Bovik, H. R. Sheikh, S. Member, E. P. Simoncelli, and S. Member, “Image Quality Assessment : From Error Visibility to Structural Similarity,” IEEE Trans. image Process., vol. 13, no. 4, pp. 600–612, 2004. A. M. Eskicioglu and P. S. Fisher, “Image Quality Measures and Their Performance - Communications, IEEE Transactions on Communication,” vol. 43, no. 12, pp. 2959–2965, 1995.