High Frame Rate Screen Video Coding for Screen Sharing Applications Dan Miao*
Jingjing Fu, Yan Lu, Shipeng Li
Chang Wen Chen
University of Science and Technology of China Hefei, China Email:
[email protected]
Media Computing Group Microsoft Research Beijing, China Email: {jifu,yanlu,spli}@microsoft.com
State University of New York at Buffalo Buffalo, NY, USA Email:
[email protected]
Abstract—In this paper, we propose a high frame rate screen video compression scheme aiming at improving the interactive user experience on screen sharing applications. The proposed screen video compression is performed as two-layer coding: a base layer coding using the conventional video codec and an enhancement layer coding using the proposed open-loop coding scheme. For efficient frame level layer selection and compression, the content update of each frame is evaluated through global motion detection. The screen frame with significant content update is fed to the conventional video encoder in base layer. In contrast, the frame with little update is compressed in enhancement layer in which the duplicate content is indicated by global motion vector and skip flag while the updated content is encoded by distinct intra modes in terms of inherent local features. The experimental results demonstrate that for the screen video containing interaction, the proposed coding scheme can achieve 3.09ms/frame encoding rate and 2.33ms/frame decoding rate with efficient rate distortion performance.
I.
I NTRODUCTION
The rapid development of computing hardware and transmission technologies provides possibility for users to share the computing and storage resources among multiple devices through real-time screen sharing, in which the screen interface of one computing device is compressed as a sequence of images and transmitted to another device for display and control. Based on screen sharing, many systems have been established to facilitate user interaction with the remote running applications, including remote desktop, online editing and so on. During the interaction with screen content, users expect an instant response after according operations. In order to achieve smooth interaction, the screen images have to be updated at a higher frame rate than regular one of 30fps. The screen video acquisition and display with a high frame rate have been supported in many screen sharing systems due to the development of screen capture and rendering techniques. Hence, for the system performance improvement, it is necessary to develop an efficient high frame rate screen video coding scheme. One of the existing solutions for screen content compression is adopting standard image and video codecs. For instance, JPEG is employed for screen image compression in Virtual Network Computing (VNC) [1] while H.264 is integrated in Onlive [2]. As we know, the standard video codec is widely employed in many video transmission systems *This work was done while the author was with Microsoft Research as a research intern.
978-1-4799-3432-4/14/$31.00 ©2014 IEEE
2157
since it is well-supported by optimized implementations including software (e.g., X264, FFMPEG, MSFT) and hardware. Although real-time encoding and decoding can be achieved with the acceleration in hardware, it is still challenging for higher frame rate coding with respect to the high complexity cost of the hybrid coding framework. Moreover, the screen video contains considerable text/graphic content with strong anisotropic features, while the transform based video coding scheme cannot handle the anisotropic correlation very well. Considering the distinct property of text/graphic content, some coding schemes are proposed to leverage H.264/AVC coding scheme. Ding et al. [3] introduce a new intra mode to better exploit spatial correlation in text region. Alexandre et al. [4] propose a rate allocation scheme between image region and text region to improve the quality of text content. These two methods are integrated with H.264, and thus suffer from the high computational complexity of H.264. In order to achieve efficient coding with low complexity, some new screen coding frameworks are proposed. Since the screen content are usually mixed with texts, graphics, natural images, the designed coding schemes segment distinct content and compress them with different algorithms in pixel level [5] or block level [6] [7]. Even though some of the proposed schemes can achieve realtime coding, it is still challenging to encode screen videos at high frame rate. Moreover, the existing standard coding resource can not be fully utilized in these schemes. To keep the compatibility with standard codec, Wang et al. [8] propose a screen coding scheme to integrate video codec directly in which the natural video content is detected and compressed by video codec. However, the video codec is not fully utilized if no video region is detected and it is difficult to reconstruct the whole screen with only video codec available. As an attempt to improve user interaction experience on screen sharing applications, we propose a layered screen video coding scheme to achieve high frame rate coding without sacrifice on coding efficiency. Based on content update evaluation results generated in terms of global motion detection, each screen video frame is distributed to base layer or enhancement layer. More specifically, the screen frame with significant content update is identified as a key frame and compressed in the base layer by a conventional video encoder to fully utilize the existing standard codec resource. In the enhancement layer, in order to reduce the coding complexity, the stable content with global motion is skipped as skip/inter mode while the new content is compressed by three intra coding modes explicitly
Enhancement layer encoding Non-key frame
Intra Mode Encoding
Mode Decision
Intra Mode Decoding
Skip/Inter Mode Encoding
Frame level selection
Transmission
Screen Video
Enhancement layer decoding
Layer content merging
Base layer encoding
Base layer decoding
Video Encoding
Video Decoding
Key frame
Fig. 1.
Skip/Inter Mode Decoding
Reconstructed Screen Video
Framework of layered screen video coding
designed to fully exploit the spatial correlation. The main contributions of the proposed coding scheme can be summarized into three folds: First, screen video coding at a high frame rate is supported by the layered coding scheme, which greatly improves user experience on interaction. Second, coding efficiency is guaranteed thanks to the adaptation to screen content variations in the enhancement layer. Finally, high compatibility with conventional video codec is achieved and the implementation based on hardware can be directly integrated into the coding scheme for screen sharing system. II.
PROPOSE FRAMEWORK
An example of stable block ratio comparison, top: natural video, bottom: screen video Fig. 2.
A. Analysis on Screen Video in Temporal Domain As a typical user interface, the screen conveys meaningful information to users by combination of texts, graphics and images, and changes its content as a feedback to user’s input operation. The layout of screen content is often pre-defined with little variation. Once it is changed, the content update frequency and ratio are mainly determined by users’ operations as well as the screen content. For instance, while playing the video, large area of the screen updates at a regular video frame rate. During user editing, most of screen content remain unchanged between neighboring frames except for a small region under editing, which need to be updated at a high frame rate to timely respond to user’s request. Two illustrative examples are shown in Fig. 2, the “Stable block referring colocated region” curve represents the block ratio with the same content as the co-located region in the previous frame while the “Stable block referring flexible region” curve represents the block ratio with the same content as the co-located region or certain region related with a motion vector in the previous frame. We can observe that screen video is more stable than natural video. There is a large region in screen frame which is the same with the co-located region in neighboring frame. In addition, scrolling up and down or moving a window typically produces content update represented by large areas of global motion. The gap between two curves in screen video stands for the block ratio with global motion compensation, in which the content is the same as certain region related with a global motion vector. Besides the stable block, the content update is limited in screen video. In contrast, the two curves in natural video coincide since there is nearly no global motion compensation between neighboring frames. The content update frequently frame by frame. B. Framework for Layered Screen Video Coding Based on the upper analysis, we propose a screen video compression scheme, to achieve high frame rate video coding
2158
by frame level layered coding, which consists of base layer coding and enhancement layer coding. The framework is illustrated as block diagram in Fig. 1. In this open-loop coding structure, the temporal variation is detected between neighboring original frames in block level. The frame with significant changes will be treated as a key frame and fed to base layer. In base layer, the standard video codec is directly employed for the key frame coding. In the enhancement layer, the blocks with new content updating are compressed as intra mode. Based on distinct content properties, three intra coding modes are designed to perform the compression of text region, image edge region and image smooth region, respectively. The inter correlation in stable content is fully exploited by skip/inter mode with global motion compensation. The stable blocks with global motion are processed as skip/inter mode with a copying command. Since the global motion compensation is performed among the original frames, there is no need to access the previous frames reconstructed by the close-loop and the encoding complexity can be reduced accordingly. In the decoder, the complete key frame is reconstructed by video decoder in base layer. For the non-key frame, the intra mode block is decoded directly, while the blocks in skip/inter mode are copied from the corresponding regions in the previous reconstructed frame. III.
FRAME LEVEL SELECTION FOR LAYERED CODING
For a screen video sequence, the coding frame rate is determined by the coding complexity of two layers as follows, T = Tb · α + Te · (1 − α) + T0
(1)
where T is the average encoding time per frame, Tb and Te are average encoding time of base layer coding and enhancement layer coding, respectively. T0 denotes the complexity cost of frame selection algorithm. α represents the key frame ratio.
Since the enhancement layer coding is designed as an openloop structure without referring the reconstructed pervious frame, the coding complexity mainly comes from the small ratio intra block coding. The average encoding time of the enhancement layer Te is much smaller than that of base layer Tb with complicated mode decision. According to the Eqn. (1), the relationship between the frame rate and the key frame ratio is shown in Fig. 3, where the complexity of frame selection T0 is small enough to be ignored. fb and fe are the frame rates of pure base layer coding and enhancement layer coding, respectively. By adjusting different key frame ratio, we can achieve the required coding frame rate ft . Two problems have to be considered: one is how to design a frame selection rule to control the key frame ratio; the other is how to achieve efficient coding performance in the enhancement layer. f Frame rate
fe ft fb
0
1
Key frame ratio
Fig. 3.
The relationship between frame rate and key frame ratio
In the proposed scheme, each frame is assigned to different layers in term of the amount of content update, which is evaluated by the difference between neighboring frames. When computing the difference, the difference checking at the co-located position is first performed, and then the global motion compensation is considered. In the implementation, the difference is investigated at block level as 16x16 blocks. For the stable region without motion, the current block content is compared with co-located one in previous original frame. If there is no difference, the block is set as skip block. To detect the blocks subjected to global motion, we employ the global motion detection algorithm based on feature detection and mapping [9]. The unique and easily identifiable feature is first scanned in the current block. Then the matching feature point is searched in the previous frame and the matching block is grown around the feature point. If the current block is same with the matching one, it is identified as inter block with motion vector recorded. Except for above two cases, other block is considered as intra one with new content updating. For a video with N frames, the key frame ratio is equal to α=
N 1 X f (Bi − B0 ) N i=1
A. Mode Decision In the enhancement layer coding, five coding modes are designed for each 16x16 block, including three intra modes, one inter mode and one skip mode. To reduce the computational complexity, the skip/inter block detected in the frame level selection module is skipped without compression. For skip block, only the block type is transmitted to decoder. While for inter block, both block type and motion vector are delivered. The screen content in edge region shows strong anisotropic features, especially the text/graphic parts. An example is given in Fig. 4. We can observe that the edge of text/graphic content is much sharper than that of image region and the geometries of edges are more complicated and irregular. Moreover, downsampling in the chroma channel, which is usually adopted in standard video codec for efficient representation, may introduce chromatic aberration in the edge regions. Fig. 4(b) shows an example that original block is transformed from YUV444 to YUV420. Noticeable grey shadow is observed around the boundaries. Due to its high contrast characteristics, the distribution of luminance values in text/graphic region is discontinuous and shown its sparseness. From Fig. 4(c), we can see that the distribution of the histogram is almost continuous for image region, while there exist large gaps between the non-zero counts in the text/graphic block. Based on these observations, we classify the intra block into three types, text/graphic region, image edge region and image smooth region. In the implementation, the intra block is first quantized as follows. The colors with peak histogram values are selected as the base colors. Then, an equal size
(2)
where, Bi represents the intra block ratio in one frame, and 1 if Bi > B0 f (·) = (3) 0 otherwise where B0 is the frame selection threshold which can be set based on the screen content, bandwidth condition, complexity constraint etc. For each frame, if the intra block ratio is beyond the threshold, the frame is identified as a key frame, otherwise it is a non-key frame. IV.
key frame in screen video is compressed in full size in base layer by the start-of-the-art standard video codec (e.g. H.264, MPEG-2 etc.). Those non-key frames with limited content updated are compressed in the enhancement layer. We intend to achieve the dual goals in both high frame rate coding and efficient rate distortion performance. To this end, we design three coding modes for intra block coding to efficiently exploit spatial correlation based on the properties of screen content.
S CREEN V IDEO C ODING IN E NHANCEMENT L AYER
In this research, we intend to leverage existing coding resources including both software and hardware. Therefore, the
2159
(a)
(b)
(c)
From columns (a)-(c): original block, block with chroma downsampling, histogram of luminance component Fig. 4.
window in the histogram is used to cluster the colors near the major ones. All pixels within the window are quantized to the base color in the same window. It is true that some pixels may escape from the clustering windows. If the number of the escaped pixels is within a threshold, the block will be considered as the text/graphic one. Otherwise it is an image block. Then, an image block type is further classified as smooth
The first frames of test sequences. From left to right: Bing, Amazon, Pdf-editing, Movie
Fig. 5.
or edge region block based on the gradient value of each block which is the sum of the absolute difference values between neighboring pixels. If the value is beyond the threshold, the block is considered as the edge region block. Otherwise it is smooth region block. B. Intra Mode Encoding The intra block in text/graphic region is encoded in pixel domain for all YUV channels. After quantization, the block can be represented by base colors with an index map and the remaining escaped pixels. The base colors of each block are entropy encoded after prediction by the corresponding one in previous frame. The index map and escaped pixels are compressed using the variable length coding. Comparing with text/graphic content, the edges in image part are more regular and the spatial correlation can be exploited efficiently by transform based coding scheme. Therefore, the JPEG based intra coding is employed to perform the image block compression. Considering the chromatic aberration caused by chroma downsampling, all three YUV components in edge image region are encoded without downsampling. For the sake of efficient representation and compression, the chroma channel in smooth region is downsampled by two along horizontal and vertical directions before compression. V.
EXPERIMENTAL RESULTS
We have carried out simulation experiments on four screen video sequences with frame number 400 and resolution 1280x768 shown in Fig. 5, including Web page, online editing, and natural video. The test machine is equipped with the CPU of Intel Core i7 870 (2.93GHz x 8) and RAM of 4G bytes. X264 [10] is adopted as the standard video codec to perform base layer coding and the default mode is set with GOP structure of IPPP and GOP size 250. The number of base colors in text region coding is set to be 4. The threshold B0 for frame level selection is set as 20%. A screen video coding scheme [8] is also employed as a reference scheme in which the natural video content is partitioned from the whole screen frame and compressed by video codec and the other content is encoded by a designed screen codec. The rate distortion performance comparison results are shown in Fig. 6. We can observed that for the typical screen video (Bing, Amazon and Pdfediting), over a wide bit rates the proposed scheme achieves better R-D performance than both X264 and reference scheme. Comparing with X264, the coding gain mainly owes to the efficient content selection for each layer and the specially designed coding mode for distinct content in the enhancement layer. Comparing with the reference scheme, the natural video content with small region update in the proposed scheme is compressed by image mode in the enhancement layer with limited bit rate cost. Moreover, more accurate mode decision with image edge region mode also contributes the coding gain. For “Movie” sequence, the
2160
Fig. 6.
Rate-Distortion performance comparison of schemes
TABLE I.
C OMPLEXITY COMPARISON OF SCHEMES ( MS / FRAME )
Sequence Bing Amazon Movie Pdfediting Average
X264 Enc. Dec. 23.75 12.32 24.12 12.15 66.11 17.48 21.08 11.67 33.77 13.41
Proposed Enc. Dec. 3.43 2.53 3.77 2.52 70.54 18.09 2.06 1.93 19.95 6.27
Ratio* Enc. Dec. 0.14 0.21 0.16 0.21 1.06 1.03 0.10 0.17 0.36 0.40
Ratio*=Proposed/Ref.
results of these three schemes are similar. That is because the screen content changes frequently for natural video and most screen frames are fed to video codec. The complexity comparison results are shown in Table I. It is observed that the proposed coding scheme on average can save 40.9% encoding time and 53.2% decoding time comparing with X264. For typical screen video (e.g. Bing, Amazon, Pdfediting), the proposed scheme can achieve 3.09ms/frame encoding rate and 2.33ms/frame decoding rate. VI.
CONCLUSION
In this paper, we have presented a layered compression scheme to achieve high frame rate coding for screen video. To reduce the coding complexity, the frame level selection algorithm is proposed and the screen video frame with little content update is efficiently encoded by designed coding mode. Simulation results show that the proposed scheme is able to achieve high frame rate coding with efficient RD performance. R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
VNC, http://www.realvnc.com/docs/rfbproto.pdf Onlive, http://desktop.onlive.com W. Ding, Y. Lu and F. Wu, “Enable Efficient Compound Image Compression in H.264/AVC Intra Coding,” Pro of ICIP 2007, pp. 337-340. A. Zaghetto and R. L. de Queiroz, “Segmentation-driven compound document coding based on H.264/AVC-INTRA,” IEEE TCSVT. vol. 16, pp. 1755-1760, 2007. L. Bottou, P. Haffner, P. Howard, P. Simard, Y. Bengio, and Y. LeCun, “High quality document image compression with “djvu”,” Journal of Electronic Imaging, vol. 7, p. 410, 1998. H. Shen, Y. Lu, F. Wu and S. Li, “A High-performance remote computing platform,” IEEE PerWare 2009 in conjunction with IEEE PerCom 2009. T. Lin and P. Hao, “Compound image compression for real-time computer screen image transmission,” IEEE TIP. vol. 14, pp. 993-1005, 2005. S. Wang, J. Fu, Y. Lu, S. Li and W. Gao, “Content-Aware Layered Compound Video Compression,” IEEE Pro ISCAS 2012, pp. 145-148. B. O. Christiansen and K. E. Schauser, “Fast Motion Detection for Thin Client Compression,” Pro of DCC 2002, pp. 332-341. X.264, http://www.videolan.org/developers/x264.html