Mixed Chroma Sampling-Rate High Efficiency Video Coding for Full ...

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 1, JANUARY 2013

173

Mixed Chroma Sampling-Rate High Efficiency Video Coding for Full-Chroma Screen Content Tao Lin, Peijun Zhang, Shuhui Wang, Member, IEEE, Kailun Zhou, and Xianyi Chen

Abstract—Computer screens contain discontinuous-tone content and continuous-tone content. Thus, the most effective way for screen content coding (SCC) is to use two essentially different coders: a dictionary-entropy coder and a traditional hybrid coder. Although screen content is originally in a full-chroma (e.g., YUV444) format, the current method of compression is to first subsample chroma of pictures and then compress pictures using a chroma-subsampled (e.g., YUV420) coder. Using two chroma-subsampled coders cannot achieve high-quality SCC, but using two full-chroma coders is overkill and inefficient for SCC. To solve the dilemma, this paper proposes a mixed chroma sampling-rate approach for SCC. An original full-chroma input macroblock (coding unit) or its prediction residual is chromasubsampled. One full-chroma base coder and one chromasubsampled base coder are used simultaneously to code the original and the chroma-subsampled macroblock, respectively. The coder minimizing rate-distortion (R-D) is selected as the final coder for the macroblock. The two base coders are coherently unified and optimized to get the best overall coding performance and share coding components and resources as much as possible. The approach achieves very high visual quality with minimal computing complexity increment for SCC, and has better R-D performance than two full-chroma coders approach, especially in low bitrate. Index Terms—Dictionary-entropy coding, full-chroma, hybrid coding, screen content coding.

I. Introduction

T

HE ONSET of cloud computing, mobile computing, and cloud-mobile computing has brought many new challenges to the video coding community. The purpose of cloud computing is to decouple user devices from computing units and connect them using a network or communication link. The first generation cloud computing does the decoupling at a place equivalent to CPU peripheral bus, which has peak data rate of 32 Gb/s. Current and future advanced network and communication infrastructures cannot provide such bandwidth for every user. To solve the problem, the second generation cloud computing does the decoupling at a place equivalent to GPU

Manuscript received April 16, 2012; revised July 18, 2012; accepted August 21, 2012. Date of publication October 10, 2012; date of current version January 9, 2013. This work was supported in part by the National Science Foundation (NSF) of China under Grants 61201226 and 61271096, the NSF of Shanghai under Grant 12ZR1433800, and the Fundamental Research Funds for the Central Universities of China under Grants 2810219002 and 2810219003. This paper was recommended by Associate Editor A. Kaup. (Corresponding author: S. Wang.) The authors are with the VLSI Laboratory, Tongji University, Shanghai 200092, China (e-mail: wangshuhui− [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSVT.2012.2223871

frame buffer or GPU screen output, which has peak data rate of 1920 × 1200 × 60 × 24 = 3.1 Gb/s, even for ultra-HD screen resolution of 1920 × 1200 at 60 Hz screen refresh rate. After screen content coding (SCC) of compression ratio around 100:1 to 300:1, the peak data rate of the screen refreshing can be reduced to 10–30 Mb/s, which is within the bandwidth most network or communication infrastructures can provide to every computer user. Therefore, the second generation cloud computing plus high-efficiency SCC can completely solve the cloud computing bottleneck problem to support even ultraHD screen resolution with uncompromising top graphics and multimedia performance [1], [2]. SCC is also an indispensable technology for a number of other rapidly growing application areas such as wireless display, WiFi display, wireless HDMI. Driven by these applications, in recent years, SCC has received increasing attention [3]–[7] and attracted a broad mix of researchers from both academia and industry. The Joint Collaborative Team on Video Coding (JCTVC) working on high efficiency video coding (HEVC) standardization has established a few SCC related ad hoc groups to progress SCC technology and standardization. A computer screen picture has some very special characteristics. It is a mix of two types of essentially different contents: mostly computer-generated discontinuous-tone content such as text, chart, graphics, and icon, and mostly camera-captured continuous-tone content such as photograph, movie/TV clips, and natural picture video sequences. The discontinuous-tone content features very sharp edges, uncomplicated shapes, and thin lines with few colors, even one-pixel-wide single-color lines while continuous-tone content features relatively smooth edges, complicated textures, and thick lines with virtually unlimited colors. It is well known that for continuous-tone content, chroma-subsampling is almost visually lossless. On the other hand, it is commonly known that chroma-subsampling significantly degrades the visual quality of discontinuous-tone content [11]. For SCC, although the visual quality of chroma-subsampled continuous-tone content oriented coders is not good, almost all applications today still uses them due to the fact that quite some portions of SCC are continuous-tone content, coding the entire image using two full-chroma coders is overkill and inefficient. Hence, the optimal solution to achieve best 3-way balance between bitrate, coding quality and coding complexity for SCC is to add a new full-chroma coder to a traditional chroma-

c 2012 IEEE 1051-8215/$31.00

174


subsampled coder, and to mix full-chroma YUV444 coding with chroma-subsampled YUV420 coding at some properly selected boundaries in an adaptive or predefined way. This paper proposes a mixed chroma sampling-rate (MCS) coding technique for SCC that combines one chroma sampling-rate coding with another chroma sampling-rate coding at certain boundaries during the coding process. One aspect of the technique is to mix two chroma samplingrate coders at macroblock (or CU, coding unit, as defined in HEVC) boundary in an adaptive way. Another aspect of the technique is to mix two chroma sampling rates at the coding chain component tool (such as prediction, transform, quantization, entropy coding, etc.) boundary. The two aspects are complementary and can be used together. It is well known that both full-chroma coding and chromasubsampled coding have their own advantages. MCS coding can combine the merits of both. Furthermore, there were huge investments in the past 25 years to develop chromasubsampled CODEC software and hardware. MCS coding can fully utilize the assets and increase the value of past investments. In a basic version of the MCS coding system, an original input macroblock of a full-chroma image is first chromasubsampled. Then, a discontinuous-tone content oriented coder, such as a dictionary-entropy coder, is used to code the original full-chroma macroblock, and a continuous-tone content oriented coder, usually a traditional YUV420 hybrid coder, is used to code the chroma-subsampled version of the macroblock. Finally, the coder that has the optimal ratedistortion (R-D) performance is selected as the optimal coder to code the macroblock, and the corresponding coded bitstream data are put into the output bitstream. In an improved version of the MCS coding system, the traditional chroma-subsampled YUV420 hybrid coder is slightly modified to become an MCS hybrid coder. The modification is that only the prediction is done in full-chroma YUV444 format, the prediction residual is chroma-subsampled, and the rest of the hybrid coding is all done in chromasubsampled YUV420 format. The existing HEVC main profile or H.264/AVC high profile syntax and semantics still can be used and only the calculation formula of prediction needs to be extended from YUV420 format to YUV444 format. In SCC, most relative motions between pictures are simple pixel position translations without any pixel value change, resulting in all zero prediction residuals. Therefore, prediction residual chroma-subsampled coding can achieve almost the same constructed picture quality as full-chroma coding with an even lower bitrate, resulting in better R-D performance, especially in low bitrate. In Section II, a general architecture of macroblock-adaptive dual-coder MCS coding system is proposed. How two sampling rates are mixed, and how two coders interact on each other are described. Section III presents a practical partiallossless dual-coder MCS coding system. One coder is a fullchroma dictionary-entropy coder and the other coder is a chroma-subsampled (or MCS) hybrid coder. R-D theory is used to analyze effectiveness and complementary roles of the two coders for coding screen contents. Section IV is devoted

to a few critical techniques to coherently unify the dictionaryentropy coder and the hybrid coder in a macroblock-level adaptive way, and to enhance the coding performance of both coders significantly. Section V gives some experimental results to show that a dictionary-entropy coder plus HEVC main profile coder based HEVC MCS coder can achieve much higher (e.g., 40 dB higher) PSNR and much better subjective visual quality than the traditional HEVC main profile coder. Section VI is the conclusion that also gives some future work on MCS coding. This paper is partially based on the JCTVC documents [8]–[15]. II. Dual-Coder MCS Coding Architecture A general architecture and the major components of a macroblock-adaptive dual-coder MCS coding system are shown in Fig. 1. In the encoder of the MCS coding system, Coder 1 is a discontinuous-tone content oriented full-chroma coder, and Coder 2 is a continuous-tone content oriented chromasubsampled coder or MCS coder that has a full-chroma part and a chroma-subsampled part. In the case that Coder 2 is a pure chroma-subsampled coder, then the full-chroma part of Coder 2 in Fig. 1(a) is null. The full-chroma original input macroblock (or CU in HEVC) O is fed to both Coder 1 and Coder 2. In the case that the full-chroma part of Coder 2 is null, then O is directly sent to a chroma subsampler that subsamples the chroma of O but keeps the luma of O unchanged to generate chroma-subsampled macroblock S. Coder 1 codes O to generate the coded bitstream b1 and the full-chroma constructed macroblock P1 . The full-chroma part (if not null) of Coder 2 also codes O to generate partially coded pixels such as prediction residuals, and then the partially coded pixels go through a subsampling or equivalent process to get the chroma-subsampled partially coded pixels S. Finally, the chroma-subsampled part of Coder 2 codes S to generate the chroma-subsampled coded bitstream b2 and the chroma-subsampled constructed macroblock (or partially coded pixels if full-chroma part of Coder 2 is not null) S2 . P1 is stored in full-chroma constructed pixel buffer 1 (CPB 1). S2 is stored in chroma-subsampled constructed pixel buffer 2 (CPB 2). CPB 1 and CPB 2 are used by Coder 1 and Coder 2 as a reference, prediction or dictionary to code other following pixels and macroblocks. CPB 1 may also be used for full-chroma constructed pixel filtering or deblocking. S2 is also chroma-upsampled (and further constructed if full-chroma part of Coder 2 is not null) to get full-chroma constructed macroblock P2 mainly for R-D evaluation. O, P1 , b1 , P2 , and b2 are sent to an R-D Cost Based Selector that calculates the R-D cost of two coders and selects the one with the best R-D performance as the final coder for the macroblock. The corresponding coded bitstream b1 or b2 is selected and put into the output MCS bitstream. If Coder 2 is an MCS hybrid coder with inter- and intraprediction, both done in full-chroma pixel domain, then, CPB 1 is also used by Coder 2 for inter- and intraprediction, and CPB 2 are usually not used and can be ignored. In some hybrid coders such as MPEG-4 Part 2, VC-1, and JPEG-XR, intraprediction

LIN et al.: MIXED CHROMA SAMPLING-RATE HIGH EFFICIENCY VIDEO CODING

Fig. 1.

175

Macroblock-adaptive dual-coder MCS coding architecture. (a) Encoder. (b) Decoder.

may be done in transform coefficient domain. In this case, CPB 1 is also used by Coder 2 for only interprediction while CPB 2 actually stores chroma-subsampled transform coefficients and is used by Coder 2 for transform coefficient domain intraprediction. To coherently unify the two coders and achieve maximum coding efficiency for both coders, each CPB should have as many relevant pixels as possible. Therefore, in the encoding process, if Coder 1 is selected as the final coder, then P1 in CPB 1 is chroma-subsampled and then put into CPB 2 to replace S2 , otherwise, S2 in CPB 2 is chroma-upsampled and then put into CPB 1 to replace P1 . In this way, the two CPBs are always synchronized. No matter which coder is selected as the final coder, all constructed pixels from the final coder are available not only to the final coder itself but also optionally to the other coder to maximize the coding performance of both coders. In the decoder, the input MCS macroblock layer bitstream is first parsed by Macroblock Coder Type Parser to get the MB coder type flag that indicates if the current macroblock to be decoded is coded by full-chroma Coder 1 or not. If the macroblock is coded by full-chroma Coder 1, then the decoding process proceeds as follows. 1) The input MCS bitstream is fed to full-chroma Decoder 1 that decodes and constructs the full-chroma macroblock P1 . 2) P1 is stored in CPB 1 to be used by Decoder 1 and the full-chroma part of Decoder 2 for reference, prediction or dictionary as well as for filtering or deblocking.

3) P1 is subsampled and stored in CPB 2 to be used by the chroma-subsampled part of Decoder 2. If Decoder 2 does not really need to use CPB 2, then this step can be ignored. 4) P1 is the final full-chroma constructed macroblock output Õ. If the macroblock is not coded by Coder 1, then the decoding process proceeds as follows. 1) The input MCS bitstream is fed to the chromasubsampled part of Decoder 2 that decodes and constructs the chroma-subsampled macroblock (or partially decoded macroblock if full-chroma part of Decoder 2 is not null) S2 . 2) S2 is stored in CPB 2 to be used by the chromasubsampled part of Decoder 2 for reference, prediction or dictionary. If Decoder 2 does not really need to use CPB 2, then this step can be ignored. 3) S2 is upsampled (and further constructed if full-chroma part of Decoder 2 is not null) to get full-chroma constructed macroblock P2 . P2 is stored in CPB 1 to be used by Decoder 1 and the full-chroma part of Decoder 2 for reference, prediction or dictionary as well as for filtering or deblocking. 4) P2 is the final full-chroma constructed macroblock output Õ. It should be noted that chroma subsamplers and chroma upsamplers in Fig. 1 may be applied to constructed pixels or partially coded pixels such as prediction residuals; and the coding and re-sampling process proceed on a macroblock-by-

176


Fig. 2. MCS coding system upgraded from an existing HEVC/H.264/MPEG1/2/4 4:2:0 hybrid coder. (a) Encoder. (b) Decoder.

macroblock basis, usually from left to right and from top row to bottom row. Therefore, the subsampling and upsampling filters in the chroma subsamplers and upsamplers can involve only the left and above macroblocks that have already been constructed or partially coded. As a result, conventional symmetric filters with more than 2 taps cannot be used in these cases. Actually, to keep sharp edges and lines of discontinuous-tone content, 2×2 average filtering for chromasubsampling and pixel duplication for chroma-upsampling often perform better than multitap filtering. In the proposed MCS coding system, since a majority of existing chroma-subsampled YUV420 hybrid coding tools can be reused in Coder 2 and Decoder 2 of Fig. 1, and the additional full-chroma coding tools in Coder 1 and Decoder 1 for discontinuous-tone content are usually simpler than Coder 2 and Decoder 2, the development cost and unit cost of a coding system for full-chroma screen content can be minimized. Typically, the computation and implementation complexity of Coder 1 is about 30% of Coder 2 while the PSNR improvement of the MCS dual-coder over single Coder 2 is about 10– 40 dB for typical screen content as to be shown in Section IV. III. A Partial-Lossless MCS Coding System The MCS coding system proposed here uses a lossless full-chroma YUV444 dictionary-entropy coder as the base full-chroma coder (Coder 1 in Fig. 1) and a lossy chromasubsampled YUV420 hybrid coder such as HEVC main profile or H.264/AVC high profile coder as the base chromasubsampled coder (Coder 2 in Fig. 1). Therefore, CPB 1 in Fig. 1 becomes the dictionary for string search and matching of Coder 1. If the prediction of Coder 2 is done in YUV444 format, then CPB 1 is also used as the neighboring pixels for intraprediction and the reference pictures for interprediction. If the prediction of Coder 2 is done in YUV420 format, then CPB 2 in Fig. 1 is used as the neighboring pixels for intraprediction and the reference pictures for interprediction. Since Coder 1 is lossless and Coder 2 is lossy, the resulting MCS coding system is a partial-lossless (defined as partially lossless and partially lossy) coding system.

A. Block Diagrams of the MCS Coding System Fig. 2 is the detailed block diagram of a partial-lossless MCS coding system. This dual-coder MCS coding system is upgraded from existing chroma-subsampled hybrid coding system with the only additions in the encoder being a chromasubsampler, a pixel reorder, and a string search & matching tool, which is very effective for discontinuous-tone content coding as reported in [3], [5], and [6]. Correspondingly, the only additions in the decoder are a chroma-upsampler, a pixel reorder, and a string construction tool. The existing YUV420 hybrid coder (HEVC/H.264/Mpeg) can be used without any modification. Fig. 3 shows an improved MCS coding system. The improvement over the coding system of Fig. 2 is that in the hybrid coder, the YUV420 prediction (both intra and inter) is upgraded to YUV444 prediction. Only the calculation formula of prediction is extended from handling YUV420 data to handling YUV444 data, i.e., the amount of chroma data quadruples in the prediction calculation. No change to the existing YUV420 hybrid coding syntax and semantics, including prediction modes and motion vector formats, is needed. All other YUV420 hybrid coding tools can be reused without any change. The dictionary-entropy coder treats both intra- and interframe coding in the same way. When dictionary is bigger than one frame, string search and matching will cross multiple frames and interframe coding is achieved. The hybrid coder performs both intra- and interframe coding as usual. In particular, when coding a P or B frame, the hybrid coder first select either an intramode or an intermode to code a macroblock as usual. Then, one of the coders is selected as the final coder for the macroblock based on R-D cost. B. R-D Theory Based Effectiveness Analysis In the hybrid coder of Figs. 2 and 3, the R-D cost is calculated by Jhybrid = Dhybrid + λRhybrid where Dhybrid and Rhybrid are the distortion and bitrate to code

LIN et al.: MIXED CHROMA SAMPLING-RATE HIGH EFFICIENCY VIDEO CODING

Fig. 3.

177

Upgrading 4:2:0 prediction to 4:4:4 prediction w/o any change to 4:2:0 syntax and semantics. (a) Encoder. (b) Decoder.

the macroblock, respectively, using the hybrid coder, and λ is the lagrange multiplier to control the weight of bit cost and is usually dependent on the quantization parameter (qp) of the hybrid coder: λ = 0.57 × 2qp/3 for intraframe coding in HM [19] and the formula is more sophisticated for interframe coding in HM. In the dictionary-entropy coder, there is no loss, so the R-D cost is calculated by Jdict = λRdict where Rdict is the bitrate to code the macroblock using the dictionary-entropy coder, and λ is the same as the one used in Jhybrid to bring Jdict to the same scale as J hybrid . The cost ratio Jhybrid /J dict ( 1, and Jhybrid /J dict < 1, respectively, for the macroblock. For the red and blue squares, brighter color indicates bigger difference between the two R-D cost values, and the brightest color indicates the difference is more than four time (Jhybrid /J dict ≥ 4 for red square or Jdict /J hybrid ≥ 4 for blue square). In the map of Fig. 4, 47.97% macroblocks are coded by the dictionary-entropy coder. From the map, we can find that: 1) two types of screen contents are consistently mapped to two colors, and 2) a majority (84% in this picture) of the squares has the brightest color, which means that one coder performs at least 400% better than the other coder in terms of R-D cost. This finding reveals that the two coders are indeed quite complementary and play very different roles to compress effectively different contents. When one coder cannot compress a macroblock well, the other coder has very high (84% in

Fig. 4. map.

Typical screen picture of webpage and its coder selection distribution

this picture) probability to code the macroblock at least four times better. Thus, unifying two coders into a single coding framework can improve SCC performance in a significant way. To further confirm the above finding in a larger scale, we have calculated 375 696 pairs of (Jhybrid , J dict ) from six screen capture pictures shown in Fig. 10 and five screen capture video sequences (first frame) [10] shown in Fig. 11. For each macroblock, six qp of 20, 24, 28, 32, 36, 40 are used in the hybrid coder to calculate six pairs of (Jhybrid , J dict ). Table I shows the percentages of 375 696 ratio Jhybrid /J dict in nine intervals: (0, 1/4], (1/4, 1/3], (1/3, 1/2], (1/2, 1), [1, 1], (1, 2), [2, 3), [3, 4), [4, ∞). The first interval is for the case that Jhybrid is at least four times larger than J hybrid . The second interval is for the case that Jdict is between 3 and four times of J hybrid . Similarly, the fifth interval is for the case that Jdict is equal to J hybrid . The 8th interval is for the case that Jhybrid is between 3 and four times of J dict . The ninth interval is for the case that Jhybrid is at least four times larger than J dict . Table I does confirm that overall more than 38% macroblocks are coded by the dictionary-entropy coder, and in more than 70% macroblocks, one coder is at least four times better than the other coder. C. Macroblock Layer Syntax of MCS Coding MCS bitstream is a mix of full-chroma dictionary-entropy coded macroblock bitstream segment and chroma-subsampled hybrid coded macroblock bitstream segment. The macroblock

178


TABLE I Percentages of 375 696 Ratio Jhybrid /Jdict in Nine Intervals

J hybrid /Jdict Percentage (%)

Jhybrid /Jdict