Practical operating points of multi-resolution frame ...

2 downloads 0 Views 602KB Size Report
Peng Yin, David Brooks and Walt Husak. Dolby Laboratories, 432 Lakeside Drive, Sunnyvale, CA, USA 94085. ABSTRACT. 3D content is gaining popularity and ...
Practical operating points of multi-resolution frame compatible (MFC) stereo coding Taoran Lu, Hariharan Ganapathy, Gopi Lakshminarayanan, Tao Chen, Peng Yin, David Brooks and Walt Husak Dolby Laboratories, 432 Lakeside Drive, Sunnyvale, CA, USA 94085 ABSTRACT 3D content is gaining popularity and the production and delivery of 3D video is now an active working item among video compression experts, content providers and the CE industry. Frame compatible stereo coding was initially adopted for the first generation of 3DTV broadcasting services for its compatibility with existing 2D decoders. However, the frame compatible solution sacrifices half of the original video resolution. In 2012, the Moving Picture Experts Group (MPEG) issued the call for proposal (CfP) for solutions that improve the resolution of frame compatible stereo 3D video signal while maintaining the backward compatibility to legacy decoders. The standardization process of the multiresolution frame compatible (MFC) stereo coding was then started. In this paper, the solution – Orthogonal Muxing Frame Compatible Full Resolution (OM-FCFR) – as a response to the CfP is introduced. In addition, this paper provides some experimental results for broadcasters to guide them in selecting operating points for MFC. It is observed that for typical broadcast bitrates, more than 0.5dB PSNR improvement can be achieved by MFC over the frame compatible solution with only 15%~20% overhead. Keywords: 3DTV, Multi-resolution frame compatible

1. INTRODUCTION The provision of a stereoscopic (3D) user experience is a long held goal of both content providers and CE manufacturers. In 2010, CE Manufacturers launched 3DTV products which were initially supported by 3D content from Blu-ray media. Nowadays, 3D-TV service provision has been supported by a number of platform providers such as Satellite and Cable. It is believed that we will see further growth in the industry of 3D-TV consumer electronics products and delivery of 3DTV content. One practical consideration for these service providers is to utilize as much of their existing end to end HDTV infrastructure as possible. One approach is to deliver one view (typically the left eye) as a conventional 2D video and transmit an additional layer carrying information to reconstruct the other view. This is called “service compatible stereo coding”. With this approach, broadcasters can enable the 2D and 3D versions of a program to be delivered simultaneously, a user having 2D television/set-top boxes can receive the 2D version and a user having 3DTV/set-top boxes can receive the 3D version. However, due to the fact that the production chain of 2D HDTV programs and 3DTV programs are quite different, and it is found that a good 3D video signal has very different production grammar from a good 2D video signal (more discussions in the experiment section), service compatible stereo coding has not become a solution in the broadcast market. In another approach, multiplexing the Left view and Right view images in a single High Definition frame allows the service provider to reuse most of the existing production infrastructure and the whole of the existing distribution infrastructure. In most cases, the set top boxes already deployed at the user’s premises for HD services can also be reused. This method is commonly referred to as “frame compatible stereo coding” [2][3]. The multiplexing of the two views includes decimation and arrangement methods. For example, decimation could be horizontal, vertical, or quincunx with consideration of different sampling offsets in the two views. Arrangements could include side by side, top and bottom, line-interleaved, and checkerboard packing. The most commonly used methods are horizontal decimation with side by side arrangement, which is usually called frame compatible side by side (FC-SbS) and vertical sampling with top and bottom arrangement (FC-TaB). Frame compatible formats are the key technology that is being adopted in the first generation of 3DTV broadcasting services.

Typically FC solutions sacrifice the spatial resolution by low-pass filtering and sub-sampling, usually by half for each view, while maintaining compatibility with existing codecs and currently deployed HDTV infrastructures. The frame packing information is signaled in the bitstream by the Frame Packing SEI message [4]. Before display, the reconstructed FC signal is upsampled to full size image. Due to the use of low pass filtering at one dimension to create the frame compatible signal, FC-SbS preserves vertical frequencies but loses horizontal frequencies. FC-TaB, on the contrary, preserves horizontal frequencies but loses vertical frequencies. An illustration of the FC-SbS and FC-TaB is shown in Figure 1. To better understand the impact of the decimation, a set of frequency spectrum visualization is also shown in the middle of Figure 1. If the original image has a flat spectrum, after SbS or TaB packing, high frequency components are lost. Since by nature the frame compatible solution sacrifices the quality of each view, and more and more devices in the market are capable of doing real-time decoding of full resolution stereoscopic HD video, it is expected that finally all the stereoscopic content can be delivered in the full resolution. However, considering the legacy infrastructures and devices, the full-resolution 2D, half-resolution 3D, and full-resolution 3D solutions will co-exist in the market. This makes backward compatibility a critical issue – if there is a solution that could provide an enhancement of the frame compatible signals, it could give service providers the flexibility to seamlessly migrate to full resolution while remaining compatible with half-resolution to support their existing 3D customers.

Figure 1. FC-SbS, FC-TaB and frequency spectrums visualization of the original and frame compatible signals.

The easiest way for frame compatible enhancement is simulcast: the full resolution signals are encoded and delivered in completely separated channels. However, the bitrate for the extra transmission channel could be too high to afford. To reduce the necessary bit rate, it may be desirable to use a scalable encoding approach (i.e., layered coding approach), where there is a frame compatible bitstream (the base-layer bitstream) that can be decoded by a legacy decoder, plus an additional bitstream (the enhancement layer bitstream) that is used to enhance the delivered video quality beyond the base layer quality. The DVB commercial requirements [5] suggests that a constrained enhancement layer which adds no more than 25% additional bitrate over the frame compatible base layer is desirable. Given that fact that evidence for the commercial interests of a frame compatible enhancement solution was presented, ISO/IEC JTC1/SC29/WG11 (MPEG) issued the call for proposals (CfP) [6] in July, 2012 and the evaluation for proposals in Oct. 2012. Both objective tests and formal subjective tests were performed. OM-FCFR was adopted as the specification and test model for the MFC standard. Later, MPEG frame compatible (MFC) was renamed as multi-resolution frame compatible stereo coding during 102th MPEG meeting and later became a joint effort between MPEG and ITU. This paper is organized as follows: The technical details of the MFC standard are presented in Section 2. The test results of MFC under practical operating points are shown in Section 3 and Section 4 draws the conclusions.

2. TECHNICAL DETAILS OF MFC 2.1 System Overview MFC aims to bring back the original resolution by encoding and signaling the missing information of a picture from FC upsampled image through an enhancement layer using a relatively low overhead. The FC signal in base layer is decodable by legacy decoders to generate a pair of half-resolution views and to perform the proper up-sampling on the BL pictures. The enhancement layer carries the orthogonal high frequencies lost in the base layer. The base layer and enhancement layer use orthogonal frame packing formats: i.e., if the base layer is FC-SbS, then the enhancement layer is TaB packed to retain full resolution by preserving the vertical high frequencies in the base layer and horizontal high frequencies in the enhancement layer. Or, if the base layer is FC-TaB, the enhancement is SbS packed. Figure 2 shows how OM-FCFR changes the frequency spectrum of the SbS base layer signal as shown in Figure 2(b) to Figure 2(c) by bringing back most of the orthogonal high frequencies lost in the base layer. MFC still misses some diagonal high frequencies. But these frequencies neither widely exist in real world image/video signals nor are sensitive to human eyes and are hard to preserve during compression.

(a) Original

(b) FC-SbS

(c) MFC

Figure 2. Frequency spectrums of the original, SbS base layer and MFC signals.

Left View Input

BL Video Encoder

Left view bitstream

BL Video Decoder

Decoded Left View

Right View Input

EL Video Encoder

Right view bitstream

EL Video Decoder

Decoded Right View

Figure 3. System diagram of MVC stereo high profile.

Left View Input

Right View Input

BL signal generation (SbS/TaB frame packing)

EL signal generation (TaB/SbS)

BL Video Encoder

(Base layer) Frame Compatible Bitstream

BL Video Decoder

RPU

RPU Parameter Signalling

RPU

EL Video Encoder

Enhancement layer Bitstream

EL Video Decoder

Decoded Frame Compatible 3D Output

full resolution reconstruction

Decoded Left View Decoded Right View

Figure 4. System diagram of MFC high profile.

MFC is a development of the MVC Stereo High Profile of AVC [1]. Since the base layer and the enhancement layer in MFC have orthogonal frame packing formats, the current MVC architecture cannot take advantage of an inter-view reference picture that has a different frame packing format. To deal with this issue, MFC extends MVC by introducing an additional processing element – the Reference Processing Unit (RPU), which pre-processes a base layer reference

picture before it is used for the prediction of its corresponding enhancement layer signal. The pre-processing includes the 1D downsample and upsample filtering to convert the base layer frame packing format to the enhancement layer frame packing format. The block diagrams of the MVC and the MFC system are shown in Figure 3 and Figure 4 respectively. 2.2 Enhancement Layer Signal Generation The input signal generation of enhancement layer is outside the scope of the MFC standard, but is very important. The general rule is that the enhancement layer signal should contain as much as possible of the orthogonal high frequency information missing in the base layer. The MFC adopts the differential residue method. Specifically, the enhancement layer signal is the differential residue from the original full resolution image and the original upsampled base layer FC image. To keep the coding overhead of the enhancement layer signal low, the original upsampled base layer FC signal is used instead of the reconstructed signal. In this way, the coding artifacts in the base layer are not included in the EL. For the coding of the residual image, a carrier image is added to the residue image to form a new “natural” enhancement layer signal which is easier to encode. To allow better inter-layer prediction, a carrier image is generated from the original base layer signal using the RPU which follows the processing steps specified in Section 2.3. More details about the carrier signal could be found in [8]. In summary, the generation of the enhancement layer signal when RPU filter mode is enabled is described as follows, assuming the base layer is in FC-SbS format: (1) Horizontally upsample frame compatible BL input signal to full size left and right views; (2) Create the full size residue signals by subtracting the upsampled BL input signal from the full size; (3) Mux full size residue signal from step (2) into TaB format after vertical downsampling; (4) If RPU filter mode is on, generate a TaB carrier signal through the RPU using the original BL SbS signal by converting SbS signal to TaB; otherwise, generate a constant value image; (5) Add the TaB residue signal from step (3) and the carrier signal from step (4) to form an EL signal. Figure 5 shows the diagram of enhancement layer input signal generation.

Figure 5. MFC enhancement layer input signal generation.

2.3 Reference Process Unit The RPU has two operational modes: filter mode and DC mode, controlled by syntax rpu_filter_enabled_flag. In the DC mode, the carrier image pels are set to a constant value, 128, which avoids the filtering process and thus has the minimum complexity. In the filter mode, RPU includes a 1D downsampling filter and 1D upsampling filter for the format conversion from the base layer FC format to the enhancement layer FC format. Figure 6 illustrates how the RPU filter mode works for a FC-SbS base layer signal. The RPU first performs vertical downsampling on the base layer picture, and then it performs horizontal upsampling to generate a TaB prediction picture.

For field coding (interlaced) support, vertical filtering is done for each field independently. The top field and bottom field of a frame are split first, and then vertical downsampling/upsampling is performed for each field separately. The top and bottom field are then merged (interleaved) back to create the frame.

RPU (filter mode) SbS BL L

SbS BL L ↑

SbS BL SbS BL L R

TaB EL Prediction



Vertical SbS BL R Horizontal

SbS BL R

Figure 6. RPU filters to do frame packing format conversion.

2.4 MFC Enhanced Resolution Picture Reconstruction MFC has the same codec architecture as MVC except for the inter-view reference process with RPU. After decoding, MFC needs a “reconstruction” step to generate enhanced resolution left and right views. The reconstruction method is described as the following steps, assuming base layer is FC-SbS: (1) Horizontally upsample the decoded base layer FC-SbS signal to full size; (2) Created the residue image by subtracting the RPU processed base layer signal in TaB format from the enhancement layer signal to get the lost orthogonal high frequency information; (3) Vertically upsample the residual signal from step (2); (4) Add the upsampled base layer signal from step (1) and the upsampled residue signal from step (3). Figure 7 shows the diagram of full resolution output reconstruction.

Figure 7. MFC full resolution output reconstruction.

2.5 Syntax Change over MVC MFC requires only minimal high level changes over the MVC syntax as regards the RPU. A new profile_idc = 134 is defined to signal an MFC bitstream. A few syntax elements such as mfc_format_idc and rpu_filter_enabled_flag are introduced for the RPU processing. Since AVC allows an interlaced sequence to be coded in either frame coding mode or field coding mode, the RPU needs to be instructed whether to process an inter-layer reference picture from the base layer as a frame or field. Therefore, syntax element rpu_field_processing_flag is used to indicate if the RPU should apply frame or field processing. These syntax elements are added at the end of the sequence parameter set MVC

extension as in Table 1. There is no change in the PPS, slice header, or block level. The complete description of the syntax, semantics and decoding process can be found in the DAM5 of AVC [7]. 2.6 MFC Filter Design MFC employs a variety of filters, which can be categorized as: multiplexing (or muxing) filters, RPU filters, and demultiplexing (or reconstruction) filters. When designing muxing filters, the goal is to maintain as much information as possible from the original signal, but without causing aliasing. For down-sampling, a muxing filter is designed to have very flat pass-band response and the strong attenuation at the midpoint of the spectrum, where the signal is folded during down-sampling, to avoid aliasing. For the base layer muxing, the SVC3D filter with cutoff frequency 0.4 specified in CfP is used [6]. For the enhancement layer muxing, a newly designed filter with cutoff frequency 0.44 is used to keep more information in the enhancement layer without causing aliasing. RPU filters comprise a 1D downsampling filter and a 1D upsampling filter. Because the high frequencies in a carrier image are not used for reconstruction and the purpose of having such a signal is to increase coding efficiency for the enhancement layer signal, the downsampling filter is designed to have a very low cutoff frequency. To maintain the integrity of the residual image, the range of the carrier signal is reduced by half. The right shift operation is integrated into the downsampling filter process in the RPU. The filters are also designed to have a lower complexity. Table 2 illustrates the downsampling filters (OM_RPU_DOWN_F0) and upsampling filters (OM_RPU_UP_F0) we proposed and Figure 8 shows their frequency responses. OM_RPU_UP_F0 is also used as the filter for reconstruction, which is the FC upsampling filter defined in MFC CfP.

Table 1. Syntax changes of sequence_parameter_set_mvc_extension()

seq_parameter_set_mvc_extension( ) { … if ( profile_idc = = 134 ) { mfc_format_idc if (mfc_format_idc = = 0 | | mfc_format_idc = = 1 ) { default_grid_position_flag if( !default_grid_position_flag ) { view0_grid_position_x view0_grid_position_y view1_grid_position_x view1_grid_position_y } } rpu_filter_enabled_flag if ( !frame_mbs_only_flag ) rpu_field_processing_flag } }

C

Descriptor

0

u(6)

0

u(1)

0 0 0 0

u(4) u(4) u(4) u(4)

0

u(1)

0

u(1)

Table 2. RPU downsample and upsample filters

Name

Filter 1D tap

Dynamic Range

Offset

Tap length

Cutoff Freq.(ω)

OM_RPU_DOWN_F0

[4 7 10 7 4]

6

32

5

0.29

OM_RPU_UP_F0

[3 -17 78 78 -17 3]

7

64

6

0.50

Figure 8. Frequency responses of MFC RPU filters.

2.7 MFC Reference Software MFC is implemented using MVC framework in the AVC reference software JM18.3. All mode decision and motion estimation operations are performed as in the original JM18.3 software using the well-known Lagrangian technique, and no specific encoder optimization is added in the simulations.

3. TEST RESULTS 3.1 MFC CfP Tests The MFC CfP specifies the experiment settings mainly for the evaluation of coding artifacts. These tests were performed on the knee of the rate distortion curve to highlight the differences MFC could bring over the test anchors (FC and SVC). Therefore, very low bitrate is used for the experiments – 1.5Mbps, 2Mbps, 2.5Mbps and 4Mbps for base layer, and 25% and 50% bitrate overhead set for enhancement layer. Also, a formal subjective test is conducted by MPEG to evaluate the subjective quality of MFC over the FC anchor. The test results clearly show that the MFC achieves significant quality improvement both objectively and subjectively [9]. However, these CfP test bitrates for the base layer are not practical operating points for broadcasters. As measured on Astra satellite, critical 3D sports content (FC-SbS) typically have bitrates around 13Mbps~17Mbps. Also, the 25%~50% bitrate overhead set for the enhancement layer would be uneconomic due to the limited bandwidth of the broadcast channel. The DVB commercial requirements [5] suggests that a constrained enhancement layer which adds no more than 25% additional bitrate over the frame compatible base layer is desirable. Therefore, we designed a set of experiments to investigate the MFC performance under practical base layer bitrates with no more than 25% additional bitrate to provide a reference for the service providers when they set the practical operating points for MFC. 3.2 Practical Operating Points Test Configuration The source materials we use are HD1080i uncompressed interlaced full resolution left eye and right eye pairs. Totally three 3D sports pairs are chosen, each is 15 seconds long. These 3D clips comprises of a variety of scenes in a soccer game. It should be noted that a 3D sports video is very different from a 2D sports video. In order to best preserve the depth perception, 3D cameras are usually placed at eye heightand very close to the grass field (the farther away, the less depth human eyes can perceive). By contrast, 2D cameras are usually placed higher up and further away to create a perspective view, so that a viwer can understand the relative position of the players.. This makes the video scene very different between 3D and 2D – a typical 2D soccer image usually has the green grass field as the background, while a typical 3D soccer image usually has only a small portion of green field and the majority background becomes the crowds sitting around (see Figure 9).

(a) a typical 2D scene

(b) a typical 3D scene

Figure 9. 2D scene vs. 3D scene in sports

The bitrate for the frame compatible base layer are set as 13Mbps, 15Mbps and 17Mbps for the three sequences respectively. In order to investigate the impact of different enhancement layer bitrates, we set four different target EL overheads: 10%, 15%, 20% and 25% (considering the DVB commercial requirements suggests that the desirable overhead to be less than 25%). The coded video sequences were evaluated subjectively to ensure that they meet broadcast quality. 3.3 Results and Analysis As the MFC CfP tests already showed the significant performance improvement of the FC, we focus on the impact of additional bitrates to encode the enhancement layer which enhances the resolution of the final reconstructed image. As mentioned earlier, the frame compatible signal losses the high frequencies when the decimation is performed. Therefore, even if the compression is lossless, the upsampled reconstructed pictures will exhibit quality degradation. When measured by PSNR, it can be regarded as the theoretical upper bound of the compressed video quality. Similarly, MFC adds the high frequencies back in the orthogonal dimension of the FC base layer but it is still not able to recover all the diagonal high frequencies, so there is also a theoretical upper bound of the recon PSNR. Table 3 shows the theoretical upper bounds of PSNR for the test sequences. The delta PSNR showed that theoretically MFC could achieve substantial quality improvements over the FC while the quality of the FC is fairly limited due to the decimation process. Figure 10 illustrated the spectrums of a typical 3D scene in the test sequence using FC and MFC. It is clearly seen that the horizontal high frequencies are missing in FC-SbS while those are present in MFC.

Table 3. PSNR of reconstructed video without lossy compression (theoretical upper bound for compression)

sequence 1 sequence 2 sequence 3

FC recon PSNR (dB) 32.98 34.92 36.69

(a) Original image spectrum

MFC recon PSNR (dB) 37.54 39.98 41.24

delta PSNR (dB) 4.56 5.06 4.55

(b) FC-SbS

(c) MFC

Figure 10. Spectrum comparison of FC and MFC

After compression, we investigate the performance of different bitrate overhead cases. The results are reported in Table 4 and Figure 11. In Figure 11, the x-axis is the additional bitrate used to code the enhancement layer. The y-axis is the PSNR improvement compared to the frame compatible base layer. An interesting observation is that when overhead is larger than 20%, the PSNR improvement seems to be “saturated”; when overhead is smaller than 15%, the PSNR improvement increases almost linearly. This observation suggests that: (1) there is little to be gained in performance by increasing the MFC enhancement layer bitrate above 25%. (2) For the commercial sport 3D base layer interlaced bitrates, an enhancement layer overhead of 15% to 20% provides the most cost-effective combination.

Figure 11. PSNR improvement by MFC vs. FC with different additional bitrates for EL coding

Table 4. Experimental results

sequence 1

sequence 2

sequence 3

delta PSNR (dB) 0.50 0.73 0.74 0.77 0.77 1.02 1.04 1.06 0.52 0.72 0.75 0.77

EL bitrate / BL bitrate (%) 11% 15% 19% 24% 10% 15% 21% 29% 13% 19% 27% 37%

4. CONCLUSIONS The MFC standard based on the OM-FCFR technology became Amendment 5 of the H.264/AVC and is currently under the DAM ballot. The MFC improves the spatial resolution on top of the frame compatible video signal and provides backward compatibility with the existing frame compatible based stereo 3D video delivery infrastructure. Considering the deployment of MFC for 3D broadcasting, we suggest an enhancement layer overhead of 15% to 20% as the practical operating points, which also meets the DVB commercial requirements criteria as defined by broadcasters.

5. ACKNOWLEDGEMENT The authors would like to acknowledge the following people who have contributed to the development of MFC: Alexis Michael Tourapis, Peshala Pahalawatta, Athanasios Leontaris, Yuwen He, Yan Ye, Kevin Stec, Prasad Balasubramanian, Ronald Mendoza, Goldy Kathuria, Gregory Buschek, Samir Hulyalkar, James Crenshaw, Klaas Schueuer, and Ludovic Noblet.

REFERENCES [1] Wiegand, T., Sullivan, G. J., Bjontegaard, G., and Luthra, A., “Overview of the H.264/AVC Video Coding Standard,” IEEE Transaction on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 645-656 (2003). [2] Vetro, A., “Frame compatible formats for 3D video distribution,” in Proceedings of IEEE ICIP (2010). [3] Vetro, A., Tourapis, A., Muller, A. K., and Chen, T., “3D-TV Content Storage and Transmission”, IEEE Trans. On Broadcasting, Vol. 57, No. 2 (2011). [4] ITU-T and ISO/IEC JTC 1, “Advanced Video Coding for generic audio-visual services,” ITU-T Rec. H.264 and ISO/IEC 14496-10 (2003). [5] DVB, “Commercial Requirements for DVB 3DTV”, 100 th MPEG Meeting, Doc. m25200, Geneva, CH, May 2012, (2012). [6] ISO/IEC JTC 1/SC 29/WG 11, “Call for Proposals on MFC”, 101th MPEG meeting, Doc. w12961, Stockholm, SE, July (2012). [7] Yin, P., Wang, Y. K., Sullivan, G. J., “Draft 3 of Multi-resolution Frame Compatible Stereo Coding”, 4 th JCT3V Meeting, Doc. JCT3V-D1007/MPEG N13558, Incheon, KR, Apr. (2013). [8] Lu, T., Ganapathy, H., Yin, P., Chen, T., Husak, W., “Study of the Carrier Signal in Multi-resolution FrameCompatible (MFC) extension of AVC”, 4th JCT3V Meeting, Doc. JCT3V-D0097, Incheon, KR, Apr. (2013). [9] Lu, T., Ganapathy, H., Lakshminarayanan, G., Chen, T., Husak, W., Yin, P., “Orthogonal Muxing Frame Compatible Full Resolution Technology for Multi-resolution Frame Compatible Stereo Coding”, 2013 IEEE International Conference on Multimedia and Expo (ICME 2013), San Jose, US, Jul. (2013).

Suggest Documents