3DoF+ 360 Video Location-Based Asymmetric Down-Sampling ... - MDPI

sensors Article

3DoF+ 360 Video Location-Based Asymmetric Down-Sampling for View Synthesis to Immersive VR Video Streaming JongBeom Jeong, Dongmin Jang, Jangwoo Son and Eun-Seok Ryu * Department of Computer Engineering, Gachon University, Seongnam 13120, Korea; [email protected] (J.J.); [email protected] (D.J.); [email protected] (J.S.) * Correspondence: [email protected]; Tel.: +82-10-4893-2199 Received: 27 August 2018; Accepted: 14 September 2018; Published: 18 September 2018

Abstract: Recently, with the increasing demand for virtual reality (VR), experiencing immersive contents with VR has become easier. However, a tremendous amount of calculation and bandwidth is required when processing 360 videos. Moreover, additional information such as the depth of the video is required to enjoy stereoscopic 360 contents. Therefore, this paper proposes an efficient method of streaming high-quality 360 videos. To reduce the bandwidth when streaming and synthesizing the 3DoF+ 360 videos, which supports limited movements of the user, a proper down-sampling ratio and quantization parameter are offered from the analysis of the graph between bitrate and peak signal-to-noise ratio. High-efficiency video coding (HEVC) is used to encode and decode the 360 videos, and the view synthesizer produces the video of intermediate view, providing the user with an immersive experience. Keywords: virtual reality; 3DoF+; HEVC; view synthesis; VSRS; multi-view video coding

1. Introduction As the virtual reality (VR) market is expanding rapidly, the need for efficient immersive VR technology has become more important. To play high-quality VR video through a head-mounted display (HMD), the minimum resolution of the video must be 4K. In this case, the amount of data to be processed from the HMD increases rapidly. Therefore, the Moving Picture Experts Group (MPEG) proposed a technology, which processes the viewport of the user, called motion-constrained tile set (MCTS) [1] in 2016; further, a paper describing MCTS implementation for VR streaming was submitted [2]. Moreover, to provide the user with high-quality 360 videos, region-wise packing [3] was proposed. It encodes a region of interest (ROI) with high quality and the other regions with low quality. To support immersive media, the MPEG-I group, established by MPEG, divided the standardization associated with VR into three phases, namely three degrees of freedom (3DoF), three degrees of freedom plus (3DoF+), and six degrees of freedom (6DoF) [4]. In 3DoF+ and 6DoF, multi-view 360 videos are required and they comprise texture and depth images to support 3D video [5]. Because both the phases provide 360 videos in response to a user’s movement, it is inevitable to synthesize the immediate views using existing views. View Synthesis Reference Software (VSRS) for 360 videos [6], Reference View Synthesizer (RVS) [7], and weighted-to-spherically-uniform peak signal-to-noise ratio (WS-PSNR) for 360 video quality evaluation [8] were proposed to MPEG to create virtual views and evaluate them. A large amount of bandwidth is required for transmitting 3DoF+ or 6DoF 360 videos because such videos need both high-resolution texture and depth. To overcome this problem, down-sampling or region-wise packing could be applied. In this paper, we propose the View Location-based Asymmetric

Sensors 2018, 18, 3148; doi:10.3390/s18093148

www.mdpi.com/journal/sensors

or region-wise packing could be applied. In this paper, we propose the View Location-based Asymmetric Down-sampling for Vie Synthesis (VLADVS) concept for the bitrate decreasing system withSensors appropriate down-sampling ratio and a quantization parameter for 3DoF+ texture and depth in FOR PEER REVIEW 22of Sensors2018, 2018,18, 18,x3148 of 20 20 view synthesis, as shown in Figure 1. It introduces a pilot test with Super Multiview Video (SMV) [9] region-wise packing Finally, could beit applied. we propose the View Location-based and or 3DoF+ test sequences. providesInthethis ratepaper, distortion (RD) curve of bitrate and WS-PSNR Down-sampling for Vie Synthesis (VLADVS) concept for the bitrate decreasing system with appropriate Asymmetric Down-sampling for Vie Synthesis (VLADVS) concept for the bitrate decreasing system obtained by 3DoF+ test sequences using 360lib with HEVC. down-sampling and a quantization parameter for 3DoF+ texture for and3DoF+ depthtexture in viewand synthesis, as with appropriateratio down-sampling ratio and a quantization parameter depth in shown in Figure It introduces pilot test with Super VideoMultiview (SMV) [9]Video and 3DoF+ view synthesis, as1.shown in Figurea 1. It introduces a pilotMultiview test with Super (SMV) test [9] sequences. Finally, it provides theitrate distortion (RD) curve of(RD) bitrate andofWS-PSNR by and 3DoF+ test sequences. Finally, provides the rate distortion curve bitrate andobtained WS-PSNR 3DoF+ test sequences using 360lib with HEVC. obtained by 3DoF+ test sequences using 360lib with HEVC.

Figure 1. Viewport-dependent immersive VR service with VLADVS.

This paper is organized follows: Sectionimmersive 2 introduces about related work such as the MPEG Figure1.1.as Viewport-dependent VRservice service with VLADVS. Figure Viewport-dependent immersive VR with VLADVS. 360 video standard, multi-view video coding, and view synthesis. Section 3 explains the overall This paper is organized as follows: Section 2 introduces about related work such as the and MPEG experiment, view synthesis with free viewpoint television (FTV)work test such sequences 3DoF+ Thisincluding paper is organized as follows: Section 2 introduces about related as the MPEG 360 video standard, multi-view video coding, and view synthesis. Section 3 explains the overall video test sequences. Section 4 summarizes the result of the experiment for 3proposed system. Lastly, 360 video standard, multi-view video coding, and view synthesis. Section explains the overall experiment, including view synthesis with free viewpoint television (FTV) test sequences and 3DoF+ Section 5 presents our conclusions andwith future experiment, including view synthesis freework. viewpoint television (FTV) test sequences and 3DoF+ video test sequences. Section 4 summarizes the result of the experiment for proposed system. Lastly, video test sequences. Section 4 summarizes the result of the experiment for proposed system. Lastly, Section 5 presents our conclusions and future work. SectionWork 5 presents our conclusions and future work. 2. Related 2. Related Work

2. Related Work 2.1. 360 Video Standard in MPEG 2.1. 360 Video Standard in MPEG During the Standard 116th MPEG 2.1. 360 Video in MPEGmeeting, the MPEG-I group was established for the support of During the 116th MPEG meeting, the MPEG-I group was established for the support of immersive immersive media. They began work by standardizing the format of established immersive,for omnidirectional video During 116th meeting, the MPEG-I was omnidirectional the support of media. They the began workMPEG by standardizing the format of group immersive, video in 2017 [10]. in 2017 [10]. Figure 2 shows the standardization roadmap of MPEG. MPEG-I group divided immersive media. They began work by standardizing the format immersive, omnidirectional video the Figure 2 shows the standardization roadmap of MPEG. MPEG-Iofgroup divided the standardization standardization into three phases [11]. Phase 1a aims to provide 360 MPEG-I video and contents including in 2017 [10]. Figure shows standardization MPEG. group divided the into three phases [11].2 Phase 1athe aims to provide 360roadmap video andofcontents including stitching, projection, stitching, projection, and video encoding. standardization into three phases [11]. Phase 1a aims to provide 360 video and contents including and video encoding. stitching, projection, and video encoding.

Figure 2. MPEG standardization roadmap.

Figure2.2.MPEG MPEGstandardization standardization roadmap. Figure roadmap.

Figure 3 introduces the 3DoF, 3DoF+, and 6DoF viewing angle and degree of freedom. If a user

Figure 3the introduces the video, 3DoF, the 3DoF+, and 6DoF viewing and degree freedom. If a user watches stereoscopic movement of the user is angle defined along the of three directions, watches the stereoscopic video, the movement of the user is defined along the three directions,

Sensors 2018, 18, 3148

3 of 20

Sensors 2018, 18, x FOR PEER REVIEW 3 xintroduces the 3DoF, SensorsFigure 2018, 18, FOR PEER REVIEW

3 of 20

3DoF+, and 6DoF viewing angle and degree of freedom. If a3 user of 20 watches the stereoscopic video, the movement of the user is defined along the three directions, namely namely yaw, pitch, and roll. However, in the 3DoF video, the things behind the objects cannot be namely yaw, pitch, and the roll.limited However, in video, the 3DoF video, behind the things behindcannot the objects cannot be yaw, pitch, and roll. However, in theexperience 3DoF the objects be represented, represented, indicating ofthe VR.things represented, the limited of VR. indicating theindicating limited experience ofexperience VR.

Figure3.3.Viewing Viewingangle angleand andthe thedegree degreeofoffreedom: freedom:(a) (a)3DoF; 3DoF;(b) (b)3DoF+; 3DoF+;and and(c) (c)6DoF. 6DoF. Figure Figure 3. Viewing angle and the degree of freedom: (a) 3DoF; (b) 3DoF+; and (c) 6DoF.

To overcome overcome the the limitations part of phase 1b in was To limitations of of 3DoF, 3DoF, the theconcept conceptofof3DoF+, 3DoF+, part of phase 1bMPEG-I, in MPEG-I, To overcome the limitations ofmovements 3DoF, the concept of 3DoF+, partas ofdescribed phase 1b in inFigure MPEG-I, was proposed. 3DoF+ provides limited of yaw, pitch, and roll, 4. Thus, was proposed. 3DoF+ provides limited movements of yaw, pitch, and roll, as described in Figure 4. proposed. 3DoF+ provides limited movements of yaw, pitch, and roll, as described in Figure 4. Thus, it provides moremore immersive experience than 3DoF. Thus, it provides immersive experience than 3DoF. it provides more immersive experience than 3DoF.

Figure4.4.Anchor Anchorgeneration generationstructure structureofof3DoF+. 3DoF+. Figure Figure 4. Anchor generation structure of 3DoF+.

In3DoF+, 3DoF+,the theVR VRdevice devicemust mustoffer offerthe thevideo videoof ofview viewthat thatthe theuser userwatches. watches.IfIfthis thisvideo videoof ofview view In Inincluded 3DoF+, the VR must offer the video of view that the the userview watches. If this video view not included inthe thedevice original video, 3DoF+ system synthesizes the view that did did not existof before. isisnot in original video, 3DoF+ system synthesizes that not exist before. isThus, not Reference included the originalView video, 3DoF+ system synthesizes the view did notvirtual exist before. Referencein Intermediate View Synthesizer [12]isis required. Further, Further, tothat synthesize virtual views, Thus, Intermediate Synthesizer [12] required. to synthesize views, Thus, Reference Intermediate View Synthesizer [12] is required. Further, to synthesize virtual views, additionaldepth depthinformation, information,such suchas asdistances distancesbetween betweencamera cameraand andobjects, objects,must mustbe besupplied. supplied. Asitit additional As additional information, distances between camera must be supplied. As it requiresa large adepth large amount ofsuch data to be transmitted, optimization for data transmission and requires amount of data to beas transmitted, optimization forand dataobjects, transmission and compression requires a large amount of data to be transmitted, optimization for data transmission and compression must be proposed. must be proposed. compression must betoproposed. Asthe thesolutions solutions tothe theabovementioned abovementioned problems, enhanced communication technologies such As problems, enhanced communication technologies such as As the solutions to the abovementioned problems, enhanced communication technologies such as 5G mobile technology [13] and mobile data offloading [14] have been announced recently. 5G mobile technology [13] and mobile data offloading [14] have been announced recently. Moreover, as 5G mobile [13] by and offloading [14] have been in recently. Moreover, the amount ofused resources used bydata the video transmission system isannounced limited a mobile the amount of technology resources themobile video transmission system is limited a mobileinplatform. Moreover, the amount of is resources video transmission system isusing limited in aadaptive mobile platform. Since the limited resource isused a weakness to thedevice, mobile device, some solutions using Since the limited resource a weakness toby thethe mobile some solutions adaptive video platform. Since the limited resource is a weakness to the mobile device, some solutions using adaptive video transmission system [15] or interactive media [16] wereConsidering proposed. Considering transmission system [15] or interactive media system [16]system were proposed. the structure the of video transmission [15] or interactive media systemprocessing [16] were [17,18] proposed. structure of CPU insystem a mobile device, asymmetric multicore wasConsidering proposed to the use structure of CPU in a mobile device, scalable asymmetric multicore processing [17,18]layer was video proposed to use its resource efficiently. Furthermore, video coding [19,20] or multiple coding [21] its resource efficiently. Furthermore, scalable video coding [19,20] or multiple layer video coding [21] can be applied as the 3DoF+ video contains multiple videos. can be applied as the 3DoF+ video contains multiple videos.

Sensors 2018, 18, 3148

4 of 20

CPU a mobile device, asymmetric multicore processing [17,18] was proposed to use its resource Sensorsin 2018, 18, x FOR PEER REVIEW 4 of 20 efficiently. Furthermore, scalable video coding [19,20] or multiple layer video coding [21] can be View synthesis video transmission from the server to the client. Therefore, the video applied as the 3DoF+assumes video contains multiple videos. mustView be compressed, as shown in Figure 4. The anchor is used synthesis, which synthesis assumes video transmission from view the server to in theview client. Therefore, theshould video be encoded and decoded. Subsequently, 2 of MPEG-I 6DoF, which which means should 3DoF+ must be compressed, as shown in Figure 4.phase The anchor view isdeals used with in view synthesis, with translational movements along the X-,2 of Y-,MPEG-I and Z-axes. It supports the user’s movements be encoded and decoded. Subsequently, phase deals with 6DoF, which means 3DoF+ with including walking, as described in Figure translational movements along the X-, Y-,3.and Z-axes. It supports the user’s movements including walking, as described in Figure 3. 2.2. Multi-View Video Coding 2.2. Multi-View Video Coding Multi-view video provides the user with an immersive 3D experience. Such video provides Multi-view video provides the simultaneously. user with an immersive 3D 3D experience. Such video provides diverse views gained from one scene Particularly, multi-view video includes both diverse views gained from oneItscene simultaneously. Particularly, video to includes texture and depth information. enables users to have multiple views3D ofmulti-view what they intend watch. both texture andadepth information. It enables to have multiple of what they intend to MPEG defined 3D video system [22], whichusers is a part of FTV, and views it contains multi-view video watch. MPEG defined a 3D video system [22], which is a part of FTV, and it contains multi-view video acquisition, encoding, transmission, decoding, and display. To process the multi-view video acquisition, encoding, transmission, display. To process the multi-view video efficiently, efficiently, multi-view video codingdecoding, [23,24] is and required. multi-view videovideos codinghave [23,24] is required. Multi-view common features as they contain the same scene at the same time. The Multi-view videos features aspoint theyofcontain the same scene at thevideo sameoftime. difference between eachhave viewcommon is the indigenous view; that is, a multi-view one The difference view is theanother indigenous viewpoint can between be made each by referencing view.point of view; that is, a multi-view video of one viewpoint be made referencingB another view. Figurecan 5 shows theby hierarchical frame multi-view video encoding structure between primary 5 showsviews. the hierarchical B frame multi-view videoreferenced encoding structure between viewFigure and extended The blue box represents a key frame by the B frame. Theprimary I frame view and extended views. box represents a key referenced the B The Iby frame can be reconstructed whileThe theblue P frame is referenced by frame one frame. The B by frame is frame. referenced two can be reconstructed while the P frame is referenced by one frame. The B frame is referenced by two frames when predicting. Joint multi-view video model [25] for reference software model of multiframes whencoding predicting. Joint multi-view videomulti-view model [25] video for reference software model of multi-view view video was proposed to compress while containing compatibility with video H.264.coding was proposed to compress multi-view video while containing compatibility with H.264.

Figure 5. Multi-view video encoding view reference structure. Figure 5. Multi-view video encoding view reference structure.

2.3. View Synthesis 2.3. View Synthesis Although multi-view video provides some views, it cannot offer out-of-source views. Because Although multi-view video aprovides someof views, it cannot offer power out-of-source views. multi-view video coding requires large amount data and computing to process, the Because number multi-view codingvideo requires a large amount of data and computing power to the of views thevideo multi-view can support is limited. Accordingly, view synthesis forprocess, multi-view number of views the multi-view video can support is limited. Accordingly, view synthesis for multivideo [26,27] was developed to overcome the limitation of multi-view video coding. When using view synthesis, video [26,27] developed overcome the ofviews multi-view video coding. When using view thewas server does nottoneed to send alllimitation the source because it synthesizes dropped view synthesis, does not to send all thedid source views because it synthesizes dropped views that were the not server sent. Further, if need the video provider not acquire many source views due to the views that were not sent. Further, if the video provider did not acquire many source views due to the the limitation of resources such as a camera and the amount of data, the other views not offered by limitationcan of still resources such as a camera and the amount of data, the other views not offered by the provider be synthesized. provider can still be synthesized. Figure 6 illustrates how to synthesize the intermediate views with RVS 1.0.2 [28]. It requires a texture video, depth map, and camera parameter. Depth map [29,30] represents the distance between the camera and the object shown in the texture video. The purpose of the depth map is to represent a 3D space, which is also used by the haptic system [31,32]. If the depth map format is 8-bit, the range

Sensors 2018, 18, 3148

5 of 20

Figure 6 illustrates how to synthesize the intermediate views with RVS 1.0.2 [28]. It requires a texture video, map, and camera parameter. Depth map [29,30] represents the distance between Sensors 2018, 18, xdepth FOR PEER REVIEW 5 of 20 the camera and the object shown in the texture video. The purpose of the depth map is to represent a 3D which is also used by0the system [31,32]. If the mapbyformat is camera 8-bit, thethat range of space, the depth value is between andhaptic 255. The depth map can be depth obtained a depth usesof the depth value is between 0 and 255. The depth map can be obtained by a depth camera that uses a depth sensor; otherwise, it can be generated by depth estimation software. MPEG-4 group proposeda Depthsensor; Estimation Reference [33,34] to obtain the depth map MPEG-4 from thegroup texture video depth otherwise, it can Software be generated by depth estimation software. proposed efficiently. Depth Estimation Reference Software [33,34] to obtain the depth map from the texture video efficiently.

Figure 6. 6. View synthesis pipeline of Figure of RVS. RVS.

Generally, camera. ItIt projects projectsthe theactual actualobject object Generally,the themulti-view multi-view video video is is obtained obtained from a pinhole camera. onto projection is is implemented implementedusing usingaaworld worldcoordinate coordinate ontoaa2D 2Dplane planeimage, image,as as shown shown in in Figure 7. The projection system systemand andcamera cameracoordinate coordinatesystem. system.The Theworld worldcoordinate coordinatesystem systempresents presentsa a3D 3Dspace. space.The Thecamera camerais is located in the world coordinate system, it also a 3D coordinate system. center point of located in the world coordinate system, andand it also hashas a 3D coordinate system. TheThe center point of the the camera represents the location of the camera in the world coordinate system. The camera camera represents the location of the camera in the world coordinate system. The camera coordinate coordinate system hasZ-axes. X-, Y-,The andX-, Z-axes. The X-, Y-, and Z-axes represent the horizontal, vertical, system has X-, Y-, and Y-, and Z-axes represent the horizontal, vertical, and optical axis and called opticalprincipal axis (alsoaxis), called principal axis), axis thecamera direction theprincipal camera (also respectively. The respectively. optical axis isThe the optical direction ofisthe ray.ofThe ray. The principal pointpoint is thebetween intersection between the principal and The the image plane. point is the intersection the point principal axis and the imageaxis plane. distance fromThe the distance fromtothe center to thefocal principal is as called focal asEach shown in Figure 8. Eachin camera center thecamera principal is called length, shown in length, Figure 8. point of the object point the object in the onto 3D space projected a 2D image plane by the camera. the 3D of space is projected a 2Disimage planeonto by the camera. To obtain the intermediate view, the point coordinates from reference To obtain the intermediate view, the point coordinates from reference views views must mustbe beconverted converted intothe thesynthesized synthesized view. view. Each Each reference reference view, into view, which which is is used used to to synthesize synthesizethe theintermediate intermediateview, view, has its own camera coordinate system. If we realize the camera parameter of reference views has its own camera coordinate system. If we realize the camera parameter of reference viewsand and intermediateview, view,the thecamera camera coordinate coordinate system system of of intermediate intermediate view intermediate view can can be be generated generatedusing usingthe the worldcoordinate coordinatesystem. system.Once Once the the conversion conversion is is complete, world complete, texture texture mapping mappingfrom fromthe thereference referenceviews views intermediateview viewisisperformed. performed. totointermediate

Sensors 2018, 18, x FOR PEER REVIEW

Sensors 2018, 18, 3148 Sensors 2018, 18, x FOR PEER REVIEW

6 of 20

6 of 20 6 of 20

Figure 7. Image projection in a pinhole camera. Figure Figure 7. 7. Image Image projection projection in in aa pinhole pinhole camera. camera.

Figure 8. Image plane in camera coordinate system.

Figure8.8.Image Imageplane plane in camera system. Figure cameracoordinate coordinate system.

3. View View Location-Based Location-Based Asymmetric Asymmetric Down-Sampling Down-Sampling for for View ViewSynthesis Synthesis 3.

3. View Location-Based Asymmetric Down-Sampling for View Synthesis

This section section explains explains VLADVS VLADVS for for efficient efficient use use of of bandwidth bandwidth in in video video transmission, transmission, as as described described This in Figure 9. It allocates the down-sampling ratio to the videos based on the distance between the input This section explains VLADVS for efficient use of bandwidth in video transmission, as input described in Figure 9. It allocates the down-sampling ratio to the videos based on the distance between the video9. and the video videothe thatdown-sampling needs to to be be synthesized. synthesized. the input close the in Figure It allocates ratio toIf basedis theto distance betweenvideo, the input video and the that needs Ifthe thevideos input video video ison close to the synthesized synthesized video, the proposed system assigns low down-sampling ratio because the video near the synthesized video video the video thatassigns needslow to be synthesized. ratio If thebecause input video is close synthesized video, theand proposed system down-sampling the video near to thethe synthesized video has a great influence on the quality of synthesized video. Section 3.1 explains view synthesis with has a great influence on the quality of synthesized video. Section 3.1 explains view synthesis with FTV the proposed system assigns low down-sampling ratio because the video near the synthesized video FTV multi-view test sequences to decide the down-sampling ratio. Section 3.2 presents the results of test sequences decide the down-sampling ratio. Section Section 3.2 theview results of source has amulti-view great influence on thetoquality of synthesized video. 3.1presents explains synthesis with source view synthesis with 3DoF+ video test sequences, which implies the impact of the input view view synthesis with 3DoF+ video test sequences, which implies the impact of the input view number and FTV multi-view test sequences to decide the down-sampling ratio. Section 3.2 presents the results of number andofthe of the correlation between the down-sampling ratio of texture andsynthesis. depth in the relation therelation correlation between thetest down-sampling ratio of texture and in view source view synthesis withSection 3DoF+ video sequences, which implies thedepth impact of the input view view synthesis. Finally, 3.3 proposes the intermediate view synthesis method and conditions Finally, Section 3.3 proposes the intermediate view synthesis method and conditions for 3DoF+ video. number and the relation of the correlation between the down-sampling ratio of texture and depth in for 3DoF+ video.

view synthesis. Finally, Section 3.3 proposes the intermediate view synthesis method and conditions for 3DoF+ video.

Sensors 2018, 18, 3148 Sensors 2018, 18, x FOR PEER REVIEW Sensors 2018, 18, x FOR PEER REVIEW

7 of 20 7 of 20 7 of 20

Figure 9. 9. View Location-based Location-based Asymmetric Asymmetric Down-sampling Down-sampling for for View View Synthesis. Rec stands for for Figure Figure 9. View Location-based Asymmetric Down-sampling for View Synthesis. Synthesis. Rec stands for reconstructed, and and Syn Syn stands stands for for synthesized. synthesized. reconstructed, reconstructed, and Syn stands for synthesized.

3.1. 3.1. View View Synthesis Synthesis with with FTV FTV Multi-View Multi-View Test Test Sequences Sequences 3.1. View Synthesis with FTV Multi-View Test Sequences To To reduce reduce the the bitrate bitrate when when transmitting transmitting multi-view multi-view video, video, this this paper paper proposes proposes aa low-complexity low-complexity To reduce the bitrate when transmitting multi-view video, this paper proposes a low-complexity multi-view multi-view video video transmit transmit system system including including down-sampling down-sampling and and up-sampling. up-sampling. The The feasibility feasibility of of this this multi-view video transmit system including down-sampling and up-sampling. The feasibility of this method was proved by a pilot test with FTV multi-view sequences [35]. method was proved by a pilot test with FTV multi-view sequences [35]. method was proved by a pilot test with FTV multi-view sequences [35]. Champagne_Tower Pantomime sequences, as shown in Figure were10, used. The used. resolution Champagne_Towerand and Pantomime sequences, as shown in 10, Figure were The Champagne_Tower and Pantomime sequences, as shown in Figure 10, were used. The and numberand of frames of of Champagne_Tower and Pantomime sequences are 1280 × 960 (acquired resolution number frames of Champagne_Tower and Pantomime sequences are 1280 ×from 960 resolution and number of frames of Champagne_Tower and Pantomime sequences are 1280 × 960 80 cameras) and 300, respectively. (acquired from 80 cameras) and 300, respectively. (acquired from 80 cameras) and 300, respectively.

(a) (a)

(b) (b)

(c) (c)

(d) (d)

Figure 10. 10. FTV FTV test test sequences sequences from from Nagoya Nagoya University. University. (a)Champagne_tower Champagne_tower (1280×× 960), 960), obtained obtained Figure Figure 10. FTV test sequences from Nagoya University. (a) (a) Champagne_tower (1280 (1280 × 960), obtained from 80 80 cameras cameras with with stereo stereodistance, distance,consists consistsof of300 300frames frameswith with30 30fps; fps;(b) (b)pantomime pantomime(1280 (1280×× 960), from from 80 cameras with stereo distance, consists of 300 frames with 30 fps; (b) pantomime (1280 × 960), gained from from 80 80 cameras cameras with with stereo stereo distance, distance, consists consists of 300 300 frames frames with with 30 30 fps; fps; (c) (c) depth depth map map of of gained gained from 80 cameras with stereo distance, consists of 300 frames with 30 fps; (c) depth map of Champagne_tower; (d) depth map of Pantomime. Champagne_tower; (d) depth map of Pantomime. Champagne_tower; (d) depth map of Pantomime.

Figure 11 introduces the proposed system architecture with FTV multi-view test sequences. Figure 11 introduces the proposed system architecture with FTV multi-view test sequences. First, it selects the anchor view, i.e., the source view used to synthesize the intermediate view. Test First, it selects the anchor view, i.e., the source view used to synthesize the intermediate view. Test

Sensors 2018, 18, 3148

8 of 20

Figure 11xintroduces the proposed system architecture with FTV multi-view test sequences. First, Sensors 2018, 18, FOR PEER REVIEW 8 of 20 it selects the anchor view, i.e., the source view used to synthesize the intermediate view. Test sequences provide theprovide depth map of 37, map 39, and 41 views, viewanchor whichview requires both textureboth and sequences the depth of 37, 39, andi.e., 41 anchor views, i.e., which requires textureThe andcombinations depth. The combinations of view synthesis areinpresented in Table it 1. down-samples Second, it downdepth. of view synthesis are presented Table 1. Second, the samplesanchor the selected anchor views. The down-sampling are50, 0, and 20, 40, 50, and 75(%), as selected views. The down-sampling ratios are 0,ratios 20, 40, 75(%), as shown in shown Table 2. in Table 2. For down-sampling and up-sampling, the DownConvertStatic executable Joint Scalable For down-sampling and up-sampling, the DownConvertStatic executable in JointinScalable Video Video (JSVM) Model (JSVM) wasThird, used.itThird, it encodes and decodes the down-sampled views. For Model [36] was[36] used. encodes and decodes the down-sampled views. For encoding encoding and decoding, HEVC reference software (HM) version 16.16 [37] was used. VSRS 4.2 [38] and decoding, HEVC reference software (HM) version 16.16 [37] was used. VSRS 4.2 [38] was used to was used to the view. intermediate Fourth, it the decoded Fifth, the it synthesize thesynthesize intermediate Fourth, view. it up-samples theup-samples decoded views. Fifth, it views. synthesizes synthesizes the intermediate viewup-sampled by referencing up-sampled anchor views. Finally, measures the intermediate view by referencing anchor views. Finally, it measures theitPSNR between PSNR between the original viewsviews and for synthesized views for objectiveFor quality the original intermediate viewsintermediate and synthesized objective quality evaluation. PSNR evaluation. Forthe PSNR measurement, the PSNR static executable measurement, PSNR static executable of JSVM was used. of JSVM was used.

Figure 11. Proposed system Figure system architecture architecture with with FTV FTVtest testsequences. sequences.

Table of view view synthesis. synthesis. Table 1. 1. Combinations Combinations of No.ofofLeft LeftViews Views No. 3737 3737 39 39

No. SynthesizedViews Views No. ofof Synthesized 3838 3939 40 40

No.of ofRight Right Views Views No. 39 39 41 41 41 41

Table and resolution resolution of of down-sampling down-samplingratios. ratios. Table 2. 2. Combination Combination and Down-SamplingRatio Ratio(%) (%) Down-Sampling Champagne_tower Champagne_tower Pantomime Pantomime

00 1280 1280××960 960 1280××960 960 1280

20 20 1024 1024 ××768 768 1024 ××768 768 1024

40 40 768 × × 576 768 768 576 768 × × 576

50 50

7575

640××480 480 640 640 640××480 480

320 240 320 × ×240 320 × ×240 320 240

Forencoding, encoding, the the quantization quantization parameter For parameter (QP) (QP) values valuesare are22, 22,27, 27,32, 32,and and37. 37.InInaapilot pilottest testwith with FTV multi-view sequences, the experiment was executed for every combination of down-sampling FTV multi-view sequences, the experiment was executed for every combination of down-sampling ratio,QP, QP,and andview viewsynthesis. synthesis. The The pilot pilot test ratio, test results results are are shown shown in in Figures Figures12 12and and13. 13.The Thefigures figuresshow show the RD-curve between PSNR and average bitrate with different QPs. The reason why the graph shows the RD-curve between PSNR and average bitrate with different QPs. The reason why the graph shows thecombination combination 0-0 0-0 to to 20-40 20-40 is is because because it it only only includes the includes the the combinations combinationswhose whosedifference differencevalues values with the original view combination (0-0) are under 1. Even though the average down-sampling with the original view combination (0-0) are under 1. Even though the average down-samplingratio ratio of the combination 0-40 (left view-right view) is equal to 20-20, the PSNR value of 20-20 was higher of the combination 0-40 (left view-right view) is equal to 20-20, the PSNR value of 20-20 was higher than 0-40. Moreover, the average bitrate of 20-20 was smaller than 0-40. Figure 12 implies that PSNR than 0-40. Moreover, the average bitrate of 20-20 was smaller than 0-40. Figure 12 implies that PSNR of the uniform down-sampling ratio assignment of left and right view is higher than non-uniform of the uniform down-sampling ratio assignment of left and right view is higher than non-uniform down-sampling ratio assignment. Furthermore, the performance of 20-40 was better than 0-50 down-sampling ratio assignment. Furthermore, the performance of 20-40 was better than 0-50 because because the down-sampling ratio difference value for each left and right view of 20-40 was lower the down-sampling ratio difference value for each left and right view of 20-40 was lower than 0-50 even than 0-50 even though the average down-sampling ratio of 20-40 was greater than 0-50. though the average down-sampling ratio of 20-40 was greater than 0-50.

Sensors 2018, 18, 3148

9 of 20

Sensors Sensors 2018, 2018, 18, 18, xx FOR FOR PEER PEER REVIEW REVIEW

99 of of 20 20

(a) (a)

(b) (b)

(c) (c)

(d) (d)

Figure 12. RD-curve between PSNR and average bitrate with different QPs. LDR stands for Left view

Figure 12. RD-curve between PSNRand andaverage average bitrate QPs. LDR stands for Left Figure 12. RD-curve between PSNR bitratewith withdifferent different QPs. LDR stands for view Left view Down-sampling Ratio, RDR stands for Right view Down-sampling Ratio. (a) RD-curve with QP = 22; Down-sampling Ratio, RDR stands for Right view Down-sampling Ratio. (a) RD-curve with QP 22; = 22; Down-sampling Ratio, RDR stands for Right view Down-sampling Ratio. (a) RD-curve with=QP (b) RD-curve with QP = 27; (c) RD-curve with QP = 32; (d) RD-curve with QP = 37. (b) RD-curve with QP = 27; (c) RD-curve with QP = 32; (d) RD-curve with QP = 37. (b) RD-curve with QP = 27; (c) RD-curve with QP = 32; (d) RD-curve with QP = 37.

Figure RD-curve and average bitrate with down-sampling ratio Figure 13. 13. betweenPSNR PSNR average with different down-sampling Figure 13. RD-curve RD-curve between between PSNR andand average bitratebitrate with different different down-sampling ratio combinations. ratio combinations. combinations.

Figure 13 shows the RD-curve between PSNR and average bitrate with different down-sampling ratio combinations. In the case of 20-20, the difference value between QP = 27 and QP = 22 is 0.17, which is very low whereas the difference value of bitrate is 862.6038, which is very high.

Sensors 2018, 18, x FOR PEER REVIEW Sensors 2018, 18, x FOR PEER REVIEW

10 of 20 10 of 20

Figure 13 13 shows shows the the RD-curve RD-curve between between PSNR PSNR and and average average bitrate bitrate with with different different down-sampling down-sampling Figure ratio combinations. combinations. In In the the case case of of 20-20, 20-20, the the difference difference value value between between QP QP == 27 27 and and QP QP == 22 22 is is 0.17, 0.17, ratio Sensors 2018, 18, 3148 10 of 20 which is very low whereas the difference value of bitrate is 862.6038, which is very high. which is very low whereas the difference value of bitrate is 862.6038, which is very high. 3.2. Source Source View Synthesis Synthesis with with 3DoF+ 3DoF+ Test Sequences Source View 3DoF+Test Test Sequences 3.2. For the the 3DoF+ experiment, MPEG provides Classroom-Video [39],[39], TechnicolorMuseum, and the3DoF+ 3DoF+experiment, experiment, MPEG provides Classroom-Video TechnicolorMuseum, For MPEG provides Classroom-Video [39], TechnicolorMuseum, and TechnicolorHijack as test testassequences, sequences, which are are illustrated illustrated in Figure Figurein14. 14. The pilot pilot test was conducted and TechnicolorHijack test sequences, which are illustrated Figure 14. test Thewas pilot test was TechnicolorHijack as which in The conducted on ClassroomVideo. To verify if the number of input views influences the quality in view synthesis, conducted on ClassroomVideo. Tonumber verify ifofthe number input views influences the synthesis, quality in on ClassroomVideo. To verify if the input viewsof influences the quality in view RVS set v0, v11, and v14 as source views, which are not encoded, and v13 for the intermediate view. view synthesis, RVS set v0, v11, and v14 as source views, which are not encoded, and v13 the RVS set v0, v11, and v14 as source views, which are not encoded, and v13 for the intermediatefor view. Figure 15 shows the pilot test of ClassroomVideo for subjective quality evaluation. As the number of intermediate view. Figure 15 shows the pilot test of ClassroomVideo for subjective quality evaluation. Figure 15 shows the pilot test of ClassroomVideo for subjective quality evaluation. As the number of input views increased, the overlapped regions of the synthesized views decreased. That is, the As theviews number of input views increased, regions the overlapped regions of the synthesized views decreased. input increased, the overlapped of the synthesized views decreased. That is, the subjective quality increases when RVS achieves several input views. However, the texture quality of That is, the subjective quality increases when RVS achieves several input views. However, the texture subjective quality increases when RVS achieves several input views. However, the texture quality of the synthesized view decreased when the number of input views increased. quality of the synthesized view decreased when the number of input views increased. the synthesized view decreased when the number of input views increased.

(a) (a)

(b) (b)

(c) (c)

(d) (d)

(e) (e)

(f) (f)

Figure 14. 3DoF+ test sequences. (a) ClassroomVideo (4096 × 2048), 360°◦ × 180°◦FOV ERP, consists of Figure (4096 × × 2048), Figure 14. 14. 3DoF+ 3DoF+test test sequences. sequences. (a) (a) ClassroomVideo ClassroomVideo (4096 2048), 360° 360 ××180° 180 FOV FOVERP, ERP,consists consists of of 15 source views, has 120 frames, 30 fps; (b) TechnicolorMuseum (2048 × 2048), 180°◦ × 180°◦FOV ERP, 15 source views, has 120 frames, 30 fps; (b) TechnicolorMuseum (2048 × 2048), 180° × 180° FOV ERP, 15 source TechnicolorMuseum (2048 × 180 × 180 FOV ERP, consists of 24 source views, has 300 frames, 30 fps; (c) TechnicolorHijack (4096 × 4096), 180° × 180°◦ ◦ × consists frames, 30 30 fps; fps; (c) (c)TechnicolorHijack TechnicolorHijack(4096 (4096×× 4096), 4096), 180 180° × 180° consists of of 24 24 source views, has 300 frames, 180 FOV ERP, consists of 10 source views, has 300 frames, 30 fps; (d) depth map of ClassroomVideo; (e) FOV 300 frames, 30 30 fps;fps; (d) (d) depth mapmap of ClassroomVideo; (e) FOV ERP, ERP, consists consistsofof10 10source sourceviews, views,has has 300 frames, depth of ClassroomVideo; depth map of TechnicolorMuseum; (f) depth map of TechnicolorHijack. depth mapmap of TechnicolorMuseum; (f) depth map of TechnicolorHijack. (e) depth of TechnicolorMuseum; (f) depth map of TechnicolorHijack.

(a) (a)

(b) (b)

(c) (c)

Figure 15. View synthesis of ClassroomVideo for subjective quality evaluation. (a) View v13 synthesized from view v0; (b) view v13 synthesized from view v0, v11; (c) view v13 synthesized from view v0, v11, v14.


11 of 20

Figure 15. View synthesis of ClassroomVideo for subjective quality evaluation. (a) View v13 synthesized from view v0; (b) view v13 synthesized from view v0, v11; (c) view v13 synthesized from Sensors 2018,v0, 18,v11, 3148v14. 11 of 20 view

In another experiment, view v0 was defined as a synthesized view; v1, v2, v3, v4, v5, and v6 another view wasv12, defined as awere synthesized view; v1,asv2, v3, v4, and16. v6 were In called near experiment, views; and v9, v10,v0v11, v13, v14 called far views, shown in v5, Figure were called near views; and v9, v10, v11, v12, v13, v14 were called far views, as shown in Figure 16. The distances between the synthesized view and the near and far views are same. For objective The distances between the synthesized viewused. and the near and far views are same. For objective quality quality evaluation, WS-PSNR tool [40] was evaluation, WS-PSNR tool [40] was used. Table 3 shows the WS-PSNR for synthesized source views of ClassroomVideo. WS-PSNR value showsthan the WS-PSNR for synthesized source views of ClassroomVideo. of of (6)Table was 3higher (1) although (6) has fewer views. Adding more views, WS-PSNR which are value down(6) was higher (1) although (6) quality has fewer Adding more down-sampled, sampled, is notthan appropriate for the of views. the synthesized view.views, If thewhich input are views were closer tois not appropriate for the quality of the synthesized view. If the input views were closer to the synthesized the synthesized view, its PSNR value would be higher, as we can see by comparing (1) and (3). view, its PSNR value would be higher, as we can see by comparing (1) and (3). Interestingly, the PSNR Interestingly, the PSNR value of (1) was higher than (2) although the depth maps of (2) were not value of (1) was higher than (2) although the depth maps of (2) were not down-sampled. It implies down-sampled. It implies both the texture and the depth should be down-sampled with the same both the texture and the depth should be down-sampled with the same ratio. ratio.

Figure 16. ClassroomVideo viewpoint definition for objective quality evaluation. Figure 16. ClassroomVideo viewpoint definition for objective quality evaluation. Table 3. PSNR of synthesized views for ClassroomVideo. Table 3. PSNR of synthesized views for ClassroomVideo. Input Views (Down-Sampling WS-PSNR_Y WS-PSNR_U WS-PSNR_V Input Views Ratio: 50%) (dB)(dB) WS-PSNR_U (dB) (dB) WS-PSNR_V (dB) (dB) WS-PSNR_Y (Down-Sampling Ratio: 50%) (1) nearOrg + farDown 31.83 48.90 51.50 (1) nearOrg + farDown 31.83 48.90 51.50 (2) nearOrg + farTextureDown 31.49 47.84 50.67 (2) + farTextureDown 31.49 47.84 50.67 (3) nearOrg nearDown + farOrg 31.41 48.56 51.16 (3) + farOrg+ farOrg 31.41 48.56 51.16 (4) nearDown nearTextureDown 31.44 43.74 50.58 (4) 31.44 43.74 50.58 (5) nearTextureDown nearOrg + farOrg + farOrg 32.73 49.91 52.49 (5) 32.73 49.91 52.49 (6) nearOrg + farOrg 31.83 48.91 51.50 (7) nearOrg farOrg 31.43 48.56 51.16 (6) 31.83 48.91 51.50 (7) farOrg 31.43 48.56 51.16

3.3. Intermediate View Synthesis with 3DoF+ Test Sequences 3.3. Intermediate View Synthesis with 3DoF+ Test Sequences In Section 3.2, the source view synthesis with 3DoF+ test sequences was introduced. Because In Section 3.2, the source view (CTC) synthesis with 3DoF+ testthe sequences introduced. Because the the 3DoF+ common test condition of 3DoF+ requires ability towas synthesize the intermediate 3DoF+ condition (CTC) of 3DoF+ requires the ability synthesize the views, common which do test not exist in source views, this section introduces theto view synthesis of intermediate intermediate views, which do not exist in source views, this section introduces the view synthesis of intermediate view. The proposed system architecture, VLADVS, includes anchor view selection, down-sampling view. proposedselection, system architecture, VLADVS, includes anchor view selection, down-sampling ratio The combination down-sampling, encoding, decoding, up-sampling, view synthesis, and measuring WS-PSNR, as described in Figure 17.


12 of 20

ratio combination Sensors 2018, 18, 3148 selection, down-sampling, encoding, decoding, up-sampling, view synthesis, 12 of 20and measuring WS-PSNR, as described in Figure 17.

Figure17. 17.Proposed Proposedsystem system architecture architecture for of of 3DoF+ video. Figure for intermediate intermediateview viewsynthesis synthesis 3DoF+ video.

In CTC, the QPs used for texture and depth are shown in Table 4. The difference value between In CTC, the QPs used for texture and depth are shown in Table 4. The difference value between the texture and depth QP is 5, which was decided by an experiment [41]. Table 5 shows the resolution the texture and depth QP is 5, which was decided by an experiment [41]. Table 5 shows the resolution of the down-sampling ratio for ClassroomVideo. Down-sampling is applied to both texture and depth. of the down-sampling ratio for ClassroomVideo. Down-sampling is applied to both texture and 360ConvertStatic of 360lib 5.1 was used for down-sampling. Table 6 shows the anchor-coded views per depth. 360ConvertStatic of 360lib 5.1 was used for down-sampling. Table 6 shows the anchor-coded class or ClassroomVideo. Class A1 uses all views for view synthesis, whereas class A2 and class A3 views persubset class or Class A1synthesis uses all views forframe view ranges synthesis, whereas class were A2 and use the of ClassroomVideo. views. To reduce the view runtime, for view synthesis class A3CTC useasthe subset of views. To reduce the view frame ranges view set in shown in Table 7. Because the proposals for synthesis 3DoF+ are runtime, required to generate ERPfor video synthesis were set inview CTC as shown Table 7. was Because the proposals forthe 3DoF+ are required for all intermediate positions, the in experiment designed to synthesize intermediate views to generate ERP video for all intermediate view positions, the experiment was designed to synthesize using A1, A2, and A3 class views. Figure 18 shows the positions of the source and intermediate views. the intermediate using A1, A2, and A3 class shows the positions of the source The goal ofviews this experiment is to reduce theviews. bitrateFigure while 18 conserving the PSNR. Modifying and intermediate views. parameters such as down-sampling ratio, QP, and the number of input views to optimize them are The goal of experiment, this experiment reduce inthe bitrate included in the whichis is to explained Section 4. while conserving the PSNR. Modifying

parameters such as down-sampling ratio, QP, and the number of input views to optimize them are Table 4. QPs usedin forSection texture and included in the experiment, which is explained 4. depth. QP

R1

R2

R3

R4

Texture Depth QP

22 17 R1

27 22 R2

32 27 R3

37 32 R4

Texture

22

27

32

37

Table 4. QPs used for texture and depth.

Table 5. Resolution ratio. Depth 17 for down-sampling 22 27 32 Down-Sampling Ratio ClassroomVideo

Down-Sampling Ratio ClassroomVideo

0%

12.5%

25%

37.5%

0%

12.5%

25%

6. Anchor-coded views3072 per class. 4096Table × 2048 3584 × 1792 × 1536

A1 ClassroomVideo Sequence Name A2Test Class ClassroomVideo A3 ClassroomVideo

A1 A2 A3

ClassroomVideo ClassroomVideo ClassroomVideo

15 of Source No. 15 Views 15 15 15 15

No. of15 AnchorCoded9 Views 1 15 9 1

2048 × 1024

37.5% 2560 × 1280

No. of Source No. of Anchor-Coded Sequence Name Table 6. Anchor-coded views per class. Views Views

Test Class

50%

Table 5. Resolution for down-sampling ratio. 4096 × 2048 3584 × 1792 3072 × 1536 2560 × 1280

50% 2048 × 1024

Anchor-Coded Views All Anchor-Coded v0, v7, Views. . . , v14 v0 All v0, v7, …, v14 v0

Sensors 2018, 18, x FOR PEER REVIEW Sensors 2018, 18, 3148

13 of 20 13 of 20

Table 7. 3DoF+ test sequence view synthesis frame range. TableTest 7. 3DoF+ view synthesisFrames frame range. Classtest sequence Sequence Name Test Class A1 A2 A3

(a)

A1 A2 A3

ClassroomVideo Sequence Name ClassroomVideo ClassroomVideo ClassroomVideo ClassroomVideo ClassroomVideo

89–120 Frames 89–120 89–120 89–120 89–120 89–120

(b)

Figure intermediateview. view. Figure18. 18.Source Sourceand andintermediate intermediateview view position. position. (a) (a) Source view; (b) intermediate

ExperimentalResults Results 4.4.Experimental Section3.3, 3.3,the the intermediate view synthesis was introduced. As described in Section 2.3, InInSection intermediate view synthesis was introduced. As described in Section 2.3, RVS RVS was used for view synthesis. In addition, the tool used for down-sampling and up-sampling is was used for view synthesis. In addition, the tool used for down-sampling and up-sampling is 360Convert in 360lib 5.1, and for HEVC encoding and decoding, the HM 16.16 encoder and decoder 360Convert in 360lib 5.1, and for HEVC encoding and decoding, the HM 16.16 encoder and decoder are used. The used version of RVS is 1.0.2 with openCV 3.4.1, and the server used for experiment has are used. The used version of RVS is 1.0.2 with openCV 3.4.1, and the server used for experiment has 2 Intel Xeon E5-2687w v4 CPU and 128 GB. 2 Intel Xeon E5-2687w v4 CPU and 128 GB. Table 8 shows the summary of WS-PSNR_Y with different down-sampling ratios for regular Table 8 shows the summary of WS-PSNR_Y with different down-sampling ratios for regular outputs and masked outputs in synthesizing the intermediate views. It contains the WS-PSNR_Y outputs and masked outputs in synthesizing the intermediate views. It contains the WS-PSNR_Y values of synthesized intermediate views. The results of the regular output were better than the values of synthesized intermediate views. The results of the regular output were better than the masked outputs. Further, class A2 and class A3, which discarded some source views, showed low masked outputs. Further, class A2 and class A3, which discarded some source views, showed low WS-PSNR. For down-sampling the anchor views, the ratio 12.5% is reasonable. Table 9 contains WS-PSNR. For down-sampling the anchor views, the ratio 12.5% is reasonable. Table 9 contains WSWS-PSNR_Y of synthesized views for different QPs with A1 class. This shows that the difference value PSNR_Y of synthesized views for different QPs with A1 class. This shows that the difference value of WS-PSNR_Y between R1 and R2 is not high. of WS-PSNR_Y between R1 and R2 is not high. Figure 19 depicts the RD-curve between WS-PSNR_Y and bitrate of A1 with 12.5%, 25%, 37.5%, Figure 19 depicts the RD-curve between WS-PSNR_Y and bitrate of A1 with 12.5%, 25%, 37.5%, and 50% down-sampling ratios. The values of the X-axis were QP of R1–R4. R2 can be used instead and 50% down-sampling ratios. The values of the X-axis were QP of R1–R4. R2 can be used instead of R1; the gap between R1 and R2 was not high. With QP of R2 and 12.5% down-sampling ratio, ofit R1; the gap between R1 and R2 was not high. With QP of R2 and 12.5% down-sampling ratio, it saved approximately 87.81% bitrate while losing only 8% WS-PSNR, compared to the result of R1 saved approximately 87.81% and 0% down-sampling ratio.bitrate while losing only 8% WS-PSNR, compared to the result of R1 and 0% down-sampling ratio.


14 of 20

Sensors 2018, 18, 3148

14 of 20

Table 8. WS-PSNR_Y of synthesized views for different down-sampling ratios.

WS-PSNR_Y (dB) Regular Output Masked Output Table 8. WS-PSNR_Y of synthesized views for different down-sampling ratios.

ClassroomVideo Down-Sampling Ratio (%)Regular Output WS-PSNR_Y (dB) A1 A2 A3 Masked A1 Output A2 A3 ClassroomVideo 0 39.46 38.32 29.16 39.34 37.35 26.70 Down-Sampling Ratio (%) 12.5 A1 37.92 A2 37.25 A3 29.10 A1 37.78A2 36.54A3 26.86 25 37.33 36.71 29.03 0 39.46 38.32 29.16 39.34 37.21 37.3536.11 26.7026.84 37.5 36.55 36.00 28.90 36.45 35.51 12.5 37.92 37.25 29.10 37.78 36.54 26.8626.83 25 37.33 36.71 29.03 37.21 36.11 26.84 50 35.30 34.85 28.70 35.22 34.49 26.86 37.5 36.55 36.00 28.90 36.45 35.51 26.83 50 35.30 34.85 28.70 35.22 34.49 26.86 Table 9. WS-PSNR_Y of synthesized views for different QPs with A1 class.

(ClassroomVideo) TableWS-PSNR_Y 9. WS-PSNR_Y(dB) of synthesizedA1 views for different QPs with A1 class. DR 1

WS-PSNR_Y (dB) QP DR 1 QP

R1 R1 R2 R2 R3 R3 R4 R4

0%

12.5% 25% 37.5% A1 (ClassroomVideo)

0%

12.5%

42.08 42.08 40.39 40.39 38.62 38.62 36.77 36.77

39.31 39.31 38.62 38.62 37.58 37.58 36.16 36.16

25%

38.58 38.58 37.99 37.99 37.04 37.04 35.73 35.73

37.5%

37.59 37.59 37.12 37.12 36.34 36.34 35.16 35.16

50% 50%

36.06 36.06 35.74 35.74 35.17 35.17 34.23 34.23

DRDR represents ratio. represents down-sampling down-sampling ratio.

1 1

(a)

(b)

(c)

(d)

Figure 19. RD-curve RD-curvebetween between WS-PSNR_Y bitrate A1. (a) RD-curve withdown-sampling; 12.5% downFigure 19. WS-PSNR_Y andand bitrate of A1.of(a) RD-curve with 12.5% sampling; (b) RD-curve with 25% down-sampling; (c) RD-curve with 37.5% down-sampling; and (d) (b) RD-curve with 25% down-sampling; (c) RD-curve with 37.5% down-sampling; and (d) RD-curve RD-curve with 50% down-sampling. with 50% down-sampling.

Sensors 2018, 18, 3148

15 of 20

In addition, experiment with two down-sampling ratios was conducted. After sorting the source views by the distance between the source views and intermediate views, the experiment assigned two down-sampling ratios to the source views. If the source views are close to the intermediate view, they got low down-sampling ratios. To decide the combination of two down-sampling ratios, the following formula is used: nCr (1) Here, n is the number of the entire down-sampling ratios, and r is the number of the down-sampling ratios to assign. Table 10 shows the combinations of two down-sampling ratios deducted by Equation (1). Table 10. Combination of two down-sampling ratios (DR: down-sampling ratio). Down-Sampling Combination

DR 1 (%)

DR 2 (%)

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10

0 0 0 0 12.5 12.5 12.5 25 25 37.5

12.5 25 37.5 50 25 37.5 50 37.5 50 50

To obtain the number of DR1 and DR2 to the source views, the following equations are used: n(DR1) =

n(source views) 2

n(DR2) = n(source views) − n(DR1)

(2) (3)

Equation (2) explains how to calculate the number of DR1. After dividing the number of source views with 2, which means the number of down-sampling ratios to assign, the formula rounds up the result. DR2 is set to the difference value between the number of source views and the number of DR1, as shown in Equation (3). Figure 20 represents the RD-curve between WS-PSNR_Y and bitrate of A1 with D1 − D10. In Section 3.1, uniform down-sampling ratio assignment showed better PSNR value than non-uniform down-sampling ratio assignment. Likewise, although the average down-sampling ratio of Figures 20d and 19b are the same, but the WS-PSNR value of the latter is better. It implies the uniform down-sampling is an advantage for view synthesis. In Section 3.3, down-sampling the source views far from the intermediate view is better in WS-PSNR value than down-sampling the near views from intermediate view. Equally, the WS-PSNR value of down-sampling the near views from intermediate views, as described from Figure 20a, is higher than Figure 19a. Although the former requires more bitrate than latter, the difference value is 23,371 Kbps when QP is R2, which is not greatly high. It implies down-sampling the far views from intermediate views can be a method for saving bitrate while preserving the WS-PSNR value.


16 of 20 16 of 20

(a)

(b)

(c)

(d)

(e)

(f) Figure 20. Cont.


17 of 20 17 of 20

(g)

(h)

(i)

(j)

Figure and bitrate of A1. (a) RD-curve withwith D1; (b) with Figure 20. 20.RD-curve RD-curvebetween betweenWS-PSNR_Y WS-PSNR_Y and bitrate of A1. (a) RD-curve D1;RD-curve (b) RD-curve D2; RD-curve with D3; (d)D3; RD-curve with D4; (e) D4; RD-curve with D5; (f) RD-curve with D6;with (g) RDwith(c)D2; (c) RD-curve with (d) RD-curve with (e) RD-curve with D5; (f) RD-curve D6; curve with D7; (h) RD-curve with D8; (i) RD-curve with D9; and (j) RD-curve with D10. (g) RD-curve with D7; (h) RD-curve with D8; (i) RD-curve with D9; and (j) RD-curve with D10.

5. Conclusions Conclusions 5. This paper paper proposes proposes aa bitrate-reducing bitrate-reducing method method for for 3DoF+ 3DoF+ video video synthesis synthesis and and transmission. transmission. This Particularly, by down-sampling and up-sampling the texture and depth, the proposed method saves Particularly, by down-sampling and up-sampling the texture and depth, the proposed method saves the bitrates of bitstream file while degrading the objective video quality very little in WS-PSNR. the bitrates of bitstream file while degrading the objective video quality very little in WS-PSNR. In In addition, down-sampling views bringshigher higherWS-PSNR WS-PSNRvalue valuethan thandown-sampling down-sampling all all the the addition, down-sampling thethe farfar views brings source views. However, because the number of the parameters for the experiment was not enough to source views. However, because the number of the parameters for the experiment was not enough to deduct the the optimal optimal parameter parameter for for view view synthesis, synthesis, the the experiment experiment using using video video compression compression methods methods deduct such as region-wise packing [42] must be conducted to reduce the bitrates for immersive 360 VR VR such as region-wise packing [42] must be conducted to reduce the bitrates for immersive 360 video streaming. Furthermore, intensive experiments should be carried out to derive an equation video streaming. Furthermore, intensive experiments should be carried out to derive an equation which defines defines the the relation relation with with the the distances distances between between the the source source views views and and intermediate intermediate views views and and which down-sampling ratios. down-sampling ratios. Author Contributions: Conceptualization and Data curation and Writing—original draft, J.J.; Data curation and Author Contributions: Conceptualization and Data curation and Writing—original draft, J.J.; Data curation and Investigation, D.J. and J.S.; Project administration and Supervision and Writing—review & editing, E.R. Investigation, D.J. and J.S.; Project administration and Supervision and Writing—review & editing, E.R. Funding: This work was supported by the Institute for Information & Communications Technology Promotion Funding: This work was supported by the Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2018-0-00765, Development of Compression and Transmission Technologies Ultragovernment High-Quality Immersive Videos Supporting 6DoF). of Compression and (IITP) grant funded by thefor Korea (MSIT) (No. 2018-0-00765, Development Transmission Technologies for Ultra High-Quality Immersive Videos Supporting 6DoF). Conflicts of Interest: The authors declare no conflict of interest.

Conflicts of Interest: The authors declare no conflict of interest.

Sensors 2018, 18, 3148

18 of 20

References 1.

2.

3.

4.

5. 6.

7. 8. 9. 10. 11.

12.

13. 14. 15.

16.

17. 18. 19.

20.

Wang, Y.K.; Hendry; Karczewicz, M. Viewport Dependent Processing in VR: Partial Video Decoding; Technical Report ISO/IEC JTC1/SC29/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2016; MPEG116/m38559. Son, J.W.; Jang, D.M.; Ryu, E.S. Implementing Motion-Constrained Tile and Viewport Extraction for VR Streaming. In Proceedings of the 28th ACM SIGMM Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV2018), Amsterdam, The Netherlands, 12–15 June 2018. Oh, S.J.; Hwang, S.J. OMAF: Generalized Signaling of Region-Wise Packing for Omnidirectional Video; Technical Report ISO/IEC JTC1/SC29/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2017; MPEG2017/m40423. Jung, J.; Kroon, B.; Doré, R.; Lafruit, G.; Boyce, J. Update on N17618 v2 CTC on 3DoF+ and Windowed 6DoF; Technical Report ISO/IEC JTC1/SC29/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2018; MPEG123/m43571. Tanimoto, M.; Fujii, T. FTV—Free Viewpoint Television; Technical Report ISO/IEC JTC1/SC29/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2002; MPEG2002/m8595. Senoh, T.; Tetsutani, N.; Yasuda, H. MPEG-I Visual: View Synthesis Reference Software (VSRSx); Technical Report ISO/IEC JTC1/SC29/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2018; MPEG2018/m42911. Kroon, B.; Lafruit, G. Reference View Synthesizer (RVS) 2.0 Manual; Technical Report ISO/IEC JTC1/SC29/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2018; MPEG2018/n17759. Sun, Y.; Lu, A.; Yu, L. WS-PSNR for 360 Video Quality Evaluation; Technical Report ISO/IEC JTC1/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2016; MPEG2016/m38551. Senoh, T.; Wegner, K.; Stankiewicz, O.; Lafruit, G.; Tanimoto, M. FTV Test Material Summary; Technical Report ISO/IEC JTC1/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2016; MPEG2016/n16521. WG11 (MPEG). MPEG Strategic Standardisation Roadmap; Technical Report ISO/IEC JTC1/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2016; MPEG2016/n16316. Champel, M.L.; Koenen, R.; Lafruit, G.; Budagavi, M. Working Draft 0.4 of TR: Technical Report on Architectures for Immersive Media; Technical Report ISO/IEC JTC1/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2017; MPEG2017/n17264. Doré, R.; Fleureau, J.; Chupeau, B.; Briand, G. 3DoF Plus Intermediate View Synthesizer Proposal; Technical Report ISO/IEC JTC1/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2018; MPEG2018/m42486. Mitra, R.N.; Agrawal, D.P. 5G mobile technology: A survey. ICT Express 2016, 1, 132–137. [CrossRef] Kim, Y.; Lee, J.; Jeong, J.S.; Chong, S. Multi-flow management for mobile data offloading. ICT Express 2016, 3, 33–37. [CrossRef] Kim, H.; Ryu, E.; Jayant, N. Channel-adaptive video transmission using H.264 SVC over mobile WiMAX network. In Proceedings of the 2010 Digest of Technical Papers International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 9–13 January 2010; pp. 441–442. Ryu, E.S.; Yoo, C. An approach to interactive media system for mobile devices. In Proceedings of the 12th Annual ACM International Conference on Multimedia, New York, NY, USA, 10–16 October 2004; pp. 160–161. Roh, H.J.; Han, S.W.; Ryu, E.S. Prediction complexity-based HEVC parallel processing for asymmetric multicores. Multimed. Tools Appl. 2017, 76, 25271–25284. [CrossRef] Yoo, S.; Ryu, E.S. Parallel HEVC decoding with asymmetric mobile multicores. Multimed. Tools Appl. 2017, 76, 17337–17352. [CrossRef] Dong, J.; He, Y.; He, Y.; McClellan, G.; Ryu, E.S.; Xiu, X.; Ye, Y. Description of Scalable Video Coding Technology Proposal by InterDigital Communications; Technical Report ISO/IEC JTC1/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2012; MPEG2012/m26569. Ryu, E.; Jayant, N. Home gateway for three-screen TV using H.264 SVC and raptor FEC. IEEE Trans. Consum. Electron. 2011, 57, 1652–1660. [CrossRef]

Sensors 2018, 18, 3148

21. 22.

23.

24.

25. 26. 27. 28.

29.

30.

31. 32. 33.

34.

35.

36. 37.

38.

39. 40.

19 of 20

Ye, Y.; McClellan, G.W.; He, Y.; Xiu, X.; He, Y.; Dong, J.; Bal, C.; Ryu, E. Codec Architecture for Multiple layer Video Coding. U.S. Patent No. 9,998,764, 12 June 2018. International Organization for Standardization (ISO); International Electrotechnical Commission (IEC). Introduction to 3D Video. In Proceedings of the 84th SC 29/WG 11 Meeting, Archamps, France, 28 April– 2 May 2008. ISO/IEC JTC1/SC29/WG11, MPEG2008/n9784. Merkle, P.; Smolic, A.; Müller, K.; Wiegand, T. Multi-view video plus depth representation and coding. In Proceedings of the 2007 IEEE International Conference on Image Processing, San Antonio, TX, USA, 16–19 September 2007; pp. 201–204. Müller, K.; Schwarz, H.; Marpe, D.; Bartnik, C.; Bosse, S.; Brust, H.; Hinz, T.; Lakshman, H.; Merkle, P.; Rhee, F.H.; et al. 3D High-Efficiency Video Coding for Multi-View Video and Depth Data. IEEE Trans. Image Process. 2013, 22, 3366–3378. [CrossRef] [PubMed] Vetro, A.; Pandit, P.; Kimata, H.; Smolic, A. Joint Multiview Video Model (JMVM) 7.0; Technical Report ISO/IEC JTC1/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2008; MPEG2008/n9578. Martinian, E.; Behrens, A.; Xin, J.; Vetro, A. View Synthesis for Multiview Video Compression. In Proceedings of the Picture Coding Symposium, Beijing, China, 24–26 April 2006. Yea, S.; Vetro, A. View synthesis prediction for multiview video coding. Signal Process. Image Commun. 2008, 24, 89–100. [CrossRef] Fachada, S.; Kroon, B.; Bonatto, D.; Sonneveldt, B.; Lafruit, G. Reference View Synthesizer (RVS) 1.0.2 Manual; Technical Report ISO/IEC JTC1/SC29/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2018; MPEG123/m42945. Zhang, J.; Hannuksela, M.M.; Li, H. Joint Multiview Video Plus Depth Coding. In Proceedings of the 2010 IEEE International Conference on Image Processing, Hong Kong, China, 26–29 September 2010; pp. 2865–2868. Ho, Y.S.; Oh, K.J.; Lee, C.; Lee, S.B.; Na, S.T. Depth Map Generation and Depth Map Coding for MVC; Technical Report ISO/IEC JTC1/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2007; MPEG2007/m14638. Ryu, Y.; Ryu, E.S. Haptic Telepresence System for Individuals with Visual Impairments. Sens. Mater. 2017, 29, 1061–1067. Park, C.H.; Ryu, E.; Howard, A.M. Telerobotic Haptic Exploration in Art Galleries and Museums for Individuals with Visual Impairments. IEEE Trans. Haptics 2015, 8, 327–338. [CrossRef] [PubMed] Tanimoto, M.; Fujii, T.; Suzuki, K. Depth Estimation Reference Software (DERS) with Image Segmentation and Block Matching; Technical Report ISO/IEC JTC1/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2009; MPEG2009/m16092. Tanimoto, M.; Fujii, T.; Tehrani, M.P.; Suzuki, K.; Wildeboer, M. Depth Estimation Reference Software (DERS) 3.0; Technical Report ISO/IEC JTC1/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2009; MPEG2009/m16390. International Organization for Standardization (ISO); International Electrotechnical Commission (IEC). FTV Test Material Summary. In Proceedings of the 116th SC 29/WG 11 Meeting, Chengdu, China, 17–21 October 2016. ISO/IEC JTC1/SC29/WG11, MPEG2016/n16521. Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG. Joint Scalable Video Model; Joint Video Team: Geneva, Switzerland, 2007; Doc. JVT-X202. Joint Collaborative Team on Video Coding (JCT-VC). HEVC Reference Software Version HM16.16. Available online: https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/tags/HM-16.16 (accessed on 16 August 2018). Senoh, T.; Yamamoto, K.; Tetsutani, N.; Yasuda, H.; Wegner, K. View Synthesis Reference Software (VSRS) 4.2 with Improved Inpainting and Hole Filling; Technical Report ISO/IEC JTC1/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2017; MPEG2017/m40657. Kroon, B. 3DoF+ Test Sequence ClassroomVideo; Technical Report ISO/IEC JTC1/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2018; MPEG2018/m42415. International Organization for Standardization (ISO); International Electrotechnical Commission (IEC). ERP WS-PSNR Software Manual. In Proceedings of the 122st SC 29/WG 11 Meeting, San Diego, CA, USA, 16–20 April 2018. ISO/IEC JTC1/SC29/WG11, w17760.

Sensors 2018, 18, 3148

41. 42.

20 of 20

Wang, B.; Sun, Y.; Yu, L. On Depth Delta QP in Common Test Condition of 3DoF+ Video; Technical Report ISO/IEC JTC1/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2018; MPEG2018/m43801. Oh, S.J.; Lee, J.W. OMAF: Signaling of Projection/Region-wise Packing Information of Omnidirectional Video in ISOBMFF; Technical Report ISO/IEC JTC1/WG11; Moving Picture Experts Group (MPEG): Villar Dora, Italy, 2017; MPEG2018/m39865. © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

3DoF+ 360 Video Location-Based Asymmetric Down-Sampling ... - MDPI

3DoF+ 360 Video Location-Based Asymmetric Down-Sampling ... - MDPI

Suggest Documents

360-Degree Panoramic Video Coding

Display & Video 360 - Google Services

Adaptive Downsampling Video Coding With Spatially Scalable Rate

Connect Analytics 360 with Display & Video 360 for ... - Google Services

Cheap WiFi 1280.pdf1042 360 camera 360 degree video camera ...

Downsampling, Upsampling, and Reconstruction

Viewport-Adaptive Navigable 360-Degree Video Delivery

Fast Take - Facebook launches 360 video

Virtual Reality, Augmented Reality & 360 Video - MEC

Viewport-Adaptive Navigable 360-Degree Video Delivery

Viewport-Adaptive Navigable 360-Degree Video Delivery

360 Video Handbook Press Release - Simcoemedia

Downsampling, Upsampling, and Reconstruction - CppSim

360

Detection Thresholds for Rotation and Translation Gains in 360 Video ...

Read 360 Video Handbook Best Ereader Book Store - WordPress.com

Cheap Marvie Camera deportiva 360 Panoramic Video Camera 4K ...

Effect of Immersive (360 ) Video on Attitude and Behavior Change

Asymmetric MachâZehnder Interferometer Based Biosensors ... - MDPI

360 degrees video and audio recording and broadcasting employing ...

DV-360 DV-464 DV-2650 - Video Car

Inverse Kinematics Solution for a 3DOF Robotic

Force capability in general 3DoF planar mechanisms

Adaptive 360 VR Video Streaming: Divide and Conquer! - arXiv

3DoF+ 360 Video Location-Based Asymmetric Down-Sampling ... - MDPI