View Synthesis with Kinect-Based Tracking for Motion ...

4 downloads 9136 Views 762KB Size Report
we propose to use the Kinect sensor to estimate head pose of the viewer. Based on ... the content displayed on digital signature displays, kiosks and other ad-.
View Synthesis with Kinect-Based Tracking for Motion Parallax Depth Cue on a 2D Display Michal Joachimiak a , Mikolaj Wasielica b , Piotr Skrzypczy´ nski b , c a Janusz Sobecki , Moncef Gabbouj a

Tampere University of Technology, Tampere, Finland Poznan University of Technology, Poznan, Poland c Wroclaw University of Technology, Wroclaw, Poland b

Abstract. Recent advancements on 3D video generation, processing, compression and rendering increase accessibility to 3D video content. However, the majority of 3D displays available on the market belong to the stereoscopic display class and require users to wear special glasses in order to perceive depth. In difference, autostereoscopic displays can render multiple views without any additional equipment. The depth perception on stereoscopic and autostereoscopic displays is realized via a binocular depth cue called stereopsis. Another important depth cue, that is not exploited by autostereoscopic displays, is motion parallax which is a monocular cue. To enable the motion parallax effect on the 2D display we propose to use the Kinect sensor to estimate head pose of the viewer. Based on the head pose, the rendering system is able to synthesize the corresponding view from the 3D video stream. The real-time view synthesis software adjusts the view and creates the motion parallax effect on the 2D display. We believe that the proposed solution can enhance the content displayed on digital signature displays, kiosks and other advertisement media where many users observe the content during move and use of the glasses-based 3D displays is not possible or too expensive. Key words: 3D video, view synthesis, motion parallax, head tracking, Kinect

1

Introduction

Human beings see the neighboring environment in 3D thanks to the interpretation of a number of binocular and monocular depth cues [16] that do not have to be perceived jointly. Binocular depth cues can be seen only using both eyes, and are based on stereopsis – a displacement between the two images of the same scene, delivered to the left and right eye. The differences in horizontal position between corresponding points in the left and right images are called binocular disparities and are used by the visual cortex to sense depth. Monocular depth cues can be perceived using only one eye, creating the depth sensation by perspective, texture, shadows, and motion parallax [7]. The stereopsis and motion parallax are the two most important sensory cues for depth perception [7]. There

2

M. Joachimiak, M. Wasielica, P. Skrzypczy´ nski, J. Sobecki, M. Gabbouj

exist a similarity between the motion parallax effect and the mechanism behind the stereopsis depth cue. In stereopsis a disparity between the positions of the corresponding points perceived by both eyes simultaneously serves as a depth cue. In case of the motion parallax the disparity occurs between the positions of an object perceived by the same eye successively over time [19]. An important advantage of the motion parallax depth cue is that it does not cause visual fatigue and discomfort [15] induced by the vergence-accommodation conflict. The conflict occurs on 3D displays, that exploit binocular depth cue, due to inaccuracies between vergence and accommodation that make the fusion of binocular stimulus difficult. The majority of 3D displays available on the market exploit the stereopsis. Thanks to the stereopsis the differences between two images observed by the viewer create sensation of depth. The stereoscopic image comprises two views that need to be delivered separately to each eye. The views can be temporally or spatially separated by shutter or polarized glasses, correspondingly. In case of the shutter glasses, only one view at a time is displayed and corresponding eye is uncovered with use of the rolling shutter built into the glasses. The shutter corresponding to the other eye is activated to block the view. This solution requires to use the display that can render video stream with at least doubleframe rate. In polarized-glasses-based technology the polarized glasses separate the views with use of differently-polarized filters. The corresponding polarization filters that separate views pertaining to each eye are used on top of the light source units on the display. In case of polarized displays the spatial separation of the views imposes decreased spatial resolution of the 3D display. Even though, in the case of time sequential displays the resolution of the rendered view is not decreased the frame rate of the video is limited since shutter glasses alternate the views and a single view is visible in the time instant, while the other is not displayed and the corresponding eye is obscured by the shutter. The aim of glasses in both display types is to separate image that does not pertain to the corresponding eye. Imperfect separation makes a small proportion of one eye’s image to be seen in the other eye. This phenomenon is known as a 3D crosstalk [20] and in both cases leads to decreased subjective quality of the 3D percept caused by ghosting artifacts. Autostereoscopic 3D displays can render more than two views, simultaneously, and do not require the user to wear any additional viewing gear. The state of the art autostereoscopic displays (ASD) use light-directing mechanisms consisting of either lenticular lenses or parallax barriers, aligned on the surface of the display, to provide different view for each eye of the viewer. Due to the fact that ASDs render multiple views at a time they share the problem of decreased resolution caused by lower amount of cells corresponding to a single view. The Autostereoscopic displays also exhibit ghosting artifacts caused by imperfect separation of the images intended for the left and right eye [4]. The presence of the aforementioned artifacts is one of the factors that limits application of such displays in public-space applications, such as advertising or information kiosks. Another two factors that slow down the widespread adoption of ASDs

View Synthesis with Kinect-Based Tracking. . .

3

are high production cost and limited spatial resolution compared to the classic 2D displays. The former research showed that the presence of head motion parallax, resulting from the tracking of the observer’s head and rendering the view accordingly, improves the ability to discriminate depth on a 3D display. In the experiments described in [14] the subjects were asked to judge the depth of the random dot sinusoidal gratings rendered on a 3D display equipped with the head tracking system. The results, corresponding to the cases in which the head motion parallax was used, showed consistently lower error rate. While the stereoscopic and autostereoscopic 3D displays rely on stereopsis to enable depth perception, it is impossible to re-create this effect using a typical 2D display without additional hardware and special eyewear. Considering this limit and the fact that the motion parallax is one of the most important depth cues [17][23], in this research we propose a cost-effective solution that enables depth sensation on a typical 2D display without enforcing the user to wear any glasses or other devices. The structured-light-based RGB-D depth sensors, such as Kinect and Xtion Pro were intended for computer games, but nowadays they are used in a wide range of applications which include tracking and recording of human motion for humanoid robot programming [24], and human-robot interactions in the social context, like gesture recognition, body and face tracking, and estimation of the users’s pose [9]. These results demonstrate that the RGB-D sensors provide a reliable and easy-to-use source of information about the user, which can be also adopted for determining the position of the head for the motion parallax effect, as demonstrated in our research.

2

Related work

Systems that track user’s eyes, face or head and render the content according to the viewpoint of the observer to simulate the motion parallax effect have been reported in literature mainly in the context of virtual reality, teleconferencing and human-computer interaction with use of 3D display. In virtual reality applications the motion of the head is often tracked by means of dedicated sensors such as infra-red or magnetic, which require the user to wear additional equipment. Some recent applications propose to use a color camera for head tracking. In one of the seminal works [18] a visual operating system, that uses autosteresoscopic display, was proposed. In this solution the head motion and gaze direction are tracked with use of the color cameras. However, the view is not adjusted according to the head position. Instead, the head motion and gaze tracking are used to enable interaction of the user with the operating system. Since the color cameras are utilized and the face detection is basing on color, the changing lighting conditions can cause noise and poor accuracy of the head tracker. Another work in which face detection works on the video signal, captured from the camera assembled in the computer, is presented in [8]. The authors

4

M. Joachimiak, M. Wasielica, P. Skrzypczy´ nski, J. Sobecki, M. Gabbouj

propose to render stereoscopic images with respect to the detected face position. They use face detection and tracking to detect the change of the head position and alter the viewpoint of the virtual camera. Based on the face position, the virtual scene can be rendered with use of two different image rendering methods, diffuse-based or raytracing. Due to changes in light conditions the quality the video captured stream can decrease and cause noise in the head pose measurement. Another disadvantage of the proposed solution is the usage of anaglyph glasses that cause distortions in color perception. The solutions like the one described in [8], even if affordable, are very inconvenient in applications that involve occasional viewers, such as information kiosks in public places. The nature of these places and type of the content to be delivered which is mostly for advertising purposes limits the possibility to distribute glasses to the viewers. The 3D video teleconferencing systems deal with a problem more similar to our application, which is a video delivery. The depth perception can be improved by tracking the motion of the viewer’s head and re-creating the motion parallax effect on the rendered images. The use of motion parallax effect in a teleconferencing application that operates on a standard camera embedded in the computer is presented in [12]. The proposed solution employs head tracking to adjust a pseudo-3D view of the remote location according to the viewer’s position. The system segments out the person with use of background subtraction, similarly to [18]. The measurement of the viewer’s head position is used to tilt the image in which the foreground layer changes position with respect to the background layer. In this way the pseudo-3D effect of looking through the window is obtained. Similarly to [18], changing lighting conditions impact the accuracy of the head tracker and segmentation. Thus, operation in dim or dark conditions is not possible. Another system that utilizes motion parallax for teleconferencing was presented in [25]. The authors propose to use the two methods to enhance the depth perception, namely box framing and layered video. In case of box framing the image is rendered in an artificial box which produces the window-like effect. The layered video requires the segmentation of the person similarly to [12] and the system adjusts the virtual view according the head position of the viewer. Even though the Time-of-Flight (TOF) camera is used, the depth information helps only with the segmentation and is not used for the estimation of the head position. This solution requires computationally complex face tracking that works on the RGB camera input, vulnerable to environmental conditions such as changing lighting conditions or moving background objects. Some systems utilize head tracking to enable continuous parallax for passive multiview displays. An exemplary solution [5], that removes the negative effect of the visibility zone change, uses face detection and tracking to adjust the views according to the observation angle of the viewer. Nevertheless, the calibration between the face position and the observation angle is required. The system proposed in [5] has a similar weak point to previous solutions since the input from RGB camera is used for head tracking.

View Synthesis with Kinect-Based Tracking. . .

5

When a 3D perception relies on the motion parallax the accuracy of head pose estimation, robustness of the tracker and response time to the changing head position become very important [25]. While there are many possible means to accomplish robust tracking of a single user in this type of application, most of them face practical drawbacks related to the use of regular color camera in constantly changing lightning conditions and dynamic background in uncontrolled environments, like public spaces. We propose a system designed in similar principles, but employing only off-the-shelf hardware components, such as the Kinect sensor, regular 2D display, and a commodity PC. The RGB-D sensor ensures fast and robust tracking of the user - not only the head motion, but also the whole body position, which is important in the context of our target applications.

3

Virtual View Synthesis with Motion Parallax

Fig. 1. Outline of the proposed system.

In difference to the previous solutions, the one proposed enables rendering a multitude of virtual views from a 3D video sequence. In result, the 3D effect is not simulated like in [25] or [12] and not only originally captured views are rendered on display like in [5]. The system takes on input a 3D video sequence stored in the Multiview Video and Depth (MVD) format. We selected the MVD format since it is supported by the recent multiview-and-depth coding extension (MVC+D) [2, 6] of the Advanced Video Coding (H.264/AVC) standard [2]. The selection of test sequences and views for input is chosen according to the Common Test Conditions (CTC) [1]. Since the desired view change has to be as smooth as possible the amount of interpolated views is increased. Since the depth signal is embedded in this format the Depth-Image-Based Rendering (DIBR) [10] is possible at dense virtual camera positions.

6

M. Joachimiak, M. Wasielica, P. Skrzypczy´ nski, J. Sobecki, M. Gabbouj

Fig. 2. Kinect-based user tracking: estimated skeleton joints (a), Kinect’s field of view overlaid on a pseudo-color-coded depth image (b)

The outline of the system is presented in Fig. 1. To enable real-time operation, the processing is split into three threads that run independently. The View Synthesis (VS) and YUV Loader (YUVL) modules operate on the circular buffer in order to minimize thread stall time. Each data block of the circular buffer consists of camera parameters, 3 texture and 3 corresponding depth views of a single frame. The frames of the sequence in the MVD format and corresponding camera parameters are loaded sequentially, from hard disk into the circular buffer memory, by the YUV loader thread. At any operating point, the viewer can freely move in front of the display. To enable operation in a longer distance range from the display, the tracking of the viewer is realized in two modes. In the first mode the human pose tracking is utilized. The depth map captured by the Kinect sensor is processed by the Pose Estimation (PE) module that utilizes the Microsoft Kinect SDK v1.8 library. In this step the human silhouette is extracted from a single input depth image. Then, a per-pixel body part distribution is inferred. Finally local modes of each part distribution are estimated to give a confidence-weighted proposal for the 3D location of body joints [21]. The algorithm is able to track up to two users. Since the motion parallax effect can be simulated for only one viewer the PE module selects the one which is closer to the display. The skeleton measurement data (Fig. 2a) is obtained in meters. The 3D coordinate system whose origin is located in the center of the Kinect sensor is used (Fig. 2b). In our setup the optical axis of the sensor and display’s normal are parallel and the sensor is situated in the symmetry plane of the screen. The vertical position and the elevation angle of the Kinect are unrestricted. The only requirement is to capture the whole user’s silhouette. Since our solution is intended for information kiosks and similar publicly accessible places, we assume that the user stands in front of the display, rather than is sitting at a desk or table. In the second mode the PE module works in close-range distance and tracks only the face. The scene in front of the display is captured by the depth sensor and RGB camera of the Kinect sensor in order to perform face tracking. The modification of active appearance model that uses depth is utilized [22]. The PE module uses the position of the user’s head in Cartesian coordinates (Fig. 3a)

View Synthesis with Kinect-Based Tracking. . .

7

Fig. 3. Concept of the head tracking and view switching method

and calculates its angular position in the transverse plane relatively to the Kinect optical axis (Fig. 3b). Thanks to the conversion to angular coordinates we achieve invariance of the view while the user moves closer to the screen. The angular spread is limited to ±10◦ measured from the optical axis at the center of the display to achieve adequacy of the head movements and displayed view. This value was adjusted experimentally to provide an optimal user experience.

Fig. 4. The layout of interpolated views with respect to the captured ones.

Based on the pose of the viewer the VS module decides whether to interpolate the view or display the originally captured one. The VS module utilizes the DIBR [10] process to interpolate the required view. The layout of interpolated views is presented in Fig. 4. The virtual views, represented by cameras in black, are interpolated from the captured views, marked as cameras in white and their corresponding depth data. The system assumes that the input data is generated by the shift-sensor camera setup [10] and is rectified [11]. Since in the shift-sensor camera setup the optical axes of the camera array are parallel the keystone effect can be avoided and vertical disparities are not present. Thanks to rectification the corresponding points lay on the corresponding horizontal lines of neighboring views. This approach simplifies the 3D image warping which is essentially a re-

8

M. Joachimiak, M. Wasielica, P. Skrzypczy´ nski, J. Sobecki, M. Gabbouj

projection from the source view to the world coordinates followed by a projection from world coordinates to the target virtual view. The shift-sensor algorithm [10] used in our approach, simplifies the 3D warping to the horizontal shift which decreases the computational complexity. Thanks to this approach, the interpolation for subpixel image generation is also one-dimensional which reduces the computational complexity. As depicted in Fig. 4 the viewer can move on the baseline that is perpendicular to the camera axes. When the viewer moves outside of the limited range the utmost, original view is displayed. The amount of interpolated views is limited to 18, therefore the algorithm discretizes the angular head position to this number, and as the result, we obtain the actual view number to render (Fig. 3c).

Fig. 5. Two example images presented to the user at the limits of the field of view

Example views obtained from our system are presented in Fig. 5. These views represent the two most distant images, presented to the user at the left and the right side of the field of view.

4

Application Scenario

The 3D perception that relies on the motion parallax by estimating the observer head or body pose may be applied in numerous digital signage applications [3]. A digital signage encompasses installations based on information and advertising screens, as well as other digital communication solutions in public places, expanded by spatial actions and search for new forms of interaction with a user. Since the Kinect can serve as a 3D video acquistion device [13], the proposed system after further development can be used as an affordable 3D video teleconferencing system. Capture with use of single RGB and depth camera produces a 3D video with a single view only which restricts the virtual view synthesis to a narrow angle. Nevertheless, the use of the 3D video teleconferencing in static scene environment makes possible to use 3D reconstruction of the background as a hole filling procedure for the view synthesis. This approach will be developed as an extension of the proposed system. The proposed solution can also be used as an addition to the home entertainment system equipped with the stereoscopic 3D display and Kinect. Since the

View Synthesis with Kinect-Based Tracking. . .

9

Kinect is orignally made for gaming it is often placed in a way required by our solution. Thus, enabling the motion parallax effect for a regular, glasses-based stereoscopic 3D display would be a seamless solution that requires only software update for a TV.

5

Conclusions

We propose to employ the active RGB-D sensor for tracking the head and body position to detect the viewpoint of the 3D video viewer. The Microsoft Kinect comes with a sophisticated software API for gaming that enables relatively precise and robust tracking of body parts in real-time. We employ this API and demonstrate that the availability of the observer’s head and body pose estimate enables accurate motion parallax effect on a standard PC. The proposed solution utilizes a computationally optimized shift-sensor view synthesis [10] that enables real-time virtual view rendering with motion parallax depth cue. Since the obsrver’s head position estimation is based on the overall posture detection, our system achieves the unique property of being able to track an user with partly or even completely covered faces, i.e. with sunglasses or beard. Depending on the observer’s height our system was able to track the silhouette from about 2 m up to 4 m from the RGB-D sensor position. Because the human body area is much larger than the area of the face, we achieved a better detection reliability from farther distances, compering to the standard face detection algorithms, which need close position of the face to the camera. Therefore, tracking the whole body by the Kinect is able to detect obsrver’s head positions in realistic scenarios related to signage and advertisement applications in public places.

References 1. Common test conditions of 3DV core experiments. ISO/IEC JTC1/SC29/WG11 JCT3V-E1100 (2013) 2. ITU-T and ISO/IEC JTC 1: Advanced video coding for generic audiovisual services. ITU-T Recommendation H.264 and ISO/IEC 14496-10 (MPEG-4 AVC) (2013) 3. Anisiewicz, J., Jakubicki, B., Sobecki, J., Wantu’a, Z. (2015). Configuration of Complex Interactive Environments. In Proc. New Research in Multimedia and Internet Systems, Springer International Publishing, 239-249. 4. Boev, A., Gotchev, A., Egiazarian, K. (2009). Stereoscopic Artifacts on Portable Auto-Stereoscopic Displays: What Matters?. Proc. Workshop on Video Processing and Quality Metrics for Consumer Electronics, Scottsdale, 24–29. 5. Boev, A., Raunio, K., Georgiev, M., Gotchev, A., Egiazarian, K. (2008). OpenGLBased Control of Semi-Active 3D Display. Proc. IEEE 3DTV-Conference: The True Vision-Capture, Transmission and Display of 3D Video, Istanbul, 125–128. 6. Chen, Y., Hannuksela, M. M., Suzuki, T., Hattori, S. (2013) Overview of the MVC+D 3D video coding standard. Journal of Visual Communication and Image Representation. 7. Cutting, J. E., Vishton, P. (1995). Perceiving Layout and Knowing Distances: The Integration, Relative Potency and Contextual use of Different Information About Depth. In: Perception of Space and Motion. San Diego: Academic Press, 69–117.

10

M. Joachimiak, M. Wasielica, P. Skrzypczy´ nski, J. Sobecki, M. Gabbouj

8. D¸abala, L., Rokita, P. (2014). Simulated Holography Based on Stereoscopy and Face Tracking. In: Computer Vision and Graphics, LNCS Vol. 8671, Springer, 163–170. 9. Dziergwa, M., Kaczmarek, P., K¸edzierski, J. (2015). RGB-D Sensors in Social Robotics. Journal of Automation, Mobile Robotics & Intelligent Systems, 9(1), 18–27. 10. Fehn, C. (2004). Depth-image-based rendering (DIBR), compression and transmission for a new approach on 3D-TV. in Proc. SPIE Conf. Stereoscopic Displays and Virtual Reality Systems XI, CA, U.S.A., 93?”104. 11. Fusiello, A., Trucco, E., Verri, A. (2000). A compact algorithm for rectification of stereo pairs. Machine Vision and Applications, 12(1), 16-22. 12. Harrison, C., Hudson, S. E. (2008). Pseudo-3D Video Conferencing with a Generic Webcam. Proc. 10th IEEE Int. Symposium on Multimedia, Berkeley, 236–241. 13. Joachimiak, M., Hannuksela, M. M., Gabbouj, M. (2013) Complete Processing Chain for 3D Video Generation Using Kinect Sensor. Proc. International Conference on 3D Imaging, Liege, Belgium. 14. Lackner, K., Boev, A., Gotchev, A. (2014). Binocular depth perception: Does head parallax help people see better in depth?. Proc. IEEE 3DTV-Conference: The True Vision-Capture, Transmission and Display of 3D Video, Budapest. 15. Lambooija, M. T., Ijsselsteijn, W., Heynderickx, I. (2007). Visual Discomfort in Stereoscopic Displays: A Review. Proc. SPIE 6490, Stereoscopic Displays and Virtual Reality Systems XIV. 16. Mikkola, M., Boev, A., Gotchev, A. (2010). Relative importance of depth cues on portable autostereoscopic display. In Proceedings of the 3rd workshop on Mobile video delivery (pp. 63-68). ACM. 17. Nawrot, M. (2003). Depth from motion parallax scales with eye movement gain. Journal of Vision, 3(11), 17. 18. Pastoor, S., Liu, J., Renault, S. (1999). An Experimental Multimedia System Allowing 3-D Visualization and Eye-Controlled Interaction without User-Worn Devices. IEEE Trans. on Multimedia, 1(1), 41–52. 19. Rogers, B., Graham, M. (1982). Similarities Between Motion Parallax and Stereopsis in Human Depth Perception. Vision Research, 22, 261–270. 20. Seuntiens, P. J., Meesters, L. M., Ijsselsteijn, W. A. (2005). Perceptual attributes of crosstalk in 3D images. Displays, 26(4), 177–183. 21. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A. (2011). Real-time human pose recognition in parts from single depth images. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Providence, 1297-1304. 22. Smolyanskiy, N., Huitema, C., Liang, L., Anderson, S. E. (2014). Real-time 3D face tracking based on active appearance model constrained by depth data. Image and Vision Computing, 32(11), 860-869. 23. Ujike, H., Ono, H. (2001). Depth thresholds of motion parallax as a function of head movement velocity. Vision Research, 41(22), 2835-2843. 24. Wasielica, M., W¸asik, M., Kasi´ nski, A., Skrzypczy´ nski, P. (2013). Interactive Programming of a Mechatronic System: A Small Humanoid Robot Example. Proc. IEEE/ASME Int. Conf. on Advanced Intelligent Mechatronics, Wollongong, 459464. 25. Zhang, C., Yin, Z., Florˆ encio, D. (2009). Improving Depth Perception with Motion Parallax and Its Application in Teleconferencing. Proc. IEEE Int. Workshop on Multimedia Signal Processing, Rio de Janeiro, 1-6.

Suggest Documents