JOURNAL OF NETWORKS, VOL. 5, NO. 9, SEPTEMBER 2010
1053
Depth Image-based Techniques for Compression, Transmission and Display of Auto-stereo Video Zhuoyun Zhang School of Electronic and Information Engineering, Tianjin University, Tianjin, China Email:
[email protected]
Chunping Hou School of Electronic and Information Engineering, Tianjin University, Tianjin, China Email:
[email protected]
Zhigang Jin School of Electronic and Information Engineering, Tianjin University, Tianjin, China Email:
[email protected]
Abstract—This paper proposes a novel framework for the processing of auto-stereo video. It is based on the video-plusdepth data representation and provides high compatibility with diverse displays. An effective stereo matching algorithm based on the genetic algorithm is adopted to generate the depth map streams for the first and last video streams. The layered coding scheme is proposed to transmit the corresponding layer according to the customer demand and the network performance. Then, the multicast technology and the QoS terminal adaptive technology are adopted for the network transmission approach. Finally, the appropriate depth image-based rendering solution is proposed for the whole framework to generate the virtual views for the displays with different view numbers. The experimental results demonstrate the effectiveness of the whole framework. Technical challenges are identified in the end. Index Terms—depth map, auto-stereo video, scalable coding, depth image-based rendering
I. INTRODUCTION Due to great progress in the areas of computer vision algorithms, image coding algorithms, auto-stereoscopic displays and depth image-based rendering techniques, auto-stereo video services have reached a high technical maturity and been ready for marketing now. The video-plus-depth data representation, proposed by “the advanced three-dimensional television system technologies” (ATTEST) project, provides high scalability and flexibility for the different stereo displays[1,2]. Based on this data representation, we propose an entire framework for the auto-stereo video services in order to realize the effective transmission on the traditional network and the diverse applications at the customer terminals.
© 2010 ACADEMY PUBLISHER doi:10.4304/jnw.5.9.1053-1059
The stereo matching algorithm is crucial to generate the depth map. Ref. [3] proposes a local “winner-take-all” (WTA) scheme at each pixel within the given window. The Markov random field algorithm is proposed in [4]. Besides, there are many other algorithms such as maxflow [5], graph-cut [6] and so on. D. Scharstein proposes the evaluation criterion for the performance of the stereo matching algorithms in [7]. In this paper, we adopt the genetic algorithm which belongs to the global optimization approaches and has perfect effect in our experiments. In our framework, it is assumed that all the auto-stereo video streams are obtained by the parallel imaging system. Then, the depth image-based rendering (DIBR) techniques are used to generate the virtual views in the receiver side to be compatible to displays with different view numbers. Ref. [8] and [9] proposed the asymmetrical filter and anisotropic diffusion to improve the quality of rendered view. In this paper, we propose the DIBR solution for our framework based on two depth map streams. The experimental results demonstrate the effectiveness of this solution. The remainder of this paper is organized as follows: we provide an overview of the whole framework in section 2. Then, we analyze and detail each individual part in the processing line in section 3, 4, 5 and 6. The experimental results are given in the corresponding sections in this paper. Finally, we conclude the paper in section 7. II. FRAMWORK FOR THE WHOLE PROCESSING LINE Based on the video-plus-depth representation, there are mainly two proposed framework in the literature. In the ATTEST project, it creates the depth map for one view. Then, it can render the virtual views in the receiver by the depth-image-based rendering (DIBR)
1054
JOURNAL OF NETWORKS, VOL. 5, NO. 9, SEPTEMBER 2010
techniques. Under this proposal, we only have to transmit one video-plus-depth stream to the receiver. The data amount is significantly reduced. However, the appearance of disocclusions in the rendered view will increase with the display view number M. As a result, the bigger the M is, the poorer the quality of the rendered image will be. P. Kauff et al. proposes to create the depth map for all the N views, which means to obtain N video-plus-depth streams in his paper [10]. It can render the virtual views with higher quality compared with the ATTEST project proposal. However, it is very time-consuming to create the depth maps for all the original N views. Moreover, it has to transmit N video-plus-depth streams to the receiver. The data amount is nearly N times larger than that in the ATTEST proposal. It will put higher burden on the network definitely. Under this consideration, we put forward an approach in order to make a balance of the rendered view quality and the network tolerance. We demonstrate the whole framework in Fig.1. The experimental results will be shown on the basis of a multi-view test database with N=5 views (see Fig. 2). This database is provided by
Two depth map streams
In this framework, we only create the depth map for view 1 and N. Then, we will obtain two video-plus-depth streams. We compare the original data amount of our framework, the P. Kauff proposal and the ATTEST project in Fig.3. From Fig.3, we found that with the increase of N, the original data amount in the receiver side is nearly 33% larger in the P. Kauff proposal than that in the ATTEST proposal, while our proposal is only a little larger. Considering the transmission issue, the ATTEST proposal transmits one video-plus-depth stream to the receiver. Our proposal has to transmit two video-plusdepth streams whose data amount is twice larger than that in the ATTEST proposal. The P. Kauff proposal has to transmit N video-plus-depth streams which are N times more. The transmitted data in our proposal is slightly larger than that in the ATTEST proposal, but significantly smaller compared with the P. Kauff proposal. Note that the video streams in different views are Customer requirement
Base layer
Scalable coding
Depth map creation
III. DEPTH MAP CREATION
Network parameters
N auto-stereo video streams N auto-stereo video streams
Middlebury College’s website[11,12].
Transmission approach
Decoder
Advanced layer
Figure 1 Whole framework of the auto-stereo video processing line.
Figure 2 Test database with N=5 views.
© 2010 ACADEMY PUBLISHER
View rendering
JOURNAL OF NETWORKS, VOL. 5, NO. 9, SEPTEMBER 2010
1055
obtained by the parallel imaging system. Then, the depth map creation steps are shown in Fig.4. We will detail these steps in the following sections. A. Stereo Matching Stereo matching is the most crucial steps in the above flowchart. There are many proposed methods in the literature, such as the Markov random field algorithm, max-flow, graph-cut and so on. In our experiment, we propose a stereo matching algorithm combining the genetic algorithm and the linear interpolation together. The detailed design is shown in [13]. For our test database, we create the disparity maps for view 1 and 5. The result of the stereo matching for view 1 is shown in Fig. 5. B. Post-processing There are two ill-posed problems in the stereo matching. One is that the point correspondences exist but can’t be found uniquely. The other is that the correspondences don’t exist at all because of the occlusion of other objects. These disocclusions will be
Figure 5 Stereo matching result for view 1.
displayed as “black holes” in the disparity map. As a result, we have to design the post-processing step to fill these holes. In our experiment, we adopt the nearest neighborhood interpolation to post-process the disparity map. C. Depth Map Calculation After the post-processing, the depth map can be calculated from the disparity map as in FB . (1) Z (i , j ) = d (i, j ) In (1), F represents the focal length for each view. Note that the camera parameters are the same in different views. B denotes the baseline of the stereo pair in view 1 and N. d (i, j ) is disparity value for pixel (i, j ) in the disparity map. Then we can calculate the depth value Z (i, j ) from (1).
IV. SCALABLE CODING
Figure 3 Original data amount comparison of different proposals.
View 1 video stream
View N video stream
Stereo matching Disparity map stream 1
Disparity map stream N
Postprocessing Disparity map stream 1
Disparity map stream N
Depth map creation Depth map stream 1
Depth map stream N
Figure 4 Depth map creation steps.
© 2010 ACADEMY PUBLISHER
In our framework, we aim to provide video programs to various customer displays such as the 2D digital TV and the auto-stereo display with different view numbers. The ATTEST project transmits one video-plus-depth stream to the receiver. However, P. Kauff et al. propose to transmit N video-plus-depth streams. Compared with the ATTEST proposal, the P. Kauff solution obtains better rendered view quality since more missing information can be provided by other video-plusdepth streams to decrease the visibility of disocclutions. However, it has higher computational complexity on the depth map creation algorithm. Besides, it has to transmit more data to the receiver, which will put higher burden on the network. After analyzing these two solutions, we propose the scalable coding scheme in order to make a balance of the rendered view quality and network tolerance. Our proposal is also based on the layered coding scheme. We design the base layer and advanced layer in our scalable coding scheme as Fig.6. For the 2D display, we only need to transmit the base layer. For the stereo display, we transmit the additional advanced lay 1. View 1 and N in these two layers can guarantee the perfect perception of stereo vision in the stereo display. If the customer requests other views, the advanced layer 2 will be used. This layer contains two depth map streams for view 1 and N. These depth maps will be used to render the virtual views in the receiver.
JOURNAL OF NETWORKS, VOL. 5, NO. 9, SEPTEMBER 2010
The network
4.Submit the performance parameters of the network
Base layer
LDAP server
5 Transmit the user information and the network parameters
1 User request for registration
View 1
Advanced layer 1
4 Request the user information and the network parameters
3.Scale and forecast of the network performance
View N
Advanced layer 2
Stream media server
6.Transmit the stereo video data
Depth map streams for view 1 and N
2 User request for the program service
1056
3 Scale and forecast of the network performance
Scalable coding syntax
Client terminals
Client terminals
Figure 6 Layered structure in our proposal.
Figure 7 Flowchart of the supervising and prediction system.
We no longer need to transmit other video streams. This DIBR solution will be demonstrated in section 6. Compared with other two solutions, our proposal can get the tradeoff of rendered view quality, data amount and the computational complexity to generate the depth map.
TAG. It uses the information of the detected bottom topological information and the QoS parameters to construct the efficient, flexible and extensible network. Every node could use different parameters (such as the path leaps, the end-to-end delay, the end-to-end bandwidth etc.) to choose the father node. The node only needs to protect the local information of the correlative nodes and could neglect the structure and the performance of the global network. Moreover, if the stereo video requires larger bandwidth, the S-TAG mode can convert to the P2P mode conveniently and obtain the data from more than one node[15,16].
V. TRANSMISSION APPROACH Considering the diverse demand of the consumers and the unexpected variance of the network, it may be necessary for the network to provide more flexibility and scalability. Our approach aims to break through the unidirectional transmission restriction of the traditional video program and makes it possible for the consumers to reflect their demand to the server. In order to realize the function, we propose the following solutions. For the design of the network, we adopt the stereo video multicast technology and the QoS terminal adaptive technology after the supervising of the network performance. A. Network Supervising and Prediction System In the system, it is based on the Pathload theory to detect and forecast the network performance in the Windows platform[14]. Both the server and customer will send the detecting data package to detect the end-to-end bandwidth of the network, and utilize this bandwidth as the transmission rate of the unidirectional delay detecting package. Then, it can verify the bandwidth value according to the detected unidirectional delay. Furthermore, it can obtain the mean value of the delay jitter making use of certain sample mechanism and processing method. Finally, the AR model will be used to forecast the network parameters. The entire network supervising and prediction system is shown in Fig.7. B. Multicast Technology Considering the large amount of data in the auto-stereo video and the characteristics of the network, we design a novel extensible application layer multicast algorithm S-
© 2010 ACADEMY PUBLISHER
C. Management of the Network Service In order to be adaptive to various terminals and provide more interactive performance for the users, we also need to design the effective management approach to administrate the whole network. In our experiment, we adopt the LDAP protocol to store the network data attained by the SNMP protocol[17,18]. Meanwhile, these data can reflect the users’ interest and habit, and provide the basis for the prediction of the users’ future choice for the video program. This approach integrates the SNMP with LDAP protocol and can manage the network efficiently. VI. VIEW RENDERING In the receiver, there may be various kinds of display such as the 2D display, stereo head-mounted display, auto-stereo display with different view numbers M and so on. In order to satisfy the customer need, the appropriate DIBR solution is crucial. In our proposal, the original data are divided into several layers. We transmit the corresponding layer according to the customer display. We show their relation in Table 1. (M is the display view number). Based on this relationship, we no longer need to render the virtual view for the 2D display and stereo display with two view points. However, if the view number M exceeds two, we have to transmit two depth map streams
JOURNAL OF NETWORKS, VOL. 5, NO. 9, SEPTEMBER 2010
1057
consider both the initial virtual view I1 and I N . The TABLE I. RELATION BETWEEN CUSTOMER DISPLAY AND TRANSMITTED LAYERS
Display type
Corresponding layer
Layer content
2D display
Base layer
Video stream 1
Stereo display for M=2
Base layer + Advanced layer 1
Video stream 1 and N
Base layer + Auto-stereo Advanced layer 1 display (M>2) + Advanced layer N
Video stream 1 and N Depth map stream 1 and N
pixels of the virtual view I can be divided into following types according to their visibility in view 1 and N. 1. Pixel (i, j ) is visible in both view 1 and N. Then, the value of I1 and I N will exist. We can get the pixel value
I (i, j ) by I (i, j ) =
I1 (i, j ) + I N (i, j ) . 2
(4)
2. Pixel (i, j ) is only visible in view 1. Then, the value of I1 exist, but I N doesn’t. So we have the pixel value by
in the advanced layer 2. Then, we will have to render the arbitrary virtual views in the customer side. The DIBR solutions are the same for different virtual views, so we take the middle virtual view I for example to show the DIBR solution we designed. It is shown in Fig.8. A. DIBR Algorithm In the DIBR solution, we should first convert the depth map to the disparity map between the original view and the virtual view. Assume that the display view number M is 3, then we have to generate one virtual view I from view 1 and N. This view I is the middle view of the view 1 and N. At first, we have to convert the depth map of view 1 and N to the disparity map according to
d (i, j ) =
F × B' . Z (i, j )
(2)
In (2), F is the camera focal length the same as that in the server. Z (i, j ) denotes the depth value of pixel (i, j ) in view 1. B ' refers to the baseline between the virtual view I and 1. For M=3,
B' =
B ,. 2
(3)
where B is the baseline between the view 1 and N. After we get the disparity map between view 1 and N, we will generate the initial virtual view I1 and I N from view 1 and N respectively. B. Consistency Check In order to obtain the virtual view I, we have to Video-plus-depth Stream in view 1
Video-plus-depth Stream in view N
I (i, j ) = I1 (i, j ) .
(i, j ) is only visible in view N. Then, the value of I N is meaningful, while the value of I1 does 3. Pixel
not exist. So we have the pixel value by I (i, j ) = I N (i, j ) .
neither value of I1 nor I N exists. So we have the pixel value by the neighborhood interpolation. After this consistency check, we will get the initial virtual view I. The rendered result is shown in Fig.9(a). C. Gaussian Filter In the initial virtual view I, some mistakes may appear at the objects boundaries. Particularly, the “ghost effect” may be caused by averaging the pixel value. It will decrease the view quality and influence the stereo perception. The smooth of the initial virtual view can improve the image quality. It is especially useful to remove the mistakes along the object boundaries. In our experiment, we choose the Gaussian filter to realize this function. After Gaussian filtering, we get the final virtual view I. The final result with Gaussian filtering is shown in Fig.9(b) and Fig.9(c). The Standard deviation of Gaussian filtering is 0.5 in Fig.9(b) and 10 Fig.9(c). We compare the rendered virtual views with and without Gaussian filtering in this figure. After the above three steps, we get the final rendered virtual views. For various auto-stereo displays with different view number, we adopt such scheme to get the virtual views.
consistency check
Initial view I
DIBR Initial view I” Figure 8 View rendering algorithm.
© 2010 ACADEMY PUBLISHER
(6)
4. Pixel (i, j ) is not visible in either view 1 or N. Then,
Initial view I’
DIBR
(5)
Gaussian filter
Final view I
1058
JOURNAL OF NETWORKS, VOL. 5, NO. 9, SEPTEMBER 2010
(b) Standard deviation of Gaussian filtering is 0.5
(a) No filtering
(c) Standard deviation of Gaussian filtering is 10
Figure 9 Final virtual view with and without Gaussian filtering.
VII. CONCLUSIONS In this paper, we detail and analyze the proposed framework. In order to provide compatibility with different displays, we choose the video-plus-depth representation. Two depth map streams are generated for view 1 and N. Then, the layered coding algorithm is designed for this representation. The transmission approach combined with network supervising and prediction system guarantees the effect of the coding scheme. Finally, the appropriate DIBR solution is proposed for this framework. The final virtual view has excellent stereo perception quality. The experimental results demonstrate the effectiveness of the proposed framework. In the future, we are looking forward to improve our framework in the following aspects. The stereo perception relies highly on the human vision system. We aim to incorporate the features of human vision system in the redesign of the coding and transmission approach. Besides, the N auto-stereo video streams contain large data redundancy. We hope to extract this information and combine it with our coding algorithm. ACKNOWLEDGMENT This work was supported by the grant from the National High Technology Research and Development Program of China (863 Program) (No. 2007AA01Z304, 2009AA01Z306, 2009AA011507), the National Natural Science Foundation of China (No. 60672077, 60932007), the Science and Technology Support Foundation of Tianjin (No. 08ZCKFJC27900), the Important National Science & Technology Specific Projects (No. 2009ZX03004-006), the National Foundation (No. 09JCZDJC17000) and Tianjin University Project (No. TJU-YFF-08B63). REFERENCES [1] A. Redert et al., “ATTEST—advanced three-dimensional television system technologies,” Proceedings of the First International Symposium on 3D Data Processing, Visualization and Transmission, Padova, Italy, pp. 313– 319, Jun. 2002.
© 2010 ACADEMY PUBLISHER
[2] C. Fehn et al., “Depth-image-based rendering (DIBR), compression and transmission for a new approach on 3D-TV,” Proceeding of the SPIE Stereoscopic Displays and Virtual Reality Systems XI, San Jose, CA, USA, pp. 93-104, Jan. 2004. [3] T. Kanade and M. Okutomi, “A stereo matching algorithm with an adaptive window: Theory and experiment,” IEEE TPAMI, 16(9), pp. 920–932, 1994. [4] S. T. Barnard, “Stochastic stereo matching over scale,” International Journal of Computer Vision, 3(1), pp. 17– 32, 1989. [5] V. Kolmogorov and R. Zabih, “Computing visual correspondence with occlusions using graph cuts,” Proceedings of International Conference on Computer Vision, pp. 508–515, 2001. [6] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11), pp. 1222–1239, 2001. [7] Scharstein D and Szeliski R, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms”, International Journal of Computer Vision, 47(1), pp.7–42, 2002. [8] Liang Zhang and Wa James Tam, “Stereoscopic image generation based on depth images for 3dtv,” IEEE Transactions on Broadcasting, 51(2), pp. 191–199, 2005. [9] Namho Hur, Wa James Tam, Filippo Speranza, Chunghyun Ahn, and Soo In Lee, “Depth image-based stereoscopic image rendering considering IDCT and anisotropic diffusion,” IEEE International Conference on Consumer Electronics, pp. 381–382, 2005. [10] P. Kauff et al., “Depth map creation and image-based rendering for advanced 3DTV services providing interoperability and scalability,” Signal Processing: Image Communication, vol. 22, pp.217-234, Oct. 2007. [11] http://vision.middlebury.edu/stereo/. [12] D. Scharstein, R. Szeliski, “High-accuracy stereo depth maps using structured light”, Proceedings of IEEE computer society conference on computer vision and pattern recognition, pp. 195-202, 2003. [13] Zhuoyun Zhang, Chunping Hou and Jiachen Yang, “A Stereo Matching Algorithm based on Genetic Algorithm with Propagation Stratagem”, Proceedings of International Workshop on intelligent systems and applications 2009, Wuhan, China, pp.1-4, 2009. [14] P. Agrawal, J.-H. Yeh and J.-C. Chen, “IP Multimedia Subsystems in 3GPP and 3GPP2: Overview and Scalability Issues,” IEEE Communications Magazine, vol. 46, pp. 138–145, Jan. 2008. [15] L.H. Sahasrabuddhe, B.Mukherjee, “Multicast Routing Algorithms and Protocols: A Tutorial,” IEEE Network, 14(1), pp. 90-102, 2000.
JOURNAL OF NETWORKS, VOL. 5, NO. 9, SEPTEMBER 2010
[16] K.L. Cheng and K.W. Cheuk, “Implementation and Performance Measurement of an Island Multicast Protocol”, Proceedings of the IEEE International Conference on Communications, Piscataway, USA, pp.1299-1303, 2005. [17] S.Legg, RFC 3727 ASN.1 Module Definition for the LDAP and X.500 Component Matching Rules, 2004. [18] OPENLDAP, OpenLDAP Software 2.3 Administrator's Guide, OpenLDAP Software 2.3 Administrator's Guide [CP/OL], http://www.openldap.org, 2006.
Zhuoyun Zhang was born in China in 1982. She is a PhD student in Tianjin University, Tianjin, China. She received the B.E. degree in electronic and information engineering in 2006 and M.E. degree in communication and information engineering
© 2010 ACADEMY PUBLISHER
1059
in 2008 in Tianjin University. The major field of study is signal and image processing and computer vision. 8 research papers in different conferences and journals have already been published. Chunping Hou received her PhD degree from the school of electronic and information engineering of Tianjin University, China, in 1998. She is currently a professor in Tianjin University. Her research interests lie in the field of image coding and transmission. She has published more than 20 papers and 2 books. Zhigang Jin is a professor in the school of electronic and information engineering in Tianjin University, China. His current research interest is the network management and security. He has published more than 15 papers.