Block-based Belief Propagation With In-place Message ... - CiteSeerX

0 downloads 0 Views 591KB Size Report
Block-based Belief Propagation With In-place Message. Updating for Stereo Vision. Yu-Cheng Tseng , Nelson Yen-Chung Chang and Tian-Sheuan Chang.
Block-based Belief Propagation With In-place Message Updating for Stereo Vision Yu-Cheng Tseng , Nelson Yen-Chung Chang and Tian-Sheuan Chang Department of Electronics Engineering, National Chiao-Tung University, Hsinchu, Taiwan, ROC {tyucheng, ycchang, tschang}@twins.ee.nctu.edu.tw Abstract—The large message-memory size has been a major design issue in belief propagation for stereo matching methods. To reduce its size, we proposed an in-place message updating approach which stores new messages into expired ones. Thus, this approach only needs one message memory and a small temporary message memory. Further combining with block based belief propagation, the final architecture only needs 172KB memory and can reduce internal memory size by 40% when compared with traditional ping-pong buffer approach.

I.

INTRODUCTION

Finding the stereo correspondence has been one of the most extensively researched topics on two camera stereo vision. In two view images, the difference between two corresponding points projected from a scene point is termed as the disparity. The estimated disparity can be further used to compute scene depth and is therefore necessary in multi-view video coding, intelligent surveillance, stereo vision system for robotics and autonomous vehicles. However, the computational complexity of dense stereo correspondence is very larger, which is even higher than motion estimation in video coding since stereo correspondence is computed for each pixel location instead of each block position. Consequently, a PC or embedded processor cannot bear the massively computational load to provide available speed for real application. Therefore, it is necessary to develop a hardware implementation to improve the speed in finding stereo correspondence. The stereo correspondence algorithms can be categorized into local approaches and global approaches [1]. The local approaches use various window templates for block-matching to find the disparity. These approaches are simple and can be easily implemented with DSPs [2] and VLSI hardware [3] to satisfy video rate processing speed for low and high resolution images respectively. Although local approaches have high throughput and low complexity, they suffer from poor and inaccurate disparity results due to their inability to handle textureless and occluded regions. Being aware of the aforementioned limitation in local approach, global approach takes the whole image into consideration during disparity estimation. One of the global approaches is dynamic programming (DP), which optimizes an objective function on

a 1-D graph model. The approach with regular computation can be implemented using 60 parallel processing elements (PE) to achieve real-time performance at 640x480 image size with 60 disparity levels [4]. However, streak line artifact occurs in the disparity result, because vertical consistency is not considered in the 1-D graph model. To take both horizontal and vertical consistency into consideration, belief propagation (BP) [5], which performs optimization on a 2-D graph model, has been proposed. The BP-based disparity estimation has been reported to produce some of the most accurate disparity results. However, it also has higher computational complexity than DP. To relieve the complexity of BP and improve the quality of disparity result, Felzenszwalb et al. [6] proposed a simplified smoothness constraint and hierarchical belief propagation (HBP) framework, which performs BP algorithm from coarse to fine manner. Following this HBP framework, Park et al.[7] proposed a real-time hardware architecture for 320x240 image size with 32 disparity levels and implemented it on two FPGA boards. However, an extremely large internal memory of 880 KB is required in this implementation. Such massive amount of storage required becomes one of the most crucial challenges on designing efficient hardware architecture for BP-based approaches. In summary, BP has better quality but also demands high memory cost. To address these above problems, we proposed a hardware architecture based on block-based BP [8](BBP), which significantly reduces memory size by partitioning an image into separated independent blocks. In addition, instead of traditional ping-pong buffer approach, we proposed an inplace message updating approach to reduce the massive memory usage. This approach also can be applied to all BPbased architectures. With the proposed in-place message updating approach, the memory size can be reduced by 40% of the storage size needed in the typical ping-pong buffer approach. The organization of this paper is as follows. In Section II, we first review the concept of stereo vision and algorithm of block-based belief propagation. Then, in Section III, we present the BBP hardware architecture with in-place message updating. Section IV shows the memory cost and its comparisons. Finally, we conclude this paper in Section V.

II.

OVERVIEW OF BELIEF PROPAGATION IN STEREO VISION

A. Stereo vision Fig. 1 (a) illustrates the geometry of a parallel two camera stereo vision configuration and explains how depth information can be computed from disparity. Given the focal length of the cameras as f, the distance between the two cameras as B, and the disparity as d, the depth Z of the object would be f/dB. Therefore, it is important to estimate disparity accurately and correctly. Fig. 1 (b) illustrates the disparity map computed from the left and right images. The disparity of each pixel is represented as a gray level with brighter level representing larger disparity and darker level representing smaller disparity. B. Belief propagation Fig. 2 shows the flow and basic 2-D graph structure of the BP algorithm proposed by Sun [5]. First, the matching cost (D term) of each pixel associated with all disparities is computed by the pixel dissimilarity measure. Then, messages of each pixel are iteratively updated by the messages in previous iteration and D term. Take Fig. 2 (b) for example, the injecting messages mNt, mSt, mEt, and D term of node (x,y) with V term, which constrains the messages passing, at iteration t can produce the new injecting message mEt+1 of node (x-1,y) at iteration t+1. In this example, the updating message function can be represented with m(t +x1−1, y ),E (d t ) = min{D( x , y ) (d s ) + V (d s , d t ) + ∑k = N ,S ,E mkt ,( x , y ) (d s )} ,(1) ds

where ds and dt are the disparity of the source and target node respectively. Note that all the previous injecting mNt, mEt, mSt, mWt messages of node (x,y) can produce only one new message of each neighboring nodes, which are mEt+1 of node (x-1,y), mSt+1 of node (x,y-1), mWt+1 of node (x+1,y), and mSt+1 of (x,y+1). After several iterations, finally, the messages and D term of each pixel are accumulated to compute the belief values for each disparity. The disparity with the minimal belief value is selected. The disparity selection of a node (x,y) can be represented with d ( x , y ) = arg{min{D( x , y ) (d s ) + ∑k = N ,S ,W ,E mkT,( x , y ) ( d s )}} .

(2)

ds

In the above algorithm flow, the message updating process requires the highest computational complexity and the largest memory storage requirement which are O(NL2T) and 4NL messages respectively, where N is the image size, L is the disparity range, and T is the iteration count of updating. To reduce the massive memory usage, BBP algorithm[8] partitions an image into regular independent blocks and

(a)

(b)

Figure 1. Geometry of stereo correspondence.

(a) (b) Figure 2. Flow of belief propagation algorithm and illustration of graph model.

performs BP for each block individually. Although BBP would decrease the quality of disparity result by 0.47~3.06% due to block boundary artifacts, the memory usage in message updating is reduced from 4NL to 4BL messages with B being the block size. III.

ARCHITECTURE

To explain the proposed hardware architecture at 100MHz clock rate, we assume the targeted size is CIF (352x288) size image with a disparity range of 32 pixels, and block size is 32x32. A. Proposed BBP architecture The proposed hardware architecture of BBP algorithm is shown in Fig. 3. The architecture includes an external memory and a BBP core connected with a 32-bit data port. The external memory and internal memory can apply the SDRAM and SRAM to implement respectively. For estimating throughput and latency of architecture, the simplified external memory only supports burst access of 8 words after latency of 6 cycles for precharge and active. The BP core consists of three PEs, D term and message memories, and some buffers. Corresponding to the BP algorithm flow, the three PEs are D term PE, message updating PE, and belief computing PE. The internal memories, which are single-port SRAMs with 16-byte width, includes a D term memory for matching cost, four direction message-memories(MM), and one temporary message-memory(TMM) for in-place message updating schedule. The buffering registers placed between memories and PEs provide the parallelism of data access for PEs’ computation. The operation of BBP is described below. From overview, one block of images is read into BBP core, and disparity result of this block is produced and written into external memory. For one block, the detail of computing process of BBP core is described as follows. First, the RGB-pixel rows from left and right images are loaded into input buffer, and then the D term PE computes matching cost of each node in this row. To

(a) (b) Figure 4. Comparison of ping-pong buffer approach and in-place message updating approach.

requires one MM of 128 KB and an additional TMM of 12 KB.

Figure 3. Overall hardware architecture of BBP algorithm.

complete the computation of matching cost of a whole block with 32 rows takes 5,888 cycles. Then, the message updating PE, which is directly implemented according to equation (1), produces new messages using the D term and previous messages. For each node, this PE computes one directional message at a time. After completing the computation of all the four directional messages, it will process the next node. Then, these new messages need to store into the MM for the next updating iteration. Updating all nodes of one block for one iteration requires 5129 cycles. The detail of the in-place message updating approach is presented in the next subsection. After several iterations, the belief computing PE gathers the final data from the message and D term memories to compute the belief values and disparity results of one block. This PE can produce a disparity result of one node at a time. As soon as the output buffer has accumulated 32 disparity results, these results can be written into the external memory. Producing the disparity results of one block requires 2,624 cycles. Note that there is no pipelining among the three PEs because of the limited ports of internal memories, which cannot serve multiple PEs at the same time. B. In-place message updating approach In the BBP architecture, the message update process requires a large internal memory. An intuitive design for this updating process is the ping-pong buffer approach as shown in Fig. 4 (a). In the ping-pong buffer approach, these new messages are computed from the previous messages of the first message memory and stored into the second message memory. In the next iteration, the new messages are computed from the messages of the second message memory and stored into the first message memory. Considering the proposed BBP architecture with 32x32 block size, this ping-pong buffer approach requires two message memories, whose total size is 256 KB. To solve this above problem, Fig. 4 (b) shows the in-place message updating approach proposed in this work. The new messages are computed from the old messages in MM and buffered in a TMM. When the old messages in MM are no longer needed, the messages in the TMM can overwrite these expired messages. In contrast with the ping-pong buffer approach, the in-place message updating approach only

Fig. 5 illustrates the architecture of the TMM. To support the in-place message updating approach using only single-port memories, the TMM is partitioned into N, S, W, and E direction message memory groups. For each group, three memory parts denoted as A, B, and C are used. Fig. 6 illustrates an example of how the three TMM parts in the in-place message updating process work in detail. In the figure, each node in the 2-D graph model is represented by a circle. The four injecting messages of a node are represented by squares. The black square means the updated message. The squares located in the gray region represent the expired messages, which will not be needed in the future. In step 1, the new messages updated from the messages of (x+1, y) are currently being processed. The four updated new messages are mSt+1, mWt+1, mEt+1, and mNt+1. Suppose these new messages should be written into mSt of (x+1,y-1), mWt of (x-1,y), mEt of (x+2,y), and mNt of (x+1,y+1) positions in the MM. However, the unused messages at these positions would still be needed to produce other new messages later. Therefore, the new messages cannot immediately overwrite the memory locations storing unused messages and must be temporarily stored in the TMM as shown in the figure. In step 2, once all the new messages of (x+1,y-1) are collected in TMM and all the old messages of (x+1, y-1) in MM are already expired, the new messages are written back into the MM as shown in the figure. By doing step 1 and step 2 sequentially, the conflict of messages reading and writing concurrently in the TMM can be avoided. Therefore, the TMM can be implemented using only single port memories. After node (x+1, y) has been processed, D term memory

N msg memory

S msg memory

W msg memory

E msg memory

16-byte

16-byte

16-byte

16-byte

16-byte

Buffer(32B)

Buffer(32B)

Buffer(32B)

Buffer(32B)

Buffer(32B)

32-byte

32-byte

32-byte

32-byte

32-byte

Message updating PE TMMs

32-byte

CN BN AN

CS BS AS

32-byte

CW BW AW

32-byte

CE BE AE

32-byte

32-byte

Figure 5. Architecture of the temporary message memories.

……

……

……

Figure 6. An example of in-place message updating.

node (x+2, y) will be processed next in step 3. When all the nodes in row y have been processed, the TMM part A becomes empty. Then the TMM part A can serve the next row y+1. Therefore, only three TMM parts are required to support the in-place message updating approach. IV.

ANALYSIS AND COMPARISON

TABLE I compares the traditional ping-pong buffer approach and the proposed approach for internal memory usage. In the architectures of BBP algorithm, both the approaches require the same size of D term memory with HWL byte, where H is the height of block, W is the width of block, and L is the disparity range. In comparison, the pingpong buffer approach requires two MM with the same size, while the proposed just needs one and a small extra buffering TMM. For an example of the specification in Section III, the proposed approach can reduce internal memory usage by 40% of the ping-pong buffer approach. TABLE I.

PERFORMANCE COMPARISON OF DIFFERENT HARDWARE

memory size by 40%. Though the in-place message updating is explained on our BBP work, it can apply to other BP-based architectures as well. In the future, we will implement this architecture and integrate it into a stereo vision system. VI.

This research is supported by National Science Council under grant NSC96-2220-E-009-038. REFERENCES [1]

[2]

[3]

[4]

ARCHITECTURE

D term memory MM TMM Total internal memory size Example with H=W=32, L=32

BBP with ping-pong buffer approach

BBP with in-place message updating approach

HWL byte 8HWL byte -

HWL byte 4HWL byte 12WL byte

9HWL byte

(5HWL+12WL) byte

288 KB

172 KB

100%

59.7%

V.

[5]

[6]

[7]

CONCLUSION

This paper proposes a hardware architecture of BBP algorithm with the in-place message updating that reduce the

ACKNOWLEDGEMENT

[8]

D. Scharstein and R. Szeliski, “A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms,” International Journal of Computer Vision, vol. 47, no. 1-3, pp.7-42, 2002. N. Chang, T. M. Lin, T. H. Tsai, Y. C. Tseng, and T. S. Chang, “Realtime DSP Implementation on Local Stereo Matching,” in Proc. International Conference on Multimedia & Expo (ICME), 2007. J. Diaz, E. Ros, R. Carrillo and A. Prieto, “Real-Time System for HighImage Resolution Disparity Estimation,” IEEE Trans. on Image Processing, vol. 16, no. 1, pp. 280-285, Jan, 2007. Y. S. Kim, S. C. Park, C. Chen and H. Jeong, “Real-time Architecture of Stereo Vision for Robot Eye,” in Proc. International Conference on Signal Processing (ICSP), 2006. J. Sun, N. N. Zheng and H. Y. Shum, “Stereo Matching Using Belief Propagation,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 25, no. 7, pp. 787-800, Jul. 2003. P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient Belief Propagation for Early Vision,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp.1261-1268, 2004. S. Park, C. Chen, and H. Jeong, “VLSI Architecture for MRF Based Stereo Matching,” in Proc. International Workshop, Embedded Computer System: Architecture, Modeling, and Simulation (SAMOS), pp. 55-64, 2007. Y. C. Tseng, N. Chang, and T. S. Chang, “Low Memory Cost Blockbased Belief Propagation for Stereo Correspondence,” in Proc. International Conference on Multimedia & Expo (ICME), 2007.

Suggest Documents