GPU optimization of the SGM stereo algorithm
Istvan Haller, Sergiu Nedevschi Department of Computer Science Techical University of Cluj Napoca Cluj Napoca, Romania
[email protected],
[email protected]
Abstract—GPU hardware architectures have evolved into a suitable platform for the hardware acceleration of complex computing tasks. Stereo vision is one such task where acceleration is desirable for robotic and automotive systems. Much research was invested in developing stereo vision algorithms with increased quality, but real-time implementations are still lacking. In this work we focus on creating a real-time dense stereo reconstruction system. We selected the Semi-global Matching method as the basis of our system due to its high quality and reduced computational complexity. The Census transform is selected as the matching metric because our results show that it can reduce the matching errors for traffic images compared to classical solutions. We also present two modifications to the original Semi-Global algorithm to improve the sub-pixel accuracy and the execution time. The system was implemented and evaluated on a current generation GPU with a running time of 19ms for image having the resolution 512x383. Keywords-stereo vision, GPU, real-time, sub-pixel
I.
INTRODUCTION
Stereo vision has been an intensive research area in the last decades. The solutions proposed were originally split into two main categories, local and global methods [1]. Later a third category was introduced to separate some of the algorithms from the global methods [2]. This third category includes the semi-global methods which are based on global optimizations, but the computational complexity is reduced to allow real-time implementations. The group of local algorithms uses a finite support region around each point to calculate the disparities. The methods are based around the selected matching metric and usually apply some matching aggregation for smoothing. For selecting the appropriate disparity for each pixel the minimum of the calculated matching costs is searched. The main advantage of local methods is the small computational complexity which allows for real-time implementations [3]-[4]. The main disadvantage is that only local information is used at each step. As a result these methods are not able to handle featureless regions or repetitive patterns. Global algorithms are able to improve the quality of the disparity map by enforcing several global constraints in the disparity selection phase. Although benchmarks show a
978-1-4244-8230-6/10/$26.00 ©2010 IEEE
197
significant improvement in the disparity map quality, these methods are not applicable for real-time applications [5]. The running times are several magnitudes larger than those achieved by local methods, usually in the range tens of seconds even on current hardware [6]. The group of semi-global methods was proposed in order to separate the global methods which use only 1D optimizations. Some older algorithms belonging to this group are the dynamic programming and scan-line optimization methods which apply the constraints only on the epipolar line. The result is a reduction in the computational complexity of the optimization problem to polynomial time. In 2005 Hirschmüller proposed the Semi-global matching (SGM) [7] as a new alternative which achieves similar results to global methods, while maintaining a reduced execution time. The Semi-global matching algorithm performs multiple 1D energy optimizations on the image. The different 1D paths run at different angles to approximate a 2D optimization. By using multiple paths instead of a single one, it can avoid a streaky behavior common with previous algorithms such as dynamic programming or scan-line optimizations. The energy optimization is based on a correlation-cost and a smoothness constraint. The smoothness is enforced by two components, a small penalty, P1, used for small disparity differences and a larger penalty, P2, used for disparity discontinuities. The larger penalty is adaptive and based on intensity changes to help with object borders. The form of the energy function is:
⎛ C p, Dp ∑ P1 T > Dp Dq q N ⎜ E D ∑⎜ p ⎜ ∑ P 2 T > D p Dq ! 1@ ⎝ q N p
p
1@ ⎞
⎟ ⎟ ⎟ ⎠
,
where D is the set of disparities, C is the cost function and Np is the neighborhood of the point p in all directions. The function T turns the values true and false into 1 and respectively 0. Dp and Dq represent the selected disparities in the points p and q. The Middlebury benchmark [5] shows the results achieved using this method compared to both global and local methods. The algorithm consistently achieves results similar to global methods while scoring much higher than the local ones. Also several real-time implementations were
proposed for smaller resolution images [8]-[10]. These results show that the method represents a good compromise between speed and accuracy for real-time systems such as automotive applications. For development we use a nVidia GTX 280 GPU [11]. The GPU is a two-tier hardware architecture both from the execution and the memory hierarchy standpoint. Our GPU is composed of 10 Thread Processing Clusters with 3 Scalar Multiprocessors (SM) each. All of the Scalar Multiprocessors have fully independent resources and schedulers. From the programmers point of view they represent a Multiple Instruction/Multiple Data architecture. Communication between these blocks is performed through the large off-chip global memory, but no synchronization mechanism is available during execution. This is a limitation on the type of problems which can be efficiently handled by this architecture. Also access to the global memory is limited by the throughput of the memory chip which is about a magnitude lower than the processing speed. 8 processing cores reside in each Scalar Multiprocessor, but the scheduling is performed at the SM level similar with a Single Instruction/Multiple Data architecture. Each Scalar Multiprocessor also contains a register file and shared memory block. Access to these resources is very fast but they have limited size and are shared by all threads executing on the SM. The use of shared memory is critical because it allows synchronized communication between threads while having similar latency as the register file. In this paper we present an improvement of the SGM algorithm proposed by Hirschmüller [7]. During our research we have discovered two main deficiencies of the original algorithm. The first is the lack of real-time implementations for larger resolutions and the second is the reduced depth accuracy of the method. We propose some modifications to improve upon these characteristics. We also present an implementation of this improved version capable of processing larger resolutions than existing solutions, while maintaining real-time constraints. The paper starts by presenting some existing stereo systems which we use as references. This is followed by an overview of the stereoreconstruction pipeline highlighting our improvements. Some optimization details are also presented. Finally evaluations show the benefits of our proposals and also the performance of the complete system. II.
real-time solution using an FPGA [8], but the supported resolution is limited. Although the paper specifies input images having up to 400000 points, the final disparity map size is reduced to one quarter due to down sampling. There have also been tries to implement the algorithm on GPUs. This hardware has the main advantage of large scale availability and an easier programming interface. In 2008 Ernst and Hirschmüller presented a work featuring the implementation of the method using the hierarchical mutual implementation as their metric [9]. The solution was designed for the NVIDIA Geforce 8800 ULTRA graphics board and reached 8.8 fps for an image having the resolution 450x375 and a disparity range of 64. They were also able to achieve real-time processing on smaller resolutions. III.
A. Overview The general stereo-system can be considered to be a pipeline of processing steps. The first step of the pipeline is the calculation of correlation scores between the two images by using a selected matching metric and an optional aggregation of values. Using these correlation scores the correct disparity value can be selected either by a simple winner-takes-all method or by applying a global optimization. Either the left or the right image or optionally both can be used as reference images. The latter allows a consistency check between the two results efficiently filtering errors and occluded areas. The resulting disparities can be further refined by using a sub-pixel interpolation step. In figure 1 we present an overview of our stereo pipeline.
RELATED WORK
The first solution is the DeepSea system created by the company Tyzx [12]. The method uses local correlation combined with Census transform for increased accuracy and high data-rate. It is one of the fastest methods available, able to provide a disparity map in 11ms for images of resolution 512x383. For improving quality, research has shifted from local methods to the Semi-Global method. Gehrig et. al presented a
198
STEREO PIPELINE
Fig. 1. System Architecture
Usually the Semi-Global matching algorithm represents the core of the matching cost calculation. During our study we decided on a different application of the algorithm. In our system it is used only as an auxiliary step which helps to generate extra information in the featureless region. The core of our algorithm is a local algorithm which was specially selected and tuned to provide accurate results even without applying the Semi-Global matching step. B. Matching score calculation and window aggregation 1) Problem proposal As noted previously, the matching score calculation step represents the most important part of our pipeline. Both the matching selection and sub-pixel interpolation is based on the information generated in this step. Thus both pixel and subpixel level results are directly affected by this data. The most important requirement is to limit erroneous small matching scores for invalid matches. This problem is visible in regions having limited features and for repetitive patterns. Although global algorithms can limit the effect of these problems, it is still more beneficial to use more accurate matching scores. The problem can be handled by extracting more powerful features which can help differentiate between the different regions. This is achieved by using different matching metrics or by performing a filtering of the input image to enhance the features. Another important requirement is the ability to model the correlation function which is the basis of sub-pixel interpolation. 2) Our solution using the census transform For the matching metric the Census transform was chosen due to its invariance intensity and contrast differences [12]. Evaluation papers have shown that it represents one of the best metrics for matching correlation in difficult conditions [13]. Other papers have shown that the original metrics proposed for SGM, namely the Bircheld-Tomasi and the Hierarchical Mutual Information have problems regarding image luminosity differences and vignetting, common errors in production systems [8]. The Census transform is efficient in these cases as it is applied only locally and is indifferent to luminosity and contrast changes. To our knowledge, the Census transform has only been used with the SGM method in evaluative works. The extra cost of using this metric was considered too high for other implementations, but using the GPU architecture we were able to limit this overhead. The Census transform is performed on a 9x9 window to provide immunity to noise and contains 2bit values for 16 positions. The Census window setup can be seen on figure 2. Comparison of the pixels from the left and right image is performed using the hamming distance and the results are aggregated across a 5x5 window. This aggregation is required because the matching values are limited for each pixel. By aggregating the values over a small window we achieve a better smoothness and a larger spread of the matching cost.
199
Fig. 2. Census window setup, black represents selected pixel
C. Semi-global matching 1) Application For our implementation the Semi-global matching algorithm is used separately from the matching selection step. It applies optimization, based on the smoothness constraint, on the matching score. Other methods use multi and adaptive window approaches for this purpose. These methods increase the effective aggregation area to allow a correction of matching scores in featureless regions. The advantage of our approach is that it is not limited by a finite aggregation area. A global smoothness constraint is applied to the image and the results are not affected by the size of the featureless regions. In the following we propose some modifications to the original algorithm of Hirschmüller [7]. These concern the running-time and the sub-pixel accuracy of the system. 2) Improvements to original method Literature notes the SGM algorithm requires a minimum of 8 directions to be used for optimization, while 16 would be recommended for maximum quality. In order to reduce the computational cost of the algorithm we proposed the use of only 4 directions. This idea was evaluated in the paper [14], but the authors noted that the best option is to choose 8 directions. Their results show an increase of the number of errors to 14% from 12.8% when choosing only 4 directions. This represents an increase of less than 10% relative to the number of errors. For real-time systems the significant gain in execution time compensates this small increase in the number of errors. In the results section we will evaluate the effect of this modification for our image scenes to see if it matches the findings of the other paper. During the evaluation of the results we have discovered an important issue with the original Semi-global algorithm. In the introduction section we have talked about two penalties that are used by the algorithm [7]. During the evaluation of the sub-pixel accuracy we have discovered that having a small value for the P1 penalty generates an increased scatter on the distance information. The scatter is also evident in textured regions, resulting in depth uncertainties. The alternative of using a large value for P1 would not be mathematically correct as is should be smaller than the smallest values of the adaptive P2. Thus in our implementation we have decided to remove this part of the algorithm. The energy function becomes: E D
⎛
∑ ⎜ C p, D ∑ P 2 T > D p
⎝
p
q N p
p
z Dq
⎞
@⎟ ⎠
. We will
evaluate the effect of this modification in the results section.
D. Matching selection 1) Left-right/right-left matching The winner-takes-all algorithm is used for selecting the best match. We also apply a confidence based filtering which is recommended as an efficient method in removing errors when not enough information is available [5]. It can also be tuned for the image characteristics and the matching method through the threshold value. It filters points based on the cost difference between the best and the third best match. The point is invalidated if the difference is too small. 2) Left-right consistency check Occlusion detection is performed using the left-right consistency check. This method is recommended in literature for local methods [5], but is also used for Semi-global Matching based implementations [8]-[9]. The method uses the uniqueness constraint which assumes that any point in the left image has a single match in the right image. This filtering checks if the disparity values are equal for both directions. Usually in occluded areas the two values do not match because the image region in the left image is missing from the right image. As a result occlusions are removed efficiently without affecting the correct matches. The algorithm requires us to perform the matching step twice using, both the left and right image as the reference image. As a result the overhead is limited because we use the same matching cost matrix calculated in the previous step. IV.
CUDA OPTIMIZATIONS
For implementing the pipeline using the CUDA interface we had to take into account parallelism possibilities. Due to the hardware architecture presented in the introduction, two levels of parallelism are available in the CUDA interface, a coarse level without inter-thread communication and a fine level where threads can share data. To improve performance the designer should maximize both as otherwise the device may not be used to its maximum capability. As a result of this limitation we choose to implement each pipeline step as a separate pass. For more complex steps also sub-steps were used. Communication between the steps is performed using a global 3D matrix storing values for each pixel location and each disparity. Although this choice for implementation increases the number global memory accesses, it allows us to optimize each step in part by using different parallelization patterns, each suited for the requirements. As a result we were able to define coarse level parallelism for one of the dimensions of the matrix and fine level for another dimension for all of the steps except for the SGM. This exception will be presented in the next paragraph. For the fine scale parallelism the image columns are used because the interface requires that all of the global memory accesses are grouped. Disregarding this limitation would result in a significant degradation of performance. By accessing elements on consecutive columns for any given row and disparity we guarantee that the interface can group the accesses and maximizes global memory
200
bandwidth. The resulting large scale parallelism is important to reach the full capacity of current devices and to scale efficiently for the future growth of the hardware. For the SGM step we are not able define the two separate dimensions of parallism due to the memory access limitation. For each 1D path required by the method, the values on one direction are dependent on all the values previously calculated in the respective direction. Furthermore the method doesn't allow coarse level parallelization on the disparity values, because of the minimum cost calculation performed for each pixel. We were required to find a way to both perform coarse level parallelization while retaining an efficient memory access pattern. Our idea is to generate a transposed of the cost matrix to be used with the horizontal directions. This allows us to use the same optimized traversal for both the horizontal and vertical direction and it also removes the dependencies between elements on different columns. The latter is important to help us group the memory accesses to consecutive elements in memory. Our solution to handle the parallelization problem was to use image column groups having 32 elements. Between the different groups we can apply coarse level parallelism while we retain grouped memory accesses through the elements in the group. The fine level parallelism is applied between the 32 elements in each group and also between the different disparities. Our experiments have shown that at least 32 elements are required in a group otherwise performance is reduced due inefficient memory access operations. In order to increase the coarse level parallelism we are also running 2 directions simultaneously. This allows us to take better advantage of the inherent parallelism in the hardware. V.
RESULTS
A. Census compared to other metrics We have compared the matching performance without applying the Semi-global step. This allows a more accurate view of the raw data provided by the different matching metrics. For the comparison we choose the SAD using a Laplace type filter, the ZSAD and the Census transform. The filter was used to remove the slight difference in luminosity between images. All methods use a common window size of 5x5. The size of the Laplace type filter is the same as the window size. In figure 3 we present an example image from our evaluation. This images show that the ZSAD metric is much better at extracting information on the road surface than simpler SAD, but none of the methods is able to compare to the Census transform in terms of error count. The Census transform also provides a significant increase in the reconstructed point count. We consider the difference in quality due to the extra directional information extracted with the Census method. Compared to SAD it normalizes the intensity differences but it preserves the position information. This locality helps the stereo-matching to handle ambiguities.
noisy. In the case of the Census transform the result is smooth, but in this case it is not able to infer depth information in the most difficult regions on the image.
Fig. 3. Comparison of different cost functions, top left is left image of stereo pair, top right is SAD+Laplace, bottom left is ZSAD, and bottom right is Census
B. Semi-global method complexity The use of 4 or 8 directions in the SGM algorithm was also compared for the previous image set. The same example image is presented in figure 4 using the two settings for the SGM algorithm. The results show limited differences between the images. For the image generated using 8 directions the number of errors is slightly reduced, but not statistically relevant. We believe that this is due to our study case of traffic scenes, where large area surfaces are usually oriented along the x or y axis. As a consequence the diagonal directions are not as relevant for this type of images as for other scenes. As a consequence we are able to reduce the running time without noticeably affecting the quality of the disparity map.
Fig. 4. Left image generated with 4 directions, right image with 8
C. Semi-global matching evaluation For this evaluation we used a standard image from the Middlebury set [4]. We compare the effect of the Semi-global matching, using both the reference solution using ZSAD and our solution. The reference solution uses a standard SGM implementation featuring 8 directions and the original energy formula. The ZSAD metric was chosen instead of the matching metrics proposed by Hirschmüller [7], because it was shown that it generates better results for traffic images [9]. The results are presented on figure 5. Again the differences between the ZSAD and Census metrics is significant using the simple window based test. Both the error rate and the point count are improved when using the Census transform. Using the Semi-Global matching the differences are reduced. In both cases the number of erroneous pixels is very small, but there is a difference in the overall behavior of the two solutions. Using the ZSAD metric the local features have a strong influence and the result is
201
Fig. 5. Tsukuba scene. Top left is ZSAD with simple window, top right Census with simple window, bottom left is SGM+ZSAD, and bottom right is our solution using SGM + Census
D. System evaluation For a final test we evaluate our proposed system in a real traffic sceneThe reference systems are the Tyzx DeepSea system and our implementation of the standard SGM method using the ZSAD metric. The images are presented on figure 6.
Fig. 6. Intersection scene. Comparison of different solutions, top left is left image of stereo pair, top right is Tyzx, bottom left is SGM+ZSAD, and bottom right is our solution using SGM + Census
The result of the Tyzx DeepSea method has a reduced number of errors, but also a limited density on the road surface. Although it is able to generate accurate matches in textured regions, the algorithm is not able to infer the correct results in featureless regions. The SGM methods present an approximately constant point density across the image resulting in a significant increase in the number of reconstructed points. Unfortunately the ZSAD based method has an increased number of errors in some difficult areas. Another problem with this method is the errors introduced at object boundaries, where objects are expanded significantly compared to their original size. Increasing the aggregation window size may help reduce the first type of errors, but it will degrade the result due increasing the effect of second type. Our implementation using the SGM method and the Census metric combines the advantages of both methods resulting in a dense disparity map with a low number of errors
similar to the first case with accurate object borders. The density is also slightly increased compared to the ZSAD metric. E. Execution time The final version of our implementation is able to handle the required resolution in approximately 19ms for a disparity range of 56. As a comparison the solution proposed by Hirschmüller [9] can handle the same images in approximately 100ms. Part of the difference in running time is due to the improvements in hardware. The difference in hardware performance is approximately 2x [11] and we estimate that the execution of their variant would still take approximately 50ms on our devices. To compare our results to local methods we use the Tyzx DeepSea development board as a reference system. Although it is an industrial solution using a local algorithm it still requires 12ms for the same resolution images and a disparity range of 52. VI.
the SGM method and further optimizations resulted in reducing the running time by a factor of approximately 2.5, compared to the reference GPU implementation. This improvement was achieved even though our method uses costly matching metric, the Census transform. The result of this design decision is a significant improvement in the disparity map quality. Compared to the industrial solution point density is significantly increased while error rate is preserved and running time is only increased by a factor of 1.5. Our method needs only 19ms to generate a disparity map for input images having the resolution 512x383. In conclusion we believe that our system represents an excellent compromise between speed and quality for automotive systems which have strict real-time constraints. The increase in quality compared to the currently employed methods could helps driving assistance systems to better identify and track the elements of the traffic scene especially in difficult conditions. REFERENCES
CONCLUSIONS
[1]
In this paper we have presented a variant of the Semi-global stereo algorithm which can be efficiently mapped to GPU hardware for real-time execution. We reduce the computational complexity of the algorithm by eliminating the diagonal directions from the algorithm. This modification also allows a more efficient implementation in the GPU. We have performed extensive evaluations using traffic scenes to show that the integer disparity values are not affected by these modifications. Our variant uses a modified energy function which has the role to improve the sub-pixel quality. Although the original function may be better suited for integer disparity results, the sub-pixel values are corrupted and an increased sub-pixel spread is observable. For the matching metric we propose the use of the Census transform as our evaluations show significant differences between this method and other alternatives. In our work we have shown that the extra overhead is not significant for an efficient implementation when compared to overhead introduced by the SGM. The paper also presents the most important choices and difficulties met during the optimization for the underlying hardware. Using specialized devices such as the GPU extensive hardware knowledge is required to achieve high efficiency. The parallelization is also not as straightforward as for CPU based solutions due to the 2-tier architecture. Still the performance gains compared to the CPU are well worth the extra time and effort required during development. This platform is also better suited for prototyping and has a wider availability than traditional hardware devices such as FPGAs. The final system was evaluated and compared with two reference designs, an industrial solution using local matching [12] and the GPU implementation of the classic SGM method proposed by Hirschmüller [9]. Our solution was able to achieve a running time closer to the local method while having the best quality from the three systems. The simplification of
202
[2]
[3]
[4] [5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
Scharstein, D., Szeliski, R., “A taxonomy and evaluation of dense twoframe stereo correspondence algorithms”, International Journal of Computer Vision, vol. 47 no.1-3, pp. 7-42, April-June 2002. Hirschmüller, H. Scharstein, D., “Evaluation of Cost Functions for Stereo Matching”, IEEE Conference on Computer Vision and Pattern Recognition, CVPR '07, pp. 1-8, June 2007. Gong, M., Yang, R., Wang, L., and Gong, M., “A Performance Study on Different Cost Aggregation Approaches Used in Real-Time Stereo Matching”, International Journal of Computer Vision vol. 75, no. 2, pp. 283-296, Nov. 2007. Scharstein, D., Szeliski, R.: Middlebury stereo vision and evaluation page, http://vision.middlebury.edu/stereo Hirschmüller, H., Innocent, P. R., and Garibaldi, J. “Real-Time Correlation-Based Stereo Vision with Reduced Border Errors” International Journal of Computer Vision vol. 47, no.1-3, pp. 229-246, April-June 2002. Mark, W., Gavrila, D.M., “Real-time dense stereo for intelligent vehicles”, IEEE Transactions on Intelligent Transportation Systems, vol.7, no.1, pp. 38-50, March 2006. Hirschmüller, H., “Accurate and Efficient Stereo Processing by SemiGlobal Matching and Mutual Information” IEEE Computer Society Conference on Computer Vision and Pattern Recognition CVPR'05, vol. 2, pp. 807-814, June 2005. Gehrig, S., Eberli, F., Meyer, T., “A Real-Time Low-Power Stereo Vision Engine Using Semi-Global Matching”, Lecture Notes in Computer Science, Computer Vision Systems, vol. 5815, pp. 134-143, 2009. Hirschmüller, H., Ernst, I., “Mutual Information Based Semi-Global Stereo Matching on the GPU”, Lecture Notes in Computer Science, Advances in Visual Computing, vol. 5358, pp. 228-239, 2008. Gibson, J., Marques, O., "Stereo depth with a Unified Architecture GPU," Computer Vision and Pattern Recognition Workshop, pp. 1-6, 2008. NVIDIA GeForce GTX 200 GPU architectural overview, secondgeneration unified GPU architecture for visual computing, Tech. Rep., NVIDIA, 2008. Woodfill, J.I., et al., “The Tyzx DeepSea G2 Vision System, A Taskable, Embedded Stereo Camera”, Embedded Computer Vision Workshop, pp. 126–132, 2006. Hirschmüller, H. Scharstein, D., "Evaluation of Stereo Matching Costs on Images with Radiometric Differences," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 9, pp. 1582-1599, September, 2009. Hermann, S., Klette, R., and Destefanis, E., “Inclusion of a SecondOrder Prior into Semi-Global Matching”, 3rd Pacific Rim Symposium on Advances in Image and Video Technology, Lecture Notes In Computer Science, vol. 5414, pp. 633-644, January 2009.