Neural Disparity Computation from IKONOS Stereo Imagery in the

Neural Disparity Computation from IKONOS Stereo Imagery in the Presence of Occlusions E. Binaghia , I. Galloa , A. Baraldib and A. Gerhardingerb a Dipartimento

di Informatica e Comunicazione, Universit´a degli Studi dell’Insubria, Via Ravasi 2, I-21100 Varese, Italy; b European Commission Joint Research Centre, Via E. Fermi 1, I-21020 Ispra (Va), Italy ABSTRACT

In computer vision, stereoscopic image analysis is a well-known technique capable of extracting the third (vertical) dimension. Starting from this knowledge, the Remote Sensing (RS) community has spent increasing efforts on the exploitation of Ikonos one-meter resolution stereo imagery for high accuracy 3D surface modelling and elevation data extraction. In previous works our team investigated the potential of neural adaptive learning to solve the correspondence problem in the presence of occlusions. In this paper we present an experimental evaluation of an improved version of the neural based stereo matching method when applied to Ikonos onemeter resolution stereo images affected by occlusion problems. Disparity maps generated with the proposed approach are compared with those obtained by an alternative stereo matching algorithm implemented in a (non-)commercial image processing software toolbox. To compare competing disparity maps, quality metrics recommended by the evaluation methodology proposed by Scharstein and Szelinski (2002, IJCV, 47, 7-42) are adopted. Keywords: Stereo Matching, Occlusion, Disparity Map, Neural Network, Ikonos, DEM generation

1. INTRODUCTION Extracting three-dimensional geographic information (digital elevation model, DEM) from (two-dimensional) images has been a very active research topic in recent years. The availability of accurate DEMs is a key issue in many application fields such as cartography, simulation, city modelling, and environmental monitoring. Several data processing techniques were conceived to extract 3-D information from different data sources like satellite stereo image analysis, lidar processing, or SAR interferometry. Among these, high resolution satellite images have potentially several advantages1 essentially related to large coverage, periodical acquisition, geopositional accuracy and digital nature which makes the automation more feasible. However , automated DEM generation from satellite stereo images is still considered a critical task due to low accuracy, large complexity, and incompleteness. In particular, occluded areas, periodic structures, homogeneous areas and moving objects (cars, river water, etc.) make the stereoscopic image analysis often ambiguous. Difficulties in stereo image analysis are emphasized when dealing with high resolution satellite images of urban areas This imagery presents unique geometric and radiometric characteristics making the traditional stereo matching techniques for middle or low resolution remotely sensed images inapplicable. Urban areas are very complex in the height dimension, featuring many height discontinuities, large differences in height and, in general, many occlusions (hidden parts). Moreover, the target scene consists of a great variety of objects and surface types. Further author information: (Send correspondence to Binaghi E.) Binaghi E.: E-mail: [email protected], Telephone: +39 0332 218941

Due to these difficulties the automatic reconstruction (digital surface model, DSF) of urban scenes has mostly been approached with scene-dependent semi-automatic solutions based on additional ancillary information and manual intervention. Recent works have proposed new 3-D data modelling approaches in an attempt to improve performance and the applicability domain of traditional methods. In general, the 3-D information extraction from satellite stereo image pairs is organized in three main steps: camera modelling, stereo matching, and interpolation. Among these steps, stereo matching is crucial to the accuracy and completeness of results and major research efforts have been devoted to this task. It concerns the matching of points or other kinds of image primitives in the two stereo images such that the matched image points are the projections of the same point in the scene. Thus, stereo matching involves either pixel-based or featurebased recognition tasks. As output, it generates a so-called map of disparities consisting of the differences in location of matched points. Next in the processing chain, the map of disparities is employed as input to compute the 3-D positions of the scene points. In remote sensing (RS) applications, pyramidal matching2 and least square correlation matching have been widely employed. Lee et al1 have recently proposed a new image matching strategy specifically designed for linear push-broom satellite optical sensors in terms of optimized conjugate search method, correlation patch design, match sequence determination. Originally proposed for SPOT stereo imagery, this method was recently applied to the automatic DEM extraction from Ikonos stereo pairs over urban areas. Based on existing literature, 3-D data modelling results generated from Ikonos 1m resolution satellite imagery seem suitable for DEM generation, but ask for substantial improvements as boundaries of buildings tend to be poorly preserved. This seems to be due to the fact that traditional stereo matching algorithms do not handle the problem of height discontinuities and occlusions explicitly. Discontinuities and/or occlusions are the major source of errors in stereo image matching almost independently of the application domain (e.g., earth observation, medical imagery, etc.) as they occur also in images featuring small disparity jumps. To summarize, discontinuities and/or occlusions drastically affect the overall accuracy of the 3-D data reconstruction process. In previous works, we investigated the potential of neural adaptive learning to solve the correspondence problem in the presence of occlusions. A novel method was proposed based on an explicit representation of occlusions within the overall matching procedure. According to the taxonomy proposed by Scharstein and Szelinsky, novelties concern aggregation and disparity computation phases.3 Within the aggregation phase a strategy is introduced based on a disparity space image aimed at exploiting occlusion and/or discontinuity information. Next, this information is processed by a neural network supervised classifier which attempts to improve the reliability in disparity computation. In this paper an experimental assessment of an improved version of the neural-based stereo matching method applied to Ikonos 1m resolution stereo images is proposed. Disparity maps generated with the proposed method are compared with those obtained by a competing stereo matching algorithm implemented in a (non-)commercial software toolbox. Quality metrics adopted for the disparity map comparison satisfy the evaluation criteria proposed by Scharstein and Szelinski.3

2. STEREO MATCHING METHOD Several authors treated occlusion as a secondary process, when matching is concluded.4, 5 Recent works on stereo matching aimed at mimicking the human visual system that, during binocular stereopsis, exploits occlusions to reason about spatial relationships among objects. Explicit representation of occlusions and direct processing within occlusion edges characterizes these approaches.6, 7 Bobik and Intille obtained significant results introducing a new data structure, the Disparity Space Image (DSI), where they explicitly model the effects of occlusion regions on the stereo solution.8 Our approach finds inspiration from these works. In particular, based on DSI, a new strategy is defined for examining the candidate matches and evaluating their support in a local neighborhood. This task sets up ideal conditions for a subsequent stage in which an adaptive neural network is employed to compute disparities.

d Disparity = dmax -1 Disparity = dmax -2

Disparity = 0

yr xr Figure 1. Graphic Representation of Disparity Space Image (DSI)

2.1. DSI Representation DSI is an explicit representation of the matching space and plays an essential role in the development of the overall matching algorithm which makes use of occlusion constraints.8 The correspondence between pixel (xr , yr ) in a reference image fr and a pixel (xm , ym ) in a matching image fm is defined as fr (xr , yr ) = fm (xr + s · d(xr , yr ), yr ) + η(xr , yr )

(1)

where s = ±1 is a sign chosen so that disparities are always positive; d(xr , yr ) is the disparity function and η(xr , yr ) is the Gaussian white noise. According to Equation (1), xm = xr + s · d(xr , yr )

(2)

Thus, the disparity function is equivalent to: d(xr , yr ) = s · (xm − xr )

(3)

Introducing the epipolar constraint, the following identity holds: yr = ym

(4)

where pixels of the matching image are assumed to move from right to left to find their matching counterpart on the reference image. The 3-D disparity space is defined as (xr , yr , d(xr , yr )), refer to Fig. 1. Once the disparity space has been specified, the concept of DSI can be introduced and defined as any image (e.g., a slice of the disparity space generated at a given xr or yr value) or function over the disparity space. Values of DSI usually represent the cost of the match implied by the particular disparity function d(xr , yr ) being adopted. Fig. 1 shows a graphic representation of DSI: each slice indicates a level of disparity varying from 0 to a value dmax , defined as the maximum disparity value for the pair of images at hand.

2.2. Growing Aggregation According to the taxonomy proposed by Scharstein and Szelinsky, the dense stereo matching process can be divided into four tasks :3 1. Matching Cost Computation 2. Aggregation Cost

3. Disparity Computation and Optimization 4. Disparity Refinement The most common matching costs include squared intensity differences (SD) and absolute intensity differences (AD) .3, 7 Aggregation is performed by summing the calculated matching costs over a squared aggregation window with constant disparity z. The specific characteristics of Ikonos stereo acquisition suggest the use of normalized cross correlation 9 as matching cost operator: the first and second steps in the taxonomy are in this case combined being the cost directly computed on a support region (aggregation window ). Our approach, extends the conventional Aggregation Cost phase including two novel sub-tasks: Growing Raw Cost and Growing Aggregation Cost.10 Unlike conventional techniques that base further steps of matching algorithm on the minimal aggregated cost (maximum correlation value) computed, our approach bases decisions on contextual information. In particular, for each pixels in the reference image the costs are ranked and winners are identified. For each disparity then, associated with the selected pixel, the number of confirmations, i.e. the number of winners within a given neighbor (growing aggregation window ) are computed; the disparity with highest number of confirmation is finally selected.

2.3. Neural Disparity Computation In our approach the disparity computation task is performed by introducing an adaptive strategy based on a neural network.11 The idea of including neural network learning within a stereo matching algorithm was successfully tested in previous works.10, 12 In the present study a multilayer perceptron model, trained with the supervised back propagation learning algorithm,13 was adopted to compute disparities based on specific local information extracted from the DSI. During training the network learns the mapping relationship between information extracted from DSI slices and disparities values based on a supervised (labeled) data set of examples. The trained network is expected to be able to generalize, i.e., to associate correct disparity values with DSI input patterns never employed as input during the training session. A salient property of this learning strategy is that data extracted from DSI slices are employed as input to the network as these data implicitly contain spatial information meaningful for identifying a disparity and/or occlusion condition. As a consequence, this approach avoids explicit formalization of spatial features. In deeper detail, neural input patterns are generated by means of a semi-automatic window-based procedure which makes data extraction independent from the DSI slice dimension. This interactive procedure consists of the following steps: 1. Center the moving window at a given position over the DSI slice. 2. Start from the centre of the moving window and select the best N seed points; seed points having values greater then the mean aggregation value computed over the DSI slice are ranked in increasing order and the first N values are selected. 3. Extract rows to which selected seed points belong. 4. Generate a set of neural input patterns consisting of the disparity values associated with each element of the selected rows and distances between rows (refer to Fig. 2). The supervised training set is generated by the supervisor (oracle, user) who provides each network’s input pattern with a continuous (real-valued) output label equivalent to the pixel’s true disparity to be assessed by visual inspection of the DSI slice to which that pixel belongs. In mathematical terms, a supervised training sample consists of a pair of elements (a, b) where a = (x1 , x2 , . . . , xN , d) and b = (y1 , y2 , . . . , yN ), such that a is the input pattern vector consisting of

moving window

3rd seed point

d

d

xi

x 1st seed point 2nd seed point

INPUT

x

computed disparity

OUTPUT

Figure 2. Neural input and output construction.

• the set of rows xi extracted from DSI slice, and • a vector d of values, each one representing the distance between subsequent rows; the cardinality of d equals the number of selected rows minus one. Vector b is the output pattern vector resulting from labeling; each component yi is associated with the ith true disparity. The number of yi components is equal to the number of selected rows. The real value of yi component having maximum value is scaled to the real disparity value for the selected row.

3. EXPERIMENTS AND RESULTS This experimental session should be able to analyze and quantify the contribution of the occlusion handling strategy adopted by our matching algorithm in the framework of a 3-D urban modelling processing chain. In other words, these experimental results should be able to address the following questions: • What is the sensitivity of the proposed algorithm to changes in its free parameters? • How did it compare with an other matching procedure which does not include explicit occlusion handling task; • How did the performance depend upon the epipolarity assumption. Ikonos imagery presents critical aspects in matching correspondence due to the large disparity of homologous points. This situation is highlighted in the image in Fig. 3, obtained by superimposing matching image to reference image. To adequately compare the proposed algorithm against existing techniques and demonstrate its superior perc formance, if any, the (non-)commercial Remote Sensing Gratz (RSG) software toolbox (Copyright °JOANNEUM RESEARCH, Graz-Austria) is selected. The RSG software package is designed for geometric processing and quality assessment of digital multi-sensor remote sensing data, namely, optical line scanner, SAR and perspective images. Display and image processing operations are not included in RSG and have to be provided by such existing systems. The present experimental session considers the ”Image Correlation” RSG module dedicated to the matching of two stereo images in order to extract the disparity map. The purpose of this module is to find homologue points in the reference and in the input image of the stereo model. The correlation can be applied

displacement

≅ 450 pixels

Figure 3. reference and matching sub-images superimposed. The approximated displacement (search area for stereomatching ) between the 2 images is equal to 450 pixels.

either as point correlation to discrete points given on ASCII-files or as areal correlation to a grid of image pixels at a specified (dense) pixel interval. The experimental activity was supported by tools and test data available within the implementation framework proposed by Scharstein and Szelinski in their paper3 and made available on the web at www.middlebury.edu/stereo. Our stereo correspondence algorithm was implemented within this framework. Among the quality measures available, we adopted the RMS (root mean squared) error (measured in disparity units) between the computed disparity map dC (x, y) and the ground truth map dT (x, y) R=(

1 X (dC (x, y) − dT (x, y)2 ))1/2 N

(5)

and the percentage of bad matching pixels: B=(

1 X (|dC (x, y) − dT (x, y)| > δd )) N

(6)

δd is a disparity error tolerance set to the suggested value equal to 1.

3.1. Input Data Set and Targeted Area Two tiles of stereo pair IKONOS Geo panchromatic images were acquired by the European Union’s Joint Research Center (JRC) from the European Space Imaging, GmbH. The two tiles, covering the city of Muzaffarabad (Kashmir zone between Pakistan and India, refer to Fig. 4), were collected on Dec. 11, 2005, i.e., slightly afterward the Pakistan/India 2005 earthquake. Experiments were conducted on both raw epipolar-resampled images and orthorectified images for which the epipolar condition was lost. Two data subsets (chips), showing natural and built-up structures, were selected as representatives of the overall non-stationary statistics of the stereo image pair at hand, with special emphasis on urban areas.

3.2. Experiments 1 The first application of the matching strategy dealt with orthorectified images which loose the epipolarity constraint (see Fig. 5a). In our strategy, the neural disparity stage requires a supervised learning procedure. Within the reference image, training and test areas including a significant presence of occlusions were identified (Fig. 6). Labeled training and test data were generated associating with selected pixels true disparities computed by means of a direct visual inspection of stereo images and corresponding DSI slices. The neural disparity computation employed 462 and 238 pixels extracted from the same region in Fig. 5a, for training (Fig. 6a) and testing (Fig. 6b) respectively. Fourteen configurations of the matching strategy were considered distinguished by a decreasing dimension (ranging from 31 × 31 to 5 × 5) of the aggregation window in matching cost computation. For each configuration,

Figure 4. Red polygon (dotted line): Area of interest. Black polygons (continuous line): scene tiles.

a)

b)

Figure 5. Data subsets from orthorectified (a) and epipolar (b) IKONOS reference images.

a)

b)

Figure 6. (a) Training (462 points) and (b) test (238 points) data extracted from an occlusion area in reference orthorectified image (Fig. 5a).

Percentage of Bad matching pixels error

RMS error 25

3.5 3

20

2

error (% > 1)

error

2.5 train set test set

1.5

15 train set test set 10

1 5

0.5

5 rows

7 rows

3 rows

5 rows

a)

31 29 27 25 23 21 19 17 15 13 11 9 7 5

31 29 27 25 23 21 19 17 15 13 11 9 7 5

RSG

31 29 27 25 23 21 19 17 15 13 11 9 7 5

31 29 27 25 23 21 19 17 15 13 11 9 7 5

31 29 27 25 23 21 19 17 15 13 11 9 7 5

RSG

7 rows

Aggr. win. size

31 29 27 25 23 21 19 17 15 13 11 9 7 5

0

0

Aggr. win. size

3 rows

b)

Figure 7. Comparison between a stereo matching procedure without occlusion handling (RSG) and our strategy when performing on orthorectified stereo pair. Performances are measured in terms of RMS and Percentage of Bad Matching Pixels. Results are obtained by setting the number of rows considered within the DSI slice to 3, 5 and 7 in all the 14 configurations distinguished by a decreasing dimension (ranging from 31 × 31 to 5 × 5) of the aggregation window.

the parameter N, indicating the number of selected rows within the DSI slice in neural disparity computation, assumed values 3, 5 and 7. All these configurations compute the growing aggregation phase for contextual confirmation with a fixed 35 × 35 window. Stereo matching by the RSG software employed the following main parameters: • Size of the search window: 50 × 50 pixels. • Definition of correlation area: 3 × 3 pixels . • Hierarchical matching: 4 levels of a dyadic pyramid. Fig. 7 compares training and testing errors of the proposed approach against those obtained by the RSG software incapable of explicit handling of occlusions. Focusing on results obtained in terms of percentage of bad matching pixels, our strategy, configured with high dimensioned aggregation window and highest rows in disparity computation, outperformed the stereomatching procedure used for comparison. Even if the comparison was conducted without reproducing comparable conditions in all the conceptual and operative aspects, the results support the usefulness of occlusion handling solutions within a global matching strategy. The same results assessed in terms of the RMS error, support this conclusion although advantages of the proposed approach are less evident. The stereo matching procedure configured with the derived optimal setting of parameters has been applied. When orthorectified images were processed, it was not possible to assume the epipolar constraint and then the search for disparities was conducted in two dimensions. Two disparity maps were obtained for horizontal and vertical displacements. Fig. 8 compares the vertical disparity map obtained by our strategy (Fig. 8a) and that obtained by the RSG software(Fig. 8b). Dashed lines superimposed on the maps have the role of highlighting the boundary of the roof recorded in the Ikonos reference image and then identifying the corresponding regions in the disparity maps. Disparity maps obtained by our strategy show accurate sharp edges in correspondence of the roof contours instead of a blurred cloud generated by the stereo matching procedures without occlusion handling . In this experiment the network was designed with one hidden layer, the size of which was set equal to the input layer. The input and output layer was configured according to the coding procedure described in the previous section. During the learning phase the weights were corrected after the presentation of each training pattern. The two interdependent parameters Learning Rate and Momentum were set at 0.2 and 0.3 respectively. For all the configurations, the network was trained for 100 epochs.

a b dcba d

GA+Neural

reference image

RSG

vertical disparities map

matching image

Figure 8. Vertical Disparity maps obtained by our strategy (a) and by the strategy without occlusion handling (b) together with reference (c) and Matching (d) images. Dashed lined highlight how accurately the object contour have been identified.

a)

b)

Figure 9. (a) Training (522 points) and (b) test (268 points) data extracted from epipolar data subset (Fig. 5b).

3.3. Experiments 2 This experiment deals with epipolar stereo images (Fig. 5b). The experiment was designed similar to the experiment described in the previous section. The neural disparity computation used 522 and 268 pixels for training and test respectively (Fig. 9). Results obtained are shown in Fig. 10. The stereo matching procedure configured with the derived optimal setting of parameters has been applied. Results obtained in this second experiment, examined quantitatively in Fig. 10 and qualitatively in Fig. 11, tallied in general with results obtained in first experiment and this allow to conclude that the performances of our algorithm don’t depend on epipolarity condition.

RMS error

Percentage of Bad matching pixels error

3

50 45

2.5

40 35 error (% > 1)

error

2 train set

1.5

test set

1

30 train set

25

test set

20 15 10

0.5

5

7 rows

5 rows

7 rows

3 rows

a)

5 rows

31 29 27 25 23 21 19 17 15 13 11 9 7 5

31 29 27 25 23 21 19 17 15 13 11 9 7 5

31 29 27 25 23 21 19 17 15 13 11 9 7 5

0

Aggr. win. size

RSG

31 29 27 25 23 21 19 17 15 13 11 9 7 5

31 29 27 25 23 21 19 17 15 13 11 9 7 5

31 29 27 25 23 21 19 17 15 13 11 9 7 5

RSG

0

Aggr. win. size

3 rows

b)

Figure 10. Comparison between a stereo matching procedure without occlusion handling (RSG) and our strategy when performing on epipolar IKONOS stereo pair. Performances are measured in terms of RMS and Percentage of Bad Pixels. Results are obtained by setting the number of rows considered within the DSI slice to 3, 5 and 7 in all the 14 configurations distinguished by a decreasing dimension (ranging from 31 × 31 to 5 × 5) of the aggregation window.

a b dcab

matching image

horizzontal disparities map reference image

Figure 11. Horizontal Disparity maps obtained by our strategy (a) and by the strategy without occlusion handling (b) together with reference (c) and Matching (d) images. Dashed lined highlight how accurately the object contour have been identified.

4. CONCLUSIONS This study aimed at investigating the potentialities of a stereomatching approach characterized by an explicit handling of occlusions in the specific context of remotely sensed Ikonos stereo imagery. The salient aspects of the proposed approach are: • the explicit representation of occlusions using the DSI data structure, and • a neural network-learning stage within the disparity computation phase which takes advantages from a preliminary improved aggregation phase. Experimental results show that the overall matching quality benefits from the combined exploitation of the DSI representation with supervised machine learning techniques with special emphasis on occluded regions. In particular, neural network learning techniques employ contextual information extracted from the DSI slices without explicit formalization. The trained network encodes the knowledge on occlusions and efficiently uses it in generalization. In future experiments, the generalization capability of the proposed approach will be further investigated at different combinations of training and testing samples. Future experimental plans include the integration of our matching solution within complete reconstruction procedures and the evaluation of its robustness to changes in scene reconstruction parameters.

REFERENCES 1. H. Y. Lee, T. Kim, W. Park, and H. K. Lee, “Extraction of digital elevation models from satellite stereo images through stereo matching based on epipolarity and scene geometry,” Image and Vision Computing 21, pp. 789–796, 2003. 2. O’Neill and M. Demos, “Automated system for coarse-to-fine pyramidal area correlation stereomatching,” Image and Vision Computing 14(3), pp. 225–236, 1996. 3. D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” International Journal of Computer Vision 47, pp. 7–42, 2002. 4. S. T. Barnard and M. A. Fischler, “Computational stereo,” ACM Computing Surveys 14, pp. 553–572, 1982. 5. U. R. Dhond and J. K. Aggarwal, “Structure from stereo - a review,” IEEE Trans. on Systems Man and Cybernetics 19, pp. 1489–1510, 1989. 6. P. Belhumeur and D. Mumford, “A bayesian treatment of the stereo correspondence problem using halfoccluded regions,” in Computer Vision and Pattern Recognition, 1992. 7. J. I. Cox, S. L. Higonani, S. P. Rao, and B. M. Maggs, “A maximum likelihoods stereo algorithm,” Computer Vision and Image Understanding 63, pp. 542–567, 1996. 8. A. F. Bobik and S. S. Intille, “Large occlusion stereo,” International Journal on Computer Vision 33, pp. 181–200, 1999. 9. Q. Chen and G. Medioni, “A volumetric stereo matching method: Application to image-based modelling,” in International Conference on Computer Vision and Pattern Recognition, pp. 29–34, 1999. 10. E. Binaghi, I. Gallo, C. Fornasier, and M. Raspanti, “Growing aggregation algorithm for dense two-frame stereo correspondence,” in First International Conference on Computer Vision Theory and Application, pp. 326–332, 2006. 11. C. M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995. 12. E. Binaghi, I. Gallo, G. Marino, and M. Raspanti, “Neural adaptive stereo matching,” Pattern Recognition Letters 25, pp. 1743–1758, 2004. 13. H. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representation by error propagation,” Parallel Distributed Processing , pp. 318–362, 1986.