Modeling Spatial Dependencies in High-Resolution Overhead Imagery A. M. Cheriyadat, R. R. Vatsavai and E. A. Bright Computational Sciences and Engineering Division Oak Ridge National Laboratory P.O. Box 2008 MS 6017, Oak Ridge, TN 37831
[email protected] Abstract— Human settlement regions with different physical and socio-economic attributes exhibit unique spatial characteristics that are often illustrated in high-resolution overhead imageries. For example- size, shape and spatial arrangements of man-made structures are key attributes that vary with respect to the socioeconomic profile of the neighborhood. Successfully modeling these attributes is crucial in developing advanced image understanding systems for interpreting complex aerial scenes. In this paper we present three different approaches to model the spatial context in the overhead imagery. First, we show that the frequency domain of the image can be used to model the spatial context [1]. The shape of the spectral energy contours characterize the scene context and can be exploited as global features. Secondly, we explore a discriminative framework based on the Conditional Random Fields (CRF) [2] to model the spatial context in the overhead imagery. The features derived from the edge orientation distribution calculated for a neighborhood and the associated class labels are used as input features to model the spatial context. Our third approach is based on grouping spatially connected pixels based on the low-level edge primitives to form support-regions [3]. The statistical parameters generated from the support-region feature distributions characterize different geospatial neighborhoods. We apply our approaches on high-resolution overhead imageries. We show that proposed approaches characterize the spatial context in overhead imageries. (Abstract) Keywords-component; spatial context, power spectrum, conditional random fields, line support regions, geo-spatial neighborhoods.
I.
INTRODUCTION
Human settlements with different physical and socioeconomic attributes exhibit unique spatial characteristics that are often illustrated in high-resolution overhead imageries. Automated aerial image understanding systems designed to detect and interpret different settlement neighborhoods from high-resolution imageries are key to the success of many geospatial applications. Many of the existing pixel based image analysis techniques are not ideal for interpreting overhead imageries with sub-meter spatial resolution. Advanced modeling of the spatial context is necessary to extract and represent information from such high-resolution overhead imageries. Often the image attributes corresponding to the physical scene need to be derived from the pixels spanning a local spatial neighborhood. Previous works reported in
[4,5,6,7] addressed the overhead image understanding challenges by developing unique methods for detecting and classifying geospatial objects such as building, cars, road, harbors, golf course, parking lot etc. In this paper we show that statistical features derived from local spatial neighborhoods can be successfully used to characterize a wide range of geospatial entities including urban, residential, planned, unplanned settlements and, settlements with different income level attributes. We present three different approaches to model the spatial context. Our first approach is based on our previous work reported in [1] where we characterize different geospatial classes like urban, residential, commercial, wooded and agricultural based on the frequency parameters. The spectral energy response at different discrete levels results in unique shape parameters characterizing different geospatial classes. Secondly, following the context-based classification work reported in [2], we model the spatial interactions to detect the human settlement regions in the high-resolution overhead imagery. Our third approach, similar to the previously explored line statistics work [3], shows that ambiguous geospatial classes such as planned and unplanned settlements or settlements with different building size attributes can be modeled using statistical parameters generated from supportregion feature distributions generated from the local spatial neighborhoods. The rest of the paper is organized as follows. Background section presents a few relevant works performed on modeling the spatial context in aerial imageries. Analysis section details the modeling approaches based on power spectrum, CRF and support-region statistics. Future and conclusions are presented at the end of the paper. II.
BACKGROUND
In many of the earlier works on satellite image understanding the focus was to perform the analysis on a perpixel basis. Given an image with input features x(i,j) at pixel location (i,j), the task was to map the input feature to corresponding class labels y(i,j). Often the input features include spectral or texture features or a combination of both. Many of the previous efforts led to the development of many advanced classification algorithms for mapping image pixels to
thematic classes. Often these approaches were developed for overhead images with 30 to 250 meter spatial resolution data. This has motivated researchers to develop new image understanding approaches to handle the high-resolution data. Previous works reported in [4,5,6,7] take different and unique approaches in developing satellite/aerial image understanding systems. The work by Bhagavathy and Manjunath [4] proposes the idea of texture codebook for modeling and detecting geospatial objects such as harbor, golf course, parking lot and housing colony. The work reported by [6,7] models the image as a bag of visual words to detect and classify various semantic geospatial classes. The visual words vocabulary is generated as an offline process by clustering feature vectors computed over local spatial neighborhoods. The graph based approach proposed in [5] models the aerial scene using a probabilistic framework by organizing geospatial components (cars, roads, buildings, trees) into hierarchical groups. The aerial scene description based on a hierarchical graph model allows the method to enforce statistical constraints such as the appearance and spatial relation attributes of geospatial components in the detection and classification process. They showed that the toplevel reasoning can be combined with bottom-level feature extraction process under the hierarchical graph model.
meter spatial resolution images with 512 x 512 pixels in size covering different parts of the United States. We grouped these images based on five broad geospatial classes - downtown, suburban, commercial complex, agricultural and wooded. The power spectrum for each image was computed after applying the Hamming window to reduce the tiling artifacts. The energy contours representing 20, 40, 60 and 80% of the total spectral energy illustrate variations in power spectrum shape for different geospatial classes as shown in Figure 1. We then modeled the power spectrum using below form -
E (| I ( f , θ ) | 2 ) ≈ A(θ ) / f α (θ )
where A(θ ) is an amplitude scaling factor, α (θ ) is the frequency exponent or rate of fall, and θ is the orientation with maximum spectral energy. Table 1 shows the average values of A and α for the dominant orientation θ . The values in the parenthesis are the standard deviations. The model parameters can be used to characterize different land cover classes. Table 1. Average A and α values for 50 images for different geospatial classes.
In this paper we explore different ways of modeling the spatial neighborhoods in the high-resolution satellite/aerial imagery. The modeling approach can be combined with some of the existing works described above to develop advanced overhead image understanding capabilities. III.
MODELING SPATIAL NEIGHBORHOODS
A. Global Features The power spectrum of the image captures the spatial arrangements of the geospatial objects in the scene. A geospatial neighborhood class such as a downtown may exhibit large intra-class variations based on the size and type of buildings or due to the variations in the geospatial object contents that compose the scene. The frequency based global features derived through the power spectrum analysis represent the spatial patterns in the image.
(1)
l l
Image Downtown Suburban Commercia Agricultura Wooded
A 0.038(0.01) 0.042(0.01) 0.023(0.01) 0.028(0.02) 0.075(0.04)
α 0.685(0.12) 0.580(0.11) 0.810(0.29) 0.846(0.41) 0.568(0.14)
B. Modeling Spatial Interactions Classifying pixels in high-resolution overhead imageries based on their intensity values is challenging due to ambiguities in the visual data. As shown in Figure 2, the appearance attributes derived for two different image patches
As reported in the our previous work [1], we collected one-
Figure 1. Sample overhead images of five different geospatial classes used in the analysis and their power spectrum shape profiles. The energy contours representing 20, 40, 50 and 80% of the energy of the spectrum yield different shapes for different geospatial classes and can be used as global features to characterize these broad classes. From left to right- (1) Downtown (2) Suburban (3) Commercial (4) Agricultural (5) Wooded
belonging to two different geospatial objects such as barren land and building might be similar and would be difficult to differentiate. But examining the neighboring image patches might offer cues that could be exploited to infer about the image patch class.
Figure 2. Shows the ambiguity in classifying an image patch belonging to two different geospatial objects based on appearance attributes.
In our second approach on modeling spatial dependencies, we show that spatial context can be modeled using a discriminative framework based on CRF. We apply the model to differentiate human settlement regions from the natural background regions in the high-resolution overhead imagery. Given the image patch features (X) at each patch site (S) and associated class labels (Y) as input, the model learns the mapping between the patch feature and label for each individual image patches and the interaction between neighboring image patches (N). For computational efficiency we operate on image patches composed of 16x16 pixels. The patch feature is composed of - first three moments computed on the edge orientation distributions estimated over the 16x16 pixel neighborhood, the sine angle of the difference between the two maximum orientation peaks and, the orientation angle corresponding to maximum peak. The edge orientations are weighted based on the edge magnitude. The mapping between individual patch feature vector and the class label is based on linear regression and the model parameters are regression weights. This mapping function is termed as the association potential (A in equation 2). The feature and label interactions between first-order image patch neighbors are captured by the regression function that maps the feature differences between neighbors to their corresponding class label disparities. This mapping function is termed as the interaction potential (I in equation 2). To capture the feature interaction between neighboring patches the feature vector generation is performed over 3 different patch sizes -16x16, 32x32 and, 64x64 pixels. The overall probabilistic discriminative model composed of the association and interaction potential as given by:
P (Y X ) =
1 ⎛ ⎞ exp⎜ ∑ A( yi , x ) + ∑ ∑ I ij ( yi , y j , x) ⎟ i∈S j∈N i Z ⎝ i∈S ⎠ (2)
The term Z represents the partition function. The regression parameters associated with the association (A) and interaction
(I) functions are learned simultaneously from the training data using a pseudo-likelihood maximization technique [8]. Once the parameters are learned, given a new test image objective is to find the optimal label configurations (Y) based on the observed local image patch features (X). This is achieved through Maximum Posterior Marginal (MPM) solution estimated using Belief Propagation (BP) [8]. We refer interested readers to [2] for a detailed analysis on the CRF based discrimination model for the detecting man-made structures in terrestrial photos. We have tested this context-based classification model on the overhead image database containing high-resolution images with spatial resolution around 50 centimeters covering two different geographical regions. The database contained 39 images of 2048 x 2048 pixels in size. The settlement areas were manually outlined for all the images for training and evaluation purposes. Nineteen images were used for the training and remaining twenty for the testing. The inference scheme is computationally efficient and the model yielded 89% of overall classification accuracy. Figure 3 shows a few test results. Our experimental analysis clearly confirms that modeling spatial context using the proposed model can improve the classification accuracy.
Figure 3. Settlement detection results generated by the contextbased classification model shown here. The overlaid red transparent mask represents the settlement detections. Tests were conducted on 20 similar images. The overall detection accuracy was around 89%.
C. Statistical Features dervied from Spatial Support-Regions In our third approach we group spatially connected pixels with similar attributes to form regions and derive features
based on these regions. In this work we use gradient magnitude and orientation computed at each pixel as the grouping attribute. The extracted spatial regions represent structural features which are invariant to the illumination variations. The proposed approach can also be applied on other pixel grouping attributes such as color, texture or fractal dimensions. To identify spatial regions with similar orientation attributes, we initially compute edge orientations per-pixel and group spatially contiguous pixels with similar edge orientations to form the support-regions. We compute the length and contrast (max gradient value) for each support-region. The spatial neighborhood is represented based on statistical features derived from the length and contrast distributions. Similar to the previous work on line support region features [3], we derive statistical features such as entropy and mean for the length and contrast distributions. To evaluate the statistical features, we applied the above feature extraction steps on image tiles (128x128 pixels) representing two different settlement types. The city of Damascus in Syria was chosen as the test site based on the availability of ground truth data. The unplanned settlements often belonging to lowest to lower income groups and rural migrants and the planned urban settlements often belonging to middle to upper income group are the two different settlement types addressed here. Figure 4 shows the two different settlement types.
Figure 4. Top and bottom rows represent unplanned settlements often belonging to lowest to lower income groups and rural migrants and, planned urban areas belonging to middle to upper income group.
LE El
CE
MC
Planned
Unplanned
Figure 5. Image tiles representing two different settlement types projected onto the support region feature space. The exemplar tiles on the left and right of the plot represent planned and unplanned settlements respectively. The blue and red data points represent planned and unplanned settlements respectively. LE, MC and CE denotes Length Entropy, Mean Contrast and Contrast Entropy respectively.
For each tile we computed the length entropy, mean contrast and entropy contrast from the support-region feature distributions. Note the distinct clustering of feature points belonging to different settlement types. Our analysis shows that the support-region features characterize the local neighborhoods (e.g. 128x128 pixels) belonging to finer geospatial subclasses such as different settlement types. Spatial support regions can also be established by applying oriented filters at different scales and discrete orientations on the image and then grouping the spatially connected pixels having filter response above a predefined threshold. This has the advantage of extracting larger spatial regions which might get fragmented due to the pixel noise. We apply anisotropic Gaussian filters at 6 different orientations (0, 30, 60, 90, 120, 150 degrees) to identify spatial regions composed of pixels with filter response above a preset value. Only support-regions that have at least one associated local perpendicular supportregion is considered for the computing length distribution. We compute the support-region length distribution and fit a lognormal model to this distribution. The log-normal distribution parameters form the representative feature for the image tile. We applied this feature extraction process on several image tiles representing settlements with different building types and sizes. Surprisingly, the feature space comprising of log-normal parameters provide an approximate measure for ordering tiles based on the building size attributes. Figure 6 shows the projection of image tiles on the log-normal parameter space. The principal component axis (diagonal) characterizes the building size variations.
this paper we applied the model to detect human settlement regions in high-resolution imageries. Our third approach explores a method to generate statistical features to model the local neighborhoods. We applied the feature extraction and mapping process on image tiles to visually examine the projections of image tiles on the feature space. The feature space comprised of the statistical features characterizes different finer geospatial classes including different settlement types and settlements with different building size attributes. In future we plan to extend these approaches by incorporating these modeling techniques in advanced image understanding systems with for automatic interpretation of complex overhead scenes. ACKNOWLEDGMENT PREPARED BY OAK RIDGE NATIONAL LABORATORY, P.O. BOX 2008, OAK RIDGE, TENNESSEE 37831-6285, MANAGED BY UTBATTELLE, LLC FOR THE U. S. DEPARTMENT OF ENERGY UNDER CONTRACT NO. DEAC05-00OR22725.
Figure 6. Top row shows exemplar images used for analysis with decreasing building size. Length distributions of valid supportregions established based on oriented filter responses are modeled as log-normal distributions. The log-normal distribution parameters form the representative feature for each image tile. The log-normal parameter space provides a measure for ordering image tiles based on their content.
IV.
THIS MANUSCRIPT HAS BEEN AUTHORED BY EMPLOYEES OF UTBATTELLE, LLC, UNDER CONTRACT DE-AC05-00OR22725 WITH THE U.S. DEPARTMENT OF ENERGY. ACCORDINGLY, THE UNITED STATES GOVERNMENT RETAINS AND THE PUBLISHER, BY ACCEPTING THE ARTICLE FOR PUBLICATION, ACKNOWLEDGES THAT THE UNITED STATES GOVERNMENT RETAINS A NONEXCLUSIVE, PAID-UP, IRREVOCABLE, WORLD-WIDE LICENSE TO PUBLISH OR REPRODUCE THE PUBLISHED FORM OF THIS MANUSCRIPT, OR ALLOW OTHERS TO DO SO, FOR UNITED STATES GOVERNMENT PURPOSES.
REFERENCES [1]
CONCLUSIONS AND FUTURE WORK
Advanced overhead image understanding systems that can automatically interpret the complex aerial scene will be an important tool for future surveillance, emergency management, and human geography applications. Modeling spatial context is one of the crucial components when dealing with high-resolution imageries. In this paper we look at three different approaches for modeling the spatial context. The approaches presented here are diverse and include global feature representation of large spatial neighborhood by exploring the frequency space of the image, modeling spatial interactions based on random field approach and, generating statistical features from spatial support-regions for characterizing different settlement types. The frequency response correlates with the spatial arrangements of geospatial objects and can yield promising features to discriminate broad geospatial classes. This can be employed in the rapid detection and identification of the spatial neighborhoods in highresolution imageries. The second approach examines a popular probabilistic framework based on CRF to model the feature and class label interactions among the local neighborhoods. The learned model can be used to build efficient classifiers. In
[2]
[3]
[4]
[5]
[6]
[7]
[8]
V. Vijayaraj, A. M. Cheriyadat, P. Sallee, B. Colder, R. R. Vatsavai, E. A. Bright, and B.L. Bhaduri, “Overhead Image Statistics,” IEEE Proc. Of the Applied Imagery and Pattern Recognition Workshop, 2008. S. Kumar and M. Hebert, “Discriminative Random Fields,” IEEE International Journal of Computer Vision, vol. 68(2), pp. 179-201, 2006. C. Unsalan and K. Boyer, “Classifying Land Development in HighResolution Panchromatic Satellite Images Using Straight Line Statistics,” IEEE Trans. On On GeoScience and Remote Sesnsing, vol. 42, no. 4, pp 907-919, 2004 S. Bhagavathy and B. S. Manjunath, “Modeling and Detection of GeoSpatial Objects using Texture Motifs,” IEEE Trans. On GeoScience and Remote Sesnsing, vol. 44, no. 12, pp 3706-3715, 2006 J. Porkway, Q. Wang and S. C. Zhu, “ A Hierarchical and Contextual Model for Aerial Image Parsing,” International Journal of Computer Vision, vol. 88, pp. 254-283, 2010. M. Lienou, H. Maitre, and M. Datcu, “Semantic Annotation of Satellite Images using Latent Dirichlet Allocation,” IEEE Geoscience and Remote Sensing Letters, IEEE, vol. 7, no. 1, pp. 28 –32, 2010. S. Gleason, R. Ferrell, A. Cheriyadat, R. R. Vatsavai and S. De, “Semantic Information Extraction from Multi-spectral Geospatial Imagery via a Flexible Framework,” IEEE International Geoscience and Remote Sensing Symposium, Hawaii, 2010. S. Z. Li, “Markov Random Field Modeling and Image Analysis,” Springer-Verlag, London 2009