URBAN BUILDING EXTRACTION VIA VISUAL GRAPHICAL TOPIC MODEL. Yansheng Li, Yihua Tan, and Jinwen Tian. National Key Laboratory of Science ...
URBAN BUILDING EXTRACTION VIA VISUAL GRAPHICAL TOPIC MODEL Yansheng Li, Yihua Tan, and Jinwen Tian National Key Laboratory of Science & Technology on Multi-spectral Information Processing, School of Automation, Huazhong University of Science and Technology, Wuhan, China ABSTRACT This paper addresses the automatic building extraction problem from high-resolution remote sensing images. The buildings in remote sensing images generally represent different shapes (i.e., simple rectangular or complex hybrid shape), it is intractable to extract all the buildings with different shapes once. Therefore, we adopt a hierarchical extraction style with a visual graphical topic model embedded, which includes two stages: the first stage detects the simple rectangular buildings and the second stage extracts the complex hybrid buildings. More specifically, the first stage is mainly responsible for regular buildings detection, and unsupervised visual graphical topic model (i.e., replicated softmax restricted boltzmann machine) and supervised discriminative model learning, and the second stage is mainly in charge of complex buildings extraction using the learned semantic feature mapping and discriminative model. Experimental results show that the second stage can obviously improve the building detection rate with slightly increasing the false alarm rate. Index Terms—Replicated softmax restricted boltzmann machine, visual word, visual topic, hierarchical building extraction. 1. INTRODUCTION Recently, automatic urban building extraction from highresolution remote sensing images has attracted lots of research interest, duo to its wide applications, such as city planning, cartography, and the geographic information databases updating. Compared with the natural scene, urban building in high-resolution remote sensing images generally represents different structure, shape, and spectral information, which are the main cues to design an automatic building extraction approach. In [1-2], linear feature is utilized to guide automatic building detection. Linear feature-based approaches [1-2] can excellently extract buildings with regular shapes, however they become moderate when the linear features, they rely on, are misleading caused by occlusions or shadows. Benefiting from the shadows which the buildings cast around themselves, Jin et al. [3] take the detection shadows as contextual information to estimate the position
978-1-4799-5775-0/14/$31.00 ©2014 IEEE
and size of buildings to be detected. Based on template matching, Sirmacek et al. [4] utilize SIFT features and graph theory to detect buildings. While a variety of automatic approaches have been proposed, how to automatically extract buildings with different shapes and appearances is still an open problem. More recently, Tao et al. [5] propose a promising building detection approach, which can detect buildings with different shapes using a hierarchical approach. The hierarchical framework [5] is mainly composed by three steps. Firstly, the object-level elements are extracted and taken as the initial set. Secondly, obvious buildings and non-buildings are detected and taken as the supervised set, and the remaining elements are taken as the candidate set. Thirdly, a Bayesian classifier is learned from the supervised set and recognizes the complex buildings from the candidate set. From the hierarchical architecture of [5], building detection can work by an unsupervised way, however, the Bayesian classification is vulnerable when the supervised set is small or the building/non-building samples in the supervised set are not balancing. In order to overcome this drawback, mapping the low-level feature of each sample to a corresponding mid-level feature (i.e., semantic feature) is an available way. The initial set is underutilized in [5], but is valuable and utilized to learn the mid-level feature mapping using an undirected topic model [6]. The learned mid-level feature mapping can assemble the supervised module, which benefits overcoming the mentioned drawback. Compared with the hierarchical detection framework [5], experimental results demonstrate that the hierarchical building detection framework with a mid-level feature mapping module embedded can achieve better detection performance. 2. THE PROPOSED BUILDING EXTRACTION FRAMEWORK Intuitively, the workflow of our proposed building detection approach is illustrated in Fig. 1. The building candidates (i.e., the initial sample set), the labeled data including simple buildings and excluded non-buildings (i.e., the supervised sample set), and the remained building candidates (i.e., the candidate sample set) can be extracted by [5]. More specifically, as depicted in Fig. 2, the objectlevel elements (as shown in (b) of Fig.2) can be extracted
4746
IGARSS 2014
from the input image (as shown in (a) of Fig.2) using neighborhood total variation. Furthermore, the vegetation and shadow are extracted, as depicted in (c) of Fig. 2. After moving the vegetation and shadow from the initial building candidate set, the simple buildings with the regular rectangle shape (as addressed in (d) of Fig. 2) can be easily extracted. As some buildings may merge with the road, the roads (as illustrated in (e) of Fig. 2) are detected and moved. After moving the vegetation/shadow, simple buildings, and roads from the initial building candidates, the remained building candidate set (as shown in (f) of Fig. 2) is generated, and may contain some potential buildings.
This paper focuses on how to utilize the number-limited supervised sample set to train a discriminative model to recognize the complex buildings from the candidate sample set. To handle the aforementioned supervised learning problem that a limited number of supervised samples may be available, an un-supervised mid-level feature mapping module is added, which is learned by an un-supervised way. As depicted in Fig. 1, mid-level feature of each sample is first extracted before supervised learning or supervised recognition start work. The relationship of the unsupervised learning and supervised learning modules is intuitively shown in Fig. 3.
Fig. 3. The discriminative process of complex buildings.
In the following, the un-supervised mid-level feature mapping module and the supervised recognition module are introduced, respectively. Fig. 1. The workflow of the proposed urban building detection framework
2.1. Unsupervised feature learning As depicted in Fig. 1, the initial sample set S = {rn | n = 1, 2,, N } is composed of the simple building
(a)
set, the excluded non-building set (including the shadow, vegetation and road), and the remained building candidate set. In this section, we mainly discuss how to utilize the initial sample set to learn a semantic feature mapping function. Given the initial sample set S = {rn | n = 1, 2,, N }
(b)
(c)
(d)
(e)
(f)
Fig. 2. The visual results of different object-level elements (a) denotes the original image, (b) denotes the object-level elements, (c) denotes the detected vegetation and shadow, (d) denotes the extracted regular buildings, (e) denotes the detected road, and (f) denotes the remained building candidates.
where rn denotes an object-level element (i.e., image patch), the visual word count feature vn of rn can be calculated by the following steps. Firstly, convolve each band of one image using Gabor filter bank [7] at 4 scales and 6 orientations, and multi-band filter responses are then integrated. Secondly, the integrated filter responses are clustered into textions using k-means algorithm, which are further used to form a texton dictionary with length L . Finally, a L -dimensional visual word count feature vn for rn can be counted by labeling each filter response in rn with the texton which lies closest to it in the dictionary. Correspond to the sample set S , the visual word count feature set of S can be denoted as c1 = {vn | n = 1, 2, , N }
4747
where vn Î {1, 2, ,D} , L denotes the length of the
Based on the estimated model parameter {W1 , a, b } , the
texton dictionary, and D denotes the number of filter responses of rn . Generally, the visual word count feature belongs to the low-level feature. With the consideration that the undirected topic model [6] owns a superior topic mining ability, this paper utilizes it to extract the mid-level feature (i.e., the semantic feature) of the low-level visual word count feature. Taking the visual word count feature as the visible layer of the undirected topic model [6], the hidden layer of the undirected topic model denotes its semantic feature. Specifically, the mid-level feature mapping can be formulated as:
mid-level feature (i.e., visual topic) of vn can be represented
L
L
H
L
H
i =1
j =1
E (v, h) = -åå W1i , j h j vi - å vi a i - D å h j b j i =1 j =1
L
(1)
where D = å v denotes the visual words number in the i
i =1
reference object-level element. As computing the gradient of the log-likelihood of the training data is intractable, the Contrastive divergence [8] can be utilized to approximate the gradient and implemented by Gibbs sampling. Sampling from the posterior of the hidden layer is performed on the following binomial distribution:
æ ö p (h j = 1| v ) = s ççå W1i , j v i + Db j ÷÷÷ ÷ø çè i
(2)
where s ( x) = (1 + e- x )-1 is the sigmoid function. Sampling from the posterior of the visible layer is performed by sampling D times from the following multinomial distribution: æ ö exp çççå h jW i , j + a i ÷÷÷ ÷ ç è j ø p (v i = 1| h) = æ ö åi exp ççççèåj h jW i, j + ai ø÷÷÷÷
L
by s (W1vn + Dn b ) , where Dn = å vni . i =1
2.2. Supervised feature learning As mentioned before, based on the aforementioned detection process, the simple regular buildings (as shown in (d) of Fig. 2.) and non-buildings including the vegetation/shadow (as illustrated in (c) of Fig. 2.) and roads (as illustrated in (e) of Fig. 2) both constitute the supervised sample set c2 = {(vm , cm ) | m = 1, 2, , M } , where cm Î {0,1} , 1 denotes the simple building, and 0 denotes the
non-building sample, and vm denotes the visual word count vector as aforementioned. Therefore, the supervised learning model can be formulated as: M
arg min å s (W2 s (W1vm + Dm b ) + g ) - cm 2 W2 , g
(5)
m =1
As shown in Fig. 3, utilizing the backward propagation algorithm, the model parameter {W1 , a, b } can be refined and the parameter {W2 , g } can be determined. Based on the mid-level feature mapping parameter W { 1 , a, b } and the supervised recognition parameter
{W2 , g } , the remained candidate sample set c3 = {vo | c = 1, 2, , O} (as depicted in (f) of Fig. 2) can be discriminated by the following function:
p (co = 1| vo ) = s (W2 s (W1vo + Do b ) + g )
(6)
If p (co = 1 | vo ) > 0.5 , the object-level element ro is (3)
Let v(k ) denotes the k round alternative Gibbs sampling results. In our implementation, one round Gibbs sampling k = 1 is adopted. The gradient of parameters {W1 , a, b } can be expressed by: ì ï W1i , j ¬ p (h j = 1| v(0)) v i (0) - p (h j = 1| v(1)) v i (1) ï ï ï i i i ï (4) ía ¬ v (0) - v (1) ï ï j j j ï b ¬ p (h = 1| v(0)) - p (h = 1| v(1)) ï ï î
recognized as the building, and vice versa. As this building detection process doesn’t rely on the shape characteristic, some buildings with hybrid shapes are extracted in this module. The extracted buildings from the first stage and the second stage are gathered as the ultimate building detection results.
3. EXPERIMENTAL RESULTS In our experiment, the length L of the texton dictionary is set to 1024 and the number of hidden units in the visual graphical topic model is set to 128. Our experimental test data set contains 20 highresolution remote sensing images, which are acquired by the
4748
IKONOS sensor, to evaluate the building extraction performance of the proposed approach. As demonstrated in the second row of Fig. 4, the first stage can extract the simple buildings with a regular rectangle shape. Benefiting from the labeled samples extracted in the first stage, the third row of Fig. 4 depicts the ultimate building detection results which are integrated from the first and second stages. Intuitively, benefiting from the cues of detected simple buildings, the second stage can detect complex buildings with hybrid shapes. Integrating the results from the second stage can obviously improve the detection rate with slightly increasing the false alarm rate. As expressed in the third and fourth row of Fig. 4, our proposed approach can perform better than [5]. Compared with [5], our proposed approach can achieve a higher detection rate at the relatively same false alarm rate.
4. CONCLUSION In this paper, the hierarchical building detection framework is advocated and refined by a mid-level feature mapping module. The hierarchical extraction strategy accords with human’s cognizance and benefits coping with buildings extraction of different shapes. With the mid-level feature mapping module embedded, the hierarchical building detection framework shows more stable and robust building detection performance. In our future work, we will attempt to extend the hierarchical extraction framework to other types of object detection task in remote sensing images.
5. ACKNOWLEDGEMENT This research was partially supported by the National Natural Science Foundation of China under grants 41371399 and 61273279, and National Science and Technology Major Project of Earth Observation System.
6. REFERENCES [1] J. Cha, R.H. Cofer, and S.P. Kozaitis, “Extended Hough Transform for Linear Feature Detection,” Pattern Recognition, 39(6), pp.1034-1043, 2006. [2] E. Simonetto, H. Oriot, and R. Garello, “Rectangular Building Extraction from Stereoscopic Airborne Radar Images,” IEEE Transaction on Geoscience and Remote Sensing, 43(10), pp.23862395, 2005. [3] X. Jin, and C. Davis, “Automated Building Extraction from High-Resolution Satellite Imagery in Urban Areas Using Structural, Contextual, and Spectral Information,” EURASIP Journal on Applied Signal Processing, 14(5), pp.2196-2206, 2005. [4] B. Sirmacek, and C. Unsalan, “Urban Area and Builidng Detection Using SIFT Keypoints and Graph Theory,” IEEE Transaction on Geoscience and Remote Sensing, 47(4), pp.11561167, 2009. [5] Ch. Tao, Y. Tan, and Zh. Zou, “Hierarchical Method of Urban Building Extraction Inspired by Human Perception,” Photogrammetric Engineering & Remote Sensing, 79(12), pp.1109-1119, 2013. [6] R. Salakhutdinov, and G. Hinton, “Replicated Softmax: An Undirected Topic Model,” Advances in Neural Information Processing Systems, pp.1607-1614, 2009.
Fig. 4. The building detection results of two stages The first row denotes the original images, the second row illustrates the building detection results from the first stage, the third row denotes the integrated building detection results from the first and second stage (i.e., the ultimate results), and the fourth row denotes the building detection results from [5].
[7] S. Bhagavathy, and B. S. Manjunath, “Modeling and detection of geospatial objects using texture motifs,” IEEE Transaction on Geoscience and Remote Sensing, 44(12), pp.3706-3715, 2006. [8] G. Hinton, “Training Products of Experts by Minimizing Contrastive Divergencae,” Neural Computation, 14(8), pp.17711880, 2002.
4749