Multilevel Building Detection Framework in Remote

0 downloads 0 Views 2MB Size Report
Jul 13, 2018 - Color versions of one or more of the figures in this paper are available ...... ral networks) [35] (Method IV), and the YOLO (you only look 520.
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

5

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Yibo Liu, Zhenxin Zhang , Ruofei Zhong, Dong Chen , Yinghai Ke, Jiju Peethambaran , Chuqun Chen, and Lan Sun

Abstract—In this paper, we propose a hierarchical building detection framework based on deep learning model, which focuses on accurately detecting buildings from remote sensing images. To this end, we first construct the generation model of the multilevel training samples using the Gaussian pyramid technique to learn the features of building objects at different scales and spatial resolutions. Then, the building region proposal networks are put forward to quickly extract candidate building regions, thereby increasing the efficiency of the building object detection. Based on the candidate building regions, we establish the multilevel building detection model using the convolutional neural networks (CNNs), from which the generic image features of each building region proposal are calculated. Finally, the obtained features are provided as inputs for training CNNs model, and the learned model is further applied to test images for the detection of unknown buildings. Various experiments using the Datasets I and II (in Section V-A) show that the proposed framework increases the mean average precision values of building detection by 3.63%, 3.85%, and 3.77%, compared with the state-of-the-art methods, i.e., Method IV. Besides, the proposed method is robust to the buildings having different spatial textures and types.

I. INTRODUCTION

30

N THE past few decades, due to the developments in advanced aerospace remote sensing techniques and sensor manufacture techniques, the quality of acquired remote sensing images has improved tremendously. Admittedly, the remote sensing images contain many complex and significant spatial objects. Typically, the buildings from remote sensing images constitute the most important landscape and have been intensively used in various practical applications, including digital urban model construction [1], urban planning [2], environment control, and mapping [3], among others. Especially, the position, geometric shape, and orientation of the buildings, often represent the basis for high-level building-oriented applications, such as building contouring, building model reconstruction, cartographic generalization, etc. Hence, efficient and accurate detection and recognition of the real buildings from large-scale remote sensing images is a relevant, yet challenging task. Additionally, in urban scenario, this challenge is further intensified due to having diverse geometric shapes and the inhomogeneity of spectral information. In this paper, we focus on the general problem of how to automatically obtain the accurate building from large-scale remote sensing imagery. Many previously published works in this domain show impressive success on building or other specific object detection. More specifically, the machine learning techniques, such as AdaBoost [4]–[7], support vector machine [8]–[12], sparse coding-based classifiers [13]–[17], etc. are generally adopted in the previous works. Although these methods can reasonably detect the objects, the saliency and hierarchy of object feature representation should be enhanced considerably. With the development of deep learning theory, especially the advantage that deep learning model can describe more powerful feature representations, it offers the possibility of efficient building detection in remote sensing images. In this paper, we aim to design an efficient and hierarchical building detection framework based on deep learning model by using remote sensing images. We first construct the multilevel training samples using the Gaussian pyramid principle, to learn the features of building objects at different scales and spatial resolutions. Then, the building region proposal networks (BRPN) are designed to determine the candidate building object locations. This process increases the efficiency of the building object detection.

31

of

4

I

ro

3

EP

2

Multilevel Building Detection Framework in Remote Sensing Images Based on Convolutional Neural Networks

Index Terms—Building detection, convolutional neural networks (CNNs), candidate building regions, multilevel framework, remote sensing images.

Manuscript received January 23, 2018; revised May 25, 2018 and July 13, 2018; accepted August 13, 2018. This work was supported in part by the Open Fund of Twenty First Century Aerospace Technology Co., Ltd. under Grant 21AT-2016-04, in part by the National Natural Science Foundation of China under Grants 41701533, 41371434, and 41301521, in part by the Open Fund for Guangdong Key Laboratory of Ocean Remote Sensing (South China Sea Institute of Oceanology Chinese Academy of Sciences) under Grant 2017B030301005-LORS1804, and in part by the Open Fund Key Laboratory for National Geography State Monitoring (National Administration of Surveying, Mapping, and Geoinformation) under Grant 2017NGCM06. (Corresponding authors: Zhenxin Zhang and Ruofei Zhong.) Y. Liu, Z. Zhang, R. Zhong, Y. Ke, and L. Sun are with the Advanced Innovation Center for Imaging Technology, Capital Normal University, Beijing 100048, China (e-mail:,[email protected]; [email protected]; zrfsss@ 163.com; [email protected]; [email protected]). D. Chen is with the College of Civil Engineering, Nanjing Forestry University, Nanjing 210037, China (e-mail:,[email protected]). J. Peethambaran is with the Department of Mathematics and Computing, Saint Mary’s University, Halifax, NS B3H 3C3, Canada (e-mail:, jijupnair2000@ gmail.com). C. Chen is with the Guangdong Key Laboratory of Ocean Remote Sensing, South China Sea Institute of Oceanology Chinese Academy of Sciences, Guangzhou 510301, China (e-mail:,[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JSTARS.2018.2866284

IEE

1

1

1939-1404 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71

73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96

Next, we establish the multilevel building detection model using convolutional neural networks (CNNs), and the features from hierarchical images corresponding to each building region proposal are extracted. Finally, the extracted features are used to learn the CNNs model, which is utilized to detect the unknown buildings in the test images. Apart from introducing this generic methodology, our paper makes the following specific novel contributions: 1) We propose a multilevel learning framework for building detection from remote sensing images based on CNNs, which can extract the features of buildings at different scales and spatial resolutions to train the deep learning model and detect buildings. 2) We establish the BRPN to generate candidate building regions, thereby improving the efficiency of building region searching and enhancing the accuracy of building positions. The rest of the paper is organized as follows. Section II reviews the related work of building or other object detection. Section III describes the proposed learning process of building detection. The test step of building detection using the learned model is designed in Section IV. Section V analyzes the performance of the proposed framework. Finally, Section VI concludes the paper along with a few suggestions for future research topics. II. RELATED WORK

98

The building or other object detections in remote sensing images has been an active area of research for the last few years and is still very much open. Two important considerations under the hood of remote sensing image based object detection are, the construction of hierarchical detection framework and the object (building or other object) feature representation, both are briefly reviewed under this section.

100 101 102 103 104

A. Construction of Hierarchical Detection Framework

106

Hierarchical structure can fully represent the spatial hierarchy and diversity [17]. A series of publications along this line demonstrated the effectiveness of the hierarchical structure in improving the effect of object detection and/or recognition. For example, Farabet et al. [18] constructed the pyramid of training data by using the Laplacian method to enhance the ability of feature representation. Yu et al. [19] proposed a hierarchical framework, called ScSPM (sparse coding based spatial pyramid matching) to extract hierarchical features. Then, considering the advantage of spatial discriminative feature, He et al. [20] equipped the networks with the strategy of spatial pyramid pooling (SPP), which can generate a fixed-length representation regardless of image size/scale. Taking the effect of object size in real space into account, Zhang et al. [21], [22] built a hierarchical structure based on exponential curve, which works well when large- and small-sized objects coexist. In the aspect of hierarchical construction concerning texture, Gaetano et al. [23] proposed hierarchical texture-based segmentation of multiresolution remote sensing images, and similarly Trias-Sanz et al. [24] used color and texture to achieve hierarchical segmenta-

108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125

IEE

105

107

133

B. Object Detection and Location

134

Object detection and location in remote sensing images have been widely researched in recent years. Sirmacek and Unsalan [28] designed a building detection method in urban areas using scale-invariant feature transform (SIFT) keypoints and graph theory. Xu et al. [11] put forward an object classification of aerial images through BoW (bag of words) method. Huang et al. [29] designed an effective and automatic building index to detect buildings from the high-resolution imagery. After that, the multifeatures [30], multiangular features [31], and the postprocessing framework [32] of remote sensing images are also explored for urban and building classification. Hu et al. [33] developed an unsupervised feature learning method via spectral clustering of patches for remotely sensed scene classification. Xia et al. [34] proposed a method of accurate annotation of remote sensing images by active spectral clustering with little expert knowledge. Recent advances in deep learning provide unprecedented opportunities to address the problems such as object detection and location in a different way. For example, in [35], Cheng et al. proposed a rotation-invariant CNNs for object detection in optical remote sensing images. In another method, Long et al. [36] designed an object detection framework in remote sensing images based on CNNs. Girshick et al. [37] constructed a region-based convolutional networks for accurate object detection and segmentation. Hu et al. [38] transferred deep CNNs for the scene classification of high-resolution remote sensing imagery. Vakalopoulou et al. [39] proposed an automated building extraction framework using deep CNNs. Han et al. [8] designed an object detection method in optical remote sensing images based on a weekly supervised learning and high-level feature learning using deep learning theory. The readers can refer to review paper in [40] to learn the current progress on object detection and location from optical remote sensing images. Obviously, the above methods may perform well when the hierarchical information of the spatial image is considered. Although the recognition of buildings and other prominent objects from imagery has been researched for many years, it remains to be solved especially for accurate recognition due to the complexity of spatial structure and diversity of surface texture, e.g., existing building occlusions and inhomogeneities of building sizes in experimental scenarios.

135

EP

97

99

tion for processing high-resolution remote sensing image. Kurtz et al. [25], [26] designed a hierarchical top-down methodology, which can extract extremely complex patterns from multiresolution remote sensing images, for example the landslides were hierarchically extracted from multiresolution remotely sensed optical images. Lin et al. [27] exploited the multiscale pyramidal hierarchy of deep feature to construct the hierarchical features with marginal extra cost.

ro

72

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

of

2

126 127 128 129 130 131 132

136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175

III LEARNING FRAMEWORK

176

The overview of the proposed method is shown in Fig. 1. First, we construct the multilevel training samples by using the Gaussian pyramid principle, to learn the features of building objects

177 178 179

Fig. 1.

Overview of our method.

199

A. Construction of Hierarchical Training Data

184 185 186 187 188 189 190 191 192 193 194 195 196 197

200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216

Multilevel image training datasets.

the hierarchical training dataset, e.g., N is equal to 6 in Fig. 2) as an example, to illustrate the process of the Gaussian kernel convolutional operations. After the image in the lth level is smoothed by a low-pass filter, the image is sampled according to the following: Gl + 1 (i, j) =

EP

182 183

IEE

181

Fig. 2.

ro

198

at different scales and spatial resolutions. Then, the BRPNs are designed to determine the candidate building object locations. Based on the building region proposals, the multilevel building detection model is established using the CNNs, and the image features from hierarchical images corresponding to each building region proposal are extracted. Finally, the extracted features are used to learn the CNNs model in the training step, and the learned model is used to detect the unknown buildings in the test images. It is to be noted that, the proposed framework not only identify buildings from remote sensing images, but also provide an accurate location of each identified building. In this section, we mainly discuss the learning framework of the hierarchical building detection in remote sensing image based on deep learning model. More precisely, the hierarchical training datasets are first constructed. Next, the candidate areas of building in different levels are determined by using BRPNs. Finally, the features of the building region proposals are designed by using deep learning CNNs model, to train the hierarchical building detection framework.

180

3

of

LIU et al.: MULTILEVEL BUILDING DETECTION FRAMEWORK IN REMOTE SENSING IMAGES BASED ON CONVOLUTIONAL

Due to the various spatial shapes, sizes, and textures of buildings and the occlusions between the objects in remote sensing images, it is difficult to construct an efficient building object features. To fully learn the characteristics of buildings, and enhance the generalization ability of the proposed model, we design a method for automatically generating multilevel training datasets. Inspired by the Gaussian pyramid method in [41], we construct a multilevel structure of training data by resampling remote sensing images in each level. The original remote sensing images are automatically divided into uniform patches of 300 × 300 pixels and 450 × 500 pixels, respectively, to include a wide variety of building samples. These patches are then used as a basic image to down-sample with a Gaussian kernel convolutional method to gradually generate multilevel training datasets. We take the resampling image generation in the (l+1)th level (l = 0, 1, 2 . . .N − 2, and N denotes the number of the level in

2 2  

217 218 219 220 221

d (m, n)Gl (2i − m, 2j − n) (1)

m =−2 n = −2

where Gl (·) represents the image in the lth level of training dataset, and the Gl+1 (·) represents the image resampled from the image of Gl (·). The d(m, n) = g(m) · g(n) is a 5 × 5 pixels window function with low-pass filtering characteristic as a Gaussian convolutional kernel, and the function g(·) is Gaussian density distribution function, which is d(m, n) =

1 −(m 2 +n 2 )/2σ 2 e . 2πσ 2

222 223 224 225 226 227

(2) 228

According to the above principles, it is obvious that a series of images G0 , G1 . . . , GN −1 with respect to (l+1)th levels can be naturally created. They constitute a set of multilevel training data.

232

B. Generation of Candidate Building Areas

233

We design BRPNs to generate candidate building areas. To this end we first describe the structure of BRPNs and give our strategy of how to generate the building region proposals. 1) Building Region Proposal Networks: Many researchers have used traditional methods to determine the candidate areas, such as sliding window [42]. However, the sliding window based searching needs to traverse the entire image, which leads to high time complexity and consequently, affects the efficiency of building detection. Meanwhile, this method needs to manually set the size and ratio of the sliding window. Therefore, it is difficult to effectively extract the building areas in remote sensing image. We use CNNs model to extract candidate building

234

229 230 231

235 236 237 238 239 240 241 242 243 244 245

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

Process of generating candidate building areas.

rectangular sliding windows as the initial building detection boxes. Each size of window in each window position indicates a candidate building region, and the nine sizes of rectangular sliding windows can form some overlapping areas in each window position of the images. Thus, the best candidate building area is determined by the window whose IoU value with respect to the sample of labeled building region is higher than the other windows, to obtain building detection box which covers more complete buildings. Then the location and size of each obtained building detection box is modified according to the values of bounding box regression calculated by a mapping relationship of regression. The mapping relationship is defined by the position translation and scaling size of the building detection box, and the mapping parameters are illustrated as follows [49]:

EP

ro

Fig. 3.

of

4

Fig. 4. Different sizes of the initial sliding windows. The red box in the upper left represents a sliding window with 1282 pixels, the square blue box represents a sliding window with 2562 pixels, and the square green box represents a sliding window with 5122 pixels. The region boxes of each color have three types of aspect ratios with 1: 1, 1: 2 and 2: 1.

270 271 272 273 274 275 276 277 278 279 280 281 282 283

tx = (x − xa )/wa , ty = (y − ya )/ha

tw = log(w/wa ), th = log(h/ha )

247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

areas, which can efficiently generate a small number of highquality candidate building areas. The proposed generation process of building region proposals is shown in Fig. 3. First, the convolutional features of hierarchical training dataset are extracted by a set of shared convolution layers [43], such as AlexNet [44], ZF [45], VGG-16 [46], GoogLeNet [47], and ResNet [48]. The input remote sensing image can be extracted into a 512-dimensional convolutional feature (shared feature map) by the last layer of the CNN model. Then, the extracted shared feature maps are used in the reconstruction of region proposal networks, in which, there are two parts as evident in Fig. 3: One is used to calculate the regression values of the building region box positions and obtain the location parameters of the predicted building region proposals; the other one is used to predict the probability of the building region box belonging to the building or nonbuilding regions by calculating the intersection over union (IoU) ratio between the initial candidate building region and the sample of labeling region. Considering the variations in spatial sizes, structures, and shapes of buildings, we set multiscale region boxes of building detection as nine rectangular sliding windows with three pixel sizes (128, 256, and 512 pixels), and there are three kinds of aspect ratios of each rectangle with 1: 1, 1: 2, and 2: 1 (see Fig. 4). In the training of BRPN, we use the above nine

IEE

246

t∗x = (x∗ − xa )/wa , t∗y = (y ∗ − ya )/ha

t∗w = log(w∗ /wa ), t∗h = log(h∗ /ha )

(3)

where x and y denote the coordinate values of the bounding box centroid of the detected building. The variables w and h are, respectively, the width and the height of the building detection box. The variables x, xa and x∗ (likewise for y), respectively, represent the coordinate value of center point in predicted building detection box, initial building detection box, and labeled detection box. The parameters w, wa and w∗ are the widths of predicted building detection box, initial building detection box, (n ) and labeled detection box. Vector ti = [tx , ty , tw , th ] represents the four parameterized coordinates of the predicted building detection box. The (tx , ty ) is the translation values between the predicted building detection box and the initial building detection box. The (tw , th ) is the scaling parameters between the predicted building detection box and the initial building detection box. The (t∗x , t∗y ) is the translation parameters between the labeled detection box and the initial building detection box, and the (t∗w , t∗h ) is the scaling parameters between the labeled detection box and the initial building detection box. During the training of the BRPNs, the hierarchical training dataset is used in the model, to generate multiscale building region proposals, which can be used to detect building objects

284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304

LIU et al.: MULTILEVEL BUILDING DETECTION FRAMEWORK IN REMOTE SENSING IMAGES BASED ON CONVOLUTIONAL

306 307 308 309 310 311

in remote sensing images. After the parameters of translation and scaling are obtained, the final building region proposals are determined by the values of IoU, e.g., the output building region proposals denoted by the red boxes in the right-most column in Fig. 5. 2) Training of BRPNs: The training process of the BRPNs are end to end. The loss function is defined as follows:   L(pi , ti ) = Lcls (pi , p∗i )+ Lloc (ti , t∗i ) (4) i

313 314 315 316 317 318 319 320 321

where the subscript i is the index of initial building detection box. The pi is the predicted probability of the ith building detection box belonging to the building region. The p∗i represents the label of the ground truth, whose value is 1 if the initial detection box is belonging to the class of building, otherwise is 0. The ti = (tx , ty , tw , th ), is a vector representing the coordinates of the predicted bounding boxes, and t∗i = (t∗x , t∗y , t∗w , t∗h ), is the coordinate of ground-truth box representing building region. The function Lcls (pi , p∗i ) is the classification loss term, which is calculated as Lcls (pi , p∗i )

322 323 324

= − log(pi ·

p∗i

+ (1 − pi )(1 −

p∗i ))

327 328 329 330 331 332 333 334 335 336 337 338 339 340

IEE

325

(5)

which represents the cost of inaccuracy of building detection. The regression loss (Lloc ) is calculated by a classical loss function (SmoothL 1 ) [42], which is defined as follows: Lloc (ti , t∗i ) = SmoothL 1 (ti − t∗i )

326

C. Feature Extraction

349

One of the most remarkable characteristics of deep learning is to automatically learn the multilevel discriminative feature representation (feature maps) by using the multilevel convolutional layers. These feature maps can be used to distinguish building and nonbuilding in our scenario. The deep learning models can extract more robust deep features from the image, which have different levels of abstraction. For example, the shallow convolution layers can extract the edge contour and color related information of building objects. The deeper convolutional layers can extract the texture and shape structures of building objects. These features are sensitive to descriptions of building characteristics with different structures and textures in remote sensing images, thereby contributing to the achievement of building detection accuracy. In the process of feature extraction (see Fig. 6), the shared feature map extracted by CNN models is used to generate building region proposals in BRPN, and is also used to detect further feature in detection network. We set ROI pooling layer [42] before the full connection layer in the network of extracting features. The ROI pooling layer obtains the candidate ROI list generated by the BRPN, and transfers the feature maps with different sizes extracted by the CNN layers into a fixed-size feature vector. Thus, the feature vector with 7 × 7 × 512 dimensions is extracted from all candidate ROI regions, which is input into the fully connected (FC) layer.

350

EP

312

i

348

ro

305

On the other hand, the Faster-RCNN uses multitask loss function to execute the training task of multiclass object detection, which is not suitable for detecting single-object task (e.g., the buildings) in remote sensing images. So, we redesigned the loss function by removing some normalization factors to form (4). In addition, our proposed BRPN model use the extracted features from different scales, which is another difference compared with Faster-RCNN.

of

Fig. 5. Schematic diagram of generating candidate building areas. Note that IoU values of red rectangles in the right-most subfigures are higher than 0.7.

5

(6)

and the SmoothL 1 function is a nonlinear regression function defined below  if |a| < 1, 0.5a2 SmoothL 1 (a) = . (7) |a| − 0.5 otherwise In (7), the parameter α represents the regression argument of (ti − t∗i ). In the optimization process, the loss function gradually approaches its minimum value and the parameters of the training network are obtained. Similar to the Faster-RCNN (faster region convolutional neural networks) [43] method, our method employs the generation of region proposals to perform the task of object detection, because the generation of candidate regions can effectively improve the accuracy and efficiency of the object detection. Unlike the RPN (region proposal networks) embedded in Faster-RCNN, the BRPN constructs the network by combining the spatial hierarchies of the multilevel image training datasets and deep learning model, while RPN does not consider the hierarchical spatial information, and only exploit the single scale of images.

341 342 343 344 345 346 347

351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374

D. Training Process

375

In the training step, we adopt the back-propagation optimization method and SGD (stochastic gradient descent) learning strategy. We generate the hierarchical training data with the Gaussian pyramid principle, and the generated hierarchical training data with the corresponding labels are used in the training of model. The whole training process is multistep: First, we employ a pretraining model (ImageNet [44] model) to initialize the BRPN network; then, the multilevel training remote sensing images are used to train the BRPN model, and the learned BRPN model can generate the building region proposals. After that, we use the same pretraining model to initialize the detection network, and train the detection network based on the building region proposals obtained from the BRPN model. When the learning model of classification and the position parameters of building detection boxes (4) are optimized by minimization, the parameters of the shared convolution layers are obtained, and used in the building detection.

376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392

6

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

Fig. 7.

394 395 396 397 398 399 400 401 402 403 404 405 406 407 408

409 410 411 412 413 414

Process of the building detection.

IV. BUILDING DETECTION

As shown in Fig. 7, the buildings in remote sensing images are detected using the parameters of the learned deep learning model. First, the test images are converted into candidate building region proposals by using the BRPN. At this point, these candidate building regional proposals correspond to the shared feature maps generated by the CNN models. Then, the feature of each building region proposal is extracted through convolutional layers, and the extracted features are mapped to the ROI pooling layer to get the fixed size of feature vectors and building region bounding box list. The feature vectors are used in the fully connected layer and softmax classification layer [50] to label the building objects. Finally, we use the building bounding box regression [49] layer to modify the position of predicted building region proposal box from bounding box list to obtain accurate building detection results.

IEE

393

EP

ro

of

Fig. 6. Process of feature extraction. The training images consist of hierarchical remote sensing images. The shared convolutional layers extract the features of the training images to obtain shared feature maps. The feature extraction layers extract deeper feature from shared feature maps. The ROI pooling layer converts the extracted features into a list of feature vectors and output the extracted features.

V. EXPERIMENTS AND RESULTS In this section, we test the sensitivities of the model parameters and compare our method with other five methods, to verify the efficiency and stability of the proposed model. The test environment is as follows: Intel Xeon E5-2640 CPU, Nvidia Quadro M4000 GPU with 8-GB RAM. The training process was

performed by Caffe [51] framework on Ubuntu14.04 operating system. In this section, there are four parts: the data overview, the test results evaluation, the parameter sensitivity analysis, and the comparison with the other five methods.

418

A. Datasets

419

We used three datasets, namely Dataset I, Dataset II, and Dataset III to test the performance of the proposed method via various experiments. Dataset I is taken from SIRI-WHU [52] dataset, which comes from the USGS (United States geological survey) public test datasets. The data is collected in Montgomery, OH, USA, with a spatial resolution of 0.6 m, and the type of the scene is primarily residential area, with the area of 5400 × 6000 m2 . Dataset II is taken from public SpaceNet dataset,1 which is collected by Digital Globe’s WorldView-2 satellite. The spatial resolution of the Dataset II is 0.5 m and the area is 8000 × 6000 m2 . Dataset III is taken from the Inria aerial image labeling dataset [53], which is aerial orthorectified color imagery with a spatial resolution of 0.3 m. The dataset covers an area of 4700 × 5000 m2 , which was captured in Chicago, America.

420

1 https://github.com/SpaceNetChallenge/SpaceNetChallenge.github.io

415 416 417

421 422 423 424 425 426 427 428 429 430 431 432 433 434

7

of

LIU et al.: MULTILEVEL BUILDING DETECTION FRAMEWORK IN REMOTE SENSING IMAGES BASED ON CONVOLUTIONAL

Training and the test areas in Dataset II.

Training and the test areas in Dataset I.

IEE

Fig. 8.

EP

ro

Fig. 9.

443

We select the training data and test data from Datasets I, II, and III, according to the type and distribution of buildings as shown in Figs. 8–10. In Dataset I, the training data has a total of 263 buildings, and the test data contains 817 buildings. In Dataset II, the training data contains 849 buildings while the test data contains 3048 buildings. In Dataset III, the training data contains 140 buildings and the test data contains 578 buildings. Figs. 8–10, respectively, represent the training and test areas of Datasets I–III.

444

B. Building Detection

445

To test the efficacy of the proposed method, we combine the VGG-16 [46] model into our method to perform experiments on the basis of the above Datasets I, II, and III. The test results of building detection are shown in Figs. 11–13, respectively. It can be seen that our method detects the building objects with different textures and shapes well. Despite having buildings with different structures and textures in the scene of Fig. 11(b), and even sheltered by the trees (e.g., the buildings in the dashed box

435 436 437 438 439 440 441 442

446 447 448 449 450 451 452

Fig. 10.

Training and the test areas in Dataset III.

of Fig. 11(b) and the ones in the dashed box 1 of Fig. 12 and in the dashed box 1 of Fig. 13), our model detects the buildings noticeably better. Meanwhile, the buildings with different sizes can be accurately detected, as evident in the dashed boxes of Fig. 11(c), the dashed box 2 of Fig. 12 and the dashed box 2 of Fig. 13.

458

C. Effect of Different Training Iteration Number

459

In order to analyze the effect of the training iteration number, we set the learning rate as 0.001 and adopt the Adam learning strategy [54] to optimize our model. The parameters of the Adam learning strategy are as follows: exponential decay rates β1 = 0.9 and β2 = 0.999, and minor amount ε = 1e − 08. We use mean average precision (mAP) to evaluate the detection

460

453 454 455 456 457

461 462 463 464 465

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

ro

of

8

Fig. 11. Results of building detection on Dataset I. (a) Detection results of some buildings. (b) Detection results of some sheltered buildings. (c) Detection results for buildings with different sizes.

Fig. 13.

Results of building detection on Dataset III.

TABLE I

IEE

EP

MAP AND IOU VALUES OF BUILDING DETECTION IN TEST IMAGE OF DATASET I UNDER THE DIFFERENT ITERATION NUMBERS

Fig. 12.

466 467 468 469 470 471 472 473 474 475 476 477 478

MAP AND IOU

TABLE II VALUES OF BUILDING DETECTION IN TEST IMAGE OF DATASET II UNDER DIFFERENT ITERATION NUMBERS

MAP AND IOU

TABLE III VALUES OF BUILDING DETECTION IN TEST IMAGE OF DATASET III UNDER DIFFERENT ITERATION NUMBERS

Results of building detection on Dataset II.

results, and exploit the IoU value to assess the accuracy of the building detection box position. Here, the VGG-16 model [46] is integrated into our method to extract features under different numbers of training iteration. Tables I–III show the obtained mAP values and the IoU values of the tested images in Datasets I–III. It can be seen from Tables I–III that the mAP values of the building detection on the three datasets gradually increase in general, when the number of training iteration changes from 4000 to 10 000, and the IoU values of the test images also increases synchronously, which indicates that the accuracies of the building detection and positioning are improved with the increase in the training iteration number. However, the accuracy of building detection decreases when the number of iteration

is more than 8000 in Tables I and III, 9000 in Table II, which indicates that the overfitting may be occurred. It can be observed that our model has the best detection performance when the training iteration number is set in the range 8000–9000.

482

D. Effects of Different Deep Learning Network Models

483

In this section, we use several representative deep learning models to test the performance of our method. The employed models, such as AlexNet [44], ZF [45], VGG-16 [46], GoogLeNet [47], and ResNet [48], have achieved excellent re-

484

479 480 481

485 486 487

LIU et al.: MULTILEVEL BUILDING DETECTION FRAMEWORK IN REMOTE SENSING IMAGES BASED ON CONVOLUTIONAL

9

TABLE IV VALUES OF MAP AND IOU USING DIFFERENT MODELS ON DATASET I

TABLE V VALUES OF MAP AND IOU USING DIFFERENT MODELS ON DATASET II

ro

TABLE VI VALUES OF MAP AND IOU USING DIFFERENT MODELS ON DATASET III

of

Fig. 16. Precision-recall curves of different deep learning models on Dataset III.

Fig. 17.

Curves of average IoU values under different number of layers.

499

In order to test the influence of the level number N (in Section III-A), we respectively set the layer number N as 1–10, and use the VGG-16 model [46] as the basic network to train the detection model. In the experiments, the training iteration is set as 8000, and the trained model is used to detect buildings in the test image. The results of building detection are counted by using the average value of IoU. As shown in Fig. 17, we used the Datasets I–III to conduct the experiments. It is evident from Fig. 17 that the average IoU values of the building detection on the two datasets gradually increase with the augment of layer number N, and the average IoU values tend to be stable when the layer number reaches 7 and 8 on the Datasets I–III, respectively. It indicates the advantage of the multilevel training structure in improving the accuracy of building detection.

500

513

F. Comparison With Other Methods

514

We compare our method with the other five building detection methods, e.g., the DPM (deformable parts model) method [55] (Method I), the fast-RCNN (fast region convolutional neural networks) method [42] (Method II), the faster-RCNN [43] (Method III), the RICNN (rotation-invariant convolutional neural networks) [35] (Method IV), and the YOLO (you only look once) model [56] (Method V). Table VII shows the comparison results between our method and the other five methods. The characteristic of our method includes three aspects: 1) the multilevel structure of training data, 2) deep learning model, and 3) the BRPNs. In the compared methods, the Method I does not use multilevel structure, deep learning model, and BRPNs, and detects the buildings by using the spatial structure components, which are several high-resolution component templates extracted from the image samples. Besides, the sliding window

515

EP

E. Impact of Layer Number N

IEE

Fig. 14. Precision-recall curves of different deep learning models on Dataset I.

Fig. 15. Precision-recall curves of different deep learning models on Dataset II.

488 489 490 491 492 493 494 495 496 497 498

sults in the object detection and recognition competitions in the recent years, and integrated into our method to test the performance on our test datasets (Datasets I–III) under the same parameters (e.g., the number of training iteration is 8000, and the number of levels (N) in multilevel training dataset is 7). The final values of mAP and IoU are shown in Tables IV–VI. From Tables IV–VI, and the precision-recall (P-R) curves of the tested results in Figs. 14–16, we can observe that the five models using our method obtained good performance in terms of mAP and IoU, which shows the advantage of our hierarchical building detection model.

501 502 503 504 505 506 507 508 509 510 511 512

516 517 518 519 520 521 522 523 524 525 526 527 528 529

10

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

TABLE VII COMPARISONS AMONG OUR METHOD AND THE OTHER FIVE METHODS

TABLE VIII VALUES OF OUR METHOD AND THE OTHER FIVE METHODS ON THE DATASET I

MAP

TABLE IX VALUES OF OUR METHOD AND THE OTHER FIVE METHODS ON THE DATASET II

TABLE X VALUES OF OUR METHOD AND THE OTHER FIVE METHODS ON THE DATASET III

TABLE XI COMPUTATIONAL COST OF BUILDING DETECTION ON THE DATASET I

532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555

EP

531

is adopted in the Method I to search for buildings in the remote sensing image. The Method II uses selective search to generate building region proposals, from which there are no multilevel structure and BRPNs. Method III extracts candidate areas by constructing region proposal networks, and the features of objects are built by the fast-RCNN model [42]. In contrast with our method, the Method III aims at the problem of multiobjects detection, and does not use the multilevel training data. The Method IV introduces a rotation-invariant layer on the basis of the existing CNN models, and learns a RICNN model to improve the performance of object detection. Method V proposes a neural network model to perform the object detection task as regression problem, and obtains the location and probability of the object. There are no BRPNs and multilevel structure of training data in the Methods IV and V. In the experiments, the same datasets (Datasets I–III) and hardware environment were used to test the building detection performance. Meanwhile, the VGG-16 model [46] is set as a basic network structure unit in our method. The mAP values are shown in Tables VIII–X. It can be seen from the above results that, the mAP values of our method reaches 0.57 in Table VIII, 0.54 in Table IX, and 0.55 in Table X, which is higher than the rest of the methods, and improve the mAP values as 3.63%, 3.85%, and 3.77% on the Datasets I–III than those of the best approach (Method IV). The results illustrate the advantages of the multilevel structure, deep learning, and BRPN

IEE

530

ro

of

MAP

MAP

Fig. 18.

P-R curves of several building detection methods in Dataset I.

Fig. 19.

P-R curves of several building detection methods in Dataset II.

in our method. As far as the time complexity of the building detection (see Table XI) is concerned, our method incurs less time overhead, which is a little more than that of the best method (Method V), as the Method V does not need to generate building region proposals. However, the object detection precision of Method V is lower than that of our method (see the mAP values in Tables VIII–X), which proves the advantages of hierarchical training dataset and BRPN. P-R curves in Figs. 18–20 indicate that our method performs best which illustrates the advantages of multilevel training framework and extraction of building region proposals using BRPN.

556 557 558 559 560 561 562 563 564 565 566

LIU et al.: MULTILEVEL BUILDING DETECTION FRAMEWORK IN REMOTE SENSING IMAGES BASED ON CONVOLUTIONAL

P-R curves of several building detection methods in Dataset III.

VI. CONCLUSION

568

In this paper, we propose a deep learning based hierarchical framework to automatically detect buildings in remote sensing image. In the proposed framework, we design the hierarchical training model using Gaussian pyramid principle, to extract discriminative features at different scales and spatial extents. Then, the deep learning model of hierarchical building detection is constructed. Our method has been validated on different scenes of remote sensing images. We have also compared our method with five related methods (i.e., Methods I-V in Section V-F) qualitatively and quantitatively. Experiments and comparisons with the state-of-the-art methods clearly demonstrate the superiority of our method in accurately and efficiently detecting the buildings in remote sensing images. This underlines the advantages of hierarchical training dataset, deep learning based building detection model, and building region proposals using the BRPN. In future, we will consider the adaptive construction of hierarchical training datasets according to the content of on-ground objects [57], to more adequately extract the feature of buildings.

571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586

587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614

EP

570

REFERENCES

[1] L. Guan, Y. Ding, X. Feng, and H. Zhang, “Digital Beijing construction and application based on the urban three-dimensional modelling and remote sensing monitoring technology,” in Proc. IEEE. Inter. Geosci. Remote. Sens. Symp., Jul. 2016, pp. 7299–7302. [2] M. M. Rathore, A. Ahmad, A. Paul, and S. Rho, “Urban planning and building smart cities based on the internet of things using Big data analytics,” Comput. Netw., vol. 101, no. 4, pp. 63–80, Jul. 2016. [3] J. F. Pekel, A. Cottam, N. Gorelick, and A. S. Belward, “High-resolution mapping of global surface water and its long-term changes,” Nature. vol. 540, no. 7633, pp. 418–432, Dec. 2016. [4] J. Leitloff, S. Hinz, and U. Stilla, “Vehicle detection in very high resolution satellite images of city areas,” IEEE Trans. Geosci. Remote Sens.,vol. 48, no. 7, pp. 2795–2806, Jul. 2010. [5] S. Tuermer, F. Kurz, P. Reinartz, and U. Stilla, “Airborne vehicle detection in dense urban areas using HoG features and disparity maps,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 6, no. 6, pp. 2327–2337, Dec. 2013. [6] H. Grabner, T. T. Nguyen, B. Gruber, and H. Bischof, “On-line boosting based car detection from aerial images,” ISPRS J. Photogramm. Remote Sens., vol. 63, no. 3, pp. 382–396, May 2008. ¨ Aytekin, U. Z¨ong¨ur, and U. Halici, “Texture-based airport runway [7] O. detection,” IEEE Geosci. Remote Sens. Lett., vol. 10, no. 3, pp. 471–475, May 2013. [8] J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, “Object detection in optical remote sensing images based on weakly supervised learning and high-level feature learning,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 6, pp. 3325–3337, Jun. 2015.

IEE

569

ro

567

[9] G. Cheng et al., “Object detection in remote sensing imagery using a discriminatively trained mixture model,” ISPRS J. Photogramm. Remote Sens., vol. 85, pp. 32–43, Nov. 2013. [10] D. Zhang, J. Han, G. Cheng, Z. Liu, S. Bu, and L. Guo, “Weakly supervised learning for target detection in remote sensing images,” IEEE Geosci. Remote Sens. Lett., vol. 12, no. 4, pp. 701–705, Apr. 2015. [11] S. Xu, T. Fang, D. Li, and S. Wang, “Object classification of aerial images with bag-of-visual words,” IEEE Geosci. Remote Sens. Lett., vol. 7, no. 2, pp. 366–370, Apr. 2010. [12] H. Sun, X. Sun, H. Wang, Y. Li, and X. Li, “Automatic target detection in high-resolution remote sensing images using spatial sparse coding bag-ofwords model,” IEEE Geosci. Remote Sens. Lett., vol. 9, no. 1, pp. 109–113, Jan. 2012. [13] Y. Zhang, L. Zhang, B. Du, and S. Wang, “A nonlinear sparse representation-based binary hypothesis model for hyperspectral target detection,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 6, pp. 2513–2522, Jun. 2015. [14] Y. Zhang, B. Du, and L. Zhang, “A sparse representation-based binary hypothesis model for target detection in hyperspectral images,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 3, pp. 1346–1354, Mar. 2015. [15] N. Yokoya and A. Iwasaki, “Object detection based on sparse representation and Hough voting for optical remote sensing imagery,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 5, pp. 2053–2062, May 2015. [16] Y. Zhong, R. Feng, and L. Zhang, “Non-local sparse unmixing for hyperspectral remote sensing imagery,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 6, pp. 1889–1909, Jun. 2014. [17] Y. Xu, E. Carlinet, T. Geraud, and L. Najman, “Hierarchical segmentation using tree-based shape spaces,” IEEE Trans. Pattern Anal. Mach Intell., vol. 39, no. 3, pp. 457–469, Mar. 2017. [18] C. Farabet, C. Couprie, L. Najman, , and Y. LeCun, “Learning hierarchical features for scene labeling,” IEEE Trans. Pattern Anal. Mach Intell., vol. 35, no. 8, pp. 1915–1929, Oct. 2013. [19] K. Yu, Y. Lin, and J. Lafferty, “Learning image representations from the pixel level via hierarchical sparse coding,” in Proc. IEEE CVPR, Jun. 2011, pp. 1713–1720. [20] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Trans. Pattern Anal. Mach Intell., vol. 37, no. 9, pp. 1904–1916, Jan. 2015. [21] Z. Zhang et al., “A multilevel Point-Cluster-Based discriminative feature for ALS point cloud classification,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 6, pp. 3309–3321, Jun. 2016. [22] Z. Zhang, L. Zhang, X. Tong, B. Guo, L. Zhang, and X. Xing, “Discriminative dictionary learning-based multi-level point-cluster features for ALS point cloud classification,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 12, pp. 7309–7322, Dec. 2016. [23] R. Gaetano, G. Scarpa, and G. Poggi, “Hierarchical texture-based segmentation of multiresolution remote-sensing images, ” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 7, pp. 2129–2141, Jan. 2009. [24] R. Trias-Sanz, G. Stamon, and J. Louchet, “Using colour, texture, and hierarchial segmentation for high-resolution remote sensing,” ISPRS J. Photogramm. Remote Sens., vol. 63, no. 2, pp. 156–168, Mar. 2008. [25] C. Kurtz, N. Passat, P. Gancarski, and A. Puissant, “Extraction of complex patterns from multiresolution remote sensing images: A hierarchical top-down methodology,” Pattern Recog., vol. 45, no. 2, pp. 685–706, Feb. 2012. [26] C. Kurtz, A. Stumpf, J. P. Malet, P. Ganc¸arski, A. Puissant, and N. Passat, “Hierarchical extraction of landslides from multiresolution remotely sensed optical images,” ISPRS J. Photogramm. Remote Sens., vol. 87, no. 1, pp. 122–136, Jan. 2014. [27] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 936–944. [28] B. Sirmacek and C. Unsalan,. “Urban-area and building detection using SIFT keypoints and graph theory,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 4, pp. 1156–1167, May. 2009. [29] X. Huang and L. Zhang, “Morphological Building/Shadow index for building extraction from High-Resolution imagery over urban Areas,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. vol. 5, no 1, pp. 161–172, Feb. 2012. [30] X. Huang and L. Zhang, “An SVM ensemble approach combining spectral, structural, and semantic features for the classification of high-resolution remotely sensed imagery,” IEEE Trans. Geosci. Remote Sens., vol. 51, no. 1, pp. 257–272, Jan. 2013.

of

Fig. 20.

11

615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688

[55] P. F. Felzenszwalb, R. B. Girshick, D. Mcallester, and D. Ramanan, “Object detection with discriminatively trained Part-Based models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627–1645, Sep. 2009. [56] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look Once: Unified, Real-Time object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2015, pp. 779–788. [57] Y. Liu, M. Yu, M. Yu, and Y. He, “Manifold slic: A fast method to compute content-sensitive superpixels,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 651–659.

764 765 766 767 768 769 770 771 772 773

Yibo Liu received the Bachelor’s degree in geographical information engineering from the Henan University of Technology, Zhengzhou, China, in 2016. He is currently working toward the Master’s degree at Beijing Advanced Innovation Center for Imaging Technology, Capital Normal University, Beijing, China. His research interests include deep learning, remote sensing images analysis, and LiDAR-based urban modeling.

774 775 776 777 778 779 780 781 782 783 784

ro

[31] X. Huang, H. Chen, and J. Gong, “Angular difference feature extraction for urban scene classification using ZY-3 multi-angle high-resolution satellite imagery,” ISPRS J. Photogramm. Remote Sens., vol. 135, no. 1, pp. 127– 141, 2018. [32] X. Huang, W. Yuan, J. Li, and L. Zhang, “A new building extraction Post-Processing framework for high spatial resolution remote sensing imagery,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 10, no. 2, pp. 654–668, Feb. 2017. [33] F. Hu, G.-S. Xia, Z. Wang, X. Huang, and L. Zhang, “Unsupervised feature learning via spectral clustering of patches for remotely sensed scene classification,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 5, pp. 2015–2030, May. 2015. [34] G. Xia, Z. Wang, C. Xiong, and L. Zhang, “Accurate annotation of remote sensing images via active spectral clustering with little expert knowledge,” Remote Sens., vol. 7, no. 11, pp. 15 014–15 045, Nov. 2015. [35] G. Cheng, P. Zhou, and J. Han, “Learning Rotation-Invariant convolutional neural networks for object detection in VHR optical remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 12, pp. 7405–7415, Dec. 2016. [36] Y. Long, Y. Gong, Z. Xiao, and Q. Liu, “Accurate object localization in remote sensing images based on convolutional neural networks,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 5, pp. 2486–2498, Jan. 2017. [37] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional networks for accurate object detection and segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 1, pp. 142–158, Jan. 2016. [38] F. Hu, G.-S. Xia, J. Hu, and L. Zhang, “Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery,” Remote Sens., vol. 7, no. 11, pp. 14 680–14 707, Nov. 2015. [39] M. Vakalopoulou, K. Karantzalos, N. Komodakis, and N. Paragios, “Building detection in very high resolution multispectral data with deep learning features,” in Proc. IEEE Inter. Geosci. Remote Sens. Symp., Jul 2015, pp. 1873–1876. [40] G. Cheng and J. Han, “A survey on object detection in optical remote sensing images,” ISPRS J. Photogramm. Remote Sens., vol. 117, pp. 11– 28, Jul. 2016. [41] M. Su´arez, V. M. Brea, J. Fern´andez-Berni, R. Carmona-Gal´an, and D. Cabello, “Low-Power CMOS vision sensor for gaussian pyramid extraction,” IEEE J. Solid-State Circuits, vol. 52, no. 2, pp. 483–495, Feb. 2017. [42] R. Girshick, “Fast R-CNN,” in Proc. IEEE Inter. Conf. Comput. Vis., Jun. 2015, pp. 1440–1448. [43] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: towards realtime object detection with region proposal networks,” in Proc. Neural Inf. Process. Syst., 2015, pp. 1137–1149 [44] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Neural Inf. Process. Syst., 2012, pp. 1097–1105. [45] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Proc. Euro. Conf. Comput. Vis., 2014, pp. 818–833. [46] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog, Sep. 2015, pp. 1–14. [47] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2015, pp. 1–9 [48] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun.2016, pp. 770–778. [49] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2014, pp. 580–587. [50] G. E. Hinton and R. R. Salakhutdinov, “Replicated Softmax: An undirected topic model,” in Proc. Neural Inf. Process. Syst., Nov. 2009, pp. 1607– 1614. [51] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embedding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2014, pp. 1401– 1410. [52] Y. Zhong, Q. Zhu, and L. Zhang, “Scene classification based on the multi feature fusion probabilistic topic model for high spatial resolution remote sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 11, pp. 6207–6222, Nov. 2015. [53] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark,” in Proc. IEEE Inter. Geosci. Remote Sens. Symp., 2017, pp. 3226–3229. [54] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Inter. Conf. Learn. Representations, May. 2015, pp. 1–15.

Zhenxin Zhang received the Ph.D. degree in geoinformatics from the School of Geography, Beijing Normal University, Beijing, China, in 2016. He is currently an Assistant Professor at Beijing Advanced Innovation Center for Imaging Technology and Key Laboratory of 3-D Information Acquisition and Application, College of Resource Environment and Tourism, Capital Normal University, Beijing, China, and he also works as a Cooperator with Beijing Key Laboratory of Urban Spatial Information Engineering, Beijing Institute of Surveying and Mapping, Beijing, China. His research interests include light detection and ranging data processing, quality analysis of geographic information systems, and algorithm development.

785 786 787 788 789 790 791 792 793 794 795 796 797 798 799

Ruofei Zhong received the Ph.D. degree in geoinformatics from the Chinese Academy of Science’s Institute of Remote Sensing Applications, Beijing, China, in 2005. He is currently a Professor at Beijing Advanced Innovation Center for Imaging Technology and Key Laboratory of 3-D Information Acquisition and Application, College of Resource Environment and Tourism, Capital Normal University, Beijing, China. His research interests include light detection and ranging data processing and data collection system with laser scanning.

800 801 802 803 804 805 806 807 808 809 810 811 812

Dong Chen received the Bachelor’s degree in computer science from the Qingdao University of Science and Technology, Qingdao, China, in 2005, the Master’s degree in cartography and geographical information engineering from the Xi’an University of Science and Technology, Xi’an, China, in 2009, and the Ph.D. degree in geographical information sciences from Beijing Normal University, Beijing, China, in 2013. He is currently an Associate Professor with Nanjing Forestry University, Nanjing, China. He is also a Postdoctoral Fellow with the Department of Geomatics Engineering, University of Calgary, Calgary, AB, Canada. His research interests include imageand LiDAR-based segmentation and reconstruction, fullwaveform LiDAR data processing, and related remote sensing applications in the field of forest ecosystems.

813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829

EP

IEE

689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

of

12

LIU et al.: MULTILEVEL BUILDING DETECTION FRAMEWORK IN REMOTE SENSING IMAGES BASED ON CONVOLUTIONAL

844 845 846 847 848 849 850 851 852 853 854 855 856 857

Jiju Peethambaran received the Bachelor’s degree in information technology from the University of Calicut, Malappuram, India, the Master’s degree in computer science from the National Institute of Technology, Karnataka, Mangalore, India, and the Ph.D. degree in computational geometry from IIT Madras, Chennai, India. He is currently a Post-Doctoral Researcher with the Department of Computer Science, University of Victoria, Victoria, BC, Canada. His research interests include Computational geometry, Geometric learning, Real-time geometry processing and related applications including motion capture for VR/AR and LiDAR-based urban modeling.

Chuqun Chen received the B.S. degree in engineering of geology and exploration from the Chengdu University of Technology, Chengdu, China, in 1982, the M.Sc. degree in cartology and remote sensing from the Institute of Remote Sensing Applications, Chinese Academy of Sciences, Beijing, China, in 1992, the Ph.D. degree in the science of physical oceanography (ocean color remote sensing) from the Graduate University of Chinese Academy of Sciences, Guangzhou, China, in 2006, and the Ph.D. degree in water resources engineering from Lund University, Sweden, in 2008. Since 1987, he has been working on remote sensing applications, and has been a Principal Investigator for more than ten projects sponsored by the National 863 Program, the National 973 Program (subproject), the National Natural Science Foundation of China, the Scientific Foundation of Guangdong Province, the Chinese Academy of Sciences, and the Ministry of Science and Technology. His research interests include marine optics theory and optical data analyses, the atmospheric correction of optical satellite data in coastal areas, remotely sensed assessment of water quality and thermal infrared remote sensing, skin temperature measurement, and validation of satellite retrieved SST with his own developed instrument, the Buoyant Equipment for Skin Temperature (BEST).

858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880

Lan Sun received the Bachelor’s degree in surveying and mapping engineering from Beijing University of Civil Engineering and Architecture, Beijing, China, in 2012. He is currently working toward the Master’s degree in cartography and geography information system at Beijing Advanced Innovation Center for Imaging Technology, Capital Normal University, Beijing, China. His research interests include spatial data mining, deep learning, and LiDAR-base classification and reconstruction.

881 882 883 884 885 886 887 888 889 890 891 892

of

Yinghai Ke received the Bachelor’s degree in environmental science from Wuhan University, Wuhan, China, the Master’s degree in geographical information Science from Peking University, Beijing, China, and the Ph.D. degree in geospatial sciences and technology from State University of New York College of Environmental Science and Forestry, Syracuse, NY, USA. She is currently an Associate Professor with Capital Normal University, Beijing, China. Her research interests include remote sensing image classification and the application in urban environment and ecology.

IEE

EP

ro

830 831 832 833 834 835 836 837 838 839 840 841 842 843

13

Suggest Documents