Building Extraction Based on an Optimized Stacked Sparse ... - MDPI

3 downloads 0 Views 6MB Size Report
Aug 24, 2017 - with great accuracy and has good robustness. ...... Hinton, G.E.; Osindero, S.; Teh, Y.W. A Fast Learning Algorithm for Deep Belief Nets. Neural ...
sensors Article

Building Extraction Based on an Optimized Stacked Sparse Autoencoder of Structure and Training Samples Using LIDAR DSM and Optical Images Yiming Yan *, Zhichao Tan *, Nan Su * and Chunhui Zhao Department of Information Engineering, Harbin Engineering University, Harbin 150001, China; [email protected] * Correspondence: [email protected] (Y.Y.); [email protected] (Z.T.); [email protected] (N.S.); Tel.: +86-139-3651-3116 (Y.Y.); +86-157-5451-8660 (Z.T.); +86-186-0450-4579 (N.S.) Received: 13 July 2017; Accepted: 24 August 2017; Published: 24 August 2017

Abstract: In this paper, a building extraction method is proposed based on a stacked sparse autoencoder with an optimized structure and training samples. Building extraction plays an important role in urban construction and planning. However, some negative effects will reduce the accuracy of extraction, such as exceeding resolution, bad correction and terrain influence. Data collected by multiple sensors, as light detection and ranging (LIDAR), optical sensor etc., are used to improve the extraction. Using digital surface model (DSM) obtained from LIDAR data and optical images, traditional method can improve the extraction effect to a certain extent, but there are some defects in feature extraction. Since stacked sparse autoencoder (SSAE) neural network can learn the essential characteristics of the data in depth, SSAE was employed to extract buildings from the combined DSM data and optical image. A better setting strategy of SSAE network structure is given, and an idea of setting the number and proportion of training samples for better training of SSAE was presented. The optical data and DSM were combined as input of the optimized SSAE, and after training by an optimized samples, the appropriate network structure can extract buildings with great accuracy and has good robustness. Keywords: stacked sparse autoencoder; LIDAR DSM; remote sensing image; building extraction

1. Introduction Nowadays, aerial optical images and digital surface model (DSM) obtained from Light detection and ranging (LIDAR) are main high resolution data for citizien remote sensing applications [1–4]. As the most typical features of artificial landscapes, buildings play an important role in urban planning, urban development and military affairs [5]. How to extract buildings quickly and accurately from high-resolution remote sensing data has become the primary problem that needs to be studied. With the rapid development of remote sensing technology, multiple sensors remote sensing data also show high-resolution features. This provides a good condition for the analysis of interested objects [6]. However, complex information in the images brings negative effects to building extraction. The traditional remote sensing image building extraction methods can be summarized as the following three categories [7]: (1) The line and corner extraction-based methods. These methods usually first draw straight lines on the basis of the straight line, and then the lines are grouped, merged and removed, and finally screens out the exact outline of the building [8–10]; (2) Region segmentation—based methods-these methods extract the image characteristics, and then get the building areas. For example, the texture feature of the building can be obtained by using the gray level co-occurrence matrix, the gray-difference matrix and the Gabor filter of the extracted image, and the extraction of the building is realized by the segmentation Sensors 2017, 17, 1957; doi:10.3390/s17091957

www.mdpi.com/journal/sensors

Sensors 2017, 17, 1957

2 of 15

of the texture feature [11]; (3) Methods based on auxiliary features or information [12–14]. Due to the apparent height difference between the building areas and the ground, the building areas are judged by the height between the buildings and other surrounding objects. These traditional building extraction methods typically utilize a single optical image or DSM data, and these methods have their own limitations. The methods based on geometric boundaries have the problem of not making full use of texture features and spatial features [15], which leads to extraction errors, and these methods construct polyhedral building models to judge the building or to fit the building, but this model for the appearance of continuous or complex building adaptability is not strong, and the robustness is poor [16]. The region segmentation-based methods depend on the accuracy of segmentation, and the error is difficult to avoid which is caused by the stacking and shadowing of the buildings. Also there are some classification based methods as image fitting methods, random forests supervised hierarchical classification were also used and obtained an good results [17–19]. Building extracts based on DSM auxiliary data are highly expensive for data acquisition [2,18,19]. The large-scale acquisition cost of such data is too high and the update cycle is uncertain. In addition to the above method, after the gradual development, some methods from the building model have been proposed. These methods are based on multi-dimensional high-resolution remote sensing data, starting from the semantic model of buildings. The model-driven building extraction methods mainly includes the method based on semantic model classification with a prior knowledge model. With the introduction of the Bayesian network and probability graph model, some researchers also use probabilistic latent semantic analysis (PLSA) [20] and latent Dirichlet allocation (LDA) theme models [21] to carry out the work. On the other hand, the Markov Random Field (MRF) [22], Conditional Random Field (CRF) [23], and multi-scale random field, etc., are based on image primitives (pixels or split regions). The spatial relationship between the building models was used for high-resolution remote sensing image building object extraction. In addition, many researchers have attempted to use the curve evolution technique, including snakes or active contour models, snakes deformation model, and level set, based on a priori model for building extraction. Based on the classification of the semantic model, the spatial relationship between objects and the background is still insufficient. The precision of the extraction needs to be improved. Due to the variety of artificial buildings, the dependence on the prior knowledge is large when the building model is established. It is difficult to find a pervasive model to describe; based on the “wide range of priority” of the visual cognitive theory of the method is currently only stay in the simple object recognition and extraction. In order to improve the effectiveness of building extraction, it is possible to extend and integrate the existing methods or models of training by means of incremental learning. It is worth noting that the intelligent algorithm of machine learning can be used to optimize and adapt the model parameters. With the rapid development of theoretical methods of machine learning, such as the emergence of manifold learning and sparse expression, the depth of learning theory and other methods in the field of image processing and artificial intelligence has been widely used. As a result, properly combining an optical image with DSM data can be a better choice. The sparse autoencoder network has a very good feature learning ability. We present a Stacked Sparse Autoencoder (SSAE) framework [24] to extract the buildings in the high-resolution remote sensing images. However, training sample selection and network structure settings of the SSAE have a significant impact on its accuracy. For better building extraction using optical images and DSM together, our contributions are these: (1) a better setting strategy of SSAE network structure is given; (2) an idea of setting the number and proportion of training samples for better training of SSAE is presented. The optical data and DSM were combined as an input of the optimized SSAE, and after training by an optimized samples, the appropriate network structure can extract buildings with great accuracy and has good robustness. The following of the paper is set as: Section 2 reports the theory and

Sensors 2017, 17, 1957

3 of 15

method of the optimized SSAE. In Section 3, experimental analysis and results are shown. In Section 4, the work is concluded and discussed. 2. Methods Sensors 2017, 17, 1957

3 of 15

2.1. The Framework 2. Methods Sensors 2017, 17, 1957 3 of 15data were The framework is shown in Figure 1. First, the three bands of optical data and DSM 2.1. The Framework input into a stacked sparse autoencoder, and then the training samples were analyzed and The framework is shown in Figure 1. First, the three bands of optical data and DSM data werestudied to 2. Methods determineinput the appropriate size and proportion of the training After obtaining the appropriate into a stacked sparse autoencoder, and then trainingsamples. samples were analyzed and studied to determine the appropriate size and proportion of training obtaining 2.1. The Framework training samples, analysis the influences of the number of samples. hiddenAfter neurons andthe theappropriate number of hidden training samples, analysis the influences of the number of hidden neurons and the number of hidden layers on the accuracy. Thenisthe appropriate samples structure of SSAE is obtained which The framework shown in Figure 1.training First, the three bandsand of optical data and DSM data were layers on the accuracy. Then the appropriate training samples and structure of SSAE is obtained input into a stacked sparse autoencoder, and then the training samples were analyzed and studied to can be used for can better building extraction. which be used for better building extraction.

determine the appropriate size and proportion of training samples. After obtaining the appropriate training samples, analysis the influences of the number of hidden neurons and the number of hidden IR, R, G DSMand structure of SSAE is obtained layers on the accuracy. Then the appropriate training samples band data which can be used for better building extraction. IR, R, G Optimizing structure ofband network

Optimizing structure of network

SSAE

DSM Optimized datatraining samples

Softmax SSAE

Optimized training samples

Extraction Softmax results

Figure 1. The framework of the building extraction method.

Figure 1. The framework of the building extraction method. Extraction results

The greedy layer wise approach was employed for pre-training stacked sparse autoencoder by training each layer in turn [25]. After the pre-training, the trained stacked sparse autoencoder will be The greedy layer wise approach employed for pre-training stacked sparse autoencoder by Figure 1. Thewas framework of the building employed to building and non-building patches extraction extraction in testingmethod. set. All layers were combined training each layer in turn [25]. After the pre-training, the trained stacked sparse autoencoder will be together to form a stacked sparse autoencoder with hidden layers and a final softmax to extract The greedy layer wise approach was employed for pre-training stacked sparse autoencoder by employedbuildings to building and non-building patches extraction in testing set. All layers were combined pixels from high-resolution remote sensing images and DSM. training each layer in turn [25]. After the pre-training, the trained stacked sparse autoencoder will be together to form a sparse autoencoder hidden layers andlayers a final employed tostacked building and non-building patches with extraction in testing set. All weresoftmax combined to extract 2.2. Stacked Sparse Autoencoder buildings pixels high-resolution sensing and DSM. togetherfrom to form a stacked sparseremote autoencoder with images hidden layers and a final softmax to extract The stacked sparse autoencoder is an unsupervised feature buildings pixels from high-resolution remote sensing images andlearning DSM. algorithm. Stacked sparse has all the advantages of the depth network and more powerful expression, tends to 2.2. Stackedautoencoder Sparse Autoencoder learn the characteristic representation of the input data. For a stacked sparse autoencoder, the first 2.2. Stacked Sparse Autoencoder layer cansparse learn theautoencoder first-order features, second can learnfeature second-order features, etc. The stacked is antheunsupervised learning algorithm. Stacked sparse The stacked sparse autoencoder is an unsupervised feature learning algorithm. Stacked sparse autoencoder has all the advantages of the depth network and more powerful expression, tends autoencoder has all the advantages of the depth network and more powerful expression, tends to to learn learn the characteristic representation of the input data. For a stacked sparse autoencoder, the first the characteristic representation of the input data. For a stacked sparse autoencoder, the first layer can layer can learn the first-order features,can the second can learn second-order features, learn the first-order features, the second learn second-order features, etc. etc.

Figure 2. A simple example of an autoencode.

Figure 2. A simple example of an autoencode. Figure 2. A simple example of an autoencode.

Sensors 2017, 17, 1957

4 of 15

Basically, training an autoencoder is to find the optimal parameters by minimizing the discrepancy between input x and its reconstruction xˆ. As shown in Figure 2. The encoder is the input x to the implicit representation of the mapping of h, expressed as h = f ( x ) = S f (Wx + b)

(1)

Here S() is activation function which are defined by the sigmoid logistic function: (1)

h1

Sensors 2017, 17, 1957

(1)

h2

The

  (1) (1) (1) (1) = S f W11 x1 + W12 x2 + W13 x3 + b1   (1) (1) (1) (1) = S f W21 x1 + W22 x2 + W23 x3 + b2

4 of 15

(2)

Basically, training an autoencoder is to find the optimal parameters by minimizing the ··· discrepancy between input x and its reconstruction xˆ. As shown in Figure 2. The encoder is the input x to the implicit representation of the mapping of h, expressed as (l ) final expression is represented byh W , and offset b(l ) , L is the Lth layer, j is the jth  jif  x   S f Wx  b  (1)

node:

 which are defined by the sigmoid logistic function:  Here S() is activation function (2) (2) (2) (2) (2) (2) (2) hW,b ( x ) = S f W11 h1

+ W12 h2

+ · · · + b1

+ W13 h3

  The discrepancy between inputh x and its xreconstruction function:  S W  W x  W x xˆb is described with a cost(2) h1(1)  S f W11(1) x1  W12(1) x2  W13(1) x3  b1(1) (1) 2



f

(1) 21 1

(1) 22 2

(1) 23 3

(1) 2

n i −1



m

s i +1 

si

2

l  l  1 The final expression byb;W offset+ b λ , L∑is the x (ji i ), ,and y(i )) J (W, b) =is represented W lj is the jth node: ∑ J (W, ∑ Lth∑layer,

m

i=1



2

hW ,b  x   S f W h

(2)  2  11 1

(2)  2  12 2

 W h

l = 1i = 1j = 1

(2)  2  13 3

 W h



 b

(2) 1



ji

 2 ni − 1 si si + 1 m  input x and its reconstruction

2  The discrepancy between xˆ is described with a(lcost function: ) = m1 ∑ 12 hw,b ( x (i )) − y(i ) + λ2 ∑ ∑ ∑ Wji nl = 1 1s i = s 11 j = 1 i=1 m 2  1  l

 J W , b; x i  , y i 

   W  i



i

(4)

(3)



J W , b   

(3)

(5)

i

ji

(4)

2 l  1 error i  1 j  1 term, and the second term is the  m i  1is the mean square  The first term in the above formula 2 regularization term (also called the attenuation term). 2  1 m weight  n  1 s s  1 l  1 (5)  h x i  y i  W        i 1 j 1 ji function. Each iteration w ,b   m   2 method We use the bulk gradient descent to solve the in the 2 l  1 objective   i  1 gradient descent method is required to update the parameters. The reverse derivative algorithm is The first term in the above formula is the mean square error term, and the second term is the used to calculate the partial derivative, afterattenuation obtaining the loss function and its partial derivative, regularization term (also called the weight term). the gradient reduction is descent used tomethod solvetothe optimal parameters the network. We use thealgorithm bulk gradient solve the objective function. of Each iteration in the gradient descent method is required to update the parameters. The reverse derivative algorithm The stacked sparse autoencoder neural network is a neural network composed ofismulti-layer used to calculate the partial derivative, after obtaining the loss function and its partial derivative, the sparse self-encoders, the outputs of each layer is wired to the inputs of the successive layer, and the gradient reduction algorithm is used to solve the optimal parameters of the network. same method to train all the hidden layer,neural withnetwork the output of thenetwork last hidden layer as a multi-classifier The stacked sparse autoencoder is a neural composed of multi-layer sparse self-encoders, the outputs of each layer is wired to the inputs of the successive layer, and the soft-max input with the original data tag to train the soft-max classifier network parameters. Finally same method to train all the hidden layer, with the output of the last hidden layer as a these network parameters as the entire neural network parameters, including all the hidden layers and multi-classifier soft-max input with the original data tag to train the soft-max classifier network a soft-max output layer, then find the minimum value for the price function parameters, parameters. Finally these network parameters as the entire neural network parameters, includingand all this is the entire network of thelayers optimal values. the hidden and a parameter soft-max output layer, then find the minimum value for the price function parameters, and this theuse entire network of the optimal parameter values. with optical image and DSM As shown in Figure 3, iswe such a stacked sparse autoencoder, As shown in Figure 3, we use such a stacked sparse autoencoder, with optical image and DSM data as the data input, to extract the buildings. as the input, to extract the buildings. i

i

i

 

Figure 3. The illustration of the stacked sparse autoencoder used in this work.

Figure 3. The illustration of the stacked sparse autoencoder used in this work. 2.3. Evaluation

Sensors 2017, 17, 1957

5 of 15

2.3. Evaluation The extraction results of different methods were evaluated as ‘Accuracy’. ‘Accuracy’ is a measure of the correct point to all points: T (6) Accuracy = N T is the point that is correctly extracted, including the number of building pixels that had been correctly extracted and the number of non-building pixels that had been correctly extracted. N represents all the points in this image. The extraction results of different methods were also evaluated in terms of ‘Completeness’ ‘Correctness’ and ‘quality’, defined as in Equations (7)–(9): Completeness = Correctness = Quality =

TP TP + FN

(7)

TP TP + FP

(8)

1 1 Completeness

+

1 Correctness

− 1

(9)

In these equations, TP, FN, and FP are the numbers of true position, false negative, and false positive, respectively. Here the true position is the number of building pixels that had been correctly extracted based on the ground truth. False negative, and false positive are defined in similar ways. 3. Experiments and Analysis 3.1. The Generation of Training and Testing Sets In this experiment, several groups of optical images and DSM were collected as the input data to a stacked sparse autoencoder. Digital aerial images are used as experimental data. This digital aerial images are a part of the high-resolution DMC block of the German Association of Photogrammetry and Remote Sensing (DGPF) test. Including the near infrared (IR) spectral band, red (R) spectral band and green (G) spectral band, the spatial resolution is 9 cm. At the same time provide matching DSM data and real orthographic projection (standard true map), which DSM data spatial resolution of 9 cm. The ground truth map has a spatial resolution of 9 cm and a data quantization of eight bits. We selected digital aerial images and their DSM data in Area 30 and Area 37 as the training data set. The luminance of buildings in Area 30 is in the low range, while the luminance of the buildings in Area 37 color is high. The size of each image is 1500 pixels × 1500 pixels. In the first layer or input layer, the input is the raw pixel intensity of square patch which is represented as column vectors of pixel intensity its size is 2,250,000 pixels × 4 floors. The four bands include the IR, the R, the G, and the DSM, and all are normalized. Area 34 was selected as test data, since the buildings in Area 34 are of various luminance. The digital aerial images and DSM were also used. The size of this images is 2500 pixels × 1000 pixels, and is represented as column vectors of pixel intensity whose size is 2,500,000 pixels × 4 floors. Likewise, the four layers include the IR band, the L band, the G band, and the DSM, and all are normalized. Moreover, it can be found in the contours of each areas that the terrains have ups and downs, as shown in Figures 4c and 5e,f, and some ground in the high-altitude areas have larger values of DSM than the roof of the buildings that in low-altitude areas. This can be used to verify the performance of our methods in different cases of terrain.

Sensors 2017, 17, 1957

6 of 15

Sensors 2017, 17, 1957

6 of 15

Sensors 2017, 17, 1957

6 of 15

(a)

(b)

(c)

(d)

(a)Vaihingen test area: (a) (b)Area 34; (b) the ALS DSM (c) (d) of the Figure 4. The data of Area 34; (c) the contour Figure 4. The Vaihingen test area: (a) Area 34; (b) the ALS DSM data of Area 34; (c) the contour of the Area 34;4.and the ground of Area Figure The(d) Vaihingen testtruth area:map (a) Area 34;34. (b) the ALS DSM data of Area 34; (c) the contour of the Area 34; and (d) the ground truth map of Area 34. Area 34; and (d) the ground truth map of Area 34.

(a)

(b)

(a)

(b)

(c)

(d)

(c)

(d)

(e)

(f)

(e)

(f) Figure 5. Cont.

Sensors 2017, 17, 1957 Sensors 2017, 17, 1957

7 of 15 7 of 15

Sensors 2017, 17, 1957

7 of 15

(g)

(h)

(g)

(h)

Figure 5. The Vaihingen test area: (a) Area 30; (b) Area 37; (c) the DSM of Area 30; (d) the DSM of

Figure 5. The Vaihingen test area: (a) Area 30; (b) Area 37; (c) the DSM of Area 30; (d) the DSM of Figure 5. The Vaihingen test (a) Area Areaof37;the(c)Area the 37; DSM 30;truth (d) the of Area 37; (e) the contour of area: the Area 30; (f) 30; the (b) contour (g) of theArea ground mapDSM of AreaArea 37; (e) the contour of the Areamap 30; (f)Area the contour of the Area 37; (g) the ground truth map of 30; the and contour (h) the ground Area 37; (e) of thetruth Area 30; of (f) the 37. contour of the Area 37; (g) the ground truth map of Area 30; and (h) the ground truth map of Area 37. Area 30; and (h) the ground truth map of Area 37. 3.2. Impact of the Different Proportion of Training Samples

3.2. Impact Impact of theof Different Proportion of Samples 3.2. the Different of Training Training Samples selected from the 30th and 37th regions, and Aof total 100,000 Proportion training samples were randomly thetotal neural 100,000 networks were trained accordingrandomly to different proportions. Each image 225,000 pixels, A selected from thethe 30th andhas 37th regions, andand the A total of of 100,000training trainingsamples sampleswere were randomly selected from 30th and 37th regions, and the 30th and 37th regions have a total of 4,500,000 pixels. The selected building pixels to neural networks werewere trained according to different proportions. Each image has 225,000 pixels, and the the neural networks trained according to next different proportions. image has 225,000 pixels, non-building pixels ratio was different for the experiment. Using aEach network structure with two 30th and 37th regions have a total of 4,500,000 pixels. The selected building pixels to non-building and hidden the 30th andThe 37th havehidden a total of have 4,500,000 pixels. The selected building pixels to layers. firstregions and second layers 200 and 100 hidden units, respectively. pixels ratio was different for the next experiment. Using a network structure withstructure two hidden layers. non-building pixels ratio was different for the next experiment. Using a network The results show a striking effort of the ratio of building pixels to non-building pixelswith on two The first and second hidden layers have 200 and 100 hidden units, respectively. hidden layers. The first From and second layers have seen 200 and 100 hidden units, respectively. performance in test. Figure 6,hidden it can be intuitively that when the ratio of building pixels to The results pixels show isaa 4striking effortofof ratio pixels to non-building pixels non-building to 6, the result experiment the best. Less than ratio, the building The results show striking effort ofthethe the ratio of of isbuilding building pixels tothis non-building pixels on on performance in test. From Figure 6, it can be intuitively seen that when the ratio of building pixels to pixels are not extracted, and larger than the ratio is a large number of non-building pixels are performance in test. From Figure 6, it can be intuitively seen that when the ratio of building pixels to divided into building pixels. This is because the encoder network fully learns the characteristics of non-building pixels is 4 to 6, the result of the experiment is the best. Less than this ratio, the building non-building pixels is 4 to 6, the result of the experiment is the best. Less than this ratio, the building building area andand non-building areas. resultsnumber can alsoof benon-building obtained frompixels in Table 1. divided pixels are not larger than theConsistent ratioratio is a large are pixelsthe are notextracted, extracted, and larger than the is a large number of non-building pixels are into building pixels. This is because the encoder network fully learns the characteristics of the building divided into building pixels. This is because the encoder network fully learns the characteristics of areabuilding and non-building areas. Consistent can also be obtained in Tablefrom 1. in Table 1. the area and non-building areas.results Consistent results can alsofrom be obtained

(a)

(a)

(b)

(b) Figure 6. Cont.

(c)

(c)

Sensors 2017, 17, 1957

8 of 15

Sensors 2017, 17, 1957

8 of 15

(d)

(e)

(f)

Figure 6. Different proportions of extraction results: (a) the ratio of buildings to non-buildings is 2 to Figure 6. Different proportions of extraction results: (a) the ratio of buildings to non-buildings is 2 to 8; 8; (b) the ratio of buildings to non-buildings is 3 to 7; (c) the ratio of buildings to non-buildings is 4 to (b) the ratio of buildings to non-buildings is 3 to 7; (c) the ratio of buildings to non-buildings is 4 to 6; 6; (d) the ratio of buildings to non-buildings is 5 to 5; (e) the ratio of buildings to non-buildings is 6 to (d) the ratio of buildings to non-buildings is 5 to 5; (e) the ratio of buildings to non-buildings is 6 to 4; 4; and (f) the ratio of buildings to non-buildings is 7 to 3. and (f) the ratio of buildings to non-buildings is 7 to 3. Table 1. Different proportions of extraction results. Table 1. Different proportions of extraction results.

2:8 3:7 3:7 Accuracy 86.79% 2:8 93.01% Accuracy48.49% 86.79% 73.40% 93.01% quality quality

48.49%

73.40%

4:6 4:6 93.93% 93.93% 77.07% 77.07%

5:5 5:5 89.52% 89.52% 68.49% 68.49%

6:4

7:3

6:4 68.33%

7:367.03%

68.33% 43.23% 43.23%

67.03% 42.51% 42.51%

3.3. Impact of the Different Size of Training Samples 3.3. Impact of the Different Size of Training Samples The ratio of building and non-building training samples was randomly selected according to The ratio of and non-building training samples selected according to the the proportion ofbuilding 4 to 6 constituting the training data, and thewas sizerandomly of the training data was changed proportion of 4effect to 6 constituting the data training data, and the sizeresults. of the training data was changed to to discuss the of the training on the experimental The experimental data were discuss the of the training data on the experimental results. The experimental data were selected in effect different cases (from 1000, 6000, 10,000, 150,000, 30,000, 100,000, 150,000, and selected 300,000 in different casespoints). (from 1000, 6000, in 10,000, 150,000, 30,000,2,100,000, and 300,000 training training sample As shown Figure 7 and Table the size150,000, of the training sample has a sampleeffect points). in Figure andtraining Table 2, sample the sizepoints of the and training sample has asample certainpoints effect certain on As theshown detection. From 71000 300,000 training onFigure the detection. 1000 training sample points and 300,000 training points pixels in Figure 8, in 8, it can From be seen that excessive training samples can cause some sample non-building to be it can beinto seenbuilding that excessive samples can cause some pixels be divided into divided pixels, training and training samples that are toonon-building small can cause the to building pixels to building pixels, and training pixels. samples are too smallsample can cause building pixels to be divided be divided into non-building A that suitable training can the obtain the best results. into non-building pixels. A suitable training sample can obtain the best results. Table 2. The results of different sizes of training samples. Table 2. The results of different sizes of training samples.

Accuracy quality

Accuracy quality

1000 6000 10,000 15,000 30,000 100,000 75.8% 1000 94.3%6000 94.5% 10,000 95.3% 15,000 95.1% 30,000 93.9% 100,000 6.22% 78.9% 79.7% 81.6% 82.6% 77.5% 75.8% 6.22%

94.3% 78.9%

94.5% 79.7%

95.3% 81.6%

95.1% 82.6%

93.9% 77.5%

300,000 91.4% 300,000 68.7% 91.4% 68.7%

600,000 87.2% 600,000 58.7% 87.2% 58.7%

Sensors 2017, 17, 1957

9 of 15

Sensors 2017, 17, 1957

9 of 15

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 7. Results of different size of extraction: (a) 1000 points; (b) 6000 points; (c) 10,000 points; (d) Figure 7. Results of different size of extraction: (a) 1000 points; (b) 6000 points; (c) 10,000 points; 15,000 points; (e) 30,000 points; (f) 100,000 points; (g) 150,000 points; and (h) 600,000 points. (d) 15,000 points; (e) 30,000 points; (f) 100,000 points; (g) 150,000 points; and (h) 600,000 points.

Sensors 2017, 17, 1957 Sensors 2017, 17, 1957

10 of 15 10 of 15

Figure Figure8.8.The Theeffect effectofofthe thesize sizeofofthe thetraining trainingsample sampleon onthe theresults. results

3.4. 3.4.Impact Impactofofthe theDifferent DifferentNetwork NetworkStructure Structure In getget a more obvious result, the use of use a group poor results of the data,ofthethe experiment Inorder ordertoto a more obvious result, the of aofgroup of poor results data, the was carried out adverse conditions. experiment wasunder carried out under adverse conditions. As entire Area 30 30 is divided intointo twotwo parts, the the left left partpart of the are Asshown shownininthe theFigure Figure9,9,the the entire Area is divided parts, of steps the steps in accordance withwith the above stepssteps to build the training samples: thatthat is, according to the ratio 4 to4 are in accordance the above to build the training samples: is, according to the ratio 630,000 points are are selected, andand the the right partpart of the whole mapmap as aas test sample. to 630,000 points selected, right of the whole a test sample. Table thethe comparison about the the different number of layers of the of neural network. When Table3 3shows shows comparison about different number of layers the neural network. the numapoch and batchsize are the same, the greater hiddenoflayers of the neural network When the numapoch and batchsize are the same, thenumber greater of number hidden layers of the neural show better results. network show better results. Table effect of the trend of neuronal changes in theinhidden layer on theon results when Table4 4shows showsthe the effect of the trend of neuronal changes the hidden layer the results the number of hidden the same. When theWhen number neurons the hidden layer decreases when the number of layers hiddenislayers is the same. theof number ofinneurons in the hidden layer by layer, the result is the better. When the number of neurons in the hidden layer is constant, and theis decreases by layer, result is the better. When the number of neurons in the hidden layer number ofand neurons in the hidden layerinisthe increasing, the effect is not good. constant, the number of neurons hidden layer is increasing, the effect is not good. Table Table3.3.Different Differentproportions proportionsofofextraction extractionresults. results.

Network Structure Numepochs Network Structure Numepochs [160 100 40 20] 50 [160 100 40 20] 50 [160 100 40] 50 [160 100 40] 50 [100 40 20] 50 [100 40 20] 50 [160 40 20] [160 40 20] 50 50 [160 100] [160 100] 50 50 [160 40] [160 40] 50 50 [100 40] [100 40] 50 50 [40 20] 50 [40 20] 50 [160] 50 [160] 50 [100] 50 [100] 50 [40] 50 [40] 50 [20] 50 [20] 50

Batchsize Batchsize 100 100 100 100 100 100 100100 100100 100100 100100 100100 100100 100100 100100 100 100

Accuracy Quality Accuracy 92.12% 92.12% 70.0% 90.87% 90.87% 67.2% 90.40% 90.40% 66.3% 91.08% 91.08% 65.6% 90.39% 90.39% 65.0% 84.80% 84.80% 55.7% 90.36% 90.36% 64.7% 89.31% 62.9% 89.31% 86.93% 57.7% 86.93% 88.04% 59.6% 88.04% 87.68% 58.9% 87.68% 87.99% 60.1% 87.99%

Quality 70.0% 67.2% 66.3% 65.6% 65.0% 55.7% 64.7% 62.9% 57.7% 59.6% 58.9% 60.1%

Sensors 2017, 2017, 17, 17, 1957 1957 Sensors

11 of of 15 15 11

(a)

(b)

Figure 9. The area of the experiment data is selected: (a) the training data is selected on the left side Figure 9. The area of the experiment data is selected: (a) the training data is selected on the left side of of Area is selected on right the right of zone Area 30;30; andand (b) (b) the the test test datadata is selected on the sideside of zone 30. 30. Table 4. Different proportions of extraction results. Table 4. Different proportions of extraction results. Network structure Numepochs Batchsize Structure50 Numepochs Batchsize [160 100 40 Network 20] 100 [20 40 100 160][160 100 40 20] 50 100 50 100 [40 40 40 40] [20 40 100 160] 50 100 50 100 50 100 [100 100 100 100][40 40 40 40] 50 100 50 100 [160 40 20][100 100 100 100] 50 100 50 100 [20 40 160] [160 40 20] 50 100 50 100 [160 100 40] [20 40 160] 50 100 [160 100 40] 50 100 [40 100 160] 50 100 [40 100 160] 50 100 [160 160 160] 50 100 [160 160 160] 50 100 [100 100 100] [100 100 100] 50 100 50 100 [40 40 40] 50 100 [40 40 40] 50 100 [20 20 20] 50 100 [20 20 20] 50 100

Accuracy Accuracy 92.12% Quality 89.27% 70.0% 92.12% 88.35% 63.8% 89.27% 88.35% 88.35% 61.6% 88.35% 91.08% 61.3% 91.08% 89.54% 65.6% 89.54% 89.87% 63.6% 89.87% 67.2% 88.62% 88.62% 62.2% 87.07% 87.07% 59.1% 89.20% 63.2% 89.20% 88.15% 61.1% 88.15% 88.82% 62.1% 88.82%

Quality 70.0% 63.8% 61.6% 61.3% 65.6% 63.6% 67.2% 62.2% 59.1% 63.2% 61.1% 62.1%

3.5. Comparison Comparison of of Different Different Methods Methods 3.5. We compare comparestacked stackedsparse sparse autoencoder with a deep Network and Support We autoencoder with a deep BeliefBelief Network (DBN)(DBN) and Support Vector Vector Machine (SVM) to verify that the selected samples and network structures are appropriate. Machine (SVM) to verify that the selected samples and network structures are appropriate. To attain To attain these aims,appropriate the most appropriate training data in were used in Area 30, Area Area 37, these aims, the most training data were used Area 30, Area 34 and Area 34 37,and respectively, respectively, and the whole graph was tested for comparison. and the whole graph was tested for comparison. SVM parameters were set to c = 3.0314; g = 84.4485. DBN chose two hidden layers of the network, the first and second hidden layers have 200 and 100 hidden units, respectively. In more than two hidden layers of the moment, DBN will be a fitting phenomenon. The stacked sparse autoencoder network structure contains five hidden layers, the hidden layers have 160, 100, 40, 20 and 10 hidden units, respectively.

Sensors 2017, 17, 1957

12 of 15

Using the same settings of data, training data and test data from the same image, the size of the training data is 30,000, the size of test data is 2,500,000. In the training data, the ratio between the building and the non-building is 4 to 6. The results of Area 30 and Area 37 were shown in Tables 5 and 6 and Figures 10 and 11, it could be found that with above settings, SSAE and DBN could get a good accuracy, but SVM will be the whole image to determine the non-buildings though parameters optimization were done. The reason could be over-fitting. SVM in the treatment of such data will appear such an unstable situation. Table 7 and Figure 12 show the comparison of the DBN, SSAE, and SVM in Area 34, respectively. We can see that the results of SVM is the best, but it takes almost five times the time of SSAE. The DBN method works the worst. SSAE results better than DBN, and spend almost the same time. Table 5. Different methods of extraction results Area 30. Sensors 2017, 17, 1957

Accuracy

Completeness

Correctness

Quality

Time

92.46% 94.20% 73.63%

89.49% 86.90% 100%

83.20% 90.73% 73.63%

75.79% 79.81% 73.63%

21.30 s 23.60 s 3650.33 s

DBN SSAE SVM

12 of 15

Table 6. Different methods of extraction results. Error Rate

Completeness

Correctness

Quality

Time

94.36% 95.45% 75.20%

92.56% 90.71% 100%

86.65% 90.98% 75.20%

81.01% 83.22% 75.20%

18.73 s 18.61 s 4247.86 s

DBN SSAE SVM Sensors 2017, 17, 1957

(a)

(b)

12 of 15

(c)

Figure 10. The result of Area30: (a) The result of DBN; (b) The result of SSAE; (c) The result of SVM. Table 5. Different methods of extraction results Area 30.

DBN SSAE SVM

Accuracy 92.46% 94.20% 73.63%

Completeness 89.49% 86.90% 100%

Correctness 83.20% 90.73% 73.63%

Quality 75.79% 79.81% 73.63%

Time 21.30 s 23.60 s 3650.33 s

SVM parameters were set to c = 3.0314; g = 84.4485. DBN chose two hidden layers of the network, the first and second hidden layers have 200 and 100 hidden units, respectively. In more than two (a) (b) (c) hidden layers of the moment, DBN will be a fitting phenomenon. The stacked sparse autoencoder network contains five hidden the hidden layers have 160, 100, andof 10SVM. hidden Figurestructure 10. The result of Area30: (a) Thelayers, result of DBN; (b) The result of SSAE; (c) 40, The20 result Figure 10. The result of Area30: (a) The result of DBN; (b) The result of SSAE; (c) The result of SVM. units, respectively. Table 5. Different methods of extraction results Area 30.

DBN SSAE SVM

Accuracy 92.46% 94.20% 73.63%

Completeness 89.49% 86.90% 100%

Correctness 83.20% 90.73% 73.63%

Quality 75.79% 79.81% 73.63%

Time 21.30 s 23.60 s 3650.33 s

SVM parameters were set to c = 3.0314; g = 84.4485. DBN chose two hidden layers of the network, the first and second hidden layers have 200 and 100 hidden units, respectively. In more than two hidden layers of the moment, DBN will be a fitting phenomenon. The stacked sparse autoencoder network structure contains five hidden layers, the hidden layers have 160, 100, 40, 20 and 10 hidden units, respectively.(a) (b) (c) Figure 11. The result of Area37: (a) the result of DBN; (b) the result of SSAE; and (c) the result of Figure 11. The result of Area37: (a) the result of DBN; (b) the result of SSAE; and (c) the result of SVM. SVM. Table 6. Different methods of extraction results.

DBN SSAE

Error rate 94.36% 95.45%

Completeness 92.56% 90.71%

Correctness 86.65% 90.98%

Quality 81.01% 83.22%

Time 18.73s 18.61s

Sensors 2017, 17, 1957

13 of 15

Table 7. Different methods of extraction results of Area 34.

SSAE DBN SVM

Sensors 2017, 17, 1957

Accuracy

Completeness

Correctness

Quality

Time

96.23% 94.28% 97.49%

94.71% 85.38% 98.18%

91.07% 92.37% 98.43%

86.66% 79.76% 96.66%

21.47 s 22.21 s 110.21 s

(a)

(b)

13 of 15

(c)

Figure 12. The result of Area 34: (a) the result of DBN; (b) the result of SSAE; and (c) the result of Figure 12. The result of Area 34: (a) the result of DBN; (b) the result of SSAE; and (c) the result of SVM. SVM.

4. Discussion and Conclusions The results of Area 30 and Area 37 were shown in Tables 5 and 6 and Figures 10 and 11, it this paper, wewith proposed remote sensing building extraction based on anSVM effective couldInbe found that above asettings, SSAE and DBN could get a method, good accuracy, but will stacked sparseimage autoencoder network, which combines the original optical and DSM data as done. input The and be the whole to determine the non-buildings though parameters optimization were extract the buildings in the high-resolution data. Although there are many feature-based or fusion reason could be over-fitting. SVM in the treatment of such data will appear such an unstable methods might SSAE, weofbelieve SSAE can find situation. Table obtain 7 and better Figureperformance 12 show thethan comparison the DBN, SSAE, andmore SVMabstract in Areaand 34, unusual features. Moreover, finding anof optimized using SSAEalmost is far-reaching respectively. We can see that the results SVM is the best,idea but of it takes five timessignificance. the time of As a result, we mainly on worst. discussing problem remote building SSAE. The DBN methodfocused works the SSAEtwo results betteron than DBN,sensing and spend almostextraction: the same (1) the suitable size and proportions of training samples for deep learning based methods; (2) the better time. structure of SSAE based methods. According to our experiments, we found appropriate settings of Different methods of extraction results of Area 34. the building and the the training sample sizeTable was 7. 15,000 points, and the better proportion between non-building was 4:6. For the structure of SSAE, according to our experiments, the number of neurons Accuracy Completeness Correctness Quality Time in the hidden layer should be decreased set layer by layer, and the most suitable number of layers is 5, SSAE 96.23% 94.71% 91.07% 86.66% 21.47 s with 160-100-40-20-10 hidden units, respectively. DBN 94.28% 85.38% 92.37% 79.76% 22.21 s In our idea, the units composing different structure of SSAE are not only highly isolated abstract SVM 97.49% 98.18% 98.43% 96.66% 110.21 s features, but can also be considered as a method of thinking to do one specific thing, the layers can be the levels of thinking, and the values of each unit is the weight of one of some unsure factor. In the 4. Discussion and Conclusions experiments, better results were obtained by the decreased set structure, so it could be deduced that In this paper, we proposed a remote sensing building method, on an effective sub-total way of thinking is suitable for building detectionextraction with LIDAR DSMbased and optical images, stacked sparse autoencoder network, combines the original optical and DSM dataMoreover, as input and the sub-total way of thinking can bewhich thought as the unusual features of this application. and extractofthe buildings the high-resolution Although there are many feature-based or five levels thinking within corresponding weightsdata. of each factors is an ideal trade off. Furthermore, fusion methods might obtain better performance than SSAE, we believe SSAE can find more abstract it is also important that not each isolated factor itself determine the result directly, but the way of and unusual features. Moreover, finding an optimized using idea of SSAE is far-reaching significance. As a result, we mainly focused on discussing two problem on remote sensing building extraction: (1) the suitable size and proportions of training samples for deep learning based methods; (2) the better structure of SSAE based methods. According to our experiments, we found appropriate settings of the training sample size was 15,000 points, and the better proportion

Sensors 2017, 17, 1957

14 of 15

thinking constructed by the SSAE which can lead to an ideal processing for some specific applications, with some appropriate number and proportions of prior samples to train the way of thinking. Acknowledgments: The Vaihingen data set was provided by the German Society for Photogrammetry, Remote Sensing and Geoinformation (DGPF) [Cramer, 2010]: http://www.ifp.uni-stuttgart.de/dgpf/DKEPAllg.html. The experimental SVM program comes from libsvm-3.1-[FarutoUltimate3.1Mcode]. faruto and liyang, LIBSVM-faruto Ultimate Version, a toolbox with implements for support vector machines based on libsvm, 2011. Software available at http://www.matlabsky.com. Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. The experimental SAE program comes from the rasmusbergpalm-DeepLearnToolbox-9faf641 toolkit. Software available at https://github.com/rasmusbergpalm/DeepLearnToolbox. Author Contributions: Yiming Yan designed the first part of the experiments, finished the work of writing of abstract and dataset collection; Zhichao Tan wrote the methods and experiments section and analyzed the data. Nan Su Corrected the Sections 3 and 4 and designed the second part of the experiments. Chunhui Zhao corrected the Sections 1 and 2. Conflicts of Interest: The authors declare no conflicts of interest.

References 1.

2. 3.

4. 5. 6.

7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

Li, J.; Guan, H. 3D building reconstruction from airborne lidar point clouds fused with aerial imagery. In Urban Remote Sensing: Monitoring, Synthesis and Modeling in the Urban Environment; Wiley: Hoboken, NJ, USA, 2011. Guan, H.; Li, J.; Chapman, M.A. Urban Thematic Mapping by Integrating LiDAR Point Cloud with Colour Imagery. GEOMATICA 2011, 65, 375–385. [CrossRef] Wu, H.; Li, Y.; Li, J.; Gong, J. A Two-step Displacement Correction Algorithm for Registration of Lidar Point Clouds and Aerial Images without Orientation Parameters. Photogramm. Eng. Remote Sens. 2010, 76, 1135–1145. [CrossRef] Yang, B.; Chen, C. Automatic registration of UAV-borne sequent images and LiDAR data. ISPRS J. Photogramm. Remote Sens. 2015, 101, 262–274. [CrossRef] Yan, Y.; Gao, F.; Deng, S.; Su, N. A Hierarchical Building Segmentation in Digital Surface Models for 3D Reconstruction. Sensors 2017, 17, 222. [CrossRef] [PubMed] Su, N.; Zhang, Y.; Tian, S.; Yan, Y.; Miao, X. Shadow Detection and Removal for Occluded Object Information Recovery in Urban High-Resolution Panchromatic Satellite Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 2568–2582. [CrossRef] Wang, J.; Qin, Q.; Ye, X.; Wang, J.; Qin, X.; Yang, X. A Survey of Building Extraction Methods from Optical High Resolution Remote Sensing Imagery. Remote Sens. Technol. Appl. 2016, 31, 653–662. Izadi, M.; Saeedi, P. Three-dimensional Polygonal Building Model Estimation from Single Satellite Images. IEEE Trans. Geosci. Remote Sens. 2012, 50, 2254–2272. [CrossRef] Katartzis, A.; Sahli, H. A Stochastic Framework for the Identi-Fication of Building Rooftops Using a Single Remote Sensing Image. IEEE Trans. Geosci. Remote Sens. 2008, 46, 259–271. [CrossRef] Kim, T.; Muller, P. Development of a Graph-based Approach for Building Detection. Image Vis. Comput. 1999, 17, 3–14. [CrossRef] Sirmacek, B.; Unsalan, C. Urban-Area and Building Detection Using SIFT Keypoints and Graph Theory. IEEE Trans. Geosci. Remote Sens. 2009, 47, 1156–1167. [CrossRef] Maruyama, Y.; Tashiro, A.; Yamazaki, F. Use of Digital Surface Model Constructed from Digital Aerial Images to Detect Collapsed Buildings during Earthquake. Procedia Eng. 2011, 14, 552–558. [CrossRef] Tournaire, O.; Brédif, M.; Boldo, D.; Durupt, M. An Efficient Stochastic Approach for Building Footprint Extraction from Digital Elevation Models. ISPRS J. Photogramm. Remote Sens. 2010, 65, 317–327. [CrossRef] Haala, N.; Kada, M. An Update on Automatic 3D Building Re-Construction. ISPRS J. Photogramm. Remote Sens. 2010, 65, 570–580. [CrossRef] Fu, Q.; Wu, B.; Wang, X.; Sun, Z. Building Extraction and Its Height Estimation over Urban Areas based on Morphological Building Index. Remote Sens. Technol. Appl. 2015, 30, 148–154. Liu, Z.; Cui, S.; Yan, Q. Building extraction from high resolution satellite imagery based on multi-scale image segmentation and model matching. In Proceedings of the International Workshop on Earth Observation and Remote Sensing Applications, Beijing, China, 30 June–2 July 2008; pp. 1–7.

Sensors 2017, 17, 1957

17. 18.

19. 20. 21.

22. 23. 24.

25.

15 of 15

Du, S.; Zhang, Y.; Qin, R.; Yang, Z.; Zou, Z.; Tang, Y.; Fan, C. Building Change Detection Using Old Aerial Images and New LiDAR Data. Remote Sens. 2016, 8, 1030. [CrossRef] Guan, H.; Li, J.; Chapman, M.; Deng, F.; Ji, Z.; Yang, X. Integration of orthoimagery and lidar data for object-based urban thematic mapping using random forests. Int. J. Remote Sens. 2013, 34, 5166–5186. [CrossRef] Guan, H.; Ji, Z.; Zhong, L.; Li, J.; Ren, Q. Partially supervised hierarchical classification for urban features from lidar data with aerial imagery. Int. J. Remote Sens. 2013, 34, 190–210. [CrossRef] Akcay, H.G.; Aksoy, S. Automatic Detection of Geospatial Objects Using Multiple Hierarchical Segmentations. IEEE Trans. Geosci. Remote Sens. 2008, 46, 2097–2111. [CrossRef] Tang, H.; Shen, L.; Qi, Y.; Chen, Y.; Shu, Y.; Li, J.; Clausi, D.A. A Multiscale Latent Dirichlet Allocation Model for Object-oriented Clustering of VHR Panchromatic Satellite Images. IEEE Trans. Geosci. Remote Sens. 2013, 51, 1680–1692. [CrossRef] Kumar, S.; Hebert, M. Man-made Structure Detectionin Natural Images Using a Causal Multiscal e Random Field. In Proceedings of the Computer Vision and Pattern Recognition, Madison, WI, USA, 18–20 June 2003. Li, E.; Femiani, J.; Xu, S.; Zhang, X.; Wonka, P. Robust Rooftop Extraction from Visible Band Images Using Higher Order CRF. IEEE Trans. Geosci. Remote Sens. 2015, 53, 4483–4495. [CrossRef] Xu, J.; Xiang, L.; Liu, Q.; Gilmore, H.; Wu, J.; Tang, J.; Madabhushi, A. Stacked Sparse Autoencoder (SSAE) for Nuclei Detection on Breast Cancer Histopathology images. IEEE Trans. Med. Imaging 2016, 35, 119. [CrossRef] [PubMed] Hinton, G.E.; Osindero, S.; Teh, Y.W. A Fast Learning Algorithm for Deep Belief Nets. Neural Comput. 2006, 18, 1527–1554. [CrossRef] [PubMed] © 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Suggest Documents