Learning Dual Multi-Scale Manifold Ranking for

Article

Learning Dual Multi-Scale Manifold Ranking for Semantic Segmentation of High-Resolution Images–Supplementary Material Mi Zhang 1 , Xiangyun Hu 1,2 *, Like Zhao1 , Ye Lv1 , Min Luo1 and Shiyan Pang 2,3 1 2 3

*

School of Remote Sensing and Information Engineering, 129 Luoyu Road, Wuhan University, Wuhan 430079, China; E-Mails: [email protected]; [email protected]; Web: http://earthvisionlab.whu.edu.cn Collaborative Innovation Center of Geospatial Technology, Wuhan University, Wuhan 430079, China School of Resource and Environmental Sciences, 129 Luoyu Road, Wuhan University, Wuhan 430079, China Correspondence: [email protected]; Tel.: +86-27-6877-1528

Academic Editor: name Version May 9, 2017 submitted to Remote Sens.

1

1. Additional Experiments

2

1.1. Additional Evaluation on CamVid

Tree

Sky

Car

Sign

Road

Pedestrian

Fence

Pole

Sidewalk

Bicyclist

mean IoU

FCN-8s [1](Multi-stage training) SegNet [2] DeepLabv2 [3](VGG-16 initialization, CRF post-processing) DMSMR DMSMR+CRF

Building

Table 1. Additional evaluation on CamVid dataset [4,5]. We compare our approaches with FCN-8s [1], SegNet [2], and DeepLabv2 [3].

70.5 75.1 93.6 93.1 93.0

63.1 68.8 95.1 94.5 94.9

84.8 91.4 88.4 82.9 88.0

61.9 77.7 90.3 92.7 88.5

19.1 52.0 26.0 45.5 26.1

89.8 92.5 97.5 97.4 96.1

19.8 71.5 73.3 72.5 72.8

30.9 44.9 47.6 77.2 47.0

6.5 52.9 7.5 7.2 7.4

70.1 79.1 86.9 94.5 84.6

29.3 69.6 23.7 68.9 23.6

49.6 55.8 63.8 63.6 63.1

12

Table 1 presents additional evaluations on CamVid dataset. We compare our approach with some recent algorithms, such as FCN-8s [1], SegNet [2] and DeepLabv2 [3]. We further conduct CRF post-processing on our DMSMR model. In the table, the proposed DMSMR approach achieves similar score as DeepLabv2. Nevertheless, our model does not require additional aides, such as pre-trained VGG-16 model or CRF post-processing. Moreover, there might be some decrements on accuracy even if CRF post-processing is adopted. This is because CRF requires proper parameter settings by trial and error. As can be seen from Table 1, the mean IoU score has decreased approximately 0.5%. In addition, the symmetric encoder-decoder structures, such as SegNet, help to find the potential small objects (i.e., pole listed on Table 1) observed in a scene. But it suffers from the integration of contextual information compared with DMSMR and DeepLabv2 approaches.

13

1.2. Additional Evaluation on EvLab-SS Dataset

3 4 5 6 7 8 9 10 11

14 15 16 17

Table 2 reports the additional comparisons to FCN-8s [1], SegNet [2] and DeepLabv2 [3] models on EvLab-SS benchmark. And we also compare these models to our DMSMR model that adopts CRF post-processing as additional aid. The CRF post-processing is conducted as presented in the DeepLabv2 1 . In this experiment, our DMSMR model set the same parameters as presented in the body of text.

1

The CRF parameter settings of the DeepLabv2 model can be found from http://liangchiehchen.com/projects/DeepLabv2_vgg.html

Submitted to Remote Sens., pages 1 – 4

www.mdpi.com/journal/remotesensing

Version May 9, 2017 submitted to Remote Sens.

2 of 4

Farmland

Garden

Woodland

Grassland

Building

Road

Structures

Digging Pile

Desert

Waters

Overall Accuracy

mean IoU

FCN-8s [1] (Multi-stage training) SegNet [2] DeepLabv2 [3] (VGG-16 initialization, CRF post-processing) DMSMR DMSMR+CRF

Background

Table 2. Additional evaluation on the EvLab-SS dataset. We compare our approaches with FCN-8s [1], SegNet [2], and DeepLabV2 [3].

53.63 1.99 63.71 40.59 63.24

17.25 0.52 26.97 22.14 28.57

0.0 0.0 0.0 0.0 0.0

34.21 14.06 40.11 62.47 40.97

6.32 2.27 8.88 8.11 8.76

49.85 34.30 57.67 68.84 57.91

18.16 3.18 27.22 39.80 28.35

28.43 3.59 31.04 51.06 31.84

13.33 0.01 21.29 14.56 21.36

1.76 0.30 3.71 16.52 3.89

2.19 2.50 35.34 19.45 38.20

49.03 27.47 56.32 54.15 57.05

20.83 5.71 28.72 22.17 29.37

4.0

1.0

2.5

1.0

0.8

2.0

0.8

0.6

1.5

0.6

0.4

1.0

0.4

0.2

0.5

train loss train avgJaccard train average Recall

3.5

train mIoU

3.0

accuracy

2.0

loss

accuracy

loss

2.5

1.5

1.0 0.2

train loss 0.5

train avgJaccard train average Recall train mIoU

0.0 0

2000

4000

6000 iteration

(a)

8000

10000

0.0 12000

0.0 0

10000

20000

30000

40000

50000

60000

0.0 70000

iteration

(b)

Figure 1. Training accuracy and loss curves with respect to iteration times. (a) Training curve of SegNet; (b) Training curve of DMSMR. The DMSMR method performs more stable than SegNet approach.

31

In the Table 2, the DMSMR method achieves competitive scores when utilizing CRF as a post-process, resulting in approximately 3% and 7% improvements on overall accuracy and mean IoU, respectively. This indicates that additional aides do help improve accuracy. However, CRF post-processing may cause some decreases in individual class, for example, the accuracy of building has decreased by approximately 9% because of the improper parameter settings in the CRF post-processing model. Every approach in the table achieves moderate scores except SegNet model. Figure 1 plots the training accuracy and loss curves of SegNet [2] and DMSMR. In the training process, we utilize the default initialization parameters presented in original SegNet model. It can be inferred from Figure 1 and Table 2 that proper model initialization plays a key rule in transferring a model that trained on close-range images to remote sensing images. SegNet does not perform stably in the training process. In contrast, the performance of our DMSMR model performs relatively stable compared to that of SegNet. Compared to DeepLabv2, our model is a bit worse on overall accuracy and mean mIoU score, but our model performs much better in some individual classes, such as building, road and structures. This indicates that more contextual information, for instance orientation and spatial resolution, is further needed to improve the performance in local regions.

32

2. Analysis of Some Failed Case

18 19 20 21 22 23 24 25 26 27 28 29 30

33 34 35 36 37

In this part, we analyze some typical failure cases in processing the challenging EvLab-SS dataset. As previously mentioned in our body text, high-resolution remote sensing images contain a potentially-unlimited scene context and vary in spatial resolution. Figure 2 depicts some typical failure cases on EvLab-SS benchmark. Figure 2(a) and (b) are areal images with spatial resolutions of 1 m and 0.25 m, respectively. Figure 2(c) and (d) are satellite images with re-sampled GSDs of 0.2 m and 0.25 m. It can be seen from


(a)

3 of 4

(b)

(c)

(d)

Figure 2. Some failed examples on EvLab-SS validation images. The first row is source patches. The second and third rows are semantic segmentation results and ground truth labellings, respectively. Four kinds of typical failure cases are illustrated in each column, that is, (a) buildings fused with similar structures on the ground, (b) shadow error on oblique areal images, (c) dessert indistinguishable from structures and (d) grassland share few contrasts with woodland.

38 39 40 41 42 43 44 45 46 47 48

Figure 2(a) and (b) that buildings on aerial images are easily fused with their surroundings when lacking of constrains on regular shape and shadow on oblique buildings. Moreover, some classes, such as dessert and grassland are indistinguishable from other classes on satellite images as shown in Figure 2(c) and (d). This phenomena mainly attributes to the unclear definitions of the prior knowledge to these classes. To further solve these problems, the spatial resolution need to be taken as an important factor in processing remote sensing images. By imposing different weights to spatial resolutions, images captured from different sensors can be well integrated in the training procedure. In addition, some prior knowledge, such as orientation, symmetric structures of buildings and shadows from oblique objects, should be treated as critical cues on remote sensing images. Moreover, the generative adversarial nets [6–8] (GAN) can be introduced to generate more training samples from open source remote sensing data, such as open street map (OSM 2 ) and open aerial map (OAM 3 ).

52

Author Contributions: Mi Zhang designed the DMSMR network and performed the experimental analysis. He also wrote the paper. Xiangyun Hu guided the algorithm design, initiated the EvLab-SS dataset production and revised the paper. Shiyan Pang help organize the paper. Like Zhao, Ye Lv and Min Luo contributed to the design of project homepage and edited the manuscript.

53

Conflicts of Interest: The authors declare no conflict of interest.

49 50 51

2 3

OSM data can be found from http://www.openstreetmap.org OAM data can be found from https://openaerialmap.org


54

55 56

References 1. 2.

57 58

3.

59 60

4.

61 62

5.

63 64

6.

65 66 67 68

69 70

4 of 4

7. 8.

Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. CVPR, 2015. Badrinarayanan, V.; Handa, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. arXiv preprint arXiv:1511.00561 2015. Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. ICLR, 2015. Brostow, G.; Shotton, J.; Fauqueur, J.; Cipolla, R. Segmentation and Recognition Using Structure from Motion Point Clouds. ECCV, 2008. Brostow, G.; Fauqueur, J.; Cipolla, R. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters 2009, 30, 88–97. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. NIPS 2014, pp. 2672–2680. Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 2014. Luc, P.; Couprie, C.; Chintala, S.; Verbeek, J. Semantic Segmentation using Adversarial Networks. arXiv preprint arXiv:1611.08408 2016.

c 2017 by the authors. Submitted to Remote Sens. for possible open access publication under the terms and conditions of

the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).