Image Semantic Segmentation Algorithm Based on Self-learning Super-Pixel Feature Extraction Juan Wang1,2 ✉ , Hao Shi1,2, Min Liu1,2, Wei Xiong1,2, Kaiwen Cheng1,2, and Yuhan Jiang1,2 (
)
1
Hubei Key Laboratory for High-Efficiency Utilization of Solar Energy and Operation Control of Energy Storage System, Hubei University of Technology, Wuhan, People’s Republic of China
[email protected] 2 Hubei Collaborative Innovation Center for High-Efficiency Utilization of Solar Energy, Hubei University of Technology, Wuhan 430068, People’s Republic of China
Abstract. Image semantic segmentation is a challenging task, influenced by high segmentation complexity, increased feature space sparseness and the semantic expression inaccurate. This paper proposes a stacked deconvolution neural network (SDN) based on adaptive super-pixel feature extraction to degrade computational cost and improve segmentation effectiveness. Firstly, the superpixel segmentation is accomplished by simple linear iterative cluster (SLIC). Secondly, we add texture information as an optimization information to the eval‐ uation function to guide the super-pixel segmentation and ensure the integrity of the super-pixel segmentation. Finally, we train a Stacked Deconvolution Neural Network (SDN) on the ISPRS Potsdam and the NZAM/ONERA Christchurch datasets and learn the sample data with weak annotation information to realize the accurate and fast super-pixel segmentation. Segmentation tests show that the proposed method can achieve the accurate segmentation of image semantics. Keywords: Image semantic segmentation · Stacking neural networks Super-pixel · Texture extraction
1
Introduction
In the face of the rapidly growing image data, it is a problem need to be solved that make computer have the ability to recognize and segment the image information. Image Semantic Segmentation is a key part of image processing and analysis which is a classic research branch in the field of computer vision. Shotton et al. used semantic texel forest as a new visual feature and applied the constructed semantic texel model to image semantic segmentation. Girshick et al. proposed a target detection algorithm based on convolutional neural network, using the top-down detection method to locate and segment the objects in the image. Liu et al. Used a deformable template to determine the boundaries of objects in the image. Pawan and others combine the star structure of the object model to segment the foreground and background of image. From the research results, the academic community has been committed to solve the problem of image segmentation [1]. © Springer International Publishing AG, part of Springer Nature 2018 L. Barolli et al. (Eds.): EIDWT 2018, LNDECT 17, pp. 773–781, 2018. https://doi.org/10.1007/978-3-319-75928-9_69
774
J. Wang et al.
From the initial N-Cut and Grab-Cut to the CNN or FCN, which quote deep learning methods, Image Semantic Segmentation through a long process of development. However, due to the semantic richness and cognitive complexity of the image content, the existing research methods are still not perfect in technology. Convolutional Neural Networks (CNN) break the performance barrier of many tasks in the field of scene classification, such as object segmentation, object classification and object recognition. CNN promote the research progress in the field of recognition, which not only improves the classification of the whole image, but also promotes the structured output of local tasks, including the generation of the bounding box in object recognition, test of part and key points [2]. However, the classification network with downsampling operations sacrifices the spatial resolution of feature maps to obtain the invariance to image transformations. The resolution reduction results in poor object delineation and small spurious regions in segmentation outputs [3]. The advent of deconvolutional neural networks is a milestone because it shows how training end-to-end with CNN to solve visual problems, it is the cornerstone of Image Semantic Segmentation using deep learning [4]. However, it still lacks the perception of different features, semantic representation is inaccurate, and can’t achieve real-time processing speed at high resolution. A good image semantic segmentation method should have several advantages: First, segmentation of different semantic regions of a nature such as grayscale and textures are similar, the region is relatively flat. Secondly, it must has the low segmentation complexity and high segmentation efficiency. Lastly, the boundaries of different semantic regions are clear and regular [5]. On this basis, this paper proposes a Stacked Deconvolution Neural network (SDN) based on adaptive super pixel feature extraction, which connects SLIC and SDN in series through the evaluation function. Accurate and fast image semantic segmentation can be achieved.
2
Work Process
The work process of image semantic segmentation based on self-learning superpixel feature extraction is shown in Fig. 1.
Image Semantic Segmentation Algorithm
SLIC superpixel segementation
Image after SLIC segmentation
ISPRS and NZAM/ ONERA
Feature Detection
SDN
Evaluation function
Original image
775
Result
Fig. 1. The process of our method
3
SLIC Superpixel Segmentation of the Original Image
Image segmentation is the basic step in the application of image processing, the purpose is to extract the part of the image that people are interested in, providing the basis for subsequent processing and analysis. The most commonly used algorithm for image segmentation method with a threshold segmentation, edge division method, the area dividing method, a segmentation algorithm based on artificial neural networks, etc. [6]. However, most of these algorithms take too long, and can’t get the desired segmentation results. Rent and Malik proposed the concept of superpixels segmentation, the algorithm can take advantage of some features of pixels in the image segmentation, the image is divided into a large number of irregular small blocks (superpixels), and then do the subsequent processing. Superpixel segmentation provides a way of image preprocessing and greatly reduces the complexity of subsequent image processing tasks [7]. Due to the importance of superpixel segmentation, a large number of algorithms for generating superpixels have been proposed. These algorithms can solve some problems, but most of them are complicated and the generated superpixels have poor matching with the original image. On this basis, Achanta et al. proposed a simple linear iterative clustering algorithm (SLIC), which uses an improved K-means clustering algorithm to generate superpixels. After segmentation, the superpixel border has a strong dependency on the original boundary of the image, its processing speed and storage efficiency are superior to other superpixel segmentation algorithms [2].
776
J. Wang et al.
From the segmentation results and calculation costs and other factors to consider, we use SLIC superpixel segmentation. It is easy to implement and only the only param‐ eter that specifies the number of superpixels needed is very easy to apply in practice. Divide shape rules and seamlessly accommodate grayscale and color images. First, the cluster center is initialized, and according to the set number of pixels, the cluster center is uniformly distributed in the image. In this paper, the total number of pixels in the image is N = 4006002, which is planned to be split into K = 200 superpixels with the N same size. Each superpixel has a size of and the distance between each cluster center K √ is about S = N∕K . Compact coefficient m is used to measure the color similarity and spatial proximity. When m is large, the resulting superpixels are more compact, while when m is smaller, the superpixels have a higher degree of edge matching. Taking all factors into consideration, we set the coefficient to m = 20. In order to avoid the seed point falling on the gradient contour border and affecting the subsequent clustering effect, We re-select the cluster centers in the neighborhood of cluster centers (n ∗ n) and define the labels for the pixels in each field. For each pixel, the color distance from each pixel to the center of the cluster is first calculated by Eq. (1). √ dc =
( )2 ( )2 ( )2 lj − l i + a j − a i + b j − b i
(1)
Secondly, calculate the spatial distance from the pixel to the center of the cluster using Eq. (2).
√ ds =
( )2 ( )2 xj − xi + yj − yi
(2)
Lastly, the measurement distance D′ is calculated by the Eq. (3).
√ ( ′
D =
dc Nc
(
)2 +
ds Ns
)2 (3)
From the similarity between cluster centers, the labels of the most similar cluster centers are assigned to the pixels. The process is repeated until convergence, and we set the number of iterations to 10. After 10 iterations, we redistribute discontinuous and under-sized superpixels to neighboring superpixels, traverse all the pixels and give them the corresponding label, we get the split image. As can be seen from the division result as a pre-treatment steps in this article, the image on the edge of goodness of fit, split speed, and performance division, in line with our expected results, as shown in Fig. 2.
Image Semantic Segmentation Algorithm
777
Fig. 2. Original image and the image after SLIC segmentation
4
Training a Stacked Deconvolution Neural Network (SDN)
Convolutional Neural Network is a kind of neural network, which has become a research hotspot in the field of image segmentation. Its weight sharing network structure makes it more similar to biological neural network, reducing the complexity of the network model and reducing the number of weights [8]. Its multi-layer structure can automati‐ cally learn features, and can learn more than one level of features. Shallow convolutional layer perception area is small, you can learn some local features, deep convolution layer has a larger sensing area, can learn more abstract features. These abstractions are less sensitive to the size, position, and orientation of the object, helping to improve recog‐ nition performance. These abstract features are very helpful for classification, which can well determine what kind of objects are contained in an image. Convolutional networks drive the advancement of recognition, and for the whole image convolution not only improves classification but also advances local tasks with structured output [9]. However, because of missing some details of objects, the object can’t be given a good outline, pointing out which pixel belongs to which object, so it is difficult to achieve accurate segmentation. Compared with the Convolutional Neural Network (CNN), Fully Convolutional Neural networks (FCN) perform well in semantic segmentation, FCN transforms the fully connected layer in traditional CNN into convolutional layers [10]. It upsamples to get finer results, can accept input images of any size, without having to require that all training images and test images have the same size. Because it avoids the problem of duplicate storage and computational convolution due to the use of pixel blocks, and is more efficient. Nonetheless, the result of FCN upsampling is still vague and smooth, insensitive to the details in the image. At the same time, the relationship between pixels and pixels is not fully considered, and the spatial regularization steps used in the usual pixel-based segmentation methods are ignored, which lacks spatial consistency [11].
778
J. Wang et al.
4.1 Overview This paper proposed SDN, which is a deeper deconvolution but easier to optimize than most convolutions. The deconvolutional network is composed of deconvolution and unpooling layers, which identify pixel-wise class labels and predict segmentation masks. In SDN, there are several shallow deconvolution neural networks called SDN units, stacked to consolidate front and back information and ensure good recovery of local information. At the same time, inter-unit and intra-unit connections are designed to facilitate network training and enhance fusion of features that can improve information and gradient propagation throughout the network. In addition, layered monitoring is applied during the upsampling of each SDN unit, which guarantees the identification of the feature representation and facilitates network optimization (Fig. 3).
DenseNet
Fig. 3. The structure of stacked deconvolutional network
4.2 Deconvolutional Network We design an efficient shallow deconvolution network called (SDN unit) and stacked multiple SDN unit in sequence, making the proposed SDN capture more contextual information and easily optimized, but as the number of stacked SDN unit increases, the difficulty of model training becomes a major issue. In order to solve this problem, first of all, we stratify oversight to be added to each SDN unit of the upsampled block. Specifically, the compression layer is mapped to pixel-labeled textures by the classifi‐ cation layer. With this structure, the web can be learned in more sophisticated ways, with no loss of ability to discriminate. Second, we import unit-to-unit connections to help optimize the network. Secondly, we import intra-unit and inter-unit connections to help the network optimization. The connections are shortcut paths from early layers to later layers, and they are beneficial to the flow of information and gradient propagation through out the network.
Image Semantic Segmentation Algorithm
779
Taking into account the different intentions of information transfer, two types of connection between cells can be used. One is to promote network optimization by connecting decoders and encoder modules between any two adjacent SDN units. The other is to connect the multi-scale feature mapping of the encoder module of the first SDN unit to the corresponding decoder module of each SDN unit to maintain the detail of the low-resolution prediction. For a given SDN unit, taking the output of the previous unit as input yields a low resolution feature map with a larger receptive field. We use downsampling twice to 1 spatial resolution of the image, as shown in Fig. 4. import 16
Fig. 4. The structure of downsampling block
The operations can be summarized by Eq. (4). ( i−1 ) Pin = Max F ([n , i′ ]) Qin = Trans ( Pin), Fn−1 , Fni = Comp Qin
(4)
[ ] i′ i′ stands for the connection of the feature maps Pin and Fn−1 . Max(.) means where Pin , Fn−1 a max-pooling operation. Trans(.) means a transformation function of the densely connected structure. Comp(.) means a 3 × 3 convolutional operation. In the decoder, we apply the up-sampling approach to up-sampling the feature map up to a much larger resolution. Use two magnifications to bring the resolution back to 1 spatial resolution of the input image, as shown in Fig. 5. 4
Fig. 5. The structure of upsampling block
780
J. Wang et al.
The operations can be calculated by Eq. (5). ( i−1 ) Oin = Deconv Fn ,]) ([ Qin = Trans ( Oin), H1k , Fni = Comp Qin .
(5)
where Deconv(.) means the deconvolutional operation. Because of a quarter of the picture resolution, we can reduce GPU memory usage for a single SDN unit so that more units can be stacked. Our deconvolutional network offers significant improvements over traditional deconvolutional networks. The simple encoding and decoder architecture and two simple upsampled and downsampled blocks make it easy to pass the SDN unit back‐ wards and facilitate end-to-end training. 4.3 Work Details We stack four convolutional layers at the lowest resolution, using the ReLU function, which computes a 3 × 3 convolutional operations with a probability of loss of 0.2 and the number of convolutional filters set to 36. The downsampling process, a 3 × 3 convo‐ lutional operations is performed, and the upsampling process also uses a 3 × 3 convo‐ lutional operations, and each SDN unit uses a ReLU operation. Finally, we train the Stacked Deconvolution Neural Network (SDN) on the ISPRS Potsdam and the NZAM/ ONERA Christchurch datasets and learn the sample data with weak annotation infor‐ mation. We use the expanded bounding box to crop the window for training examples. The class labels for each crop area are provided based on only the centrally located objects, while all other pixels are marked as the background. In order to increase training data, we converted the input image to a 250 × 250 image, cropped the image to 224 × 224, and optionally flipped horizontally. Also, we provide sufficient training samples enough to train stack deconvolution neural networks from scratch, initial learning rates, momentum and weight decay of 0.01, 0.9 and 0.0005, respectively.
5
SDN for Image Semantic Segmentation
We evaluate our network on PASCAL VOC 2012 bench-mark, which contains 1456 test images and involves 20 object categories [12]. From the segmentation results, our method is one of the most advanced image semantic segmentation methods and is competitive. By SLIC superpixel segmentation preprocessing, semantic segmentation of the image becomes faster and more accurate, compared with FCN, the average loU is expected to increase 1.6%, reaching excellent accuracy.
6
Conclusion
In this paper, we propose a new method for image semantic segmentation. Firstly, the image is preprocessed by SLIC subpixel segmentation, and then the image is imported
Image Semantic Segmentation Algorithm
781
into the trained SDN network as an evaluation function. The stacking structure of SDN network and the connection between units promote the optimization of the network and make the network have better segmentation performance. We achieve fast and accurate segmentation of image semantics. Acknowledgments. This research was supported by Program of International science and technology cooperation (2015DFA10940); Science and technology support program (R&D) project of Hubei Province (2015BAA115); PhD Research Startup Foundation of Hubei University of Technology (No. BSQD13037, No. BSQD14028); Open Foundation of Hubei Collaborative Innovation Center for High-Efficiency Utilization of Solar Energy (HBSKFZD2015005, HBSKFTD2016002).
References 1. Liu, Y., Liu, J., Li, Z., Tang, J., Lu, H.: Weakly-supervised dual clustering for image semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2075–2082 (2013) 2. Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34(11), 2274–2282 (2012) 3. Fu, J., Liu, J., Wang, Y., Lu, H.: Stacked deconvolutional network for semantic segmentation. arXiv preprint arXiv:1708.04943 (2017) 4. Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Garcia-Rodriguez, J.: A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv: 1704.06857 (2017) 5. Shen, J., Du, Y., Wang, W., Li, X.: Lazy random walks for superpixel segmentation. IEEE Trans. Image Process. 23(4), 1451–1462 (2014) 6. Liu, M.Y., Tuzel, O., Ramalingam, S., Chellappa, R.: Entropy rate superpixel segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2097–2104. IEEE (2011) 7. Ren, C.Y., Reid, I.: gSLIC: a real-time implementation of SLIC superpixel segmentation. Technical report, University of Oxford, Department of Engineering (2011) 8. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015) 9. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv preprint arXiv:1606.00915 (2016) 10. Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1520–1528 (2015) 11. Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) 12. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)