applied sciences Article
An Improved Image Semantic Segmentation Method Based on Superpixels and Conditional Random Fields Wei Zhao 1 , Yi Fu 1 , Xiaosong Wei 1 and Hai Wang 2, * 1
2
*
Key Laboratory of Electronic Equipment Structure Design, Ministry of Education, Xidian University, Xi’an 710071, China;
[email protected] (W.Z.);
[email protected] (Y.F.);
[email protected] (X.W.) School of Aerospace Science and Technology, Xidian University, Xi’an 710071, China Correspondence:
[email protected]; Tel.: +86-029-8820-3115
Received: 11 April 2018; Accepted: 17 May 2018; Published: 22 May 2018
Abstract: This paper proposed an improved image semantic segmentation method based on superpixels and conditional random fields (CRFs). The proposed method can take full advantage of the superpixel edge information and the constraint relationship among different pixels. First, we employ fully convolutional networks (FCN) to obtain pixel-level semantic features and utilize simple linear iterative clustering (SLIC) to generate superpixel-level region information, respectively. Then, the segmentation results of image boundaries are optimized by the fusion of the obtained pixel-level and superpixel-level results. Finally, we make full use of the color and position information of pixels to further improve the semantic segmentation accuracy using the pixel-level prediction capability of CRFs. In summary, this improved method has advantages both in terms of excellent feature extraction capability and good boundary adherence. Experimental results on both the PASCAL VOC 2012 dataset and the Cityscapes dataset show that the proposed method can achieve significant improvement of segmentation accuracy in comparison with the traditional FCN model. Keywords: image semantic segmentation; superpixels; conditional random fields; fully convolutional network
1. Introduction Nowadays, image semantic segmentation has become one of the key issues in the field of computer vision. A great deal of scenarios are under increasing demand for abstracting relevant knowledge or semantic information from images, such as autonomous driving, human-machine interaction, and image search engine [1–3]. As a preprocessing step for image analysis and visual comprehension, semantic segmentation is used to classify each pixel in the image and divide the image into a number of visually meaningful regions. In the past decades, researchers have proposed various methods including the simplest pixel-level thresholding methods, clustering-based segmentation methods, and graph partitioning segmentation methods [4] to yield the image semantic segmentation results. These methods have high efficiency due to their having low computational complexity with fewer parameters. However, their performance is unsatisfactory for image segmentation tasks without any artificial supplementary information. With the growing development of deep learning in the field of computer vision, the image semantic segmentation methods based on convolutional neural networks (CNNs) [5–14] have been proposed one after another, far exceeding the traditional methods in accuracy. The first end-to-end semantic segmentation model was proposed as a CNN variant by Long et al. [15], known as FCN. They popularized CNN architectures for dense predictions without any fully connected layers. Unlike the CNNs, the output of FCN becomes a two-dimensional matrix instead of a one-dimensional vector
Appl. Sci. 2018, 8, 837; doi:10.3390/app8050837
www.mdpi.com/journal/applsci
Appl. Sci. 2018, 8, 837
2 of 17
during the image semantic segmentation. It is the first time image classification on a pixel-level has been realized, which is a significant improvement in accuracy. Subsequently, a large number of FCN-based methods [16–20] have been addressed to promoting the development of image semantic segmentation. As one of the most popular pixel-level classification methods, the DeepLab models proposed by Chen et al. [21–23] make use of the fully connected CRF as a separated post-processing step in their pipeline to refine the segmentation result. The earliest version of DeepLab-v1 [21] overcomes the poor localization property of deep networks by combining the responses at the final FCN layer with a CRF for the first time. By using this model, all pixels, no matter how far apart they lie, are taken into account, rendering the system able to recover detailed structures in the segmentation that were lost due to the spatial invariance of the FCN. Later, Chen et al. extended their previous work and developed the DeepLab-v2 [22] and DeepLab-v3 [23] with improved feature extractors, better object scale modeling, careful assimilation of contextual information, improved training procedures, and increasingly powerful hardware and software. Benefiting from the fine-grained localization accuracy of CRFs, the DeepLab models are remarkably successful in producing accurate semantic segmentation results. At the same time, the superpixel method, a well-known image pre-processing technique, has been rapidly developed in recent years. Existing superpixel segmentation methods can be classified into two major categories: graph-based methods [24,25] and gradient-ascent-based methods [26,27]. As one of the most widely used methods, SLIC adapts a k-means clustering approach to efficiently generate superpixels, which has been proved better than other superpixel methods in nearly every respect [27]. SLIC deserves our consideration for the application of image semantic segmentation due to its advantages, such as low complexity, compact superpixel size, and good boundary adherence. Although researchers have made some achievements, there is still much room for improvement in image semantic segmentation. We observe that the useful details such as the boundaries of images are often neglected because of the inherent spatial invariance of FCN. In this paper, an improved image semantic segmentation method is presented, which is based on superpixels and CRFs. Once the high-level abstract features of images are extracted, we can make use of the low-level cues, such as the boundary information and the relationship among pixels, to improve the segmentation accuracy. The improved method is briefly summarized as follows: First, we employ FCN to extract the pixel-level semantic features, while the SLIC algorithm is chosen to generate the superpixels. Then, the fusion of the two obtained pieces of information is implemented to get the boundary-optimized semantic segmentation results. Finally, the CRF is employed to optimize the results of semantic segmentation through its accurate boundary recovery ability. The improved method possesses not only excellent feature extraction capability but also good boundary adherence. The rest of this paper is organized as follows. Section 2 provides an overview of our method. Section 3 describes the key techniques in detail. In Section 4, experimental results and discussion are given. Finally, some conclusions are drawn in Section 5. 2. Overview An improved image semantic segmentation method based on superpixel and CRFs is proposed to improve the performance of semantic segmentation. In our method, the process of semantic segmentation can be divided into two stages. The first stage is to extract semantic information from input image as much as possible. In the second stage (also treated as post-processing steps), we intend to optimize the coarse features generated during the first stage. As a widely used post-processing technique, CRFs are introduced into the image semantic segmentation. However, the performance of CRFs depends on the quality of feature maps, so it is necessary to optimize the first stage results. To address this problem, the boundary information of superpixels is combined with the output of the first stage for the boundary optimization, which helps CRFs recover the information of boundaries more accurately. Figure 1 illustrates the flow chart of the proposed method, in which red boxes denote two stages and blue boxes denote three processing steps. In the first step, we use the FCN model to extract
Appl. Sci. 2018, 8, x FOR PEER REVIEW
3 of 17
Appl. Sci. 2018, 8, 837
3 of 17
Figure 1 illustrates the flow chart of the proposed method, in which red boxes denote two stages and blue boxes denote three processing steps. In the first step, we use the FCN model to extract the the feature information the input to obtain pixel-level semantic Although the feature information fromfrom the input imageimage to obtain pixel-level semantic labels. labels. Although the trained trained FCN model has fine feature extraction ability, the result is still relatively rough. Meanwhile, FCN model has fine feature extraction ability, the result is still relatively rough. Meanwhile, the SLIC the SLIC algorithm is employed to segment theimage input image and generate a large number of superpixels. algorithm is employed to segment the input and generate a large number of superpixels. In In the second step, also the most important, we use coarse features to reassign semantic predictions the second step, also the most important, we use coarse features to reassign semantic predictions within each each superpixel. superpixel. Benefiting Benefiting from from the the good good image image boundary boundary adherence adherence of of superpixel, superpixel, we we obtain obtain within the results resultsof ofboundary boundaryoptimization. optimization. third step, we employ to predict the semantic the In In thethe third step, we employ CRFsCRFs to predict the semantic label label of each pixel for further refining the segmentation boundaries. At this point, the final semantic of each pixel for further refining the segmentation boundaries. At this point, the final semantic segmentation result result is is obtained. obtained. Both Both the the high-level high-level semantic semantic information information and and the the low-level low-level cues cues in in segmentation image boundaries are fully utilized in our method. image boundaries are fully utilized in our method. step1 step2
step3
extract
CRF
Features
Input segment Superpixels Stage1
accurate boundary recovery
boundary optimization
Final result
Stage2
Figure 1. The flow chart of the proposed method. Figure 1. The flow chart of the proposed method.
3. Key Techniques 3. Key Techniques In this section, we describe the theory of feature extraction, boundary optimization, and accurate In this section, we describe the theory of feature extraction, boundary optimization, and accurate boundary recovery. The key point is the application of superpixels, which affects the optimization of boundary recovery. The key point is the application of superpixels, which affects the optimization coarse features and the pixel-by-pixel prediction of CRF model. The technical details are discussed of coarse features and the pixel-by-pixel prediction of CRF model. The technical details are below. discussed below. 3.1. Feature Extraction 3.1. Feature Extraction Different from the classic CNN model, the FCN model can take images of arbitrary size as inputs Different from the classic CNN model, the FCN model can take images of arbitrary size as inputs and generate correspondingly-sized outputs, so we employ it to implement the feature extraction in and generate correspondingly-sized outputs, so we employ it to implement the feature extraction our semantic segmentation method. In this paper, the VGG-based FCN is selected due to its best in our semantic segmentation method. In this paper, the VGG-based FCN is selected due to its best performance among various structures of FCN model. The VGG-based FCN is transformed from the performance among various structures of FCN model. The VGG-based FCN is transformed from the VGG-16 [28] by replacing the fully connected layers with convolutional ones and keeping the first VGG-16 [28] by replacing the fully connected layers with convolutional ones and keeping the first five five layers. After multiple iterations of convolution and pooling, the resolution of the resulting feature layers. After multiple iterations of convolution and pooling, the resolution of the resulting feature map gets lower and lower. Upsampling is required to restore the coarse feature to the output image map gets lower and lower. Upsampling is required to restore the coarse feature to the output image with the same size as the input one. In the implementation procedure, the resolution of the feature with the same size as the input one. In the implementation procedure, the resolution of the feature maps is reduced by 2, 4, 8, 16, and 32 times, respectively. Then, upsampling the output of the last maps is reduced by 2, 4, 8, 16, and 32 times, respectively. Then, upsampling the output of the last layer by 32 times can get the result of FCN-32s. Because the large magnification leads to the lack of layer by 32 times can get the result of FCN-32s. Because the large magnification leads to the lack of image details, the results of FCN-32s are not accurate enough. To improve the accuracy, we added image details, the results of FCN-32s are not accurate enough. To improve the accuracy, we added more detailed information of the last few layers, and combined them with the output of FCN-32s. By more detailed information of the last few layers, and combined them with the output of FCN-32s. this means, the FCN-16s and the FCN-8s can be derived. By this means, the FCN-16s and the FCN-8s can be derived. An important problem that needs to be solved in semantic segmentation is how to combine An important problem that needs to be solved in semantic segmentation is how to combine “where” with “what” effectively. In other words, semantic segmentation is to classify pixel-by-pixel “where” with “what” effectively. In other words, semantic segmentation is to classify pixel-by-pixel and combine the information of position and classification together. On one hand, due to the and combine the information of position and classification together. On one hand, due to the difference difference in receptive fields, the resolution is relatively higher in the first few convolutions, and the in receptive fields, the resolution is relatively higher in the first few convolutions, and the positioning of positioning of the pixels is more accurate. On the other hand, in the last few convolutions, the the pixels is more accurate. On the other hand, in the last few convolutions, the resolution is relatively
Appl. Sci. 2018, 8, 837
4 of 17
Appl. Sci. 2018, 8, x FOR PEER REVIEW
4 of 17
lower and is the classification thethe pixels is more accurate. An is example of threeAn models is shown in resolution relatively lowerof and classification of the pixels more accurate. example of three Figure 2. models is shown in Figure 2.
Input
FCN-32s
FCN-16s
FCN-8s
Ground Truth
Figure 2. The features extracted from FCN models. Figure 2. The features extracted from FCN models.
It is can be observed from Figure 2 that the result of FCN-32s is drastically smoother and less It is This can be fromreceptive Figure 2 field that the of FCN-32s drastically and less refined. is observed because the of result FCN-32s model isislarger and smoother more suitable for refined. This is because the receptive field of FCN-32s model is larger and more suitable for macroscopic macroscopic perception. In contrast, the receptive field of FCN-8s model is smaller and more suitable perception. In details. contrast, the receptive field FCN-8s model smaller and more suitable for feeling for feeling the From Figure 2 we canofalso see that the is result of FCN-8s, which is closest to the the details. From Figure 2 we can also see that the result of FCN-8s, which is closest to the ground ground truth, is significantly better than those of FCN-16s and FCN-32s. Therefore, we choose FCNtruth, is significantly better than those of FCN-16s and FCN-32s. Therefore, we choose FCN-8s as 8s as the front-end to extract the coarse features of images. However, the results of FCN-8s are still the front-end to extract the coarse features of images. However, the results of FCN-8s are still far far from perfect and insensitive to the details of images. In the following, the two-step optimization from perfect and insensitive to the details of images. In the following, the two-step optimization is is introduced in detail. introduced in detail. 3.2. Boundary Optimization 3.2. Boundary Optimization In this section, more attention is paid to the optimization of image boundaries. There are some In this section, more attention is paid to the optimization of image boundaries. There are some image processing techniques that can be used for the boundary optimization. For example, the image processing techniques that can be used for the boundary optimization. For example, the method method based on graph cut [29] can obtain better edge segmentation results, but it relies on human based on graph cut [29] can obtain better edge segmentation results, but it relies on human interactions, interactions, which is unacceptable for processing a large number of images. In addition, some edge which is unacceptable for processing a large number of images. In addition, some edge detection detection algorithms [30] are often used to optimize the boundary of images. These algorithms share algorithms [30] are often used to optimize the boundary of images. These algorithms share the common the common feature that the parameters for a particular occasion are fixed, which are more applicable feature that the parameters for a particular occasion are fixed, which are more applicable to some to some specific applications. However, when solving general border tracing problems for images specific applications. However, when solving general border tracing problems for images containing containing unknown objects and backgrounds, the fixed parameter approach often fails to achieve unknown objects and backgrounds, the fixed parameter approach often fails to achieve the best results. the best results. In this work, superpixel is selected for the boundary optimization purpose. Generally, In this work, superpixel is selected for the boundary optimization purpose. Generally, a superpixel can a superpixel can be treated as a set of pixels that are similar in location, color, texture, etc. According be treated as a set of pixels that are similar in location, color, texture, etc. According to this similarity, to this similarity, superpixels have a certain visual significance in comparison with pixels. Although superpixels have a certain visual significance in comparison with pixels. Although a single superpixel a single superpixel has no valid semantic information, it is a part of an object that has semantic has no valid semantic information, it is a part of an object that has semantic information. Besides, information. Besides, the most important property of superpixels is its ability to adhere to image the most important property of superpixels is its ability to adhere to image boundaries. Based on this boundaries. Based on this property, superpixels are applied to optimize the coarse features extracted property, superpixels are applied to optimize the coarse features extracted by the front end. by the front end. As shown in Figure 3, SLIC is used to generate superpixels, and then the coarse features are As shown in Figure 3, SLIC is used to generate superpixels, and then the coarse features are optimized by the object boundaries from these superpixels. To some degree, this method can improve optimized by the object boundaries from these superpixels. To some degree, this method can improve the segmentation accuracy of the object boundaries. The critical algorithm of boundary optimization is the segmentation accuracy of the object boundaries. The critical algorithm of boundary optimization completely demonstrated in Algorithm 1. is completely demonstrated in Algorithm 1. Figure 4 shows the result of boundary optimization by applying Algorithm 1. As can be seen Figure 4 shows the result of boundary optimization by applying Algorithm 1. As can be seen from the partial enlarged details in Figure 4, the edge information of superpixels is utilized effectively, from the partial enlarged details in Figure 4, the edge information of superpixels is utilized and thus a more accurate result can be obtained. effectively, and thus a more accurate result can be obtained. It is observed from the red box in Figure 4 that a number of superpixels with sharp, smooth, It is observed from the red box in Figure 4 that a number of superpixels with sharp, smooth, and and prominent edges adhere the boundaries of object well. Due to diffusion errors in the upsampling prominent edges adhere the boundaries of object well. Due to diffusion errors in the upsampling process, a few pixels inside these superpixels have different semantic information. A common mistake process, a few pixels inside these superpixels have different semantic information. A common is to misclassify background pixels as another classification. Figure 4 shows that this kind of mistake mistake is to misclassify background pixels as another classification. Figure 4 shows that this kind of mistake can be corrected effectively using our optimization algorithm. There are some other superpixels with more types of semantic information, while the number of pixels with different
Appl. Sci. 2018, 8, 837
5 of 17
can be corrected effectively using our optimization algorithm. There are some other superpixels with Sci. 2018, 8, x FOR PEER REVIEW of 17 more Appl. types of semantic information, while the number of pixels with different classification5 is about the same. These superpixels can be found in the weak edges or thin structures of images, which are classification is about the same. These superpixels can be found in the weak edges or thin structures easy to in aare complex these environment. superpixels, we segmentation results ofmisclassify images, which easy to environment. misclassify in aFor complex For keep these the superpixels, we keep with those delivered by the front end. the segmentation results with those delivered by the front end. Input image
Features
boundary optimization
FCN
SLIC
optimize
Superpixels Figure 3. Boundary optimization using superpixels.
Figure 3. Boundary optimization using superpixels. Algorithm 1. The algorithm of boundary optimization. 1. 1. Input 𝐼 and coarse features 𝐿. Algorithm The image algorithm of boundary optimization. 2. Apply SLIC algorithm to segment the whole image into K superpixels R = {𝑅1 , 𝑅2 , … , 𝑅𝐾 }, in superpixel region 1. Input which image I𝑅and coarse features L. with label 𝑖. 𝑖 is a 3. Outer loop: For 𝑖to=segment 1:K 2. Apply SLIC algorithm the whole image into K superpixels R = { R1 , R2 , . . . , RK }, in which ① Use region M = {𝐶 , … , 𝐶𝑁i.} refers to all pixels in 𝑅𝑖 , in which 𝐶𝑗 is a pixel with 1 , 𝐶2 label Ri is a superpixel with classification 3. Outer loop: For i = 1: K 𝑗. ② Get the feature of each pixel in C from the front end. Initialize the weight 𝑊𝐶 with 0. ③MInner 𝑗 N=}1refers : N to all pixels in Ri , in which Cj is a pixel with classification j. 1
Use = {Cloop: ...,C 1 , C2 , For Save the feature label ofC𝐶𝑗from as 𝐿the weight the 𝑊𝐶𝑗weight of the label in the 𝐶𝑗 , and 2
Get the feature of each pixel in frontupdate end. Initialize W with 0. entire C
3
superpixel. Inner loop: For j = 1: N
1 𝑊𝐶𝑗 = label 𝑊𝐶′𝑗 +of C, 𝑖𝑛 𝑤ℎ𝑖𝑐ℎ 𝑊 ′ 𝑑𝑒𝑛𝑜𝑡𝑒𝑠 𝑡ℎ𝑒 𝑙𝑎𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑊𝐶𝑗 in the entire superpixel. Save the feature 𝑁 j as LCj , and𝐶𝑗update weight WCj of the label If 𝑊𝐶𝑗 > 0.8, then exit the inner loop. 1 End WCj = WC0 j + , in which WC0 j denotes the last value o f WCj N ④ Search for 𝑊 . 𝐶
If there is a 𝑊𝐶𝑗 > 0.8, then move on to the next step. If WCj > 0.8, then exit the inner loop. Else End Search the maximum 𝑊𝑚𝑎𝑥 and the sub-maximum 𝑊𝑠𝑢𝑏 . 4
Search for WIfC . 𝑊𝑚𝑎𝑥 − 𝑊𝑠𝑢𝑏 > 0.2, then move on to the next step. Else continue the outer loop. If there is a WCj > 0.8, then move on to the next step. ⑤ Reassign the classification of current superpixel with 𝐿𝐶𝑚𝑎𝑥 . End Else 4. Output the image 𝐼̃. Search the maximum Wmax and the sub-maximum Wsub .
If Wmax − Wsub > 0.2, then move on to the next step. 5
4.
Else continue the outer loop. Reassign the classification of current superpixel with LCmax .
End Output the image e I.
Appl. Sci. 2018, 8, 837
6 of 17
Appl. Sci. 2018, 8, x FOR PEER REVIEW
6 of 17
Appl. Sci. 2018, 8, x FOR PEER REVIEW
6 of 17
Figure 4. Boundary optimization and partial enlarged details.
Figure 4. Boundary optimization and partial enlarged details. 4. Boundary optimization and partial enlarged details. 3.3. Accurate Boundary Figure Recovery
3.3. Accurate Boundary Recovery
After the above Recovery boundary optimization, it is still necessary to improve the segmentation 3.3. Accurate Boundary
accuracy the thin structure,optimization, weak edge, and superposition. Therefore, employ the CRF After theofabove boundary it complex is still necessary to improve the we segmentation accuracy the above boundary optimization, it isi.e., still necessary to improveTothe segmentation to recover the boundaries accurately, to further optimization. clearly showmodel the of themodel thinAfter structure, weak edge, andmore complex superposition. Therefore, we employ the CRF to accuracy of the thin structure, weak edge, andin complex superposition. Therefore, we employ the CRF effect of the CRF model, an example is given Figure 5. recover the boundaries more accurately, i.e., to further optimization. To clearly show the effect of the model to recover the boundaries more accurately, i.e., to further optimization. To clearly show the CRF effect model, an example is given in Figure 5. of the CRF model, an example is given in Figure 5.
(a) before CRF
(b) after CRF
(c) ground truth
(a) before CRF
Figure 5. An example before/after CRF. (b) afterof CRF
(c) ground truth
Figure 5. An example of before/after CRF. Consider the pixel-wise labels as random variables and the relationship between pixels as edges, Figure 5. An example of before/after CRF. and correspondingly constitute a conditional random field. These labels can be modelled after we Consider the pixel-wisewhich labels are as random the relationship pixels as edges, obtain global observations, usuallyvariables the inputand images. In detail, abetween global observation 𝐼 is Consider the labels random variables and thegiven relationship between pixels as and correspondingly constitute conditional random field. These labels can be we represented as pixel-wise the input image ofaas N pixels in our method. Then, a graph 𝐺 modelled = (𝑉, 𝐸), 𝑉after and 𝐸edges, obtain global observations, which are usually the input images. In detail, a global observation 𝐼 is and correspondingly constitute a conditional random field. These labels can be modelled after denotes the vertices and the edges of the graph, respectively. Let 𝑋 be the vector formed by the we represented as the input N pixels our Then, given a grapha𝐺global = (𝑉,label 𝐸), 𝑉 and 𝐸 I is obtain globalvariables observations, usually input images. In detail, observation random 𝑋1 , 𝑋2 ,image …which , 𝑋𝑁 , of inare which 𝑋in themethod. random variable representing the assigned 𝑖 is the denotes the vertices and the edges of the graph, respectively. Let 𝑋 be the vector formed by to the pixel 𝑖. Ainput conditional field conforms to Gibbs distribution, the pair represented as the imagerandom of N pixels in our method. Then, givenand a graph G (𝐼, = 𝑋) E)the ,be V and (V,can random variables 𝑋 , 𝑋 , … , 𝑋 , in which 𝑋 is the random variable representing the label assigned 1 2 𝑁 𝑖 modelled E denotes the as vertices and the edges of the graph, respectively. Let X be the vector formed by the to the pixel 𝑖. A conditional random field conforms to Gibbs distribution, and the pair (𝐼, 𝑋) can be 1 random variables X1 , X2 , . . . , X N𝑃(𝑋 , in=which the(−𝐸(𝑥|𝐼)), random variable representing the label(1)assigned 𝑥|𝐼) =Xi is∙ 𝑒𝑥𝑝 modelled as 𝑍(𝐼)
to the pixel i. A conditional random field conforms to Gibbs distribution, and the pair ( I, X ) can be 1 in which labeling x∙ 𝑒𝑥𝑝 ∈ 𝐿𝑁 and Z(I) is the partition function [31]. The(1) fully 𝑃(𝑋of =a𝑥|𝐼) = (−𝐸(𝑥|𝐼)), modelled as E(x) is the Gibbs energy 𝑍(𝐼) 1 connected CRF model [32] employs the energy function P(of X a=labeling x | I ) = x ∈ 𝐿𝑁 ·exp E(isxthe | I ))partition , in which E(x) is the Gibbs energy and(− Z(I) function [31]. The fully (1) 𝐸(𝑥) = ∑ 𝜙 (𝑥 ) +Z∑( I )𝜓 (𝑥 , 𝑥 ), (2) 𝑖 𝑖 𝑖,𝑗 𝑖,𝑗 connected CRF model [32] employs the𝑖 energy function
𝑖
𝑗
LN
in which E(x) is the Gibbs energy of a labeling x∈ and Z(I) is the partition function The fully in which 𝜙𝑖 (𝑥 potentials represent the probability of the pixel 𝑖 taking[31]. the(2) label 𝑖 ) is the unary 𝐸(𝑥) = ∑𝑖 𝜙that 𝑖 (𝑥𝑖 ) + ∑𝑖,𝑗 𝜓𝑖,𝑗 (𝑥𝑖 , 𝑥𝑗 ), connected model [32] employs the energy function 𝑥𝑖 , andCRF 𝜓𝑖,𝑗 (𝑥 , 𝑥 ) is the pairwise potentials that represent the cost of assigning labels 𝑥 , 𝑥 to pixels 𝑖 𝑗 𝑖 𝑗
in is theInunary potentials represent the can probability of the pixel 𝑖 takingoptimized the label 𝑖 (𝑥𝑖 ) time. 𝑖, 𝑗which at the𝜙same our method, thethat unary potentials be treated as the boundary 𝑥feature (𝑥 , 𝑥 ) is the pairwise potentials that represent the cost of assigning labels 𝑥 , 𝑥 pixels (2) E x = φ x + ψ x , x , ( ) ( ) 𝑖 , and 𝜓 𝑖,𝑗 𝑖 𝑗 𝑖 𝑗 map that can help improve the ∑ performance of i,j CRFi,jmodel. i j The pairwise potentialstousually ∑ i i i 𝑖, 𝑗 at the same time. In our method, the unary potentials can be treated as the boundary optimized model the relationship among neighboring pixels and are weighted by color similarity. feature helppotentials improve the performance of CRF model. The potentials The expression is can employed for pairwise potentials [32] asprobability shown below: in which φi (map xi ) isthat the unary that represent the ofpairwise the pixel i takingusually the label xi , relationship among neighboring pixels and are weighted by color similarity. model the and ψi,j xi , x j is the pairwise potentials that represent the cost of assigning labels xi , x j to pixels i, j at The expression is employed for pairwise potentials [32] as shown below:
the same time. In our method, the unary potentials can be treated as the boundary optimized feature map that can help improve the performance of CRF model. The pairwise potentials usually model
Appl. Sci. 2018, 8, 837
7 of 17
the relationship among neighboring pixels and are weighted by color similarity. The expression is employed for pairwise potentials [32] as shown below: "
8, x FOR PEER REVIEW Appl. Sci. 2018, ψi,j xi , x j = µ xi , x j ω1 exp
−
p i − p j 2 2σα2
𝜓𝑖,𝑗 (𝑥𝑖 , 𝑥𝑗 ) = 𝜇(𝑥𝑖 , 𝑥𝑗 ) [𝜔1 𝑒𝑥𝑝 (−
−
||𝑝𝑖 −𝑝𝑗||2 2 2𝜎𝛼
−
! Ii − Ij 2 2σβ2
||𝐼𝑖 −𝐼𝑗 ||2 2 2𝜎𝛽
# p i − p j 2 7 of 17 + ω2 exp(− ) , (3) 2 2σγ
) + 𝜔2 𝑒𝑥𝑝(−
||𝑝𝑖 −𝑝𝑗 ||2 2
(3)
)],
2𝜎𝛾 in which the first term depends on both pixel positions and pixel color intensities, and the second term only depends on pixel positions. I , I are the color vectors, and p , p are the pixel positions. other in which the first term depends and theThe i j on both pixel positions and pixel i j color intensities, second parameters described previous work in the Potts [33], xi , xpositions. termare only dependsin onthe pixel positions. 𝐼𝑖 , 𝐼[32]. theshown color vectors, and 𝑝model the µ pixel 𝑗 areAs 𝑖 , 𝑝𝑗 are j is equal parameters are described in the previous work [32]. As shown in the Potts model [33],be to 1 if xiThe 6= xother , and 0 otherwise. It means that nearby similar pixels assigned different labels should j 𝜇(𝑥In ) is equal to 1 similar if 𝑥𝑖 ≠ 𝑥pixels otherwise. It means that nearby similar pixels assigned different 𝑖 , 𝑥𝑗other 𝑗 , and 0are penalized. words, encouraged to be assigned the same label, whereas pixels labels should be penalized. In other words, similar pixels are encouraged to be assigned the same that differ greatly in “distance” are assigned different labels. The definition of “distance” is related label, whereas pixels that differ greatly in “distance” are assigned different labels. The definition of to the color and the actual distance; thus, the CRF can segment images at the boundary as much as “distance” is related to the color and the actual distance; thus, the CRF can segment images at the possible. Partial enlarged details shown in Figure 6 are used to explain the analysis of the accurate boundary as much as possible. Partial enlarged details shown in Figure 6 are used to explain the boundary recovery. analysis of the accurate boundary recovery.
(a) before CRF
(b) after CRF
(c) Ground truth
Figure 6. Accurate boundary recovery and partial enlarged details.
Figure 6. Accurate boundary recovery and partial enlarged details. 4. Experimental Evaluation
4. Experimental Evaluation
In this section, we first describe the used experimental setup, including the datasets and the selection
of parameters. Next, wedescribe provide comprehensive ablation study of each component the improved In this section, we first the used experimental setup, including the of datasets and the Then, the evaluation the proposed method is given, together with selectionmethod. of parameters. Next, weofprovide comprehensive ablation study ofother eachstate-of-the-art component of methods. Qualitative andthe quantitative experimental resultsmethod are presented entirely, and necessary the improved method. Then, evaluation of the proposed is given, together with other comparisons are performed to validate the competitive performance of our method. state-of-the-art methods. Qualitative and quantitative experimental results are presented entirely, and necessary comparisons 4.1. Experimental Setup are performed to validate the competitive performance of our method. We use the PASCAL VOC 2012 segmentation benchmark [34], as it has become the standard 4.1. Experimental Setup dataset to comprehensively evaluate any new semantic segmentation methods. It involves 20
Weforeground use the PASCAL VOC 2012 segmentation benchmark [34], ason it has standard classes and one background class. For our experiments VOCbecome 2012, wethe adopt the dataset extended to comprehensively evaluate any new semantic segmentation methods. It training set of 10,581 images [35] and a reduced validation set of 346 images [20]. We involves further 20 foreground and onemethod background For our experiments on VOC we adopt evaluate classes the improved on theclass. Cityscapes dataset [36], which focuses2012, on semantic understanding urban scenes. It consists of around 5000 fine annotatedset images of street scenes the extended trainingofset of street 10,581 images [35] and a reduced validation of 346 images [20]. and 20,000 coarse annotated ones, in which all annotations are from 19 semantic classes. We further evaluate the improved method on the Cityscapes dataset [36], which focuses on semantic From it canscenes. be seen It that as the number of superpixels increases; aimages single superpixel will understanding of Figure urban 7, street consists of around 5000 fine annotated of street scenes get closer to the edge of the object. Most of images in VOC dataset have a resolution of around 500 × and 20,000 coarse annotated ones, in which all annotations are from 19 semantic classes. 500, so we set the number of superpixels to 1000. For Cityscapes dataset, we set the number of From Figure 7, it can be seen that as the number of superpixels increases; a single superpixel superpixels to 6000 due to the high-resolution of images. In our experiments, 10 mean field iterations will get are closer to the edge of the object. Most of images in VOC dataset have a resolution of around employed for CRF. Meanwhile, we use default values of 𝜔2 = 𝜎𝛾 = 3 and set 𝜔1 = 5, 𝜎𝛼 = 49 500 × 500, so we setbythe of superpixels and 𝜎𝛽 = 3 thenumber same strategy in [22]. to 1000. For Cityscapes dataset, we set the number of superpixels to 6000 due to the high-resolution of images. In our experiments, 10 mean field iterations
Appl. Sci. 2018, 8, 837
8 of 17
are employed for CRF. Meanwhile, we use default values of ω1 = σr = 3 and set ω1 = 5, σα = 49 and 8 of 17 σβ = Appl. 3 bySci. the same strategy in [22]. 2018, 8, x FOR PEER REVIEW 8 of 17
Appl. Sci. 2018, 8, x FOR PEER REVIEW
100 superpixels 100 superpixels
500superpixels superpixels 500
1000 superpixels
1000 superpixels 1000 superpixels
3000 superpixels
6000 superpixels
1000 superpixels 3000 superpixels 6000 superpixels Figure 7. Results with different superpixel number. The first row: SLIC segmentation results with 100, Figure 7. and Results with different superpixel number. The first first row: row: SLIC segmentation results with with100, 100, 500, 1000with superpixels on superpixel an example number. image in VOC dataset. The second row: SLIC segmentation Figure 7. Results different The SLIC segmentation results 500, and 1000 superpixels on an example image in VOC dataset. The second row: SLIC segmentation results with 1000, 3000, and 6000 superpixels on an example image in Cityscapes dataset. 500, and 1000 superpixels on an example image in VOC dataset. The second row: SLIC segmentation
results with with 1000, 1000, 3000, 3000, and and 6000 6000 superpixels superpixels on results on an an example example image image in in Cityscapes Cityscapes dataset. dataset. The standard Jaccard Index (Figure 8), also known as the PASCAL VOC intersection-over-union (IoU)standard metric [34], is introduced for the performance assessment in this paper: The Jaccard Index (Figure 8), also known as the PASCAL VOC intersection-over-union
The standard Jaccard Index (Figure 8), also known as the PASCAL VOC intersection-over-union (IoU) metric [34], is introduced for the performance assessment in this paper: Ground truth pixels in class
Ground truth pixels in class
FN
TP
FN
TP
FP
FP
Predicted pixels in class
Predicted pixels in class
Figure 8. The standard Jaccard Index, in which TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels.
Figure 8. The standard Jaccard Index, in which TP, FP,criteria and FN are the numbers of true According to Jaccard Index, many evaluation been proposed to positive, evaluatefalse the Figure 8. The standard Jaccard Index, in which TP, FP, and FNhave are the numbers of true positive, false positive, and false negative pixels. segmentation accuracy. Among them, PA, IoU, and mIoU are often used, and their definitions can be positive, and false negative pixels. found in the previous work [37]. We assume a total of k + 1 classes (including a background class), According to amount JaccardofIndex, criteria been proposed to evaluate and 𝑝𝑖𝑗 is the pixels many of classevaluation 𝑖 inferred to class 𝑗.have 𝑝𝑖𝑖 represents the number of true the According to Jaccard Index, many evaluation criteria have been proposed to evaluate segmentation accuracy. Among them, PA, IoU, and mIoU are often used, and their definitions canthe be positives, while 𝑝𝑖𝑗 and 𝑝𝑗𝑖 are usually interpreted as false positives and false negatives, segmentation accuracy. Among them, PA, IoU, and mIoU are often used, and their definitions foundrespectively. in the previous work [37]. We assume a total k + 1below: classes (including a background class), The formulas of PA, IoU, and mIoU are of shown
can be found the previous work of k𝑗.+𝑝1 classes (including a background and 𝑝 thein amount of pixels of[37]. classWe 𝑖 assume inferred atototal class 𝑖𝑖 represents the number of true 𝑖𝑗 is Pixel Accuracy (PA): class), and p is the amount of pixels of class i inferred to class j. p represents thefalse number of true ij positives, while 𝑝𝑖𝑗 and 𝑝𝑗𝑖 are usually interpreted as false iipositives and negatives, 𝑘 ∑ 𝑝 𝑖𝑖 (4) 𝑖=0 as false positives and false negatives, respectively. positives, while usually ij and p ji are 𝑃𝐴 =mIoU 𝑘 are ,shown below: respectively. Thepformulas of PA, IoU, interpreted and ∑𝑘 𝑖=0 ∑𝑗=0 𝑝𝑖𝑗 The formulas of PA, IoU, and mIoU are shown below: Pixel whichAccuracy computes (PA): a ratio between the number of properly classified pixels and the total number of Pixel Accuracy (PA): them. ∑𝑘 𝑝𝑖𝑖k (4) ∑i=0, pii 𝑃𝐴 = ∑𝑘 𝑖=0 𝑘 𝑝 PA = , (4) 𝑖𝑗 k Intersection Over Union (IoU): 𝑖=0 ∑𝑗=0 k ∑i=0 ∑ j=0 pij 𝑝 (5) 𝐼𝑜𝑈number = ∑𝑘 , classified pixels and the total number which computes a ratio between the of 𝑖𝑖properly of 𝑝𝑖𝑗 +∑𝑘 𝑗=0of 𝑗=0 𝑝𝑗𝑖 −𝑝𝑖𝑖 which computes a ratio between the number properly classified pixels and the total number of them. them. which is used to measure whether the target in the image is detected. Intersection Over Union (IoU): Intersection Over Union (IoU): Mean Intersection Over Union (mIoU): 𝑝 pii (5) (5) 𝐼𝑜𝑈 ==∑𝑘 1 𝑖𝑖𝑘 , IoU , (6) 𝑝𝑗𝑖 −𝑝 k𝑝𝑖𝑗 +∑ k𝑖 , 𝑖𝑖 ∑ 𝑚𝐼𝑜𝑈 = 𝐼𝑜𝑈 𝑗=0 𝑗=0 𝑖=0 ∑ j= 𝑘+1 0 pij + ∑ j=0 p ji − pii which is used to measure whether the target in the image is detected. which is used to measure whether the target in the image is detected. Mean Intersection Over Union (mIoU): 𝑚𝐼𝑜𝑈 =
1 𝑘+1
∑𝑘𝑖=0 𝐼𝑜𝑈𝑖 ,
(6)
Appl. Sci. 2018, 8, 837
9 of 17
Mean Intersection Over Union (mIoU): mIoU =
Appl. Sci. 2018, 8, x FOR PEER REVIEW
1 k IoUi , k + 1 ∑ i =0
9 of(6) 17
which isis the thestandard standardmetric metric segmentation purposes and computed by averaging IoU. which for for segmentation purposes and computed by averaging IoU. mIoU mIoU computes ratio between the ground truth and our predicted segmentation. computes a ratio abetween the ground truth and our predicted segmentation. 4.2. Ablation Ablation Study Study 4.2. The core core idea idea of of the the improved improved method method lies lies in in the the utility utility of of superpixels superpixels and and the the optimization optimization of of The CRF model. First, to evaluate the importance of the utility of superpixels, we directly compared the CRF model. First, to evaluate the importance of the utility of superpixels, we directly compared the plain FCN-8s FCN-8s model model with with the the boundary boundary optimized optimized one. one. Then, Then, the the FCN-8s FCN-8s with with CRF CRF and and our our proposed proposed plain method are implemented sequentially. For better understanding, the results of these comparative method are implemented sequentially. For better understanding, the results of these comparative experiments are are shown shown in in Figure Figure 99 and and Table Table 1. experiments 1. (a) Input image
(c) Optimized by superpixels
(b) FCN-8s
(d) FCN-8s with CRF
(e) Proposed Method
(f) Ground Truth
Figure 9. An example result of the comparative experiments. (a) The input image, (b) the result of Figure 9. An example result of the comparative experiments. (a) The input image, (b) the result of plain FCN-8s, plain FCN-8s, (c) (c) the the boundary boundary optimization optimization result result by by superpixels, superpixels, (d) (d) the the result result of of FCN-8s FCN-8s with with CRF CRF post-processing, and (f) (f) the the ground ground truth. truth. post-processing, (e) (e) the the result result of of our our method, method, and
From the results, we see that methods with boundary optimization by superpixels consistently From the results, we see that methods with boundary optimization by superpixels consistently perform better than the counterparts without optimization. For example, as shown in Table 1, the perform better than the counterparts without optimization. For example, as shown in Table 1, method with boundary optimization is 3.2% better than the plain FCN-8s on VOC dataset, while the the method with boundary optimization is 3.2% better than the plain FCN-8s on VOC dataset, improvement becomes 2.8% on Cityscapes dataset. It can be observed that the object boundaries in while the improvement becomes 2.8% on Cityscapes dataset. It can be observed that the object Figure 9c are closer to the ground truth than Figure 9b. Based on the above experimental results, we boundaries in Figure 9c are closer to the ground truth than Figure 9b. Based on the above experimental can conclude that the segmentation accuracy of the object boundaries can be improved by the utility results, we can conclude that the segmentation accuracy of the object boundaries can be improved of superpixels. Table 1 also shows that, for the FCN-8s with CRF applied as a post-processing step, by the utility of superpixels. Table 1 also shows that, for the FCN-8s with CRF applied as a the better performance can be observed after boundary optimization. As shown in Figure 9d,e, the post-processing step, the better performance can be observed after boundary optimization. As shown results of our method are more similar with the ground truth than that without boundary in Figure 9d,e, the results of our method are more similar with the ground truth than that without optimization. The mIoU scores of the two right-most columns also corroborated this point, with an boundary optimization. The mIoU scores of the two right-most columns also corroborated this point, improvement of 5% on VOC dataset and 4.1% on Cityscapes dataset. with an improvement of 5% on VOC dataset and 4.1% on Cityscapes dataset. Table 1. The mIoU scores of the comparative experiments. Table 1. The mIoU scores of the comparative experiments.
Dataset\Method Plain FCN-8s With BO 1 DatasetMethod Plain FCN-8s With65.9 BO 1 VOC 2012 62.7 Cityscapes 56.1 58.9 VOC 2012 62.7 65.9 Cityscapes
1
With CRF Our Method With Our 69.5CRF 74.5Method 61.3 65.474.5 69.5
58.9 Optimization. 61.3 BO56.1 denotes the Boundary 1
65.4
BO denotes the Boundary Optimization.
The purpose of using CRFs is to recover the boundaries more accurately. We compared the The purpose of using CRFs is to recover boundaries more accurately. We compared the performance of plain FCN-8s with/without CRFthe under the same situation. From Table 1, it is clear performance of plain FCN-8s with/without CRF under the same situation. From Table 1, it is that CRF consistently boosts classification scores on both the VOC dataset and the Cityscapes dataset. clear thatwe CRFcompared consistently classification scores on both the VOC datasetwith/without and the Cityscapes Besides, theboosts performance of boundary optimized FCN-8s CRF. dataset. Besides, we compared the performance of boundary optimized FCN-8s with/without CRF. In conclusion, methods optimized by CRFs outperform the counterparts by a significant margin, In conclusion, methods optimized by CRFs outperform the counterparts by a significant margin, which shows the importance of boundary optimization and CRFs. which shows the importance of boundary optimization and CRFs.
Appl. Sci. 2018, 8, 837
10 of 17
Appl. Sci. 2018, 8, x FOR PEER REVIEW
10 of 17
4.3. Experimental Results
4.3. Experimental Results
4.3.1. Qualitative Analysis
4.3.1. Qualitative Analysis
According to the method proposed in this paper, we have obtained the improved semantic According to the method proposed in this paper, we have obtained the improved semantic segmentation results. The on VOC VOCdataset datasetamong among FCN-8s, DeepLab-v2 and our segmentation results. Thecomparisons comparisons on FCN-8s, DeepLab-v2 [22], [22], and our method are shown in Figure 10. method are shown in Figure 10. Input image
PASCAL VOC 2012
class-definitions
Superpixels
FCN-8s
DeepLab-v2
Our Method
Ground Truth
B-ground
Aeroplane
Bicycle
Bird
Boat
Bottle
Bus
Car
Cat
Chair
Cow
Dining-Table
Dog
Horse
Motorbike
Person
Potted-Plant
Sheep
Sofa
Train
TV/Monitor
Figure 10. Qualitative results on the reduced validation set of PASCAL VOC 2012. From left to right:
Figure 10. Qualitative results on the reduced validation set of PASCAL VOC 2012. From left to the original image, SLIC segmentation, FCN-8s results, DeepLab-v2 results, our results, and the right: the original image, SLIC segmentation, FCN-8s results, DeepLab-v2 results, our results, and the ground truth. ground truth.
Appl. Sci. 2018, 8, x FOR PEER REVIEW Appl. Sci. 2018, 8, 837
11 of 17 11 of 17
As shown in Figure 10, our results are significantly closer to the ground truth than those by FCN8s, especially forinthe segmentation of the are object edges in images. By the comparing results As shown Figure 10, our results significantly closer to ground the truth than obtained those by by FCN-8s with ours,for it can seen that they sameedges colorin and they have similar object FCN-8s, especially the be segmentation of are thethe object images. By acomparing theoutline, results which indicates that our method inherits feature extraction and object recognition abilities of FCN. obtained by FCN-8s with ours, it can be seen that they are the same color and they have a similar In addition, ourwhich method can getthat finer than DeepLab-v2 the vast and majority ofrecognition categories, object outline, indicates ourdetails method inherits featurein extraction object which profits from boundary of coarse features. From segmentation of each abilities of FCN. In the addition, our optimization method can get finer details than DeepLab-v2 in theresults vast majority image in Figure 10, profits we also notice following details: (a) these methods cansegmentation identify the of categories, which from the the boundary optimization of All coarse features. From classification of image the single large object and locate thethe region accurately, due(a) to the extraction results of each in Figure 10, we also notice following details: All excellent these methods can ability of CNNs. (b) In the complex scene, these methods cannot work effectively due to the error of identify the classification of the single large object and locate the region accurately, due to the excellent extracted employing different post-processing steps. (c) Smallwork objects are possible extractionfeatures, ability ofeven CNNs. (b) In the complex scene, these methods cannot effectively due to be or missed, dueeven to the lack of thedifferent pixels describing the small objects. themisidentified error of extracted features, employing post-processing steps. (c) Small objects are It is to worth mentioning that, in some details oflack Figure 10, such the rear wheel of theobjects. first bicycle, possible be misidentified or missed, due to the of the pixelsasdescribing the small the results of DeepLab-v2 than ours. of The segmentation result DeepLab-v2 almost It is worth mentioningseem that,better in some details Figure 10, such as the of rear wheel of the first completely rear wheel of the first bicycle. it cannotresult be overlooked that bicycle, the retained results ofthe DeepLab-v2 seem better than ours. However, The segmentation of DeepLab-v2 DeepLab-v2 fails toretained properlythe process the front wheels the two bicycles.itIncannot contrast, front wheels almost completely rear wheel of the first of bicycle. However, be the overlooked that processed byfails our to method are process closer tothe that of ground rear wheel missingthe problem, some DeepLab-v2 properly front wheelstruth. of theFor twothe bicycles. In contrast, front wheels analysis is by made the difference among theofbicycle wheel, and missing background. The processed ouron method are closer to that groundbody, truth.solid Forrear the rear wheel problem, solid wheel a distinguished coloramong with compassion of the solid background, but at thebackground. same time, somerear analysis is has made on the difference the bicycle body, rear wheel, and the same case wheel exists has in the bicycle bodycolor and with solidcompassion rear wheel,ofleading to a largebut probability of The solid rear a distinguished the background, at the same misrecognition. Additionally, the non-solid wheel is much more common than the solid one in the time, the same case exists in the bicycle body and solid rear wheel, leading to a large probability of real scenario; therefore, the CRF tends to predict the non-solid wheel. misrecognition. Additionally, the model non-solid wheel is much more common than the solid one in the real The proposed semantic segmentation method is implemented on scenario; therefore, the CRF model tends to predict the non-solid wheel. Cityscapes dataset. Some visual results proposed by FCN-8s, DeepLab-v2, our method on areCityscapes shown in Figure Similar to The proposed semantic segmentation methodand is implemented dataset.11. Some visual the results on PASCAL VOC 2012 dataset, the method achieves better performance results proposed by FCN-8s, DeepLab-v2, andimproved our method are shown in Figure 11. Similar tothan the others. results on PASCAL VOC 2012 dataset, the improved method achieves better performance than others. Input image
Cityscapes Class definitions
Superpixels
Road Terrain
Sky
FCN-8s
Sidewalk Building Person
Rider
DeepLab-v2
Our Method
Wall
Fence
Pole
Traffic light
Traffic sign
Vegetation
Car
Truck
Bus
Train
Motorcycle
Bicycle
Ground Truth
Figure 11. Visual results on the validation set of Cityscapes. From left to right: the original image, Figure 11. Visual results on the validation set of Cityscapes. From left to right: the original image, SLIC segmentation, FCN-8s results, DeepLab-v2 results, our results, and the ground truth. SLIC segmentation, FCN-8s results, DeepLab-v2 results, our results, and the ground truth.
4.3.2. Quantitative Analysis 4.3.2. Quantitative Analysis PASCAL VOC 2012 Dataset PASCAL VOC 2012 Dataset The per-class IoU scores on VOC dataset among FCN-8s, the proposed method, and other The methods per-class IoU dataset the that proposed method, achieves and otherthe popular popular are scores shownoninVOC Table 2. Itamong can beFCN-8s, observed our method best methods are shown in Table 2. It can be observed that our method achieves the best performance in performance in most categories, which is consistent with the conclusion obtained in qualitative
Appl. Sci. 2018, 8, x FOR PEER REVIEW
12 of 17
analysis. In addition, the proposed method outperforms prior methods in mIoU metric; it reaches the 12 of 17 highest 74.5% accuracy.
Appl. Sci. 2018, 8, 837
Table 2. Per-class IoU on VOC dataset. Best performance of each category is highlighted in addition, bold. most categories, which is score consistent with the conclusion obtained in qualitative analysis. In the proposed method outperforms prior methods in mIoU metric; it reaches the highest accuracy. Zoom-out CRF-RNN GCRF DPN 74.5%Our FCN-8s
DeepLab-v2 [38] [20] [39] [40] Method Table IoU score on85.6 VOC dataset. Best each category in 85.5 bold. areo2. Per-class 74.3 86.6performance of 87.5 85.2is highlighted 87.7 bike 36.8 37.3 37.2 39.0 43.9 59.4 40.1 Zoom-Out [38]83.2 DeepLab-v2 82.1 CRF-RNN [20] 79.7 GCRF [39] 83.3 DPN [40] birdFCN-8s 77.0 78.4 Our Method 83.1 areo boat 74.3 86.6 85.5 52.4 85.6 62.5 65.6 87.5 64.2 85.2 65.2 87.7 64.9 66.3 bikebottle 36.8 37.2 40.1 67.7 37.3 66.0 71.2 39.0 68.3 43.9 68.3 59.4 70.3 74.2 bird 77.0 83.2 82.1 79.7 83.3 78.4 83.1 bus 75.4 85.1 88.3 87.6 89.0 89.3 91.3 boat 52.4 62.5 65.6 64.2 65.2 64.9 66.3 car 71.4 80.7 82.8 80.8 82.7 83.5 82.3 bottle 67.7 66.0 71.2 68.3 68.3 70.3 74.2 76.3 85.1 84.9 85.6 87.6 84.4 89.0 85.3 89.3 86.1 87.5 bus cat 75.4 88.3 91.3 car chair 71.4 82.8 82.3 23.9 80.7 27.2 36.6 80.8 30.4 82.7 31.1 83.5 31.7 33.6 cat cow 76.3 85.6 87.5 69.7 84.9 73.2 77.3 84.4 78.2 85.3 79.5 86.1 79.9 82.2 chair 23.9 27.2 36.6 30.4 31.1 31.7 33.6 table 44.5 57.5 51.8 60.4 63.3 62.6 62.3 cow 69.7 73.2 77.3 78.2 79.5 79.9 82.2 dog 69.2 78.1 80.2 80.5 80.5 81.9 85.9 table 44.5 57.5 51.8 60.4 63.3 62.6 62.3 doghorse 69.2 80.2 85.9 61.8 78.1 79.2 77.1 80.5 77.8 80.5 79.3 81.9 80.0 83.0 horse 77.1 83.0 mbike 61.8 75.7 79.2 81.1 75.7 77.8 83.1 79.3 85.5 80.0 83.5 83.4 mbike 75.7 81.1 75.7 83.1 85.5 83.5 83.4 person 75.7 77.1 82.0 80.6 81.0 82.3 86.6 person 75.7 77.1 82.0 80.6 81.0 82.3 86.6 44.3 53.6 53.6 52.0 59.5 59.5 60.5 60.5 60.5 60.5 56.9 plantplant 44.3 52.0 56.9 sheep 68.2 68.2 74.0 74.0 78.2 82.8 82.8 85.5 85.5 83.2 83.2 86.3 sheep 78.2 86.3 sofa sofa 34.1 44.9 49.4 34.1 49.2 49.2 44.9 47.8 47.8 52.0 52.0 53.4 53.4 49.4 traintrain 75.5 79.7 80.4 75.5 71.7 71.7 79.7 78.3 78.3 77.3 77.3 77.9 77.9 80.4 tv 52.7 63.3 66.7 67.1 65.1 65.0 69.9 tv 52.7 63.3 66.7 67.1 65.1 65.0 69.9 mIoU 62.7 69.6 71.2 72.0 73.2 74.1 74.5 mIoU 62.7 69.6 71.2 72.0 73.2 74.1 74.5
This This work work proposed proposed aapost-processing post-processing method method totoimprove improvethe theFCN-8s FCN-8sresult. result. During During the the implementation procedure, the superpixel and the CRF are used subsequently to improve the coarse implementation procedure, the superpixel and the CRF are used subsequently to improve the coarse results resultsextracted extractedby byFCN-8s. FCN-8s.Therefore, Therefore,the thecomparison comparisonwith withFCN-8s FCN-8sisisthe themost mostprimary primaryand anddirect direct way to illustrate the improvement level of our method. Meanwhile, to further evaluate the performance way to illustrate the improvement level of our method. Meanwhile, to further evaluate the of our method,ofthe comparison with DeepLab has been made. The IoU, mIoU, scores of FCN-8s performance our method, the comparison with DeepLab has been made.and ThePA IoU, mIoU, and PA and DeepLab-v2, and our methods are given in Figure 12. scores of FCN-8s and DeepLab-v2, and our methods are given in Figure 12. FCN-8s
DeepLab-v2
Our Method
100 90
74.5 71.2 62.7
80 70
Score
93.2 91.6 75.9
60 50 40 30 20 10 0
Figure 12. IoU, mIoU, and PA scores on the PASCAL VOC 2012 dataset. Figure 12. IoU, mIoU, and PA scores on the PASCAL VOC 2012 dataset.
Appl. Sci. 2018, 8, 837
13 of 17
Appl. Sci. 2018, 8, x FOR PEER REVIEW
13 of 17
It can observed from Figure 12, compared with FCN-8s, that the proposed significantly It canbebe observed from Figure 12, compared with FCN-8s, that themethod proposed method improves the IoU score in every category with higher mIoU and PA scores. In addition, method significantly improves the IoU score in every category with higher mIoU and PA scores. our In addition, is better than in most categories for(except aero plane, car plane, and chair). The detailed our method is DeepLab-v2 better than DeepLab-v2 in most (except categories for aero car and chair). The improvement statistics are shown in Figure 13. detailed improvement statistics are shown in Figure 13. improvement(FCN-8s)
improvement(DeepLab-v2)
25.0 20.0
17.4
15.0
11.8
10.0 3.3
5.0
1.7
0.0 -5.0
Figure 13. The improvement by our method compared with FCN-8s and DeepLab-v2. Figure 13. The improvement by our method compared with FCN-8s and DeepLab-v2.
As shown in Figure 13, our method can be significantly improved compared with FCN-8s, which in Figure 13,and our17.4% method can respectively. be significantly compared with FCN-8s, reachAs upshown to 11.8% in mIoU in PA, For improved DeepLab-v2, the improvements of which reach up to 11.8% in mIoU and 17.4% in PA, respectively. For DeepLab-v2, the improvements of mIoU and PA are 3.3% and 1.7%. Moreover, Figure 13 intuitively shows the improvement level using mIoU and PA 3.3% and 1.7%. Moreover, 13 intuitively shows thecan improvement level using our method inare comparison with FCN-8s andFigure DeepLab. The improvement be clearly observed in our method in comparison with FCN-8s and DeepLab. The improvement can be clearly observed 100% (21/21) of categories when compared with FCN-8s. Meanwhile, there is 85.71% (18/21) of in 100% (21/21) of categories when compared with FCN-8s. Meanwhile, is 85.71% (18/21) of categories improvement in comparison withDeepLab-v2. In summary, our there method achieved the best categories improvement inthree comparison withDeepLab-v2. In summary, our method achieved the best performance among these methods. performance among these three methods. Cityscapes Dataset Cityscapes Dataset We conducted an experiment on the Cityscapes dataset, which differs from the previous one in We conducted an experiment the Cityscapes dataset, which from the previous one in the high-resolution images and theon number of classes. We used the differs provided partitions of training the high-resolution images and the number of classes. We used the provided partitions of training and validation sets, and the obtained results are reported in Table 3. It can be observed that and the validation sets, and the obtained results are 3. Itthe can be observed that the evaluation evaluation on the Cityscapes validation set reported is similarintoTable that on VOC dataset. Using our method, on the Cityscapes set is up similar to thatThe on IoU, the VOC dataset. Using ourofmethod, highest the highest mIoU validation score can reach to 65.4%. mIoU, and PA scores FCN-8s,the DeepLabmIoU can reach to 65.4%. The 14. IoU, mIoU, and PA scores of FCN-8s, DeepLab-v2, and our v2, andscore our method areup given in Figure method are given in Figure 14. Table 3. Per-class IoU score on Cityscapes dataset. Best performance of each category is highlighted Table 3. Per-class IoU score on Cityscapes dataset. Best performance of each category is highlighted in bold. in bold.
FCN-8s
road 95.9 FCN-8s sidewalk 71.5 road 95.9 building 85.9 sidewalk 71.5 wall 25.9 building 85.9 fence 38.4 wall 25.9 pole 31.2 fence 38.4 traffic 38.3 pole light 31.2 traffic sign 52.3 traffic light 38.3 vegetation 87.3 terrain 52.1 sky 87.6
DPN [40]
CRF-RNN [20]
DeepLab-v2
96.3[40] DPN 71.7 96.3 86.7 71.7 43.7 86.7 31.7 43.7 29.2 31.7 35.8 29.2 47.4 35.8 88.4 63.1 93.9
96.3 [20] CRF-RNN 73.9 96.3 88.2 73.9 47.6 88.2 41.3 47.6 35.2 41.3 49.5 35.2 59.7 49.5 90.6 66.1 93.5
96.8 DeepLab-v2 75.6 96.8 88.2 75.6 31.1 88.2 42.6 31.1 41.2 42.6 45.3 41.2 58.8 45.3 91.6 59.6 89.3
Our Method
97.2Method Our 78.9 97.2 88.8 78.9 35.1 88.8 43.335.1 40.243.3 44.340.2 59.344.3 93.5 61.6 94.2
Appl. Sci. 2018, 8, 837
14 of 17
Table 3. Cont. FCN-8s traffic sign 52.3 vegetation 87.3 Appl. Sci. 2018, 8, x FOR PEER REVIEW terrain 52.1 sky 87.6 person 61.7 person 61.7 rider 32.9 rider 32.9 86.6 car car 86.6 trucktruck 36.0 36.0 bus bus 50.8 50.8 traintrain 35.4 35.4 motorcycle motorcycle 34.7 34.7 bicycle 60.6 bicycle 60.6 mIoU 56.1 mIoU 56.1
DPN [40]
CRF-RNN [20]
DeepLab-v2
47.4 88.4 63.1 93.9 64.7 64.7 38.7 38.7 88.8 88.8 48.0 48.0 56.4 56.4 49.4 49.4 38.3 38.3 50.0 50.0 59.1 59.1
59.7 90.6 66.1 93.5 70.4 70.4 34.7 34.7 90.1 90.1 39.2 39.2 57.5 57.5 55.4 55.4 43.9 43.9 54.6 54.6 62.5 62.5
58.8 91.6 59.6 89.3 75.8 75.8 41.2 90.1 46.7 60.0 60.0 47.0 47.0 46.2 46.2 71.9 71.9 63.1 63.1
FCN-8s
DeepLab-v2
Our Method
100 90 80
IoU Score
70
Our Method 59.3 93.5 61.6 94.2 79.3 79.3 43.9 43.9 94.1 94.1 53.4 53.4 66.0 66.0 51.8 51.8 47.3 47.3 70.4 70.4 65.4 65.4
14 of 17
95.1 93.9 86.0 65.4 63.1 56.1
60 50
40 30 20 10 0
Figure 14. IoU, mIoU, and PA scores on the Cityscapes dataset. Figure 14. IoU, mIoU, and PA scores on the Cityscapes dataset.
It can be observed from Figure 14 that the proposed method has significantly improved the IoU, It can observed from Figure 14 that thethe proposed method has significantly mIoU, andbePA scores in every category with comparison of FCN-8s. Similar toimproved the resultsthe in IoU, VOC mIoU, and PA our scores in every category with the comparison ofcategories FCN-8s. Similar thepole, results in VOC 2012 dataset, method is better than DeepLab-v2 in most (excepttofor traffic light, 2012 better than DeepLab-v2 most categories (except for pole, traffic light, and dataset, bicycle).our Themethod detailedisimprovement statistics areinshown in Figure 15. and bicycle). The detailed improvement statistics are shown in Figure 15. As shown in Figure 15, our method can get a significant improvement compared with FCN-8s, improvement(FCN-8s) improvement(DeepLab-v2) which reach up to 9.3% in mIoU and 9.1% in PA, respectively. Moreover, Figure 15 intuitively 20.0improvement level using our method with the comparison of FCN-8s and DeepLab-v2. shows the The improvement can be clearly observed in 100% (19/19) of categories when compared with FCN-8s and 84.21% (16/19) of categories when compared with DeepLab-v2, respectively. 15.0
10.0
5.0
0.0
-5.0
9.3 9.1
2.3
1.2
Figure 14. IoU, mIoU, and PA scores on the Cityscapes dataset.
It can be observed from Figure 14 that the proposed method has significantly improved the IoU, mIoU, and PA scores in every category with the comparison of FCN-8s. Similar to the results in VOC Appl. Sci. 2018, 8,our 837 method is better than DeepLab-v2 in most categories (except for pole, traffic15 of 17 2012 dataset, light, and bicycle). The detailed improvement statistics are shown in Figure 15. improvement(FCN-8s)
improvement(DeepLab-v2)
20.0
15.0 9.3 9.1
10.0
5.0
2.3
1.2
0.0
-5.0 Figure 15. The improvement by our method compared with FCN-8s and DeepLab-v2. Figure 15. The improvement by our method compared with FCN-8s and DeepLab-v2.
5. Conclusions In this paper, an improved semantic segmentation method is proposed, which utilizes the superpixel edges of images and the constraint relationship between different pixels. First, our method has the ability to extract advanced semantic information, which is inherited from the FCN model. Then, to effectively optimize the boundaries of results, our method takes into account the good adherence to the edges of superpixels. Finally, we apply CRF to further predict the semantic information of each pixel, and make full use of the local texture features of the image, global context information, and smooth priori. Experiment results show that our method can achieve the more accurate segmentation result. Using our method, mIoU scores can reach up to 74.5% on the VOC dataset and 65.4% on the Cityscapes dataset, which are 11.8% and 9.3% improvements over FCN-8s, respectively. Author Contributions: Hai Wang, Wei Zhao, and Yi Fu conceived and designed the experiments; Yi Fu and Xiaosong Wei performed the experiments; Wei Zhao and Yi Fu analyzed the data; Xiaosong Wei contributed analysis tools; Hai Wang, Wei Zhao, and Yi Fu wrote the paper. Conflicts of Interest: The authors declare no conflict of interest.
References 1. 2.
3.
4.
5.
Oberweger, M.; Wohlhart, P.; Lepetit, V. Hands deep in deep learning for hand pose estimation. arXiv 2015, arXiv:1502.06807. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: New York, NY, USA, 2012; pp. 3354–3361. Wan, J.; Wang, D.; Hoi, S.C.H.; Wu, P.; Zhu, J.; Zhang, Y.; Li, J. Deep learning for content-based image retrieval: A comprehensive study. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; ACM: Orlando, FL, USA, 2014; pp. 157–166. Kang, W.X.; Yang, Q.Q.; Liang, R.P. The comparative research on image segmentation algorithms. In Proceedings of the 2009 First International Workshop on Education Technology and Computer Science, Wuhan, China, 7–8 March 2009; pp. 703–707. Ciresan, D.; Giusti, A.; Gambardella, L.M.; Schmidhuber, J. Deep neural networks segment neuronal membranes in electron microscopy images. In Proceedings of the Advances in Neural Information Processing Systems 25, Lake Tahoe, NV, USA, 3–6 December 2012; Curran Associates, Inc.: Red Hook, NY, USA, 2012; pp. 2843–2851.
Appl. Sci. 2018, 8, 837
6. 7. 8.
9. 10.
11.
12. 13. 14. 15.
16.
17. 18. 19.
20.
21. 22.
23. 24. 25. 26.
16 of 17
Gupta, S.; Girshick, R.; Arbeláez, P.; Malik, J. Learning rich features from rgb-d images for object detection and segmentation. In European Conference on Computer Vision; Springer: New York, NY, USA, 2014; pp. 345–360. Hariharan, B.; Arbeláez, P.; Girshick, R.; Malik, J. Simultaneous detection and segmentation. In European Conference on Computer Vision; Springer: New York, NY, USA, 2014; pp. 297–312. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; IEEE Computer Society: Washington, DC, USA, 2014; pp. 580–587. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed] Luo, P.; Wang, G.; Lin, L.; Wang, X. Deep dual learning for semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2718–2726. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. Ravì, D.; Bober, M.; Farinella, G.M.; Guarnera, M.; Battiato, S. Semantic segmentation of images exploiting dct based features and random forest. Pattern Recogn. 2016, 52, 260–273. [CrossRef] Fu, J.; Liu, J.; Wang, Y.; Lu, H. Stacked deconvolutional network for semantic segmentation. arXiv 2017, arXiv:1708.04943. Wu, Z.; Shen, C.; Hengel, A.V.D. Wider or deeper: Revisiting the resnet model for visual recognition. arXiv 2016, arXiv:1611.10080. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015; IEEE: New York, NY, USA, 2015; pp. 3431–3440. Ghiasi, G.; Fowlkes, C.C. Laplacian pyramid reconstruction and refinement for semantic segmentation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: New York, NY, USA, 2016; pp. 519–534. Lin, G.; Shen, C.; Hengel, A.V.D.; Reid, I. Exploring context with deep structured models for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1352–1366. [CrossRef] [PubMed] Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Boston, MA, USA, 8–10 June 2015; pp. 1520–1528. Chen, L.C.; Yang, Y.; Wang, J.; Xu, W.; Yuille, A.L. Attention to scale: Scale-aware semantic image segmentation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27 June–1 July 2016; IEEE: New York, NY, USA, 2016; pp. 3640–3649. Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.; Su, Z.Z.; Du, D.L.; Huang, C.; Torr, P.H.S. Conditional random fields as recurrent neural networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; IEEE: New York, NY, USA, 2015; pp. 1529–1537. Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [CrossRef] [PubMed] Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. Shi, J.; Malik, J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 888–905. Felzenszwalb, P.F.; Huttenlocher, D.P. Efficient graph-based image segmentation. Int. J. Comput. Vis. 2004, 59, 167–181. [CrossRef] Van den Bergh, M.; Boix, X.; Roig, G.; de Capitani, B.; Van Gool, L. Seeds: Superpixels extracted via energy-driven sampling. In Proceedings of the 12th European Conference on Computer Vision-Volume Part VII, Florence, Italy, 7–13 October 2012; Springer: New York, NY, USA, 2012; pp. 13–26.
Appl. Sci. 2018, 8, 837
27. 28. 29.
30. 31.
32.
33. 34. 35.
36.
37. 38.
39.
40.
17 of 17
Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Susstrunk, S. Slic superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2274–2281. [CrossRef] [PubMed] Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. Rother, C.; Kolmogorov, V.; Blake, A. Grabcut: Interactive foreground extraction using iterated graph cuts. In Proceedings of the ACM Transactions on Graphics (TOG), Los Angeles, CA, USA, 8–12 August 2004; ACM: New York, NY, USA, 2004; Volume 23, pp. 309–314. Maini, R.; Aggarwal, H. Study and comparison of various image edge detection techniques. Int. J. Image Process. 2009, 3, 1–11. Gadde, R.; Jampani, V.; Kiefel, M.; Kappler, D.; Gehler, P.V. Superpixel convolutional networks using bilateral inceptions. In Computer Vision, ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; Volume 9905, pp. 597–613. Krähenbühl, P.; Koltun, V. Efficient inference in fully connected crfs with gaussian edge potentials. In Proceedings of the 24th International Conference on Neural Information Processing Systems, Granada, Spain, 12–15 December 2011; Curran Associates Inc.: Red Hook, NY, USA, 2011; pp. 109–117. Wu, F. The potts model. Rev. Mod. Phys. 1982, 54, 235. [CrossRef] Everingham, M.; Eslami, S.M.A.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [CrossRef] Hariharan, B.; Arbelaez, P.; Bourdev, L.; Maji, S.; Malik, J. Semantic contours from inverse detectors. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE Computer Society: Washington, DC, USA, 2011; pp. 991–998. Cordts, M.; Omran, M.; Ramos, S.; Scharwächter, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset. In Proceedings of the CVPR Workshop on the Future of Datasets in Vision, Boston, MA, USA, 11 June 2015; p. 3. Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Garcia-Rodriguez, J. A review on deep learning techniques applied to semantic segmentation. arXiv 2017, arXiv:1704.06857. Mostajabi, M.; Yadollahpour, P.; Shakhnarovich, G. Feedforward semantic segmentation with zoom-out features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3376–3385. Vemulapalli, R.; Tuzel, O.; Liu, M.-Y.; Chellapa, R. Gaussian conditional random field network for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27 June–1 July 2016; pp. 3224–3233. Liu, Z.; Li, X.; Luo, P.; Loy, C.-C.; Tang, X. Semantic image segmentation via deep parsing network. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; IEEE: New York, NY, USA, 2015; pp. 1377–1385. © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).