information Article
Aerial-Image Denoising Based on Convolutional Neural Network with Multi-Scale Residual Learning Approach Chong Chen and Zengbo Xu * School of Fashion Engineering, Shanghai University of Engineering Science, Shanghai 201620, China;
[email protected] * Correspondence:
[email protected]
Received: 13 May 2018; Accepted: 3 July 2018; Published: 9 July 2018
Abstract: Aerial images are subject to various types of noise, which restricts the recognition and analysis of images, target monitoring, and search services. At present, deep learning is successful in image recognition. However, traditional convolutional neural networks (CNNs) extract the main features of an image to predict directly and are limited by the requirements of the training sample size (i.e., a small data size is not successful enough). In this paper, using a small sample size, we propose an aerial-image denoising recognition model based on CNNs with a multi-scale residual learning approach. The proposed model has the following three advantages: (1) Instead of directly learning latent clean images, the proposed model learns the noise from noisy images and then subtracts the learned residual from the noisy images to obtain reconstructed (denoised) images; (2) The developed image denoising recognition model is beneficial to small training datasets; (3) We use multi-scale residual learning as a learning approach, and dropout is introduced into the model architecture to force the network to learn to generalize well enough. Our experimental results on aerial-image denoising recognition reveal that the proposed approach is highly superior to the other state-of-the-art methods. Keywords: CNNs; multi-scale residual learning; aerial imaging; image denoising; dropout
1. Introduction As one of the classical problems in aerial imaging, aerial images for regional scenes, searching, and military scouting are subject to various types and degrees of noise, which may occur during transmission and acquisition. The appearance of noise in aerial images interferes with the quality of the original images, which has direct or indirect influences that complicate the timely recognition, analysis, target monitoring, and search services [1]. Therefore, image denoising is a crucial issue in the field of computer vision that has been widely discussed and studied by researchers [2–4]. At present, most of the existing denoising methods mainly focus on the scenario of additive white Gaussian noise (AWGN), whereby an observed noisy image y is modeled as a combination of a clean image x and AWGN v; that is, y = x + v. Using a general model, Elad & Aharon [5] presented sparse and redundant representations over learned dictionaries, whereby the proposed algorithm simultaneously trained a dictionary on its content using the K-SVD algorithm. However, its computational complexity is one of the shortcomings of this algorithm. In recent years, with the development of deep learning, the research results of deep architecture have shown good performance [6–9]. In the task of image denoising, Frost et al. [10] and Kuan et al. [11] proposed spatial linear filters that assume that the resulting values of image filtering are linear with respect to the original image, by searching for the correlation between the intensity of the center
Information 2018, 9, 169; doi:10.3390/info9070169
www.mdpi.com/journal/information
Information 2018, 9, 169
2 of 18
pixel in the moving window and the average intensity of the filter window. Therefore, the spatial linear filter achieves a trade-off between the constant all-pass identity filter in the edge-containing region and balance in the uniform region. However, the spatial linear filter methods are not enough to integrally preserve edges and details and are unable to preserve the average value, especially when the equivalent number of look (ENL) of the original Synthetic Aperture Radar (SAR) image is very small. To address this problem, Zhang et al. [12] considered that image speckle noise can be expressed more accurately through nonlinear models than through linear models and proposed a novel deep-neural-network-based approach for SAR image despeckling, learning a nonlinear end-to-end mapping between the speckled and clean SAR images by a dilated residual network (SAR-DRN), which enlarges the receptive field and maintains the filter size and layer depth. Likewise, Wei et al. [13] proposed the deep residual pan-sharpening neural network (DRPNN); that is, the concept of residual learning was introduced to form a very deep convolutional neural network (CNN) to make full use of the high nonlinearity of the deep learning models, which overcomes the drawbacks of the previously proposed methods and performs high-quality fusion of Panchromatic (PAN) and Multi-spectral (MS) images. Furthermore, Yuan et al. [14] introduced multi-scale feature extraction and residual learning into the basic CNN architecture and propose a multi-scale and -depth CNN for the pan-sharpening of remote sensing imagery that overcomes some shortcomings; for example, in the Component Substitution (CS) and Multiresolution Analysis (MRA) based fusion methods, the transformation from observed images to fusion targets is not rigorously modeled, and distortion in the spectral domain is very common. In the Model-based optimization (MBO)-based methods, the linear simulation from the observed and fusion images is still a limitation, especially when the spectral coverages of the PAN and MS images do not fully overlap and lead to the fusion process being highly nonlinear. Very recently, Zhang et al. [15] proposed an image denoising model using residual learning of deep CNNs (feed-forward denoising CNN—DnCNN) that has provided promising performance among the state-of-the-art methods. Specifically, the model utilizes residual learning and batch normalization to speed up the training process as well as boost the denoising performance. As mentioned above, using the deep learning method, it has been explained that if the model is able to be trained with a very large data size, the deep architecture will provide a competitive result [16,17]. That is to say, these deep learning techniques require a large training dataset to establish a prediction model to provide better model performance, which brings about a serious problem: not only do they need to consume a lot of precious time, but the image datasets of the target region search are often also limited. Therefore, it is still an open field of research to apply deep learning to image-processing recognition problems with small datasets. In this paper, to improve the model denoising recognition performance for small training datasets, following Zhang et al. [15], with our own modified architecture, we propose an aerial-image denoising recognition model based on a CNN with an improved multi-scale residual learning approach. Firstly, differently from [15], the image denoising recognition model developed by us is beneficial to small training datasets. Secondly, instead of directly outputting a denoised image, we separate the noise from a noisy image by a deep CNN model, and then the denoised image is obtained by subtracting the residual from the noisy observation. Finally, multi-scale residual learning is used as a learning algorithm, and batch normalization and dropout are also incorporated to speed up the training process and improve the model’s denoising performance. The rest of the paper is organized as follows. Section 2 presents related work, Section 3 explains the proposed aerial-image denoising, the experimental results and evaluation are elaborated upon in Section 4, and finally Section 5 provides the conclusion of the paper. 2. Related Work In this section, we give a brief review of the residual learning network (ResNet). A comprehensive review on the ResNet can be found in [18–25].
Information 2018, 9, 169
3 of 18
Compared with shallow neural networks, deep networks have a higher training error and test error, and deep-level network structure training is also very difficult [18,19]. Supposing there exists a shallow network, then there should be such a deep network that is made up of multiple x → x mappings that are based on the shallow network, and the training error of the deep neural network should not be2018, higher than that of the corresponding shallow network. However, it is very difficult to Information 9, x FOR PEER REVIEW 3 of 17 use multiple nonlinear layers to fit a x → x direct mapping [20–22]. mappings that the are based on the shallow network, and the training of the deep neural To overcome above-stated problem, the residual learningerror approach along withnetwork some other should not be higher than that of the corresponding shallow network. However, it is very difficult approaches was introduced. The ResNet can be defined as y = ℘({Wi }, x ) + x, where ℘({Wi }, x ) is to use multiple nonlinear layers a 𝑥 → 𝑥 ydirect the residual mapping function to to befitlearned, is themapping output [20–22]. vector of the considered layers, and x To overcome the above-stated problem, the residual learning approach along with some other is the input vector of the considered layers. As shown in Figure 1a, ℘ = W2 Relu(W1 x ), and Relu is approaches was introduced. The ResNet can be defined as 𝑦 = ℘({𝑊 }, 𝑥) + 𝑥, where ℘({𝑊 }, 𝑥) is defined as Relu( x ) = max {0, x }. the residual mapping function to be learned, 𝑦 is the output vector of the considered layers, and 𝑥 Furthermore, the idea residual learning illustrated. We 1a, assume that the target mapping is the input vector of theofconsidered layers. Asisshown in Figure ℘=𝑊 𝑅𝑒𝑙𝑢(𝑊 𝑥), and 𝑅𝑒𝑙𝑢 isH( x ) is a sub-module of the neural network ϕ that we wish to learn and that H( x may be very complex and ) defined as 𝑅𝑒𝑙𝑢(𝑥) = 𝑚𝑎𝑥{0, 𝑥}. difficult toFurthermore, learn. Now,the instead ofresidual learninglearning H( x ) directly, the sub-module learns a residual function idea of is illustrated. We assume ϕ that the target mapping is a sub-module of the neural network φ that we wish to learn and that ℋ(𝑥) may be very ℘( x ) ℋ(𝑥) as complex and difficult to learn. Now, instead ℘( x ) of =learning H( x ) −ℋ(𝑥) x. directly, the sub-module φ learns a (1) residual function ℘(𝑥) as
The original mapping function xˇ can be expressed as
℘(𝑥) = ℋ(𝑥) − 𝑥.
(1)
℘( x ) + x. The original mapping function 𝑥 canxˇbe=expressed as
(2)
𝑥 = ℘(𝑥) + 𝑥.
(2)
That is, ϕ can be composed of two parts: a linear direct mapping x → x and a nonlinear mapping Thatdirect is, φ can be composed parts: a linear mapping 𝑥 → 𝑥 and ℘( x ). If the mapping x →ofx two is optimal, thedirect learning algorithm cana nonlinear easily setmapping the weight ℘(𝑥) . If the direct mapping 𝑥 → 𝑥 is optimal, the learning algorithm can easily set the weight for parameters of the nonlinear mapping ℘( x ) to 0. If there is no direct mapping, it is very difficult parameters of the nonlinear mapping ℘(𝑥) to 0. If there is no direct mapping, it is very difficult for the the nonlinear mapping ℘( x ) to learn the linear mapping x → x . Additionally, He Kaiming’s [26] nonlinear mapping ℘(𝑥) to learn the linear mapping 𝑥 → 𝑥 . Additionally, He Kaiming’s [26] experiments showed that such a residual structure in the extraction of the image texture and detail experiments showed that such a residual structure in the extraction of the image texture and detail features is a significant result. features is a significant result.
𝑥
𝑥 𝑥
Convolution layer
℘ (𝑥 )
identify
Relu
1*1 Convolution layer,64
𝑥
Relu m*m Convolution layer,64 Relu
Convolution layer
identify
1*1 Convolution layer,256
℘(𝑥 ) + 𝑥
℘(𝑥 ) + 𝑥
Relu
(a) Residual learning module
Relu
(b) Residual module of bottleneck structure
Figure1.1.Residual Residual learning Figure learningnetwork. network.
At present, the ResNet is mainly composed of multiple residual learning modules. However,
At the ResNet is mainly composed of multiple residualtolearning modules. thepresent, consumption in stacking a number of residual learning modules build a deep neuralHowever, network the consumption stacking a number Zhang of residual modules to buildmodule a deep neural network is still a in little large. Therefore, et al.learning [15] presented a residual structure called ais still bottleneck, as shown Zhang in Figure with the residual learning strategy,called a newamethod a little large. Therefore, et 1b. al. Additionally, [15] presented a residual module structure bottleneck, named DnCNN [15] recently showed excellent performance, implicitly removing the named latent the as shown in the Figure 1b. Additionally, with the residual learning strategy, a new method clean image in the hidden layers. This property motivated us to train a single DnCNN model to in DnCNN [15] recently showed excellent performance, implicitly removing the latent clean image tackle several general image denoising tasks, such as Gaussian denoising, single-image the hidden layers. This property motivated us to train a single DnCNN model to tackle several super-resolution, and JPEG image deblocking. In addition, the bottleneck structure module is the key for the ResNet to achieve hundreds or thousands of layers. In Figure 1b, there are three layers in total for the nonlinear part of the bottleneck structure, one m ∗ m (m = 3) convolution and two 1∗1 convolutions. Assuming that the dimension of the input data 𝑥 is 256, the first 1∗1 convolution can reduce the dimensions; at the
Information 2018, 9, 169
4 of 18
general image denoising tasks, such as Gaussian denoising, single-image super-resolution, and JPEG image deblocking. In addition, the bottleneck structure module is the key for the ResNet to achieve hundreds or thousands of layers. In Figure 1b, there are three layers in total for the nonlinear part of the bottleneck structure, one m ∗ m(m = 3) convolution and two 1∗1 convolutions. Assuming that the dimension of the input data x is 256, the first 1∗1 convolution can reduce the dimensions; at the same time, it also implements cross-channel information fusion to some extent and reduces the input dimension to 64. The second 1∗1 convolution plays a major role in raising the dimension, which causes the output dimension to return back to 256. The input and output dimensions of the middle m ∗ m convolution fall from the original 256 to 64 by the two 1∗1 convolutions, which greatly reduces the number of parameters and also increases the module depth [23,24]. In order to boost the denoising performance and improve image denoising recognition as much as possible, the bottleneck residual learning module is used to replace the ordinary residual learning module [25]. 3. Proposed Aerial-Image Denoising 3.1. The Proposed Method Formulation At present, all the image denoising methods have a common formulation, as follows: y = x + v,
(3)
where y is the observed noisy image generated as the addition of the clean image x and some noise v. Most of the discriminative denoising models aim to learn a mapping function (Equation (4)) to predict the latent clean image from y: H(y) ≈ x. (4) Unlike previous methods, for aerial-image denoising, we follow DnCNN [15] and adopt the residual learning formulation to learn the residual mapping, as follows:
℘1 (y) ≈ v.
(5)
xˇ = y − ℘1 (y).
(6)
ˇ Thus, we have the original function x:
Formally, to learn the trainable parameters λ in the aerial-image denoising model, the averaged mean-square error between the desired image residuals {∆t } and estimated residuals {℘} from noisy input 1 N 1 N `(λ) = k℘ − ∆t k2F = ℘(yt ; λ) − (yt − xt )2F (7) ∑ 2N t = 1 2N t ∑ =1 can be adopted as the loss function of the residual estimation, where {(yt ; xt )}tN= 1 are N pairs of the noisy–clean training image (patch) and k.k2F is the Frobenius norm, which, for a matrix H, can be calculated by the following formula: 2 k H k2F = ∑ hij . (8) i,j
The architecture of the proposed aerial-image denoising model for learning ℘(y) as shown in Figures 2 and 3 illustrates that the reconstructed denoised image xˇ is obtained by subtracting the residual image ℘1 (y) from the noisy observation image y. In the following, we explain the architecture of the multi-scale learning module for the aerial-image denoising model.
Information 2018, 9, 169 Information 2018, 9, x FOR PEER REVIEW
5 of 18 5 of 17 5 of 17
⋮⋮
Information 2018, 9, x FOR PEER REVIEW
⋮ ⋮
⋮⋮
⋮⋮
⋮⋮
Figure 2. The architecture of the model for aerial-image denoising; 𝑛 is the number of learning
Figure 2.2.The of of thediscussed model for aerial-image denoising; n is the number of learning Figure Thearchitecture architecture the model aerial-image number of modules learning modules (the size of 𝑛 is in for Section 4.2.3), anddenoising; the module𝑛 isisathe multi-scale residual (the size of n is discussed in Section 4.2.3), and the module is a multi-scale residual learning module modules (themodule size of is discussed in Section 4.2.3), and the module is a multi-scale residual learning (see𝑛 Figure 4). (see Figure 4). learning module (see Figure 4).
Figure 3. Reconstruction phase of the proposed model.
3.2. Multi-Scale LearningFigure Module 3. Reconstruction phase of of the the proposed proposed model. model. Figure 3. Reconstruction phase It is assumed that the output of the previous layer 𝑋 is composed of the 𝑛 feature 3.2. Multi-Scale Learning Module mappings (i.Learning e. , 𝑛 channels). 3.2. Multi-Scale Module Firstly, to produce a set of feature mappings 𝑧 , multi-scale filters are applied to the input data, as follows: It is assumed that the output of the previous layer 𝑋 is composed of the 𝑛 feature
It is assumed that the output of the previous layer Xi−1 is composed of the ni−1 feature mappings mappings (i. e. , 𝑛 channels). Firstly, to=produce of 𝐵feature mappings 𝑧 , multi-scale )+ 𝑊 ∗ (𝑋a set (9) filters (i.e., ni−1 channels). Firstly, to produce𝑧 a set of feature mappings zik , multi-scale filters are applied to are applied to the input data, as follows: the input follows: wheredata, 𝑊 as and 𝐵 are the convolutional filters and biases of the 𝑘 th type of the 𝑖 th layer, size × respectively, and 𝑊 includes 𝑛 filters Wik ∗∗𝑛(𝑋 (9) ( Xi×−1𝑓))+ 𝑧zik ==of𝑊 +𝑓B𝐵ik. (9) Secondly, each convolution produces 𝑛 feature maps. Therefore, the multi-scale convolution
where W k and the convolutional filtersfilters and biases of the kth type ith layer, where andBik𝐵areare andmaps biases the 𝑘ofinto ththe type of therespectively, 𝑖 th layer, [24]𝑊iis composed of 𝐾the × 𝑛 convolutional feature maps. The feature areof divided 𝑛 nonoverlapping and W includes n filters of size n × f × f . i i − 1 i i l groups, and group is made up ofof feature maps, that 𝑛 filters size ×𝑓 × 𝑓 is, . 𝑧 , 𝑧 , 𝑧 , ⋯ , 𝑧 ; then maximum respectively, andthe 𝑊𝑡thincludes k k 𝐾 k 𝑛 Secondly, each nl 𝑛feature Therefore, the multi-scale convolution [24] , 𝑧 , 𝑧 , ⋯ , 𝑧produces is performed by themaps. maxout activation function 𝛿(. ). The function pooling across feature maps. Therefore, the multi-scale convolution Secondly, each𝑧convolution convolution produces is composed ni 𝐾 feature maps. The feature are maps divided nl nonoverlapping groups, and 𝑦), of K ×of 𝑌 (𝑥, [24] is composed × 𝑛 feature maps. Themaps feature areinto divided into 𝑛 nonoverlapping t , zt , zt , · · · , zt ; then maximum pooling across the tth group is made up of K feature maps, that is, z i1maps, i ithat ik , 𝑧 , 𝑧 , ⋯ , 𝑧 ; then maximum groups, and the 𝑡th group𝑌is(𝑥, made feature 3 𝑦) = up 𝛿 𝑧of(𝑥,𝐾𝑦), 𝑧 (𝑥, 𝑦), 𝑧 2(𝑥, 𝑦), ⋯ is, , 𝑧 𝑧(𝑥, 𝑦) , t t t , · · · , zt is performed by t ( x, y )(10) zpooling maxout activation function δ .). The function𝛿(.Yi). ( i , zi2 , zi3across i𝑧 , 𝑧 , 𝑧 , ⋯ , 𝑧 isthe performed by the maxout activation function The , function 1
k
is the maxout output of the 𝑡th group at the position (𝑥, 𝑦), where 𝑧 (𝑥, 𝑦)(𝑝 = 1,2, … , 𝑘) is the 𝑌 (𝑥, 𝑦), t t the 𝑝th feature data at a particular position (𝑥, 𝑦) in Yi ( x, y) = δ zi1 ( x, y), zit2 ( x, y)map , zit3 (for x, ythe · · ,group. zitk ( x, y) , (10) ), ·𝑡th In our paper, the𝑌multi-scale layer comprises three 3) ,types of filters, 𝑓 = (10) (𝑥, 𝑦) = 𝛿convolutional (𝑥, (𝑥, (𝑥, (𝑥,=𝑦) 𝑧 𝑦), 𝑧 𝑦), 𝑧 𝑦), ⋯ , 𝑧 (𝐾
× 5, 𝑓 = 3 × 3, of and 𝑓 tth = 1group × 1, as shown in Figure Acrossztthe feature is the5maxout output the at the position 1, 2, . the . . , kmaxout )( p =maps, ) is the data ( x, y)4., where i p ( x, y (𝑥, 𝑦) (𝑥,each (𝑝 = process, 1,2, … , 𝑘) is thefunction maxout outputmaximum of the 𝑡th group at pooling. the position 𝑦), iteration where 𝑧of the executes element-wise During training theis the at a particular position ( x, y) in the pth feature map for the tth group. function ensures the map maximum in the group is activated. data activation at a particular position (𝑥, that 𝑦) inthe theunit 𝑝thwith feature for thevalue 𝑡th group. In our paper, the multi-scale convolutional layer three (K function. = 3) types of filters, Conversely, thethe convolution layer feeds the feature map tocomprises the maxout In our paper, multi-scale convolutional layer comprises threeactivation (𝐾 = 3) types of filters, 𝑓 = f i1 = 5The × 5,difference f i2 = is3 that × 3,instead and f iof = 1 × 1, as shown in Figure 4. Across the feature maps, the single convolution kernel in the convolution 5 × 5, 𝑓 = 3 × 3, and 𝑓 = 1 × 1, as3 shown in 3∗3 Figure 4. Across the feature maps, module, the maxout the maxout function executes maximum element-wise pooling. During each iteration the can training the multiple 1∗1, 3∗3, and 5∗5 convolution kernels are used, so that the convolution of layer function executes maximum element-wise pooling. During each iteration of the training process, the process, the activation function ensures that thesounit value in the group is activated. observe the input data from different scales, as towith helpthe the maximum different convolution kernels converge activation function ensures that the unit with the maximum value in the group is activated. to different which layer effectively the collaborative work of theactivation network [27,28]. Conversely, the values, convolution feedsavoids the feature map to the maxout function. Conversely, the convolution layer feeds the feature map to the maxout activation function. The difference is that instead of the single 3∗3 convolution kernel in the convolution module, the The difference is that instead of the single 3∗3 convolution kernel in the convolution module, multiple 1∗1, 3∗3, and 5∗5 convolution kernels are used, so that the convolution layer can observe the the multiple 1∗1, 3∗3, and 5∗5 convolution kernels are used, so that the convolution layer can input data from different scales, so as to help the different convolution kernels converge to different observe the input data from different scales, so as to help the different convolution kernels converge values, which effectively avoids the collaborative work of the network [27,28]. to different values, which effectively avoids the collaborative work of the network [27,28].
Information 2018, 9, 169
6 of 18
Information 2018, 9, x FOR PEER REVIEW
5*5 Conv k
3*3 Conv k
6 of 17
5*5 Conv
𝑧𝑖 1
k
Maxout
𝑧𝑖 2
3*3 Conv k
𝑧𝑖 1
𝑧𝑖 2
Maxout
𝑌𝑖
Previous layer X
1*1 Conv
1*1 Conv
𝑧𝑖 3
𝑧𝑖 3
k
k
module Output dimension
Input dimension
Figure 4. Multi-scale residual learning module.
Figure 4. Multi-scale residual learning module.
In Figure 4, 𝑘 is the scaling parameter, and the three convolutions of unit width are 1 ∗ 1 ∗ 16,Figure 3 ∗ 3 ∗ 16, 5 ∗ 5scaling ∗ 16. Then after the and transformation the scaling parameter 𝑘, the are network In 4, kand is the parameter, the three of convolutions of unit width 1 ∗ 1 ∗ 16, is as5follows: 3 ∗ 3 ∗width 16, and ∗ 5 ∗ 16. Then after the transformation of the scaling parameter k, the network width is as follows: 3 → 3 ∗ 𝑘. (11) 3 → 3 ∗ k. (11) The corresponding three convolution kernels are as follows:
The corresponding three convolution kernels are as follows: 1 ∗ 1 ∗ 16 → 1 ∗ 1 ∗ 16 ∗ 𝑘, 3 ∗ 3 ∗ 16 → 3 ∗ 3 ∗ 16 ∗ 𝑘,
(12)
1 ∗ 1 ∗ 16 → 1 ∗ 1 ∗ 16 ∗ k, 5 ∗35∗∗ 316∗ → ∗ 53∗∗16 (12) 165→ 3 ∗∗ 𝑘. 16 ∗ k , 5 ∗the 5 ∗network 16 → 5 depth ∗ 5 ∗ 16 ∗ k. a certain degree, the effect of the For the traditional ResNet, when reaches network becomes less obvious [26]. However, the proposed multi-scale ResNet not only increases
For the traditional ResNet, whenthe the network depthvery reaches a certain degree,results the effect the network depth, but also makes network training easy. The experimental showof the network becomes less obvious [26]. However, the proposed multi-scale ResNet not only increases the the improvement in performance, as detailed in Section 4.2. network depth, but also makes the network training very easy. The experimental results show the 3.3. Dropout improvement in performance, as detailed in Section 4.2. Considering the excellence in dropout [6] for CNNs, we incorporated dropout for the image
3.3. Dropout denoising recognition [29,30], where dropout is explained hereunder.
We assume a neural network with[6] 𝐿 for hidden layers the feed-forward operation of image a Considering thethat excellence in dropout CNNs, weand incorporated dropout for the standard neural network can be described as follows and can be seen in Figure 5a: denoising recognition [29,30], where dropout is explained hereunder. We assume that a neural L hidden layers and the feed-forward operation of a ( network ) ( with ) ( ) (13) 𝑧 = 𝑏 + 𝒘 ∙ 𝒚 (𝑖 = 1, 2, 3, ⋯ ) standard neural network can be described as follows and can be seen in Figure 5a: ( l +1)
( l +1)
( l +1)
( ) ·yl (i = 1, 2, 3, · · ·) zi 𝑦 ( =) b= (1) (13) 𝑧 wi (𝑖 = 1, 2, 3, ⋯ ) i 𝑓 + ( l +1) ( l +1) yi the=hidden f zi layers 1, 2, 3, · · ·) 𝒚( ) represents the vector of (14) (i of=the where 𝑙 ∈ {1, 2, 3, ⋯ , 𝐿} represents network;
outputs from layer 𝑙 (where 𝒚(
)
= 𝒙 is the input); 𝒃( ) and 𝑊 ( ) are the biases and weights of
) where l ∈ 𝑙,{respectively; 1, 2, 3, · · · , L}𝒛( represents the network; y(l ) hidden represents vector of representsthe the hidden vector oflayers inputs of into layer 𝑙; 𝑖 is any unit;the and 𝑓 layer (l ) ( 0 ) ( l ) outputs from layer l (where y such = xasis the input); b and W are the biases and weights of layer l, is any activation function, respectively; z(l ) represents the vector of inputs into layer l; i is any hidden unit; and f is any activation function, such as 1 f (x) = . (15) 1 + exp(− x )
Information 2018, 9, x FOR PEER REVIEW
7 of 17
Information 2018, 9, 169
7 of 18
𝑓(𝑥) =
(
.
(2)
)
However, with dropout, the feed-forward operation can be expressed as follows and can be seen However, with dropout, the feed-forward operation can be expressed as follows and can be in Figure 5b: seen in Figure 5b: (l ) r j ∼ Bernoulli( p)( j = 1, 2, 3, · · ·) (16) () 𝑟 ~ Bernoulli(𝑝) (𝑗 = 1, 2, 3, ⋯ ) (3) y e( l ) = r( l ) ∗ y( l ) (17) (16) 𝐲( ) = 𝐫( ) ∗ 𝐲( ) ( l +1) ( l +1) ( l +1) l z ( ) = b(i ) + w(i ) ·y e (i = 1, 2, 3, · · ·) (18) (17) = 𝑏 + 𝐰 𝑧i ∙ 𝐲 (𝑖 = 1, 2, 3, ⋯ ) ( l +1) ( l +1) (19) yi ( ) = f zi ( ) (i = 1, 2, 3, · · ·) (18) = 𝑓 𝑧 (𝑖 = 1, 2, 3, ⋯ ) 𝑦 where Bernoulli() Bernoulli() is is the the binomial binomial distribution distribution function function and and 𝑝p isis the the probability probability that that aa neuron neuron is is where ((l)) is a vector of discarded, ranging in interval [0, 1], where the probability of being retained is 1 − p; r discarded, ranging in interval [0, 1], where the probability of being retained is 1 − 𝑝; 𝐫 is a vector of independent Bernoulli Bernoullirandom randomvariables variablesfor forany anylayer layer 𝑙,l, and andeach eachrandom randomvariable variablehas hasaaprobability probability 𝑝p independent ( l ) of being being1. 1. To To create create the the thinned thinned outputs outputs 𝐲y e( ) ,, the the vector vector is is sampled sampled and and multiplied multiplied with with the the output output of ((l )). Then the process in which y (l )( )is used as input to the next layer begins to be elements of the layer, y e elements of the layer, 𝐲 . Then the process in which 𝐲 is used as input to the next layer begins to applied forfor each layer. be applied each layer. +1
+1 (𝑙) 𝑟3
(𝑙+1)
(𝑙+1)
(𝑙) 𝑦2
(𝑙+1)
𝑏𝑖
(𝑙)
𝑦3
𝑤𝑖
(𝑙)
(𝑙+1) 𝑧𝑖
f
𝑏𝑖
(𝑙)
𝑦3
𝑦3 (𝑙)
𝑟2
(𝑙+1)
𝑦𝑖
(𝑙)
(𝑙+1)
(𝑙)
𝑦2
𝑦2
𝑤𝑖
(𝑙+1)
𝑧𝑖
f
(𝑙+1)
𝑦𝑖
(𝑙)
𝑟1 (𝑙)
(𝑙)
𝑦1
𝑦1
(a) Standard network.
(𝑙)
𝑦1
(b) Dropout network.
Figure 5. A comparison of the basic operations between the standard and dropout networks. Figure 5. A comparison of the basic operations between the standard and dropout networks.
In the multi-scale learning module, when the input and output neurons remain unchanged, learning in module, when theare input and output neurons remain unchanged, half In of the themulti-scale hidden neurons the network deleted randomly (temporary), so that half the of the hidden neurons in the network are deleted randomly (temporary), so that the corresponding corresponding parameters will not be updated when the network is propagating in reverse. At the parameters will not updated when the network is propagating At the time of training, time of training, the be dropout is equivalent to training a subnet of in thereverse. whole network every time. If the dropout is equivalent to training a subnet of the whole network every time. If there are in there are 𝑛 nodes in the network, the number of available subnets should be 2 . When n 𝑛 nodes is large n the network, the number subnets should bethe 2 . same. When Finally, n is largethe enough, subnets used enough, the subnets usedofinavailable each training will not be whole the network can be in each training will not be the same. Finally, the whole network can be regarded as the average of the regarded as the average of the multiple subnet models. By doing this, the training set can avoid multiple subnet models. By doing this, thethe training set can avoid overfitting of a certain subnet and overfitting of a certain subnet and enhance generalization ability of the network. enhance the generalization ability of the network. As described above, dropout helps the network retain robustness. Figure 6 shows the test error described above,different dropout datasets helps the(CIFAR-10, network retain robustness. Figure 6progressed. shows the test rates As obtained for many MNIST, etc.) as training As error seen rates obtained for many different datasets (CIFAR-10, MNIST, etc.) as training progressed. As from two separate clusters of trajectories, the same datasets trained with and without dropout seen had from two separate clusters of trajectories, the brought same datasets with and without dropout had significantly different test errors, and dropout a greattrained improvement across all architectures. significantly differentdenoising test errors, and dropout brought a great improvement across all For our aerial-image model, we found that incorporating dropout boosted thearchitectures. accuracy of For our aerial-image denoising model, we found that incorporating dropout boosted the the model and made the training process faster. The experimental results show that the accuracy dropout of the model and made training faster. The experimental results show4.2.2. that the dropout location and drop ratio 𝑝the affected theprocess network performance, as detailed in Section location and drop ratio p affected the network performance, as detailed in Section 4.2.2.
Information 2018, 9, 169
8 of 18
Information 2018, 9, x FOR PEER REVIEW
8 of 17
Figure 6. Test error for different datasets with and without dropout.
Figure 6. Test error for different datasets with and without dropout.
4. Experimental Results and Setting
4. Experimental Results and Setting
The main purpose of the experimental work of this paper was to confirm the performance of
The main purpose of the experimental of this paper was to confirm the performance of our our proposed approach in the aerial-imagework denoising recognition model. proposed approach in the aerial-image denoising recognition model. 4.1. System Environment
4.1. System Environment The computer configuration used in the experimental environment was as shown in Table 1. The computer configuration used in the experimental environment was as shown in Table 1. Table 1. The machine configuration used in experimental environment.
Table 1. The machine configuration used in experimental environment. Computer Configuration
Processor Intel(R) Core(TM) i7-7700HQ CPU @ 2.90 GHz Computer Configuration GPU NVIDIA GeForce GTX1050 Ti Processor Intel(R) Core(TM)8i7-7700HQ CPU @ 2.90 GHz RAM GB NVIDIA HardGPU disk SSDGeForce 256 GBGTX1050 Ti RAM 8 GB Operating system Windows 7 @ 64 bit Hard disk SSD 256 GB Test framework Tensorflow Operating system Windows 7 @ 64 bit Test framework
Tensorflow
In our experiment, the model for Gaussian denoising was trained with noise levels σ = 10 σ = 15, and σ = 25 independently. Firstly, we considered the images of the Caltech [31] database to In our experiment, the model for Gaussian denoising was trained with noise levels σ = 10 train the model. The database is a publicly available aerial database that comprises an aircraft σ = dataset 15, andand σ = 25 independently. Firstly, weimages considered thevehicle imagesdataset of theand Caltech [31] database a vehicle dataset. We mixed 475 from the 350 images from to train the the aircraft model. dataset The database a publicly available that comprises aircraft dataset to trainisthe model. Finally, 55 aerial imagesdatabase were randomly selected an from the two and adatasets, vehicle which dataset. Wenot mixed 475 images from the vehicle dataset images from the aircraft were included in the training period but were usedand for 350 the test model.
dataset to train the model. Finally, 55 images were randomly selected from the two datasets, which were 4.2. notParameter includedSetting in theAnalysis training period but were used for the test model. 4.2.1. Experimental Parameters 4.2. Parameter Setting Analysis In the experiment, the momentum random gradient descent method was used. At the 50th and
4.2.1.100th Experimental epochs, theParameters learning rate increased by a factor of 0.1, and the total number of epoch runs was
150.The experiment ran two blocks random of NVIDIA GeForcedescent GTX1050 Ti, and was mini-batch 100 50th In the experiment, theon momentum gradient method used. had At the samples for each GPU. In addition, other parameters are shown in Table 2. and 100th epochs, the learning rate increased by a factor of 0.1, and the total number of epoch runs was 150.The experiment ran Table on two blocks of NVIDIA GeForce GTX1050 Ti, and mini-batch had 100 2. The parameter in experimental environment. samples for each GPU. In addition, other parameters are shown in Table 2. Loss Function Cross-entropy
Parameter Initial Rate Drop Probabilityenvironment. Momentum TableLearning 2. The parameter in experimental 0.1 0.3 0.9
Weight Decay 0.0005
Parameter Loss Function
Initial Learning Rate
Drop Probability
Momentum
Weight Decay
Cross-entropy
0.1
0.3
0.9
0.0005
Information 2018, 9, 169
9 of 18
Information 2018, 9, x FOR PEER REVIEW
9 of 17
4.2.2. 4.2.2. Selecting Selecting Dropout Dropout Location Location From From Section Section 3.3, 3.3, we we know know that that dropout dropout can can effectively effectively avoid avoid network network overfitting overfitting and and improve improve the denoising performance. Therefore, it is necessary to discuss the influence of dropout the denoising performance. Therefore, it is necessary to discuss the influence of dropout location location on on the Figures 77 and the network network performance. performance. Figures and 88 show show the the different different dropout dropout locations locations in in the the multi-scale multi-scale residual andand the the influence of parameter p (drop𝑝 ratio) the network denoising residuallearning learningmodule module influence of parameter (droponratio) on the image network image recognition performance, respectively. denoising recognition performance, respectively.
Figure 7. 7. The The module module structure structure of of different different dropout dropout locations. locations. Figure
From 0.3,0.3, thethe error raterate of the test test set was the From Figure Figure8,8,we wecan cansee seethat thatininScenario Scenario5,5,when whenp was 𝑝 was error of the set was smallest, 3.88%. On On the the contrary, in Scenario 4, when p was 0.3, the rate rate of the the the smallest, 3.88%. contrary, in Scenario 4, when 𝑝 was 0.3,error the error of test the set testwas set was largest, 4.32%. Therefore, for the proposed multi-scale ResNet, we adopted Scenario 5, that the node the largest, 4.32%. Therefore, for the proposed multi-scale ResNet, we adopted Scenario 5, that the discarding probability p is 0.3𝑝and a dropout layer islayer added fullthe connection layer. layer. node discarding probability is 0.3 and a dropout is before added the before full connection
Information 2018, 9, 169
10 of 18
Information 2018, 9, x FOR PEER REVIEW
10 of 17
Information 2018, 9, x FOR PEER REVIEW
10 of 17
Figure 8. The influence of parameter 𝑝 and dropout for multi-scale residual learning network
Figure 8. The influence of parameter p and dropout for multi-scale residual learning network (ResNet). (ResNet). Figure 8. The influence of parameter 𝑝 and dropout for multi-scale residual learning network (ResNet).
Network Depth 4.2.3.4.2.3. Network Depth
4.2.3. Network Depth
Instead of the network depthbeing beingbased based on on the the experience, optimal network depth was was Instead of the network depth experience,thethe optimal network depth acquired through the experiments. The multi-scale ResNet structure includes 3𝑛 multi-scale Instead ofthe theexperiments. network depth being based on ResNet the experience, theincludes optimal network depth was acquired through The multi-scale structure 3n multi-scale residual residual learning modules with two layers of convolution for each learningincludes module,3𝑛 andmulti-scale the entire acquired through the experiments. The multi-scale ResNet structure learning modules with two layers of convolution for each learning module, and the entire network network has (6𝑛 modules + 2) layers. Intwo addition, is a scaling parameter 𝑘module, for eachand convolution residual learning with layers ofthere convolution forkeach learning the Each entirelayer has (6n + 2Each Inconsists addition, there is a scaling parameter for each convolution layer. ) layers. layer. layer of three different scale convolutions 1∗1, 3∗3, and 5∗5, and the network network has (6𝑛 + 2) layers. In addition, there is a scaling parameter 𝑘 for each convolution consists of is three different convolutions 1∗of 1, different 3∗3, andscaling 5∗5, and the network width on is 3the ∗ k. In width 3 ∗layer 𝑘. Inconsists orderscale toofexplore the effect and depths layer. Each three different scale convolutions 1∗1,parameters 3∗3, and 5∗5, and the network ordernetwork to explore the effect of different scaling parameters and depths on thedataset. network performance, we performance, we conducted some experiments on the Caltech The results are width is 3 ∗ 𝑘. In order to explore the effect of different scaling parameters and depths on the conducted some experiments on the Caltech dataset. The results are shown in Figure 9 and Table 3. shown Figure 9 and Table 3. networkin performance, we conducted some experiments on the Caltech dataset. The results are It can be seen from Table 3 that with the increase the parameter the network It can be seen from Table 3 that with the increase in thein parameter k, the 𝑘, network depth depth decreased. shown in Figure 9 and Table 3. decreased. However, the model became more and more complex, and more and more parameters However, the model became more and more complex, and more and more parameters needed It can be seen from Table 3 that with the increase in the parameter 𝑘, the network depth to be needed to However, be trained. Accordingly, the more recognition error rate thetest model for more the test set also trained. Accordingly, thethe recognition error rate and of the model forofthe set and also became continually decreased. model became more complex, and more parameters became continually lower. In Table 3, the scaling parameter of model 8 was 5, which was larger to be trained. Accordingly, recognition rate of thelarger modelthan for that the test set also lower.needed In Table 3, the scaling parameterthe of model 8 waserror 5, which was of model 7, and than thatcontinually of model 7,lower. and the number parametersparameter was also more than 810was M, but the accuracy rate became Table 3,ofthe of model 5, which larger the number of parameters wasIn also more thanscaling 10 M, but the accuracy rate declined. Thewas main reasons declined. The main7,reasons may be that the structure of model 8 was too10complex andaccuracy the number than thatthe of model and the number of parameters was alsothe more than M, but the rategreat, may of be parameters that structure ofgreat, model 8 was too complex and number ofand parameters was too was too which increased the training difficulty, a slight overfitting declined. The main reasons may be that the structure of model 8 was too complex and the number which increased the training difficulty, and a slight overfitting phenomenon occurred. training In addition, phenomenon addition, can see from Figure 3difficulty, that with and the increase of parametersoccurred. was too In great, whichweincreased the training a slight in overfitting we can see from Figure 3 that with the increase in training times, the error rate tended to be stable. times, the error rate tended to be stable. Therefore, in Figure the experiment, in order to obtain best phenomenon occurred. In addition, we can see from 3 that with the increase in the training Therefore, the experiment, into order towas obtain best network effect, the scaling parameter was set network the scaling parameter set tothe 4 and the network depth in was set totoabout 26.the best times, in theeffect, error rate tended be stable. Therefore, in the experiment, order obtain to 4 and the network depth was set to about 26. network effect, the scaling parameter was set to 4 and the network depth was set to about 26.
Figure 9. The curve for the training times and the error rates for the test set. Figure 9. The curve for the training times and the error rates for the test set.
Figure 9. The curve for the training times and the error rates for the test set.
Information 2018, 9, 169
11 of 18
Table 3. The error rate of test set under different network structures, where k is the scaling parameter and n is the number of learning modules. Model States Model Number
k
n
Network Depth
Parameters
Error Rate
i ii iii iv v vi vii viii ix
1 1 1 2 3 3 4 5 6
12 16 20 8 5 6 4 3 2
74 98 122 50 32 38 26 20 14
12.9 M 18.7 M 22.8 M 36.1 M 47.8 M 59.1M 66.4 M 76.4 M 81.3 M
4.813% 4.913% 4.965% 4.191% 4.178% 4.238% 3.883% 4.012% 4.099%
To further evaluate the performance of the proposed model, it was necessary to test the running state of the model under conditions of different scaling parameters and network depths, as shown in Table 4. Table 4. The running state of model under different scaling parameters and network depths, where “—” indicates that GPU resources were exhausted and could not be trained (blue font indicates the best). Model State k
n
Network Depth
Parameters (M)
Training Situation
Error Rate (%)
1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6
20 21 22 8 9 10 6 7 8 4 5 6 3 4 5 2 3 4
122 128 134 50 56 62 38 44 50 26 32 38 20 26 32 14 20 26
22.8 25.3 26.8 36.1 40.9 43.9 59.1 68.0 69.5 66.4 83.6 88.5 76.4 105.5 108.3 81.3 110.2 114.7
Normal training — — Normal training — — Normal training — — Normal training — — Normal training — — Normal training — —
4.965
4.229
4.232
3.883
4.012
4.099
As we can see from Table 4, with the increase in the scaling parameter k, the number of parameters of the network training also increased. When the scaling parameter was 4, the network could only be added to 26 layers, and the training parameter is 66.4 M. When the number of layers was increased, the GPU resources were exhausted. In addition, when the scaling parameter increased to 5, 76.4 M parameters could be trained at most. In addition, when the number of parameters reached about 70 M, the proposed multi-scale ResNet algorithm did not have serious overfitting problems. It also can be seen from Table 4 that the proposed multi-scale ResNet (scaling parameter of 4 and network depth of 26) was the best. The error rate on the Caltech dataset was about 3.883%. To further study the impact of the network depth on the network performance, we fixed the scaling parameters to 4, slightly adjusting the number of groups of learning modules.
Information 2018, 9, 169
12 of 18
Information 2018, 9, x FOR PEER REVIEW
12 of 17
The multi-scale ResNet consists of three groups of learning modules, and each group contains n The multi-scale ResNet consists of three groups of learning modules, and each group contains learning modules. To obtain the ideal number groups of learning modules, we changed the numbers of 𝑛 learning modules. To obtain the ideal number groups of learning modules, we changed the learning modules in each group so that they were not completely equal. Supposing that the numbers numbers of learning modules in each group so that they were not completely equal. Supposing that of learning modules in the groups are x, y, and z (x being the number of modules close to the input the numbers of learning modules in the groups are x, y, and z (𝑥 being the number of modules layer, y being the number of modules close to the output layer, and z being the fixed scaling parameter), close to the input layer, 𝑦 being the number of modules close to the output layer, and 𝑧 being the the value of z was 4, and the values of x and y were changed. The experimental results are shown in fixed scaling parameter), the value of 𝑧 was 4, and the values of 𝑥 and 𝑦 were changed. The Figure 10. experimental results are shown in Figure 10.
Figure 10. The error rate of test set under different learning module numbers (where z = 4). Figure 10. The error rate of test set under different learning module numbers (where z = 4).
As can be seen from Figure 10, when 𝑥 = 5, 𝑦 = 3, and 𝑧 = 4, the error rate was 3.693%, which As canbetter be seen from 10, when x= 5, y = = 4, the error rate was 3.693%, which was much than for Figure all variables being equal to3,4.and Thiszindicates that appropriately increasing was much better than for all are variables equallayer to 4. can Thisbetter indicates thatthe appropriately increasing the number of modules that close tobeing the input capture original features of the the number of modules that are close to the input layer can better capture the original features the picture and help to improve the network performance. In the next part, we describe the use ofof these picture and help to improve the network performance. In therecognition next part, we describe the use of these parameters to evaluate our proposed aerial-image denoising methods. parameters to evaluate our proposed aerial-image denoising recognition methods. 4.2.4. Model Architecture 4.2.4. Model Architecture In this study, as we have shown in the model architecture in Figure 2, the model was trained by In this study, as we have shown in the model architecture in Figure 2, the model was trained by the minimizing the cost function in Equation (11), and the model architecture had a depth of (6n + 2) the minimizing the cost function in Equation (11), and the model architecture had a depth of (6n + 2) layers. The image size of 32∗32∗3 was inputted and passed through the 3∗3 convolution kernel with layers. The image size of 32∗32∗3 was inputted and passed through the 3∗3 convolution kernel with 16 16 convolution kernels, and the output of the model was 32∗32∗16; then these data were passed convolution kernels, and the output of the model was 32∗32∗16; then these data were passed through through the 6n convolution layers and average pool layer, followed by a dropout operation, and the 6n convolution layers and average pool layer, followed by a dropout operation, and were finally were finally passed through a full connection layer and subjected to a softmax operation. In this passed through a full connection layer and subjected to a softmax operation. In this model, residual model, residual learning was adopted to learn the mapping ℘(𝑦), and batch normalization was learning was adopted to learn the mapping ℘(y), and batch normalization was incorporated to increase incorporated to increase the model’s accuracy and speed up the training performance. It was seen the model’s accuracy and speed up the training performance. It was seen that the proposed multi-scale that the proposed multi-scale ResNet for a small training dataset has promising advantages for ResNet for a small training dataset has promising advantages for training deep networks and further training deep networks and further improves the network accuracy. improves the network accuracy. 4.3. Comparison with State-of-the-Art Algorithms 4.3. Comparison with State-of-the-Art Algorithms In this section, we use the optimal parameters that were obtained in Section 4.2 to evaluate the In this section, we use the optimal parameters that were obtained in Section 4.2 to evaluate the recognition ability of the proposed improved multi-scale residual neural network learning approach. recognition ability of the proposed improved multi-scale residual neural network learning approach. We compared our results with other aerial-image denoising methods from the existing literature, such We compared our results with other aerial-image denoising methods from the existing literature, such as PNN [32], BM3D [33], SRCNN [34], WNNM [35], and DnCNN [15], where x = 5, y = 3, and z = 4. The as PNN [32], BM3D [33], SRCNN [34], WNNM [35], and DnCNN [15], where x = 5, y = 3, and z = 4. The experimental results revealed that our model benefits from multi-scale residual learning, parameter experimental results revealed that our model benefits from multi-scale residual learning, parameter setting, and particular image sizes, as can be seen in Table 5. Figures 11–16 show the randomly setting, and particular image sizes, as can be seen in Table 5. Figures 11–16 show the randomly selected selected images reconstructed by the model trained with the training sample. images reconstructed by the model trained with the training sample.
Information 2018, 9, 169
13 of 18
Information 2018, 9, x FOR PEER REVIEW Information 2018, 9, x FOR PEER REVIEW Information 2018, 9, x FOR PEER REVIEW Information 2018, 9, x FOR PEER REVIEW Information 2018, 9, x FOR PEER REVIEW
13 of 17 13 of 17 13 of 17 13 of 17 13 of 17
(A1) (A2) (A3) (A1) (A2) (A3) (A1) (A2) (A3) Figure 11. Vehicle images: model performance trained with 64 × 64 image size. (A1) The original (A1) (A2) (A3) Figure 11. Vehicle images: trained 64 image image size. (A1) The The original Figure 11. Vehicle images: model model performance performance (A2) trained with with 64 64 × × 64 size. (A1) original (A1) (A3) image; (A2) the noisy image; (A3) denoised image with σ = 10. Figure 11. Vehicle images: model performance trained with 64 × 64 image size. (A1) The original image; (A2) (A2) the noisy noisy image;model (A3) image with = 10. 10. image; the image; (A3) denoised denoised imagetrained with σσwith = Figure 11. Vehicle images: performance 64 × 64 image size. (A1) The original Figure (A2) 11. Vehicle images: performance image; the noisy image;model (A3) denoised imagetrained with σwith = 10.64 × 64 image size. (A1) The original image; (A2) the noisy image; (A3) denoised image with σ = 10. image; (A2) the noisy image; (A3) denoised image with σ = 10.
(A1) (A2) (A3) (A1) (A2) (A3) (A1) (A2) (A3) Figure 12. Vehicle images: model performance (A2) trained with 64 × 64 image size. (A1) The original (A1) (A3) Figure 12. Vehicle images: model performance (A2) trained with 64 × 64 image size. (A1) The original (A1) (A3) Figure (A2) 12. Vehicle images: performance 64 64 image image size. (A1) The The original image; the noisy image;model (A3) denoised imagetrained with σwith = 15. Figure 12. Vehicle images: model performance trained with 64 × × 64 size. (A1) original image; (A2) the noisy image;model (A3) denoised imagetrained with σwith = 15.64 × 64 image size. (A1) The original Figure 12. Vehicle images: performance image; (A2) the noisy image; (A3) denoised image with σ = 15. Figure 12. Vehicle images: model performance trained with 64 × 64 image size. (A1) The original image; (A2) the noisy image; (A3) denoised image with σ = 15. image; (A2) the noisy image; (A3) denoised image with σ = 15. image; (A2) the noisy image; (A3) denoised image with σ = 15.
(A1) (A2) (A3) (A1) (A2) (A3) (A1) (A2) (A3) Figure 13. Vehicle images: model performance (A2) trained with 64 × 64 image size. (A1) The original (A1) (A3) Figure 13. Vehicle images: model performance (A2) trained with 64 × 64 image size. (A1) The original (A1) (A3) image; (A2) the noisy image;model (A3) denoised imagetrained with σwith = 25.64 × 64 image size. (A1) The original Figure 13. Vehicle images: performance Figure (A2) 13. Vehicle images: performance trained 64 64 image image size. (A1) The The original image; the noisy image;model (A3) denoised image with σwith = 25. Figure 13. Vehicle images: model performance trained with 64 × × 64 size. (A1) original Figure (A2) 13. Vehicle images: performance 64 × 64 image size. (A1) The original image; the noisy image;model (A3) denoised denoised imagetrained with σwith = 25. image; (A2) (A2) the the noisy noisy image; image; (A3) = 25. 25. image; (A3) denoised image image with with σσ = image; (A2) the noisy image; (A3) denoised image with σ = 25.
(A1) (A2) (A3) (A1) (A2) (A3) (A1) (A2) (A3) Figure 14. Vehicle images: model performance trained (A1) (A2) with 180 × 180 image size. (A3)(A1) The original Figure 14. Vehicle images: model performance trained (A1) (A2) with 180 × 180 image size. (A3)(A1) The original image; the noisy image;model (A3) denoised image with with σ = 10. Figure (A2) 14. Vehicle images: performance trained 180 × 180 image size. (A1) The original image; (A2) the noisy image; (A3) denoised image with σ = 10. Figure 14. Vehicle images: model performance trained with 180 × 180 image size. (A1) original Figure 14. images: model performance trained 180 180 image (A1) The The Figure (A2) 14. Vehicle Vehicle images: model performance trained with 180 × × 180 image size. size. (A1) The original original image; the noisy image; (A3) denoised image with with σ = 10. image; (A2) the noisy image; (A3) denoised image with σ = 10. image; 10. image; (A2) (A2) the the noisy noisy image; image; (A3) (A3) denoised denoised image image with with σσ = = 10.
(A1) (A2) (A3) (A1) (A2) (A3) (A1) (A2) with 180 × 180 image size. (A3) Figure 15. Vehicle images: model performance trained (A1) (A2) (A3)(A1) The original Figure 15. Vehicle images: model performance trained (A1) (A2) with 180 × 180 image size. (A3)(A1) The original image; the noisy image;model (A3) denoised image with with σ = 15. Figure (A2) 15. Vehicle images: performance trained 180 × 180 image size. (A1) The original image; (A2) the noisy image; (A3) denoised image with with σ = 15. Figure 15. Vehicle images: model performance trained 180 × 180 image size. (A1) The original Figure (A2) 15. Vehicle Vehicle images: model performance trained 180 × × 180 image size. size. (A1) The original original image; the noisy image; (A3) denoised image with with σ = 15. Figure 15. images: model performance trained 180 180 image (A1) The image; (A2) the noisy image; (A3) denoised image with with σ = 15. image; (A2) the noisy image; (A3) denoised image with σ = 15. image; (A2) the noisy image; (A3) denoised image with σ = 15.
Information 2018, 9, 169 Information 2018, 9, x FOR PEER REVIEW
14 of 18 14 of 17
Information 2018, 9, x FOR PEER REVIEW
(A1)
14 of 17
(A2)
(A3)
Figure 16. Vehicle images: model performance trained with 180 × 180 image size. (A1) The original Figure 16. Vehicle images: model performance trained with 180 × 180 image size. (A1) The original image; (A2) the noisy(A1) image; (A3) denoised image with σ = 25. (A2) (A3) image; (A2) the noisy image; (A3) denoised image with σ = 25. Figure 16. Vehicle images: model performance trained with 180 × 180 image size. (A1) The original
Figures 11–16 with different noise parameters are used for comparison to illustrate our image; (A2) the noisydifferent image; (A3) denoised image with σare = 25.used for comparison to illustrate our Figures 11–16 parameters experimental results.with We randomly noise selected some images from the test model, as shown in Figures experimental results. We randomly selected some images from theimages, test model, as shown in 11–13, which show the evaluation of the proposed model on comparison 825 with an image Figures 11–16 with different noise parameters aretrained used for toeach illustrate our Figures 11–13, show the evaluation the proposed model on 825 images, each with an experimental results. randomly some images thetrained test model, as shown Figures size of 64 × 64.which From theWe three figures,selected weofcan see that thefrom reconstructed images wereinsimilar to the image size of 64 × 64. From the three figures, we can see that the reconstructed images were similar to 11–13, whichFigure show the the proposed model trained on 825 image images,was eachmore with an imageto the original images; 11evaluation shows theofeffect better, as the reconstructed similar the original images; Figure 11 shows the effect the reconstructed image was more to size image of 64 × because 64. From thelow three figures, we canbetter, seesimilar thatasthe reconstructed images were similar tosimilar the original of noise (σ = 10). A conclusion can be observed by comparing originalimage images;because Figure 11 shows the effect better, as similar the reconstructed image was more similar to the the original of low noise (σ = 10). A conclusion can be observed by comparing Figures 14–16 (Figure 14 is the most apparent). original because noise (σ = 10). A similar conclusion can be observed by comparing Figures 14–16image (Figure 14represent is of thelow most Figures 14–16 also theapparent). evaluation of the proposed model trained on 825 images, each Figures 14–16 (Figure 14 is the most apparent). 14–16 also represent the evaluation of the proposed model trained on 825 images,14each with Figures an image size of 180 ×represent 180. Compared with Figure 11, the reconstructed images of Figure are Figures 14–16 also× the evaluation of the proposed model trained on 825 images, each14 are with an image size of 180 180. Compared with Figure 11, the reconstructed images of Figure morewith similar to the original image, which implies the proposed model benefited from a particular an image size of 180 × 180. Compared with Figure 11, the reconstructed images of Figure 14 are more similar to the original image, which implies the proposed model and benefited from a particular image size.similar Furthermore, comparing Figures 12 and and Figures 16 from separately, a similar more to the original image, which implies the15proposed model13 benefited a particular image size.can Furthermore, comparing 12 and 15 and Figures 13Figure and 16 a similar conclusion be observed. The sameFigures conclusion is also reached from 17.separately, image size. Furthermore, comparing Figures 12 and 15 and Figures 13 and 16 separately, a similar conclusion can be observed. The same conclusion is also reached from Figure 17. conclusion can be observed. The same conclusion is also reached from Figure 17.
Figure 17. Aerial images: model performance trained with 180 × 180 image size. (A1,A4,A7,A10)
Figure 17. Aerial images: model performance trained with 180 × 180 image size. (A1,A4,A7,A10) Original images; (A2) noisy image with 64 × 64 image size and σ = 10; (A3) denoised image with 64 × Original17. images; (A2) noisymodel image performance with 64 × 64 image size and180 σ = 10; (A3) denoised image with 64 Figure Aerial images: with × 180 size. (A1,A4,A7,A10) 64 image size and σ = 10; (A5) noisy image withtrained 64 × 64 image size and σ =image 15; (A6) denoised image × 64 image size and σ = 10; (A5) noisy image with 64 × 64 image size and σ = 15; (A6) denoised Original image 64 × noisy 64 image size and 10;image (A3) denoised withimages; 64 × 64 (A2) imagenoisy size and σ =with 15; (A8) image with 180σ ×=180 size and σimage = 10; with (A9) 64 × image with 64 × 64 image size and σ = 15; (A8) noisy image with 180 × 180 image size and σ image = 10; denoised 180(A5) × 180noisy imageimage size and σ =64 10;× (A11) noisysize image with × 180 image size and 64 image size image and σwith = 10; with 64 image and σ 180 = 15; (A6) denoised (A9) image withand 180 180 size and σ 10;σ(A11) image size withand 180 σ ×=180 σ64= ×15; denoised image 180 × 180 image size=and = 15.×noisy with denoised 64(A12) image size σ×=with 15;image (A8) noisy image with 180 180 image 10;image (A9) size and σ = 15; (A12) denoised image with 180 × 180 image size and σ = 15. denoised image with 180 × 180 image size and σ = 10; (A11) noisy image with 180 × 180 image size and
σ = 15; (A12) denoised image with 180 × 180 image size and σ = 15.
Information 2018, 9, 169
15 of 18
Table 5. The average results of our aerial-image model, where the multi-scale residual learning network (ResNet) was the proposed aerial-image denoising method in the paper. The corresponding experimental parameters were x = 5, y = 3, scaling parameter z = 4, and network depth of 26 (blue font indicates the best). Image Size of 64 × 64, with Dropout
Noise Level (σ)
Methods
PSNR/SSIM
PNN
Multi-Scale ResNet
SRCNN
BM3D
WNNM
DnCNN
σ = 10 σ = 15 σ = 25
38.485/0.9000 36.388/0.8881 35.553/0.7003
39.482/0.9271 38.607/0.9202 36.998/0.9193
38.551/0.9211 37.112/0.9000 36.996/0.9303
38.494/0.9004 36.993/0.9001 36.261/0.9184
38.494/0.9001 36.379/0.8746 36.118/0.7839
39.466/0.9271 38.007/0.9202 36.767/0.9192
We compared our model performance with images of the same image size. Table 5 shows the average results of the proposed model trained on an image size of 64 × 64, the training sample size of 825 images, and with dropout; it can be seen that the proposed model improved the reconstructed image quality by about 0.167/0.001 over the DnCNN and by about 0.931/0.006 over the SRCNN at σ = 10 and was better than the other methods in terms of PSNR/SSIM values at σ = 10, σ = 15, and σ = 25. Similarly, Table 6 exhibits the model performance with an image size of 180 × 180, the training sample size of 825 images, and with dropout. Again, it can be seen that the proposed model improved the average reconstructed image quality at σ = 10, σ = 15, and σ = 25. However, in comparison with the results in Tables 5 and 6, we found this image size affected the model performance; the performance of the model trained on the image size of 180 × 180 with dropout was better than that with the image size of 64 × 64 with dropout. This conclusion is consistent with the above (Figures 11–16). Furthermore, comparing Tables 6 and 7, we also found that the incorporation of dropout for image recognition (dropout location is critical) significantly improved the model network performance. This conclusion is consistent with Section 4.2.2. Table 6. The average results of our aerial-image model, where the multi-scale residual learning network (ResNet) was the proposed aerial-image denoising method in the paper. The corresponding experimental parameters were x = 5, y = 3, scaling parameter z = 4, and network depth of 26 (blue font indicates the best). Noise Level (σ)
Image Size of 180 × 180, with Dropout
Methods
PSNR/SSIM
PNN
Multi-Scale ResNet
σ = 10 σ = 15 σ = 25
40.114/0.9009 38.572/0.9441 37.316/0.9311
40.987/0.9652 40.208/0.9597 38.448/0.9421
SRCNN
BM3D
WNNM
DnCNN
40.478/0.9622 39.442/0.9501 37.411/0.9400
40.115/0.9195 39.028/0.9447 37.401/0.9338
40.114/0.9191 39.005/0.9448 37.189/0.9278
40.582/0.9642 39.881/0.9597 37.644/0.9420
Table 7. The average results of our aerial-image model, where the multi-scale residual learning network (ResNet) was the proposed aerial-image denoising method in the paper. The corresponding experimental parameters were x = 5, y = 3, scaling parameter z = 4, and network depth of 26 (blue font indicates the best). Noise Level (σ)
Image Size of 180 × 180, without Dropout
Methods
PSNR/SSIM
PNN
Multi-Scale ResNet
σ = 10 σ = 15 σ = 25
37.660/0.8796 36.782/0.8982 35.116/0.8716
39.481/0.9424 38.115/0.9331 36.662/0.9004
SRCNN
BM3D
WNNM
DnCNN
38.003/0.9008 37.180/0.9115 35.333/0.8894
38.104/0.9103 37.178/0.9112 35.371/0.8990
37.661/0.9001 36.703/0.8954 35.003/0.8541
38.117/0.9424 37.380/0.9321 35.773/0.9010
Information 2018, 9, 169
16 of 18
5. Summary and Conclusions In this paper, we propose an efficient deep CNN model for the denoising of aerial images with a small training dataset. Unlike most of the aerial-image denoising methods, which approximate a latent clean image from an observed noisy image, the proposed model approximates the noise from the observed noisy image. By combining multi-scale residual learning and dropout, regarding the influence of the network depth and the number of learning modules in each group on the network performance, we not only speed up the training process, but also improve the denoising performance of the model. More importantly, although the deep architecture has a better performance with a large training dataset, the proposed aerial-image denoising model was trained with small dataset and achieved competitive results. With a small training dataset, the experimental results showed that the proposed aerial-image denoising model has better performance than the existing image denoising methods. In the future, the performance expectation of the model will be improved by stacking more multi-scale competitive modules. Author Contributions: C.C. designed the experiments and wrote the manuscript; Z.X. provided the instructions and helped during the design. Both authors have read and approved the final manuscript. Funding: This article was funded by the Shanghai Municipal Committee of Science and Technology Project (No. 18030501400) and by the Shanghai University of Engineering Science Innovation Fund for Graduate Students (No. E3-0903-18-01319). Acknowledgments: This article was funded by the Shanghai Municipal Committee of Science and Technology Project (No. 18030501400) and by the Shanghai University of Engineering Science Innovation Fund for Graduate Students (No. E3-0903-18-01319). The funder (Zengbo Xu) was involved in the manuscript writing, editing, and approval. Conflicts of Interest: The authors declare no conflict of interest.
References 1. 2. 3. 4. 5.
6. 7. 8. 9. 10. 11. 12.
Naylor, B.J.; Endress, B.A.; Parks, C.G. Multiscale Detection of Sulfur Cinquefoil Using Aerial Photography. Rangel. Ecol. Manag. 2005, 58, 447–451. [CrossRef] Chang, S.G.; Yu, B.; Vetterli, M. Adaptive wavelet thresholding for image denoising and compression. IEEE Trans. Image Process. 2000, 9, 1532–1546. [CrossRef] [PubMed] Dong, W.; Zhang, L.; Shi, G.; Li, X. Nonlocally Centralized Sparse Representation for Image Restoration. IEEE Trans. Image Process. 2013, 22, 1618–1628. [CrossRef] [PubMed] Hou, Y.; Zhao, C.; Yang, D.; Cheng, Y. Comments on “Image Denoising by Sparse 3-D Transform-Domain Collaborative Filtering”. IEEE Trans. Image Process. 2011, 1, 268. Elad, M.; Aharon, M. Image Denoising Via Learned Dictionaries and Sparse representation. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22 June 2006; pp. 895–900. Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing co-adaptation of feature detectors. Comput. Sci. 2012, 3, 212–223. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [CrossRef] [PubMed] Tan, Z.; Wei, H.; Chen, Y.; Du, M.; Ye, S. Design for medical imaging services platform based on cloud computing. Int. J. Big Data Intell. 2016, 3, 270. [CrossRef] Yu, S.; Cheng, Y.; Su, S.; Cai, G.; Li, S. Stratified pooling based deep convolutional neural networks for human action recognition. Multimedia Tools Appl. 2017, 76, 13367–13382. [CrossRef] Frost, V.; Stiles, J.; Shanmugan, K.; Holtzman, J. A model for radar images and its application to adaptive digital filtering of multiplicative noise. IEEE Trans. Pattern Anal. Mach. Intell. 1982, 2, 157–166. [CrossRef] Kuan, D.; Sawchuk, A.; Strand, T.; Chavel, P. Adaptive noise smoothing filter for images with signal-dependent noise. IEEE Trans. Pattern Anal. Mach. Intell. 1985, 2, 165–177. [CrossRef] Zhang, Q.; Yuan, Q.; Li, J.; Yang, Z.; Ma, X. Learning a dilated residual network for SAR image despeckling. Remote Sens. 2017, 10, 196. [CrossRef]
Information 2018, 9, 169
13. 14.
15. 16. 17.
18. 19. 20.
21. 22. 23. 24.
25. 26. 27. 28. 29.
30. 31. 32. 33. 34. 35.
17 of 18
Wei, Y.; Yuan, Q.; Shen, H.; Zhang, L. Boosting the accuracy of multispectral image pansharpening by learning a deep residual network. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1795–1799. [CrossRef] Yuan, Q.; Wei, Y.; Meng, X.; Shen, H.; Zhang, L. A multiscale and multidepth convolutional neural network for remote sensing imagery pan-sharpening. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 978–989. [CrossRef] Zhang, K.; Zuo, W.; Chen, Y.; Meng, D.; Zhang, L. Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising. IEEE Trans. Image Process. 2017, 26, 3142–3155. [CrossRef] [PubMed] Oliveira, T.P.; Barbar, J.S.; Soares, A.S. Computer network traffic prediction: A comparison between traditional and deep learning neural networks. Int. J. Big Data Intell. 2016, 3, 28–37. [CrossRef] Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–8 December 2012; Curran Associates Inc.: Red Hook, NY, USA, 2012; Volume 1, pp. 1097–1105. Greenspan, H.; Ginneken, B.V.; Summers, R.M. Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique. IEEE Trans. Med. Imaging 2016, 35, 1153–1159. [CrossRef] Gebraeel, N.Z.; Lawley, M.A. A neural network degradation model for computing and updating residual life distributions. IEEE Trans. Autom. Sci. Eng. 2008, 5, 154–163. [CrossRef] Sun, J.; Cao, W.; Xu, Z.; Ponce, J. Learning a convolutional neural network for non-uniform motion blur removal. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 769–777. Dhungel, N.; Carneiro, G.; Bradley, A.P. A deep learning approach for the analysis of masses in mammograms with minimal user intervention. Med. Image Anal. 2017, 37, 114–128. [CrossRef] [PubMed] Bruna, J.; Sprechmann, P.; LeCun, Y. Super-resolution with deep convolutional sufficient statistics. arXiv 2015. Qin, P.; Cheng, Z.; Cui, Y.; Zhang, J.; Miao, Q. Research on Image Colorization Algorithm Based on Residual Neural Network. In CCF Chinese Conference on Computer Vision; Springer: Singapore, 2017; pp. 608–621. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. Murad, A.; Pyun, J.-Y. Deep Recurrent Neural Networks for Human Activity Recognition. Sensors 2017, 17, 2556. [CrossRef] [PubMed] He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 630–645. Rajkomar, A.; Lingam, S.; Taylor, A.G.; Blum, M.; Mongan, J. High-throughput classification of radiographs using deep convolutional neural networks. J. Digit. Imaging 2017, 30, 95–101. [CrossRef] [PubMed] Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015. Dahl, G.E.; Sainath, T.N.; Hinton, G.E. Improving deep neural networks for LVCSR using rectified linear units and dropout. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; Volume 26, pp. 8609–8613. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. Clevert, D.A.; Unterthiner, T.; Hochreiter, S. Fast and accurate deep network learning by exponential linear units (ELUs). arXiv 2015. Masi, G.; Cozzolino, D.; Verdoliva, L.; Scarpa, G. Pansharpening by convolutional neural networks. Remote Sens. 2016, 8, 594. [CrossRef] Dabov, K.; Foi, A.; Katkovnik, V.; Egiazarian, K. Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Trans. Image Process. 2007, 16, 2080–2095. [CrossRef] [PubMed] Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 295–307. [CrossRef] [PubMed] Schmidt, U.; Roth, S. Shrinkage fields for effective image restoration. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2774–2781.
Information 2018, 9, 169
18 of 18
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).