Multi-scale object detection in remote sensing

1 downloads 0 Views 11MB Size Report
(You Only Look Once) (Redmon et al., 2016; Redmon and Farhadi,. 2016), and SSD (Single ... pose various improvements to the YOLO detection method. They.
ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

Contents lists available at ScienceDirect

ISPRS Journal of Photogrammetry and Remote Sensing journal homepage: www.elsevier.com/locate/isprsjprs

Multi-scale object detection in remote sensing imagery with convolutional neural networks Zhipeng Deng a, Hao Sun a, Shilin Zhou a,⇑, Juanping Zhao b, Lin Lei a, Huanxin Zou a a b

College of Electronic Science, National University of Defense Technology, Changsha, China Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai, China

a r t i c l e

i n f o

Article history: Received 31 May 2017 Received in revised form 19 March 2018 Accepted 16 April 2018 Available online xxxx Keywords: Object detection Deep learning Convolutional neural networks Multi-modal remote sensing images

a b s t r a c t Automatic detection of multi-class objects in remote sensing images is a fundamental but challenging problem faced for remote sensing image analysis. Traditional methods are based on hand-crafted or shallow-learning-based features with limited representation power. Recently, deep learning algorithms, especially Faster region based convolutional neural networks (FRCN), has shown their much stronger detection power in computer vision field. However, several challenges limit the applications of FRCN in multi-class objects detection from remote sensing images: (1) Objects often appear at very different scales in remote sensing images, and FRCN with a fixed receptive field cannot match the scale variability of different objects; (2) Objects in large-scale remote sensing images are relatively small in size and densely peaked, and FRCN has poor localization performance with small objects; (3) Manual annotation is generally expensive and the available manual annotation of objects for training FRCN are not sufficient in number. To address these problems, this paper proposes a unified and effective method for simultaneously detecting multi-class objects in remote sensing images with large scales variability. Firstly, we redesign the feature extractor by adopting Concatenated ReLU and Inception module, which can increases the variety of receptive field size. Then, the detection is preformed by two sub-networks: a multi-scale object proposal network (MS-OPN) for object-like region generation from several intermediate layers, whose receptive fields match different object scales, and an accurate object detection network (AODN) for object detection based on fused feature maps, which combines several feature maps that enables small and densely packed objects to produce stronger response. For large-scale remote sensing images with limited manual annotations, we use cropped image blocks for training and augment them with re-scalings and rotations. The quantitative comparison results on the challenging NWPU VHR-10 data set, aircraft data set, Aerial-Vehicle data set and SAR-Ship data set show that our method is more accurate than existing algorithms and is effective for multi-modal remote sensing images. Ó 2018 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.

1. Introduction Object detection in very high resolution (VHR) remote sensing images is to determine if a given aerial or satellite image contains one or more objects belonging to the class of interest and locate the position of each predicted object in the image (Cheng and Han, 2016). The term ‘object’ used in this paper mainly refers to manmade objects (e.g. aircrafts, vehicles, storage tanks and ships, etc.) that have sharp boundaries and are independent of background. As a fundamental problem faced for remote sensing image ⇑ Corresponding author. E-mail addresses: [email protected] (Z. Deng), [email protected] (H. Sun), [email protected] (S. Zhou), [email protected] (J. Zhao), [email protected] (L. Lei), [email protected] (H. Zou).

analysis, object detection in remote sensing images plays an important role for both military and civilian applications. However, it’s still a challenging problem due to the varying visual appearance of objects caused by occlusion, illumination, shadow, viewpoint variation, resolution, polarization, speckle noise, etc. Furthermore, the explosive growth of remote sensing images in quantity and quality creates an extremely high computational costs, which also increases the difficulties of object detection for near-real-time applications. Over the last decades, numerous detectors has been developed for detecting different types of objects in remote sensing images. Cheng and Han (2016) reviewed various object detection algorithms in optical remote sensing images and categorized them into four groups, namely template matching-based methods,

https://doi.org/10.1016/j.isprsjprs.2018.04.003 0924-2716/Ó 2018 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.

Please cite this article in press as: Deng, Z., et al. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.003

2

Z. Deng et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

knowledge-based methods, object-based image analysis-based methods, and machine learning-based methods. For synthetic aperture radar (SAR) images, El-Darymli et al. (2013) categorized the object detection methods into three major taxa: namely single-feature-based methods, multi-feature-based methods, and expert-system-oriented methods. In recent years, due to the advance of the machine learning technique, particularly the deep learning based models with powerful feature representations, many approaches consider object detection as a region-ofinterest (RoI) classification problem using deep features and have shown more impressive success for certain object detection tasks than hand-craft features based detectors (Cheng et al., 2016; Girshick et al., 2016; Ren et al., 2015; Deng et al., 2017; Tang et al., 2016). In these approaches, object detection processing is split into two distinctive stages: proposal generation and object classification. The proposal generation stage aims to generate bounding boxes of object-like targets. The most common paradigm is based on a sliding-window search in which each image is scanned in all positions with different scales. While real-time detectors are available for specific classes of objects, e.g. Constant False Alarm Rate (CFAR) based ship detector in SAR images (Kuttikkad and Chellappa, 1994) or Aggregated Channel Features (ACF) feature based vehicle detector in aerial images (Liu and Mattyus, 2015), it has proven difficult to design multi-class detectors under this paradigm. Furthermore, searching for objects in high-resolution broad-area remote-sensing images led to heavy computational costs. Another popular paradigm samples hundreds of object-like regions, using a visual saliency attention stage, to reduce the search space for the whole image. Although successful detectors are available for multiclasses of objects, e.g. ten-class VHR object detection (Cheng et al., 2016) or simultaneous airport detection and aircraft recognition (Zhang and Zhang, 2016), the RoI generation capability of those saliency analysis based methods may become limited or even impoverished under complex background. What’s more, these methods are computationally expensive, at 3 s per image (about 600  800 pixels) in a CPU implementation. The object classification stage infers each region’s category by learning a classifier. As object-like region is usually carried out from feature space, powerful feature representation is very important for constructing a high-performance object detector. Convolutional neural networks (CNNs) are among the most prevalent deep learning methods (Dean et al., 2012; Yao et al., 2016; Cheng et al., 2017; Zhang et al., 2016b; Yuan et al., 2015; Feng et al., 2016), which can be employed as an universal feature extractor. Compared with hand-crafted features or shallow-learning-based features, CNN features relying on neural networks of deep architecture are more powerful for representation, which can significantly improve the performance of object detection (Cheng et al., 2016; Girshick et al., 2016). However, all candidate regions should be cropped and scaled into a fixed size (e.g. 224  224) supported by the CNN. This pre-requisite warping technique may reduce some critical and discriminative properties for object categories with larger size, resulting in low detection accuracy. While the aforementioned deep learning based object detection methods have shown impressive success for some specific object detection tasks in remote sensing images, they are all trained in clumsy and separate multi-stage pipelines. On the one hand, how to extract good potential object-like regions is a critical task and computational bottleneck for accurate object detection in largescale remote sensing images. The existed visual saliency attention based region generation methods involve abundant human ingenuity for features design that is only effective for specific classes of object detection task. On the other hand, region classification ignores the fact that the frame of detection task is a regression problem. In addition, less concern has been given to multi-class

object detection tasks, whereas automatically identifying multiclass objects simultaneously plays a significant role for the intelligent interpretation of remote sensing images. Therefore, a good detection model should be able to unify the above two distinctive stages into one unified framework that can detect multi-class objects simultaneously, and be applicable to the variety of data sources. In the field of computer vision, object detection is one of the most fundamental and challenging problems. In 2013, a breakthrough was made by Girshick et al. (2016), who proposed region-based CNN (R-CNN) detector that improves mean average precision (mAP) by more than 50% relative to the previous best result (Felzenszwalb et al., 2010). Since then, considerable efforts have been made to improve the detector along the R-CNN based pipeline. The most successful improved detector is Faster R-CNN (Ren et al., 2015), which consists of a region proposal network (RPN) for predicting candidate regions, and an object detection network (Girshick, 2015) for classifying object proposals and refining their spatial locations. It is an end-to-end data-driven detector that takes an image as input and outputs the location and category of objects simultaneously, which can effectively overcome the aforementioned drawbacks of the existing deep leaning based object detection methods in remote sensing images. This has motivated us to move from the multi-stage pipelines to the era of unified detection framework. However, although the Faster R-CNN based methods have proven to be very successful for detecting objects such as cars, people, or dogs in nature scene images, they are not specially designed to detect small objects in large images, several challenges in remote sensing images limit their applications in earth observation community: (1) Objects often appear at very different scales in remote sensing images. On the one hand, the scale variability is caused by image resolution. On the other hand, different object categories have large size differences. As shown in Fig. 1, ship and bridge have large difference in size. However, Faster RCNN generates candidate object-like regions by sliding a fixed set of filters with a single receptive field over the topmost convolutional feature maps, which creates an inconsistency between the size variability and fixed filter receptive fields. As shown in Fig. 1, a fixed receptive field (illustrated in the shaded area) cannot match the scale variability of different objects in remote sensing images. For small size or large size objects, the detection performance tends to be particularly poor. (2) Objects in large-scale remote sensing images are relatively small in size and appear in densely distributed groups (like the storage tanks in Fig. 1). Whereas Faster R-CNN struggles with small objects, this is because CNN feature used for detection is pooled from the topmost convolutional feature map with lower resolution. After multiple downsampling, the object size in the topmost convolutional feature map is 1/16 of the original size in the input images. This may lose some important information for small objects and lead to some missing detections. (3) Remote sensing images are enormous (often hundreds of megapixels), and there’s a relative dearth of labeled training data. The available manual annotation of objects are not sufficient in number for properly training the Faster R-CNN based methods. The collection of labeled samples through photointerpretation or terrestrial campaigns is time consuming and expensive, often requires expertise background. (4) Remote sensing images are generated by different instruments (e.g., multi/hyperspectral, SAR, etc.) with different resolutions. Remarkable efforts have been made in develop-

Please cite this article in press as: Deng, Z., et al. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.003

Z. Deng et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

Fig. 1. In remote sensing images, objects often appear at very different scales, as illustrated by the bounding boxes. A single receptive field (as shown in the shaded area) cannot match this large scale variability, especially for small object appear in groups, such as storage tanks.

ing various detectors for different types of remote sensing images. This raises an important demand for developing a common method for object detection in multi-modal remote sensing images. Unfortunately, the current state-of-the-art deep CNN based object detection methods has neither been widely applied to multi-modal remote sensing images, nor been verified its effectiveness. To address these issues, in this paper, we propose a unified and effective deep CNN based approach for simultaneously detecting multi-class objects in remote sensing images with large scales variability. Similar to Faster R-CNN, our method consists of two subnetworks: a multi-scale object proposal network (MS-OPN) and an accurate object detection network (AODN). Firstly, we redesign the architecture of feature extractor by adopting some recent building blocks, such as inception module, which can increase the variety of receptive field sizes. In order to ease the inconsistency between the sizes variability of objects and fixed filter receptive fields, MS-OPN is performed with several intermediate feature maps, according to the certain scale ranges of different objects, i.e. the larger objects are proposed in deeper feature maps with highly-abstracted information, whereas the smaller objects are proposed in shallower feature maps with fine-grained details. The intuition is that deeper layers with larger receptive fields are more suited for detecting large objects. Conversely, shallower layers with smaller receptive fields are suitable to detect small objects (Cai et al., 2016). The object proposals from various intermediate feature maps are combined together to form the outputs of MSOPN. Then those object proposals are send to the AODN for accurate object detection. For detecting small objects appear in groups, AODN combines several outputs of intermediate layers to increase the resolution of feature maps, enabling small and densely packed objects to produce larger regions of strong response. For large-scale remote sensing images with limited manual annotations, it is noteworthy that we crop image blocks for training with data rotation and re-scaling augmentation to increase the robustness of the detector to varying lighting conditions, atmospheric conditions, and sensors. Comprehensive evaluations on multi-modal remote sensing images (including optical satellite and aerial images, SAR images) and comparisons with state-of-the-art deep CNN based methods, demonstrate the effectiveness of the proposed method. The main contributions of our work are presented as follows: (1) We redesigned the CNN architecture by adopting powerful inception module to increases the variety of receptive field sizes that can capture both small and large objects more effectively. While inception has been explored for scene

3

classification, it is, as far as we know, for the first time used to verify its effectiveness in remote sensing image object detection tasks. (2) We developed a CNN-based method named MS-OPN to hypothesize multi-scale object proposals at multiple feature maps with different receptive field sizes that can improve the recall rate of widely varying-sized objects. To our best knowledge, less efforts has been made towards the detection of multi-scale objects using deep CNN based methods. (3) We combined multiple feature maps so that multiple levels of details can be considered simultaneously and the resolution will be increased, enabling the detection of small and densely packed objects more accurate. (4) We investigated a number of representative deep CNN based detection methods for the task of object detection using multi-modal remote sensing images and the results are reported as a useful performance baseline. The rest of this paper is organized as follows. Section 2 discusses related works. Section 3 describes the framework of our method in detail. Section 4 reports comparative experimental results on multi-modal remote sensing image data set. Finally, Section 5 concludes this paper. 2. Related work Here, we briefly introduce deep CNN based object detection methods in both compute vision community and earth observation community. 2.1. Deep CNNs for object detection in compute vision community Deep CNNs have made impressive improvements in object detection for recent years (Krizhevsky et al., 2012; Zeiler and Fergus, 2014; Simonyan and Zisserman, 2014; He et al., 2016; Iandola et al., 2016). Inspired by recent successes of deep CNNs for image classification (Krizhevsky et al., 2012), Girshick et al. (2016) proposed the R-CNN detector, which was among the first modern incarnations of CNN based detector. This method combines an object proposal mechanism (Uijlings et al., 2013), a CNN feature extractor, a Support Vector Machine (SVM) classifier and a bounding boxes regressor. While the R-CNN surpassed previous deformable part model (DPM) based detectors (Felzenszwalb et al., 2010) by a large margin, its speed is limited by duplicated computation from overlapping proposals. In order to increase the speed and accuracy of detection further, Fast R-CNN (Girshick, 2015) trained classifiers and bounding box regression with an end-to-end solution. However, it still depends on human-designed object proposal mechanism. More recently, Faster R-CNN (Ren et al., 2015) combined object proposal and detection into an unified network, achieving near real-time rates and state-of-the-art performance. This is a particularly influential detector and has led to a large number of follow-up improvements (including PVANET (Kim et al., 2016), R-FCN (Region-based Fully Convolutional Networks) (Li et al., 2016), MS-CNN (Multi-scale CNN) (Cai et al., 2016), YOLO (You Only Look Once) (Redmon et al., 2016; Redmon and Farhadi, 2016), and SSD (Single Shot Detector) (Liu et al., 2016)). Here we mainly focus on reviewing some representative detectors. Faster R-CNN divides the framework of detection in two stages (see Fig. 2). In the first stage, called the RPN, images are processed by a feature extractor (e.g., Zeiler and Fergus (ZF) mode (Zeiler and Fergus, 2014) and VGG16 model (Simonyan and Zisserman, 2014)), and the topmost feature maps are used to predict bounding box proposals. In the second stage, these proposals are used to crop features from the topmost feature maps which are subsequently fed

Please cite this article in press as: Deng, Z., et al. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.003

4

Z. Deng et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

Fig. 2. The architecture of Faster R-CNN.

to the Fast R-CNN for classification and bounding box regression. While Faster R-CNN is an order of magnitude faster than Fast RCNN, its speed is limited by the CNN feature extraction in the first stage and costly per-region computation in the second stage. PVANET (Kim et al., 2016) mainly redesigns the feature extraction part by combining recent technical innovations, achieving the state-ofthe-art accuracy in multi-category object detection task while minimizing the computational cost. R-FCN (Li et al., 2016) adopts the recent state-of-the-art Residual Nets (ResNets) to construct a fully convolutional object detector (see Fig. 3). On the one hand, ResNets has deep architectures (e.g. 50 or 101 layers) that are as translationinvariant as possible to effectively detect deformable objects (He et al., 2016). On the other hand, the fully connected layers in Fast R-CNN are removed and replaced by a set of position-sensitive score maps, which can encode the spatial information for accurate object detection. Based on these improvement, R-FCN model could achieve comparable accuracy to Faster R-CNN at faster running times. However, all above three detectors struggle with small-size objects detection, mainly due to the coarseness of the feature maps used for proposal generation or detection. HyperNet (Kong et al., 2016) and MS-CNN (Cai et al., 2016) conduct detection at multiple output layers, which provide an effective framework for multi-scale objects detection, but lead to heavy computational costs. Another interesting work is YOLO1 (Redmon et al., 2016), which use a single feed-forward convolutional network to directly predict classes and bounding boxes (see Fig. 4). The network reframes object detection as a regression problem. It divides the entire image into an S  S grid and for each grid cell predicts B bounding boxes with confidence score, and C class probabilities. These predictions are encoded as an S  S  ðB  5 þ CÞ tensor. This network runs at real-time speed, but with some compromise of detection accuracy, especially struggles with small-size objects that appear in groups. In order to make a tradeoff between speed and accuracy, YOLO2 (Redmon and Farhadi, 2016) and SSD (Liu et al., 2016) propose various improvements to the YOLO detection method. They remove the fully connected layers from YOLO1 and use anchor boxes to predict bounding boxes. Additionally, SSD combines predictions from multiple feature maps with different resolutions to handle objects of various sizes (see Fig. 5). However, these multiple feature maps are still coarse for small-size object detection. 2.2. Object detection in earth observation community Compared with object detection in natural scene images, detection of objects in remote sensing images (RSIs) is a more challenging task. It has been extensively studied during the past years.

Traditional methods regard object detection as a classification problem, this is achieved by learning a classifier to classify the RoIs (sliding windows or object proposals) into object categories and background (Han et al., 2015; Zhang et al., 2015a; Han et al., 2014; Du et al., 2016; Li et al., 2015; Lu et al., 2015; Zhang and Zhang, 2016; Yokoya and Iwasaki, 2015; Qiu et al., 2017; Qiu et al., 2018). In this framework, feature extraction plays quite important roles in the performance of object detection. In early studies, hand-craft features were adopted for representation. For example, Cheng et al. (2014) and He He et al. (2018) used histogram of oriented gradient (HOG) features and latent SVM to train deformable part-based mixture (DPM) models for multi-class object detection in optical RSIs and aircraft detection in SAR images respectively. Zhang et al. (2015b) designed the appearance features, the spatial deformation features and the rotation deformation features to detect airplanes. Yao et al. (2017) adopted the inter-frame difference algorithm and Otsus algorithm to detect moving ships on the water surface. Other methods adopt machine learning techniques to learn rotation-and-scale-invariant features for detection. For example, Yu et al. (2016) constructed a Hough forest model with embedded patch orientations and scale factors to estimate airplane centroids and handle airplanes with arbitrary orientations and sizes. Han et al. (2014) proposed to detect multi-class geospatial objects based on visual saliency modeling and discriminative learning of sparse coding. Schilling et al. (2018) presented a workflow that utilizes optical and elevation data to detect vehicles using machine learning features combined with specific features. With the development of deep learning, a number of RSIs object detection tasks based on deep learning have been conducted (Ševo and Avramovic´, 2016; Zhang et al., 2016a; Zhou et al., 2016; Diao et al., 2016; Cheng et al., 2016; Tang et al., 2016; Tang et al., 2017). Diao et al. (2016) employed visual saliency to generate a small number of bounding boxes, and then extracted features using deep belief networks (DBN). This method is suitable for simple environments. Chen et al. (2014a) and Chen et al. (2014b) presented a hybrid CNN to extract multi-scale features for vehicle detection in satellite images. Similarly, Chen et al. (2013) also proposed an effective aircraft detection method by DBN. However, those methods adopted time consuming sliding-window search paradigm to locate the vehicle or aircraft, which cannot be implemented to multiclass detectors under this architecture. Cheng et al. (2016) proposed a Rotation Invariant CNN (RICNN) model for multiclass object detection in VHR remote sensing images. However, the object-like regions are generated by unsupervised selective search algorithm (Uijlings et al., 2013), which may become unstable in complex

Please cite this article in press as: Deng, Z., et al. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.003

Z. Deng et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

5

Fig. 3. The architecture of R-FCN.

Fig. 4. The architecture of YOLO1.

Fig. 5. The architecture of SSD.

environments. Zou and Shi (2016) proposed a novel ship detection method called Singular Value Decompensation Networks (SVDNet), which generated ship-like regions in the ship probability map and verified each ship candidate by using feature pooling operation and the linear SVM classifier. While this method provided a novel and promising ship detection framework, but it was trained in clumsy and slow multi-stage pipelines. In addition, land mask was adopted as a priori knowledge guide to remove the land regions, which limited its application to other scenes without Geographic Information System (GIS) database. All above deep CNN based detectors are limited by the need for generating object-like regions and the repeated CNN calculation. Furthermore, these deeply supervised network architectures often require a large amount of training data with manual annotation, whereas the manual annotation of objects in large image sets is generally expensive. Zhong et al. (2018) proposed a position-sensitive balancing (PSB) framework

for multi-class geospatial object detection from RSIs. This work adopts the position-sensitive strategy to balance the translationinvariance in the classification stage and the translation-variance in the object detection stage, on the basis of ResNet. Zhang et al. (2016a) proposed an aircraft detection method based on coupled CNNs in a weakly supervised framework. This method combines a candidate region proposal network to extract the image-level proposals of an aircraft and a localization network to locate the aircraft. It is a promising way to alleviate the human labor cost of annotation. However, it is not a real-time approach for detection and the performance need to be further improved. 3. Multi-scale CNN for object detection Fig. 6 illustrates the detailed architecture of our proposed method, which consists of a MS-OPN and an AODN. The MS-OPN

Please cite this article in press as: Deng, Z., et al. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.003

6

Z. Deng et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

Fig. 6. The architecture of our proposed method.

aims to generate multi-scale object-like regions1 with different filter receptive fields through several intermediate layers. Next, these object-like regions are sent to the AODN for accurate classification and regression. In the training phase, MS-OPN and AODN can be trained jointly and effectively with batch normalization (Ioffe and Szegedy, 2015) and residual connections (He et al., 2016). Due to the MS-OPN and AODN share the same CNN feature extraction stage, in this section, we will introduce our method in three parts, i.e. the architecture of convolutional feature extractor, MS-OPN and AODN.

3.1. Details on convolutional feature extractor design The convolutional feature extractor takes an image (of any size) as input and outputs multiple level features. The design of feature extractor is of crucial importance as the types of layers and number of parameters directly affect speed, memory, and performance of the detector. Recent evidence (He et al., 2016) reveals that very deep models, with a depth of hundreds of layers, can significantly improve the performance of many visual recognition tasks, e.g. image classification, object detection, semantic segmentation. However, it is difficult to directly use very deep models for object detection in remote sensing images. This is because on the one hand, these models lead to heavy computational cost, whereas remote sensing images are enormous (often hundreds of megapixels). On the other hand, very deep models need a huge amount of training samples, however, there’s a relative dearth of labeled remote sensing images as training data. Furthermore, the output feature maps should be able to capture both small and large objects effectively. In order to fulfill these requirements, we adopt the idea from PVANET (Kim et al., 2016) and GoogleNet (Szegedy et al., 2015), which used flexible convolutional kernel filter sizes with a layer-by-layer structure to make the model has less channels and more layers. Less channels means lightweight and low computational cost, and more layers means deeper architecture for better representation. This is achieved by adopting some recent building blocks, i.e. concatenated rectified linear unit (C.ReLU) (Shang et al., 2016) and inception module. C.ReLU is inspired from an interesting observation of activation patterns in CNNs. The output nodes of lower layers (the ones close to the input of CNNs) tend to be opposite-paired, i.e. the activation of one node is the opposite side of another node. Based on this observation, C.ReLU concatenates the output of one node with 1

In this paper we only consider axis-aligned bounding boxes as regions.

negation, which reduces the number of output channels by half without losing accuracy. Fig. 7 (a) illustrates the C.ReLU module applied to a K  K convolutional layer. In order to reduce the input size and enlarge the output capacity, 1  1 convolutional layers are added before and after C.ReLU module. Both positive and negative phase information of the K  K convolutional layer are concatenated to double the number of channels. Scaling/shifting layers and ReLU activation layers are added after concatenation to allow that activations in the negated part can be adaptive. Inception module clusters multiple convolutional layers of different kernel size (i.e. 1  1 convolution, 3  3 convolution, 5  5 convolution) into groups of units, which made it possible to increase the width and depth of a network without increasing the computational cost. Fig. 7(b) illustrates the Inception module. Each module consists of four sub-sequences. The sub-sequence in dashed box is added when we need to reduce feature map size by half (i.e. feature stride = 2). The number of channels for the remaining sub-sequences is set to 1/2, 1/4, 1/4 of the previous module, respectively. 5  5 convolution layer is replaced with a sequence of two 3  3 convolution layers for efficiency (Szegedy et al., 2016). As shown in Fig. 8, a chain of Inception modules can increase the variety of receptive field sizes, so that it can learn visual patterns of widely varying-sized objects. Fig. 6 shows the structure of our convolutional feature extractor, which has 16 convolutional layers in total, where C.ReLU is applied to the first 8 layers and inception is applied to the remaining 8 layers. It’s adopted as a CNN trunk for the following MS-OPN and AODN.

3.2. Multi-scale object proposal network (MS-OPN) A good detector should be able to cover a large range of object sizes. Traditional methods rescale the input image multiple times (as illustrated in Fig. 9(a)) or apply multiple filters to a single input image (as illustrated in Fig. 9(b)) to match all possible object sizes. This tends to be costly. For CNN-based detectors, based on the generated multiple level CNN features, RPN (Ren et al., 2015) slides a fixed set of filters with a single receptive field over the topmost convolutional feature maps to generate candidate object-like regions (as illustrated in Fig. 9(c)). On the one hand, this strategy leads to a scale inconsistency between the object size variability and fixed filter receptive fields. On the other hand, the topmost convolutional feature map is coarse to cover small size object. This compromises the performance of object detection in remote

Please cite this article in press as: Deng, Z., et al. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.003

Z. Deng et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

7

Fig. 7. (a) C.ReLU building block. (b) Inception building block.

Fig. 8. The distribution of receptive field sizes in a chain of Inception modules. Each module consists of 3 kinds of convolutional layers with different channel sizes. The latter module can broaden the range of receptive field sizes, and achieving higher level of nonlinearity.

Fig. 9. Different strategies for multi-scale detection.

sensing images. In order to ease the inconsistency, we propose a new method, named MS-OPN, to generate object-like regions through several intermediate layers with different filter size (as

illustrated in Fig. 9(d)), which is inspired by SSD (Liu et al., 2016) and MS-CNN (Cai et al., 2016). Specifically, we added smaller size filters to capture densely packed objects in remote sensing images.

Please cite this article in press as: Deng, Z., et al. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.003

8

Z. Deng et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

Table 1 Parameter configurations of the different data set. Data set

Layer Filter

33

55

77

33

55

77

33

55

77

NWPU VHR-10

Anchor size Anchor ratio Anchor size Anchor ratio Anchor size Anchor ratio

24  24 1:2,1:1,2:1 24  24 3:2,1:1,2:3 10  10 1:2,1:1,2:1

40  40 1:2,1:1,2:1 28  28 3:2,1:1,2:3 16  16 1:2,1:1,2:1

56  56 1:2,1:1,2:1 32  32 3:2,1:1,2:3 22  22 1:2,1:1,2:1

80  80 1:2,1:1,2:1 36  36 3:2,1:1,2:3 28  28 1:2,1:1,2:1

112  112 1:2,1:1,2:1 40  40 3:2,1:1,2:3 34  34 1:2,1:1,2:1

160  160 1:2,1:1,2:1 44  44 3:2,1:1,2:3 40  40 1:2,1:1,2:1

224  224 1:2,1:1,2:1 48  48 3:2,1:1,2:3 46  46 1:2,1:1,2:1

320  320 1:2,1:1,2:1 52  52 3:2,1:1,2:3 52  52 1:2,1:1,2:1

448  448 1:2,1:1,2:1 56  56 3:2,1:1,2:3 58  58 1:2,1:1,2:1

Aerial-Vehicle SAR-Ship

Conv3_4

Conv4_4

As shown in Fig. 6, our proposed MS-OPN predicts multi-scale object-like regions through three proposal branches, i.e. sliding windows over the output feature map of conv3_4 layer, conv4_4 layer and conv5_4 layer respectively. Each proposal branch consists of three detection layers, which implement the sliding operation of different window size (i.e. 3  3; 5  5; 7  7) with a 3  3 convolutional layer, a 5  5 convolutional layer and a 7  7 convolutional layer to extract local feature representation X i for each sliding-window location.At each sliding-window position, we pre x

y

w

h

dict one anchor box Bi ¼ bi ; bi ; bi ; bi

x

y

(where bi and bi represent w

h

the top-left coordinates of the predicted region, bi and bi denote the width and height of the predicted region) according to the filter size. Each anchor box has three kinds of ratios (e.g. 2:1,1:1,1:2) to represent different objects’ aspect ratios. More details can be found in Table 1. In order to constructing training samples Sm for each detection layer, the predicted region boxes located outside the image boundary are discarded and remaining region boxes are assigned a class label Y i 2 f0; 1; 2; . . . ; Cg. If a predicted region box Bi has the highest Intersection-over-Union (IoU) overlap ratio with a ground-truth box Bi , we assign a positive label Y i P 1 to it. Whereas, if a predicted region box’s IoU ratio is lower than 0.2 for all ground-truth boxes, we assign a negative label Y i ¼ 0 to it. Then, the remaining regions are discarded. The IoU ratio is defined as follows:

areaðBi \ Bi Þ a¼ areaðBi [ Bi Þ

ð1Þ

where areaðBi \ Bi Þ represents the intersection of the predicted region box and ground truth box, and areaðBi [ Bi Þ denotes their union. With the above definitions, each detection layer’s training samples are defined as Sm ¼ fðX i ; Y i ; Bi ÞgNi¼1 . Inspired by Ren et al. (2015), the loss of each detection layer is a combination of classification and bounding box regression, which is defined as follows. m ^ l ðX; Y; BjWÞ ¼ Lcls ðpðXÞ; YÞ þ k½Y P 1Lbbr ðB ; BÞ

ð2Þ

where W stands for the parameters of the network, classification is a cross-entropy loss, loss Lcls ðpðXÞ; YÞ ¼  log pY ðXÞ pðXÞ ¼ ðp0 ðXÞ; . . . pC ðXÞÞ is the probability confidence over C þ 1 classes, kis the balancing parameter. Besides, ½Y P 1 denotes that the background is meaningless for training bounding box regres^ stands for regressed bounding box, Lbbr denotes a smooth L1 sion, B loss defined as

^ ¼ f ðB  BÞ; ^ Lbbr ðB ; BÞ L1 ( if jxj < 1 0:5x2 ; where f L1 ðxÞ ¼ jxj  0:5; otherwise

ð3Þ

Based on the above definitions, the overall loss function of our MS-OPN is a concatenation of each detection layer’ loss, which is achieved as follow.

LMSOPN ðWÞ ¼

Conv5_4

M X X

am lm ðX i ; Y i ; Bi jWÞ

ð4Þ

m¼1i2Sm

where M is the number of detection layers, in our work, M ¼ 9, which stands for three proposal branches with three detection layers. am denotes the weight of each detection layer’s loss. The optimal parameters W  ¼ argminW LMSOPN ðW Þ are optimized by stochastic gradient descent (SGD) (Lecun et al., 2008). To prevent over-fitting, we adopted the pre-trained model (Kim et al., 2016) for 1000-class Image-Net classification to initialize the convolutional layers. Since the training of deeper network becomes troublesome, we add residual shortcut connection structures onto C.ReLU layers and Inception layers (as shown in Fig. 7) to stabilize the training process. Besides, batch normalization layers (Ioffe and Szegedy, 2015) are added before all convolutional layers to accelerate training. During each iteration, we process a batch of training samples into the network and then update the parameters. When the training of the MS-OPN is finished, it takes an image as input and outputs the location of objects through several proposal branches. Then, these results are concatenated as the final proposal detections. 3.3. Accurate object detection network (AODN) Although MS-OPN could work as a detector, it is not stronger enough for accurate detection. On the one hand, the sliding operation do not cover objects well. On the other hand, the local feature representation X i for each predicted region box Bi is not powerful for better classification and bounding box regression. In order to increase detection accuracy, AODN is added after MS-OPN. It takes an image with its predicted region boxes (generated by MS-OPN) as input, and outputs the refined category and location simultaneously. Inspired by success of combing multi-scale representation in HyperNet (Kong et al., 2016) and ION (Bell et al., 2016), AODN combines multiple layers with different resolution to achieve more informative feature maps for accurate object detection (as shown in Fig. 7). Since objects in large-scale remote sensing images are relatively small in size and appear in densely distributed groups, specifically, we choose the conv3_4 layer as a reference layer, and concatenate the conv4_4 layer and conv5_4 layer with upscaling (using deconvolution layer). This is because conv3_4 layer with higher resolution is better suited for detecting densely peaked objects. As shallower layers are more suitable for location and deeper layers are more suitable for classification, the concatenated feature maps are complementary for small size object detection as shown in our experiments. Since these predicted object-like region boxes have different sizes, we adopted a ROI pooling layer for each box to generate features of a fixed dimension (e.g. 7  7  512). Then, the features are fed to the subsequent fully connected layers and branched into two parts for further classification and bounding box regression. The loss of AODN LAODN is similar to (2), which combines a crossentropy loss for classification and a smoothed L1 loss for bounding

Please cite this article in press as: Deng, Z., et al. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.003

Z. Deng et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

9

box regression. The training samples SMþ1 is collected as Sm . Then the multi-task loss of (4) is extended to

LðW; W d Þ ¼

M X X

am lm ðX i ; Y i ; Bi jWÞ

m¼1i2Sm

þ

X

aMþ1 lAODN ðX i ; Y i ; Bi jW; W d Þ

ð5Þ

i2SMþ1

where aMþ1 denotes the weight of AODN’s loss. W d stands for the added parameters of fully connected layers and deconvolution layers. The parameters are learned jointly, i.e. ðW  ; W d Þ ¼ arg min LðW; W d Þ, by back-propagation throughout the unified network. Since the MS-OPN and AODN share the same CNN feature extraction stage, we adopt the pre-trained model of MS-OPN to initialize the convolutional layers of AODN. The additional new deconvolution layer weights are randomly initialized by a zero-mean Gaussian distribution with a 0.01 standard deviation. When the training is completed, our method takes an image as input and outputs about 300 detection results. Then, Non-Maximum Suppression (NMS) (Neubeck and Gool, 2006; Wan et al., 2015; Dollar, 2014) is adopted to reduce redundancy. 4. Experimental results In this section, we evaluate our method for object detection of multi-modal remote sensing images. Experiments are implemented based on the deep learning framework Jia et al. (2014) and executed on a PC with an Intel single Core i7 CPU, NVIDIA GTX-1060 GPU (6 GB memory), and 32 GB RAM. The PC operating system is Ubuntu 14.04. 4.1. Data set description and experimental configuration 4.1.1. Data set description Four data sets are used in the experiments. The first data set is a challenging ten-class geospatial object detection data set, i.e. NWPU VHR-10 data set (Cheng et al., 2014; Cheng and Han, 2016). This data set contains a total of 650 VHR optical remote sensing images (each image contains at least one target), where 565 color images were acquired from Google Earth with the spatial resolution ranging from 0.5 to 2 m, and 85 pansharpened color infrared images were acquired from Vaihingen data with a spatial resolution of 0.08 m. From this data set, 757 airplanes, 302 ships, 655 storage tanks, 390 baseball diamonds, 524 tennis courts, 159 basketball courts, 163 ground track fields, 224 harbors, 124 bridges, and 477 vehicles were manually annotated with bounding boxes used for ground truth. In our work, the NWPU VHR-10 data set was divided into 60% for training, and 40% for testing. Fig. 10 gives the statistics of object size in this data set. As can be observed, the targets have various sizes. To provide further verification, we evaluated the previous trained detector on another aircraft data set (Zhang et al., 2016a). This aircraft data set contains five large-scale satellite images, which were acquired from Google Earth. The first three satellite images were collected over the airport of Berlin, Sydney and Tokyo, respectively. The last two satellite images were collected from the aircraft boneyard of Davis Monthan Air Force Base (DM-AFB), USA. The details of these data sets are shown in Table 2. In this data set, aircrafts have various sizes, colors, and orientations. Specifically, aircrafts are densely packed in the aircraft boneyard, which can effectively demonstrate the superiority of our method. The third data set is a publicly available vehicle data set, namely Aerial-Vehicle data set, which was collected over the city of Munich, Germany. It contains 20 aerial images captured from an airplane by a Canon Eos 1Ds Mark III camera with a resolution of

Fig. 10. The statistics of object size in the NWPU VHR-10 data set. The bottom, middle black line and top of each bar represent the minimum size, mean size and maximum size of each category, respectively.

5616  3744 pixels (Liu and Mattyus, 2015). These aerial images were taken at a height of 1000 meters above ground, the ground sampling distance is approximately 13 cm. By following the works of Liu and Mattyus (2015), the first ten images are used for training and the other ten images for testing. For the ground truth, the vehicles in the images are annotated manually as oriented bounding boxes with attributes (e.g. orientation and type). Furthermore, in order to verity the generality of our detection method, we collected several Sentinel-1 images for training a ship detector. In this paper, 11 images were collected (whose main information is listed in Table 3.) and annotated with the help of AIS information and the Sentinel-1 Application Platform (SNAP) software. The data set, namely SAR-Ship data set, is highly diverse containing ship targets of different conditions. 4.1.2. Evaluation metrics We adopted three widely used criteria to quantitatively evaluate the performance of detection, namely, the precision-recall curve (PRC), average precision (AP), and F1-score. The Precision measures the fraction of detections that are true positives and the Recall measures the fraction of positives that are correctly identified. The AP metric is measured by the area under the PRC. The higher the AP value, the better the performance, and vice versa. The F1-score combines the precision and recall metrics to a single measure to comprehensively evaluate the quality of an object detection method, which is defined as follows.

F1 ¼

2Precision  Recall Precision þ Recall

ð6Þ

For evaluating the aircraft detection performance in the second data set, following Zhang et al. (2016a), four commonly used criteria were computed: false positive rate (FPR), missing ratio (MR), accuracy (AC), and error ratio (ER). These criteria are defined as follows.

FPR ¼

Number of falsely detected aircraft  100% Number of detected aircraft

ð7Þ

MR ¼

Number of missing aircraft  100% Number of aircraft

ð8Þ

Number of detected aircraft  100% Number of aircraft

ð9Þ

AC ¼

Please cite this article in press as: Deng, Z., et al. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.003

10

Z. Deng et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

Table 2 Image information about the aircraft data set. Data set

Imaging size (pixel)

Spatial resolution (m)

Aircraft size (pixel)

Aircraft number

Berlin Tegel Airport Sydney International Airport Tokyo Haneda Airport DM-AFB-1 DM-AFB-2

8160  3456 4992  8256 6528  7488 7929  5305 10320  11238

1 1 1 1 1

47—88 28  110 48—88 29—268 32—330

31 55 67 1007 1689

Table 3 Imaging information about the sentinel-1 imagery. Satellite

Imaging mode

Wave band

Polarization

Imagery format

Resolution (m  m)

Pixel size (m  m)

Ave size (pixel  pixel)

Sentinel-1A

IW

C

VH/VV

GRD

20  20

10  10

25300  18750

ER ¼ FPR þ MR

ð10Þ

Generally, a detection result is considered to be a true positive if the IoU overlap ratio, a, between a detected bounding box and ground truth bounding box, is greater than 0.5. Otherwise the detection is considered as a false positive. Furthermore, if several detections overlap with the same ground truth, only one detection with the highest overlap ratio is considered a true positive, and others are considered false positives. 4.1.3. Compared approaches To evaluate the detection performance of our method quantitatively, we compared it with seven competitive methods whose codes are publicly available. In our experiments, we use the choices laid out in the original papers when possible.  Faster RCNN (FRCN) (Ren et al., 2015) is a particularly influential detector. In our experiments, both the ZF mode (Zeiler and Fergus, 2014) and VGG model(Simonyan and Zisserman, 2014) are adopted as the feature extractor for evaluation, namely FRCN-ZF and FRCN-VGG. The ZF model has 5 convolutional layers and VGG model has 16 convolutional layers.  YOLO1 (Redmon et al., 2016) uses a single feed-forward convolutional network to directly predict classes and bounding boxes. In our experiments, we adopt the detection network from darknet-24 (Redmon et al., 2016), which has 24 convolutional layers followed by 2 fully connected layers.  YOLO2 (Redmon and Farhadi, 2016) is an improvement of YOLO1, which removes the fully connected layers and uses anchor boxes to predict bounding boxes. In our experiments, we adopt the detection network from darknet-19 (Redmon and Farhadi, 2016), which has 19 convolutional layers.  R-FCN (Li et al., 2016) is an improvements of FRCN, which adopts the 50 layers ResNets and replaces the fully connected layers with a set of position-sensitive score maps.  PVANET (Kim et al., 2016) is an improvement of FRCN. It combines the C.ReLU and Inception module to reconstruct the CNN architecture, which is similar to our feature extractor.  SSD (Liu et al., 2016) is an improvement of YOLO1, which uses anchor boxes to predict bounding boxes from multiple feature maps with different resolutions. Following (Liu et al., 2016), we adopt the VGG-16 model as the feature extractor.  MS-CNN (Cai et al., 2016) is an improvement of FRCN. It adopts the VGG-16 model as its feature extractor and predicts region proposal on several feature maps, which is similar to our MS-OPN.

In order to demonstrate the improvement of AODN, we evaluated our method in two configurations, namely ‘Ours’ and ‘OursFuse’. ‘Ours’ represents that each region’s deep feature is pooled from the topmost convolutional feature map, whereas ‘Ours-Fuse’ represents that each region’s deep feature is pooled from the fused feature map. Furthermore, we also add two traditional methods for comparison. For the third data set, i.e. vehicle detection in the aerial image, the Aggregated Channel Features (ACF) (Dollar et al., 2014) based detector is adopted as a comparison. It was implemented on the Piotr’s Computer Vision Matlab Toolbox (Dollar, 2014). In detail, this binary detector was trained with a sliding window size of 48  48 pixels and 2048 weak classifiers. For ship detection in SAR images, two-parameter CFAR detector is adopted as a comparison, which consists of four components, including calibration preprocessing, Land-sea masking, CFAR algorithm for prescreening, and ship size based discrimination. This binary detector was implemented on the software of SNAP. 4.1.4. Implementation details Owing to the limited size of the training set, performing data augmentation to artificially increase the number of training samples is necessary to avoid over-fitting. In our implementation, for each training image in NWPU VHR-10 data set, we simply rotated it in the step of 90 from 0 to 270 , and then flipped it horizontally and vertically respectively, which expanded the number of samples sixfold. In addition, we also randomly scaled each image in hue-saturation-value (HSV) space to increase the robustness of the detector to varying sensors, lighting conditions, and atmospheric conditions. Fig. 11 shows some training images rotated and rescaled in hue and saturation. We can observe that the augmented images are reasonable for representing the rotation of target, lighting changes and the variety of sensors. For large scale training images in Aerial-Vehicle data set and SAR-Ship data set, considering GPU memory and process speed, each original aerial image or SAR image was cropped into several adjacent image blocks with the resolution of 800  600 pixels. That is intentional for two reasons: First, images blocks are easy to augment. Second, small images consume less GPU memory, which can improve the training efficiency. In detail, we set the adjacent image block overlap ratio as 0.5 and remove the annotation of targets (i.e. vehicle or ship) that cross image block boundaries. Then, the image blocks without targets (i.e. vehicle or ship) are discarded, and the remaining image blocks are rotated with four angles (i.e., 0 ; 90 ; 180 and 270 ) that expanded the number of samples fourfold.

Please cite this article in press as: Deng, Z., et al. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.003

Z. Deng et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

11

with a higher overlap ratio (a > hNMS ) will be filtered by the NMS strategy. 4.2. Results on NWPU VHR-10 data set

Fig. 11. Training images rotated and rescaled in hue and saturation.

During the training and testing process, for region proposal based methods (i.e. FRCN, R-FCN, PVANET, MS-CNN, and our method), the input images are resized such that its shorter side has 600 pixels. For regression based methods (i.e. YOLO1, YOLO2, SSD), the input images are always resized to a fixed shape 600  600 pixels. In our training process, MS-OPN and AODN are trained jointly. The MS-OPN is first trained by processing an image with the ground truth bounding boxes as input and predicts a set of object-like regions. In order to make the training stable in the early iterations. We use a small trade-off coefficient k ¼ 0:05 for the first 10 k iterations and a large trade-off coefficient k ¼ 1 for the next 15k iterations. The learning rate is set to 0.00005, and decreases by 0.1 every 10k iterations. We use a weight decay of 0.0005 and a momentum of 0.9 for the whole network. Sine the conv3_4 layer is close to the lower layers of the CNN feature extractor, it may affects gradients more than the following conv4_4 and conv5_4 layers. Thus we set a slightly lower weight (i.e. am ¼ 0:9) for the conv3_4 proposal branch and a slightly higher weight (i.e. am ¼ 1) for the conv4_4 proposal branch and conv5_4 proposal branch. The resulting model of MS-OPN is used to initialize the AODN, which is trained by another 25k iterations with an initial learning rate of 0.0005, and a trade-off coefficient lambda ¼ 1. Following the joint training process of region proposal based methods, aMþ1 is set to 1. The whole training process takes about 36 h on a single GPU implementation. For a fair comparison, each method was evaluated with top-100 detection results (i.e. top-100 ranked ones based on the confidence score). The NMS ratio was always set as hNMS ¼ 0:3, i.e. detections

Firstly, we evaluate the performance of candidate region generation in terms of proposal quality. The recall rates of different methods under different IoU thresholds are plotted in Fig. 12. For better comparison and visualization, we semi-log plot (1-IoU) for X-axes and (1-Recall) for Y-axes, and then reverse the X-axes and Y-axes. It can be observed that (1) With IoU thresholds increasing, the recall curve drops. In detail, the recall of FRCN and YOLO based methods drops more quickly than other deep CNN based methods, which demonstrates that FRCN and YOLO based methods have limited performance for accurate object detection in remote sensing images. (2) For different object categories, the recall curves of different methods for storage tank, tennis court, basketball court and vehicle are more complex and varied than other object classes. This is due to these objects are small in size and densely peaked with complex background, so that they are more difficult for detection with deep CNN based methods. (3) For different methods, OursFuse really improves the recall rate, especially for small size objects, such as storage tank, tennis court, vehicle classes. FRCNZF and FRCN-VGG have similar lower recall rate than other deep CNN based methods. YOLO2 achieves higher recall rate than YOLO1. The remaining deep CNN based methods have comparative performance. This shows that our method can generate candidate regions with higher recall for different kinds of objects with scale variability, which is desirable in object detection. Secondly, we evaluate the performance of detection. Fig. 13 displays the PRCs of ten different methods. For better comparison and visualization, we semi-log plot (1-recall) for X-axes and (1precision) for Y-axes, and then reverse the X-axes and Y-axes. As can be seen from them, (1) For the object categories of baseball diamond and ground track field classes, all methods achieved comparable excellent detection performance. Whereas for other eight object categories, the PRCs of different methods are varied. This is due to baseball diamond and ground track field classes have relatively larger size than other object classes, which is suitable for deep CNN based methods. (2) FRCN-VGG and YOLO2 have higher precision than FRCN-ZF and YOLO1 respectively. This demonstrates that deeper network and the added anchor boxes can improve the detection performance. (3) Compared with FRCN and YOLO based methods, other deep CNN based methods achieved better performance. This shows that the improvement of FRCN is effective for object

Fig. 12. Recall vs. IoU overlap ratio on the NWPU VHR-10 data set for airplane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, and vehicle classes, respectively.

Please cite this article in press as: Deng, Z., et al. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.003

12

Z. Deng et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

Fig. 13. PRCs of the proposed method and other state-of-the-art approaches on the NWPU VHR-10 data set for airplane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, and vehicle classes, respectively.

Table 4 Performance comparisons of eleven different methods in terms of AP values and average running time per image. The bold numbers denote the optimal values in each row. The italic underlined numbers denote the suboptimal values in each row. Method

COPD

RICNN

PSB

FRCN

FRCN

YOLO1

YOLO2

R-FCN

PVANET

SSD

MSCNN

Ours

Ours Fuse

Airplane

0.6230

0.8835

0.9070

0.8345

0.8277

0.6077

0.8733

0.9607

0.9586

0.9565

0.9869

0.9971

0.9873

Ship

0.6890

0.7734

0.8110

0.7706

0.7754

0.6274

0.8472

0.9828

0.9024

0.9356

0.8433

0.8898

0.9721

Storage tank

0.6370

0.8527

0.8050

0.4660

0.5247

0.2877

0.4265

0.7245

0.6665

0.6087

0.6933

0.8009

Baseball diamond

0.8330

0.8812

0.8940

0.9350

0.9631

0.8572

0.9312

0.9937

0.9837

0.9939

0.9839

0.9806

0.8063 0.9909

Tennis court

0.3210

0.4083

0.7730

0.4646

0.6286

0.5844

0.6570

0.8053

0.8765

0.8149

0.8607

0.9734

Basketball court

0.3630

0.5845

0.8980

0.6164

0.6879

0.8216

0.8553

0.9067 0.9779

0.9202

0.9200

0.9598

0.9255

Ground track field

0.8530

0.8673

0.8710

0.9848

0.9839

0.8872

0.9709

0.9903

0.9864

0.9838

0.9860

0.9709 0.9868

Harbor

0.5530

0.6860

0.7690

0.7708

0.8245

0.7509

0.8045

0.9250

0.9887 0.8999

0.9459

0.9720

0.1480

0.6151

0.7840

0.7777

0.7876

0.7248

0.8995

0.9341

0.8957

0.9460 0.9704

0.8949

Bridge

0.9436

0.9267

Vehicle

0.4400

0.7110

0.6900

0.6122

0.6384

0.5233

0.7075

0.8842

0.8140

0.7447

0.9684 0.7296

0.8221

0.8921

Mean AP

0.5460

0.7263

0.8200

0.7232

0.7642

0.6672

0.7973

0.8939

0.8859

0.9152

0.9487

1.07

8.77

0.10

0.29

0.34

0.10

0.13

0.9280 0.15

0.8835

Mean Time(s)

0.13

0.14

0.38

0.17

0.17

detection in remote sensing images. 4) Ours-Fuse achieved higher precision than Ours, which demonstrates that the combination of multiple feature maps can improve the detection performance. Table 4 shows the quantitative comparison results of eleven different methods measured by AP values, and average running time per image. The best performances are highlighted in bold. We added the CPD (Cheng et al., 2014), RICNN (Cheng et al., 2016) and PSB(ResNet-101) (Zhong et al., 2018) for comparison and achieved the following observation: (1) Compared with RICNN, FRCN-ZF achieved similar detection performance measured in mean AP. YOLO1 degenerated the mean AP from 72.63% to 66.72%. FRCN-VGG and YOLO2 obtained slightly performance gains, whereas other deep CNN based methods achieved larger performance gains in terms of mean AP. This is due to the RICNN was trained in clumsy multi-stage pipelines that involves generating candidate regions, extracting deep features, and training SVMs. Whereas the other ten comparison approaches are unified end-toend detector, the improvement of these unified detectors can achieve higher detection performance. Not surprisingly, CPD achieved the worst performance due to the limit feature representation of HOG. (2) PSB was an improvement of R-FCN on the basis of ResNet-101. However, it achieved lower mean AP value than R-FCN on the basis of ResNet-50. This is because we trained R-FCN on the augmented images that can increase the robustness of the detector. (3) For the object categories of storage tank, tennis court and vehicle which are smaller in size, compared with other

deep CNN based methods, our method (without fuse and with fuse) improved the detection performance significantly, which shows that the existing deep CNN based methods struggle with small-size objects, whereas our method is more effective for detection small size objects in remote sensing images. (4) Ours-Fuse achieved optimal or suboptimal AP values for most class of objects. This show that our method is effective for detecting objects with various size. R-FCN obtained comparable detection performance, this is due to the deeper architecture. Compared with Ours, OursFuse obtained 3.35% performance gains in terms of mean AP, which illustrates that the combination of multiple feature maps can effectively improve the detection performance. Compared with PVANET and MSCNN, Ours achieved 3.17% and 2.93% performance gains in terms of mean AP, showing that the region boxes predicted from multiple feature maps with different receptive field sizes and the redesigned deep feature extractor by adopting recent building blocks can improve the detection performance effectively. (5) With the same computation cost, RICNN is inefficient due to the multistage pipelines. Whereas the other ten unified detectors achieved near real time detection. In detail, YOLO1 is the fastest method, but with some compromise of detection accuracy. Our method achieved a better tradeoff of detection accuracy and speed. However, although our method has achieved the best performance, the detection accuracy for the object categories of storage tank and vehicle is still relatively low. As can be seen from Fig. 13, Ours-Fuse achieved near 100% recall rate for these two object cat-

Please cite this article in press as: Deng, Z., et al. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.003

Z. Deng et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

13

Fig. 14. Confusion matrices of object and background classification.

Fig. 15. Number of object detection results with the proposed approach.

egories, but the AP value is degenerated. This is mainly because the small size of object leads to limited feature representation for accurate object detection. In our future work, we will consider encoding context and objects together to further improve the object detection performance. The performance of object classification is evaluated by 11-class confusion matrices(i.e. 10 classes of objects and a background category). The comparison results are depicted in Fig. 14. Generally, a detection result is considered to be an object if the IoU overlap ratio (between a detected bounding box and ground truth bounding box) is greater than 0.5. Otherwise it is classified as background. Furthermore, if several detections overlap with the same ground truth, all of them are considered objects. If these detections’ label do not match the ground truth label, this ground truth bounding box is considered missing detection. The right column of the matrix represents missing detections, and the bottom row of the matrix represents false alarms. It can be observed that: (1) without considering the background, almost all methods could correctly classify 10 classes of objects. This is because the deep features have powerful representation for classification and the distinction between the 10 classes of objects is obvious. (2) Compared with other deep CNN based methods, our method achieved less missing detections and false alarms. This demonstrates that our method has good performance for classifying objects with large scales variability. Fig. 15 shows a number of object detection results with the proposed approach. Different colors2 represent different kinds of object

2 For interpretation of color in Fig. 15, the reader is referred to the web version of this article.

categories. Some objects such as storage tank, tennis court are densely peaked, vehicles are small in size with complex background, ground track field has large size. Our method has successfully detected most of these objects demonstrating the effectiveness of our method. 4.3. Results on aircraft data set In order to illustrate the generalization ability of the proposed method, we directly used the previous trained ten-class detector on this aircraft data set. For large-scale images, we crop each image into small scale image blocks. In order to detect aircrafts at the boundary area, we set an overlap of 200 pixels (larger than the average size of aircraft, see Table 2) for each adjacent divided image block. Then, the image blocks are detected separately. Afterwards, according to the order of the division, the adjacent image blocks with detection results are stitched together by adding the corresponding upper left corners coordinates. For the quantitative evaluation, the PRCs and statistical results of ten different methods are shown in Fig. 16 and Table 5, respectively. As can be seen from them, (1) For airport images, all methods achieved comparative detection performance. This demonstrates the robustness of deep CNN based methods. Specially, R-FCN obtained a slightly better performance than our method, which demonstrates that deeper architecture has better generalization ability. (2) For DM-AFB images, our method obtained the highest recall rate and higher precision, which demonstrates the high superiority of our method for detecting densely peaked objects. In detail, Ours-Fuse obtained better performance than Ours, this demonstrates the combination of multiple

Please cite this article in press as: Deng, Z., et al. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.003

14

Z. Deng et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

Fig. 16. PRCs of the proposed method and other state-of-the-art approaches on the aircraft data set.

Table 5 Performance comparison between different methods on aircraft data set. The bold numbers denote the optimal values in each row. The italic underlined numbers denote the suboptimal values in each row. Data set

Method

FRCN-ZF

FRCN-VGG

YOLO1

YOLO2

R-FCN

Berlin Tegel Airport

FPR MR

7.14%(2/28) 9.68%(3/31)

53.57%(15/28) 9.68%(3/31)

30.00%(9/30)

23.33%(7/30)

3.23%(1/31)

3.23%(1/31)

29.03%(9/31) 0.00%(0/31)

AC

90.32%(28/31)

90.32%(28/31)

16.82%

63.25% 38.64%(17/44)

96.77%(30/31) 26.56% 25.00%(11/44)

100.00%(31/31)

ER FPR

96.77%(30/31) 33.23% 22.73%(10/44)

20.00%(11/55) 80.00%(44/55) 58.64% 70.77%(46/65) 2.99%(2/67)

20.00%(11/55) 80.00%(44/55) 42.73% 30.65%(19/62) 7.46%(5/67)

20.00%(11/55) 80.00%(44/55) 45.00% 31.15%(19/61) 8.96%(6/67)

5.45%(3/55) 94.55%(52/55) 24.69% 44.44%(28/63) 5.97%(4/67)

97.01%(65/67)

92.54%(62/67)

91.04%(61/67)

94.03%(63/67)

73.75% 34.39% (76/221) 78.05%(786/1007) 21.95%(221/1007) 112.44% 16.99%(95/559) 66.90%(1130/1689) 33.10%(559/1689) 83.90%

38.11% 105.20%(344/327) 67.53%(680/1007) 32.47%(327/1007) 172.73% 115.15%(874/759 55.06%(930/1689) 44.94%(759/1689) 170.21%

40.10% 95.11% (350/368) 63.46%(639/1007) 36.54%(368/1007) 158.56% 81.57%(646/792) 53.11%(897/1689) 46.89%(792/1689) 134.67%

50.41% 29.95% (127/424) 57.89%(583/1007) 42.11%(424/1007) 87.85% 21.47%(207/964) 42.92%(725/1689) 57.08%(964/1689) 64.40%

Sydney International Airport

Tokyo Haneda Airport

MR AC ER FPR MR AC

DM-AFB-1

DM-AFB-2

ER FPR MR AC ER FPR MR AC ER

7.32%(3/41) 25.45%(14/55) 74.55%(41/55) 32.77% 23.44%(15/64) 4.48%(3/67) 95.52%(64/67) 27.92% 154.92%(299/193) 80.83%(814/1007) 19.17%(193/1007) 235.76% 136.25%(654/480) 71.58%(1209/1689) 28.42%(480/1689) 207.83%

29.03% 19.23%(10/52)

Data set

Method

PVANET

SSD

MSCNN

Ours

Ours-Fuse

Berlin Tegel Airport

FPR

0.00%(0/30)

0.00%(0/29)

6.67%(2/30)

3.33%(1/30)

6.67%(2/30)

Sydney International Airport

Tokyo Haneda Airport

DM-AFB-1

DM-AFB-2

MR

3.23%(1/31)

6.45%(2/31)

3.23%(1/31)

3.23%(1/31)

3.23%(1/31)

AC

93.55%(29/31)

96.77%(30/31) 9.89%

96.77%(30/31) 6.56%

96.77%(30/31) 9.89%

ER

96.77%(30/31) 3.23%

FPR MR

13.33%(6/45) 18.18%(10/55)

6.45% 5.41%(2/37) 32.73%(18/55)

28.21%(11/39) 29.09%(16/55)

19.51%(8/41) 25.45%(14/55)

14.58%(7/48)

AC

81.82%(45/55)

67.27%(37/55)

70.91%(39/55)

74.55%(41/55)

87.27%(48/55) 27.31% 6.25%(4/64)

12.73%(7/55)

ER

31.52%

38.13%

57.30%

44.97%

FPR

28.81%(17/59)

23.08%(15/65)

20.63%(13/63)

2.99%(2/67)

5.97%(4/67)

4.48%(3/67)

97.01%(65/67)

94.03%(63/67)

95.52%(64/67) 10.73%

MR

11.94%(8/67)

8.77%(5/57) 14.93%(10/67)

AC

88.06%(59/67)

85.07%(57/67)

ER

40.75%

23.70%

26.06%

FPR

16.88% (91/539)

31.43% (182/579)

26.61% 33.81% (215/636)

38.03% (259/681)

MR

21.31% (65/305) 69.71%(702/1007)

46.47%(468/1007)

42.50%(428/1007)

36.84%(371/1007)

32.27%(326/1007)

AC

30.29%(305/1007)

53.53%(539/1007)

57.50%(579/1007)

91.02%

63.36%

73.94%

63.16%(636/1007) 70.65%

67.63%(681/1007)

ER FPR

5.40%(55/1019)

6.73%(68/1011)

13.84%(172/1243)

MR

6.29%(57/906) 46.36%(783/1689)

39.67%(670/1689)

40.14%(678/1689)

26.41%(446/1689)

13.97%(236/1689)

AC

53.64%(906/1689)

60.33%(1019/1689)

59.86%(1011/1689)

73.59%(1243/1689)

86.03%(1453/1689)

ER

52.65%

45.07%

46.87%

40.24%

20.58%

feature maps is useful and effective. Not surprisingly, FRCN and YOLO based methods achieved poor detection performance. Although R-FCN has deeper architecture, it’s still unsuitable for densely peaked object detection. To visualize the detection performance, we show the detection results of Ours-Fuse in Figs. 17–19. The green boxes denote correct detections, the red boxes denote false alarms and the blue boxes denote the ground truth. For airport images, despite very small aircrafts, the proposed approach has successfully detected most of the

70.41% 6.61%(96/1453)

aircrafts. For DM-AFB images, both larger size aircrafts, such as strategic bomber, and smaller size aircrafts, such as fighter, can be effectively detected. It is noteworthy that the previous trained detector was trained on the passenger aircraft samples, whereas the test DM-AFB images only contains military aircrafts. Our method still achieved the best detection performance, which demonstrated the highly generalization ability of our method. In order to compare the outputs of each algorithm and verify the effectiveness of detecting densely peaked objects with large

Please cite this article in press as: Deng, Z., et al. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.003

Z. Deng et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

15

Fig. 17. Aircraft detection results with the proposed approach on three airport images. A green box denotes correct detection, a red box denotes false alarm and a blue box denotes the ground truth. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

scales variability, we selected a subarea that contains numerous densely peaked aircrafts with multiple size (e.g. the size of transporter is about five times larger than that of fighter). Fig. 20 shows the comparison results with different approaches. It can be seen that, our method could correctly detect most small size and densely peaked fighters, whereas other deep CNN based methods show more wrong detections. The results demonstrate that our method is more effective for densely peaked small size objects. Furthermore, our method have achieved lower missing detection for larger size transporter or bomber than other deep CNN based methods, which shows that our method is more effective for detecting objects with scales variability. 4.4. Results on aerial-vehicle data set Detecting vehicles in aerial images plays an important role for a wide range of applications. However, it is still a challenging

problem due to the relatively small size, varying types, and the complex background. In order to test the robustness of our method, we trained a vehicle detector on ten aerial images and evaluated the detection performance on the other ten aerial images. For large-scale aerial images, we crop each image into small scale image blocks with an overlap of 50 pixels (larger than the average size of vehicles). Then, the image blocks are detected separately. Afterwards, the adjacent image blocks with detection results are stitched together. Table 6 and Fig. 21 show the numerical comparison results, Recall vs. IoU overlap ratios, and PRCs of the eleven methods on the Aerial-Vehicle data set. We can observe the following: (1) In Fig. 21(a), the recall of ACF drops more quickly than deep CNN based methods. Ours and Ours-Fuse performs well and surpasses the other deep CNN based methods by a significant margin. This further demonstrates that our method can generate candidate vehicle-like regions with a relatively high recall, which is desirable

Please cite this article in press as: Deng, Z., et al. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.003

16

Z. Deng et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

Fig. 18. Aircraft detection results with the proposed approach on DM-AFB-1. A green box denotes correct detection, a red box denotes false alarm and a blue box denotes the ground truth. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 19. Object detection results with the proposed approach on DM-AFB-2. A green box denotes correct detection, a red box denotes false alarm and a blue box denotes the ground truth. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

in vehicle detection. (2) In Fig. 21(b), FRCN-VGG and YOLO2 have higher precision than FRCN-ZF and YOLO1 respectively. This demonstrates that deeper network and the added anchor boxes can improve the detection performance. Furthermore, the detection performance of Ours-Fuse is significantly improved over Ours, which demonstrates that the combination of multiple feature maps is useful for improving the detection performance. (3) In Table 6, Ours-Fuse could get the most desirable result in terms of F1 score and AP value, both of which are two comprehensive metrics on recall rate and precision rate. The precision rate of PVANET reached 93.94%, which is the best among all the eleven methods, but the recall rate is rather low. Ours-Fuse harvest the highest recall rate of 83.59%, while its precision rate is not very good. YOLO1 is the fastest method, but with some compromise of detection accuracy. Our method achieved a better tradeoff of detection accuracy and speed. Fig. 22 displays the visual performance of the detection result by Ours-Fuse method, where the green box indicates the correct

location, and the red and blue box indicate the false alarm and the ground truth, respectively. To see more details of the performance, we enlarged four cropped image blocks (which are shown by four rectangles). It shows explicitly that despite some vehicles located in the shade, the proposed approach has successfully detected most of the vehicles. Specially, in the parking area(as shown in the top-right sub-figure), vehicles are densely parked, our method still achieved an excellent detection results. This demonstrates that our method is effective for densely peaked small size objects. 4.5. Results on SAR-ship data set In order to illustrate the application of the proposed method in multi-modal remote sensing images, we explore the ship detection task in large scale SAR images. With the accurately annotated SAR ship dataset available, we trained our method and other nine baseline methods separately on ten large scale SAR images. During the

Please cite this article in press as: Deng, Z., et al. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.003

17

Z. Deng et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

Fig. 20. Aircraft detection results with different approach on the Aircraft data set.

Table 6 Performance comparison between different methods on Aerial-Vehicle data set. The bold numbers denote the optimal values in each column. The italic underlined numbers denote the suboptimal values in each column. Method

Ground Truth

True Positive

False Positive

Recall Rate (%)

Precision Rate (%)

F1 Score

AP

Time/ Per Image (s)

ACF FRCNN-ZF

5892 5892

2319 3819

4403 734

39.36 64.82

34.50 83.88

0.37 0.73

0.27 0.59

4.37

FRCNN-VGG YOLO 1 YOLO2 R-FCN

5892 5892 5892 5892

3917 3800 4170 4492

526 6200 3072

66.48 64.49 70.77 76.24

88.16 38.00 57.58 91.62

0.76 0.48 0.64

0.63 0.58 0.67

PVANET SSD MSCNN Ours

5892 5892 5892 5892

3288 4548 4214

55.80 77.19 71.52

93.94 66.26 78.85 80.76

0.83 0.70 0.71 0.75 0.82

0.73 0.53 0.71 0.67 0.76

Ours-Fuse

5892

89.17

0.86

0.80

4912 4925

411 212 2316 1130 1170 598

83.37 83.59

3.84 6.21 3.36 4.24 4.32 4.02 5.72 8.24 4.26 4.26

Fig. 21. (a) Recall vs. IoU overlap ratio on the Aerial-Vehicle data set. (b) PRCs of the proposed method and other state-of-the-art approaches on the Aerial-Vehicle data set.

testing process, large-scale SAR image was cropped into small scale image blocks with an overlap of 50 pixels (larger than the average size of ships). Then, the image blocks are detected separately. Afterwards, the adjacent image blocks with detection results are stitched together. Table 7 and Fig. 23 show the numerical comparison results, Recall vs. IoU overlap ratios, and PRCs of the eleven methods on the SAR-Ship data set. We can observe the similar performance of vehicle detection. (1) In Fig. 23(a), Ours and Ours-Fuse perform well and surpass the other deep CNN based methods by a significant margin. This further demonstrates that our method is effective for ship detection in SAR images and can generate candidate

ship-like regions with a relatively high recall. (2) In Fig. 23(b), FRCN-VGG and YOLO2 achieved higher precision than FRCN-ZF and YOLO1 respectively, which demonstrates that the improvement is still useful in ship detection task. Furthermore, the detection performance of Ours-Fuse is slightly improved over Ours, which demonstrates that the combination of multiple feature maps is useful for improving the detection performance. (3) In Table 6, Ours-Fuse achieves the best performance in terms of F1 score and AP, respectively. The highest recall rate can be obtained by Ours-Fuse of 81.44%. Although CFAR achieved the highest precision rate, the recall rate is rather low. Not surprisingly, YOLO1 is still the fastest method, our method is also considerably faster.

Please cite this article in press as: Deng, Z., et al. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.003

18

Z. Deng et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

Fig. 22. Vehicle detection results with the proposed approach on an aerial image. A green box denotes correct detection, a red box denotes false alarm and a blue box denotes the ground truth. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Table 7 Performance comparison between different methods on SAR-Ship data set. The bold numbers denote the optimal values in each column. The italic underlined numbers denote the suboptimal values in each column. Method

Ground Truth

True Positive

False Positive

Recall Rate (%)

Precision Rate (%)

F1 Score

AP

Time/ Per Image (s)

CFAR FRCNN-ZF

1358 1358

564 683

188

41.53 50.29

75.00

0.70 0.58

0.32 0.44

2550

FRCNN-VGG YOLO1 YOLO2 R-FCN PVANET SSD MSCNN Ours

1358 1358 1358 1358 1358 1358 1358 1358

958 671 785 932 878 671 968

0.69 0.54 0.60 0.69 0.62 0.56 0.66

0.62 0.45 0.53 0.64 0.55 0.41 0.64

Ours-Fuse

1358

0.71 0.72

0.72 0.74

1065 1106

289 464 440 466 425 607 389 618 566 619

70.54 49.41 57.81 68.63 64.65 44.91 71.28 78.42 81.44

70.27 67.37 60.40 62.75 68.68 59.12 63.30 61.03 65.30 64.12

53 121 48 59 67 56 103 128 63 63

Fig. 23. (a) Recall vs. IoU overlap ratio on the SAR-Ship data set. (b) PRCs of the proposed method and other state-of-the-art approaches on the SAR-Ship data set.

Fig. 24 exhibits the qualitative result on one testing image from Hong Kong port, where the green boxes indicate the correct location, the red and blue boxes represent the false alarms and the ground truth respectively. To see the detection results more clearly, we enlarged four small regions denoted by cyan rectangles. From this figure, we can conclude briefly that (1) Whether in the areas of offshore or inshore, most of the ships has been successfully detected. Specially, ships in inland rivers or around small islands can also be correctly detected, which shows that our method is effective and useful. It is noteworthy that our method takes a large scale SAR image as input, and outputs the ship detec-

tion results directly without land-masking. As shown in this figure, there are nearly no false alarms on land area. Therefore, it has great potential for wide field application. (2) For the inshore areas, ships are small in size and densely peaked with complex background, our method still achieved satisfying detection performance, which demonstrates that our method is effective for densely peaked small size objects. (3) Although there are some false alarms (the red boxes), they look very similar to ships. Without any further information to determine they are ground truth, we consider them as false alarms, which reduces the precision rate of our method.

Please cite this article in press as: Deng, Z., et al. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.003

Z. Deng et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

19

Fig. 24. Ship detection results with the proposed approach on a SAR image. A green box denotes correct detection, a red box denotes false alarm and a blue box denotes the ground truth. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

5. Conclusions

References

In this paper, we propose a unified and effective deep CNN based approach for simultaneously detecting multi-class objects in remote sensing images with large scales variability. The detection is preformed at a redesigned CNN feature extractor and followed by two sub-networks: a MS-OPN for object-like region generation from several intermediate layers, whose receptive fields match different object scales, and an AODN for object detection based on fused feature maps. The quantitative comparison results on the challenging NWPU VHR-10 data set, aircraft data set, AerialVehicle data set and SAR-Ship data set show that (1) our method achieved better performance for detecting objects with large scale variability. (2) our method is more effective than existing algorithms for densely peaked small size objects. (3) our data augmentation strategy is effective for training multi-modal remote sensing images (i.e. optical satellite and aerial images, SAR images) with limited annotation. In our future study, we will focus on learning rotation invariant deep features for object detection. Furthermore, we will consider training object detectors without using pretrained models on Image-Net (Shen et al., 2017). Additionally, we will incorporate a multi-GPU configuration to further reduce the computation time.

Bell, S., Lawrence Zitnick, C., Bala, K., Girshick, R., 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2874–2883. Cai, Z., Fan, Q., Feris, R.S., Vasconcelos, N., 2016. A unified multi-scale deep convolutional neural network for fast object detection. In: European Conference on Computer Vision. Springer, pp. 354–370. Chen, X., Xiang, S., Liu, C.L., Pan, C.H., 2013. Aircraft detection by deep belief nets. In: Asian Conference on Pattern Recognition. pp. 54–58. Chen, X., Xiang, S., Liu, C.L., Pan, C.H., 2014a. Vehicle detection in satellite images by hybrid deep convolutional neural networks. IEEE Geosci. Remote Sens. Lett. 11 (10), 1797–1801. Chen, X., Xiang, S., Liu, C.L., Pan, C.H., 2014b. Vehicle detection in satellite images by parallel deep convolutional neural networks. In: Pattern Recognition. pp. 181– 185. Cheng, G., Han, J., 2016. A survey on object detection in optical remote sensing images. ISPRS J. Photogram. Remote Sens. 117, 11–28. Cheng, G., Han, J., Lu, X., 2017. Remote sensing image scene classification: benchmark and state of the art. Proc. IEEE PP (99), 1–19. Cheng, G., Han, J., Zhou, P., Guo, L., 2014. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogram. Remote Sens. 98 (1), 119–132. Cheng, G., Zhou, P., Han, J., 2016. Learning rotation-invariant convolutional neural networks for object detection in vhr optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 54 (12), 7405–7415. Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., et al., 2012. Large scale distributed deep networks. In: Advances in neural information processing systems, pp. 1223–1231. Deng, Z., Sun, H., Zhou, S., Zhao, J., Zou, H., 2017. Toward fast and accurate vehicle detection in aerial images using coupled region-based convolutional neural networks. IEEE J. Select. Top. Appl. Earth Observ. Remote Sens. 10 (8), 3652– 3664. Diao, W., Sun, X., Zheng, X., Dou, F., Wang, H., Fu, K., 2016. Efficient saliency-based object detection in remote sensing images using deep belief networks. IEEE Geosci. Remote Sens. Lett. 13 (2), 137–141. Dollar, P., 2014. Piotr’s Computer Vision Matlab Toolbox (PMT). . Dollar, P., Appel, R., Belongie, S., Perona, P., 2014. Fast feature pyramids for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 36 (8), 1532–1545. Du, B., Zhang, Y., Zhang, L., Tao, D., 2016. Beyond the sparsity-based target detector: a hybrid sparsity and statistics-based detector for hyperspectral images. IEEE Trans. Image Process. 25 (11), 5345–5357. El-Darymli, K., McGuire, P., Power, D., Moloney, C., 2013. Target detection in synthetic aperture radar imagery: a state-of-the-art survey. J. Appl. Remote Sens. 7 (1), 071598. Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D., 2010. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32 (9), 1627–1645. Feng, Y., Yuan, Y., Lu, X., 2016. Deep representation for abnormal event detection in crowded scenes. In: Proceedings of the 2016 ACM on Multimedia Conference. ACM, pp. 591–595. Girshick, R., 2015. Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448. Girshick, R., Donahue, J., Darrell, T., Malik, J., 2016. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 38 (1), 142–158.

Acknowledgment The authors would like to thank Junwei Han, Gong Cheng, Miaozhong Xu, Fan Zhang, Kang Liu and Gellert Mattyus, who generously provided their NWPU VHR-10 data set, aircraft data set, Aerial-Vehicle data set with the ground truth. We thank ESA for providing free Sentinel data online and Shanghai Jiao Tong University for providing labeled Sentinel-1 ship data set. The authors would like to thank the Associate Editor who handled the manuscript and the anonymous reviewers for their valuable comments and suggestions. This work was supported in part by the National Natural Science Foundation of China under Grant 61303186, in part by the Fund of Innovation of National University of Defense Technology Graduate School (No. B150406). Appendix A. Supplementary material Supplementary data associated with this article can be found, in the online version, at https://doi.org/10.1016/j.isprsjprs.2018.04. 003.

Please cite this article in press as: Deng, Z., et al. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogram. Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.003

20

Z. Deng et al. / ISPRS Journal of Photogrammetry and Remote Sensing xxx (2018) xxx–xxx

Han, J., Zhang, D., Cheng, G., Guo, L., Ren, J., 2015. Object detection in optical remote sensing images based on weakly supervised learning and high-level feature learning. IEEE Trans. Geosci. Remote Sens. 53 (6), 3325–3337. Han, J., Zhou, P., Zhang, D., Cheng, G., Guo, L., Liu, Z., Bu, S., Wu, J., 2014. Efficient, simultaneous detection of multi-class geospatial targets based on visual saliency modeling and discriminative learning of sparse coding. ISPRS J. Photogram. Remote Sens. 89, 37–48. He, C., Tu, M., Xiong, D., Tu, F., Liao, M., 2018. Adaptive component selection-based discriminative model for object detection in high-resolution sar imagery. ISPRS Int. J. Geo-Inform. 7 (2), 72–93. He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K., 2016. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and