improvements, partially attributable to algorithms like Con- volutional Neural ... based on Convolutional Neural Networks (CNNs) [4]. An ex- ample of the results ...
FAST ANIMAL DETECTION IN UAV IMAGES USING CONVOLUTIONAL NEURAL NETWORKS Benjamin Kellenberger, Michele Volpi, Devis Tuia MultiModal Remote Sensing, University of Zurich (Switzerland) {benjamin.kellenberger, michele.volpi, devis.tuia}@geo.uzh.ch
ABSTRACT Illegal wildlife poaching poses one severe threat to the environment. Measures to stem poaching have only been with limited success, mainly due to efforts required to keep track of wildlife stock and animal tracking. Recent developments in remote sensing have led to low-cost Unmanned Aerial Vehicles (UAVs), facilitating quick and repeated image acquisitions over vast areas. In parallel, progress in object detection in computer vision yielded unprecedented performance improvements, partially attributable to algorithms like Convolutional Neural Networks (CNNs). We present an object detection method tailored to detect large animals in UAV images. We achieve a substantial increase in precision over a robust state-of-the-art model on a dataset acquired over the Kuzikus wildlife reserve park in Namibia. Furthermore, our model processes data at over 72 images per second, as opposed 3 for the baseline, allowing for real-time applications. 1. INTRODUCTION In this paper we address the task of animal detection from sub-decimeter resolution images acquired by low-cost Unmanned Aerial Vehicles. This task is of particular interest to live stock conservation, where accurate and cost-effective solutions to animal monitoring would lead to targeted counteractions to poaching [1, 2]. Such actions are of paramount importance, as can be seen by the increasing numbers of killed individuals, that raised from 13 to 668 in the case of Rhinos in South Africa in the period 2007-2012 [3] and amounted to tens of thousands for African Elephants in 2011 alone [2]. The answer to the needs of automatic counting might be found in UAV-based monitoring systems. They allow for frequent acquisitions over large areas at sub-decimeter resolution. The task of animal detection has been traditionally carried out by manual annotation, which requires trained experts and large amounts of time. To offer a more efficient system, we propose a pipeline performing animal (object) detection based on Convolutional Neural Networks (CNNs) [4]. An example of the results achieved by our system is shown in Fig. 1. This work has been supported by the SNSF grant PP00P2 150593. The authors would like to acknowledge the SAVMAP project and Micromappers for providing the data and ground truth used in this work.
’
Ground truth
’
Predictions
Fig. 1. Example result on animal detection using the proposed model. We demonstrate the performance of our model on an set of UAV images of the Kuzikus wildlife reserve in Namibia1 . 2. RELATED WORK Object detection is the task of drawing bounding boxes (localization) and identifying classes of objects (recognition) in an image. It is one of the most investigated fields in computer vision [5]. The common principle of object detection algorithms has hardly changed and still consists of two to three steps: (i.) identifying candidate locations for objects (object proposals), (ii.) extracting expressive features for each candidate region, and (iii.) classifying each location according to these features. Traditional models typically rely on hand-crafted features like HOG [6] or SIFT [7], and use classifiers like Boosting [8] for recognition. Recently, CNNs have become state-of-the-art in many computer vision tasks by performing end-to-end, task specific joint learning of features and classification. Typically, standard object detection pipelines relying on CNNs perform tasks (ii.) and (iii.) jointly. For step (i.) there are mainly two paradigms: Sliding window-based: These models perform recognition at every possible location and scale in the image [9], in a sliding window fashion. Region proposal-based: These models suggest a large number of bounding boxes (object proposals) that are likely to contain an instance of any object. A discriminator then has the task of classifying each proposal into different object classes or background. Examples of such models are R-CNN 1 http://www.kuzikus-namibia.de/
Fig. 2. Architecture of the proposed model. It is based on a pre-trained AlexNet and learns specific features for recognition and localization in two separate branches. [10] and its extensions, e.g. Fast R-CNN [11]; a proposals generator typically employed is Selective Search [12]. Due to the large computational complexity of sliding windows detectors, object proposal-based models have traditionally been preferred. However, recent studies such as YOLO [13] exploit the intrinsic multi-scale properties of CNNs and show large speed-ups compared to methods relying on object proposals. These models also reduce problems related to inaccurate object-proposal generation. Remote sensing tasks have a number of peculiarities that set them apart from traditional computer vision problems: objects are typically seen only from above with an absolute scale, orientation is not discriminative and there is no absolute location prior. Some successful applications of object detectors on over-head imagery are the detection of seals on Greenland ice shelves [14], and airplanes [15]. Models not relying on hand-crafted features and region proposals have only been hesitantly applied to remote sensing datasets. We argue that the aforementioned properties of remote sensing scenarios could actually be beneficial to object detectors using pure CNNs pipelines: objects of similar classes are of comparable absolute size and the candidate search space encompasses the entire image. We propose a CNN architecture, which is optimized for fast and efficient detection of small, similarly sized objects at arbitrary locations, and does not require object proposals. 3. PROPOSED ARCHITECTURE Figure 2 presents the architecture of our model. We base it on a pre-trained instance of AlexNet [4], which has given good results in natural image classification [4] and has already been used successfully in object detection [11]. Under the assumption that low-level features are similar between different image analysis tasks (e.g. color gradients or edge detectors), we fine-tune the first layers from AlexNet and add new learnable layers on top. We adopt a two-branch strategy in our CNN. Our network performs animal recognition (i.e., assigning a local likelihood score of the presence of an animal) and localization (i.e., lo-
calizing plausible animals) in a hybrid-parallel fashion. By doing so, we let the two branches of the network learn complementary aspects: the former learns the local appearance of animals while the second learns the size of animals based on the local likelihood provided by the first branch. Both branches learn directly from AlexNet features. In this implementation, we predict locally over 24 × 24 cells (from 224 × 224 pixels inputs, as explained in the next section): during the forward pass, each cell receives a confidence score on the presence of an animal (from the first branch) and an estimation of the height and width of the most likely bounding box (from the second branch). The former is learned using two convolutional blocks (blocks 1a and 2a in Fig. 2). The latter is learned by using both features learned directly form the image (the 128 filters coming from block 1b in Fig. 2), stacked with the confidence score used to perform recognition in the first branch, therefore letting the localization branch know the spatial extent of detections. The confidence in the recognition branch is normalized using a sigmoid. We observed that using a sigmoid activation function reduced chances of exploding gradients. In turn, BatchNorm layers [16] reduce vanishing gradient effects. The constrained output range of [0, 1] facilitates setting thresholds on the final confidence map. Further studies are needed to evaluate alternative scoring functions, such as Softmax. Note that during backpropagation the two branches communicate on the update of the AlexNet features, therefore performing multi-task learning. 4. EXPERIMENTS 4.1. Dataset We evaluate our pipeline on a dataset of UAV images of the Kuzikus Wildlife Reserve park in central Namibia, acquired by the SAVMAP project2 in May 2014 [17]. The campaign resulted in 654 RGB orthorectified images3 . Ground truth was established via a crowdsourcing campaign organized by 2 http://lasig.epfl.ch/savmap 3 An
example can be found at http://dx.doi.org/10.5281/zenodo.16445
MicroMappers4 to retrieve positions of large animals. Some missing animals were manually added and bounding boxes refined to exclude animal shadows. In the end, a total of 1196 animals could be identified. 4.2. Model Training We divide all images (of size 4000 × 3000 pixels) into 224 × 224 sub-frames to match the predefined AlexNet input size, yielding a total of 1379 frames. Out of those, 690 images (50%) with 1004 animal bounding boxes are used for training, 276 (20%; 372 bounding boxes) for validation and 413 (30%; 509 animals) for testing. The number of training examples is relatively small and could potentially lead to overfitting. However, note that only a small portion of the weights is learned from scratch, as we employ the pre-trained AlexNet in the common branch of our model and only fine-tune it. We use extensive data augmentation, including mirroring (horizontal and vertical), rotations, shifting (horizontal and vertical) as well as adding small Gaussian noise to the images. We train our model with stochastic gradient descent with momentum of 0.9 for 300 epochs, gradually reducing the learning rate from 10−4 to 10−7 . We employ weight decay of 0.001 and use for testing the average model of the last ten epochs. For training, we backpropagate over all the grid cells scoring with high confidence: outside ground truth bounding boxes as negative examples and all the grid cells corresponding to ground truth bounding boxes as positive (even when there are multiple detections on a single bounding box). We expect this procedure to make the model aware of all the appearance variations of animals. We filter out overlapping bounding box predictions using Non-Maximum Suppression (NMS) as a post-processing step. As a baseline, we train a Fast R-CNN model using the same pre-trained AlexNet base network with object proposals from Selective Search [12]. This model corresponds to the CaffeNet model described in [11]. For both models, NMS threshold and detection cutoff (the threshold above which a region is detected as “animal”) were selected to yield maximum F1 scores on the validation set. 5. RESULTS AND DISCUSSION Table 1 presents performance measures obtained on the held out test set; visual examples for detections are displayed in Fig. 3. The PASCAL VOC challenge requires a predicted bounding box to have an Intersection-over-Union (IoU) score above 50%, with the closest ground truth bounding box to be counted as a true positive [18]. However, note that this threshold is optimized for problems where objects occupy a large fraction of the image. Assuming fixed image size, deviations of a prediction from its closest ground truth have a much more severe impact (in IoU terms) in small bounding boxes than in large ones. In our case, the coverage of animal 4 https://micromappers.wordpress.com
Table 1. Detection results. A bounding box is counted as correct if its IoU with the closest ground truth exceeds 25%. Ground Truth Objects True Positives False Positives False Negatives Precision (UA) Recall (PA) F1 Score Avg. Speed [Hz]
Fast R-CNN (baseline) 509 429 843 80 0.34 0.84 0.48 2.96
Proposed Model 509 379 254 130 0.60 0.74 0.66 73.62
bounding boxes hardly exceeds a mere one percent of the image area. We therefore counted a prediction as a true positive if its IoU exceeded 25%. However, we did retain the rule that multiple predictions of the same object count as one positive hit only, crucial when automatically counting instances. In comparison to Fast R-CNN, we observe a significant increase in precision (User’s Accuracy; UA) of about 0.25 points. This improvement is attributable to a lower number of false positives (254 vs. 843). On the one hand, our model has to evaluate far less potential candidates (242 = 576 against up to 5000 proposals) and thus the overall false alarm rate upper bound is lower. On the other hand, the object probability map from our model allows tuning the localization in one backward pass; a property that does not hold for proposal-based methods. Looking at the examples in Fig. 3, the lower false alarm rate can indeed be confirmed. Our model struggles in detecting certain animals, leading to a recall rate (Producer’s Accuracy; PA) 0.1 points lower. To some extent, our model is more prone to predicting multiple, smaller hits per animal, as can be seen in the bottom left two images in Fig. 3. This issue could be partially mitigated by employing stronger NMS, although this does not correct for too small bounding box sizes. Instead, predicting on a slightly coarser confidence grid (i.e., smaller than 24 × 24) could lead to improvements, as it decreases chances for single animals to be located on multiple prediction cells. Finally, while Fast R-CNN is able to evaluate an average of 2.96 images per second5 , our model processes 72.65 images in the same amount of time, both evaluated on a GTX980TI graphics card. Our model could be used for realtime applications reducing latency in wildlife monitoring. 6. CONCLUSION Illegal poaching of wildlife animals remains a global threat and calls for measures for near real time monitoring of livestock. In this paper, we proposed an animal detection system able to efficiently operate on sub-decimeter images acquired by UAVs. Experiments have shown our proposed method to be far more precise in predicting the location of animals in images compared to the state-of-the-art Fast R-CNN model. Our 5 This estimate does not include the time required to calculate the object proposals in Fast R-CNN.
Correct predictions by our model
Multiple detections of a single animal
False positives
Missed animals
Fig. 3. Detection examples on the test set (blue; with IoU scores for predictions) and Fast R-CNN baseline (cyan). Ground truth is in green. Top row shows correct detections for our model, while bottom row show failure cases. model is able to predict sufficiently accurate bounding boxes, while producing significantly less false positives. Moreover, our system is able to operate in near-real time (73Hz), which is a promising characteristic for more ubiquitous deployment in real-life, automated, animal monitoring systems. 7. REFERENCES [1] M. Mulero-P´azm´any, R. Stolper, L. D. Van Essen, J. J. Negro, and T. Sassen, “Remotely piloted aircraft systems as a rhinoceros anti-poaching tool in Africa,” PLoS One, vol. 9, no. 1, pp. 1–10, 2014. [2] G. Wittemyer, J. M. Northrup, J. Blanc, I. Douglas-Hamilton, P. Omondi, and K. P. Burnham, “Illegal killing for ivory drives global decline in African elephants,” Proc. Natl. Acad. Sci., vol. 111, no. 36, pp. 13 117–13 121, 2014. [3] D. Biggs, F. Courchamp, R. Martin, and H. P. Possingham, “Legal trade of Africa‘s rhino horns,” Science, vol. 339, no. 6123, pp. 1038–1039, 2013. [4] A. Krizhevsky, I. Sulskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Adv. Neural Inf. Process. Syst., pp. 1–9, 2012. [5] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015.
[8] A. Torralba, K. P. Murphy, and W. T. Freeman, “Sharing features: efficient boosting procedures for multiclass object detection,” CVPR, vol. 3, pp. 762–769, 2004. [9] C. Park, Dennis and Ramanan, Deva and Fowlkes, “Multiresolution models for object detection,” ECCV, vol. 6314, pp. 1–14, 2010. [10] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” CVPR, pp. 580–587, 2014. [11] R. Girshick, “Fast R-CNN,” ICCV, pp. 1440–1448, 2015. [12] J. R. R. Uijlings, K. E. A. Van De Sande, T. Gevers, and A. W. M. Smeulders, “Selective search for object recognition,” Int. J. Comput. Vis., vol. 104, no. 2, pp. 154–171, 2013. [13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: unified, real-time object detection,” arXiv, 2016. [14] A.-B. Salberg, “Detection of seals in remote sensing images using features extracted from deep convolutional neural networks,” IGARSS, no. 0373, pp. 1893–1896, 2015. [15] F. Zhang, B. Du, L. Zhang, and M. Xu, “Weakly supervised learning based on coupled convolutional neural networks for aircraft detection,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 9, pp. 1–11, 2016. [16] S. Ioffe and C. Szegedy, “Batch normalization: accelerating deep network training by reducing internal covariate shift,” Arxiv, pp. 1–11, 2015.
[6] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” CVPR, pp. 886–893, 2010.
[17] F. Ofli, P. Meier, M. Imran, C. Castillo, D. Tuia, N. Rey, J. Briant, P. Millet, F. Reinhard, M. Parkan, and S. Joost, “Combining human computing and machine learning to make sense of big (aerial) data for disaster response,” Big Data, vol. 4, no. 1, pp. 47–59, 2016.
[7] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.
[18] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (VOC) challenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010.