IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 12, NO. 11, NOVEMBER 2015
2183
Rotation-Invariant Object Detection in High-Resolution Satellite Imagery Using Superpixel-Based Deep Hough Forests Yongtao Yu, Haiyan Guan, Member, IEEE, and Zheng Ji
Abstract—This letter presents a rotation-invariant method for detecting geospatial objects from high-resolution satellite images. First, a superpixel segmentation strategy is proposed to generate meaningful and nonredundant patches. Second, a multilayer deep feature generation model is developed to generate high-level feature representations of patches using deep learning techniques. Third, a set of multiscale Hough forests with embedded patch orientations is constructed to cast rotation-invariant votes for estimating object centroids. Quantitative evaluations on the images collected from Google Earth service show that an average completeness, correctness, quality, and F1 - measure values of 0.958, 0.969, 0.929, and 0.963, respectively, are obtained. Comparative studies with three existing methods demonstrate the superior performance of the proposed method in accurately and correctly detecting objects that are arbitrarily oriented and of varying sizes. Index Terms—Airplane detection, deep learning, Hough forest, object detection, rotation invariance, ship detection.
I. I NTRODUCTION
W
ITH the development of remote sensing technologies, currently available satellite sensors can easily capture high-resolution remotely sensed images. Benefited from the advance of Internet services, the easy access and distribution of high-resolution remotely sensed images have provided researchers with new opportunities to analyze and interpret the contents of these images. As a result, extensive studies and applications (e.g., detection, recognition, and classification) have focused on high-resolution remotely sensed images in the fields of remote sensing, photogrammetry, intelligent transportation, security surveillance, etc. As a fundamental research topic, detection and recognition of small man-made objects (e.g., vehicles, ships, and airplanes) have attracted significant attention from researchers for the applications of change detection [1], environmental monitoring [2], and traffic management Manuscript received December 19, 2014; revised April 20, 2015; accepted May 10, 2015. Date of publication September 25, 2015; date of current version October 27, 2015. This work was supported in part by the National Natural Science Foundation of China under Grant 41501501 and the Natural Science Foundation of Jiangsu Province under Grant BK20151524. (Corresponding author: Haiyan Guan.) Y. Yu is with the Faculty of Computer and Software Engineering, Huaiyin Institute of Technology, Huai’an 223003, China (e-mail: allennessy.yu@ gmail.com). H. Guan is with the College of Geography and Remote Sensing, Nanjing University of Information Science and Technology, Nanjing 210044, China (e-mail:
[email protected];
[email protected]). Z. Ji is with the School of Remote Sensing Information and Engineering, Wuhan University, Wuhan 430079, China (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/LGRS.2015.2432135
[3]. However, different from ground-shot images, the withinclass orientation and scale variations, as well as between-class similarities, bring great challenges for robust and reliable object detection in high-resolution remotely sensed images. In [4], robust invariant generalized Hough transform was proposed to detect inshore ships. To handle shape deformations, an iterative training procedure was used to learn a robust shape model. A spatial sparse coding bag-of-words model was developed in [5] to detect objects with complex shapes. This model had the capabilities of encoding the geometric information of object parts and handling rotation variations. A contour-based spatial model was proposed in [6] for accurately detecting geospatial objects. In this method, dynamic programming and multisegmentation were applied to compute the similarity of the contour information. Similarly, local edge distributions [7] were also explored to detect regions of salient textures and objects. In [8], a rotation-invariant parts-based model was developed to detect objects with complex shapes and varying orientations. To achieve rotation invariance, the structure information among the object parts was depicted and regulated in polar coordinates. To extract multiscale features, a hybrid deep convolutional neural network (HDNN) [9] was proposed to detect vehicles. In the HDNN, the maps of the highest convolutional layer and the maxpooling layer were divided into multiple blocks of specific sizes. In [10], entropy-balanced bitmap tree was proposed to perform content-based object retrieval. The objects of varying scales were extracted and encoded into bitmap shape representations. To detect objects with arbitrary orientations, a colorenhanced rotation-invariant Hough forest method [11] was developed to detect airplanes and buildings. In this method, pose-estimation-based rotation-invariant Texton forest was trained to cast rotation-invariant votes for estimating object centroids. In [12], objects were detected using structural feature description and query expansion. The feature description combined both the local and global information of objects. The detection task was converted into a ranking query task using a ranking support vector machine (SVM). A collection of part detectors was used in [13] for detecting multiclass geospatial objects. Each part detector was a linear SVM classifier used for detecting objects or recurring spatial patterns. Recently, some other methods, such as discriminatively trained mixture model [14], texture motifs [15], and saliency and gist features [16], have also been exploited to detect geospatial objects. In this letter, we develop a rotation-invariant method for detecting geospatial objects from high-resolution satellite imagery. The proposed method can effectively handle objects with varying appearances, orientations, and sizes. The contributions include 1) a superpixel segmentation strategy for generating local patches; 2) a deep feature generation model for generating high-level feature representations of local patches; and 3) a set
1545-598X © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
2184
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 12, NO. 11, NOVEMBER 2015
Fig. 1. (a) Superpixel-based patch generation strategy. (b) DBM for learning patch features. (c) Deep feature generation model.
of multiscale Hough forests integrated with patch orientations and scale factors for estimating object centroids. Fig. 2. Illustration of a Hough forest training framework.
II. M ETHOD A. Deep Feature Generation Model Deep learning techniques have attracted significant attention for their capabilities in retrieving multilevel feature abstractions. Among the deep learning models, deep Boltzmann machines (DBMs) [17] have shown a powerful and highly distinctive feature representation model. In this letter, we construct a deep feature generation model for generating high-level feature representations of local patches based on a DBM model. As shown in Fig. 1(a), the training samples are partitioned into a set of local patches with a size of n × n pixels. To generate meaningful and nonredundant patches, we adopt a superpixel segmentation strategy to partition the training samples. First, each training sample is segmented into superpixels using the simple linear iterative clustering (SLIC) superpixels [18]. To guarantee patches covering the neighborhood information of the superpixels, we design each superpixel to have an approximately half size of a patch. Thus, parameter k (the number of superpixels) in the SLIC superpixels is set to be 2N/(n × n), where N is the number of pixels in a training sample. Then, a local patch of n × n pixels is generated centered at a superpixel. Next, the intensities of the generated patches are normalized into the range of [0, 1] in each RGB color channel. As shown in Fig. 1(b), the normalized patches are used to train a two-layer DBM. 2 Denote v ∈ [0, 1]3n as the visible variables representing a linear arrangement of the RGB color channels of a training patch. Denote h1 ∈ {0, 1}D1 and h2 ∈ {0, 1}D2 as the lowerand higher-layer binary hidden variables, respectively. Then, the energy of the joint configuration {v, h1 , h2 } is defined as E(v, h1 , h2 ; θ) D1 D2 3n2 3n2 D1 vi 1 1 1 vi2 = − w h − h1 w 2 h2 2 i=1 σi2 i=1 j=1 σi ij j j=1 m=1 j jm m D1 D2 3n vi − bi − h1j a1j − h2m a2m σ i m=1 i=1 j=1 2
(1)
where θ = {W1, W2, σ, b, a1, a2} are the model parameters. W1 and W2 represent the visible-to-hidden and hidden-to-hidden symmetric interaction terms, respectively; σ denotes the standard deviations of the visible variables; and b, a1 , and a2 represent the visible and hidden biases, respectively. The marginal distribution over the visible vector v takes the following form: 1 2 h1 ,h2 exp −E(v, h , h ; θ) . P (v; θ) = (2) 1 2 h1 ,h2 exp −E(v , h , h ; θ) dv v
The conditional distributions over the visible variables and two sets of hidden variables are expressed as follows: ⎞ ⎛ D2 3n2 1 v i 1 2 p hj = 1|v, h2 = g ⎝ (3) wij + h2m wjm + a1j ⎠ σ i m=1 i=1 ⎛ ⎞ D1 2 2 p hm = 1|h1 = g ⎝ h1j wjm + a2m ⎠ (4) j=1
⎛
⎜ x−σi ⎜ 1 ⎜ 1 p vi = x|h = √ exp⎜− ⎜ 2πσi ⎝
D1 j=1
2⎞ 1 h1j wij +bi
2σi2
⎟ ⎟ ⎟ ⎟ ⎟ ⎠ (5)
where g(x) = 1/(1 + exp(−x)) is the logistic function [17]. To effectively and rapidly train the model parameters, greedy-layer-wise pretraining [17] is first applied to initialize the model parameters θ. Then, an iterative training algorithm integrated with variational and stochastic approximation approaches [17] is adopted to fine-tune the model parameters. Once the DBM is trained, the stochastic activities of binary features in each hidden layer are replaced by deterministic realvalued probability estimations to construct a deep feature generation model [see Fig. 1(c)]. For each visible vector v, mean-field inference [19] is used to produce an approximate posterior distribution Q(h2 |v). Then, the deep feature generation model is augmented by the marginal q(h2 |v) of this approximate posterior. Finally, the top layer of the deep feature generation model generates a high-level feature representation, i.e., T v T 2 T 1 2 T 1T 2 2 T I =g g W +q(h |v) (W ) +(a ) W +(a ) σT ∈ [0, 1]D2 . (6) B. Hough Forest Model Hough forests [20] have become a powerful probabilistic model for encoding local patch features into the probabilities of the existence of object centroids. In this letter, we modify the Hough forest model by embedding patch orientations into the leaf nodes to achieve rotation-invariant voting. As shown in Fig. 2, a training sample consists of three components: an image, a mask, and a centroid. For negative samples, both the mask and the centroid are designed to be null.
YU et al.: OBJECT DETECTION IN SATELLITE IMAGERY USING DEEP HOUGH FORESTS
2185
Fig. 3. Illustration of the definitions of patch offset and orientation.
First, the training samples are divided into a set of local patches using the superpixel segmentation strategy. For the patches generated from positive samples, the mask is used to label them into positive and negative patches. The patches overlapped with the mask are labeled as positive patches, whereas the others are labeled as negative patches. All the patches generated from negative samples are labeled as negative patches. For a positive patch, offset and orientation information is computed. As shown in Fig. 3, the offset of a patch (d) is defined as a direction vector starting from the object’s centroid and ending at the patch’s centroid. The orientation of a patch (α) is defined as the included angle between the dominant gradient orientation of the patch (nf ) and the offset of the patch along an anticlockwise direction. Here, the dominant gradient orientation is computed using the scale-invariant feature transform method. Then, the generated patches are normalized and characterized by the deep feature generation model. Finally, the patches with high-level features, class labels, offsets, and orientations are used to train a Hough forest model. Each tree in the Hough forest is separately constructed based on a group of patches {pi = (Ii , ci , di , αi )}, where Ii denotes the high-level feature representation, ci denotes the class label (0: negative patch, 1: positive patch), di denotes the offset, and αi denotes the orientation. For a negative patch, di is set to (0, 0), and αi is set to zero. As shown in Fig. 2, each internal node of a constructed tree represents a binary test function, which bipartitions the incoming patches into two subsets and distributes them to its left and right children nodes, respectively. In this letter, we design the binary test function as follows: 0, if I(e) < τ BF (I; e, τ ) = (7) 1, otherwise where I(e) is the eth feature channel of feature I, and τ ∈ (0, 1) is a real handicap value. Specifically, the patches with test values of 0 are distributed to the left child node, whereas the others are dispatched to the right child node. For each leaf node, the following information is stored: label proportion CL , offset list DL , and orientation list RL .CL ∈ [0, 1] records the proportion of positive patches; DL = {di } and RL = {αi }, respectively, store the offsets and orientations corresponding to the positive patches. At the training stage, each tree in the Hough forest is recursively constructed starting from the root. During construction, each newly created node receives a set of patches. If the depth of the node reaches the maximal depth dm or the number of patches lies below Nm , this node is labeled as a leaf node. Then, the information {CL , DL , RL } is computed and stored at this node. Otherwise, an internal node is created, and an optimal
Fig. 4. Object detection framework.
binary test function is determined to bipartition the patches. Finally, the split two subsets are, respectively, distributed to two newly created children nodes. To suppress the uncertainties of class labels, offsets, and orientations toward leaf nodes, we adopt the same way as in [20] to design the binary test functions. C. Object Detection The proposed object detection framework is detailed in Fig. 4. At the detection stage, a test image is first partitioned into a set of local patches using the superpixel segmentation strategy. Then, the generated patches are characterized by the deep feature generation model. Next, each of the patches is dispatched to the Hough forest, where each tree receives a copy of this patch. When the patch arrives at an internal node, the binary test function is used to redistribute this patch. Finally, once the patch arrives at a leaf node, the information {CL , DL , RL } stored at this node is used to cast probabilistic votes. Consider a patch p(u, v) = (I(u, v), c(u, v), d(u, v), α(u, v)) in a test image. Denote E(x, y) as the random event corresponding to the existence of an object centered at position (x, y). Then, the conditional probability p(E(x, y)|I(u, v)) is expressed as p (E(x, y)|I(u, v)) = p (E(x, y), c(u, v) = 1|I(u, v)) = p(E(x, y)|c(u, v) = 1, I(u, v))p(c(u, v) = 1|I(u, v)) = p (d(u, v) = (u, v) − (x, y)|c(u, v) = 1, I(u, v)) × p (c(u, v) = 1|I(u, v)) . (8) By such definition, the existence of an object centered at position (x, y) inevitably indicates c(u, v) = 1, because only a positive patch can estimate the existence of the object. In addition, the actual offset of patch p(u, v) (i.e., d(u, v)) should exactly match the estimated offset of this patch (i.e., (u, v) − (x, y)). For a single tree t, the probability estimation is given by p (E(x, y)|I(u, v); t) =
CL |DL |
d ∈D ;α ∈R
1 2πσ 2
i L i L ⎛ 2⎞ cos α −sin α i i ·nf [(u, v)−(x, y)]−di 2 · ⎜ ⎟ sin αi cos αi ⎜ 2⎟ ×exp⎜− ⎟ 2σ 2 ⎠ ⎝
(9)
2186
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 12, NO. 11, NOVEMBER 2015
TABLE I PARAMETER C ONFIGURATIONS U SED AT T RAINING AND D ETECTION S TAGES
TABLE II O BJECT D ETECTION R ESULTS AND Q UANTITATIVE E VALUATIONS O BTAINED U SING D IFFERENT M ETHODS Fig. 5. Subset of training samples. (a) Airplanes. (b) Ships. (c) Background images of natural scenes.
where nf is the dominant gradient orientation of patch p(u, v). For the entire Hough forest, the probability estimation is T 1 p (E(x, y)|I(u, v); ti). (10) p E(x, y)|I(u, v); {ti}Ti=1 = T i=1
This forest-based probability estimation casts a vote by a single patch p(u, v) about the existence of an object in its vicinity. By accumulating the votes cast by different patches, we construct a 2-D Hough voting space HS(x, y), where each position (x, y) contains the accumulated votes about the probability of the existence of an object at this position, i.e., (11) HS(x, y) = p E(x, y)|I(u, v); {ti }Ti=1 . p(u,v)
Finally, the object centroids are located by ascertaining the positions with local maxima in the Hough voting space through a traditional nonmaximum suppression process. Multiscale Hough Forests: To handle objects with varying sizes, we construct a set of multiscale Hough forests with scale factors s1 , s2 , . . . , sM at the detection stage. Each Hough forest is dominated by a scale factor. Then, the offsets stored at a leaf node are scaled by the associated scale factor to cast votes. Thus, for a Hough forest with scale factor s, (9) becomes CL 1 p(E(x, y)|I(u, v); t, s)= |DL | 2πσ 2 di ∈DL ;αi ∈RL ⎛ 2⎞ cos αi −sin αi ·nf ⎜ [(u,v)−(x,y)]−s·di 2 · sin α ⎟ cos α i i ⎜ 2⎟ ×exp⎜− ⎟. ⎠ ⎝ 2σ 2 (12) III. R ESULTS AND D ISCUSSION Due to the lack of public data sets of high-resolution satellite images for object detection, in this study, we collected our own image data sets from the publicly available Google Earth service. At the training stage, we collected 350 airplanes from the airports in Britain, the United States, Germany, Singapore, and Dubai; 350 ships from the ports in China, the United States, Japan, Taiwan, Hong Kong, and Singapore; and 350 background images of natural scenes. Fig. 5 presents a subset of the training samples. At the detection stage, we collected 100 images of airport scenes from the airports in Italy, the United States, Germany, Holland, Japan, Switzerland, Britain, and Spain and 90 inshore and ocean images in China, the United States, Japan, Taiwan, Hong Kong, and Singapore for evaluating the proposed method. Each image contains multiple instances of objects of interest at different orientations and sizes. All the images are with a spatial resolution of 0.27 m.
Fig. 6. Object detection results. (a) Airplanes. (b) Ships.
A. Object Detection To evaluate the performance of the proposed method, we applied it to the 190 test images containing 1623 airplanes and 2321 ships. Through parameter sensitivity analysis on computational and detection performances, the parameters and their configurations used in this study are detailed in Table I. To handle scale variations, the scale factors were defined as {0.6, 0.8, 1.0, 1.2, 1.4}, thereby resulting in five different scales of Hough forests. The object detection results are listed in Table II. As reflected in Table II, the majority of airplanes and ships was correctly detected, and only a small number of false positives were generated. Fig. 6 presents a subset of object detection results on the test images. Fig. 7 shows the statistical results by using precision–recall curves. Specifically, due to the use of patch orientations and multiscale Hough forests, the proposed method is able to handle objects with varying orientations and sizes. However, as shown by the green boxes in Fig. 6, some small-size airplanes and ships failed to be detected. This was because such small-size objects received very few votes in the Hough voting space. Moreover, as shown by the blue boxes in Fig. 6, some passageways connecting the airport lounge and the airplanes and some long-stretched buildings were falsely detected as the targets of interest. This was caused by the high similarities of these objects in texture and geometry to the targets of interest. On the whole, the proposed method achieved promising performance in detecting airplanes and ships with varying orientations and sizes. To quantitatively evaluate the detection results, we used the following four measures: completeness (cpt), correctness (crt),
YU et al.: OBJECT DETECTION IN SATELLITE IMAGERY USING DEEP HOUGH FORESTS
2187
an average completeness, correctness, quality, and F1 -measure of 0.958, 0.969, 0.929, and 0.963, respectively, in detecting airplanes and ships. Comparative studies with three existing methods demonstrated that the proposed method outperformed the other methods in accurately and correctly detecting objects with varying orientations and sizes. R EFERENCES
Fig. 7. Object detection performances obtained using different methods. (a) Airplane detection performances. (b) Ship detection performances.
quality (qat), and F1 -measure (fmr). Completeness assesses the proportion of true positives in the ground truth, correctness measures the proportion of true positives in the detected objects, and quality and F1 -measure evaluate the overall performance. They are defined as follows: cpt = T P/(T P + F N ), crt = T P/ (T P + F P ), qat= T P/(T P + F N + F P ), fmr = 2 · cpt · crt/ (cpt + crt), where T P , F N , and F P are the numbers of true positives, false negatives, and false positives, respectively. The quantitative evaluations are detailed in Table II. Specifically, the proposed method obtained an average completeness, correctness, quality, and F1 -measure of 0.958, 0.969, 0.929, and 0.963, respectively, in detecting airplanes and ships from the 190 collected satellite images. The proposed method was implemented using C++ and run on an HP Z820 8-core 16-thread workstation. The time cost for training the deep feature generation model and the Hough forest model was about 5.1 and 0.5 h, respectively. The computing time at the detection stage was about 21 min with a parallel computing strategy with 16 parallel threads. B. Comparative Study Comparative studies were also conducted to further compare the performance of our proposed method with the following three existing methods: Lei’s method [11], Cheng’s method [14], and Bai’s method [12]. We respectively applied these three methods to the 190 collected images to detect airplanes and ships. Table II lists the object detection results obtained using these three methods. Fig. 7 presents the statistical results obtained using different methods. Among the three methods, Lei’s method generated more false positives and less true positives than the other two methods. Bai’s method achieved a similar performance to our proposed method. Quantitative evaluations using completeness, correctness, quality, and F1 -measure were also performed on the detection results to further compare their performances, as shown in Table II. Comparatively, our proposed method obtained the best performance, whereas Lei’s method obtained relatively lower performance than the other three methods. IV. C ONCLUSION This letter has presented a rotation-invariant method for detecting small man-made objects from high-resolution satellite imagery. Benefiting from the use of the deep feature generation model and multiscale Hough forests with patch orientations, the proposed method is able to handle objects with arbitrary orientations and sizes. Quantitative evaluations on 190 collected satellite images showed that the proposed method achieved
[1] R. Qin, “Change detection on LOD 2 building models with very high resolution spaceborne stereo imagery,” ISPRS J. Photogramm. Remote Sens., vol. 96, pp. 179–192, Oct. 2014. [2] L. Durieux, E. Lagabrielle, and A. Nelson, “A method for monitoring building construction in urban sprawl areas using object-based analysis of Spot 5 images and existing GIS data,” ISPRS J. Photogramm. Remote Sens., vol. 63, no. 4, pp. 399–408, Jul. 2008. [3] L. Eikvil, L. Aurdal, and H. Koren, “Classification-based vehicle detection in high-resolution satellite images,” ISPRS J. Photogramm. Remote Sens., vol. 63, no. 1, pp. 65–72, Jan. 2009. [4] J. Xu, X. Sun, D. Zhang, and K. Fu, “Automatic detection of inshore ships in high-resolution remote sensing images using robust invariant generalized Hough transform,” IEEE Geosci. Remote Sens. Lett., vol. 11, no. 12, pp. 2070–2074, Dec. 2014. [5] H. Sun, X. Sun, H. Wang, Y. Li, and X. Li, “Automatic target detection in high-resolution remote sensing images using spatial sparse coding bag-of-words model,” IEEE Geosci. Remote Sens. Lett., vol. 9, no. 1, pp. 109–113, Jan. 2012. [6] Y. Li, X. Sun, H. Wang, H. Sun, and X. Li, “Automatic target detection in high-resolution remote sensing images using a contour-based spatial model,” IEEE Geosci. Remote Sens. Lett., vol. 9, no. 5, pp. 886–890, Sep. 2012. [7] X. Hu, J. Shen, J. Shan, and L. Pan, “Local edge distributions for detection of salient structure textures and objects,” IEEE Geosci. Remote Sens. Lett., vol. 10, no. 3, pp. 466–470, May 2013. [8] W. Zhang, X. Sun, K. Fu, C. Wang, and H. Wang, “Object detection in high-resolution remote sensing images using rotation invariant parts based model,” IEEE Geosci. Remote Sens. Lett., vol. 11, no. 1, pp. 74–78, Jan. 2014. [9] X. Chen, S. Xiang, C. Liu, and C. Pan, “Vehicle detection in satellite images by hybrid deep convolutional neural networks,” IEEE Geosci. Remote Sens. Lett., vol. 11, no. 10, pp. 1797–1801, Oct. 2014. [10] G. J. Scott, M. N. Klaric, C. H. Davis, and C. R. Shyu, “Entropy-balanced bitmap tree for shape-based object retrieval from large-scale satellite imagery databases,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 5, pp. 1603–1616, May 2011. [11] Z. Lei, T. Fang, H. Huo, and D. Li, “Rotation-invariant object detection of remotely sensed images based on Texton forest and Hough voting,” IEEE Trans. Geosci. Remote Sens., vol. 50, no. 4, pp. 1206–1217, Apr. 2012. [12] X. Bai, H. Zhang, and J. Zhou, “VHR object detection based on structural feature extraction and query expansion,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 10, pp. 6508–6520, Oct. 2014. [13] G. Cheng, J. Han, P. Zhou, and L. Guo, “Multi-class geospatial object detection and geographic image classification based on collection of part detectors,” ISPRS J. Photogramm. Remote Sens., vol. 98, pp. 119–132, Dec. 2014. [14] G. Cheng et al., “Object detection in remote sensing imagery using a discriminatively trained mixture model,” ISPRS J. Photogramm. Remote Sens., vol. 85, pp. 32–43, Nov. 2013. [15] S. Bhagavathy and B. S. Manjunath, “Modeling and detection of geospatial objects using texture motifs,” IEEE Trans. Geosci. Remote Sens., vol. 44, no. 12, pp. 3706–3715, Dec. 2006. [16] Z. Li and L. Itti, “Saliency and gist feature for target detection in satellite images,” IEEE Trans. Image Process., vol. 20, no. 7, pp. 2017–2029, Jul. 2011. [17] R. Salakhutdinov, J. B. Tenenbaum, and A. Torralba, “Learning with hierarchical-deep models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1958–1971, Aug. 2013. [18] R. Achanta et al., “SLIC superpixels compared to state-of-the-art superpixel methods,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274–2281, Nov. 2012. [19] R. Salakhutdinov and G. Hinton, “An efficient learning procedure for deep Boltzmann machines,” Neural Comput., vol. 24, no. 8, pp. 1967–2006, Aug. 2012. [20] J. Gall, A. Yao, N. Razavi, L. van Gool, and V. Lempitsky, “Hough forests for object detection, tracking, and action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 11, pp. 2188–2202, Nov. 2011.