Multimed Tools Appl DOI 10.1007/s11042-015-3141-0
Occluded vehicle detection with local connected deep model Hai Wang 1 & Yingfeng Cai 2 & Xiaobo Chen 2 & Long Chen 2
Received: 7 August 2015 / Revised: 17 November 2015 / Accepted: 1 December 2015 # Springer Science+Business Media New York 2015
Abstract Traditional vehicle detection algorithms do not include targeted processing to handle the vehicle occlusion phenomenon. To address this issue, this paper proposes a locally-connected, deep-model-based, occluded vehicle detection algorithm. Firstly, a suspected occluded vehicle is generated using a cascaded Adaboost Classifier. Any subimages that are rejected during the last two stages of the cascaded Adaboost Classifier are considered as a suspected occluded vehicle. Then, eight types of vehicle occlusion visual models are manually established. The suspected occluded vehicle will be assigned to a certain type of model by color histogram matching. Finally, the sub image of the suspected occluded vehicle will be loaded into a locally connected deep model of the corresponding type to make the final determination. An experiment using the KITTI dataset has demonstrated that compared with existing vehicle detection algorithms such as the cascaded Adaboost, the Deformable Part Model (DPM), Deep Convolutional Neural Networks (DCNN) and the Deep Belief Network (DBN), this algorithm has a much higher occluded vehicle detection rate. Additionally, this method requires minimal extra processing time, at around 5 % higher than the cascaded Adaboost. Keywords Vehicle detection . Occluded vehicle . Deep model . Occlusion type matching . Monocular vision
1 Introduction Road environment perception technology in onboard vision platforms is one of the core technologies in the field of automatic driving and active safety systems [3, 14, 18, 21].
* Yingfeng Cai
[email protected]
1
School of Automotive and Traffic Engineering, Jiangsu University, Zhenjiang 212013 Jiangsu, China
2
Automotive Engineering Research Institution, Jiangsu University, Zhenjiang 212013 Jiangsu, China
Multimed Tools Appl
Road vehicle detection technology is currently receiving a lot of attention within this field from vehicle manufacturers, vehicular parts suppliers, scientific research institutes and other related enterprises, due to its strong impact on vehicle safety [2, 8]. Real-time capabilities and robustness are two of the most critical indicators of visionbased vehicle detection algorithms. Viola and Jones [19] have proposed a cascade Adaboost algorithm based on Haar features which has achieved very good results for a face recognition task. This method trained an Adaboost strong classifier grouped with several weak classifiers with less features and used each cascaded strong classifier to gradually reject some of the non-face fields to finally obtain the facial region. Inspired by their work, this paper proposes some improved methods that can be applied to vehicle detection tasks. Sun [17] has performed a detailed study on feature selection for vehicle detection. This study has analyzed and compared various features within the vehicle detection field, including PCA, wavelet coefficients and Gabor coefficients. Daniel [15] has used a technique strategy called ‘lazy evaluation’ to enhance the real-time performance of boosting and has applied this method to vehicle detection. Withopf [20] has improved the cascaded structure of the classifier and has proposed a tree-like cascaded structure which dramatically reduces the number of features required for training at each stage, consequently reducing the training time. In recent years, the Histogram of Oriented Gradient (HOG) feature has been applied to the Scale Invariant Feature Transform (SIFT) features and good results have been achieved for pedestrian detection [9]. Moranduzzo [12] has used the recombinant SIFT feature and support vector machine (SVM) as a classifier to achieve good classification results for the recognition of multiclass objects. Felzenszwalb [4] has brought HOG features into the Deformable Part Model (DPM) and used latent SVM to achieve accurate detection of various objects, including vehicles, pedestrians and bicycles. Classifiers such as SVM and Adaboost are all shallow learning models because they can be modeled as a structure with one input layer, one hidden layer and one output layer. Deep learning refers to a class of machine learning techniques which exploit hierarchical architectures for representation learning and pattern classification. In contrast to shallow models, deep learning has the ability to learn multiple levels of representation and abstraction, thus enabling better understanding of the image data. Deep learning structures such as DBN have demonstrated their success in image classification tasks such as MNIST handwritten digit recognition. The DCNN structure works well for vehicle detection in remote sensing images [1, 13]. Although many algorithms have been proposed to achieve vehicle detection tasks, these are currently no algorithms that are specific to the problem of vehicle occlusion. Therefore, although the algorithms described above have high detection rates when the vehicles are in full view, they often perform poorly on partially occluded vehicles. In practice, a large number of vehicles are frequently occluded or blocked in actual traffic scenes. Therefore, there is a requirement for vehicle detection algorithms to handle this more effectively. The reason why it is difficult to detect an occluded vehicle is that currently the pixels belonging to the object that is blocking the vehicle are used in the judgment, which obviously introduces interference. If these pixels can be excluded from the judgment process, it will greatly improve the detection rate of the occluded vehicles and transform the occluded object recognition task to an ‘incomplete image recognition’ task. Luckily, the incomplete image recognition problem can be easily solved using Zhong’s method [22]. Her method applies a
Multimed Tools Appl
nonlinear hierarchical deep structure and solves the incomplete problem through the design of a locally connected deep belief network. This method has shown superior performance for incomplete handwritten number and human face recognition. Some other methods have also been proposed to deal with images where information is missing. In [16], Song has proposed a novel model for data missing completion by seamlessly analyzing information from multiple sources. Based on the analysis above and inspired by Zhong’s method, an occluded vehicle detection algorithm is proposed in this paper. The steps for the algorithm are as follows. Firstly, a suspected occluded vehicle will be generated using Viola and Jones’s object detection framework based on Haar features and a cascaded Adaboost. Then, eight types of vehicle occlusion visual models will be manually established. A suspected occluded vehicle will be assigned to a certain type of model by color histogram matching. Finally, the sub image of the suspected occluded vehicle will be loaded into a locally-connected deep model of the corresponding type to make the final decision.
2 Suspected occluded vehicle generation Many previous research results have demonstrated that the Haar feature and cascaded Adaboost based object recognition method performs well for face and vehicle detection, especially where the detection target is without occlusion. In this section, a brief introduction to this method will be given, and then the suspected occluded vehicle generation strategy will be proposed.
2.1 Haar feature A Haar rectangular feature is obtained by calculating the value of two or more adjacent regions in a grayscale image. Figure 1 shows some of the Haar features used in this work. The value of a Haar feature is equal to the sum of pixels in the black area minus the sum of the pixels in the white area. In [19], Viola and Jones have also proposed a fast computation method called the ‘integral image’ to quickly calculate the sum of pixels in a rectangular area. With the integral image, pixels that accumulate within a rectangular area in the original image only need four points in the integral image. Figure 2 shows the feature maps of an image in a horizontal and vertical dual rectangular Haar feature with a 1×2 scale.
Fig. 1 Haar rectangular feature
Multimed Tools Appl
(a) Original image (b) Horizontal Haar feature (c) Vertical Haar feature Fig. 2 Dual rectangular Haar feature with 1×2 scale
2.2 Weak classifier construction A strong Adaboost classifier is built with several weak classifiers. The weak classifier is designed using a discriminative model and defined as follows: f j < θj 1; ð1Þ gHaar ¼ 0; otherwise where fj is the absolute value of the Haar feature j and θj is the threshold.
2.3 Cascaded Adaboost algorithm There are many types of Adaboost Algorithm and the Discrete Adaboost is used in our work [11]. The pseudo code of this algorithm is as follows: Pseudo code of Discrete Adaboost 1: N samples: (x1,y1), ......, (xN,yN), x∈R, y1 ∈{0,1} 2: Initialization: weight ωi =1/N, i=1,…,N 3: For t=1:T do 3a: train gj for all features j 3b: εj =∑ωi|gj(xi)−yi| 3c: Find the minimal error εt 3d: Update: ωtþ1;i ¼ ωt;i β t 1−ei : if gt(xi)=yi, ei =0; otherwise, ei =1. βt =εt/(1−εt) end for 4: Output G=∑Tt=1αtgt ≥(1/2)∑Tt=1αt, where αt =log(1/βt) The cascaded Adaboost classifier is built by cascading several Adaboost output models G. In this work, a cascaded Adaboost classifier Cv is chosen for the suspected occluded vehicle generation which is cascaded with fifteen stages and trained with non-occluded vehicle samples (Fig. 3).
2.4 Suspected occluded vehicle generation strategy In the cascaded classifier, we have observed that sub-images with a large difference from the positive samples are easily rejected in the initial stages of the cascade classifier. Meanwhile, sub-images with a small difference from the positive samples are often rejected in the final
Multimed Tools Appl Fig. 3 Cascaded Adaboost classifier Cv
stages. For occluded vehicle images, which contain both positive samples and negative samples, generally the latter situation will occur and the image will be rejected in the final stages. To verify the hypothesis proposed above, we have conducted a group of experiments. A large number (1000) of sub-images containing occluded vehicles are loaded into the cascaded Adaboost classifier Cv and the following results have been obtained: a) the vehicle was correctly identified in 237 sub-images containing occluded vehicles; b) 433 sub-images containing occluded vehicles were rejected in the 15th stage of Cv. c) 291 sub images containing occluded vehicles were rejected in the 14th stage of Cv. d) 39 sub images containing occluded vehicles re rejected before the 14th stage of Cv. The experimental results have proved our hypothesis to a certain degree. From the experiment, we have also discovered that most of the failures to detect occluded vehicles occur due to rejection during the 14th or 15th stages of the cascaded Adaboost classifier Cv. Based on the above analysis, we manually consider all of the sub-images that were rejected at the 14th or 15th stages of Cv as suspected occluded vehicles and mark these sub-images for further consideration. For example, Fig. 4a is a typical urban road image recorded by an onboard vision platform. This image is loaded into a cascaded Adaboost classifier Cv and all sub-images that pass through the classifier and are rejected in the penultimate and final stages will be marked using green and orange boxes respectively (Fig. 4b). In this figure, it can be seen that the white car without occlusion on the left was correctly identified as a vehicle and another eight sub-images were identified as suspected occluded vehicles. Obviously, not all of the eight identified sub-images are occluded vehicles, as some of the sub-images contain nonvehicle objects, such as windows, signs, curbs and road markings which can easily be falsely identified as vehicles.
Fig. 4 Examples of suspected occluded vehicle generation
Multimed Tools Appl
3 Incomplete image recognition classifier with local connected deep model Deep learning interprets data such as images, voice and text by imitating the analysis mechanism of the human brain. It is a complex model similar to the human brain and is currently the nearest approach to real artificial intelligence. It introduces hierarchical information processing to feature expression and maps original data to a new feature space by hierarchical feature training. Compared with handcrafted features, self-trained hierarchical features using big data can more easily describe rich internal information within the data, making it easier to classify or predict tasks [7, 10]. By applying the feature extraction capabilities of the deep model with improvements to solve incomplete image recognition, Zhong has proposed a classifier based on a locally connected bilinear deep belief network [22]. By setting neurons as either ‘connected’ or ‘disconnected’ across adjacent layers in the deep network, this classifier performs better than a fully-connected deep network for the incomplete image recognition task. The methods used will be described in more detail in the following four sections.
3.1 Structure of local connected deep belief network Let X be a set of incomplete image samples, X=[X1,X2,…,Xk,…XK]. Here, Xk ∈RI ×J represents one sample of an incomplete image and K is the number of samples. Let Fk be the corresponding missing information of sample Xk. If the pixel (Xk)ij is missing in sample Xk, then (Xk)ij ∈Fk. Meanwhile, Y represents the labels corresponding to X, which can be written as Y=[Y1,Y2,…,Yk,…YK]. In our work, all images are classified as vehicle or nonvehicle so YK is either ½ 1 0 or ½ 1 0 . Based on the above definition, a locally connected DBN architecture is constructed as shown in Fig. 5. This model includes one visible input layer H1, several hidden layers H2,…, H N and one visible label layer La at the top. The visible input layer H1 maintains I×J neurons and is equal to the dimensions of the training feature, which are the original 2D-image pixel values of the training samples in this application. At the top, the La layer has only two units which are equivalent to the classes that will be classified by this application.
3.2 Network neural number setting This subsection introduces the neural number setting method based on bilinear projection. For samples X1,X2,…,Xk,…XK ∈RI×J, two projection matrices U∈RI×P, V∈RJ×Q and a latent representation TX1,TX2,…,TXK ∈RI×J can be easily found that satisfy TXs =UTXsV. These two projection matrices can be calculated by establishing the objective function as follows: arg max J ðU; VÞ ¼ u;v
s:t: Z X st i j ¼
K X
∥UT Xs ∩ZXst −Xt ∩ZXst V∥2 ðαBst −ð1−αÞWst Þ
s;t¼1
0; if ðXs Þi j ∈ F s or ðXt Þi j ∈F; ; 1; else;
ð2Þ
UT U ¼ Ip ; VT V ¼ IQ
Optimization of J(U,V) is a non-convex optimization problem with two matrices U and V. This issue is addressed by using a strategy called Alternative Fixing (AF), which first fixes U (or V) and optimizes the objective function J(U,V) with just the variable matrix V (or U), and
Multimed Tools Appl
y1k
La ...
...
HN 1
...
...
...
...
...
HN
...
S S
S
...
...
...
...
...
... ...
...
Hidden Layer
...
Label Layer
yk2
...
H1
...
S
S
...
S
...
...
S
...
...
...
Visible Layer
...
...
...
... ...
...
H2
S
...
. Feature Input. . X k Fig. 5 Locally connected DBN
then fixes V (or U) and optimizes J(U,V) with just U (or V). The AF method alternates between fixing both matrices until J(U,V) reaches its upper bound. After the optimization process, a new U* and V* that maximizes J(U,V) is obtained which preserves the discriminative information of the original sample data X. Based on this, the size of the upper layer can then be determined by the number of positive eigenvalues of U* and V*, which is P×Q.
3.3 Pretraining of network weights In a deep belief network, a Restrict Boltzmann Machine (RBM) is used to extract hidden data through layer-based reconstruction. For incomplete image recognition tasks, Zhong has
Multimed Tools Appl
modified the traditional RBM into a locally connected RBM. Specifically, a switch is added between neural connections of the lower and upper layers based on traditional RBM. When (Xk)ij ∈Fk, the input feature is missing and the switch (Vc)ij is disconnected. Otherwise, switch (Vc)ij is set as connected. The weights of two adjacent layers will be pretrained using a greedy-wise reconstruction method proposed by Hinton [6]. This pretraining process can be illustrated using an example with a visible input layer H1 and a first hidden layer H2. The visible input layer H1 and the first hidden layer H1 comprise a locally connected RBM and the state energy of any two neurons (h1,h2) is:
E h ;h ;θ 1
2
1
i ≤ I; j ≤ J p ≤ P2 ;q ≤ Q2
¼−
X
X
i¼1; j¼1
p¼1;q¼1
h1i j
A1i j;pq
\
i ≤ I; j ≤ J p ≤ P ;q ≤ Q \ X X h2pq − h1i j − b1i j Z b;1 c1pq h2pq ð3Þ ij 2
Z A;1 i j;pq
i¼1; j¼1
2
p¼1;q¼1
where I×J is the neural number in H1 and P2 ×Q2 is that of H2. θ1 =(A1,b1,c1) are the 1 is the weight parameters between the visible input layer H1 and the first hidden layer H2. Aij,pq 1 1 2 1 from the input neural (i,j) in H to the hidden neural (p,q) in H . bij and cpq are the (i,j)th and b,1 (p,q)th bias of H1 and H2. ZA,1 ij,pq and Zij are the switch variables which control the weight connection or disconnection and are defined as follows: 0; if ðXk Þi j ∈F k ; b;1 0; if ðXk Þi j ∈F k ; ¼ ¼ ; Z ð4Þ Z iA;1 j;pq ij 1; else; 1; else; So with the joint distribution, this RBM is: e−Eðh ;h ;θ Þ P h1 ; h2 ; θ1 ¼ X X 1 2 1 e−Eðh ;h ;θ Þ 1
h1
2
1
ð5Þ
h2
Similar to existing deep learning models, the stochastic steepest ascent in the log probability of the training data is utilized to update the parameter space θ1 =[A1,b1,c1]: A1i j;pq ¼ ϑA1i j;pq þ ΔA1i j;pq ∩Z iA;1 j;pq
ð6Þ
ΔA1i j;pq ¼ εA < h1i j ð0Þh2pq ð0Þ>data − < h1i j ð1Þh2pq ð1Þ>recon
ð7Þ
where data denotes the expectation with respect to the data distribution and recon denotes the reconstruction distribution after one step. In the method above, the pretraining process of the locally connected RBM can be demonstrated by taking the visible input layer H1 and the first hidden layer H2 as an example. The whole pretraining process can be taken from the low layer groups (H1, H2) to the upper layer groups (Hn−1, Hn) individually.
3.4 Global fine-tuning In the unsupervised pretraining process described above, a greedy layer-wise algorithm was used to learn the local connected DBN parameters with additional information
Multimed Tools Appl
from the bilinear projection. In this subsection, a traditional back propagation algorithm will be used to fine-tune the parameters θ=[A,b,c] with the information from the label layer La. Since strong parameter initiation has been maintained during the pretraining process, back propagation is used only to finely adjust the parameters so that the locally optimum parameters θ*=[A*,b*,c*] can be achieved. At this stage, the objective is to minimize the classification error −∑ yk log^yk , where yk and ŷk are the real label and the output label of data Xk, t
respectively.
4 Vehicle occlusion visual model establishment and matching The locally connected deep model based classifier has proven to be useful for incomplete image recognition; however, it cannot be used directly for occluded vehicle detection. The reason is that in order to achieve incomplete image recognition, it needs to know in advance which pixels belong to the object that is to be identified and which pixels belong to the missing (or occluded) section. However in this application, each pixel’s property cannot be specified for the suspected occluded vehicle sub-images that are generated (i.e., whether each pixel belongs to the occluded object or is the occlusion object). Therefore, a method of specifying each pixel’s property in advance is required so that the vehicle can be detected using the classifier which has been trained in the last section. For this task, a simplified vehicle occlusion visual model is proposed in this section and a color histogram matching method is also given.
4.1 Vehicle occlusion visual model establishment Through observation and analysis of a large number of occluded vehicle images (Fig. 6), it was found that the visual model of the vehicle partial occlusion can be simplified and attributed to one of eight geometric types, as shown in Fig. 7. In this figure, the blue section indicates that the pixels belong to the vehicle object and the brown section indicates that the pixels belong to the blocking object. Based on this, once the suspected
Fig. 6 Examples of occlusion scenarios
Multimed Tools Appl
Pixels belong to vehicles Pixels belong to occlusion objects
Fig. 7 Occlusion visual geometry model
occluded vehicle image can be assigned to one of the eight types of occlusion models, then the pixel property can be specified and the switch setting in the locally connected deep belief network can be obtained.
4.2 Occlusion model matching Observations and analysis by our group have found that when a vehicle is occluded on the road, it often has a relatively large color difference or color distribution difference from the blocking object. Therefore, in this occlusion model matching task, color space is primarily used. Specifically, for any suspected occlusion vehicle image Ic, each pixel’s property will be assigned to one of eight types of occlusion models separately and marked as AOcc for the blocking section and AVeh for the vehicle section. Then regions AOcc and AVeh will be divided into 256 color blocks in HSV (hue-saturation-value) color space, where the H component is divided into sixteen pieces equally and the S and V components are both divided into four components. A histogram of the 256 blocks will then be calculated and a histogram of 256 bins will be obtained. Finally, the degree of difference between AOcc and AVeh is defined with the histogram as shown in formula (8), where AVeh(k) and AOcc(k) are the kth bin of the histogram of AOcc and AVeh. Diff ¼
256 X AV eh ðk Þ−AOcc ðk Þ
ð8Þ
k¼1
In the calculation method described above, for each suspected occluded vehicle image Ic, the occlusion type will be considered to be that with the largest degree of difference. The pixels in Ic will then be marked as the vehicle and blocking object. In Fig. 8, the degree of difference between the suspected occluded vehicle image and each of the eight occlusion models is calculated separately and shown with the bar figure on the right. From the bar figure, it can be seen that the largest difference degree Diff is seen when the image is compared to model b. Based on this, model b is considered to be the correct occlusion model for the suspected occlusion vehicle image, and the switches of the local connected deep belief network are set according to this model.
Multimed Tools Appl
Fig. 8 A sample of the calculation of the degree of difference and matching
5 Experiment and analysis In this section, we will firstly introduce the preparation of training samples for incomplete image recognition and setting of the training parameters. The vehicle detection results of our method and other existing methods using different datasets will then be demonstrated, compared and analyzed.
5.1 Training of incomplete image recognition classifier In order to train the effective incomplete image recognition classifier with the deep model, a large number of occluded vehicle samples have been manually prepared. The nonoccluded positive samples have been partially selected from existing vehicle sample datasets such as Caltech1999, INRIA and TME and partially picked from real road pictures captured by our group. The non-occluded negative samples have been randomly selected from road pictures without vehicles. The total number of non-occluded positive and negative samples are 7500 and 20,000 respectively and all samples have been resized to 32×32 pixels (as shown in Fig. 9). To generate occluded samples, some manmade blocks have been added to each of the non-occluded positive and negative sample based on the eight types of occlusion models illustrated in Fig. 7. Some of the generated occluded samples are shown in Fig. 10. After preparation of the occluded samples, they have been input into the local connected deep belief network for classifier training. Most of the training parameters in this work are the same as [7], where the number of epochs was fixed at 50, and the learning rate η was set to 0.1. The initial momentum ϑ was 0.5.
5.2 Experiment on KITTI dataset The experimental dataset that was used in this work is the KITTI dataset [5]. This dataset provides many types of road images and gives very detailed marks for every object.
Multimed Tools Appl
(a) Non-occluded positive samples
(b) Non-occluded negative samples Fig. 9 Original sample images
More importantly, KITTI provides the occlusion status for every vehicle in the images and marks it as either non-occluded, partially occluded and fully occluded. A typical road image with marks is shown in Fig. 11 and it can be seen that the three types of occlusion are marked as green, yellow and red separately. The partially occluded and non-occluded vehicles in KITTI were chosen for our test. 2000 images of KITTI dataset were selected randomly, containing 2175 partially occluded vehicles and 4308 nonoccluded vehicles. Detailed processing using the proposed method for a KITTI image is shown in Fig. 12. It can be seen that with the cascaded Adaboost-classifier-based suspected occlusion vehicle generation, two sub-images were successfully passed through the fifteen-stage cascaded classifier and were identified as vehicles (marked with green boxes). Another nine sub-images were rejected at the 14th or 15th stage of the cascaded classifier and were classed as suspected occlusion vehicles (marked with brown boxes). All nine of these sub-images were assigned to the eight occlusion models based on the HSV histogram model matching and the switch S of the locally connected deep belief
Fig. 10 Sample images with missing information
Multimed Tools Appl
Fig. 11 A typical image with occlusion marks in the KITTI dataset
network was set based on the corresponding model. Finally, each sub-image was classed by the deep classifier. For this image, the detection results have demonstrated that three occluded vehicles that were missed by the cascaded Adaboost classifier were correctly detected as vehicles. We have compared our method with many state-of-the-art methods including the cascaded Adaboost, Deformable Part Model (DPM), Deep Convolutional Neural Network (DCNN) and Deep Belief Network (DBN). In the training process used here for the cascaded Adaboost, DPM, DCNN and DBN, all the un-occluded and manmade occluded samples were used. The detection results for the KITTI dataset using each of these methods are listed below in Table 1.
Local Connected Deep Belief Networks
NonVehicle
NonVehicle
NonVehicle
NonVehicle
Vehicle
Vehicle
Vehicle
NonVehicle
NonVehicle
Fig. 12 An example of detailed processing of the proposed method for a KITTI image
Multimed Tools Appl Table 1 Detection results of multiple methods Correct labeling of non-occluded vehicles
Detection rate of non-occluded vehicles
Classifier types
Correct labeling of occluded vehicles
Detection rate of occluded vehicles
Cascaded Adaboost [19]
1512/2175
69.52 %
4098/4308
96.13 %
DPM [4]
1596/2175
73.33 %
4113/4308
95.47 %
DCNN [7]
1679/2175
77.20 %
4206/4308
97.63 %
DBN [10] Our method (N=9)
1720/2175 1938/2175
79.08 % 89.10 %
4191/4308 4169/4308
97.28 % 96.77 %
As shown in Table 1, our method has the highest detection rate for occluded vehicles of all the methods when the hidden layer number N is set as 9. Additionally, since the proposed method can be considered to perform an extra step for uncertain images compared with the traditional cascaded Adaboost method, the detection rate for non-occluded vehicles will not be affected when using our method. The running times for each of the five methods are shown in Fig. 13. All images that were processed had a size of 1240×370 pixels. The processing platform was an Intel Core i7 quad-core 2.67G processor, 8G memory, Windows 7 64bit, and the programming software was Microsoft Visual 2010. Figure 13 shows that of all of these methods, DCNN has the lowest processing time (89 ms) due to the sharing of weights by each of the convolution filters, and DPM takes the most time (176 ms) because it needs to calculate HOG features for root and partial nodes. The running time of our method was 131 ms which was slightly longer (around 5 %) than the cascaded Adaboost (125 ms). In general, the occluded vehicle detection method performs a single additional step to the traditional vehicle detection methods, in order to identify the suspected occlusion vehicles with the deep model. Therefore, this method does not affect the detection rate of non-occluded vehicles but dramatically enhances the ability to detect partially occluded vehicles for the cost of a small additional time requirement.
Fig. 13 Running time of each algorithms
Multimed Tools Appl
6 Conclusions Traditional vehicle detection algorithms lack special processing to address the vehicle occlusion phenomenon. To address this issue, this paper proposes a locally-connected, deep-model-based, occluded vehicle detection algorithm. Firstly, suspected occluded vehicles were identified using a cascaded Adaboost Classifier. The sub-images that were excluded in the last two stages of the cascaded Adaboost Classifier were considered to be suspected occluded vehicles. Eight types of vehicle occlusion visual models were then manually established. The suspected occluded vehicle was then attributed to a certain type of model by color histogram matching. Finally, the subimage of the suspected occluded vehicle was loaded to a locally connected deep model of the corresponding type for the final determination. The experiment using the KITTI dataset demonstrated that compared with existing vehicle detection algorithms such as cascaded Adaboost, Deformable Part Model (DPM), Deep Convolutional Neural Networks (DCNN) and Deep Belief Networks (DBN), this algorithm has a much higher occluded vehicle detection rate. Additionally, the extra processing time for this method is minimal at around 5 % more than cascaded Adaboost. Acknowledgments This work has been supported by the National Natural Science Foundation of China under the grant (61573171, 61403172, 61203244, 51305167), China Postdoctoral Science Foundation (2014 M561592, 2015 T80511), Information Technology Research Program of Transport Ministry of China under the grant (2013364836900), Natural Science Foundation of Jiangsu Province (BK20140555).
References 1. Chen X, Xiang S, Liu CL et al (2014) Vehicle detection in satellite images by hybrid deep convolutional neural networks[J]. Geosci Rem Sens Lett, IEEE 11(10):1797–1801 2. Cheon M, Lee W, Yoon C et al (2012) Vision-based vehicle detection system with consideration of the detecting location[J]. IEEE Trans Intell Transp Syst 13(3):1243–1252 3. Eum S, Jung HG (2013) Enhancing light blob detection for intelligent headlight control using lane detection[J]. IEEE Trans Intell Transp Syst 14(2):1003–1011 4. Felzenszwalb PF, Girshick RB, McAllester D et al (2010) Object detection with discriminatively trained partbased models[J]. Patt Anal Mach Intell, IEEE Trans 32(9):1627–1645 5. Geiger A, Lenz P, Urtasun R et al. (2012) Are we ready for autonomous driving? the kitti vision benchmark suite[C]//Computer Vision and Pattern Recognition (CVPR), 2012 I.E. Conf IEEE 3354–3361 6. Hinton GE, Osindero S, Teh Y et al. (2006) A fast learning algorithm for deep belief nets. Neural Comput 1527–1554 7. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks[J]. Science 313(5786):504–507 8. Hsieh JW, Chen LC, Chen DY (2014) Symmetrical SURF and its applications to vehicle detection and vehicle make and model recognition[J]. IEEE Trans Intell Transp Syst 15(1):6–20 9. Kaaniche M, Bremond F (2012) Recognizing gestures by learning local motion signatures of HOG descriptors[J]. IEEE Trans Patt Anal Mach Intell 34(11):2247–2258 10. Krizhevsky A, Sutskever I, Hinton GE et al. (2012) Imagenet classification with deep convolutional neural networks[C]. Adv Neural Inform Process Syst 1097–1105 11. Lienhart R, Maydt J (2002) An extended set of haar-like features for rapid object detection[C]//image processing 2002. Proc 2002 Int Conf IEEE 1:I-900–I-903 12. Moranduzzo T, Melgani F (2012) A SIFT-SVM method for detecting cars in UAV images[C]// 2012 I.E. Int Geosci Remote Sens Symposium, IRENA H, Munich, IEEE. 6868–6871 13. Pauplin O, Jiang J (2012) DBN-based structural learning and optimisation for automated handwritten character recognition[J]. Pattern Recogn Lett 33(6):685–692 14. Pedersoli M, Gonzalez J, Hu X et al (2014) Toward real-time pedestrian detection based on a deformable template model[J]. IEEE Trans Intell Transp Syst 15(1):1–10 15. Ponsa D, Lopez A (2007) Cascade of classifiers for vehicle detection[C]// The 9th International conference on Advanced Concepts for Intelligent Vision Systems,JACQUES B T. Springer, Netherlands, pp 980–989
Multimed Tools Appl 16. Song X, Nie L, Zhang L et al. (2015) Multiple social network learning and its application in volunteerism tendency prediction[C]//Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM 213–222 17. Sun Z, Bebis G, Miller R (2006) Monocular precrash vehicle detection: features and classifiers[J]. IEEE Trans Image Process 15(7):2019–2034 18. Vazquez D, Lopez A, Marin J et al (2013) Virtual and real world adaptation for pedestrian detection[J]. IEEE Trans Patt Anal Mach Intell 36(4):797–809 19. Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features[C]// 2001 I.E. Comput Soc Conf Comput Vision Pattern Recognit. KASTURI R, Kauai, USA, IEEE 511–518 20. Withopf D, Jahne B (2007) Improved training algorithm for tree like classifiers and its application to vehicle detection[C]// 2007 I.E. Int Conf Intell Transport Syst, Daniel J D, Seattle, USA, IEEE. 642–647 21. Yoo H, Yang U, Sohn K (2013) Gradient-enhancing conversion for illumination-robust lane detection[J]. IEEE Trans Intell Transp Syst 14(3):1083–1094 22. Zhong S, Liu Y, Chung F et al. (2012) Semiconducting bilinear deep learning for incomplete image recognition[C]//Proceedings of the 2nd ACM International Conference on Multimedia Retrieval. ACM 32
Dr. Hai Wang (1983-), Male, Associate Professor. His research interests include machine vision, machine learning and its application in Intelligent Vehicles.
Dr. Yingfeng Cai (1985-), Female, Associate Professor. Her research interest includes machine vision, machine learning and its application in Intelligent Transportation System.
Multimed Tools Appl
Dr. Xiaobo Chen (1982-), Male, Associate Professor. His research interest is machine learning.
Dr. Long Chen (1958-), Male, Professor. His research interests include vehicle dynamics and Intelligent Transportation System.